UFO-VIT: HIGH PERFORMANCE LINEAR VISION TRANSFORMER WITHOUT SOFTMAX

Page created by Susan Salinas

Society

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

UFO-ViT: High Performance Linear Vision Transformer without Softmax
Jeong-geun Song
Kakao Enterprise
po.ai@kakaoenterprise.com

PREPRINT
arXiv:2109.14382v1 [cs.CV] 29 Sep 2021

Abstract

Vision transformers have become one of the most im-
portant models for computer vision tasks. While they
outperform earlier convolutional networks, the complexity
quadratic to N is one of the major drawbacks when using
traditional self-attention algorithms. Here we propose the
UFO-ViT(Unit Force Operated Vision Trnasformer), novel
method to reduce the computations of self-attention by elim-
inating some non-linearity. Modifying few of lines from
self-attention, UFO-ViT achieves linear complexity without
the degradation of performance. The proposed models out-
perform most transformer-based models on image classi-
fication and dense prediction tasks through most capacity
regime.

1. Introduction
Recently, several results of transformers[10, 37, 42, 32, Figure 1: ImageNet1k Top-1 Acc. vs. FLOPS. Xmeans
19, 14, 12] have shown many breakthroughs in vision tasks distilled model.
as well as in natural language processing. The vision
transformer[10] shows better scalability with large dataset
by removing inductive bias of CNN-based architectures. In We suggest simple alternative self-attention mechanism
recent studies, transformer-based architectures renew the to avoid those two major drawbacks. It is called Unit Force
state of the art on image classification, object detection and Operated Vision Transformer, or simply UFO-ViT. This
semantic segmentation[2, 26, 41, 49, 51], and generative method eliminates softmax function to use associative law
models[23, 13, 11]. of matrix multiplication. It is not a kind of approximation
Transformer-based models have been shown competi- but similar to low rank approximation. It has been presented
tive performance compared to earlier state-of-the-art mod- at Performer[19] and Efficient Attention[31] earlier. Here
els. Despite their great successes, the models using self- we suggest another mechanism to drop the softmax function
attention have well-known drawbacks. One is that self- by two-step L2 -normalization, called XNorm. It can com-
attention mechanism has time and memory complexity pute self-attention with linear complexity, exploiting asso-
quadratic to N . When computing self-attention, QK T ∈ ciative law as mentioned above.
RN ×N is multiplied to value matrix to extract pairwise Unlike previous results of linear approximation meth-
global relations. It could be critical issue on the tasks that ods, UFO-ViT achieves higher or competitive results on
require high resolution, i.e., object detection or segmenta- benchmarks of vision tasks compared to the state of the
tion. If width and height is doubled, self-attention requires art models. It is detailed in Section 4, including results
16 times of resource to compute. Another major issue is on ImageNet1k[9] and COCO benchmarks[25]. It has been
about data efficiency. ViT[10] has to be pre-trained using obtained without any extra dataset like JFM-300M or Ima-
large extra dataset to achieve competitive performance com- geNet21k.
pared to CNN-based architectures. Our approaches have contributions as follows:

1. We propose novel constraint scheme, XNorm, which duce the computation except using small channels. Zhang
generates unit hypersphere to extract relational features. et al.[50] propose the nested self-attention structure. They
Detailed in Section 3, it prevents self-attention from being split the patches to several blocks and apply local attention
dependent to initialization. Furthermore, it has O(N ) com- to them. At the end of the stage, blocks are flatten and
plexity, handling high resolution input efficiently. pooled by convolution with stride 2. Above this, various
2. We demonstrate UFO-ViT can be adopted to general patch merging methods are introduced[3].
purpose. Our models are experimented on both image clas- Different methods to integrate convolutional layers[19,
sification and dense prediction tasks. Even it uses linear 41, 14, 12, 44, 15] are introduced too. LeViT, designed
self-attention scheme, UFO-ViT models outperform most by Graham et al.[14] applies multi-stage networks to trans-
of state-of-the-art models based on transformer at lower ca- formers, inspired by common structure of early CNN-based
pacity and FLOPS. Especially, our models perform well in networks. Xiao et al.[44] have found that to replace lin-
lightweight regime. ear patch embedding layers to convolutions helps the trans-
3. We propose new module which can process input im- formers capture the better features. At XCiT[12], El-Nouby
age dynamically. It means the size of weights are irrele- et al. introduce local patch interaction. By adding two
vant to the input resolution. It is useful at dense prediction depthwise convolutions[6] after XCA, XCiT achieves bet-
tasks like object detection or semantic segmentation which ter performance. From another point of view, alternative
usually requires higher resolution compared to pre-training self-attention method is introduced by Chu et al[8]. Their
stage. For MLP-based structures[35, 36], post-process is model uses global attention and local attention together.
required if the model is pre-trained on different resolution. For data efficiency, our models are mostly inspired by in-
trinsic optimization strategies XCiT[12] presents. But UFO
2. Related Works scheme is not based on their method, as detailed in follow-
Vision transformers. Dosovitskiy et al.[10] has shown ing paragraph.
potential of transformers in vision. After the achieve- Efficient self-attention. Instead of architectural strate-
ments of ViT, DeiT[37] introduced the most efficient train- gies, several approaches are proposed to directly solve
ing strategies for vision transformer by detailed ablation O(N 2 ) problem of self-attention mechanism. They are
studies. They have solved the data efficiency problem of summarized to some categories; Special patterns[20, 5, 33],
ViT successfully and most of concurrent transformer-based exploiting associative law[7, 31], using low rank factor-
models have been following their schemes. ization method[40], linear approximation through sampling
In further researches, various architectural strategies to important tokens[24, 46], and using cross-covariance matri-
improve training stability have been presented. Touvron ces instead of Gram matrices[12]. Although detailed meth-
et al.[38] propose two simple modules to train deep vision ods are quite different, our UFO scheme is mainly related
transformer models. One is class attention layers, which is to utilizing associative law.
additional self-attention modules for the class token only.
Proposed modules help the class token aggregate features 3. Methods
from the patches by detaching them from main modules.
Another is LayerScale module. They claims that it helps The structure of our model is depicted in Figure 2. It is a
larger transformer models generalize better by scaling resid- mixture of convolutional layers, UFO module, and a simple
ual connections. Simple modified version of that has pre- feed-forward MLP layer. In this section, we point out how
sented at ResMLP[36], which is LayerScale with the bias our proposed method can replace the softmax function and
term. While MLP-based models are not directly related to ensures linear complexity.
our models, we adopt Affine modules to our model for scal-
ing normalized heads and residuals. Module Type Complexity
Hybrid architectures. To solve several limitations of ViT[10] O(N 2 d)
transformers, designing schemes used in CNNs are re- Linformer[40] O(kN d)
visited on further works. Liu et al.[26] propose the shifting
window and patch merging. It generates hierarchical fea-
Efficient Attention[31] √ d)
O(hN
Axial[20] O(N N d)
tures by grouped attention and gradually merges the patches
XCiT[12] O(N d2 )
to shrink the input resolution. This methods help self-
UFO-ViT O(hN d)
attention modules learn local structures as well as reduce
the computation. Tokens-to-token ViT, introduced by Yuan Table 1: Complexity of self-attention methods. h means
et al.[48], aims similar problem with different approaches. hidden dimension of key-query-value.
They present the method overlapping tokens to correlate
patches locally. But they do not use special tricks to re-

Figure 4: Centroids generating harmonic potentials.
                                                                    Each patch, represented as a particle, is interfered by
                                                                    parabolic potentials generated from h clusters.

Figure 2: Overview of UFO-ViT module. Note that affine              XNorm, is defined as follows:
layers[36] are following each module.                                      A(x) = XNdim=filter (Q)(XNdim=space (K T V ))        (3)
                                                                                       γa
                                                                           XN(a) := qP                                          (4)
                                                                                      h
                                                                                              i=0   ||a||2

                                                                    where γ is a learnable parameter and h is the number of
                                                                    embedding dimension. It is simple L2 -norm, but it is ap-
                                                                    plied along two dimensions; Spatial dimension of K T V
                                                                    and channel dimension of Q. That is why it is called cross-
                                                                    normalization.
                                                                       Using associative law, key and value are multiplied first
                                                                    and query is multiplied after. It is depicted at Figure 3. Both
                 Figure 3: UFO module.                              multiplication has complexity of O(hN d), so this process is
                                                                    linear to N . (To compare with other methods, see Table 1.)
                                                                    3.2. XNorm
3.1. Theoretical Interpretation                                        Replace softmax to XNorm. Key and value multiplied
                      N ×C                                          directly in our process. It is same with linear kernel method,
   For an input x ∈ R      , traditional self-attention mech-
                                                                    generating h clusters.
anism is formulated as follows:
                                                                                                       n
                                                                                                       X
                         p
                             T                                                         [K T V ]ij =           T
                                                                                                             Kik Vkj            (5)
            A(x) = σ(QK / dk )V                          (1)                                           k=1
            Q = xWQ , K = xWK , V = xWV                  (2)
                                                                       XNorm applied to both query and output.:
                                                                                
                                                                                  q̂0 · k̂0 q̂0 · k̂1 · · · q̂0 · k̂h
                                                                                                                      
where A denotes the attention operator.
                                                                                 q̂1 · k̂0 q̂1 · k̂1 · · · q̂1 · k̂h 
   σ(QK T )V cannot be decomposed to O(N × h + h ×                       A(x) =  .
                                                                                                                     
                                                                                                                                (6)
                                                                                                ..    ..        .. 
N ) because of non-linearity of softmax. Our approach is                         ..             .        .      . 
eliminating softmax to compute K T V first with associative                          q̂N · k̂0   q̂N · k̂1    · · · q̂N · k̂h
law. Since applying identity instead causes degradation, we
                                                                          q̂i = XN[(Qi0 , Qi1 , · · · , Qih )]                  (7)
suggest simple constraint to prevent it.
                                                                                          T            T                 T
   Our proposed method, called cross-normalization or                     k̂i = XN[([K V ]0i , [K V ]1i , · · · , [K V ]hi )]   (8)

                                                                3

Model          Depth       #Dim         #Embed     #Head      GFLOPS       Params (M)      Res.      Patch Size
            UFO-ViT-U          12          192            96        4          1.0            5.6         224           16
             UFO-ViT-T         24          192            96        4          1.9           10.0         224           16
             UFO-ViT-S         12          384           128        8          3.7           20.7         224           16
            UFO-ViT-M          24          384           128        8          7.0           37.3         224           16
            UFO-ViT-B          24          512           128        8         11.9           63.7         224           16
             UFO-ViT-L         12          384           128        8         14.3           20.6         224           8
            UFO-ViT-XL         24          384           128        8         27.4           37.3         224           8

                                                        Table 2: UFO-ViT models.

where x denotes the input. Finally, projection weight scales                  Above formulation is generally used to approximate po-
and aggregates them by weighted sum.                                       tential energy of molecule at x ≈ 0. For multiple molecules,
                                                                           using linearity of elasticity,
                                 h
                                 X                                                                      n
              [Wproj A(x)]ij =         wmj q̂i · k̂j             (9)                                    X
                                 m=1
                                                                                                 F=−           ki · x               (15)
                                                                                                         i=1
   In this formulation, relational features are defined by co-
                                                                               It is similar equation to Eq.9 except that above formula
sine similarity between the patches and the clusters. XNorm
                                                                           is not normalized. (Note that wmj terms are static. They are
restricts the query and clusters to be unit vectors. It prevents
                                                                           irrelevant to each batch so do not perform significant roles
their values to suppress relational properties, by regulariz-
                                                                           in this simulation.) Assume that small number of k are too
ing them to narrow range. If they have arbitrary values, the
                                                                           large. The shapes of their potentials are wide and deep.
region of attention is too dependent on initialization.
                                                                           (See Eq.14.) If a certain particle moves around them, it is
   However, this interpretation is not enough to explain why
                                                                           not easy for it to escape. In this case, only two cases exist.
XNorm has to be L2 -normalization form. Here we intro-
                                                                           1. Collapsed. The particles are aligned to the direction of
duce another theoretical view, considering simple physical
                                                                           the largest k. 2. Relational features are neglected. A few
simulation.
                                                                           particles with large x are survived from collapsing.
   Details of XNorm. Considering residual connection, the
                                                                               This is exactly same case as mentioned at previous part.
output of arbitrary module is formulated as follows.
                                                                           XNorm enforces all vectors to unit vector to prevent this
                     xn+1 = xn + f (xn )                        (10)       situation. In other words, XNorm is not normalization, but
                                                                           constraint.
where n and x denote current depth and the input image. If                     To show this empirically, we have demonstrated that
we assume that x is the displacement of certain object and                 other normalization methods cannot work well. For detailed
n is time, then above equation is re-defined as:                           results, refer to our ablation study(See Table 4).
                                                                               Feed-forward layers(FFN). In attention module, FFN
                     xt+1 = xt + f (xt )                        (11)       cannot be ignored. However, there is simple nice interpre-
                             ∆x                                            tation. FFN is static and not dependent to spatial dimension.
                     f (x) =                                    (12)       Physically, this type of function represents driven force of
                             ∆t
                                                                           harmonic oscillator equation.
    Most of neural networks are discrete, so ∆t is constant.
(Let ∆t = 1 for simplicity.) Residual term expresses veloc-                                     F = −k · x + g(t)                   (16)
ity, and it has same value with force where ∆t = 1.                          This performs the role of amplifier or reducer, which
    In physics, Hooke’s law is defined as dot product of elas-             boosts or kills the features irrelevant to spatial relation.
ticity vector k and displacement vector x. The elastic force
generates harmonic potential U , a function of x2 . Phys-                  3.3. UFO-ViT
ically, the potential energy interferes the path of particle                  To build our UFO-ViT model, we adopted some architec-
moving through it. (To easily understand, imagine a ball                   tural strategies from earlier vision transformer models[14,
moves around a track with parabola shape.)                                 44, 12, 38]. In this section, we introduce several archi-
                                                                           tectural optimization techniques. Overall structure is illus-
                         F = −k · x                             (13)
                                                                           trated at Figure 2.
                               1 2                                            Patch embedding with convolutions. Recent several
                         U=      kx                             (14)
                               2                                           researches[14, 44] claim that vision transformers are trained

                                                                       4

well with convolutional patch embedding layers, instead of Method Top-1 Acc. (%)
linear projection. Adopting their strategy, we use convolu- Baseline(Linear Embed+XNorm) 81.8
tional layers at the early stage of our models. XNorm → LN[1], GN[43] Failed
Positional encoding. We use positional encoding XNorm → Learnable p-Norm 81.8
as learnable parameters, which Dosovitskiy et al. has XNorm → Single L2Norm Failed
proposed[10] earlier. Using positional encoding does not Linear Embed[10] → Conv Embed 82.0
affect ablation study, but we have decided to add it to pre- +Tuned Hyperparameter 82.8
pare same experimental environment.
Multi-headed attention. Following the original[39], Table 4: Ablation study on ImageNet1k classification.
our modules are multi-headed for better regularization. γ The results of ablation study on UFO-ViT-M. Note that sin-
parameter in Eq.3 is applied to all heads to scale importance gle L2Norm means applying L2Norm to only one of query
of each heads. and key-value interaction. The learnable parameter p of p-
Local patch interaction. It is not special idea re- norm is initialized by 2.
cently to design the extra module to extract local features
for self-attention module. We choose the most simplistic
method among them, 3 × 3 depthwise separable convolu-
tions. Adopting LPI proposed in XCiT[12] early, we use
stacked two Conv-BN[22]-GELU[18] layers.
Feed-forward network. Like traditional transformer-
based models, our model uses a point-wise feed-forward
MLP with 4d hidden dimension.
Class attention. In ImageNet1k experiments, we use
the class attention layers presented in CaiT[38]. It helps the
class token gather spatial information. To reduce compu-
tation, Class attention is computed on class token only. It
is same with the original paper. Note that class attentions
consist of UFO modules, while CaiT uses the normal self-
attention module for class attention.

Hyperparam Model Value
UFO-ViT-S, L, XL 5e-4
learning rate UFO-ViT-M 4e-4
UFO-ViT-B 3.5e-4
UFO-ViT-S, L 0.05 Figure 5: Memory consumption according to the num-
weight decay[27] UFO-ViT-M, XL 0.07 ber of input tokens. Maximum GPU memory allocated
UFO-ViT-B 0.09 using UFO-ViT-M model with 64 batches. It is linear to
UFO-ViT-S, L 0.1 the number of input tokens. We use torch.cuda API
drop path[21] UFO-ViT-M, XL 0.15 PyTorch[29] library provides.
UFO-ViT-B 0.2
UFO-ViT-S/L 1.0
Implementation details. Our setup is almost same with
grad clip[28] UFO-ViT-M, XL 0.7
DeiT[37]. However, we use small learning rate and larger
UFO-ViT-B 0.5
weight decay for up-scaled models for training stability.
Table 3: Hyperparameters for image classification. All (See Table 3 for more details.) The learning rate follows
the other hyperparameters are same as DeiT[37]. linear scaling rule[47], scaled per 512 batch size. We train
our model for 400 epochs with the AdamW optimizer[27].
Learning rate is linearly warmed up during first 5 epochs
and decayed with a cosine schedule after that.
4. Experiments Ablation study on UFO-ViT-M. Our ablation study fo-
cus on the importance of XNorm, and architectural opti-
4.1. Image Classification
mizations which have explained in Section 3.3. We have ex-
Dataset. For image classification task, we evaluate our perimented various normalization methods. Most of other
models using ImageNet1k[9] dataset. Extra dataset or dis- normalization methods do not decrease the loss. It can be
tilled knowledge from convolutional teacher is not used. the one of implicit evidences to prove that our theoretical

interpretation is reasonable. Interestingly, applying single
L2Norm also shows bad performance. All results are de-
tailed in Table 4.
    Comparison with the state of the art models. We ex-
perimented three models which have same architecture de-
signing schemes with DeiT[37]. (See Table 5.) As summa-
rized in Figure 1, all our models have shown higher per-
formance and parameter efficiency than most of concur-
rent transformer-based models. Additionally, our proposed
model has advantages on complexity and data efficiency,
while original ViT[10] requires larger extra dataset like JFT-
300M or ImageNet21k for competitive performance.

                         Top-1            Params     FLOPs
 Model                             Res
                          Acc              (M)        (G)
 RegNetY-1.6G[30]         78.0     224      11        1.6
 DeiT-Ti[37]              72.2     224       5        1.3
 XCiT-T12/16[12]          77.1     224      26        1.2
 UFO-ViT-T                78.3     224      10        1.9
 ResNet-50[17]            75.3     224      26        3.8
 RegNetY-4G[30]           80.0     224      21        4.0
 DeiT-S[37]               79.8     224      22        4.6            Figure 6: Visualized UFO-module outputs.             UFO
 Swin-T[26]               81.3     224      29        4.5            schemes cannot be visualized like traditional N × N self-
 XCiT-S12/16[12]          82.0     224      26        4.8            attention maps. Instead, we visualize tensor length map of
 UFO-ViT-S                81.8     224      21        3.7            UFO-module output from UFO-ViT-M pretrained on Im-
 ResNet-101[17]           75.3     224      47        7.6            ageNet1k. All maps are extracted from first layer which
 RegNetY-8G[30]           81.7     224      39        8.0            gathers most of spatial information. This map is scaled at
 Swin-S[26]               83.0     224      50        8.7            (0, 1). Red means the value is close to 1.
 XCiT-S24/16[12]          82.6     224      48        9.1
 UFO-ViT-M                82.8     224      37        7.0
 RegNetY-16G[30]          82.9     224      84        16.0           classification, we use same hyperparameters for all models,
 DeiT-B[37]               81.8     224      86        17.5           learning rate of 1e-4 and 0.05 weight decay. All experi-
 Swin-B[26]               83.5     224      88        15.4           ments are performed on 3x schedule. Input resolution is
 XCiT-S12/8[12]           83.4     224      26        18.9           fixed to 800 × 1333 for all experiments.
 UFO-ViT-L                83.1     224      21        14.3              Evaluation on COCO dataset.             We compare to
 EfficientNet-B7[34]      84.3     600      66        37.0           CNNs[17, 45] and transformer-based vision models on ob-
 XCiT-S24/8[12]           83.9     224      48        36.0           ject detection and instance segmentation tasks. For fair
 UFO-ViT-XL               83.9     224      37        27.4           comparison, the experimental environment is same for all
                                                                     results. All models are pre-trained on ImageNet1k dataset.
Table 5: Comparison with the state of the art models.                   Our models significantly outperforms CNN-based mod-
Note that the properties of the other models are taken from          els. Also they achieves higher or competitive results com-
original papers.                                                     pared to state-of-the-art vision transformers with lower ca-
                                                                     pacity. But XCiT[12] shows slightly better results with
                                                                     same design space. It might be that XCiT models have
4.2. Object Detection with Mask R-CNN                                larger embedding space d. UFO-ViT-B performs slightly
                                                                     worse than UFO-ViT-M on bounding box detection task,
   Implementation details. Our models are evaluated on               but better on smaller bounding boxes and overall instance
COCO benchmark dataset[25] for object detection task.                segmentation scores. All detailed results are reported on
We use UFO-ViT as backbone and Mask R-CNN[16] as                     Table 6.
detector. This implementation is based on mmdetection
library.[4] The training schemes, like multiscale training
and data augmentation setup, are same as DETR[2]. We use
16 NVIDIA A100 GPUs at training, for 36 epochs with 2
batch size per GPU using AdamW optimizer. Unlike image

                                                                 6

Backbone           Params (M)    APb      APb50   APb75    APm     APm
                                                                                            50    APm
                                                                                                    75
                     ResNet50[17]             44.2      41.0     61.7    44.9     37.1    58.4    40.1
                     PVT-Small[41]            44.1      43.0     65.3    46.9     39.9    62.5    42.8
                      Swin-T[26]              47.8      46.0     68.1    50.3     41.6    65.1    44.9
                    XCiT-S12/16[12]           44.3      45.3     67.0    49.5     40.8    64.0    43.8
                      UFO-ViT-S               39.7      44.6     66.7    48.7     40.4    63.6    42.9
                     ResNet101[17]            63.2      42.8     63.2    47.1     39.2    60.1    41.3
                    PVT-Medium[41]            63.9      44.2     66.0    48.2     40.5    63.1    43.5
                      Swin-S[26]              69.0      48.5     70.2    53.5     43.3    67.3    46.6
                    XCiT-S24/16[12]           65.8      46.5     68.0    50.9     41.8    65.2    45.0
                      UFO-ViT-M               56.4      46.0     68.2    50.0     41.0    64.6    43.7
                   ResNeXt101-64[45]         101.9      44.4     64.9    48.8     39.7    61.9    42.6
                     PVT-Large[41]            81.0      44.5     66.0    48.3     40.7    63.4    43.7
                    XCiT-M24/16[12]          101.1      46.7     68.2    51.1     42.0    65.6    44.9
                      UFO-ViT-B               82.4      45.8     67.4    50.1     41.2    64.5    44.1

                            Table 6: Object detection performance on the COCO val2017.

5. Conclusion                                                         lab detection toolbox and benchmark. arXiv preprint
                                                                      arXiv:1906.07155, 2019. 6
    In this paper, we propose the simple method to
make self-attention have linear complexity without loss           [5] Rewon Child, Scott Gray, Alec Radford, and Ilya
of performance. By replacing softmax function, we                     Sutskever. Generating long sequences with sparse
remove quadratic operation by associate law of ma-                    transformers. arXiv preprint arXiv:1904.10509, 2019.
trix multiplication. This type of factorization usually               2
causes degradation of performance in earlier researches.          [6] François Chollet. Xception: Deep learning with
UFO-ViT models outperform most of existing the state-                 depthwise separable convolutions. In Proceedings of
of-the-art results of transformer-based and CNN-based                 the IEEE conference on computer vision and pattern
models at image classification. We have shown that                    recognition, pages 1251–1258, 2017. 2
our models can be deployed for general purpose well.
Our UFO-ViT models show competitive or higher                     [7] Krzysztof Choromanski, Valerii Likhosherstov, David
performance than earlier results on dense prediction                  Dohan, Xingyou Song, Andreea Gane, Tamas Sar-
tasks. With further strategies like multi-stage structure,            los, Peter Hawkins, Jared Davis, Afroz Mohiuddin,
we expect UFO-ViT to be more efficient and perform better.            Lukasz Kaiser, et al. Rethinking attention with per-
                                                                      formers. arXiv preprint arXiv:2009.14794, 2020. 2
                                                                  [8] Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang,
References                                                            Haibing Ren, Xiaolin Wei, Huaxia Xia, and Chun-
 [1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E                   hua Shen. Twins: Revisiting the design of spa-
     Hinton.     Layer normalization.     arXiv preprint              tial attention in vision transformers. arXiv preprint
     arXiv:1607.06450, 2016. 5                                        arXiv:2104.13840, 1(2):3, 2021. 2
 [2] Nicolas Carion, Francisco Massa, Gabriel Synnaeve,           [9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai
     Nicolas Usunier, Alexander Kirillov, and Sergey                  Li, and Li Fei-Fei. Imagenet: A large-scale hierarchi-
     Zagoruyko. End-to-end object detection with trans-               cal image database. In 2009 IEEE conference on com-
     formers. In European Conference on Computer Vi-                  puter vision and pattern recognition, pages 248–255.
     sion, pages 213–229. Springer, 2020. 1, 6                        Ieee, 2009. 1, 5
 [3] Chun-Fu Chen, Quanfu Fan, and Rameswar Panda.               [10] Alexey Dosovitskiy, Lucas Beyer, Alexander
     Crossvit: Cross-attention multi-scale vision trans-              Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
     former for image classification.     arXiv preprint              Thomas Unterthiner, Mostafa Dehghani, Matthias
     arXiv:2103.14899, 2021. 2                                        Minderer, Georg Heigold, Sylvain Gelly, et al.
 [4] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao,                An image is worth 16x16 words: Transformers
     Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng,                 for image recognition at scale.    arXiv preprint
     Ziwei Liu, Jiarui Xu, et al. Mmdetection: Open mm-               arXiv:2010.11929, 2020. 1, 2, 5, 6

                                                             7

[11] Ricard Durall, Stanislav Frolov, Andreas Dengel,                  internal covariate shift. In International conference on
     and Janis Keuper. Combining transformer genera-                   machine learning, pages 448–456. PMLR, 2015. 5
     tors with convolutional discriminators. arXiv preprint       [23] Yifan Jiang, Shiyu Chang, and Zhangyang Wang.
     arXiv:2105.10189, 2021. 1                                         Transgan: Two transformers can make one strong gan.
[12] Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron,                 arXiv preprint arXiv:2102.07074, 2021. 1
     Piotr Bojanowski, Matthijs Douze, Armand Joulin,             [24] Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya.
     Ivan Laptev, Natalia Neverova, Gabriel Synnaeve,                  Reformer: The efficient transformer. arXiv preprint
     Jakob Verbeek, et al. Xcit: Cross-covariance image                arXiv:2001.04451, 2020. 2
     transformers. arXiv preprint arXiv:2106.09681, 2021.
     1, 2, 4, 5, 6, 7                                             [25] Tsung-Yi Lin, Michael Maire, Serge Belongie, James
                                                                       Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
[13] Patrick Esser, Robin Rombach, and Bjorn Ommer.                    C Lawrence Zitnick. Microsoft coco: Common ob-
     Taming transformers for high-resolution image syn-                jects in context. In European conference on computer
     thesis. In Proceedings of the IEEE/CVF Conference                 vision, pages 740–755. Springer, 2014. 1, 6
     on Computer Vision and Pattern Recognition, pages
     12873–12883, 2021. 1                                         [26] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei,
                                                                       Zheng Zhang, Stephen Lin, and Baining Guo. Swin
[14] Ben Graham, Alaaeldin El-Nouby, Hugo Touvron,                     transformer: Hierarchical vision transformer using
     Pierre Stock, Armand Joulin, Hervé Jégou, and                   shifted windows. arXiv preprint arXiv:2103.14030,
     Matthijs Douze. Levit: a vision transformer in con-               2021. 1, 2, 6, 7
     vnet’s clothing for faster inference. arXiv preprint
     arXiv:2104.01136, 2021. 1, 2, 4                              [27] Ilya Loshchilov and Frank Hutter.        Decou-
                                                                       pled weight decay regularization. arXiv preprint
[15] Ali Hassani, Steven Walton, Nikhil Shah, Abulikemu                arXiv:1711.05101, 2017. 5
     Abuduweili, Jiachen Li, and Humphrey Shi. Escap-
     ing the big data paradigm with compact transformers.         [28] Razvan Pascanu, Tomas Mikolov, and Yoshua Ben-
     arXiv preprint arXiv:2104.05704, 2021. 2                          gio. On the difficulty of training recurrent neural net-
                                                                       works. In International conference on machine learn-
[16] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross             ing, pages 1310–1318. PMLR, 2013. 5
     Girshick. Mask r-cnn. In Proceedings of the IEEE
     international conference on computer vision, pages           [29] Adam Paszke, Sam Gross, Francisco Massa, Adam
     2961–2969, 2017. 6                                                Lerer, James Bradbury, Gregory Chanan, Trevor
                                                                       Killeen, Zeming Lin, Natalia Gimelshein, Luca
[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian                 Antiga, et al. Pytorch: An imperative style, high-
     Sun. Deep residual learning for image recognition.                performance deep learning library. Advances in neural
     In Proceedings of the IEEE conference on computer                 information processing systems, 32:8026–8037, 2019.
     vision and pattern recognition, pages 770–778, 2016.              5
     6, 7
                                                                  [30] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Gir-
[18] Dan Hendrycks and Kevin Gimpel. Gaussian error lin-               shick, Kaiming He, and Piotr Dollár. Designing net-
     ear units (gelus). arXiv preprint arXiv:1606.08415,               work design spaces. In Proceedings of the IEEE/CVF
     2016. 5                                                           Conference on Computer Vision and Pattern Recogni-
[19] Byeongho Heo, Sangdoo Yun, Dongyoon Han,                          tion, pages 10428–10436, 2020. 6
     Sanghyuk Chun, Junsuk Choe, and Seong Joon Oh.               [31] Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai
     Rethinking spatial dimensions of vision transformers.             Yi, and Hongsheng Li. Efficient attention: Atten-
     arXiv preprint arXiv:2103.16302, 2021. 1, 2                       tion with linear complexities. In Proceedings of
[20] Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn,                  the IEEE/CVF Winter Conference on Applications of
     and Tim Salimans. Axial attention in multidimen-                  Computer Vision, pages 3531–3539, 2021. 1, 2
     sional transformers. arXiv preprint arXiv:1912.12180,        [32] Aravind Srinivas, Tsung-Yi Lin, Niki Parmar,
     2019. 2                                                           Jonathon Shlens, Pieter Abbeel, and Ashish Vaswani.
[21] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and                  Bottleneck transformers for visual recognition. In
     Kilian Q Weinberger. Deep networks with stochastic                Proceedings of the IEEE/CVF Conference on Com-
     depth. In European conference on computer vision,                 puter Vision and Pattern Recognition, pages 16519–
     pages 646–661. Springer, 2016. 5                                  16529, 2021. 1
[22] Sergey Ioffe and Christian Szegedy. Batch normaliza-         [33] Sainbayar Sukhbaatar, Edouard Grave, Piotr Bo-
     tion: Accelerating deep network training by reducing              janowski, and Armand Joulin. Adaptive attention span

                                                              8

in transformers. arXiv preprint arXiv:1905.07799,              [45] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu,
       2019. 2                                                             and Kaiming He. Aggregated residual transformations
[34]   Mingxing Tan and Quoc Le. Efficientnet: Rethinking                  for deep neural networks. In Proceedings of the IEEE
       model scaling for convolutional neural networks. In                 conference on computer vision and pattern recogni-
       International Conference on Machine Learning, pages                 tion, pages 1492–1500, 2017. 6, 7
       6105–6114. PMLR, 2019. 6                                       [46] Yunyang Xiong, Zhanpeng Zeng, Rudrasis
[35]   Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov,                Chakraborty, Mingxing Tan, Glenn Fung, Yin
       Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jes-                 Li, and Vikas Singh. Nystr\” omformer: A nystr\”
       sica Yung, Daniel Keysers, Jakob Uszkoreit, Mario                   om-based algorithm for approximating self-attention.
       Lucic, et al. Mlp-mixer: An all-mlp architecture for                arXiv preprint arXiv:2102.03902, 2021. 2
       vision. arXiv preprint arXiv:2105.01601, 2021. 2               [47] Yang You, Igor Gitman, and Boris Ginsburg. Large
[36]   Hugo Touvron, Piotr Bojanowski, Mathilde Caron,                     batch training of convolutional networks. arXiv
       Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave,                   preprint arXiv:1708.03888, 2017. 5
       Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, and            [48] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun
       Hervé Jégou. Resmlp: Feedforward networks for im-                 Shi, Zihang Jiang, Francis EH Tay, Jiashi Feng, and
       age classification with data-efficient training. arXiv              Shuicheng Yan. Tokens-to-token vit: Training vision
       preprint arXiv:2105.03404, 2021. 2, 3                               transformers from scratch on imagenet. arXiv preprint
[37]   Hugo Touvron, Matthieu Cord, Matthijs Douze, Fran-                  arXiv:2101.11986, 2021. 2
       cisco Massa, Alexandre Sablayrolles, and Hervé                [49] Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin
       Jégou. Training data-efficient image transformers                  Xiao, Lu Yuan, Lei Zhang, and Jianfeng Gao. Multi-
       & distillation through attention.       arXiv preprint              scale vision longformer: A new vision transformer
       arXiv:2012.12877, 2020. 1, 2, 5, 6                                  for high-resolution image encoding. arXiv preprint
[38]   Hugo Touvron, Matthieu Cord, Alexandre Sablay-                      arXiv:2103.15358, 2021. 1
       rolles, Gabriel Synnaeve, and Hervé Jégou. Go-               [50] Zizhao Zhang, Han Zhang, Long Zhao, Ting Chen,
       ing deeper with image transformers. arXiv preprint                  and Tomas Pfister. Aggregating nested transformers.
       arXiv:2103.17239, 2021. 2, 4, 5                                     arXiv preprint arXiv:2105.12723, 2021. 2
[39]   Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob               [51] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian
       Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz                       Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng
       Kaiser, and Illia Polosukhin. Attention is all you need.            Feng, Tao Xiang, Philip HS Torr, et al. Rethinking
       In Advances in neural information processing systems,               semantic segmentation from a sequence-to-sequence
       pages 5998–6008, 2017. 5                                            perspective with transformers. In Proceedings of the
[40]   Sinong Wang, Belinda Z Li, Madian Khabsa, Han                       IEEE/CVF Conference on Computer Vision and Pat-
       Fang, and Hao Ma. Linformer: Self-attention with                    tern Recognition, pages 6881–6890, 2021. 1
       linear complexity. arXiv preprint arXiv:2006.04768,
       2020. 2
[41]   Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan,
       Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and
       Ling Shao. Pyramid vision transformer: A versatile
       backbone for dense prediction without convolutions.
       arXiv preprint arXiv:2102.12122, 2021. 1, 2, 7
[42]   Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu,
       Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Intro-
       ducing convolutions to vision transformers. arXiv
       preprint arXiv:2103.15808, 2021. 1
[43]   Yuxin Wu and Kaiming He. Group normalization. In
       Proceedings of the European conference on computer
       vision (ECCV), pages 3–19, 2018. 5
[44]   Tete Xiao, Mannat Singh, Eric Mintun, Trevor Dar-
       rell, Piotr Dollár, and Ross Girshick. Early convo-
       lutions help transformers see better. arXiv preprint
       arXiv:2106.14881, 2021. 2, 4

                                                                  9

You can also read