VITAEV2: VISION TRANSFORMER ADVANCED BY EXPLORING INDUCTIVE BIAS FOR IMAGE RECOGNITION AND BEYOND
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Noname manuscript No. (will be inserted by the editor) ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond Qiming Zhang1 · Yufei Xu1 · Jing Zhang1 · Dacheng Tao2,1 arXiv:2202.10108v1 [cs.CV] 21 Feb 2022 Received: date / Accepted: date Abstract Vision transformers have shown great po- works. Besides, we scale up our ViTAE model to 644M tential in various computer vision tasks owing to their parameters and obtain the state-of-the-art classification strong capability to model long-range dependency using performance, i.e., 88.5% Top-1 classification accuracy the self-attention mechanism. Nevertheless, they treat on ImageNet validation set and the best 91.2% Top-1 an image as a 1D sequence of visual tokens, lacking classification accuracy on ImageNet real validation set, an intrinsic inductive bias (IB) in modeling local vi- without using extra private data. It demonstrates that sual structures and dealing with scale variance, which the introduced inductive bias still helps when the model is instead learned implicitly from large-scale training size becomes large. Source code and pretrained models data with longer training schedules. In this paper, we will be publicly available at code. propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE. Technically, Keywords Vision transformer · Neural networks · ViTAE has several spatial pyramid reduction modules Image classification · Object detection · Inductive bias to downsample and embed the input image into tokens with rich multi-scale context using multiple convolutions with different dilation rates. In this way, it acquires an 1 Introduction intrinsic scale invariance IB and can learn robust feature representation for objects at various scales. Moreover, in Transformers [74,21] have become the popular frame- each transformer layer, ViTAE has a convolution block works in NLP studies owing to their strong ability in parallel to the multi-head self-attention module, whose modeling long-range dependencies by the self-attention features are fused and fed into the feed-forward network. mechanism. Such success and good properties of trans- Consequently, it has the intrinsic locality IB and is able formers have inspired many following works that apply to learn local features and global dependencies collabo- them in various computer vision tasks [22,98, 76]. Among ratively. The proposed two kinds of cells are stacked in them, ViT [22] is the pioneering work that adapts a pure both isotropic and multi-stage manners to formulate two transformer model for vision by embedding images into families of ViTAE models, i.e., the vanilla ViTAE and a sequence of visual tokens and modeling the global de- ViTAEv2. Experiments on the ImageNet dataset as well pendencies among them with stacked transformer blocks. as downstream tasks on the MS COCO, ADE20K, and Although it achieves promising performance on image AP10K datasets validate the superiority of our models classification, it experiences a severe data-hungry issue, over the baseline transformer models and concurrent i.e., requiring large-scale training data and a longer training schedule for better performance. One impor- Qiming Zhang (qzha2506@uni.sydney.edu.au) tant reason is that ViT does not efficiently utilize the Yufei Xu (yuxu71166@uni.sydney.edu.au) prior knowledge in vision tasks and lacks such inductive Jing Zhang (jing.zhang1@sydney.edu.au) Dacheng Tao (dacheng.tao@gmail.com) bias (IB) in modeling local visual clues (e.g., edges and 1 School of Computer Science, Faculty of Engineering, The corners) and dealing with objects at various scales like University of Sydney, Darlington, NSW 2008, Australia. convolutions. Alternatively, ViT has to learn such IB 2 JD Explore Academy, China implicitly from large-scale data.
2 Qiming Zhang, et al. 73 T2T 75.3 Different from DeiT, we explicitly introduce intrinsic 72.6 ViTAE IBs into vision transformers by re-designing the network ImageNet Top-1 Accuracy (%) 74.2 structures in this paper. Current vision transformers ImageNet Top-1 Accuracy (%) 71.6 74 71 always obtain tokens with single-scale context [22,92, 76,49] and learn to adapt to objects at different scales 72.6 69 from data. For example, T2T-ViT [92] improves ViT 68.7 72 68.1 71.7 by delicately generating tokens in a soft split manner. 67.5 Specifically, it uses a series of Tokens-to-Token trans- 67 70.8 formation layers to aggregate single-scale neighboring 70 contextual information and progressively structures the 65 image to tokens. Motivated by the success of CNNs in 64.1 68.7 dealing with scale variance, we explore a similar design in transformers, i.e., intra-layer convolutions with differ- 63 68 20 60 100 100 200 300 ent receptive fields [67, 89], to embed multi-scale context Data Percentage (%) Epochs into tokens. Such a design allows tokens to carry useful features of objects at various scales, thereby naturally Fig. 1 Comparison of data and training efficiency of T2T- having the intrinsic scale-invariance IB and explicitly ViT-7 and ViTAE-T on ImageNet. facilitating transformers to learn scale-invariant features more efficiently from data. On the other hand, low-level local features are fundamental elements to generate high- Unlike vision transformers, Convolution Neural Net- level discriminative features. Although transformers can works (CNNs) are naturally equipped with the intrinsic also learn such features at shallow layers from data, IBs of locality and scale-invariance and still serve as they are not skilled as convolutions by design. Recently, prevalent backbones in vision tasks [30,67,11, 97]. The [86,43,25] stack convolutions and attention layers se- success of CNNs inspires us to explore the benefits of quentially and demonstrate that locality is a reasonable introducing intrinsic IBs in vision transformers. We start compensation of global dependency. However, this se- by analyzing the above two IBs of CNNs, i.e., locality rial structure ignores the global context during locality and scale-invariance. Convolution that computes local modeling (and vice versa). To avoid such a dilemma, we correlation among neighbor pixels is good at extracting follow the “divide-and-conquer” idea and propose mod- local features such as edges and corners. Consequently, eling locality and long-range dependencies in parallel CNNs can provide great low-level features at the shallow and fusing the features to account for both. In this way, layers [93], which are then aggregated into high-level we empower transformers to learn local and long-range features progressively by a bulk of sequential convo- features within each block more effectively. lutions [34,66,68]. Moreover, CNNs have a hierarchy structure to extract multi-scale features at different lay- Technically, we propose a new Vision Transformers ers [66,39,30]. Intra-layer convolutions can also learn Advanced by Exploring Intrinsic Inductive Bias (Vi- features at different scales by varying their kernel sizes TAE ), which is a combination of two types of basic cells, and dilation rates [29, 67,11,45, 97]. Consequently, scale- i.e., reduction cell (RC) and normal cell (NC). RCs are invariant feature representations can be obtained via used to downsample and embed the input images into intra- or inter-layer feature fusion. Nevertheless, CNNs tokens with rich multi-scale context, while NCs aim to are not well suited to model long-range dependencies1 , jointly model locality and global dependencies in the which is the key advantage of transformers. An inter- token sequence. Moreover, these two types of cells share esting question then comes up: can we improve vision a simple basic structure, i.e., paralleled attention mod- transformers by leveraging the good properties of CNNs? ule and convolutional layers followed by a feed-forward Recently, DeiT [72] explores the idea of distilling knowl- network (FFN). It is noteworthy that RC has an ex- edge from CNNs to transformers to facilitate training tra pyramid reduction module with atrous convolutions and improve performance. However, it requires an off- of different dilation rates to embed multi-scale context the-shelf CNN model as the teacher and incurs extra into tokens. Following the setting in [92], we stack three training costs. reduction cells to reduce the spatial resolution by 1/16 and a series of NCs to learn discriminative features from data. ViTAE outperforms representative vision Despite the projection layer in a transformer can be viewed 1 transformers in terms of data efficiency and training as 1 × 1 convolution [12], the term of convolution here refers to those with larger kernels, e.g., 3 × 3, which are widely used efficiency (see Figure 1) as well as classification accuracy in typical CNNs to extract spatial features. and generalization on downstream image classification
ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond 3 tasks. In addition, we further scale up ViTAE to large images. Experiments on popular benchmarks demon- models and show that the inductive bias still helps to strate that it outperforms state-of-the-art methods for obtain better performance, e.g., ViTAE-H with 644M various downstream vision tasks, including image classi- parameters achieves 88.5% Top-1 classification accuracy fication, object detection, semantic segmentation, and on ImageNet without using extra private data. pose estimation. Beyond image classification, backbone networks should The following of this paper is organized as follows. adapt well to various downstream tasks such as object Section 2 describes the relevant works to our paper. We detection, semantic segmentation, and pose estimation. then detail the two basic cells, the vanilla ViTAE model, To this end, we extend the vanilla ViTAE to the multi- the scaling strategy for ViTAE, as well as the multi- stage design, i.e., ViTAEv2. Specifically, a natural choice stage design for ViTAEv2 in Section 3. Next, Section 4 is to construct the model by re-arranging the reduction presents the extensive experimental results and analysis. cells and normal cells according to the strategies in [76, Finally, we conclude our paper in Section 5 and discuss 49] to have multi-scale feature outputs, i.e., several con- the potential applications and future research directions. secutive NC cells are used following one RC module at each stage (feature resolution) rather than using a series of NCs only at the last stage. As a result, the multi-scale 2 Related Work features from different stages can be utilized for those various downstream tasks. One remaining issue is that 2.1 CNNs with intrinsic inductive bias the vanilla attention operations in transformers have a quadratic computational complexity, requiring a large CNNs [39, 93,30] have explored several inductive biases memory footprint and computation cost, especially for with specially designed operations and have led to a feature maps with a large resolution. To mitigate this series of breakthroughs in vision tasks, such as image issue, we further explore another inductive bias, i.e., lo- classification, object detection, and semantic segmenta- cal window attention introduced in [49], in the RC and tion. For example, following the fact that local pixels NC modules. Since the parallel convolution branch in are more likely to be correlated in images [42], the con- the proposed two cells can encode position information volution operations in CNNs extract features from the and enable inter-window information exchange, special neighbor pixels within the receptive field determined designs like the relative position encoding and window- by the kernel size [41]. By stacking convolution opera- shifting mechanism in [49] can be omitted. Consequently, tions, CNNs have the inductive bias in modeling locality our ViTAEv2 models outperform state-of-the-art meth- naturally. ods for various vision tasks, including image classifica- In addition to the locality, another critical inductive tion, object detection, semantic segmentation, and pose bias in visual tasks is scale-invariance, where multi-scale estimation, while keeping a fast inference speed and features are needed to represent the objects at differ- reasonable memory footprint. ent scales effectively [52, 88]. For example, to effectively The contribution of this study is threefold. First, learn features of large objects, a large receptive field we explore two types of intrinsic IB in transformers, is needed by either using large convolution kernels [88, i.e., scale invariance and locality, and demonstrate the 89] or a series of convolution layers in deeper archi- effectiveness of this idea by designing a new transformer tectures [30,34,66,68]. However, such operations may architecture named ViTAE based on two new reduction ignore the features of small objects. To construct multi- and normal cells that incorporate the above two IBs. scale feature representation for objects at different scales ViTAE outperforms representative vision transformers effectively, various image pyramid techniques [11, 1,55, 6, regarding classification accuracy, data efficiency, training 40,19] have been explored, where features are extracted efficiency, and generalization on downstream vision tasks. from a pyramid of images at different resolutions re- Second, we scale up our ViTAE model to 644M pa- spectively [44,11,53,62,35,4], either in a hand-crafted rameters and obtain 88.5% Top-1 classification accuracy manner or learned manner. Accordingly, features from on ImageNet without using extra private data, which the small-scale images mainly encode the large objects, is better than the state-of-the-art Swin Transformer, while features from the large-scale images respond more demonstrating that the introduced inductive bias still to small objects. Then, features extracted from differ- helps when the model size becomes large. Third, we ent resolutions are fused to form the scale-invariant extend the vanilla ViTAE to the multi-stage design, feature, i.e., the inter-layer fusion. Another way to ob- i.e., ViTAEv2. It learns multi-scale features at different tain the scale-invariant feature is to extract and aggre- stages efficiently while keeping a fast inference speed gate multi-scale context by using multiple convolutions and reasonable memory footprint for large-size input with different receptive fields in a parallel manner, i.e.,
4 Qiming Zhang, et al. the intra-layer fusion [97,68, 67,67, 69]. Either the inter- idea and propose to model locality and global depen- layer or intra-layer fusion empowers the CNNs with the dencies simultaneously via a parallel structure within scale-invariance inductive bias. It helps improve their each transformer layer. In this way, the convolution and performance in recognizing objects at different scales. attention modules are designed to complement each However, it is unclear whether these inductive bi- other within the transformer block, which is more ben- ases can help the visual transformer to achieve better eficial for the models to learn better features for both performance. This paper explores the possibility of in- classification and dense prediction tasks. troducing two types of inductive biases in the vision transformer, namely locality by introducing convolu- 2.3 Self supervised learning and model scaling tion in the vision transformer and scale-invariance by encoding a multi-scale contxt into each visual token As demonstrated in previous studies, scaled-up models using multiple convolutions with different dilation rates, are naturally few-shot learners and beneficial to obtain following the convention of intra-layer fusion. better performance no matter in language, image, or cross-modal domains [21, 94,60]. Recently, many efforts 2.2 Vision transformers with inductive bias have been made to scale up vision models, e.g., BiT [36] and EfficientNet [70] scale up the CNN models to hun- ViT [22] is the pioneering work that applies a pure dreds of millions of parameters by employing wider and transformer to vision tasks and achieves promising re- deeper networks, and obtain superior performance on sults. It treats images as a 1D sequence, embeds them many vision tasks. However, they need to train the into several tokens, and then processes them by stacked scaled-up models with a much larger scale of private transformer blocks to get the final prediction. However, data, i.e., JFT300M [36]. Similar phenomena can be since ViT simply treats images as 1D sequences and thus observed when training the scaled-up vision transformer lacks inductive bias in modeling local visual structures, models for better performance [22, 94]. it indeed implicitly learns the IB from a large amount of However, it is not easy to gather such large amounts data. Similar phenomena can also be observed in models of labeled data to train the scaled-up models. On the with fewer inductive biases in their structures [71,23, other hand, self-supervised learning can help train scaled- 31]. up models using data without labels. For example, To alleviate the data-hungry issue, the following CLIP [60] adopts paired text and image data captured works explicitly introduce inductive bias into vision from the Internet and exploits the consistency between transformers, e.g., leveraging the IB from CNNs to fa- text and images to train a big transformer model, which cilitate the training of vision transformers with less obtains good performance on image and text genera- training data or shorter training schedules. For exam- tion tasks. [47] adopt masked language modeling (MLM) ple, DeiT [72] proposes to distill knowledge from pre- as pretext tasks and generate supervisory signals from trained CNNs to transformers during training via an the input data. Specifically, they take masked sentences extra distillation token to imitate the behavior of CNNs. with several words overrode with mask and predicted However, it requires an off-the-shelf CNN model as a the masked words with the words from the sentence teacher, introducing extra computation cost. Recently, before masking as supervision. In this way, these models some works try to introduce the intrinsic IB of CNNs do not require additional labels for the training data and into vision transformers explicitly [26,58,25,43,18,86, achieve superior performance on translation, sentiment 80,91,9,49]. For example, [43,25, 80,17] stack convolu- analysis, etc. Inspired by the superior performance of tions and attention layers sequentially, resulting in a MLM tasks in language, masked image modeling (MIM) serial structure and modeling the locality and global de- tasks have been explored in vision tasks recently. For ex- pendency accordingly. [76] design sequential multi-stage ample, BEiT [3] tokenizes the images into visual tokens structures while [49] apply attention within local win- and randomly masks some tokens using a block-wise dows. However, these serial structures may ignore the manner. The vision transformer model must predict the global context during locality modeling (and vice versa). original tokens for those masked tokens. In this way, [77] establishes connection across different scales at the BEiT obtains superior classification and dense predic- cost of heavy computation. To jointly model global and tion performance using publicly available ImageNet-22K local context, Conformer [58] and MobileFormer [13] dataset [20]. MAE [27] simplifies the requirement of employ a model-parallel structure, consisting of parallel tokenizers and simply treats the image pixels as the tar- individual convolution and transformer branches and a gets for reconstruction. Using only ImageNet-1K train- complicated bridge connection between the two branches. ing data, MAE obtains impressive performance. It is Different from them, we follow the “divide-and-conquer” under-explored whether the vision transformers with
ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond 5 introduced inductive bias can be scaled up, e.g., in a dimension, respectively, and N = (H × W )/p2 . Then, self-supervised setting. Besides, whether inductive bias an extra learnable embedding with the same dimension can still help these scaled-up models achieve better per- D, considered as a class token, is concatenated to the formance remains unclear. In this paper, we make an visual tokens before adding position embeddings to all attempt to answer this question by scaling up the Vi- the tokens in an element-wise manner. In the following TAE model and training it in a self-supervised manner. part of this paper, we use xt to represent all tokens, Experimental results confirm the value of introducing and N is the total number of tokens after concatena- inductive bias in scaled-up vision transformers. tion for simplicity unless specified. These tokens are fed into several sequential transformer layers for the final prediction. Each transformer layer is composed of two 2.4 Comparison to the conference version parts, i.e., a multi-head self-attention module (MHSA) and a feed-forward network (FFN). A preliminary version of this work was presented in [85]. MHSA extends single-head self-attention (SHSA) by This paper extends the previous study by introducing using different projection matrices for each head. In three major improvements. other words, MHSA is obtained after repeating SHSA 1. We scale up the ViTAE model to different model for h times, where h is the number of heads. Specifically, sizes, including ViTAE-B, ViTAE-L, and ViTAE- for SHSA, the input tokens xt are first projected to H. With the help of inductive bias, the proposed queries (Q), keys (K) and values (V ) using three differ- ViTAE-H model with 644M parameters obtains the ent projection matrices, i.e., Q, K, V = xt WQ , xt QK , D state-of-the-art classification performance, i.e., 88.5% xt QV , where WQ/K/V ∈ RD× h denotes the projection Top-1 classification accuracy on ImageNet validation matrix for query/key/value, respectively. Then, the self- set and the best 91.2% Top-1 classification accuracy attention operation is calculated as: on ImageNet real validation set, without using extra QK T private data. It demonstrates that the introduced in- Attention(Q, K, V ) = sof tmax( √ )V, (1) ductive bias still helps when the model size becomes D large. We also show the excellent few-shot learning D where the output of each head is of size RN × h . Then ability of the scaled-up ViTAE models. the features of all the h heads are concatenated along the 2. We extend the vanilla ViTAE to the multi-stage channel dimension and formulate the MHSA module’s design and devise ViTAEv2. The efficiency of the output. RC and NC modules is also improved by exploring FFN is placed on top of the MHSA module and ap- another inductive bias from local window attention. plied to each token identically and separately. It consists ViTAEv2 outperforms state-of-the-art models for im- of two linear transformations with an activation func- age classification tasks as well as downstream vision tion in between. Besides, a layer normalization [2] and tasks, including object detection, semantic segmen- a shortcut are added before and aside from the MHSA tation, and pose estimation. and FFN, respectively. 3. We also present more ablation studies and exper- iment analysis regarding module design, inference speed, memory footprint, and comparisons with the 3.2 The isotropic design of ViTAE latest works. ViTAE aims to introduce the intrinsic IB in CNNs to vision transformers. As shown in Figure 2, ViTAE 3 Methodology is composed of two types of cells, i.e., RCs and NCs. RCs are responsible for downsampling while embedding 3.1 Revisit vision transformer multi-scale context and local information into tokens, and NCs are used to further model the locality and We first give a brief review of the vision transformer in long-range dependencies in the tokens. Taken an image this part. To adapt transformers to vision tasks, ViT [22] x ∈ RH×W ×C as input, three RCs are used to gradually first splits an image x ∈ RH×W ×C into several non- downsample x with a total of 16× ratio by 4×, 2×, and overlapping patches with the patch size p, and embeds 2×, respectively. Thereby, the output tokens of the RCs them into visual tokens (i.e., xt ∈ RN ×D ) in a patch- after downsampling are of size [H/16, W/16, D] where to-token manner, where H, W , C denote the height, D is the token dimension (64 in our experiments). The width, and channel dimensions of the input image re- output tokens of RCs are then flattened as R(HW/256)×D , spectively, N and D denote the token number and token concatenated with the class token (red in the figure),
6 Qiming Zhang, et al. R/4 x W/4 R/8 x W/8 R/16 x W/16 Reduction Cell Reduction Cell Reduction Cell Normal Cell Normal Cell Normal Cell Linear Dog Img2Seq ×2 ×2 Seq2Img Img2Seq Img2Seq Conv Conv SiLU SiLU Conv Conv SiLU SiLU BN BN Pyramid Reduction Feed Forward Self-Attention Feed Forward Self-Attention Layer Norm Layer Norm Multi-Head Layer Norm Layer Norm Multi-Head Img2Seq Seq2Img GeLU C Different Dilation Rates Position Embedding Element-wise Add Feature Map C Concat Conv kernel Token Class Token Fig. 2 The structure of the proposed ViTAE. It is constructed by stacking three RCs and several NCs. Both types of cells share a simple basic structure, i.e., an MHSA module and a parallel convolutional module followed by an FFN. In particular, RC has an extra pyramid reduction module using atrous convolutions with different dilation rates to embed multi-scale context. and added by the sinusoid position encoding. Next, the after convolution are concatenated along the channel tokens are fed into the following NCs, which keep the dimension, i.e., fims ∈ R(Wi /ri )×(Hi /ri )×(|Si |Di ) , where length of the tokens. Finally, the prediction probability |Si | denotes the number of dilation rates in the set Si . is obtained using a linear classification layer on the class fims is then processed by an MHSA module to model token from the last NC. long-range dependencies, i.e., fig = M HSAi (Img2Seq(fims )), (3) 3.2.1 Reduction cell where Img2Seq(·) is a simple reshape operation to flat- Instead of directly splitting and flattening images into ten the feature map to a 1D sequence. In this way, fig visual tokens based on a linear image patch embedding embeds the multi-scale context in each token. Note that layer, we devise the reduction cell to embed multi-scale the traditional MHSA individually attends each token context and local information into visual tokens, intro- at the same scale and thus lacks the ability to model ducing the intrinsic scale-invariance and locality IBs the relationship between tokens at different scales. By from convolutions into ViTAE. Technically, RC has two contrast, the introduced multi-scale convolutions in re- parallel branches responsible for modeling locality and duction cells can (1) mitigate the information loss when long-range dependency, followed by an FFN for feature merging tokens by looking at a larger field and (2) embed transformation. We denote the input feature of the ith multi-scale information into tokens to aid the following RC as fi ∈ RHi ×Wi ×Di . The input of the first RC is the MHSA to model the better global dependencies based image x. In the global dependency branch, fi is firstly on features at different scales. fed into a Pyramid Reduction Module (PRM) to extract In addition, we use a Parallel Convolutional Module multi-scale context, i.e., (PCM) to embed local context within the tokens, which fims , P RMi (fi ) are fused with fig as follows: (2) = Cat([Convij (fi ; sij , ri )|sij ∈ Si , ri ∈ R]), filg = fig + P CMi (fi ). (4) where Convij (·) indicates the jth convolutional layer in Here, P CMi (·) represents the PCM of the ith RC, which the ith PRM (i.e., P RMi (·)). It uses a dilation rate sij is composed of an Img2Seq(·) operation and three from the predefined dilation rate set Si corresponding stacked convolution layers with BN layers and activa- to the ith RC. Note that we use stride convolution to tion layers in between. It is noteworthy that the parallel reduce the spatial dimension of features by a ratio ri convolution branch has the same spatial downsampling from the predefined reduction ratio set R. The features ratio as the PRM by using stride convolutions. In this
ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond 7 way, the token features can carry both local and multi- 3.3 Scaling up ViTAE via self-supervised learning scale context, implying that RC acquires the locality IB and scale-invariance IB by design. The fused tokens Except stacking the proposed RCs and NCs to construct are then processed by the FFN and reshaped back to the isotropic ViTAE models with 4M, 6M, 13M, and feature maps, i.e., 24M parameters, we also scale up ViTAE to evaluate the benefit of introducing the inductive bias in vision trans- fi+1 = Seq2Img(F F Ni (filg ) + filg ), (5) formers with large model sizes. Specifically, we follow the setting in ViT [22] to scale up the proposed ViTAE model, i.e., we embed the image into visual tokens and where the Seq2Img(·) is a simple reshape operation to process them using stacked NCs to extract features. The reshape a token sequence back to feature maps. F F Ni (·) stacking strategy is exactly the same as the strategy represents the FFN in the ith RC. In our ViTAE, three adopted in ViT [22], where we use 12 NCs with 768 RCs are stacked sequentially to gradually reduce the embedding dimensions to construct the ViTAE base input image’s spatial dimension by 4×, 2×, and 2×, model (i.e., ViTAE-B with 89M parameters), 24 NCs respectively. As the first RC handles images with high with 1,024 embedding dimensions to construct the Vi- resolution, we adopt Performer [14] to reduce the com- TAE large model (i.e., ViTAE-L with 311M parameters), putational burden and memory cost. and 36 normal cells with 1,248 embedding dimension to construct the ViTAE huge model (i.e., ViTAE-H with 644M parameters). The normal cells are stacked 3.2.2 Normal cell sequentially. However, the scaled-up models are easy to overfit if only trained using the ImageNet-1K dataset As shown in the bottom right part of Figure 2, NCs share under a fully supervised training setting. Self-supervised a similar structure with the RC except for the absence learning [27], on the contrary, can eliminate this issue of the PRM, which can provide rich spatial information and facilitate the training of scaled-up models. In this via aggregating multi-scale information and compensate paper, we adopt MAE [27] to train the scaled-up ViTAE the spatial information loss caused by downsampling model due to its simplicity and efficiency. Specifically, in RC. Given the features containing the multi-scale we first embed the input images into tokens and then information, NCs are expected to focus on modeling randomly remove 75% of the tokens. The removed to- long- and short-range dependency among the features. kens are filled with randomly initialized mask tokens. Besides, omitting the PRM module in NCs also helps to After that, the remained visual tokens are processed by reduce the computational cost due to the large number the ViTAE model for feature extraction. The extracted of NCs in the stacked models. Therefore, we do not use features and the mask tokens are then concatenated and PRM in NC. Specifically, given f3 from the third RC, fed into the decoder network to predict the values of we first concatenate it with the class token tcls , and then the pixels belonging to the masked regions. The mean add it to the positional encodings to get the input to- squared errors between the prediction and the masked kens t for the following NCs. We ignore the subscript for pixels are minimized during the training. clarity since all NCs have an identical architecture but However, as the encoder only processes the visual different learnable weights. tcls is randomly initialized tokens, i.e., the remained tokens after removing, the at the start of training and fixed during the inference. built-in locality property of images has been broken Similar to the RC, the tokens are fed into the MHSA among the visual tokens. To adapt the proposed ViTAE module, i.e., tg = M HSA(t). Meanwhile, they are re- model to the self-supervised task, we simply use convolu- shaped to 2D feature maps and fed into the PCM, i.e., tion with kernel size 1×1 instead of 3×3 to formulate the tl = Img2Seq(P CM (Seq2Img(t))). Note that the class ViTAE model for pretraining. This simple modification token is discarded in PCM because it has no spatial helps us to preserve a similar architecture between the connections with other visual tokens. To further reduce network’s pretraining and finetuning stage and helps the the parameters in NCs, we use group convolutions in convolution branch to learn a meaningful initialization, PCM. The features from MHSA and PCM are then as demonstrated in [95]. After the pretraining stage, we fused via element-wise sum, i.e., tlg = tg + tl . Finally, convert the kernels of the convolutions from 1×1 to 3×3 tlg are fed into the FFN to get the output features of by zero-padding to recover the complete ViTAE models, NC, i.e., tnc = F F N (tlg ) + tlg . Similar to ViT [22], we which are further finetuned on the ImageNet-1k training apply layer normalization to the class token generated data for 50 epochs. Inspired by [3], we use layer-wise by the last NC and feed it to the classification head to learning rate decay during the finetuning to adapt the get the final classification result. pre-trained models for specific vision tasks.
8 Qiming Zhang, et al. 3.4 The multi-stage design for ViTAE tic segmentation, and pose estimation, while keeping a fast inference speed and reasonable memory footprint. Apart from classification, other downstream tasks, in- cluding object detection, semantic segmentation, and pose estimation, are also very important that a gen- 3.5 Model details eral backbone should adapt to. These downstream tasks usually need to extract multi-level features from the In this paper, we propose ViTAE and further extend it backbone to deal with those objects at different scales. to the multi-stage version ViTAEv2 as described above. To this end, we extend the vanilla ViTAE model to the We devise several ViTAE and ViTAEv2 variants in our multi-stage design, i.e., ViTAE-v2. A natural choice for experiments to be compared with other models with the design of ViTAE-v2 can be re-constructing the model similar model sizes. The details of them are summarized by re-organizing RCs and NCs. As shown in Figure 3, in Table 1. The ‘dilation’ column determines the dilation ViTAE-v2 has four stages where four corresponding RCs rate sets S in each RC. The two rows in the ‘RC’ and are used to gradually downsample the features by 4×, ‘NC’ columns denote the specific configurations of RCs 2×, 2×, and 2×, respectively. At each stage, a number and NCs, respectively, where ‘P’, ’W’, ’F’ refers to Per- of Ni normal cells are sequentially stacked following the former [14], local window attention, and the vanilla full ith RC. Note that a series of NCs are used only at the attention, respectively, and the number in the second most coarse stage in the isotropic design. The number of rows denotes the number of heads in the corresponding normal cells, i.e., Ni , controls the model depth and size. attention module. The ‘arrangement’ column denotes By doing so, ViTAE-v2 can extract a feature pyramid the number of NC at each stage, while the ‘embedding’ from different stages which can be used by the decoders denotes the token embedding size at each stage. Specif- specifically designed for various downstream tasks. ically, the default convolution kernel size in the first One remaining issue is that the vanilla attention op- RC is 7 × 7 with a stride of 4 and dilation rates from erations in transformers have a quadratic computational S1 = [1, 2, 3, 4]. In the following two RCs (or three RCs complexity, therefore requiring a large memory footprint for ViTAEv2), the convolution kernel size is 3 × 3 with and computation cost, especially for feature maps with a stride of 2 and dilation rates from S2 = [1, 2, 3] and a large resolution. In contrast to the fast resolution re- S3 = [1, 2] (and S4 = [1, 2] for ViTAEv2), respectively. duction in the vanilla ViTAE design, we adopt a slow Since the number of tokens decreases at later stages, resolution reduction strategy in the multi-stage design, there is no need to use large kernels and dilation rates e.g., the resolution of the feature maps at the first stage at later stages. PCM in both RCs and NCs comprises is only 1/4 of the original image size, thereby incurring three convolutional layers with a kernel size of 3 × 3. more computational cost especially when the images in downstream tasks have high resolutions. To mitigate this issue, we further explore another inductive bias, i.e., local window attention introduced in [49], in the 4 Experiments RC and NC modules. Specifically, the window attention split the whole feature map into several non-overlap lo- 4.1 Implementation details cal windows and conducts the multi-head self-attention within each window, i.e., each query token within the Unless explicitly stated, we train and test the proposed same window shares the same key and value sets. Since ViTAE and ViVTAEv2 model on the ImageNet-1k [39] the parallel convolution branch in the proposed two dataset, which contains about 1.3 million images from 1k cells can encode position information and enable inter- classes. The image size during training is set to 224×224. window information exchange, special designs like the We use the AdamW [51] optimizer with the cosine learn- relative position encoding and window-shifting mecha- ing rate scheduler and use the data augmentation strat- nism in [49] can be omitted. We empirically find that egy exactly the same as T2T [92] for a fair comparison replacing the full attention with local window attention regarding the training strategies and the size of models. at early stages can achieve a good trade-off between com- We use a batch size of 512 for training ViTAE and 1024 putational cost and performance. Therefore, we only use for ViTAEv2. The learning rate is set to be proportion local window attention in the RC and NC modules at to 512 batch size with a base value 5e-4. The results the first two stages. Consequently, our ViTAEv2 models of our models can be found in Table 2, where all the can deliver superior performance for various vision tasks, models are trained for 300 epochs. The models are built including image classification, object detection, seman- on PyTorch [57] and TIMM [79].
ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond 9 × 1 × 2 × 3 × 4 R/4 x W/4 R/8 x W/8 R/16 x W/16 R/32 x W/32 Reduction Cell Reduction Cell Reduction Cell Reduction Cell Normal Cell Normal Cell Normal Cell Normal Cell Linear Dog Pooling Normal Cell Reduction Cell Feature Map Feature vector Fig. 3 The structure of the proposed multi-stage design ViTAEv2. The RCs and NCs are re-arranged in a stage-wise manner. At each stage, a number of NCs are sequentially stacked following each RC, which gradually downsamples the features by a certain ratio, i.e., 4×, 2×, 2×, and 2×, respectively. Table 1 Model details of ViTAE and ViTAEv2 variants. The two rows in the ‘RC’ and ‘NC’ columns denote the specific configurations of RCs and NCs, i.e., the attention type and number of heads, respectively. ’-’ denotes there is no RC at the corresponding stage. ‘arrangement’ and ‘embedding’ denote the number of NCs and the token embedding size at each stage. ’P’, ’T’, ’W’ represents performer, vanilla transformer, and local window attention, respectively. Model dilation RC NC arrangement embedding Params (M) Flops (G) P, P, P F ViTAE-T [1,2,3,4] 0, 0, 7 64, 64, 256 4.8 1.5 1, 1, 1 4 P, P, P F ViTAE-6M [1,2,3,4] 0, 0, 10 64, 64, 256 6.5 2.0 1, 1, 1 4 P, P, P F ViTAE-13M [1,2,3,4] 0, 0, 11 64, 64, 320 13.2 3.3 1, 1, 1 4 P, P, P F ViTAE-S [1,2,3,4] 0, 0, 14 96, 192, 384 23.6 5.6 1, 1, 1 4 - F ViTAE-B - 0, 0, 12 768 89.3 36.9 - 12 - F ViTAE-L - 0, 0, 24 1024 311.7 125.8 - 16 - F ViTAE-H - 0, 0, 32 1280 644.3 335.7 - 16 W, W, F, F W, W, F, F ViTAEv2-S [1,2,3,4] ↓ 2, 2, 8, 2 64, 128, 256, 512 19.3 5.7 1, 1, 2, 4 1, 2, 4, 8 W, W, F, F W, W, F, F ViTAEv2-48M [1,2,3,4] ↓ 2, 2, 11, 2 96, 192, 384, 768 48.7 13.3 1, 1, 2, 4 1, 2, 4, 8 W, W, F, F W, W, F, F ViTAEv2-B [1,2,3,4] ↓ 2, 2, 12, 2 128, 256, 512, 1024 89.7 24.3 1, 1, 2, 4 1, 2, 4, 8 4.2 Comparison with the state-of-the-art parameters. ViTAE-S achieves 82.0% Top-1 accuracy with half of the parameters of ResNet-101 and ResNet- 152, showing the superiority of learning both local and We compare our ViTAE and ViTAEv2 with both CNN long-range features from specific structures with corre- models and vision transformers with similar model sizes sponding intrinsic IBs by design. When adopting the in Table 2 and Table 3. Both Top-1/5 accuracy and real multi-stage design, ViTAEv2-S further improves the Top-1 accuracy [5] on the ImageNet validation set are Top-1 accuracy to 82.6% significantly. When finetuning reported. We categorize the methods into CNN models, the model using images of a larger resolution, e.g., using vision transformers with learned IB, and vision trans- 384 × 384 images as input, ViTAE-S’s performance is formers with introduced intrinsic IB. Compared with further improved significantly by 1.2% absolute Top-1 CNN models, our ViTAE-T achieves a 75.3% Top-1 accuracy. ViTAEv2-48M in Table 3 also benefits from accuracy, which is better than ResNet-18 with more it, and the performance increases from 83.8% to 84.7%, parameters. The real Top-1 accuracy of the ViTAE which further shows the potential of vision transformers model is 82.9%, which is comparable to ResNet-50 that with intrinsic IBs for large resolution images that are has four more times of parameters than ours. Simi- common in downstream dense prediction tasks. When lar phenomena can also be observed when comparing the model size increases to 88M, ViTAEv2-B reaches ViTAE-T with MobileNetV1 [33] and MobileNetV2 [64], 84.6% Top-1 accuracy, significantly outperforming other where ViTAE obtains better performance with fewer
10 Qiming Zhang, et al. Table 2 Comparison with SOTA methods. ↑ 384 denotes finetuning the model using images of 384×384 resolution. Params FLOPs Input ImageNet Real Type Model (M) (G) Size Top-1 Top-5 Top-1 ResNet-18 [30] 11.7 1.8 224 70.3 86.7 77.3 ResNet-50 [30] 25.6 3.8 224 76.7 93.3 82.5 ResNet-101 [30] 44.5 7.6 224 78.3 94.1 83.7 ResNet-152 [30] 60.2 11.3 224 78.9 94.4 84.1 EfficientNet-B0 [70] 5.3 0.4 224 77.1 93.3 83.5 CNN EfficientNet-B4 [70] 19.3 4.2 380 82.9 96.4 88.0 RegNetY-600M [61] 6.1 0.6 224 75.5 - - RegNetY-4GF [61] 20.6 4.0 224 80.0 - 86.4 RegNetY-8GF [61] 39.2 8.0 224 81.7 - 87.4 DeiT-T [72] 5.7 1.3 224 72.2 91.1 80.6 DeiT-T ⚗ [72] 5.7 1.3 224 74.5 91.9 82.1 PiT-Ti [32] 4.9 0.7 224 73.0 - - T2T-ViT-7 [92] 4.3 1.2 224 71.7 90.9 79.7 ViTAE-T 4.8 1.5 224 75.3 92.7 82.9 CeiT-T [91] 6.4 1.2 224 76.4 93.4 83.6 ConViT-Ti [18] 6.0 1.0 224 73.1 - - CrossViT-Ti [9] 6.9 1.6 224 73.4 - - ViTAE-6M 6.5 2.0 224 77.9 94.1 84.9 PVT-T [76] 13.2 1.9 224 75.1 - - ConViT-Ti+ [18] 10.0 2.0 224 76.7 - - PiT-XS [32] 10.6 1.4 224 78.1 - - ConT-M [86] 19.2 3.1 224 80.2 - - ViTAE-13M 13.2 3.3 224 81.0 95.4 86.8 DeiT-S [72] 22.1 4.6 224 79.9 95.0 85.7 DeiT-S ⚗ [72] 22.1 4.6 224 81.2 95.4 86.8 PVT-S [76] 24.5 3.8 224 79.8 - Conformer-Ti [58] 23.5 5.2 224 81.3 - - Swin-T [49] 29.0 4.5 224 81.3 - - CeiT-S [91] 24.2 4.5 224 82.0 95.9 87.3 CvT-13 [80] 20.0 4.5 224 81.6 - 86.7 ConViT-S [18] 27.0 5.4 224 81.3 - - CrossViT-S [9] 26.7 5.6 224 81.0 - - PiT-S [32] 23.5 4.8 224 80.9 - - TNT-S [26] 23.8 5.2 224 81.3 95.6 - Twins-PCPVT-S[15] 24.1 3.8 224 81.2 - - Transformer Twins-SVT-S [15] 24.0 2.9 224 81.7 - - T2T-ViT-14 [92] 21.5 5.2 224 81.5 95.7 86.8 ViTAE-S 23.6 5.6 224 82.0 95.9 87.0 ViTAE-S ↑ 384 23.6 20.2 384 83.0 96.2 87.5 ViTAEv2-S 19.2 5.7 224 82.6 96.2 87.6 ViTAEv2-S ↑ 384 19.2 17.8 384 83.8 96.7 88.3 transformer models including Swin-B [49], Focal-B [87], T2T-ViT-7 at different training settings: (a) training and CrossFormer-B [77]. When finetuning using im- them using 20%, 60%, and 100% ImageNet training set ages of larger resolution or pretraining the model with for equivalent 100 epochs regarding the full ImageNet ImageNet-22k, ViTAEv2-B’s performance increases to training set, e.g., we employ 5 times epochs when using 85.3% and 86.1% Top-1 accuracy, respectively, confirm- 20% data for training compared with using 100% data; ing the scalability of using IBs for large models and and (b) training them using the full ImageNet training trained on large-scale datasets. set for 100, 200, and 300 epochs, respectively. The re- sults are shown in Figure 1. As can be seen, ViTAE-T consistently outperforms the T2T-ViT-7 baseline by a 4.3 Analysis of the isotropic design of ViTAE large margin in terms of both data efficiency and train- ing efficiency. For example, ViTAE-T using only 20% 4.3.1 Data efficiency and training efficiency training data achieves comparable performance with T2T-ViT-7 using all data. When 60% training data are To validate the effectiveness of introducing intrinsic IBs used, ViTAE-T significantly outperforms T2T-ViT-7 us- in improving data efficiency and training efficiency, we ing all data by about an absolute 3% accuracy. It is also compare our ViTAE-T model with the baseline model
ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond 11 Table 3 Comparison with SOTA methods (Table 2 continued). ↑ 384 denotes finetuning the model using images of 384×384 resolution, while * denotes the model pretrained using ImageNet-22k. Params FLOPs Input ImageNet Real Type Model (M) (G) Size Top-1 Top-5 Top-1 ViT-B/16 [22] 86.5 35.1 384 77.9 - - ViT-L/16 [22] 304.3 122.9 384 76.5 - - DeiT-B [72] 86.6 17.5 224 81.8 95.6 86.7 PVT-M [76] 44.2 13.2 224 81.2 - - PVT-L [76] 61.4 9.8 224 81.7 - - Transformer Conformer-S [58] 37.7 10.6 224 83.4 - - Swin-S [49] 50.0 8.7 224 83.0 ConT-B [86] 39.6 6.4 224 81.8 - - CvT-21 [80] 32.0 7.2 224 82.5 - 87.2 ConViT-S+ [18] 48.0 10.0 224 82.2 - - ConViT-B [18] 86.0 17.0 224 82.4 - - ConViT-B+ [18] 152.0 30.0 224 82.5 - - PiT-B [32] 73.8 12.5 224 82.0 - - TNT-B [26] 65.6 14.1 224 82.8 96.3 - T2T-ViT-19 [92] 39.2 8.9 224 81.9 95.7 86.9 ViL-Base [96] 55.7 13.4 224 83.2 - - ViTAEv2-48M 48.5 13.3 224 83.8 96.6 88.4 ViTAEv2-48M ↑ 384 48.5 41.1 384 84.7 97.0 88.8 Swin-B [49] 88.0 15.4 224 83.3 - - Twins-SVT-L [15] 99.2 14.8 224 83.7 - - PVTv2-B5 [75] 82.0 11.8 224 83.8 - - Focal-B [87] 89.8 16.0 224 83.8 - - DAT-B [81] 88.0 15.4 224 84.0 - - CrossFormer [77] 92.0 16.1 224 84.0 - - Conformer-B [58] 83.3 23.3 224 84.1 - - ViTAEv2-B 89.7 24.3 224 84.6 96.9 88.7 ViTAEv2-B ↑ 384 89.7 74.4 384 85.3 97.1 89.2 ViTAEv2-B* 89.7 24.3 224 86.1 97.9 89.9 noteworthy that ViTAE-T trained for only 100 epochs fewer parameters. These results demonstrate the good has outperformed T2T-ViT-7 trained for 300 epochs. Af- generalization ability of our ViTAE models. ter training ViTAE-T for 300 epochs, its performance is significantly boosted to 75.3% Top-1 accuracy. With the proposed RCs and NCs, the transformer layers in our 4.3.3 Ablation study of the design of RC and NC ViTAE only need to focus on modeling long-range de- pendencies, leaving the locality and multi-scale context We use T2T-ViT [92] as our baseline model in the fol- modeling to its convolution counterparts, i.e., PCM and lowing ablation study of our ViTAE. As shown in Ta- PRM. Such a “divide-and-conquer” strategy facilitates ble 5, we investigate the hyper-parameter settings of ViTAE’s training, making learning more efficient with RC and NC in the ViTAE-T model by isolating them less training data and fewer training epochs. separately. All the models are trained for 100 epochs on ImageNet-1k, following the same training setting and data augmentation strategy as described in Section 4.1. 4.3.2 Generalization on downstream classification tasks We use X and × to denote whether or not the cor- responding module is enabled during the experiments. We further investigate the generalization of the proposed If all columns under the RC and NC are marked × as ViTAE models pre-trained on ImageNet-1k for down- shown in the first row, the model becomes the stan- stream image classification tasks by fine-tuning them dard T2T-ViT-7 model. “Pre” denotes the early fusion further on the training sets of several fine-grained classi- strategy that fuses output features of PCM and MHSA fication tasks, including Flowers [54], Cars [37], Pets [56], before FFN, while “Post” denotes a late fusion strategy and iNaturalist19. We also fine-tune the proposed Vi- alternatively. The Xin “BN” denotes PCM uses BN. “×3” TAE models pre-trained on ImageNet-1k further on in the first column denotes that the dilation rate set is Cifar10 [38] and Cifar100 [38]. The results are shown in the same in the three RCs. “[1, 2, 3, 4] ↓” denotes using Table 4. It can be seen that ViTAE achieves SOTA per- smaller dilation rates in deeper RCs, i.e., S1 = [1, 2, 3, 4], formance on most of the datasets using comparable or S2 = [1, 2, 3], S3 = [1, 2].
12 Qiming Zhang, et al. Table 4 Generalization of ViTAE and SOTA methods on different downstream image classification tasks. Model Params (M) Cifar10 Cifar100 iNat19 Cars Flowers Pets Grafit ResNet-50 [73] 25.6 - - 75.9 92.5 98.2 - EfficientNet-B5 [70] 30 98.1 91.1 - - 98.5 - ViT-B/16 [22] 86.5 98.1 87.1 - - 89.5 93.8 ViT-L/16 [22] 304.3 97.9 86.4 - - 89.7 93.6 DeiT-B [72] 86.6 99.1 90.8 77.7 92.1 98.4 - T2T-ViT-14 [92] 21.5 98.3 88.4 - - - - ViTAE-T 4.8 97.3 86.0 73.3 89.5 97.5 92.6 ViTAE-S 23.6 98.8 90.8 76.0 91.4 97.8 94.2 Table 5 Ablation Study of RC and NC in ViTAE-T. “Pre” sic locality IB and the performance increases to 71.7% denotes the early fusion strategy that fuses output features Top-1 accuracy. Finally, the combination of RCs and of PCM and MHSA before FFN while “Post” denotes a late fusion strategy alternatively. The Xin “BN” denotes PCM uses NCs achieves the best accuracy at 72.6%, demonstrating BN. “×3” in the first column denotes that the dilation rate their complementarity. set is the same in the three RCs. “[1, 2, 3, 4] ↓” denotes using smaller dilation rates in deeper RCs, i.e., S1 = [1, 2, 3, 4], S2 = [1, 2, 3], S3 = [1, 2]. Reduction Cell Normal Cell Dilation (S1 ∼ S3 ) PCM Pre Post BN Top-1 4.3.4 Visual inspection of ViTAE × × × × × 68.7 × × X × × 69.1 × × × X × 69.0 To further analyze the property of our ViTAE, we apply × × × X X 68.8 Grad-CAM [65] on the MHSA’s output in the last NC to × × X × X 69.9 [1, 2] × 3 × × × × 69.5 qualitatively inspect ViTAE. The visualization results [1, 2, 3] × 3 × × × × 69.9 are provided in Figure 4. Compared with the baseline [1, 2, 3, 4] × 3 × × × × 69.2 T2T-ViT, our ViTAE covers the single or multiple tar- [1, 2, 3, 4, 5] × 3 × × × × 68.9 gets in the images more precisely and attends less to [1, 2, 3, 4] ↓ × × × × 69.8 [1, 2, 3, 4] ↓ X × × × 71.7 the background. Moreover, ViTAE can better handle [1, 2, 3, 4] ↓ X X × X 72.6 the scale variance issue as shown in Figure 4(b). That is, it covers birds accurately whether they are small, medium, or large in size. Such observations demonstrate As can be seen, using an early fusion strategy and BN that introducing the intrinsic IBs of locality and scale- in NC achieves the best 69.9% Top-1 accuracy among invariance from convolutions to transformers helps Vi- other settings. It is noteworthy that all the variants TAE learn more discriminate features than the pure of NC outperform the vanilla T2T-ViT, implying the transformers. effectiveness of PCM, which introduces the intrinsic lo- Besides, we calculate the average attention distance cality IB in transformers. It can also be observed that of each layer in ViTAE-T and the baseline T2T-ViT-7 BN plays an important role in improving the model’s on the ImageNet validation set, respectively. The results performance as it can help to alleviate the scale devia- are shown in Figure 5. It can be observed that with tion between convolution’s and attention’s features. For the usage of PCM, which focuses on modeling locality, RC, we first investigate the influence of using different the transformer layers in the proposed NCs can better dilation rates in the PRM, as shown in the first col- focus on modeling long-range dependencies, especially umn. As can be seen, using larger dilation rates (e.g., 4 in shallow layers. In the deep layers, the average atten- or 5) does not deliver better performance. We suspect tion distances of ViTAE-T and T2T-ViT-7 are almost that larger dilation rates may lead to plain features in the same, where modeling long-range dependencies is the deeper RCs due to the smaller resolution of feature much more important. It implies that the PCM does maps. To validate the hypothesis, we use smaller di- not affect the transformer’s behavior in deep layers. lation rates in deeper RCs as denoted by [1, 2, 3, 4] ↓. These results confirm the effectiveness of the adopted As can be seen, it achieves comparable performance as “divide-and-conquer” idea in ViTAE, i.e., introducing [1, 2, 3]×. However, compared with [1, 2, 3, 4] ↓, [1, 2, 3]× the intrinsic locality IB from convolutions into vision increases the amount of parameters from 4.35M to 4.6M. transformers makes it possible that transformer layers Therefore, we select [1, 2, 3, 4] ↓ as the default setting. only need to be responsible to long-range dependencies In addition, using PCM in the RC introduces the intrin- since convolutions can well model locality in PCM.
ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond 13 Input T2T-ViT ViTAE (a) (b) Fig. 4 Visual inspection of T2T-ViT-7 and ViTAE-T using Grad-CAM [65]. (a) Images containing multiple or single objects and the heatmaps obtained by T2T-ViT-7 and ViTAE-T. (b) Images containing the same class of objects at different scales and the heatmaps obtained by T2T-ViT-7 and ViTAE-T. Best viewed in color. T2T-ViT ViTAE Table 6 The performance of scaled up ViTAE models on Attention Distance (pixel) the ImageNet1K dataset. ∗ denotes the results that we re- 100 implement on GPU with PyTorch framework. † indicates that ImageNet22K are used to further finetune the models with 80 224×224 resolution for 90 epochs. Test ImageNet Real 60 #Params Method size Top-1 Top-1 40 T2TViT-24 65 M 224 Supervised [92] 82.3 87.2 0 1 2 3 4 5 6 7 8 ViT-B∗ 88 M 224 MAE [27] 83.4 89.1 Layers ViTAE-B 89 M 224 MAE [27] 83.8 89.4 ViTAE-B† 89 M 224 MAE [27] 84.8 89.9 Fig. 5 The average per-layer attention distance of T2T-ViT-7 Swin-L† 197 M 384 Supervised [49] 87.3 90.0 and our ViTAE-T on the ImageNet validation set. SwinV2-L† 197 M 384 Supervised [48] 87.7 - CoAtNet-4† 275 M 384 Supervised [17] 87.9 - CvT-W24† 277 M 384 Supervised [80] 87.7 - ViT-L∗ 304 M 224 MAE [27] 85.5 90.1 4.4 Analysis of the scaled up ViTAE models ViT-L 304 M 224 MaskFeat [78] 85.7 - ViTAE-L 311 M 224 MAE [27] 86.0 90.3 4.4.1 Image classification performance ViTAE-L† 311 M 224 MAE [27] 87.5 90.8 ViTAE-L† 311 M 384 MAE [27] 88.3 91.1 SwinV2-H 658 M 224 SimMIM [84] 85.7 - We evaluate the performance of the scaled-up models SwinV2-H 658 M 512 SimMIM [84] 87.1 - on the ImageNet dataset. The scaled-up models are ViTAE-H 644 M 224 MAE [27] 86.9 90.6 pre-trained for 1600 epochs using MAE [27], taking ViTAE-H 644 M 512 MAE [27] 87.8 91.2 images from the ImageNet-1K training set. Then, the ViTAE-H† 644 M 224 MAE [27] 88.0 90.7 ViTAE-H† 644 M 448 MAE [27] 88.5 90.8 models are finetuned for 100 (the base model) or 50 (the large and huge model) epochs using the labeled data from the ImageNet-1K training set. It should be noted that the original MAE is trained on the TPU the baseline ViT-B model by 0.4 Top-1 classification ac- machines with Tensorflow, while our implementation curacy. For the ViTAE-L model with 300M parameters, adopts PyTorch as the framework and uses NVIDIA the inductive bias still brings about 0.3 performance GPU for the training. This implementation difference gains. After using ImageNet-22K labeled data for fine- may cause a slight performance difference in the models’ tuning, the classification accuracy on the ImageNet-1K classification accuracy. The results are summarized in validation set further increases by about 1%. These re- Table 6. It demonstrates that the proposed ViTAE-B sults show that the benefit of introducing inductive bias model with the introduced inductive bias outperforms in vision transformers is scalable to large models and
You can also read