Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision

Page created by Adrian Ross
 
CONTINUE READING
Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision
Noname manuscript No.
                                        (will be inserted by the editor)

                                        Perspectives and Prospects on Transformer Architecture for
                                        Cross-Modal Tasks with Language and Vision
                                        Andrew Shin · Masato Ishii · Takuya Narihira
arXiv:2103.04037v2 [cs.CV] 9 Nov 2021

                                        Received: date / Accepted: date

                                        Abstract Transformer architectures have brought about fun-       learn the language representation. While a plethora of varia-
                                        damental changes to computational linguistic field, which        tions exist as to specific ways to extract features and how to
                                        had been dominated by recurrent neural networks for many         blend them into common embedding space, the fundamental
                                        years. Its success also implies drastic changes in cross-modal   pipeline has almost invariably been restricted to the combi-
                                        tasks with language and vision, and many researchers have        nation of CNN and RNN.
                                        already tackled the issue. In this paper, we review some of
                                                                                                             This steadfast landscape started to change with the intro-
                                        the most critical milestones in the field, as well as overall
                                                                                                         duction of transformer architecture (Vaswani et al. (2017)).
                                        trends on how transformer architecture has been incorpo-
                                                                                                         Transformer has first demonstrated its capacity in natural
                                        rated into visuolinguistic cross-modal tasks. Furthermore,
                                                                                                         language processing (NLP), achieving state-of-the-art per-
                                        we discuss its current limitations and speculate upon some
                                                                                                         formances in countless number of NLP tasks (Peters et al.
                                        of the prospects that we find imminent.
                                                                                                         (2018); Dai et al. (2019); Yang et al. (2019b)), and has rapidly
                                        Keywords language and vision · transformer · attention ·         replaced recurrent neural networks. Its application has also
                                        BERT                                                             been expanded to speech recognition domain (Dong et al.
                                                                                                         (2018); Wang et al. (2020c)). While a variety of transformer-
                                                                                                         based language models exist, BERT (Devlin et al. (2019))
                                        1 Introduction                                                   has particularly gained wide attention not only with its per-
                                                                                                         formance, but also with its unique approach to pre-training
                                        Ever since the advent of deep learning revolution, the de-       and adaptability to downstream tasks. A GPT-line of works
                                        facto standard for cross-modal tasks with language and vi-       (Radford and Sutskever (2018); Radford et al. (2019); Brown
                                        sion has been to use convolutional neural networks (CNN)         et al. (2020)) have also demonstrated that pre-training with
                                        (LeCun et al. (1998); Krizhevsky et al. (2012)) to extract       a very large corpus and fine-tuning the model to a target
                                        features from visual domain, with VGG (Simonyan and Zis-         task can outperform conventional models by a large mar-
                                        serman (2015)) or ResNet (He et al. (2016)) being the most       gin. In particular, GPT-3 shows that pre-training with ex-
                                        frequently used CNN architectures, while employing recur-        tremely large amount of corpus and parameters easily ex-
                                        rent neural networks (RNN) (Elman (1990)), such as long          tends to high performance in few-shot learning tasks without
                                        short-term memory (LSTM) (Hochreiter and Schmidhuber             performing any fine-tuning.
                                        (1997)) or gated recurrent unit (GRU) (Cho et al. (2014)), to
                                                                                                             Success of transformer architecture in language domain
                                        Andrew Shin
                                        Sony Corporation
                                                                                                         has naturally led to its further application in cross-modal
                                        E-mail: andrew.shin@sony.com                                     tasks involving language and vision. ViLBERT (Lu et al.
                                        Masato Ishii
                                                                                                         (2019)) was one of the first models to demonstrate that pre-
                                        Sony Corporation                                                 training objectives of BERT can be extended to cross-modal
                                        E-mail: masato.a.ishii@sony.com                                  learning, and that it obtains state-of-the-art or comparable
                                        Takuya Narihira                                                  performances to models based on conventional CNN-RNN
                                        Sony Corporation                                                 approach. Many other models followed with similar approach,
                                        E-mail: takuya.narihira@sony.com                                 and now the assumption that pre-training with a large amount
Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision
2                                                                                                               Andrew Shin et al.

Fig. 1: A brief timeline showing important models in vision (upper) and language (lower) domains respectively, and how
they have affected approaches in cross-modal domains (middle).
of data leads to superior performance holds true for cross-       uation metrics in Sec 2. We also review the elementary trans-
modal domain as well.                                             former architecture and its variations, mostly restricted to
    Yet, many important arguments can be raised with re-          language domain, centered around BERT in Sec 3. We then
gards to the limitations and prospects of transformer-based       introduce and inspect the recent models on cross-modal tasks
cross-modal models. For example, while most models re-            that employ transformer architecture in Sec 4, focusing on
quire that images or videos be tokenized and serialized in        their architectural modifications and pre-training schemes.
some way, they still fundamentally rely on CNN-based mod-         While the works introduced in Sec 4 mostly employ trans-
els to extract features for each visual token. It also remains    former architecture to acquire linguistic representation, Sec 5
arguable whether transformers learn fundamentally superior        introduces works that examine representing vision with trans-
embeddings or whether its performance is simply due to a          former architecture, suggesting the possibility of replacing
large amount of computations and data, bringing up the is-        convolutional neural networks. Sec 6 discusses some of the
sue of computational efficiency. Furthermore, blending the        prospects on transformer architecture for cross-modal tasks.
transformer architecture with generative models is an open        Finally, we conclude the paper in Sec 7, by summarizing the
challenge, which is just beginning to be explored.                important points raised in the paper, while also discussing
    Figure 1 shows a brief timeline over the past few years       the current limitations and future works.
with important models in vision and language domains re-
spectively, and how they have influenced the approaches for
                                                                  2 Preliminaries I: Visuolinguistic Tasks
cross-modal tasks. Looking back at the important milestones
and current trends, we project that transformer architecture
                                                                  In this section and Sec 3, we briefly go over the preliminaries
may take over the vision representation part of cross-modal
                                                                  necessary to understand the topics and implications put for-
tasks, as its performance in vision domain has been shown
                                                                  ward by this paper. We first review some of the representa-
to be comparable to CNNs (Dosovitskiy et al. (2020)). More
                                                                  tive tasks in visuolinguistic domain, namely image caption-
about this will be discussed in Sec 5 and Sec 6.
                                                                  ing and visual question answering, with commonly used ap-
    This paper attempts to review some of the representa-         proaches. We also visit other important tasks, some of which
tive works on transformer-based cross-modal models, with          have become benchmark tasks for transformer-based cross-
emphasis on pre-training schemes, discussing the common           modal tasks. We then go over evaluation metrics for these
characteristics in various models along with their differences.   tasks.
We also investigate models of novel and promising direc-              .
tion, such as transformer-based vision representation, or im-
age generation from text. By doing so, our goal is to acquire
up-to-date insights as to various aspects of cross-modal learn-   2.1 Classical Tasks
ing, as well as prospects on how they may continue to influ-
ence deep learning field.                                         Prior to deep learning era, early models frequently tack-
    The rest of the paper is organized as following: we first     led visuolinguistic cross-modal tasks with template-based
review the conventional visuolinguistic cross-modal tasks         model (Barbu et al. (2012); Elliott and Keller (2013); Ushiku
and recent benchmark tasks, along with frequently used eval-      et al. (2012)) or ranking and retrieval models (Farhadi et al.
Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision                                 3

(2010); Ordonez et al. (2011); Hodosh et al. (2013)). With                et al. (2014); Venugopalan et al. (2015b,a)) still heavily re-
the advent of deep learning, however, the mainstream paradigm             lied on CNNs and RNNs for representing video and lan-
for tackling cross-modal tasks has rapidly shifted towards                guage.
the approach incorporating convolutional and recurrent neu-
ral networks.
                                                                          2.2 Benchmark Tasks
    Image captioning and visual question answering (VQA)
(Antol et al. (2015); Zhang et al. (2016); Goyal et al. (2017))           While image captioning and visual question answering have
have conventionally been considered two of the most repre-                been exemplary visuolinguistic cross-modal tasks, other im-
sentative examples of cross-modal tasks involving language                portant and intriguing variations have emerged. In particu-
and vision. In image captioning, the model is trained with                lar, some of these novel variations, along with VQA task,
pairs of image and the captions describing that image, and                have been consolidated as benchmark tasks for pre-trained
learns to generate descriptive captions for unseen images.                cross-modal tasks. Interestingly, image captioning task, de-
While early works relied on straightforward combination of                spite being a representative visuolinguistic task, has only
CNN and LSTM, (Vinyals et al. (2015); Karpathy and Li                     rarely been tackled with recent transformer-based models.
(2014)), advanced models appeared with dense localization                 We conjecture that it may be attributable to the fact that im-
(Johnson et al. (2016)), semantic attention (Xu et al. (2015);            age captioning task employs a different array of evaluation
You et al. (2016); Zhou et al. (2016)), to name a few. Some               metrics, and makes it less intuitive to compare the perfor-
works (Dai et al. (2017)) went steps further to incorporate               mance along with other simple accuracy-based tasks, as we
generative adversarial networks (GANs) (Goodfellow et al.                 shall see in Sec. 2.3. However, it is still being actively exam-
(2014)) for image captioning, in which GANs are used to                   ined in video and language field, and is likely to remain an
predict whether the captioning is natural.                                important indicator of cross-modal models’ performances.
    VQA is a task in which a question in natural language                      One of the frequently visited benchmark tasks is visual
is asked about an image, and the model is asked to provide                commonsense reasoning (VCR) (Zellers et al. (2019)), which
an answer to the question. While many variations of VQA                   extends VQA by not only asking a question about a referred
task exist, VQA as a benchmark task usually refer to the one              agent’s action, but also asking why the model chose that an-
based on MS COCO (Lin et al. (2014)) with roughly 0.6M                    swer. This notably attempts to accomplish the logical infer-
questions, each of which comes with 10 crowd-sourced an-                  ence made by humans. Three subtasks with varying difficul-
swers. For VQA, most approaches can be categorized by 4                   ties exist within the task, where 1) the question is given and
primary components; image representation, text represen-                  the model predicts the answer (Q→A), or 2) the question
tation, common embedding scheme, and attention mecha-                     and the answer are given and the model predicts the rea-
nism. Image representation mostly relied on CNNs, often                   son or intention (QA→R), or 3) the question is given and
employing region detection models, while text representa-                 the model predicts both the answer and the reason for that
tions may rely on RNN-line of models, Skipthoughts (Kiros                 answer (Q→AR). VCR dataset contains roughly 290k mul-
et al. (2015)), or more classical models, such as word2vec                tiple choice questions accompanied by bounding boxes and
(Mikolov et al. (2013)) or Glove (Pennington et al. (2014)).              semantic masks to indicate which agent the question or the
While cross-modal embedding scheme may be simple con-                     answer candidate refers to.
catenation or element-wise addition and multiplications, more                  Natural language for visual reasoning (NLVR) (Suhr et al.
sophisticated schemes have also been proposed, such as com-               (2019)) also requires an advanced understanding of vision
pact bilinear pooling (Fukui et al. (2016)), low-rank bilinear            and language, by judging whether a statement is true with re-
pooling (Kim et al. (2017)), or cross-modal tucker fusion                 gards to a juxtaposition of two images, based on the dataset
(Ben-younes et al. (2017)). Various attention mechanisms                  consisting of roughly 100k human statements made about
(Yang et al. (2015); Lu et al. (2016)) have also been pro-                the pairs of images. The unique setting of the task in which
posed, and have demonstrated its effectiveness.1                          there are two images to consider brings new challenges as to
    Image captioning and VQA have also been extended to                   how to embed the images, e.g., whether they should be em-
cross-modal domain involving language and video, as video                 bedded separately or together, and how the decision should
captioning (Das et al. (2013); Gella et al. (2018)) and video             affect common embedding with text. RefCOCO (Yu et al.
QA (Tapaswi et al. (2016)) respectively. As in image cap-                 (2016)) is also a frequently visited benchmark task, where
tioning and VQA, the models in video domain (Donahue                      the goal is to identify which object in the image is referred
                                                                          to by the linguistic statement, based on the dataset consisting
   1 Note that attention mechanism referred to here, while nearly iden-
                                                                          of 140k referring expressions from ReferitGame (Kazemzadeh
tical in its motivation, deviates from the same term employed in trans-
                                                                          et al. (2014)). Grounding referring expressions (Kazemzadeh
former architecture, and must not be confused with the works that do
employ transformer architecture to tackle the same task, which will be    et al. (2014)), caption-based image retrieval (Young et al.
introduced in Sec. 4                                                      (2014)), and embodied visual recognition (Yang et al. (2019a))
4                                                                                                                  Andrew Shin et al.

are also noteworthy tasks involving vision and language. Fi-      usually appear to retain mutual agreement to a fair extent.
nally, image generation from text is also an important yet        Also, when necessary, accuracy is measured at different re-
largely unexplored axis in visuolinguistic task. So far, it has   call levels, as in image retrieval task shown in Table 1.
mostly been limited to certain target domains, such as birds
or flowers (Reed et al. (2016); Zhang et al. (2018b)), but
promising works have started to appear, as we will see in         3 Preliminaries II: Transformer-Based models
Sec. 6.
    As we shall see in Sec 4.1.2, benchmark tasks appear to       In this section, we first describe transformer architecture by
be less consolidated in video and language domain, although       looking at multi-heads self-attention mechanism. We then
video captioning tasks with YouCook2 (Zhou et al. (2017))         introduce BERT, which is a transformer-based model and
and ActivityNet (Heilbron et al. (2015)) are frequently vis-      has become a crucial component in recent surge of transformer-
ited.                                                             based cross-modal models, with emphasis on its unique pre-
                                                                  training objectives. We also introduce pre-training objec-
                                                                  tives from models other than BERT for reference.
2.3 Evaluation Metrics

Evaluation metrics for visuolinguistic tasks can be catego-       3.1 Transformer Architecture
rized as those for classification task and those for genera-
tion tasks, such as image captioning. While the classifica-       Vaswani et al. (2017) proposed transformer architecture, and
tion tasks mostly rely on straightforward accuracy, genera-       demonstrated that it outperforms the then-dominant RNN or
tion tasks usually employ multiple metrics for evaluation.        CNN-based approaches on several sequential transduction
BLEU (Papineni et al. (2002)), originally proposed for ma-        tasks, such as machine translation. Transformer comprises
chine translation, is one of the most frequently used evalua-     an encoder part and a decoder part, and both of them con-
tion metrics and computes the portion of n-grams in candi-        sist of a series of self-attention based modules. Differently
date caption that also appear in ground truth. While BLEU         from CNN and RNN, it adopts self-attention as a basic oper-
places emphasis on precision, ROUGE (Lin (2004)) takes            ation in the model instead of convolution or memory gating,
more recall-oriented approach by counting the occurrences         which leads to obtaining suitable properties for handling se-
of exact match of n-grams. While BLEU and ROUGE look              quential data as will be shown later.
for exact match of words, METEOR (Banerjee and Lavie                   In a self-attention process, each input vector is first trans-
(2005)) evaluates n-grams considering synonyms and stems,         formed into three vectors called query, key, and value. The
based on WordNet (Miller (1995)). CIDEr (Vedantam et al.          output is computed as a weighted sum of the values, where
(2015)), designed specifically for image captioning task, looks   the weight of each value is assigned according to the simi-
at the consensus between generated image caption and the          larity between the query and the corresponding key. Let Q,
reference caption, with assigning more weights on n-grams         K, and V be matrices that contain all queries, keys, and val-
that appear specifically for the image and less weights on        ues extracted from the given input vectors, respectively. The
n-grams that appear frequently for all images. SPICE (An-         self-attention process can then be formulated as
derson et al. (2016)) is also a popular metric for image cap-                                   QK T
tioning task, where a set of tuples are extracted from the        Attention(Q, K, V ) = softmax( √ )V,                           (1)
                                                                                                  dk
semantic parse graph of the sentences. All of these metrics
are also widely used for video captioning task as well.           where dk is a dimensionality of the key.
    Most of the benchmark tasks apart from captioning eval-           To boost the flexibility of the self-attention process, the
uate the performance with straightforward accuracy, although      transformer adopted multi-head attention mechanism instead
accuracy might be defined slightly differently depending on       of the simple self-attention, as shown in Eq. (2). In multi-
the task. For example, VQA defines accuracy with respect          head attention, several self-attention processes are conducted
to the number of humans that provided the same answer             in parallel, and each output is integrated by concatenation
out of all ground truth answers available for the question.       followed by a linear projection to obtain the final output vec-
Also, GQA (Hudson and Manning (2019)) proposes to sup-            tor as shown in the following equations:
plement accuracy by proposing additional metrics such as          MultiHead(Q, K, V ) = Concat(head1 , ..., headh )W O ,(2)
consistency, validity and plausibility. While the straightfor-
                                                                                headi = Attention(QWiQ , KWiK , V WiV ),(3)
ward nature of the accuracy metric may risk loss of account-
ability for more subtle aspects of performance, such as the       where W O is a projection matrix to integrate the outputs
model’s ability to yield the second best answer when not          of all attention processes. The queries, keys, and values for
giving the correct answer, it is usually reported over a set      each attention process are computed by a linear projection
of frequently visited tasks, and such aggregated evaluations      of the original ones, and their projection matrices WiQ , WiK ,
Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision                                5

and WiV are jointly optimized with W O and other trainable             prediction. In masked language modeling, tokens are ran-
parameters via training.                                               domly masked with 15% probability, and the model is trained
     In the self-attention process, the order of the input vec-        to predict those masked tokens. For next sentence predic-
tors does not affect the resulting output vectors. For exam-           tion, two sentences are provided, where the second sentence
ple, a model’s prediction on a certain sentence does not de-           may be actual succeeding sentence or a random sentence
pend on the order of words in the sentence, which should be            with 50% probability, and the model is given a task of bi-
inappropriate in many NLP tasks. To avoid this problem, po-            nary classification of whether the second sentence is an ac-
sition information of each input is encoded and added to the           tual succeeding sentence. While the first objective of masked
input embedded token before being fed to the transformer-              language modeling aims to learn token-level dependency,
based model. Specifically, the position information is en-             the motivation of the second task is to learn the inter-sentence
coded as following:                                                    relationship. Although (Liu et al. (2019b)) has shown that
                                                                       the next sentence prediction turns out not to be crucial for
  PE(p,2i) = sin(p/100002i/dmodel ),                           (4)
                                                                       the model’s performance, these two objectives, along with
PE(p,2i+1) = cos(p/100002i/dmodel ),                           (5)     the bidirectional architecture, characterize and differentiate
                                                                       BERT from other transformer-based language models.
where p is the position of the target token in the input se-
                                                                           For reference, we would like to highlight a few more
quence, dmodel and i are the dimensionality of the embedded
                                                                       notable pre-training objectives proposed by other models.
token and its index, respectively. As for positional embed-
                                                                       XLNet (Yang et al. (2019b)) attempts to overcome some of
ding, other embedding methods have been proposed. For ex-
                                                                       the drawbacks of masked language modeling of BERT, e.g.,
ample, (Wang et al. (2021a)) have shown that relative sinu-
                                                                       the absence of masked tokens in real data, or its inability to
soidal positional embedding outperforms absolute positional
                                                                       model the joint probability with product rule used in autore-
embedding in longer distances. Modified attention matrix
                                                                       gressive language modeling. Specifically, they propose per-
(MAM) (Dufter et al. (2021)) has also been proposed, where
                                                                       mutation language modeling, where the model maximizes
positional embedding is added as bias of attention map.
                                                                       likelihood of the input sequence over all possible permu-
    Some of the distinctive merits of transformer over pre-
                                                                       tations of the factorization order, thereby learning bidirec-
vious models are modeling long-term dependency and its
                                                                       tional context while retaining the merits of autoregressive
flexibility for parallel computing. For instance, RNNs gen-
                                                                       models. MT-DNN (Liu et al. (2019a)) performs multi-task
erally find it difficult to preserve context as the distance be-
                                                                       learning, which leverages supervised data from multiple re-
tween tokens becomes longer. Since they are executed in a
                                                                       lated tasks, such as single-sentence or pairwise text classi-
sequential manner, it also becomes non-trivial to run them in
                                                                       fication and relevance ranking, on top of language model
parallel. Long-distance dependency is also challenging for
                                                                       pre-training. The paper claims that it prevents the model
CNNs, as it would require a proportionally large number of
                                                                       from overfitting to a specific task, with complementary ef-
layers. On the other hand, attention mechanism can model
                                                                       fect to general language model pre-training. (Chang et al.
dependency for words of any distance within the pre-defined
                                                                       (2020)) proposes three pre-training objectives with empha-
sequence length, and as it does not require sequential order
                                                                       sis on capturing granularity of semantics between the query
of the input, it is highly suitable for parallel, distributed com-
                                                                       and document, namely inverse cloze task to capture local se-
putation.
                                                                       mantic context, body first selection to capture global context
                                                                       within the document, and wiki link prediction to capture dis-
                                                                       tant inter-article context. Their experiments suggest that the
3.2 BERT
                                                                       proposed pre-training objectives can perform significantly
While a large number of variations on transformer exist,               better than masked language modeling.
BERT (bidirectional encoder representations from transform-
ers) (Devlin et al. (2019)) is particularly important in our
topic, as its architecture has been employed by many cross-            4 Analysis of Cross-Modal Embeddings
modal models, as well as its pre-training tasks that have
been extended to account for cross-modal setting. Follow-              We now look into cross-modal models that have employed
ing transformer, BERT applies layer normalization (Ba et al.           transformer/BERT architectures. We first review their pre-
(2016)) on top of multi-head attention with residual connec-           training objectives, many of which have been directly in-
tion, and applies feed-forward propagation with its unique             spired by BERT, albeit with notable variations, and spec-
choice of non-linear activation function, namely Gaussian              ulate on how they relate to the models’ performances. We
error linear units (GELU) (Hendrycks and Gimpel (2016)).               also look at the models’ attempts to construct network archi-
    BERT is also known for its unique choice of pre-training           tectures that learn inter-modal dependency, along with how
tasks, namely masked language modeling and next-sentence               they deal with distinct modalities for mutual compatibility.
6                                                                                                                             Andrew Shin et al.

Fig. 2: Visualization of representative cross-modal models in terms of dataset size used for pre-training, performance as measured by VQA task,
the type of pre-training objectives employed, and the model size. The size of block for respective model corresponds to the size of the model, as
approximated by the implementation details reported in respective paper. VM1 and VM2 refer to two distinct tasks of masked visual modeling
with image regions.

4.1 Pre-training Tasks                                                    Another pre-raining task of BERT, next sentence prediction,
                                                                          is mostly converted to binary classification of whether in-
One of the primary arguments laid by the transformer-based                put image and sentence are semantically matched. This also
language models, including BERT, is that large-scale pre-                 is performed by most cross-modal models, with few excep-
training is the key to their successful performance. It fol-              tions such as VL-BERT (Su et al. (2020)) or UnifiedVLP
lows evidently that the choice of pre-training tasks is es-               (Zhou et al. (2019)) that explicitly opt not to perform this
sential in obtaining a high-quality language model. Cross-                pre-training task. InterBERT again makes a unique variation
modal models employing transformer architecture have gen-                 by explicitly involving matching with hard negative exam-
erally followed the same presumption, placing a strong em-                ples.
phasis on the design of pre-training objectives, and the key                  One of the key challenges for pre-training tasks among
challenge is consequently how to replicate large-scale pre-               cross-modal embedding models is with the way they imple-
training with cross-modal setting. For example, as shown in               ment masked language modeling task for visual inputs, as it
Sec 3.2, BERT is known for its unique pre-training tasks of               cannot be naively extended to vision domain in a straightfor-
masked language modeling and next sentence classification,                ward manner, due to the non-sequential nature of vision. In
but since it is designed solely for language, it inevitably re-           fact, some models, such as B2T2 and VisualBERT (Li et al.
quires adjustments to be extended to cross-modal setting.                 (2019)), simply opt not to perform any extended masked
While most models adopt those two pre-training tasks in                   modeling task for visual inputs. On the other hand, many
modified ways, some models propose additional pre-training                models propose novel ways to apply masked language mod-
tasks designed specifically for cross-modal setting.                      eling for visual tokens. For example, ViLBERT handles this
    Masked language modeling, one of the two pre-training                 challenge by proposing to mask the image regions extracted
tasks of BERT, is used as is for linguistic input tokens in               by Faster R-CNN (Ren et al. (2015)), and trains the model
cross-modal models, almost without exception. One notable                 to predict the class distribution of the region, with the class
variation is InterBERT (Lin et al. (2021)), which performs                distribution output from Faster R-CNN as the ground truth.
masked language modeling by replacing the masked word                     VL-BERT (Su et al. (2020)) and Unicoder-VL (Li et al.
with continuous words, rather than random words as with                   (2020a)) are also notable examples that perform masked vi-
most other models. B2T2 (Alberti et al. (2019)) also extends              sual modeling with class prediction. In particular, VL-BERT
the task by training language model while seeing the image.               proposes masked RoI classification with linguistic clues, where
Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision                                                              7

                                           VQA           GQA                VCR                 RefCOCO          NLVR2           IR                  ZS IR
                                          dev/std       dev/std     Q/A     QA/R     Q/AR     val/ testA/testB    test-P      R1/R5/R10            R1/R5/R10
 ViLBERTnopt (Lu et al. (2019))           68.93/–          –        69.26   71.01    49.48   68.61/75.97/58.44     –       45.50/76.78/85.02     0.00/0.00/0.00
 ViLBERT                                70.55/70.92        –         73.3    74.6     54.8   72.34/78.52/62.61     –       58.20/84.90/91.52   31.86/61.12/72.80
 B2T2(Alberti et al. (2019))                 –             –         72.6    75.7     55.0            –            –                –                   –
 B2T2ensemble                                –             –         74.0    77.1     57.1            –            –                –                   –
 VL-BERTnopt (Su et al. (2020))           69.58/–          –          –       –        –     66.03/71.87/56.13     –                –                  –
 VL-BERT                                  71.16/–          –          –       –        –     71.60/77.72/60.99     –                –                  –
 VL-BERTlarge                           71.79/72.22        –         75.8    78.4     59.7   72.59/78.57/62.30     –                –                   –
 LXMERT(Tan and Bansal (2019))            –/72.5       60.0/60.3      –       –        –              –            –                –                  –
 Unicoder-VLzs (Li et al. (2020a))           –             –          –       –        –              –            –         48.4/76.0/85.2             –
 Unicoder-VLnopt                             –             –          –       –        –              –            –         57.8/82.2/88.9             –
 Unicoder-VL                                 –             –         73.4    74.4     54.9            –            –         71.5/90.9/94.9             –
 VisualBERT(Li et al. (2019))           70.80/71.00        –         71.6    73.2     52.4            –           67.0              –                   –
 UNITERbase (Chen et al. (2020c))       72.70/72.91        –         75.0    77.2     58.2   75.31/81.30/65.58   77.85     72.52/92.36/96.08   66.16/88.40/92.94
 UNITERlarge                            73.82/74.02        –         77.3    80.8     62.8    75.9/81.45/66.7    79.98     75.56/94.08/96.76    68.74/89.2/93.86
 MiniVLM(Wang et al. (2020b))           69.39/69.06        –          –       –        –              –          73.93              –                   –
 PixelBERTr50 (Huang et al. (2020b))    71.35/71.42        –          –       –        –              –           72.4        59.8/85./91.6             –
 PixelBERTx152                          74.45/74.55        –          –       –        –              –           77.2       71.5/92.1/95.8             –
 OSCARbase (Li et al. (2020c))            –/73.44          –          –       –        –              –          78.36              –                  –
 OSCARlarge                               –/73.82     61.58/61.62     –       –        –              –          80.37              –                   –
 UnifiedVLP(Zhou et al. (2019))          70.5/70.7         –          –       –        –              –            –                –                   –
 InterBERTnopt (Lin et al. (2021))           –             –         63.6    63.1     40.3            –            –         53.1/80.6/87.9             –
 InterBERT                                   –             –         73.1    74.8     54.9            –            –         61.9/87.1/92.7      49.2/77.6/86.0
 ERNIE-ViLbase (Yu et al. (2020))       72.62/72.85        –         77.0    80.3     62.1   74.02/80.33/64.74     –       74.44/92.72/95.94            –
 ERNIE-ViLlarge                         73.78/73.96        –        79.2    83.5     66.3    74.24/80.97/64.70     –       75.10/93.42/96.26           –
 DeVLBERTV (Zhang et al. (2020))             –             –          –       –        –              –            –         59.3/85.4/91.8      32.8/63.0/74.1
 DeVLBERTVL                                  –             –          –       –        –              –            –         60.3/86.7/92.2      34.9/65.5/77.0
 DeVLBERTVLC                             71.1/71.5         –          –       –        –              –            –         61.6/87.1/92.6      36.0/67.1/78.3
 SemVLP(Li et al. (2021a))              74.52/74.68   62.87/63.62     –       –        –           – /–/–        79.55     74.10/92.43/96.12            –
 CAPT(Luo et al. (2020a))               72.78/73.03   60.48/60.93     –       –        –              –          75.13              –                   –
 LAMPcoco (Guo et al. (2020))           70.85/71.0         –          –       –        –              –          74.34              –            42.5/70.9/80.8
 LAMPcoco+vg                            72.48/72.62     –/61.05       –       –        –              –          75.43              –            51.8/77.4/85.3
 (Kervadec et al. (2019))                    –           –/60.5       –       –        –              –           75.5              –                   –
 ImageBERT(Qi et al. (2020))                 –             –          –       –        –              –            –         73.1/92.6/96.0      54.3/79.6/87.5
 Human                                       –            89.3       91.0    93.0     85.0            –           96.3              –                   –

Table 1: Comparison of performances of cross-modal models on various tasks. The subscript nopt refers to the model without pre-training, zs
refers to zero-shot. Other subscripts follow respective papers. Human performances are compiled from respective papers for the tasks.

                                       Masked LM        Img-Sentence Match              Masked Visual Modeling                             Other
  ViLBERT                                                                                     class prediction                              –
  B2T2                                                                                               7                                      –
  VL-BERT                                                           7                         class prediction                              –
  LXMERT                                                                            class prediction+feature regression                image QA
  Unicoder-VL                                                                                 class prediction                              –
  VisualBERT                                                                                         7                         task-specific masked LM
  UNITER                                                                            class prediction+feature regression          word region alignment
  MiniVLM                                                                                            7                       detection+captioning+tagging
  PixelBERT                                                                                          7                          pixel random sampling
  OSCAR                                                                                              7                                      –
  UnifiedVLP                                                      7                 class prediction+feature regression                     –
  InterBERT                             (same word)          (hard negative)                  class prediction                              –
  ERNIE-ViL                                                                                   class prediction                              –
  DeVLBERT                                                                                    class prediction                              –
  SemVLP                                                                            class prediction+feature regression                image QA
  CAPT                                                                                        class prediction                         contrastive
  LAMP                                                                              class prediction+feature regression                     –
  (Kervadec et al. (2019))                                                                    class prediction                           VQA
  ImageBERT                                                                         class prediction+feature regression                     –

                                          Table 2: Pre-training tasks performed by various cross-modal models.
8                                                                                                                            Andrew Shin et al.

                                          Dataset                  Computation     Optimizer     Batch Size   BERT L/H/A    Cross-Modal Depth
    ViLBERT                                  CC                      8 TitanX        Adam           512        12/12/762            8
    B2T2                                     CC                          –           Adam             –       24/16/1024            –
    VL-BERT                                  CC                      16 V100         Adam           256       24/16/1024            –
    LXMERT                       COCO/VG/VQA/GQA/VG-QA              4 Titan Xp       Adam          256         12/12/768            9
    Unicoder-VL                           CC/SBU                      4 V100         Adam           192        12/12/768            –
    VisualBERT                             COCO                       4 V100         Adam             –        12/12/768            –
    UNITER                           COCO/VG/CC/SBU                      –             –              –       24/16/1024            –
    MiniVLM                  COCO/CC/SBU/Flicker/VQA/GQA/VG-QA           –           SGD            512         12/–/386            –
    PixelBERT                            COCO/VG                     64 V100      SGD/AdamW        4096             –               –
    OSCAR                         COCO/CC/SBU/Flicker/GQA                –          AdamW           512       24/16/1024            –
    UnifiedVLP                               CC                       8 V100        AdamW           512        12/12/768            –
    InterBERT                          CC/COCO/SBU                       –          AdamW            –              –               –
    ERNIE-ViL                             CC/SBU                      8 V100         Adam           512       24/16/1024            6
    DeVLBERT                                 CC                       8 V100           –              –            –                –
    SemVLP                    COCO/VG/VQA/GQA/VG-QA/CC/SBU            4 V100         Adam           256        12/12768             6
    CAPT                                      –                          –           Adam             –       24/16/1024            –
    LAMP                                 COCO/VG                   2 Titan RTXs        –            256        12/12/768            –
    Kervadec et al. (2019)               COCO/VG                      4 V100         Adam           512        12/12/768            5
    ImageBERT                               LAIT                      4 V100         Adam            48        12/12/768            –

Table 3: We report the setting for the largest model reported in respective papers. COCO refers to MS COCO (Lin et al. (2014)), CC to Conceptual
Captions (Sharma et al. (2018)), VG to Visual Genome (Krishna et al. (2016)), VG-QA to Q&A subset of Visual Genome, and SBU to SBU
Captions (Ordonez et al. (2011)). LAIT is collected by ImageBERT. BERT L/H/A refer to the number of layers, the number of attention heads, and
hidden size, respectively. Cross-modal depth refers to the number of inter-modality layers, after embedding each modality separately. We report
cross-modal depth only for two-stream models, since most single stream models follow the same network depth as their language embedding for
cross-modal embedding.

a region of interest (for example, a cat) is randomly masked              learn similar representations for sequences that share the
out, and the model is trained to predict the category of the              same semantics by matching the original sequence and the
masked out RoI by solely relying on linguistic clues, such                corresponding corrupted sequence.
as “kitten drinking from bottle.” Here, the RoIs are obtained                 While most models so far have relied on Fast/Faster R-
via Fast R-CNN (Girshick (2015)). Such setting is derived                 CNN for region extraction, MiniVLM (Wang et al. (2020b))
from the concern that there may be cases where it is nearly               employs a region extraction module inspired by EfficientDet
impossible to identify a region when the region is masked.                (Tan et al. (2020)), and as such, they first attempt to enhance
While VL-BERT opts not to perform image-sentence match-                   their visual features by performing large-scale image classi-
ing task, this unique setup intends to make up for the learn-             fication and object detection with Objects365 dataset (Shao
ing of cross-modal dependency.                                            et al. (2019)). Cross-modal representation is subsequently
                                                                          fine-tuned with captioning and tagging, where captions and
    While the models above relied on the prediction of class
                                                                          tags are obtained by existing models.
distribution, it has also been found beneficial to incorpo-
rate feature regression into masked visual modeling task.
                                                                          4.1.1 Analysis of Performance
LXMERT (Tan and Bansal (2019)) is one such example.
On top of masked object classification, where the label for
                                                                          Table 1 compares the performance of various models on
the masked RoI should be classified based on other visual
                                                                          representative cross-modal tasks. Note that the models are
inputs and linguistic inputs, they also perform RoI-feature
                                                                          trained with different size of datasets with varying size of
regression with L2 loss. UnifiedVLP also reports that com-
                                                                          model sizes, so direct comparison of performance does not
bining class prediction and feature prediction improves per-
                                                                          necessarily imply superiority of certain models over oth-
formance. UNITER (Chen et al. (2020c)) also performs both
                                                                          ers. Table 2 compares pre-training tasks performed by each
class prediction and feature regression, but they propose to
                                                                          model, with 4 rough categories for pre-training tasks. Ta-
add a third task for masked visual modeling, where they per-
                                                                          ble 3 compares settings of the models, ranging from datasets
form class prediction with KL divergence.
                                                                          used for pre-training to their implementation details. Fig-
    Some models further propose novel pre-training tasks                  ure 2 visualizes representative models in terms of the dataset
that do not fall into any of the 3 categories above, which fre-           size used for pre-training, their performance in VQA task,
quently depends on the target downstream task. For exam-                  the types of pre-training objectives performed, and the re-
ple, LXMERT, SemVLP (Li et al. (2021a)) and (Kervadec                     spective model size. While it remains arguable whether we
et al. (2019)) propose image question answering task for                  can attribute a certain aspect to the respective model’s per-
pre-training, whereas PixelBERT (Huang et al. (2020b)) em-                formance, the models with higher performance have one or
ploys pixel random sampling. (Luo et al. (2020a)) proposes                more of the following characteristics in common; pre-trained
contrastive pre-training, where they encourage the model to               on a relatively large amount of data, larger model size, and
Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision                             9

more pre-training objectives. While the observations are quite         performance on specific tasks, it appears beneficial to per-
generic, note that they are consistent with recent trend of            form similar type of pre-training task for a certain target
pre-training a model with a very large number of parameters            downstream task. LXMERT and SemVLP are again notable
with large-scale dataset, as affirmed by GPT-3.                        examples, as they propose unique pre-training objective of
    Table 4 shows some of the frequently used datasets for             image QA and indeed achieve better performances on VQA
both images and videos, along with the description of their            task. MiniVLM also performs additional pre-training objec-
contents, annotation types, and sizes. Although it would be            tive of image captioning task to demonstrate reliable per-
difficult to generalize from a small group of models described         formance on image captioning task. Furthermore, video and
so far, as it is easy to be overfitting, further observations          language models pre-trained with YouCook2 annotations can
can be made with respect to dataset and pre-training objec-            be considered a similar case.
tives that may likely be of reference. It must first be noted
that all datasets with images are at the order of millions,            4.1.2 Video and Language
once again affirming the indispensable role played by pre-
training with large-scale dataset. In fact, even models of rel-        Extending vision domain of cross-modal tasks from images
atively small sizes can attain fair performance when pre-              to videos naturally brings new challenges. For example, a
trained with large-scale dataset, as is the case with mini-            truly reliable supervision would require annotations at frame-
VLM. Pre-training with large-scale dataset also appears to             level, which necessitates prohibitive amount of manual labour.
overcome the noisiness of labels obtained via weak, web                In fact, existing large-scale video datasets, such as YouTube-
supervision. For example, ERNIE-ViL is pre-trained with                8M (Abu-El-Haija et al. (2016)) or Sports-1M (Karpathy
Conceptual Captions and SBU Captions, both of which are                et al. (2014)), often rely on user-attached tags, resulting in
”webly” supervised, yet demonstrates fairly high performance           noisy labels. As such, self-supervised learning has become
throughout the tasks. To be fair, however, under similar set-          a frequently used approach for learning video representa-
tings, pre-training with manually annotated datasets appears           tions (Vondrick et al. (2018); Agrawal et al. (2015); Wang
to provide an edge over pre-training with datasets with web            and Gupta (2015)). Transformer-based cross-modal models
supervision, particularly in terms of the size of the dataset          tackling video and language are relatively newly emerging
necessary to obtain the desirable level of performance, which          research topic, and have displayed varying approaches with
makes sense considering the inevitable presence of noise in            more task-oriented principles. While many models adopted
web supervision. Higher efficiency of human annotation in              approaches employed in the models for images, for exam-
terms of the required dataset size is well-illustrated by Visu-        ple the BERT-inspired pre-training objectives, there are also
alBERT and a variation of LAMP, both of which are trained              models that opt to diverge, as we shall see below.
only with MS COCO, displaying more reliable performance                     When aligning video and language, where language su-
than ViLBERT trained with Conceptual Captions, although                pervision is extracted from the corresponding audio of the
Conceptual Captions is more than 5 times larger than MS                unannotated raw videos, a frequent problem is that the se-
COCO. As noted above, however, it should be highlighted                mantic of the language and video may not align. For exam-
again that this is equivalent to saying, alternatively, that pre-      ple, the speaker may be talking about cars, where the video
training with web supervision can demonstrate comparable               shows the speaker himself. (Sun et al. (2019)) notes that
performance to human annotations when given a sufficient               cooking videos have high probability of visual and linguistic
amount of data, as is the case with models like ERNIE-ViL.             semantics temporally well-aligned, and exploit such proper-
As we shall see in Sec. 4.1.2, a similar tendency can also             ties to examine video caption generation and next frame pre-
be observed with video and language models. For example,               diction with their VideoBERT. In order to extend BERT’s
VideoBERT (Sun et al. (2019)), which is pre-trained with               pre-training objectives to video domain, they obtain visual
large-scale YouTube videos with annotations obtained by                tokens using hierarchical vector quantization to video fea-
YouTube video annotation system and uses human-annotated               tures. On top of the masked language modeling and masked
YouCook2 (Zhou et al. (2017)) for evaluation only, falls be-           visual token modeling, next-sentence classification is ex-
low the models directly pre-trained with YouCook2, such as             tended as alignment classification for visual and linguistic
UniVL (Luo et al. (2020b)).                                            sentences. Even with cooking videos, however, the align-
    It may also be inferred that pre-training with dataset de-         ment may still be noisy, and they deal with this problem
signed for the same or similar task as the target task helps           by concatenating neighboring sentences into a long single
boost performance in that target task. LXMERT, OSCAR,                  sentence, and varying the subsampling rate of video tokens,
and SemVLP, which are trained with at least one dataset                making the model robust to variations in video speed.
designed for image QA task, all demonstrate higher per-                     Similarly to VideoBERT, CBT (Sun et al. (2020)) em-
formance on VQA task, as shown in Fiugre 2. Similarly,                 ploys sliding window approach to extract visual tokens us-
in terms of correlation between pre-training objectives and            ing S3D network (Xie et al. (2018)). Visual and linguistic to-
10                                                                                                                                 Andrew Shin et al.

         Dataset                                                              Contents                           Annotation           Size
         Conceptual Captions (Sharma et al. (2018))               image + scene-level description             web supervision         3.3M
         MS COCO (Lin et al. (2014))                              image + scene-level description              crowd-sourced          0.6M
         Visual Genome (Krishna et al. (2016))                      image + region descriptions                crowd-sourced          5.4M
         Visual Genome - Q&A (VG-QA)                               image + question and answer                 crowd-sourced          1.7M
         SBU (Ordonez et al. (2011))                              image + scene-level description             web supervision          1M
         Flickr 30k (Young et al. (2014))                         image + scene-level description              crowd-sourced          0.2M
         VQA (Antol et al. (2015))                                 image + question and answer                 crowd-sourced          0.6M
         GQA (Hudson and Manning (2019))                           image + question and answer                    generated           22M
         YouCook2 (Zhou et al. (2017))                             video + temporal annotations              manual annotation         15k
         HowTo100M (Miech et al. (2019))                                videos + descriptions                audio transcription      136M
         ActivityNet Captions (Heilbron et al. (2015))             videos + temporal annotations               crowd-sourced           72k
         MSR-VTT (Xu et al. (2016))                                     videos + descriptions                  crowd-sourced          200k
         CrossTask (Zhukov et al. (2019))                             videos + task descriptins              manual annotation         4.7k
         CMU-MOSI (Zadeh et al. (2016))                    videos + sentence-level sentiment annotations        crowd-surce             2k

Table 4: Comparison of frequently used datasets for pre-training visuolinguistic models. Web supervision refers to captions obtained by crawling
the tags or descriptions for corresponding images uploaded on the web pages. Generated refers to synthetic sentences generated in a rule-based
manner. Crowd-sourced and manual annotation both refer to human annotation, but differ in that the former refers specifically to external workforce.
Size refers to the number of descriptions or questions available.

                                                              YouCook2                                            ActivityNet
 Method                                        BLEU-4      METEOR ROUGE-L                CIDEr    BLEU-4       METEOR ROUGE-L                 CIDEr
 Masked Transformer (Zhou et al. (2018))         1.42         11.20           –            –         2.23         9.56           –              –
 VideoBERT (Sun et al. (2019))                   4.33         11.94         28.80        0.55          –            –            –              –
 CBT (Sun et al. (2020))                         5.12         12.97         30.44        0.64          –            –            –              –
 COOT (Ging et al. (2020))                      11.30         19.85         37.94        0.57       10.85        15.99         31.45          0.28
 UniVL (Luo et al. (2020b))                     17.35         22.35         46.52        1.81         –            –             –              –
 ActBERT (Zhu and Yang (2020))                   5.41         13.30         30.56        0.65          –            –            –              –
 E2vid (Huang et al. (2020a))                   12.04         18.32         39.03        1.23         –            –             –              –

Table 5: Performances of transformer-based cross-modal models on video captioning for YouCook2 (Zhou et al. (2017) and ActivityNet (Heilbron
et al. (2015)). Metrics used include BLEU (Papineni et al. (2002)), METEOR (Banerjee and Lavie (2005)), ROUGE (Lin (2004)), and CIDER
(Vedantam et al. (2015)).

kens are concatenated and passed to a shallow 1-layer cross-                incorporate generative modeling approach, by proposing to
modal transformer, where the mutual information between                     regard each modality as a translation of each other.
the two is computed. For pre-training objectives, masked
language modeling and masked visual token modeling are                           Models that do not explicitly employ BERT-inspired ap-
employed, where HowTo100M dataset (Miech et al. (2019))                     proaches have also appeared. (Gabeur et al. (2020)) pro-
is used, rather than crawled videos from YouTube as is the                  poses multi-modal transformer calibrated for video retrieval
case with VideoBERT. Also, unlike VideoBERT where the                       task. Rather than directly extending pre-training objectives
features are clustered with vector quantization and the task                of BERT, they rely on bidirectional max-margin ranking,
is to estimate to which cluster the masked feature belongs                  enforcing high similarity between video and captions. No-
to, CBT attempts to directly regress the masked features.                   tably, they extract video features by the combination of pre-
While the models above rely on densely sampled visual to-                   trained models from 7 different domains, which differen-
kens, ClipBERT (Lei et al. (2021)) notably demonstrates                     tiates them from other models that generally employ one
that sparse visual tokens can be sufficient for video repre-                pre-trained model for each modality. Similarly, Cooperative
sentation.                                                                  hierarchical transformer (COOT) (Ging et al. (2020)) pro-
                                                                            poses cross-modal cycle-consistency loss to enforce seman-
     Novel pre-training objectives that are specific to video               tic alignment, on top of alignment losses from (Zhang et al.
and language domain have also appeared. For example, Act-                   (2018a)), rather than simply extending pre-training objec-
BERT (Zhu and Yang (2020)) proposes masked action clas-                     tives of BERT. (Zhou et al. (2018)) also proposes to use
sification, where the objective is to predict the action label              masked transformer dense video captioning task, but simply
given the linguistic and object features, whereas HERO (Li                  uses it as a captioning decoder based on inputs from visual
et al. (2020b)) proposes frame order modeling. VideoAsMT                    encoder and proposal decoder, without explicit pre-training
(Korbar et al. (2020)) and UniVL (Luo et al. (2020b)) also                  objective designed for cross-modal learning.
Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision                       11

    As we have seen, while there are only a limited num-        (2020a)), UNITER (Chen et al. (2020c)), and VisualBERT
ber of transformer-based models dealing with video and lan-     (Li et al. (2019)) are some of the notable examples of single-
guage, they tend to be specific to task and domain, taking      stream cross-modal models.
varying approaches. Many models have employed approaches             Notably, nearly all models perform early fusion of lan-
from cross-modal models for image and language, but it may      guage and vision embeddings, in which they are concate-
be still early to capture a definite common ground for these    nated prior to being fed to cross-modal transformer blocks.
models, since much more remains to be explored yet. Ta-         In fact, B2T2 (Alberti et al. (2019)), designed specifically
ble 5 summarizes the models’ performances on video cap-         for VQA and VCR tasks, modifies dual encoder (Wu et al.
tioning task, one of the frequently visited tasks in video and  (2017); Gillick et al. (2018)) and compares performing early
language cross-modal domain.                                    fusion with text and bounding boxes to late fusion with the
                                                                feature from the entire image, and reports that early fusion
                                                                outperforms the late fusion.
4.2 Network Architecture
                                                                     MiniVLM (Wang et al. (2020b)) aims at building a light-
Network architecture for cross-modal embedding using trans-     weight   cross-modal model, and unlike the models that em-
former can be roughly divided into two categories; single-      ployed Fast R-CNN or Faster R-CNN to extract visual fea-
stream models, where the transformer block is modality-         tures, they propose two-stage efficient feature extractor (TEE),
specific, and two-stream models, where the inputs to each       consisting of EfficientNet (Tan and Le (2019)) and compact
transformer block are inter-modal. (Figure 3)                   BERT fusion model. In particular, the visual feature extrac-
     ViLBERT (Lu et al. (2019)) is a representative two-stream tion consists of EfficientNet with bidirectional feature pyra-
model, in which they proposed co-attention transformer mech- mid network as in EfficinetDet (Tan et al. (2020)), followed
anism, with input keys and values of one modality being         by non-maximum suppression. MiniVLM is also unique in
passed as inputs to the transformer block of another modal-     that they propose a triplet input to cross-modal learning block,
ity. In other words, keys and values for the language are in-   consisting of visual features, tokenized sentences, and tok-
put to transformer block for the vision part, and vice versa.   enized object names. The motivation is that inputting object
Since the queries for each modality still go into the corre-    labels explicitly may help further enforce the learning of de-
sponding modalities, the transformer block ends up learning     pendency between objects and corresponding text. Similar
to embed features for each modality conditioned on the other    approach is also employed by OSCAR (Li et al. (2020c)),
modality. In order to tokenize the vision, they extract image   which inputs object tags along with word tokens and region
regions using Faster R-CNN with ResNet (He et al. (2016))       features.
backbone, while also employing 5-dimensional spatial loca-           Apart from the cross-modal learning mechanism and in-
tion vector. LXMERT (Tan and Bansal (2019)) takes a simi-       put format, positional embedding is another source of vari-
lar approach, where, after embedding each modality and en-      ation among the models. Unicoder-VL (Li et al. (2020a))
coding them with transformer separately, cross-modality en-     extracts image regions with Faster R-CNN along with 5-d
                                            k
coder is applied, for which query vector hi from one modal-     vector, but they notably use the same position embedding
                          k
ity and context vector vj from another modality are the in-     for all image regions, rather than random permutations. In
puts, with k being the number of respective single-modality     VisualBERT (Li et al. (2019)) position embedding for each
encoders. DeVLBERT (Zhang et al. (2020)) and (Kervadec          visual token is matched to the corresponding input token,
et al. (2019)) also follow ViLBERT by exchanging queries,       whenever the alignments between image regions and the in-
or equivalently keys and values, for each modality. SemVLP      put tokens are available.
(Li et al. (2021a)) notably performs two-stream cross-modal          VL-BERT (Su et al. (2020)) proposes a format where
learning only at the upper parts of the blocks, in order to     each input element consists of 4 different types of embed-
learn high-level semantic alignment. ERNIE-ViL (Yu et al.       dings, namely token embedding, visual feature embedding,
(2020)), while mostly following ViLBERT, incorporates scene segment embedding, and sequence position embedding. Most
graph approach, where the input text is parsed into the nodes   notable is visual feature embedding, which is a combination
of objects, relations, and attributes that the text mentions.   of visual appearance feature, corresponding to the output of
Scene graph representation is shown to improve the seman-       fully-concatenated features from Fast R-CNN for the image
tic alignment between vision and language, and as the au-       region, and visual geometry embedding, a 4-d vector for co-
thors note, hints at the possibility of extending cross-modal   ordinates of each corner of the region. Note that, for linguis-
tasks to graph neural networks.                                 tic tokens, the entire image is used for visual feature embed-
     Single-stream cross-modal learning is fairly intuitive and ding. Sequence position embedding for image tokens can be
simple, as the conventional transformer blocks can be ex-       randomized.
tended in a straightforward manner, without architectural            Both single-stream and two-stream have appeared in video
modifications, for concatenated inputs. Unicoder-VL (Li et al. and language domain as well. For example, ActBERT pro-
12                                                                                                              Andrew Shin et al.

                                 (a) Single-stream                       (b) Two-stream

                                                                                            (i)        (i)
Fig. 3: Comparison of single-stream and two-stream cross-modal transformer blocks. Hv and Hw refer to embedding of
visual and word tokens respectively, output by i-th layer.

poses tangled transformer, where each of query, key, and          representations has for the most part relied on pre-trained
value come from different modalities, while UniVL employs         convolutional neural networks. For example, ViLBERT (Lu
cross encoder, where language and video encodings are com-        et al. (2019)) relies on Faster R-CNN to extract image re-
bined along the dimension of sequence. On the other hand,         gions as tokens. However, some of the recent researches
E2vid (Huang et al. (2020a)) relies on separate-modality ar-      have suggested that even convolution may be safely replaced
chitecture, and only partially employs cross-modal co-attention   by transformer architecture, which may arguably imply that
transformer.                                                      more fundamentally different changes may be possible with
    In summary, we can make a few observations that are           obtaining visual representations. While these works do not
common in most models. While some models explicitly come          explicitly deal with cross-modal tasks, they suggest critical
up with two-stream cross-attention scheme by exchanging           implications as to the prospects of cross-modal tasks, as will
inputs to attention heads, many models do not explicitly de-      be discussed in Sec 6. We thus briefly introduce recent im-
sign cross-attention scheme and simply rely on pre-training       portant works on vision representation with transformers in
objectives for learning cross-modal dependency. Whether           this section.
single-stream or two-stream, images are tokenized as re-               (Dosovitskiy et al. (2020)) proposes Vision Transformer
gions without exception, although the entire image may re-        (ViT) and suggests that pure transformers can achieve com-
place regions in some models for position embedding. Also,        parable performance on image classification tasks. While
regions are mostly accompanied by low-dimensional spatial         closely following the original transformer architecture, they
vectors containing their coordinate information. Features are     split images into patches, feed the sequence of linear em-
extracted with pre-trained object detection models rather than    beddings of these patches as input to transformer, so that 2D
classification models, except models like MiniVLM, which          images are represented as a sequence of flattened patches.
employs EfficientNet backbone and further trains the detec-       In a similar spirit as masked language modeling in BERT,
tion model. Attaching position embedding to image regions         they perform masked patch prediction, in which the model
is one source of variations among the models. Many mod-           is trained to predict the mean 3bit color for each patch. In
els employ coordinate-based ordering, while some models           particular, their ablation study with variations in dataset size
simply use random permutation, or the same position em-           shows that pre-training with the largest dataset, JFT-300M
bedding for all visual tokens.                                    (Sun et al. (2017)), resulted in better performance, implicat-
                                                                  ing that, as in language domain, training a transformer-based
                                                                  model with a massively large amount of image data can
5 Visual Representation with Transformer                          lead to models with outstanding performances. Pyramid vi-
                                                                  sion transformer (Wang et al. (2021b)) expands upon vision
So far, most works tackling cross-modal tasks with trans-         transformer and demonstrates that convolution-free models
former architecture have primarily applied it to linguistic       can be extended to a wider range of computer vision tasks,
representation, with certain schemes to embed it together         including object detection, semantic segmentation, and in-
with visual representation, mostly by tokenizing the visual       stance segmentation. VidTr (Li et al. (2021b)) further shows
representation in some ways. Such tokenization of visual          that videos can also be handled without using convolutions.
You can also read