BERTGEN: Multi-task Generation through BERT

Page created by Alfredo Caldwell

Buildings

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

BERTGEN: Multi-task Generation through BERT

B ERT G EN: Multi-task Generation through BERT

                                               Faidon Mitzalis1 , Ozan Caglayan1 , Pranava Madhyastha1 , Lucia Specia1,2
                                                        1
                                                          Department of Computing, Imperial College London, UK
                                                     2
                                                       Department of Computer Science, University of Sheffield, UK
                                            phaedonmit@gmail.com, {o.caglayan,pranava,lspecia}@ic.ac.uk

                                                             Abstract                           2020). Recent work in this direction has predom-
                                                                                                inantly investigated the use of pre-trained mod-
                                            We present B ERT G EN, a novel generative,          els to either initialise Transformer-based encoder-
                                            decoder-only model which extends BERT
                                                                                                decoder models (Imamura and Sumita, 2019; Clin-
arXiv:2106.03484v1 [cs.CL] 7 Jun 2021

                                            by fusing multimodal and multilingual pre-
                                            trained models VL-B ERT and M-B ERT, re-            chant et al., 2019; Yang et al., 2020; Rothe et al.,
                                            spectively. B ERT G EN is auto-regressively         2020) or to distill knowledge for sequence genera-
                                            trained for language generation tasks, namely       tion tasks (Chen et al., 2020).
                                            image captioning, machine translation and              In this work, we present B ERT G EN, which ex-
                                            multimodal machine translation, under a multi-      tends BERT in a generative setting (§ 2.1). This re-
                                            task setting. With a comprehensive set of           sults in a single generator – without a separation be-
                                            evaluations, we show that B ERT G EN outper-        tween the encoder and the decoder – capable of con-
                                            forms many strong baselines across the tasks
                                                                                                suming multiple input modalities and generating in
                                            explored. We also show B ERT G EN’s abil-
                                            ity for zero-shot language generation, where        multiple languages. The latter features are achieved
                                            it exhibits competitive performance to super-       by transferring knowledge from state-of-the-art pre-
                                            vised counterparts. Finally, we conduct abla-       trained models, namely VL-B ERT (Su et al., 2020)
                                            tion studies which demonstrate that B ERT G EN      and multilingual BERT (M-B ERT) (Devlin et al.,
                                            substantially benefits from multi-tasking and       2019). We train B ERT G EN on various tasks, in-
                                            effectively transfers relevant inductive biases     cluding image captioning, machine translation and
                                            from the pre-trained models.
                                                                                                multimodal machine translation, and datasets in
                                                                                                four different languages (§ 2.2).
                                        1   Introduction
                                                                                                   Based on a number of experiments, our find-
                                        Recent work in unsupervised and self-supervised         ings (§ 3) show that B ERT G EN (i) is surprisingly
                                        pre-training has revolutionised the field of natural    versatile as it is capable of describing images and
                                        language understanding (NLU), resulting in high         performing translation in unimodal and multimodal
                                        performance ceilings across multiple tasks (Devlin      settings, across all languages, (ii) generalises well
                                        et al., 2019; Yang et al., 2019; Dong et al., 2019).    across zero-shot image captioning, multimodal ma-
                                        The recent success of language model pre-training       chine translation, and out-of-domain news trans-
                                        with masked language modelling (MLM) such as            lation tasks, and finally (iii) is parameter efficient
                                        BERT (Devlin et al., 2019) further paved the way        when compared to state-of-the-art models for each
                                        for more complex approaches that combine lan-           of the tasks combined together.
                                        guage pre-training with images (Tan and Bansal,
                                        2019; Su et al., 2020; Lu et al., 2020), video (Sun     2     Method
                                        et al., 2019), and speech (Chuang et al., 2020).        In this section, we describe B ERT G EN and the tasks
                                        Most of these approaches follow a task-specific         we explore. We then detail the baselines and SoTA
                                        fine-tuning step after the model is pre-trained.        systems that we compare against.
                                           However, there has been little work on exploiting
                                        pre-trained MLMs for natural language generation        2.1   Model
                                        (NLG) tasks. Previous work argues that the MLM          This section details the main aspects of B ERT G EN
                                        objective is ill-suited for generation tasks such as    that distinguish it from the existing work on vision
                                        machine translation (Yang et al., 2019; Rothe et al.,   & language pre-training.

y3                                 Input Embeddings

  [DE] [CLS]      x1    x2          x7 [SEP] y1         y2 [MASK] [SEP] [IMG] [IMG]                [IMG] [END]

                                                                                                        RoI Features

                                                                                               Segment Embeddings

                                 Segment A                                                 Segment B
                                                                                              Positional Embeddings

    1       2                                          12     13       14     15      15               15     16

                                                   Transformer
                                                            MLM Head

Figure 1: A view of the B ERT G EN model during the training of an MMT sample: solid and dashed borders around
the images represent full-image features and regional features, respectively. At test time, the most likely token
y3 = argmax(P (yt |x, v, y

MASK self-attentive representations of early positions are
naturally re-computed, in contrast to typical Trans-
MASK
former decoders. Finally, due to how inputs/outputs
MASK are represented in a single stream and encoded
through shared self-attention, B ERT G EN enforces
MASK
an inductive bias towards a truly multi-modal and
STOP
multi-lingual representation space.
Figure 3: A look at B ERT G EN’s self-attention: the con-
nections denote that self-attentive representations are Target language specifiers. Finally, to select the
re-computed in every step. The generation ends when language during generation, input sequences begin
STOP is predicted. The smileys refer to RoI features. with special target language specifiers (Ha et al.,
2016; Johnson et al., 2017) (Figure 1). The speci-
fier is task-agnostic, i.e. the same specifier [DE] is
bounding box coordinates). When encoding the
used both when captioning into German and when
non-visual positions, the same RoI feature vector
translating into German.
for the full image is repeated (see Figure 1). We
note that we do not fine-tune the object detector Training & hyper-parameters. We extend2 the
during training. base configuration of VL-B ERT which is a Trans-
Sequence unrolling. An important aspect of former with 12 self-attention layers and 12 heads.
B ERT G EN is that it does not explicitly distinguish The model and feed-forward dimensions are 768
between the encoder and the decoder blocks usu- and 3072, respectively. On a single 32GB V100
ally seen in sequence-to-sequence models. This is GPU, one epoch (§ 3) takes approximately two days
accomplished by formalising both encoding and to complete as we could only fit one example per
generation using the MLM framework. Formally, task (i.e. batch size equal to 13) into the memory3 .
let us consider the MMT task and define the max- We use AdamW optimiser (Loshchilov and Hutter,
imum log-likelihood objective for a given triplet 2019) with base learning rate set to 1.3×10−5 . The
{x(i) , v(i) , y(i) } where the target y(i) has n tokens: learning rate is warmed up in the first 16K steps
and then decays linearly. We set the weight decay
n
(i)
X
(i) (i)
to 10−4 . During training, we let the model update
L = log P (yt | x(i) ; v(i) ; y

Name           Type      Task        Sents   Augm.        2.2.2   Multimodal Machine Translation
  F LICKR 8 K     IC     IM→TR       13,828     263K        Multimodal Machine Translation (MMT) attempts
  M ULTI 30 K    MMT     DE→EN       29,000     464K        to improve MT quality by incorporating informa-
                         FR→EN                  464K        tion from modalities other than language (Suluba-
                         EN→FR                  560K        cak et al., 2020). In our case, we train B ERT-
  M ULTI 30 K    MMT     EN→DE       29,000     582K
                                                            G EN for E N↔D E and E N↔F R MMT tasks and
  F LICKR 30 K    IC     IM→EN      145,000    2.39M        use the M ULTI 30 K dataset, the main dataset for
                         IM→DE                 2.48M
                                                            image-informed translation, which provides cap-
  I WSLT          MT     DE→EN      158,388    3.85M        tion translations for F LICKR 30 K images in German
                         EN→DE                 4.43M
                                                            and French. To evaluate B ERT G EN on MMT tasks,
  I WSLT          MT     FR→EN      163,328    4.01M        we use the original 2016 test set which contains
                         EN→FR                 4.78M
                                                            1,000 examples.
  S ETIMES        MT     TR→EN      185,318    6.01M           For a comprehensive comparison with previous
                         EN→TR                 8.40M
                                                            work, we train a SoTA recurrent MMT (Caglayan
Table 1: Training statistics of B ERT G EN: the last col-
                                                            et al., 2020) solely on the M ULTI 30 K dataset,
umn is the number of samples after sequence unrolling.      which applies a secondary (visual) attention in the
                                                            decoder over the RoI features i.e. the same features
                                                            that are also used by B ERT G EN (§ 2.1). There
                                                            are two GRU (Cho et al., 2014) layers in both the
2.2.1   Image Captioning                                    encoder and the decoder and the embedding & hid-
                                                            den dimensions in the model are set to 200 and 320,
                                                            respectively. Each model has ∼5.6M parameters
Image captioning (IC) involves describing images            excluding the word embeddings.
in a specified natural language. We train B ERT-               Besides the state-of-the-art constrained recur-
G EN for English, German and Turkish caption-               rent MMT model described above, we further
ing tasks. Specifically, we use the F LICKR 30 K            compare B ERT G EN – which is trained on various
dataset (Young et al., 2014) that provides 29K train-       other MT and IC corpora – to an unconstrained
ing images, each with five English captions col-            Transformer-based MMT trained on ∼9M addi-
lected through crowd-sourcing. The validation and           tional E N→D E sentences (Libovický, 2019)4 in
test sets contain approximately 1K images each.             addition to M ULTI 30 K.
We use the M ULTI 30 K dataset (Elliott et al., 2016),
which annotates F LICKR 30 K images with five Ger-          2.2.3   Text-only Machine Translation
man captions. Finally, we use the TASVIR E T                We incorporate six text-only MT tasks into our
dataset (Unal et al., 2016) which provides two              training protocol. We use E N↔D E and E N↔F R
Turkish captions for each of the 8,092 images in            MT datasets from IWSLT’14 (Cettolo et al., 2012)
the F LICKR 8 K dataset (Rashtchian et al., 2010).          which consists of TED Talks’ subtitles and their
Since F LICKR 8 K is a subset of F LICKR 30 K, we           translations. We take the prepare-iwslt14
create a new split of TASVIR E T to avoid data leak-        recipe from FAIRSEQ (Ott et al., 2019) to prepare
age between training and test splits. The resulting         the dev and test sets. This yields an E N↔D E test
training, validation and test splits contain 6914,          set of 6,750 sentences which consists of dev2010,
543, and 543 images, respectively.                          dev2012.TEDX, tst2010, tst2011 and tst2012. Sim-
                                                            ilarly, the E N↔F R test set consists of dev2010,
   To evaluate B ERT G EN’s performance on IC,
                                                            tst2010, tst2011 and tst2012, which amounts to
we compare it against previous work with strong
                                                            4,493 sentences.
performance on COCO (Chen et al., 2015) and
F LICKR 30 K. More precisely, A DAPTIVE ATTEN -                For E N↔T R directions, we use the SE-
TION (S ENTINEL ) (Lu et al., 2017), which uses
                                                            TIMES2 (Tiedemann, 2012) news dataset for train-
a sentinel token to distinguish between visual and          ing. For development and test sets, we take the
non-visual representations, and N EURAL BABY                official WMT test sets (Bojar et al., 2018), namely,
TALK (N BT ), which follows a slot-filling approach         newstest2016 and newstest2017 as the development
through explicit object region information (Lu                 4
                                                                 We obtained test set outputs from the author and pre-
et al., 2018).                                              processed with M-B ERT tokeniser to ensure comparability.

set (6,007 sentences), and newstest2018 (6,000 sen-                       TASK     A PPROACH       BL     MT      CR
tences) as the test set. Both IWSLT and SETIMES2
                                                                       F30K E N    B ERT G EN     27.0   23.2    0.587
corpora are medium-scale resources often used in
                                                                                   S ENTINEL‡     25.1   20.4    0.531
MT research community, and have much harder
                                                                                   N BT‡          27.1   21.7    0.575
test sets than the MMT and IC tasks, due to a sig-
nificant domain shift.                                                 C OCO E N   B ERT G EN     15.9   20.4    0.487
   Finally, for each translation direction, we train                               N BT‡          34.7   27.1    1.072
a Transformer NMT model (Vaswani et al., 2017)                         F30K F R    B ERT G EN      5.2   18.1    0.397
using the I WSLT-D E -E N recipe of the FAIRSEQ                         F8K T R    B ERT G EN      8.5   14.5    0.363
toolkit (Ott et al., 2019). This recipe has six en-                    F30K D E    B ERT G EN     17.8   34.2    0.500
coders and six decoders, each equipped with 4-head
self-attention layers. The model and feed-forward               Table 2: BLEU (B L), METEOR (M T) and CIDEr (C R)
                                                                scores for image captioning: gray background indicates
dimensions are set to 512 and 1024, respectively.
                                                                zero-shot generation whereas ‡ denotes the systems de-
Each model has ∼31.5M parameters excluding the                  coded with beam search.
word embeddings. Since B ERT G EN is a general
purpose multilingual and multimodal generator, we
expect it to perform in the same ballpark as these              given that B ERT G EN is not trained on COCO: our
strong NMT baselines, but not necessarily be SoTA               model lags behind N BT (w/ beam search) by 6.7
compared to novel & sophisticated NMT models,                   METEOR.
which also make use of a lot more training data.                   For zero-shot French captioning (F30K F R), we
                                                                resort to the reference MMT translations from the
3       Results and Findings                                    M ULTI 30 K E N→F R task, as there are no human
                                                                references for French. Although this is problem-
We train B ERT G EN on lowercased sentences for
                                                                atic as the metrics will penalise captions that are
45 epochs, after which the overall performance
                                                                not translations of English captions, we provide
on the tasks reached a plateau. We define one
                                                                the scores to show that the zero-shot outputs are
B ERT G EN epoch as a single pass over all of the
                                                                valid descriptions. We note that the low range
training data for the M ULTI 30 K E N→D E MMT
                                                                of scores reported here is also due to having one
task and denote this task as the reference task. We
                                                                reference caption instead of five references7 as in
use greedy search for all systems that we trained
                                                                F LICKR 30 K. Finally we report results for our cus-
and merge back the word pieces before evaluation.
                                                                tom Turkish split (§ 2.2.1) (F30K T R) and Ger-
We compute tokenised5 BLEU (Papineni et al.,
                                                                man (F30K D E). Even though there are no com-
2002), METEOR (Denkowski and Lavie, 2014)
                                                                parable results in the literature for these three tasks,
and CIDEr (Vedantam et al., 2015) using coco-
                                                                we demonstrate through some qualitative examples
caption6 . In what follows, we provide detailed
                                                                that B ERT G EN produces sensible outputs.
quantitative and qualitative findings.
                                                                Qualitative examples. We now focus on a few
3.1      Image Captioning                                       examples to examine the multilingual image cap-
Table 2 provides an overview of B ERT G EN’s image              tioning ability of B ERT G EN in action (Table 3).
captioning performance on different test sets and               For the first image, all captions are almost the same
languages. First of all, on English F LICKR 30 K,               as the image has few salient points. For the second
B ERT G EN is clearly able to outperform strong cap-            image however, we observe much more variation
tioning models (§ 2.2.1) S ENTINEL (Lu et al., 2017)            across captions, in line with the complexity of the
and N BT (Lu et al., 2018), even though they use                scene. We are particularly surprised by the zero-
beam search for decoding. On COCO (Chen et al.,                 shot French captioning performance, a task that
2015), an image captioning corpus much larger                   B ERT G EN is not trained for at all. Upon manual
and diverse than F LICKR 30 K, we evaluate B ERT-               inspection, we noticed that the captions are often
G EN on Karpathy’s test split (Karpathy and Fei-                short, objective gists of the images. These obser-
Fei, 2015) and notice that the scores are reasonable            vations also hold for the captions generated for the
    5                                                              7
     Since M-B ERT is aggressive on splitting apostrophes and        As a reference, evaluating English captions using one
hyphens, our results may slightly differ from other work.       reference at a time, yields 7.9 BLEU on average, compared to
   6
     https://github.com/tylin/coco-caption                      27.0 BLEU in Table 2.

MMT A PPROACH BL MT
EN → DE B ERT G ENF 42.2 61.6
Libovický (2019)F‡ 40.8 59.2
Caglayan et al. (2020) 37.8 56.9
EN a man wearing a hat and glasses.
FAIRSEQ N MT 37.5 56.1
DE ein mann mit hut und brille.
a man with hat and glasses EN → FR B ERT G ENF 68.0 81.2
TR şapkalı ve gözlüklü bir adam. Libovický (2019)F‡ 63.4 77.3
a man with a hat and glasses. FAIRSEQ N MT 61.5 75.5
FR un homme avec un chapeau et des lunettes.
Caglayan et al. (2020) 61.0 75.3
a man with a hat and glasses.
DE→ FR B ERT G ENF 44.8 64.1
Caglayan et al. (2020) 43.8 62.1
FAIRSEQ N MT 41.7 60.7
FR → DE B ERT G ENF 35.1 56.9
FAIRSEQ N MT 35.1 53.6
EN two men are on a rooftop working on something. Caglayan et al. (2020) 33.5 53.1
DE zwei männer arbeiten auf einem dach.
two men working on a roof Table 4: BLEU (B L) and METEOR (M T) scores for
TR iki binanın inşasında oturmuş, yanyana MMT: zero-shot systems are highlighted with gray.
yerde duran iki kişi. Systems marked with F and ‡ denote the use of aux-
two people seated in the construction of two iliary resources (i.e. unconstrained) and beam-search
buildings, standing next to each other on the ground. decoding, respectively.
FR trois ouvriers du bâtiment construisent un toit.
three construction workers build a roof.

3.2 Multimodal Machine Translation
Table 4 summarises B ERT G EN’s performance on
MMT. First of all, B ERT G EN consistently outper-
forms the Transformer-based FAIRSEQ NMT mod-
EN a man in a red shirt and helmet is riding a
motorbike on a dirt road.
els and the recurrent MMT (Caglayan et al., 2020)
DE ein mann fährt mit einem motorrad auf einem models on both the E N→D E and the E N→F R
weg an einem fluß entlang. language pairs. Furthermore, B ERT G EN is also
a man rides a motorcycle on a path along a river. substantially better than a state-of-the-art uncon-
TR çamurlu bir yolda motoruyla ilerlemekte olan kırmızı
üstlü bir adam ve arkasındaki dağ manzarası.
strained MMT (Libovický, 2019) model trained on
A man in a red top riding his bike down a muddy a ∼6x larger parallel corpus.
road with a mountain landscape behind him.
FR un homme avec un casque fait du motocross. Adversarial evaluation. Following Elliott
a man with a helmet rides motocross. (2018), we probe B ERT G EN’s ability for integrat-
ing multiple modalities effectively. Specifically,
Table 3: Multilingual image captioning examples: The
we decode translations by shuffling {image, source
italicised sentences are Google Translate translations
of D E ,T R ,F R sentences into English. Gray background
caption} mappings so that the images do not
indicates zero-shot outputs. The last example is from correspond to the sentences to be translated. The
C OCO while the others are from F LICKR 30 K. E N→D E results showed that the incongruence
leads to 1.1 and 0.9 point drops in BLEU and
METEOR, respectively. For E N→F R, the drops
are much more prominent with 3.1 and 2.3 points
again for BLEU and METEOR. This indicates
COCO test set, as we can see in the third exam- that the features are not ignored at all, unlike in
ple. A set of additional examples in the Appendix (Caglayan et al., 2019), where they showed that
shows that B ERT G EN does not simply retrieve cap- sequence-to-sequence MMT models can learn to
tion translations learned from the E N→F R task. ignore the images when the linguistic signal is
Overall, both quantitative and qualitative results sufficient to perform the task.
provide evidence of the utility of multimodal and
multilingual initialisation as well as the efficacy of Zero-shot performance. The results in Table 4
knowledge transfer across different tasks for image show the surprising ability of B ERT G EN to per-
captioning. form MMT on directions unseen during training.

FAIRSEQ              B ERT G EN                                     D E→F R         F R→D E
 TASK                            BL   MT              BL      MT                                     BL  MT          BL  MT
 I WSLT E N→D E                  27.4       47.1      27.8         48.4              B ERT G EN     19.6 40.5       13.1 36.7
 I WSLT D E→E N                  33.6       33.8      35.6         34.7                TARTU‡       39.5 59.0       26.3 47.3
 I WSLT E N→F R                  41.0       59.8      40.2         60.5                 M SRA‡      46.5 64.2       38.2 56.4
 I WSLT F R→E N                  39.1       36.4      40.0         36.8
 S ETIMES E N→T R                14.1       18.9      13.5         19.1       Table 6: Zero-shot B ERT G EN performance on
 S ETIMES T R→E N                17.3       25.8      19.0         26.9       WMT’19 test set: TARTU and MSRA systems are not
                                                                              zero-shot as they are trained on D E↔FR corpora. Sys-
Table 5: Comparison of text-only MT performance of                            tems marked with ‡ are beam search outputs.
B ERT G EN to each dedicated FAIRSEQ NMT system:
B ERT G EN outperforms single models in most cases.                           cific task’s training set has been seen by the models.
              IWSLT DE    EN                    IWSLT EN       DE
                                                                              B ERT G EN is trained for 45 reference epochs (§ 3),
40                                                                            and this corresponds to only a few complete passes
30
20
                                                                              over the training sets of NMT tasks8 . This is in
10                                                                            contrast to the single-task systems that usually re-
 0                                                                            quire a large number of epochs for convergence.
60
              IWSLT EN    FR                    IWSLT FR       EN             We notice a general trend and observe that B ERT-
45                                                                            G EN tends to outperform single-task systems usu-
30                                                                            ally after only a few passes over the corresponding
15                                                                            training set. Many factors could be contributing to
 0
                                                                              this observation such as sequence unrolling, multi-
           SETIMES EN      TR                SETIMES TR           EN
20                                                                            tasking, shared input space or relevant inductive
15
                                                                              biases transferred from M-B ERT. We partly ad-
10
                                                            Fairseq           dress these in the ablation studies (§ 3.4) and leave
 5
                                                           BertGen
 0                                                                            further investigation to future work.
      0   5    10   15   20 25   30     0   5    10   15     20     25   30
                                                                              Zero-shot performance. We use the D E↔F R
Figure 4: B ERT G EN’s learning efficiency on MT: val-
                                                                              test set from the WMT’19 shared task on news
idation scores are plotted against the number of full
passes completed by B ERT G EN and each FAIRSEQ
                                                                              translation (Barrault et al., 2019) to assess the zero-
model, over the corresponding task’s training set. Best                       shot translation capability of B ERT G EN. This test
checkpoints’ test set performances are given in Table 5.                      set includes 1,701 sentences from news data regard-
                                                                              ing European Elections. We compare our results
                                                                              to two shared task systems, namely TARTU (base-
Moreover, the zero-shot performance surpasses                                 line) and M SRA (state-of-the-art) (Barrault et al.,
strong MMT and NMT systems by up to 2 and                                     2019), after re-tokenising them accordingly with
3.3 METEOR for D E→F R and F R→D E, respec-                                   M-B ERT9 . Although B ERT G EN is expected to
tively. Similar to the image captioning results, this                         obtain lower scores than the dedicated WMT sys-
demonstrates the potential of B ERT G EN to gener-                            tems due to the domain mismatch of the test set,
alise over a variety of language pairs and tasks.                             we consider both the quantitative (Table 6) and the
                                                                              qualitative results (Table 7) extremely encouraging.
3.3       Machine Translation
First, we compare B ERT G EN’s performance to                                 3.4    Ablation Studies
each task-specific FAIRSEQ system. According                                  3.4.1 Impact of initialisation
to Table 5, we observe that the translation quality                           We train single-task MMT systems on the
of B ERT G EN is generally superior compared to the                           M ULTI 30 K E N→D E language pair. Specifically,
strong FAIRSEQ systems, especially in METEOR,                                 we begin with a baseline system which is initialised
where B ERT G EN leads in all pairs.                                          with random weights. We then train a second base-
   Second, we look at the learning efficiency by                              line where only the visual processing layers are
comparing the training curves between B ERT G EN                                 8
                                                                                   For example, only ∼3 passes over S ETIMES E N→T R.
and each task-specific FAIRSEQ system (Figure 4).                                9
                                                                                   TARTU is the baseline and M SRA is the best performing
Here, the x axis represents how many times the spe-                           system for the shared task

B ERT G EN:       la décision est tombée au 70ème anniversaire de ma femme.
                    the decision fell on my wife’s 70th birthday.
   W MT R EF        la décision est tombée le jour du 70ème anniversaire de ma femme.
                    the decision fell on my wife’s 70th birthday.
  B ERT G EN:       en espagne, on s’est malheureusement habitué à une rôle double et passive.
                    in spain, we unfortunately got used to a double and passive role.
   W MT R EF        en espagne, on s’est malheureusement habitué à un rôle secondaire, passif.
                    in spain, we unfortunately got used to a secondary, passive role.
  B ERT G EN:       pas parce que le président du fdp a dit quelque chose qu’ ils ont défaillant leur vote.
                    not because the fdp president said something that they missed their vote.
   W MT R EF        ce n’ est pas parce que le président fédéral du fdp a dit quelque chose qu’ ils ont refusé d’ approuver.
                    it is not because the federal president of the fdp said something that they refused to approve.

Table 7: Zero-shot D E→F R translations on WMT’19 test set. The italicised sentences are Google Translate trans-
lations of French sentences into English. Bold indicates important differences between B ERT G EN and references.

         45                                                                    50
                                                                                        Single Task
         40                                                                              Multi Task
                                                                               45
         35
                                                                        BLEU
         30                                                                    40
  BLEU

         25
                                                                               35
         20
                                             Random
                                                                               30
         15                               Visual Only                               0            5           10       15     20
                                               Hybrid                                          Passes over EN-DE MMT data
         10
              0        5           10      15             20
                     Passes over EN-DE MMT data
                                                                    Figure 6: Validation scores on M ULTI 30 K E N→D E
                                                                    MMT for the multi-tasking ablation: The default
Figure 5: Validation scores on M ULTI 30 K E N→D E                  multi-task B ERT G EN outperforms the single-task one.
MMT for the initialisation ablation: Hybrid initialisa-
tion is the most beneficial strategy for B ERT G EN.
                                                                    of catastrophic forgetting (French, 1999). Based
                                                                    on these observations, we expect similar model
transferred from VL-B ERT. Finally, we train a                      behavior to hold for other tasks.
third baseline that is initialised similar to B ERT-
G EN, i.e. using the hybrid initialisation (§ 2.1).                 4      Related Work
Figure 5 compares the validation BLEU scores of
these three systems. We observe that the benefits of                4.1        Multimodal multilingual pre-training
knowledge transfer from pre-trained models are in-                  Research in NLP and related fields has been in-
crementally positive, however, B ERT G EN’s hybrid                  creasingly focusing on transfer learning approaches
initialisation outperforms the other two ablations.                 where a model is first pre-trained on a data-rich
                                                                    task, and then transferred to downstream tasks (Mc-
3.4.2         Impact of multi-task training                         Cann et al., 2017; Peters et al., 2018; Devlin et al.,
We now remove the multi-tasking aspect from                         2019). This framework presumably allows the
B ERT G EN to investigate the extent to which the per-              model to capture useful inductive biases that gen-
formance improvements are related to other tasks.                   eralise to a variety of NLP tasks, often after per-
Similar to § 3.4.1, we focus on the M ULTI 30 K                     forming a task-specific fine-tuning (Raffel et al.,
E N→D E MMT task and train a single-task, hybrid-                   2020). Of these, the most relevant studies to our
initialised B ERT G EN. Figure 6 compares the vali-                 work are BERT (Devlin et al., 2019) and its multi-
dation BLEU scores obtained by the default B ERT-                   lingual version M-B ERT, which pre-train a Trans-
G EN and the single-task variant. We observe that                   former (Vaswani et al., 2017) on large monolin-
B ERT G EN benefits from multi-task training and,                   gual corpora using the masked language modelling
more importantly, does not seem to exhibit patterns                 (MLM) objective.

Recent research has also attempted to combine Knight, 2016; Firat et al., 2016). The multi-task
linguistic inputs with other modalities such as vi- (and zero-shot) generation ability of B ERT G EN is
sion and speech, to achieve a grounded understand- mostly inspired by Ha et al. (2016) and Johnson
ing of meaning. Successful approaches including et al. (2017). Both of these introduced target lan-
LXMERT (Tan and Bansal, 2019), VL-BERT (Su guage specifiers to select the output language when
et al., 2020) and others (Lu et al., 2019; Li et al., decoding translations from their model.
2020a,b) achieve this by combining BERT’s MLM Our multilingual & multimodal take on multi-
objective with auxiliary tasks such as masked re- task generation is most similar to Kaiser et al.
gion classification and image sentence matching, (2017), where a single Transformer model is
and pre-train their model on large-scale image cap- trained on different tasks including image cap-
tioning corpora (Chen et al., 2015; Sharma et al., tioning, object classification, machine translation,
2018). Similarly, SpeechBERT extends BERT by speech recognition and parsing. However, their
jointly training on speech and text data (Chuang architecture depends on particular structures such
et al., 2020). Although SoTA results are reported as encoders, decoders, modality-specific networks
by these approaches, they focus on unimodal and and I/O mixers, unlike B ERT G EN which does not
multimodal natural language understanding (NLU) require task-specific modules.
tasks, with a strong emphasis in English. The back-
bone of B ERT G EN combines VL-BERT (Su et al., 5 Conclusions
2020) with M-B ERT (Devlin et al., 2019) to realise
In this paper, we presented B ERT G EN, a novel gen-
a multilingual and multimodal generator that can
erative, decoder-only model which extends BERT
be used for a diverse set of generative tasks and
by combining multimodal and multilingual pre-
languages rather than NLU tasks.
trained models. Our findings show that B ERT G EN
4.2 Pre-training for generative tasks obtains strong performance on a variety of gen-
erative tasks and further generalises over unseen
Previous work has studied how to benefit from
tasks. Importantly, our model demonstrates the po-
pre-trained BERT models in generative tasks such
tential for general-purpose (instead of task-specific)
as NMT (Imamura and Sumita, 2019; Clinchant
generation that is above and beyond the traditional
et al., 2019; Zhu et al., 2020). B ERT G EN differs
pre-training and fine-tuning practices. B ERT G EN is
from these as it is not fine-tuned for a particular
also parameter efficient as it has 89.3M total param-
MT corpus and it exhibits multi-lingual and multi-
eters and is trained on thirteen tasks encompassing
modal properties for general purpose generation.
MT, multimodal MT and image captioning. On the
Another related branch of work explores pre-
other hand, each of the single-task FAIRSEQ NMT
training strategies specific to sequence-to-sequence
baselines has 31.5M parameters.
tasks. This includes MASS (Song et al., 2019),
Our ablation studies show that B ERT G EN is able
which exploits an encoder-decoder framework
to efficiently transfer relevant inductive biases from
with the MLM objective for task-specific gener-
the pre-trained models and benefits from multi-task
ative pre-training and UniLM (Dong et al., 2019),
learning without suffering from catastrophic for-
which introduces uni-directional, bi-directional and
getting. We hope that these findings will motivate
sequence-to-sequence LM objectives by carefully
future research in exploiting more sophisticated
adjusting the self-attention masks during train-
pre-trained models in place of M-B ERT and VL-
ing. Zhou et al. (2020) extend UniLM to vision
B ERT and others.
& language pre-training using Conceptual Cap-
tions (Sharma et al., 2018) as the pre-training Acknowledgments
dataset. However, these models require a further
fine-tuning step for generative tasks, unlike B ERT- This paper is a follow-up work to the MSc. Thesis
G EN that is trained only once. of Faidon Mitzalis, co-supervised by Prof. Lucia
Specia and Dr. Ozan Caglayan. Lucia Specia,
4.3 Multi-task learning for generation Pranava Madhyastha and Ozan Caglayan received
Several approaches exist for multi-task learning & support from MultiMT project (H2020 ERC Start-
generation (Dong et al., 2015; Luong et al., 2016) ing Grant No. 678017). Lucia Specia also received
in NLP, especially in multilingual NMT, where support from the Air Force Office of Scientific Re-
tasks denote different language pairs (Zoph and search (under award number FA8655-20-1-7006).

References Kyunghyun Cho, Bart van Merriënboer, Caglar Gul-
cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Peter Anderson, Xiaodong He, Chris Buehler, Damien Schwenk, and Yoshua Bengio. 2014. Learning
Teney, Mark Johnson, Stephen Gould, and Lei phrase representations using RNN encoder–decoder
Zhang. 2018. Bottom-up and top-down attention for for statistical machine translation. In Proceedings of
image captioning and visual question answering. In the 2014 Conference on Empirical Methods in Nat-
Proceedings of the IEEE conference on computer vi- ural Language Processing (EMNLP), pages 1724–
sion and pattern recognition, pages 6077–6086. 1734, Doha, Qatar. Association for Computational
Linguistics.
Loı̈c Barrault, Ondřej Bojar, Marta R. Costa-jussà,
Christian Federmann, Mark Fishel, Yvette Gra-
ham, Barry Haddow, Matthias Huck, Philipp Koehn, Yung-Sung Chuang, Chi-Liang Liu, Hung yi Lee, and
Shervin Malmasi, Christof Monz, Mathias Müller, Lin shan Lee. 2020. SpeechBERT: An Audio-and-
Santanu Pal, Matt Post, and Marcos Zampieri. 2019. Text Jointly Learned Language Model for End-to-
Findings of the 2019 conference on machine transla- End Spoken Question Answering. In Proc. Inter-
tion (WMT19). In Proceedings of the Fourth Con- speech 2020, pages 4168–4172.
ference on Machine Translation (Volume 2: Shared
Task Papers, Day 1), pages 1–61, Florence, Italy. As- Stephane Clinchant, Kweon Woo Jung, and Vassilina
sociation for Computational Linguistics. Nikoulina. 2019. On the use of BERT for neu-
ral machine translation. In Proceedings of the 3rd
Ondřej Bojar, Christian Federmann, Mark Fishel, Workshop on Neural Generation and Translation,
Yvette Graham, Barry Haddow, Philipp Koehn, and pages 108–117, Hong Kong. Association for Com-
Christof Monz. 2018. Findings of the 2018 con- putational Linguistics.
ference on machine translation (WMT18). In Pro-
ceedings of the Third Conference on Machine Trans- Michael Denkowski and Alon Lavie. 2014. Meteor uni-
lation: Shared Task Papers, pages 272–303, Bel- versal: Language specific translation evaluation for
gium, Brussels. Association for Computational Lin- any target language. In Proceedings of the Ninth
guistics. Workshop on Statistical Machine Translation, pages
376–380, Baltimore, Maryland, USA. Association
Ozan Caglayan, Julia Ive, Veneta Haralampieva, for Computational Linguistics.
Pranava Madhyastha, Loı̈c Barrault, and Lucia Spe-
cia. 2020. Simultaneous machine translation with vi- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
sual context. In Proceedings of the 2020 Conference Kristina Toutanova. 2019. BERT: Pre-training of
on Empirical Methods in Natural Language Process- deep bidirectional transformers for language under-
ing (EMNLP), pages 2350–2361, Online. Associa- standing. In Proceedings of the 2019 Conference
tion for Computational Linguistics. of the North American Chapter of the Association
for Computational Linguistics: Human Language
Ozan Caglayan, Pranava Madhyastha, Lucia Specia, Technologies, Volume 1 (Long and Short Papers),
and Loı̈c Barrault. 2019. Probing the need for visual pages 4171–4186, Minneapolis, Minnesota. Associ-
context in multimodal machine translation. In Pro- ation for Computational Linguistics.
ceedings of the 2019 Conference of the North Amer-
ican Chapter of the Association for Computational Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and
Linguistics: Human Language Technologies, Vol- Haifeng Wang. 2015. Multi-task learning for mul-
ume 1 (Long and Short Papers), pages 4159–4170, tiple language translation. In Proceedings of the
Minneapolis, Minnesota. Association for Computa- 53rd Annual Meeting of the Association for Compu-
tional Linguistics. tational Linguistics and the 7th International Joint
Conference on Natural Language Processing (Vol-
Mauro Cettolo, Christian Girardi, and Marcello Fed-
ume 1: Long Papers), pages 1723–1732, Beijing,
erico. 2012. WIT3: Web inventory of transcribed
China. Association for Computational Linguistics.
and translated talks. In Proceedings of the 16th An-
nual conference of the European Association for Ma-
chine Translation, pages 261–268, Trento, Italy. Eu- Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi-
ropean Association for Machine Translation. aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou,
and Hsiao-Wuen Hon. 2019. Unified language
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr- model pre-training for natural language understand-
ishna Vedantam, Saurabh Gupta, Piotr Dollar, and ing and generation. In Advances in Neural Informa-
C. Lawrence Zitnick. 2015. Microsoft COCO Cap- tion Processing Systems, volume 32, pages 13063–
tions: Data Collection and Evaluation Server. 13075. Curran Associates, Inc.

Yen-Chun Chen, Zhe Gan, Yu Cheng, Jingzhou Liu, Desmond Elliott. 2018. Adversarial evaluation of mul-
and Jingjing Liu. 2020. Distilling knowledge timodal machine translation. In Proceedings of the
learned in BERT for text generation. In Proceedings 2018 Conference on Empirical Methods in Natu-
of the 58th Annual Meeting of the Association for ral Language Processing, pages 2974–2978, Brus-
Computational Linguistics, pages 7893–7905, On- sels, Belgium. Association for Computational Lin-
line. Association for Computational Linguistics. guistics.

Desmond Elliott, Stella Frank, Khalil Sima’an, and Lu- Gao. 2020b. Oscar: Object-Semantics Aligned Pre-
cia Specia. 2016. Multi30K: Multilingual English- training for Vision-Language Tasks. In Computer Vi-
German image descriptions. In Proceedings of the sion – ECCV 2020, pages 121–137, Cham. Springer
5th Workshop on Vision and Language, pages 70– International Publishing.
74, Berlin, Germany. Association for Computational
Linguistics. Jindřich Libovický. 2019. Multimodality in Machine
Translation. Ph.D. thesis, Charles University, Fac-
Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. ulty of Mathematics and Physics, Institute of Formal
2016. Multi-way, multilingual neural machine trans- and Applied Linguistics, Prague, Czechia.
lation with a shared attention mechanism. In Pro-
ceedings of the 2016 Conference of the North Amer- Ilya Loshchilov and Frank Hutter. 2019. Decoupled
ican Chapter of the Association for Computational weight decay regularization. In International Con-
Linguistics: Human Language Technologies, pages ference on Learning Representations.
866–875, San Diego, California. Association for
Computational Linguistics. J. Lu, C. Xiong, D. Parikh, and R. Socher. 2017. Know-
ing when to look: Adaptive attention via a visual
Robert M. French. 1999. Catastrophic forgetting in sentinel for image captioning. In 2017 IEEE Confer-
connectionist networks. Trends in Cognitive Sci- ence on Computer Vision and Pattern Recognition
ences, 3(4):128–135. (CVPR), pages 3242–3250.

Thanh-Le Ha, Jan Niehues, and Alex Waibel. 2016. To- Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan
ward Multilingual Neural Machine Translation with Lee. 2019. Vilbert: Pretraining task-agnostic visi-
Universal Encoder and Decoder. In Proceedings of olinguistic representations for vision-and-language
the 13th International Conference on Spoken Lan- tasks. In Advances in Neural Information Process-
guage Translation. ing Systems, pages 13–23.

Kenji Imamura and Eiichiro Sumita. 2019. Recycling a Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi
pre-trained BERT encoder for neural machine trans- Parikh, and Stefan Lee. 2020. 12-in-1: Multi-task
lation. In Proceedings of the 3rd Workshop on Neu- vision and language representation learning. In The
ral Generation and Translation, pages 23–31, Hong IEEE/CVF Conference on Computer Vision and Pat-
Kong. Association for Computational Linguistics. tern Recognition (CVPR).

Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh.
Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, 2018. Neural baby talk. In Proceedings of the IEEE
Fernanda Viégas, Martin Wattenberg, Greg Corrado, Conference on Computer Vision and Pattern Recog-
Macduff Hughes, and Jeffrey Dean. 2017. Google’s nition, pages 7219–7228.
multilingual neural machine translation system: En-
abling zero-shot translation. Transactions of the As- Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol
sociation for Computational Linguistics, 5:339–351. Vinyals, and Lukasz Kaiser. 2016. Multi-task Se-
quence to Sequence Learning. In International Con-
Lukasz Kaiser, Aidan N. Gomez, Noam Shazeer, ference on Learning Representations.
Ashish Vaswani, Niki Parmar, Llion Jones, and
Jakob Uszkoreit. 2017. One Model To Learn Them Bryan McCann, James Bradbury, Caiming Xiong, and
All. Richard Socher. 2017. Learned in translation: Con-
textualized word vectors. In Proceedings of the
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual- 31st International Conference on Neural Informa-
semantic alignments for generating image description Processing Systems, NIPS’17, page 6297–6308,
tions. In Proceedings of the IEEE conference on Red Hook, NY, USA. Curran Associates Inc.
Computer Vision and Pattern Recognition, pages
3128–3137. Myle Ott, Sergey Edunov, Alexei Baevski, Angela
Fan, Sam Gross, Nathan Ng, David Grangier, and
Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Michael Auli. 2019. fairseq: A fast, extensible
Daxin Jiang. 2020a. Unicoder-VL: A universal en- toolkit for sequence modeling. In Proceedings of
coder for vision and language by cross-modal pre- the 2019 Conference of the North American Chap-
training. In The Thirty-Fourth AAAI Conference ter of the Association for Computational Linguistics
on Artificial Intelligence, AAAI 2020, The Thirty- (Demonstrations), pages 48–53, Minneapolis, Min-
Second Innovative Applications of Artificial Intelli- nesota. Association for Computational Linguistics.
gence Conference, IAAI 2020, The Tenth AAAI Sym-
posium on Educational Advances in Artificial Intel- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
ligence, EAAI 2020, New York, NY, USA, February Jing Zhu. 2002. Bleu: a method for automatic eval-
7-12, 2020, pages 11336–11344. AAAI Press. uation of machine translation. In Proceedings of
the 40th Annual Meeting of the Association for Com-
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, putational Linguistics, pages 311–318, Philadelphia,
Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Pennsylvania, USA. Association for Computational
Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Linguistics.

Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Hao Tan and Mohit Bansal. 2019. LXMERT: Learning
Gardner, Christopher Clark, Kenton Lee, and Luke cross-modality encoder representations from trans-
Zettlemoyer. 2018. Deep contextualized word rep- formers. In Proceedings of the 2019 Conference on
resentations. In Proceedings of the 2018 Confer- Empirical Methods in Natural Language Processing
ence of the North American Chapter of the Associ- and the 9th International Joint Conference on Natu-
ation for Computational Linguistics: Human Lan- ral Language Processing (EMNLP-IJCNLP), pages
guage Technologies, Volume 1 (Long Papers), pages 5100–5111, Hong Kong, China. Association for
2227–2237, New Orleans, Louisiana. Association Computational Linguistics.
for Computational Linguistics.
Jörg Tiedemann. 2012. Parallel data, tools and inter-
Colin Raffel, Noam Shazeer, Adam Roberts, Kather- faces in OPUS. In Proceedings of the Eighth In-
ine Lee, Sharan Narang, Michael Matena, Yanqi ternational Conference on Language Resources and
Zhou, Wei Li, and Peter J. Liu. 2020. Exploring Evaluation (LREC’12), pages 2214–2218, Istanbul,
the limits of transfer learning with a unified text-to- Turkey. European Language Resources Association
text transformer. Journal of Machine Learning Re- (ELRA).
search, 21(140):1–67.
Mesut Erhan Unal, Begum Citamak, Semih Yagcioglu,
Aykut Erdem, Erkut Erdem, Nazli Ikizler Cinbis,
Cyrus Rashtchian, Peter Young, Micah Hodosh, and
and Ruket Cakici. 2016. Tasviret: A benchmark
Julia Hockenmaier. 2010. Collecting image annota-
dataset for automatic turkish description generation
tions using Amazon’s Mechanical Turk. In Proceed-
from images. In 2016 24th Signal Processing
ings of the NAACL HLT 2010 Workshop on Creating
and Communication Application Conference (SIU),
Speech and Language Data with Amazon’s Mechan-
pages 1977–1980. IEEE.
ical Turk, pages 139–147, Los Angeles. Association
for Computational Linguistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Sascha Rothe, Shashi Narayan, and Aliaksei Severyn. Kaiser, and Illia Polosukhin. 2017. Attention is all
2020. Leveraging pre-trained checkpoints for se- you need. In Advances in neural information pro-
quence generation tasks. Transactions of the Asso- cessing systems, pages 5998–6008.
ciation for Computational Linguistics, 8:264–280.
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi
Piyush Sharma, Nan Ding, Sebastian Goodman, and Parikh. 2015. Cider: Consensus-based image de-
Radu Soricut. 2018. Conceptual captions: A scription evaluation. In Proceedings of the IEEE
cleaned, hypernymed, image alt-text dataset for au- Conference on Computer Vision and Pattern Recog-
tomatic image captioning. In Proceedings of the nition, pages 4566–4575.
56th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), pages Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
2556–2565, Melbourne, Australia. Association for Chaumond, Clement Delangue, Anthony Moi, Pier-
Computational Linguistics. ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
icz, Joe Davison, Sam Shleifer, Patrick von Platen,
Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie- Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
Yan Liu. 2019. MASS: Masked sequence to se- Teven Le Scao, Sylvain Gugger, Mariama Drame,
quence pre-training for language generation. In Pro- Quentin Lhoest, and Alexander Rush. 2020. Trans-
ceedings of the 36th International Conference on formers: State-of-the-art natural language process-
Machine Learning, volume 97 of Proceedings of Ma- ing. In Proceedings of the 2020 Conference on Em-
chine Learning Research, pages 5926–5936. PMLR. pirical Methods in Natural Language Processing:
System Demonstrations, pages 38–45, Online. Asso-
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, ciation for Computational Linguistics.
Furu Wei, and Jifeng Dai. 2020. VL-BERT: Pre-
Jiacheng Yang, Mingxuan Wang, Hao Zhou, Chengqi
training of Generic Visual-Linguistic Representa-
Zhao, Weinan Zhang, Yong Yu, and Lei Li. 2020.
tions. In International Conference on Learning Rep-
Towards making the most of BERT in neural ma-
resentations.
chine translation. In Proceedings of the AAAI Con-
ference on Artificial Intelligence, volume 34, pages
Umut Sulubacak, Ozan Caglayan, Stig-Arne Grönroos, 9378–9385.
Aku Rouhe, Desmond Elliott, Lucia Specia, and
Jörg Tiedemann. 2020. Multimodal machine trans- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
lation through visuals and speech. Machine Transla- bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.
tion, pages 1–51. XLNet: Generalized Autoregressive Pretraining for
Language Understanding. In Advances in Neural In-
Chen Sun, Austin Myers, Carl Vondrick, Kevin Mur- formation Processing Systems, volume 32. Curran
phy, and Cordelia Schmid. 2019. Videobert: A joint Associates, Inc.
model for video and language representation learn-
ing. In Proceedings of the IEEE/CVF International Peter Young, Alice Lai, Micah Hodosh, and Julia Hock-
Conference on Computer Vision, pages 7464–7473. enmaier. 2014. From image descriptions to visual

denotations: New similarity metrics for semantic in-
  ference over event descriptions. Transactions of the
  Association for Computational Linguistics, 2:67–78.
Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong
  Hu, Jason Corso, and Jianfeng Gao. 2020. Uni-
  fied vision-language pre-training for image caption-
  ing and vqa. Proceedings of the AAAI Conference
  on Artificial Intelligence, 34(07):13041–13049.
Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin,
   Wengang Zhou, Houqiang Li, and Tie-Yan Liu.
   2020. Incorporating BERT into neural machine
   translation. In 8th International Conference on
   Learning Representations, ICLR 2020, Addis Ababa,
   Ethiopia, April 26-30, 2020. OpenReview.net.
Barret Zoph and Kevin Knight. 2016. Multi-source
  neural translation. In Proceedings of the 2016 Con-
  ference of the North American Chapter of the Asso-
  ciation for Computational Linguistics: Human Lan-
  guage Technologies, pages 30–34, San Diego, Cali-
  fornia. Association for Computational Linguistics.

A   Qualitative Examples

                 B ERT G EN: un groupe de jeunes sont réunis, et ils sont debout à l ’extérieur d’un bâtiment.
                              (a group of young people are gathered, and they are standing outside a building.)
                   M T R EF : des gens debout devant un bâtiment.
                              (people standing outside of a building.)

                 B ERT G EN: une petite fille lit un livre.
                              (a little girl reads a book.)
                   M T R EF : un jeune enfant dormant dans son lit avec un livre ouvert sur sa poitrine.
                              (a young child sleeping in her bed with an open book on her chest.)

                 B ERT G EN: des manifestants avec des pancartes.
                              (demonstrators with placards.)
                   M T R EF : des familles de militaires défilent dans new york un jour de pluie.
                              (military families are marching through new york on a rainy day.)

           Table 8: Zero-shot French image captioning examples for the F LICKR 30 K test set.

EN: a group of people are riding on elephants through a river.
FR: un groupe de personnes sur des chevaux sur un bateau dans un ruisseau.
a group of people on horses on a boat in a stream.
DE: eine gruppe reiter fährt auf einem fluss.
a group of riders is riding on a river.
TR: bir grup insan bir derede duran dört tane at ile ilerliyorlar.
a group of people are moving with four horses standing in a stream.

EN: a black dog is playing with a yellow toy in the grass.
FR: un chien avec une frisbee dans la pelouse.
a dog with a frisbee in the lawn.
DE: ein schwarzer hund mit rotem halsband spielt mit einem gelben ball auf einer wiese.
a black dog with a red collar is playing with a yellow ball in a meadow.
TR: yeşil bir topu ısırmaya çalışan siyah bir köpek.
a black dog trying to bite a green ball.

EN: a boy in a red shirt and white shorts is playing tennis.
FR: un tennisteur frappe une balle.
a tennisteur hits a ball.
DE: ein junge spielt tennis.
a boy is playing tennis.
TR: tenis raketi ile topa vuran çocuk.
boy hitting the ball with a tennis racket.

Table 9: COCO captioning examples: Italicised captions are Google’s translations into English for non-English
examples. Bold phrases highlight lexical variations or errors related to the salient visual concepts in the images.
The last example shows a morphological error that B ERT G EN does when trying to generate a tennis player in
French.

B ERT G EN mehrere ngos, darunter die mozilla - und greenpeace - stiftung,
schätzen, dass diese neuen werkzeuge unfähig sind und zu spät kommen.
several ngos, including the mozilla and greenpeace foundations,
estimate that these new tools are incapable and come too late.
W MT R EF mehrere ngos, unter denen die mozilla - stiftung und greenpeace,
schätzen, dass diese neuen tools unzureichend sind und zu spät kommen.
several ngos, including the mozilla foundation and greenpeace,
estimate that these new tools are inadequate and come too late.
B ERT G EN immigration wird als ein großes problem für die ue betrachtet,
für 45 prozent der deutschen und 40 prozent aller europäischen.
immigration is seen as a big problem for the ue,
for 45 percent of germans and 40 percent of all european ones.
W MT R EF die einwanderung halten 45 prozent der deutschen und 40 prozent
aller europäer für das größte problem der eu.
45 percent of germans and 40 percent of all europeans
consider immigration to be the biggest problem in the eu.
B ERT G EN das ist der grund, warum er in seinem buch die frage erforscht,
ob es alternativen zu wahlen gibt.
that is the reason why he explores the question of whether
there are alternatives to choose from in his book.
W MT R EF deshalb geht er in seinem buch der frage nach,
ob es alternativen zu wahlen gibt.
therefore, in his book, he investigates the question of whether
there are alternatives to choose from.

Table 10: Zero-shot F R→D E translations of WMT’19 test set. The italicised sentences are Google Translate
translations of the German outputs into English.

You can also read