BERTGEN: Multi-task Generation through BERT
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
B ERT G EN: Multi-task Generation through BERT Faidon Mitzalis1 , Ozan Caglayan1 , Pranava Madhyastha1 , Lucia Specia1,2 1 Department of Computing, Imperial College London, UK 2 Department of Computer Science, University of Sheffield, UK phaedonmit@gmail.com, {o.caglayan,pranava,lspecia}@ic.ac.uk Abstract 2020). Recent work in this direction has predom- inantly investigated the use of pre-trained mod- We present B ERT G EN, a novel generative, els to either initialise Transformer-based encoder- decoder-only model which extends BERT decoder models (Imamura and Sumita, 2019; Clin- arXiv:2106.03484v1 [cs.CL] 7 Jun 2021 by fusing multimodal and multilingual pre- trained models VL-B ERT and M-B ERT, re- chant et al., 2019; Yang et al., 2020; Rothe et al., spectively. B ERT G EN is auto-regressively 2020) or to distill knowledge for sequence genera- trained for language generation tasks, namely tion tasks (Chen et al., 2020). image captioning, machine translation and In this work, we present B ERT G EN, which ex- multimodal machine translation, under a multi- tends BERT in a generative setting (§ 2.1). This re- task setting. With a comprehensive set of sults in a single generator – without a separation be- evaluations, we show that B ERT G EN outper- tween the encoder and the decoder – capable of con- forms many strong baselines across the tasks suming multiple input modalities and generating in explored. We also show B ERT G EN’s abil- ity for zero-shot language generation, where multiple languages. The latter features are achieved it exhibits competitive performance to super- by transferring knowledge from state-of-the-art pre- vised counterparts. Finally, we conduct abla- trained models, namely VL-B ERT (Su et al., 2020) tion studies which demonstrate that B ERT G EN and multilingual BERT (M-B ERT) (Devlin et al., substantially benefits from multi-tasking and 2019). We train B ERT G EN on various tasks, in- effectively transfers relevant inductive biases cluding image captioning, machine translation and from the pre-trained models. multimodal machine translation, and datasets in four different languages (§ 2.2). 1 Introduction Based on a number of experiments, our find- Recent work in unsupervised and self-supervised ings (§ 3) show that B ERT G EN (i) is surprisingly pre-training has revolutionised the field of natural versatile as it is capable of describing images and language understanding (NLU), resulting in high performing translation in unimodal and multimodal performance ceilings across multiple tasks (Devlin settings, across all languages, (ii) generalises well et al., 2019; Yang et al., 2019; Dong et al., 2019). across zero-shot image captioning, multimodal ma- The recent success of language model pre-training chine translation, and out-of-domain news trans- with masked language modelling (MLM) such as lation tasks, and finally (iii) is parameter efficient BERT (Devlin et al., 2019) further paved the way when compared to state-of-the-art models for each for more complex approaches that combine lan- of the tasks combined together. guage pre-training with images (Tan and Bansal, 2019; Su et al., 2020; Lu et al., 2020), video (Sun 2 Method et al., 2019), and speech (Chuang et al., 2020). In this section, we describe B ERT G EN and the tasks Most of these approaches follow a task-specific we explore. We then detail the baselines and SoTA fine-tuning step after the model is pre-trained. systems that we compare against. However, there has been little work on exploiting pre-trained MLMs for natural language generation 2.1 Model (NLG) tasks. Previous work argues that the MLM This section details the main aspects of B ERT G EN objective is ill-suited for generation tasks such as that distinguish it from the existing work on vision machine translation (Yang et al., 2019; Rothe et al., & language pre-training.
y3 Input Embeddings [DE] [CLS] x1 x2 x7 [SEP] y1 y2 [MASK] [SEP] [IMG] [IMG] [IMG] [END] RoI Features Segment Embeddings Segment A Segment B Positional Embeddings 1 2 12 13 14 15 15 15 16 Transformer MLM Head Figure 1: A view of the B ERT G EN model during the training of an MMT sample: solid and dashed borders around the images represent full-image features and regional features, respectively. At test time, the most likely token y3 = argmax(P (yt |x, v, y
MASK self-attentive representations of early positions are naturally re-computed, in contrast to typical Trans- MASK former decoders. Finally, due to how inputs/outputs MASK are represented in a single stream and encoded through shared self-attention, B ERT G EN enforces MASK an inductive bias towards a truly multi-modal and STOP multi-lingual representation space. Figure 3: A look at B ERT G EN’s self-attention: the con- nections denote that self-attentive representations are Target language specifiers. Finally, to select the re-computed in every step. The generation ends when language during generation, input sequences begin STOP is predicted. The smileys refer to RoI features. with special target language specifiers (Ha et al., 2016; Johnson et al., 2017) (Figure 1). The speci- fier is task-agnostic, i.e. the same specifier [DE] is bounding box coordinates). When encoding the used both when captioning into German and when non-visual positions, the same RoI feature vector translating into German. for the full image is repeated (see Figure 1). We note that we do not fine-tune the object detector Training & hyper-parameters. We extend2 the during training. base configuration of VL-B ERT which is a Trans- Sequence unrolling. An important aspect of former with 12 self-attention layers and 12 heads. B ERT G EN is that it does not explicitly distinguish The model and feed-forward dimensions are 768 between the encoder and the decoder blocks usu- and 3072, respectively. On a single 32GB V100 ally seen in sequence-to-sequence models. This is GPU, one epoch (§ 3) takes approximately two days accomplished by formalising both encoding and to complete as we could only fit one example per generation using the MLM framework. Formally, task (i.e. batch size equal to 13) into the memory3 . let us consider the MMT task and define the max- We use AdamW optimiser (Loshchilov and Hutter, imum log-likelihood objective for a given triplet 2019) with base learning rate set to 1.3×10−5 . The {x(i) , v(i) , y(i) } where the target y(i) has n tokens: learning rate is warmed up in the first 16K steps and then decays linearly. We set the weight decay n (i) X (i) (i) to 10−4 . During training, we let the model update L = log P (yt | x(i) ; v(i) ; y
Name Type Task Sents Augm. 2.2.2 Multimodal Machine Translation F LICKR 8 K IC IM→TR 13,828 263K Multimodal Machine Translation (MMT) attempts M ULTI 30 K MMT DE→EN 29,000 464K to improve MT quality by incorporating informa- FR→EN 464K tion from modalities other than language (Suluba- EN→FR 560K cak et al., 2020). In our case, we train B ERT- M ULTI 30 K MMT EN→DE 29,000 582K G EN for E N↔D E and E N↔F R MMT tasks and F LICKR 30 K IC IM→EN 145,000 2.39M use the M ULTI 30 K dataset, the main dataset for IM→DE 2.48M image-informed translation, which provides cap- I WSLT MT DE→EN 158,388 3.85M tion translations for F LICKR 30 K images in German EN→DE 4.43M and French. To evaluate B ERT G EN on MMT tasks, I WSLT MT FR→EN 163,328 4.01M we use the original 2016 test set which contains EN→FR 4.78M 1,000 examples. S ETIMES MT TR→EN 185,318 6.01M For a comprehensive comparison with previous EN→TR 8.40M work, we train a SoTA recurrent MMT (Caglayan Table 1: Training statistics of B ERT G EN: the last col- et al., 2020) solely on the M ULTI 30 K dataset, umn is the number of samples after sequence unrolling. which applies a secondary (visual) attention in the decoder over the RoI features i.e. the same features that are also used by B ERT G EN (§ 2.1). There are two GRU (Cho et al., 2014) layers in both the 2.2.1 Image Captioning encoder and the decoder and the embedding & hid- den dimensions in the model are set to 200 and 320, respectively. Each model has ∼5.6M parameters Image captioning (IC) involves describing images excluding the word embeddings. in a specified natural language. We train B ERT- Besides the state-of-the-art constrained recur- G EN for English, German and Turkish caption- rent MMT model described above, we further ing tasks. Specifically, we use the F LICKR 30 K compare B ERT G EN – which is trained on various dataset (Young et al., 2014) that provides 29K train- other MT and IC corpora – to an unconstrained ing images, each with five English captions col- Transformer-based MMT trained on ∼9M addi- lected through crowd-sourcing. The validation and tional E N→D E sentences (Libovický, 2019)4 in test sets contain approximately 1K images each. addition to M ULTI 30 K. We use the M ULTI 30 K dataset (Elliott et al., 2016), which annotates F LICKR 30 K images with five Ger- 2.2.3 Text-only Machine Translation man captions. Finally, we use the TASVIR E T We incorporate six text-only MT tasks into our dataset (Unal et al., 2016) which provides two training protocol. We use E N↔D E and E N↔F R Turkish captions for each of the 8,092 images in MT datasets from IWSLT’14 (Cettolo et al., 2012) the F LICKR 8 K dataset (Rashtchian et al., 2010). which consists of TED Talks’ subtitles and their Since F LICKR 8 K is a subset of F LICKR 30 K, we translations. We take the prepare-iwslt14 create a new split of TASVIR E T to avoid data leak- recipe from FAIRSEQ (Ott et al., 2019) to prepare age between training and test splits. The resulting the dev and test sets. This yields an E N↔D E test training, validation and test splits contain 6914, set of 6,750 sentences which consists of dev2010, 543, and 543 images, respectively. dev2012.TEDX, tst2010, tst2011 and tst2012. Sim- ilarly, the E N↔F R test set consists of dev2010, To evaluate B ERT G EN’s performance on IC, tst2010, tst2011 and tst2012, which amounts to we compare it against previous work with strong 4,493 sentences. performance on COCO (Chen et al., 2015) and F LICKR 30 K. More precisely, A DAPTIVE ATTEN - For E N↔T R directions, we use the SE- TION (S ENTINEL ) (Lu et al., 2017), which uses TIMES2 (Tiedemann, 2012) news dataset for train- a sentinel token to distinguish between visual and ing. For development and test sets, we take the non-visual representations, and N EURAL BABY official WMT test sets (Bojar et al., 2018), namely, TALK (N BT ), which follows a slot-filling approach newstest2016 and newstest2017 as the development through explicit object region information (Lu 4 We obtained test set outputs from the author and pre- et al., 2018). processed with M-B ERT tokeniser to ensure comparability.
set (6,007 sentences), and newstest2018 (6,000 sen- TASK A PPROACH BL MT CR tences) as the test set. Both IWSLT and SETIMES2 F30K E N B ERT G EN 27.0 23.2 0.587 corpora are medium-scale resources often used in S ENTINEL‡ 25.1 20.4 0.531 MT research community, and have much harder N BT‡ 27.1 21.7 0.575 test sets than the MMT and IC tasks, due to a sig- nificant domain shift. C OCO E N B ERT G EN 15.9 20.4 0.487 Finally, for each translation direction, we train N BT‡ 34.7 27.1 1.072 a Transformer NMT model (Vaswani et al., 2017) F30K F R B ERT G EN 5.2 18.1 0.397 using the I WSLT-D E -E N recipe of the FAIRSEQ F8K T R B ERT G EN 8.5 14.5 0.363 toolkit (Ott et al., 2019). This recipe has six en- F30K D E B ERT G EN 17.8 34.2 0.500 coders and six decoders, each equipped with 4-head self-attention layers. The model and feed-forward Table 2: BLEU (B L), METEOR (M T) and CIDEr (C R) scores for image captioning: gray background indicates dimensions are set to 512 and 1024, respectively. zero-shot generation whereas ‡ denotes the systems de- Each model has ∼31.5M parameters excluding the coded with beam search. word embeddings. Since B ERT G EN is a general purpose multilingual and multimodal generator, we expect it to perform in the same ballpark as these given that B ERT G EN is not trained on COCO: our strong NMT baselines, but not necessarily be SoTA model lags behind N BT (w/ beam search) by 6.7 compared to novel & sophisticated NMT models, METEOR. which also make use of a lot more training data. For zero-shot French captioning (F30K F R), we resort to the reference MMT translations from the 3 Results and Findings M ULTI 30 K E N→F R task, as there are no human references for French. Although this is problem- We train B ERT G EN on lowercased sentences for atic as the metrics will penalise captions that are 45 epochs, after which the overall performance not translations of English captions, we provide on the tasks reached a plateau. We define one the scores to show that the zero-shot outputs are B ERT G EN epoch as a single pass over all of the valid descriptions. We note that the low range training data for the M ULTI 30 K E N→D E MMT of scores reported here is also due to having one task and denote this task as the reference task. We reference caption instead of five references7 as in use greedy search for all systems that we trained F LICKR 30 K. Finally we report results for our cus- and merge back the word pieces before evaluation. tom Turkish split (§ 2.2.1) (F30K T R) and Ger- We compute tokenised5 BLEU (Papineni et al., man (F30K D E). Even though there are no com- 2002), METEOR (Denkowski and Lavie, 2014) parable results in the literature for these three tasks, and CIDEr (Vedantam et al., 2015) using coco- we demonstrate through some qualitative examples caption6 . In what follows, we provide detailed that B ERT G EN produces sensible outputs. quantitative and qualitative findings. Qualitative examples. We now focus on a few 3.1 Image Captioning examples to examine the multilingual image cap- Table 2 provides an overview of B ERT G EN’s image tioning ability of B ERT G EN in action (Table 3). captioning performance on different test sets and For the first image, all captions are almost the same languages. First of all, on English F LICKR 30 K, as the image has few salient points. For the second B ERT G EN is clearly able to outperform strong cap- image however, we observe much more variation tioning models (§ 2.2.1) S ENTINEL (Lu et al., 2017) across captions, in line with the complexity of the and N BT (Lu et al., 2018), even though they use scene. We are particularly surprised by the zero- beam search for decoding. On COCO (Chen et al., shot French captioning performance, a task that 2015), an image captioning corpus much larger B ERT G EN is not trained for at all. Upon manual and diverse than F LICKR 30 K, we evaluate B ERT- inspection, we noticed that the captions are often G EN on Karpathy’s test split (Karpathy and Fei- short, objective gists of the images. These obser- Fei, 2015) and notice that the scores are reasonable vations also hold for the captions generated for the 5 7 Since M-B ERT is aggressive on splitting apostrophes and As a reference, evaluating English captions using one hyphens, our results may slightly differ from other work. reference at a time, yields 7.9 BLEU on average, compared to 6 https://github.com/tylin/coco-caption 27.0 BLEU in Table 2.
MMT A PPROACH BL MT EN → DE B ERT G ENF 42.2 61.6 Libovický (2019)F‡ 40.8 59.2 Caglayan et al. (2020) 37.8 56.9 EN a man wearing a hat and glasses. FAIRSEQ N MT 37.5 56.1 DE ein mann mit hut und brille. a man with hat and glasses EN → FR B ERT G ENF 68.0 81.2 TR şapkalı ve gözlüklü bir adam. Libovický (2019)F‡ 63.4 77.3 a man with a hat and glasses. FAIRSEQ N MT 61.5 75.5 FR un homme avec un chapeau et des lunettes. Caglayan et al. (2020) 61.0 75.3 a man with a hat and glasses. DE→ FR B ERT G ENF 44.8 64.1 Caglayan et al. (2020) 43.8 62.1 FAIRSEQ N MT 41.7 60.7 FR → DE B ERT G ENF 35.1 56.9 FAIRSEQ N MT 35.1 53.6 EN two men are on a rooftop working on something. Caglayan et al. (2020) 33.5 53.1 DE zwei männer arbeiten auf einem dach. two men working on a roof Table 4: BLEU (B L) and METEOR (M T) scores for TR iki binanın inşasında oturmuş, yanyana MMT: zero-shot systems are highlighted with gray. yerde duran iki kişi. Systems marked with F and ‡ denote the use of aux- two people seated in the construction of two iliary resources (i.e. unconstrained) and beam-search buildings, standing next to each other on the ground. decoding, respectively. FR trois ouvriers du bâtiment construisent un toit. three construction workers build a roof. 3.2 Multimodal Machine Translation Table 4 summarises B ERT G EN’s performance on MMT. First of all, B ERT G EN consistently outper- forms the Transformer-based FAIRSEQ NMT mod- EN a man in a red shirt and helmet is riding a motorbike on a dirt road. els and the recurrent MMT (Caglayan et al., 2020) DE ein mann fährt mit einem motorrad auf einem models on both the E N→D E and the E N→F R weg an einem fluß entlang. language pairs. Furthermore, B ERT G EN is also a man rides a motorcycle on a path along a river. substantially better than a state-of-the-art uncon- TR çamurlu bir yolda motoruyla ilerlemekte olan kırmızı üstlü bir adam ve arkasındaki dağ manzarası. strained MMT (Libovický, 2019) model trained on A man in a red top riding his bike down a muddy a ∼6x larger parallel corpus. road with a mountain landscape behind him. FR un homme avec un casque fait du motocross. Adversarial evaluation. Following Elliott a man with a helmet rides motocross. (2018), we probe B ERT G EN’s ability for integrat- ing multiple modalities effectively. Specifically, Table 3: Multilingual image captioning examples: The we decode translations by shuffling {image, source italicised sentences are Google Translate translations of D E ,T R ,F R sentences into English. Gray background caption} mappings so that the images do not indicates zero-shot outputs. The last example is from correspond to the sentences to be translated. The C OCO while the others are from F LICKR 30 K. E N→D E results showed that the incongruence leads to 1.1 and 0.9 point drops in BLEU and METEOR, respectively. For E N→F R, the drops are much more prominent with 3.1 and 2.3 points again for BLEU and METEOR. This indicates COCO test set, as we can see in the third exam- that the features are not ignored at all, unlike in ple. A set of additional examples in the Appendix (Caglayan et al., 2019), where they showed that shows that B ERT G EN does not simply retrieve cap- sequence-to-sequence MMT models can learn to tion translations learned from the E N→F R task. ignore the images when the linguistic signal is Overall, both quantitative and qualitative results sufficient to perform the task. provide evidence of the utility of multimodal and multilingual initialisation as well as the efficacy of Zero-shot performance. The results in Table 4 knowledge transfer across different tasks for image show the surprising ability of B ERT G EN to per- captioning. form MMT on directions unseen during training.
FAIRSEQ B ERT G EN D E→F R F R→D E TASK BL MT BL MT BL MT BL MT I WSLT E N→D E 27.4 47.1 27.8 48.4 B ERT G EN 19.6 40.5 13.1 36.7 I WSLT D E→E N 33.6 33.8 35.6 34.7 TARTU‡ 39.5 59.0 26.3 47.3 I WSLT E N→F R 41.0 59.8 40.2 60.5 M SRA‡ 46.5 64.2 38.2 56.4 I WSLT F R→E N 39.1 36.4 40.0 36.8 S ETIMES E N→T R 14.1 18.9 13.5 19.1 Table 6: Zero-shot B ERT G EN performance on S ETIMES T R→E N 17.3 25.8 19.0 26.9 WMT’19 test set: TARTU and MSRA systems are not zero-shot as they are trained on D E↔FR corpora. Sys- Table 5: Comparison of text-only MT performance of tems marked with ‡ are beam search outputs. B ERT G EN to each dedicated FAIRSEQ NMT system: B ERT G EN outperforms single models in most cases. cific task’s training set has been seen by the models. IWSLT DE EN IWSLT EN DE B ERT G EN is trained for 45 reference epochs (§ 3), 40 and this corresponds to only a few complete passes 30 20 over the training sets of NMT tasks8 . This is in 10 contrast to the single-task systems that usually re- 0 quire a large number of epochs for convergence. 60 IWSLT EN FR IWSLT FR EN We notice a general trend and observe that B ERT- 45 G EN tends to outperform single-task systems usu- 30 ally after only a few passes over the corresponding 15 training set. Many factors could be contributing to 0 this observation such as sequence unrolling, multi- SETIMES EN TR SETIMES TR EN 20 tasking, shared input space or relevant inductive 15 biases transferred from M-B ERT. We partly ad- 10 Fairseq dress these in the ablation studies (§ 3.4) and leave 5 BertGen 0 further investigation to future work. 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Zero-shot performance. We use the D E↔F R Figure 4: B ERT G EN’s learning efficiency on MT: val- test set from the WMT’19 shared task on news idation scores are plotted against the number of full passes completed by B ERT G EN and each FAIRSEQ translation (Barrault et al., 2019) to assess the zero- model, over the corresponding task’s training set. Best shot translation capability of B ERT G EN. This test checkpoints’ test set performances are given in Table 5. set includes 1,701 sentences from news data regard- ing European Elections. We compare our results to two shared task systems, namely TARTU (base- Moreover, the zero-shot performance surpasses line) and M SRA (state-of-the-art) (Barrault et al., strong MMT and NMT systems by up to 2 and 2019), after re-tokenising them accordingly with 3.3 METEOR for D E→F R and F R→D E, respec- M-B ERT9 . Although B ERT G EN is expected to tively. Similar to the image captioning results, this obtain lower scores than the dedicated WMT sys- demonstrates the potential of B ERT G EN to gener- tems due to the domain mismatch of the test set, alise over a variety of language pairs and tasks. we consider both the quantitative (Table 6) and the qualitative results (Table 7) extremely encouraging. 3.3 Machine Translation First, we compare B ERT G EN’s performance to 3.4 Ablation Studies each task-specific FAIRSEQ system. According 3.4.1 Impact of initialisation to Table 5, we observe that the translation quality We train single-task MMT systems on the of B ERT G EN is generally superior compared to the M ULTI 30 K E N→D E language pair. Specifically, strong FAIRSEQ systems, especially in METEOR, we begin with a baseline system which is initialised where B ERT G EN leads in all pairs. with random weights. We then train a second base- Second, we look at the learning efficiency by line where only the visual processing layers are comparing the training curves between B ERT G EN 8 For example, only ∼3 passes over S ETIMES E N→T R. and each task-specific FAIRSEQ system (Figure 4). 9 TARTU is the baseline and M SRA is the best performing Here, the x axis represents how many times the spe- system for the shared task
B ERT G EN: la décision est tombée au 70ème anniversaire de ma femme. the decision fell on my wife’s 70th birthday. W MT R EF la décision est tombée le jour du 70ème anniversaire de ma femme. the decision fell on my wife’s 70th birthday. B ERT G EN: en espagne, on s’est malheureusement habitué à une rôle double et passive. in spain, we unfortunately got used to a double and passive role. W MT R EF en espagne, on s’est malheureusement habitué à un rôle secondaire, passif. in spain, we unfortunately got used to a secondary, passive role. B ERT G EN: pas parce que le président du fdp a dit quelque chose qu’ ils ont défaillant leur vote. not because the fdp president said something that they missed their vote. W MT R EF ce n’ est pas parce que le président fédéral du fdp a dit quelque chose qu’ ils ont refusé d’ approuver. it is not because the federal president of the fdp said something that they refused to approve. Table 7: Zero-shot D E→F R translations on WMT’19 test set. The italicised sentences are Google Translate trans- lations of French sentences into English. Bold indicates important differences between B ERT G EN and references. 45 50 Single Task 40 Multi Task 45 35 BLEU 30 40 BLEU 25 35 20 Random 30 15 Visual Only 0 5 10 15 20 Hybrid Passes over EN-DE MMT data 10 0 5 10 15 20 Passes over EN-DE MMT data Figure 6: Validation scores on M ULTI 30 K E N→D E MMT for the multi-tasking ablation: The default Figure 5: Validation scores on M ULTI 30 K E N→D E multi-task B ERT G EN outperforms the single-task one. MMT for the initialisation ablation: Hybrid initialisa- tion is the most beneficial strategy for B ERT G EN. of catastrophic forgetting (French, 1999). Based on these observations, we expect similar model transferred from VL-B ERT. Finally, we train a behavior to hold for other tasks. third baseline that is initialised similar to B ERT- G EN, i.e. using the hybrid initialisation (§ 2.1). 4 Related Work Figure 5 compares the validation BLEU scores of these three systems. We observe that the benefits of 4.1 Multimodal multilingual pre-training knowledge transfer from pre-trained models are in- Research in NLP and related fields has been in- crementally positive, however, B ERT G EN’s hybrid creasingly focusing on transfer learning approaches initialisation outperforms the other two ablations. where a model is first pre-trained on a data-rich task, and then transferred to downstream tasks (Mc- 3.4.2 Impact of multi-task training Cann et al., 2017; Peters et al., 2018; Devlin et al., We now remove the multi-tasking aspect from 2019). This framework presumably allows the B ERT G EN to investigate the extent to which the per- model to capture useful inductive biases that gen- formance improvements are related to other tasks. eralise to a variety of NLP tasks, often after per- Similar to § 3.4.1, we focus on the M ULTI 30 K forming a task-specific fine-tuning (Raffel et al., E N→D E MMT task and train a single-task, hybrid- 2020). Of these, the most relevant studies to our initialised B ERT G EN. Figure 6 compares the vali- work are BERT (Devlin et al., 2019) and its multi- dation BLEU scores obtained by the default B ERT- lingual version M-B ERT, which pre-train a Trans- G EN and the single-task variant. We observe that former (Vaswani et al., 2017) on large monolin- B ERT G EN benefits from multi-task training and, gual corpora using the masked language modelling more importantly, does not seem to exhibit patterns (MLM) objective.
Recent research has also attempted to combine Knight, 2016; Firat et al., 2016). The multi-task linguistic inputs with other modalities such as vi- (and zero-shot) generation ability of B ERT G EN is sion and speech, to achieve a grounded understand- mostly inspired by Ha et al. (2016) and Johnson ing of meaning. Successful approaches including et al. (2017). Both of these introduced target lan- LXMERT (Tan and Bansal, 2019), VL-BERT (Su guage specifiers to select the output language when et al., 2020) and others (Lu et al., 2019; Li et al., decoding translations from their model. 2020a,b) achieve this by combining BERT’s MLM Our multilingual & multimodal take on multi- objective with auxiliary tasks such as masked re- task generation is most similar to Kaiser et al. gion classification and image sentence matching, (2017), where a single Transformer model is and pre-train their model on large-scale image cap- trained on different tasks including image cap- tioning corpora (Chen et al., 2015; Sharma et al., tioning, object classification, machine translation, 2018). Similarly, SpeechBERT extends BERT by speech recognition and parsing. However, their jointly training on speech and text data (Chuang architecture depends on particular structures such et al., 2020). Although SoTA results are reported as encoders, decoders, modality-specific networks by these approaches, they focus on unimodal and and I/O mixers, unlike B ERT G EN which does not multimodal natural language understanding (NLU) require task-specific modules. tasks, with a strong emphasis in English. The back- bone of B ERT G EN combines VL-BERT (Su et al., 5 Conclusions 2020) with M-B ERT (Devlin et al., 2019) to realise In this paper, we presented B ERT G EN, a novel gen- a multilingual and multimodal generator that can erative, decoder-only model which extends BERT be used for a diverse set of generative tasks and by combining multimodal and multilingual pre- languages rather than NLU tasks. trained models. Our findings show that B ERT G EN 4.2 Pre-training for generative tasks obtains strong performance on a variety of gen- erative tasks and further generalises over unseen Previous work has studied how to benefit from tasks. Importantly, our model demonstrates the po- pre-trained BERT models in generative tasks such tential for general-purpose (instead of task-specific) as NMT (Imamura and Sumita, 2019; Clinchant generation that is above and beyond the traditional et al., 2019; Zhu et al., 2020). B ERT G EN differs pre-training and fine-tuning practices. B ERT G EN is from these as it is not fine-tuned for a particular also parameter efficient as it has 89.3M total param- MT corpus and it exhibits multi-lingual and multi- eters and is trained on thirteen tasks encompassing modal properties for general purpose generation. MT, multimodal MT and image captioning. On the Another related branch of work explores pre- other hand, each of the single-task FAIRSEQ NMT training strategies specific to sequence-to-sequence baselines has 31.5M parameters. tasks. This includes MASS (Song et al., 2019), Our ablation studies show that B ERT G EN is able which exploits an encoder-decoder framework to efficiently transfer relevant inductive biases from with the MLM objective for task-specific gener- the pre-trained models and benefits from multi-task ative pre-training and UniLM (Dong et al., 2019), learning without suffering from catastrophic for- which introduces uni-directional, bi-directional and getting. We hope that these findings will motivate sequence-to-sequence LM objectives by carefully future research in exploiting more sophisticated adjusting the self-attention masks during train- pre-trained models in place of M-B ERT and VL- ing. Zhou et al. (2020) extend UniLM to vision B ERT and others. & language pre-training using Conceptual Cap- tions (Sharma et al., 2018) as the pre-training Acknowledgments dataset. However, these models require a further fine-tuning step for generative tasks, unlike B ERT- This paper is a follow-up work to the MSc. Thesis G EN that is trained only once. of Faidon Mitzalis, co-supervised by Prof. Lucia Specia and Dr. Ozan Caglayan. Lucia Specia, 4.3 Multi-task learning for generation Pranava Madhyastha and Ozan Caglayan received Several approaches exist for multi-task learning & support from MultiMT project (H2020 ERC Start- generation (Dong et al., 2015; Luong et al., 2016) ing Grant No. 678017). Lucia Specia also received in NLP, especially in multilingual NMT, where support from the Air Force Office of Scientific Re- tasks denote different language pairs (Zoph and search (under award number FA8655-20-1-7006).
References Kyunghyun Cho, Bart van Merriënboer, Caglar Gul- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Peter Anderson, Xiaodong He, Chris Buehler, Damien Schwenk, and Yoshua Bengio. 2014. Learning Teney, Mark Johnson, Stephen Gould, and Lei phrase representations using RNN encoder–decoder Zhang. 2018. Bottom-up and top-down attention for for statistical machine translation. In Proceedings of image captioning and visual question answering. In the 2014 Conference on Empirical Methods in Nat- Proceedings of the IEEE conference on computer vi- ural Language Processing (EMNLP), pages 1724– sion and pattern recognition, pages 6077–6086. 1734, Doha, Qatar. Association for Computational Linguistics. Loı̈c Barrault, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Yvette Gra- ham, Barry Haddow, Matthias Huck, Philipp Koehn, Yung-Sung Chuang, Chi-Liang Liu, Hung yi Lee, and Shervin Malmasi, Christof Monz, Mathias Müller, Lin shan Lee. 2020. SpeechBERT: An Audio-and- Santanu Pal, Matt Post, and Marcos Zampieri. 2019. Text Jointly Learned Language Model for End-to- Findings of the 2019 conference on machine transla- End Spoken Question Answering. In Proc. Inter- tion (WMT19). In Proceedings of the Fourth Con- speech 2020, pages 4168–4172. ference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 1–61, Florence, Italy. As- Stephane Clinchant, Kweon Woo Jung, and Vassilina sociation for Computational Linguistics. Nikoulina. 2019. On the use of BERT for neu- ral machine translation. In Proceedings of the 3rd Ondřej Bojar, Christian Federmann, Mark Fishel, Workshop on Neural Generation and Translation, Yvette Graham, Barry Haddow, Philipp Koehn, and pages 108–117, Hong Kong. Association for Com- Christof Monz. 2018. Findings of the 2018 con- putational Linguistics. ference on machine translation (WMT18). In Pro- ceedings of the Third Conference on Machine Trans- Michael Denkowski and Alon Lavie. 2014. Meteor uni- lation: Shared Task Papers, pages 272–303, Bel- versal: Language specific translation evaluation for gium, Brussels. Association for Computational Lin- any target language. In Proceedings of the Ninth guistics. Workshop on Statistical Machine Translation, pages 376–380, Baltimore, Maryland, USA. Association Ozan Caglayan, Julia Ive, Veneta Haralampieva, for Computational Linguistics. Pranava Madhyastha, Loı̈c Barrault, and Lucia Spe- cia. 2020. Simultaneous machine translation with vi- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and sual context. In Proceedings of the 2020 Conference Kristina Toutanova. 2019. BERT: Pre-training of on Empirical Methods in Natural Language Process- deep bidirectional transformers for language under- ing (EMNLP), pages 2350–2361, Online. Associa- standing. In Proceedings of the 2019 Conference tion for Computational Linguistics. of the North American Chapter of the Association for Computational Linguistics: Human Language Ozan Caglayan, Pranava Madhyastha, Lucia Specia, Technologies, Volume 1 (Long and Short Papers), and Loı̈c Barrault. 2019. Probing the need for visual pages 4171–4186, Minneapolis, Minnesota. Associ- context in multimodal machine translation. In Pro- ation for Computational Linguistics. ceedings of the 2019 Conference of the North Amer- ican Chapter of the Association for Computational Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Linguistics: Human Language Technologies, Vol- Haifeng Wang. 2015. Multi-task learning for mul- ume 1 (Long and Short Papers), pages 4159–4170, tiple language translation. In Proceedings of the Minneapolis, Minnesota. Association for Computa- 53rd Annual Meeting of the Association for Compu- tional Linguistics. tational Linguistics and the 7th International Joint Conference on Natural Language Processing (Vol- Mauro Cettolo, Christian Girardi, and Marcello Fed- ume 1: Long Papers), pages 1723–1732, Beijing, erico. 2012. WIT3: Web inventory of transcribed China. Association for Computational Linguistics. and translated talks. In Proceedings of the 16th An- nual conference of the European Association for Ma- chine Translation, pages 261–268, Trento, Italy. Eu- Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi- ropean Association for Machine Translation. aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr- model pre-training for natural language understand- ishna Vedantam, Saurabh Gupta, Piotr Dollar, and ing and generation. In Advances in Neural Informa- C. Lawrence Zitnick. 2015. Microsoft COCO Cap- tion Processing Systems, volume 32, pages 13063– tions: Data Collection and Evaluation Server. 13075. Curran Associates, Inc. Yen-Chun Chen, Zhe Gan, Yu Cheng, Jingzhou Liu, Desmond Elliott. 2018. Adversarial evaluation of mul- and Jingjing Liu. 2020. Distilling knowledge timodal machine translation. In Proceedings of the learned in BERT for text generation. In Proceedings 2018 Conference on Empirical Methods in Natu- of the 58th Annual Meeting of the Association for ral Language Processing, pages 2974–2978, Brus- Computational Linguistics, pages 7893–7905, On- sels, Belgium. Association for Computational Lin- line. Association for Computational Linguistics. guistics.
Desmond Elliott, Stella Frank, Khalil Sima’an, and Lu- Gao. 2020b. Oscar: Object-Semantics Aligned Pre- cia Specia. 2016. Multi30K: Multilingual English- training for Vision-Language Tasks. In Computer Vi- German image descriptions. In Proceedings of the sion – ECCV 2020, pages 121–137, Cham. Springer 5th Workshop on Vision and Language, pages 70– International Publishing. 74, Berlin, Germany. Association for Computational Linguistics. Jindřich Libovický. 2019. Multimodality in Machine Translation. Ph.D. thesis, Charles University, Fac- Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. ulty of Mathematics and Physics, Institute of Formal 2016. Multi-way, multilingual neural machine trans- and Applied Linguistics, Prague, Czechia. lation with a shared attention mechanism. In Pro- ceedings of the 2016 Conference of the North Amer- Ilya Loshchilov and Frank Hutter. 2019. Decoupled ican Chapter of the Association for Computational weight decay regularization. In International Con- Linguistics: Human Language Technologies, pages ference on Learning Representations. 866–875, San Diego, California. Association for Computational Linguistics. J. Lu, C. Xiong, D. Parikh, and R. Socher. 2017. Know- ing when to look: Adaptive attention via a visual Robert M. French. 1999. Catastrophic forgetting in sentinel for image captioning. In 2017 IEEE Confer- connectionist networks. Trends in Cognitive Sci- ence on Computer Vision and Pattern Recognition ences, 3(4):128–135. (CVPR), pages 3242–3250. Thanh-Le Ha, Jan Niehues, and Alex Waibel. 2016. To- Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan ward Multilingual Neural Machine Translation with Lee. 2019. Vilbert: Pretraining task-agnostic visi- Universal Encoder and Decoder. In Proceedings of olinguistic representations for vision-and-language the 13th International Conference on Spoken Lan- tasks. In Advances in Neural Information Process- guage Translation. ing Systems, pages 13–23. Kenji Imamura and Eiichiro Sumita. 2019. Recycling a Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi pre-trained BERT encoder for neural machine trans- Parikh, and Stefan Lee. 2020. 12-in-1: Multi-task lation. In Proceedings of the 3rd Workshop on Neu- vision and language representation learning. In The ral Generation and Translation, pages 23–31, Hong IEEE/CVF Conference on Computer Vision and Pat- Kong. Association for Computational Linguistics. tern Recognition (CVPR). Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, 2018. Neural baby talk. In Proceedings of the IEEE Fernanda Viégas, Martin Wattenberg, Greg Corrado, Conference on Computer Vision and Pattern Recog- Macduff Hughes, and Jeffrey Dean. 2017. Google’s nition, pages 7219–7228. multilingual neural machine translation system: En- abling zero-shot translation. Transactions of the As- Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol sociation for Computational Linguistics, 5:339–351. Vinyals, and Lukasz Kaiser. 2016. Multi-task Se- quence to Sequence Learning. In International Con- Lukasz Kaiser, Aidan N. Gomez, Noam Shazeer, ference on Learning Representations. Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit. 2017. One Model To Learn Them Bryan McCann, James Bradbury, Caiming Xiong, and All. Richard Socher. 2017. Learned in translation: Con- textualized word vectors. In Proceedings of the Andrej Karpathy and Li Fei-Fei. 2015. Deep visual- 31st International Conference on Neural Informa- semantic alignments for generating image descrip- tion Processing Systems, NIPS’17, page 6297–6308, tions. In Proceedings of the IEEE conference on Red Hook, NY, USA. Curran Associates Inc. Computer Vision and Pattern Recognition, pages 3128–3137. Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Michael Auli. 2019. fairseq: A fast, extensible Daxin Jiang. 2020a. Unicoder-VL: A universal en- toolkit for sequence modeling. In Proceedings of coder for vision and language by cross-modal pre- the 2019 Conference of the North American Chap- training. In The Thirty-Fourth AAAI Conference ter of the Association for Computational Linguistics on Artificial Intelligence, AAAI 2020, The Thirty- (Demonstrations), pages 48–53, Minneapolis, Min- Second Innovative Applications of Artificial Intelli- nesota. Association for Computational Linguistics. gence Conference, IAAI 2020, The Tenth AAAI Sym- posium on Educational Advances in Artificial Intel- Kishore Papineni, Salim Roukos, Todd Ward, and Wei- ligence, EAAI 2020, New York, NY, USA, February Jing Zhu. 2002. Bleu: a method for automatic eval- 7-12, 2020, pages 11336–11344. AAAI Press. uation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Com- Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, putational Linguistics, pages 311–318, Philadelphia, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Pennsylvania, USA. Association for Computational Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Linguistics.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Gardner, Christopher Clark, Kenton Lee, and Luke cross-modality encoder representations from trans- Zettlemoyer. 2018. Deep contextualized word rep- formers. In Proceedings of the 2019 Conference on resentations. In Proceedings of the 2018 Confer- Empirical Methods in Natural Language Processing ence of the North American Chapter of the Associ- and the 9th International Joint Conference on Natu- ation for Computational Linguistics: Human Lan- ral Language Processing (EMNLP-IJCNLP), pages guage Technologies, Volume 1 (Long Papers), pages 5100–5111, Hong Kong, China. Association for 2227–2237, New Orleans, Louisiana. Association Computational Linguistics. for Computational Linguistics. Jörg Tiedemann. 2012. Parallel data, tools and inter- Colin Raffel, Noam Shazeer, Adam Roberts, Kather- faces in OPUS. In Proceedings of the Eighth In- ine Lee, Sharan Narang, Michael Matena, Yanqi ternational Conference on Language Resources and Zhou, Wei Li, and Peter J. Liu. 2020. Exploring Evaluation (LREC’12), pages 2214–2218, Istanbul, the limits of transfer learning with a unified text-to- Turkey. European Language Resources Association text transformer. Journal of Machine Learning Re- (ELRA). search, 21(140):1–67. Mesut Erhan Unal, Begum Citamak, Semih Yagcioglu, Aykut Erdem, Erkut Erdem, Nazli Ikizler Cinbis, Cyrus Rashtchian, Peter Young, Micah Hodosh, and and Ruket Cakici. 2016. Tasviret: A benchmark Julia Hockenmaier. 2010. Collecting image annota- dataset for automatic turkish description generation tions using Amazon’s Mechanical Turk. In Proceed- from images. In 2016 24th Signal Processing ings of the NAACL HLT 2010 Workshop on Creating and Communication Application Conference (SIU), Speech and Language Data with Amazon’s Mechan- pages 1977–1980. IEEE. ical Turk, pages 139–147, Los Angeles. Association for Computational Linguistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Sascha Rothe, Shashi Narayan, and Aliaksei Severyn. Kaiser, and Illia Polosukhin. 2017. Attention is all 2020. Leveraging pre-trained checkpoints for se- you need. In Advances in neural information pro- quence generation tasks. Transactions of the Asso- cessing systems, pages 5998–6008. ciation for Computational Linguistics, 8:264–280. Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Piyush Sharma, Nan Ding, Sebastian Goodman, and Parikh. 2015. Cider: Consensus-based image de- Radu Soricut. 2018. Conceptual captions: A scription evaluation. In Proceedings of the IEEE cleaned, hypernymed, image alt-text dataset for au- Conference on Computer Vision and Pattern Recog- tomatic image captioning. In Proceedings of the nition, pages 4566–4575. 56th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages Thomas Wolf, Lysandre Debut, Victor Sanh, Julien 2556–2565, Melbourne, Australia. Association for Chaumond, Clement Delangue, Anthony Moi, Pier- Computational Linguistics. ric Cistac, Tim Rault, Remi Louf, Morgan Funtow- icz, Joe Davison, Sam Shleifer, Patrick von Platen, Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie- Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Yan Liu. 2019. MASS: Masked sequence to se- Teven Le Scao, Sylvain Gugger, Mariama Drame, quence pre-training for language generation. In Pro- Quentin Lhoest, and Alexander Rush. 2020. Trans- ceedings of the 36th International Conference on formers: State-of-the-art natural language process- Machine Learning, volume 97 of Proceedings of Ma- ing. In Proceedings of the 2020 Conference on Em- chine Learning Research, pages 5926–5936. PMLR. pirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Asso- Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, ciation for Computational Linguistics. Furu Wei, and Jifeng Dai. 2020. VL-BERT: Pre- Jiacheng Yang, Mingxuan Wang, Hao Zhou, Chengqi training of Generic Visual-Linguistic Representa- Zhao, Weinan Zhang, Yong Yu, and Lei Li. 2020. tions. In International Conference on Learning Rep- Towards making the most of BERT in neural ma- resentations. chine translation. In Proceedings of the AAAI Con- ference on Artificial Intelligence, volume 34, pages Umut Sulubacak, Ozan Caglayan, Stig-Arne Grönroos, 9378–9385. Aku Rouhe, Desmond Elliott, Lucia Specia, and Jörg Tiedemann. 2020. Multimodal machine trans- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- lation through visuals and speech. Machine Transla- bonell, Russ R Salakhutdinov, and Quoc V Le. 2019. tion, pages 1–51. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Advances in Neural In- Chen Sun, Austin Myers, Carl Vondrick, Kevin Mur- formation Processing Systems, volume 32. Curran phy, and Cordelia Schmid. 2019. Videobert: A joint Associates, Inc. model for video and language representation learn- ing. In Proceedings of the IEEE/CVF International Peter Young, Alice Lai, Micah Hodosh, and Julia Hock- Conference on Computer Vision, pages 7464–7473. enmaier. 2014. From image descriptions to visual
denotations: New similarity metrics for semantic in- ference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78. Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, and Jianfeng Gao. 2020. Uni- fied vision-language pre-training for image caption- ing and vqa. Proceedings of the AAAI Conference on Artificial Intelligence, 34(07):13041–13049. Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin, Wengang Zhou, Houqiang Li, and Tie-Yan Liu. 2020. Incorporating BERT into neural machine translation. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. Barret Zoph and Kevin Knight. 2016. Multi-source neural translation. In Proceedings of the 2016 Con- ference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Lan- guage Technologies, pages 30–34, San Diego, Cali- fornia. Association for Computational Linguistics.
A Qualitative Examples B ERT G EN: un groupe de jeunes sont réunis, et ils sont debout à l ’extérieur d’un bâtiment. (a group of young people are gathered, and they are standing outside a building.) M T R EF : des gens debout devant un bâtiment. (people standing outside of a building.) B ERT G EN: une petite fille lit un livre. (a little girl reads a book.) M T R EF : un jeune enfant dormant dans son lit avec un livre ouvert sur sa poitrine. (a young child sleeping in her bed with an open book on her chest.) B ERT G EN: des manifestants avec des pancartes. (demonstrators with placards.) M T R EF : des familles de militaires défilent dans new york un jour de pluie. (military families are marching through new york on a rainy day.) Table 8: Zero-shot French image captioning examples for the F LICKR 30 K test set.
EN: a group of people are riding on elephants through a river. FR: un groupe de personnes sur des chevaux sur un bateau dans un ruisseau. a group of people on horses on a boat in a stream. DE: eine gruppe reiter fährt auf einem fluss. a group of riders is riding on a river. TR: bir grup insan bir derede duran dört tane at ile ilerliyorlar. a group of people are moving with four horses standing in a stream. EN: a black dog is playing with a yellow toy in the grass. FR: un chien avec une frisbee dans la pelouse. a dog with a frisbee in the lawn. DE: ein schwarzer hund mit rotem halsband spielt mit einem gelben ball auf einer wiese. a black dog with a red collar is playing with a yellow ball in a meadow. TR: yeşil bir topu ısırmaya çalışan siyah bir köpek. a black dog trying to bite a green ball. EN: a boy in a red shirt and white shorts is playing tennis. FR: un tennisteur frappe une balle. a tennisteur hits a ball. DE: ein junge spielt tennis. a boy is playing tennis. TR: tenis raketi ile topa vuran çocuk. boy hitting the ball with a tennis racket. Table 9: COCO captioning examples: Italicised captions are Google’s translations into English for non-English examples. Bold phrases highlight lexical variations or errors related to the salient visual concepts in the images. The last example shows a morphological error that B ERT G EN does when trying to generate a tennis player in French.
B ERT G EN mehrere ngos, darunter die mozilla - und greenpeace - stiftung, schätzen, dass diese neuen werkzeuge unfähig sind und zu spät kommen. several ngos, including the mozilla and greenpeace foundations, estimate that these new tools are incapable and come too late. W MT R EF mehrere ngos, unter denen die mozilla - stiftung und greenpeace, schätzen, dass diese neuen tools unzureichend sind und zu spät kommen. several ngos, including the mozilla foundation and greenpeace, estimate that these new tools are inadequate and come too late. B ERT G EN immigration wird als ein großes problem für die ue betrachtet, für 45 prozent der deutschen und 40 prozent aller europäischen. immigration is seen as a big problem for the ue, for 45 percent of germans and 40 percent of all european ones. W MT R EF die einwanderung halten 45 prozent der deutschen und 40 prozent aller europäer für das größte problem der eu. 45 percent of germans and 40 percent of all europeans consider immigration to be the biggest problem in the eu. B ERT G EN das ist der grund, warum er in seinem buch die frage erforscht, ob es alternativen zu wahlen gibt. that is the reason why he explores the question of whether there are alternatives to choose from in his book. W MT R EF deshalb geht er in seinem buch der frage nach, ob es alternativen zu wahlen gibt. therefore, in his book, he investigates the question of whether there are alternatives to choose from. Table 10: Zero-shot F R→D E translations of WMT’19 test set. The italicised sentences are Google Translate translations of the German outputs into English.
You can also read