Probing the Need for Visual Context in Multimodal Machine Translation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Probing the Need for Visual Context in Multimodal Machine Translation Ozan Caglayan Pranava Madhyastha LIUM, Le Mans University Imperial College London ozan.caglayan@univ-lemans.fr pranava@imperial.ac.uk Lucia Specia Loı̈c Barrault Imperial College London LIUM, Le Mans University l.specia@imperial.ac.uk loic.barrault@univ-lemans.fr Abstract with spatially-unaware global features (Calixto and Liu, 2017; Ma et al., 2017; Caglayan et al., Current work on multimodal machine trans- lation (MMT) has suggested that the vi- 2017a; Madhyastha et al., 2017) and (iii) the in- arXiv:1903.08678v1 [cs.CL] 20 Mar 2019 sual modality is either unnecessary or only tegration of regional features from object detec- marginally beneficial. We posit that this is tion networks (Huang et al., 2016; Grönroos et al., a consequence of the very simple, short and 2018). Nevertheless, the conclusion about the con- repetitive sentences used in the only available tribution of the visual modality is still unclear: dataset for the task (Multi30K), rendering the Grönroos et al. (2018) consider their multimodal source text sufficient as context. In the general gains “modest” and attribute the largest gain to case, however, we believe that it is possible to combine visual and textual information in or- the usage of external parallel corpora. Lala et al. der to ground translations. In this paper we (2018) observe that their multimodal word-sense probe the contribution of the visual modality disambiguation approach is not significantly dif- to state-of-the-art MMT models by conducting ferent than the monomodal counterpart. The orga- a systematic analysis where we partially de- nizers of the latest edition of the shared task con- prive the models from source-side textual con- cluded that the multimodal integration schemes text. Our results show that under limited tex- explored so far resulted in marginal changes in tual context, models are capable of leveraging terms of automatic metrics and human evaluation the visual input to generate better translations. This contradicts the current belief that MMT (Barrault et al., 2018). In a similar vein, Elliott models disregard the visual modality because (2018) demonstrated that MMT models can trans- of either the quality of the image features or late without significant performance losses even in the way they are integrated into the model. the presence of features from unrelated images. These empirical findings seem to indicate that 1 Introduction images are ignored by the models and hint at the Multimodal Machine Translation (MMT) aims fact that this is due to representation or modeling at designing better translation systems which limitations. We conjecture that the most plausi- take into account auxiliary inputs such as im- ble reason for the linguistic dominance is that – at ages. Initially organized as a shared task within least in Multi30K – the source text is sufficient to the First Conference on Machine Translation perform the translation, eventually preventing the (WMT16) (Specia et al., 2016), MMT has so far visual information from intervening in the learn- been studied using the Multi30K dataset (Elliott ing process. To investigate this hypothesis, we et al., 2016), a multilingual extension of Flickr30K introduce several input degradation regimes (Sec- (Young et al., 2014) with translations of the En- tion 2) and revisit state-of-the-art MMT models glish image descriptions into German, French and (Section 3) to assess their behavior under degraded Czech (Elliott et al., 2017; Barrault et al., 2018). regimes. We further probe the visual sensitivity by The three editions of the shared task have seen deliberately feeding features from unrelated im- many exciting approaches that can be broadly cat- ages. Our results (Section 4) show that MMT egorized as follows: (i) multimodal attention us- models successfully exploit the visual modality ing convolutional features (Caglayan et al., 2016; when the linguistic context is scarce, but indeed Calixto et al., 2016; Libovický and Helcl, 2017; tend to be less sensitive to this modality when ex- Helcl et al., 2018) (ii) cross-modal interactions posed to complete sentences.
2 Input Degradation D a lady in a blue dress singing DC a lady in a [v] dress singing In this section we propose several degradations DN a [v] in a blue [v] singing to the input language modality to simulate condi- tions where sentences may miss crucial informa- D4 a lady in a [v] [v] [v] D2 a lady [v] [v] [v] [v] [v] tion. We denote a set of translation pairs by D and D0 [v] [v] [v] [v] [v] [v] [v] indicate degraded variants with subscripts. Both the training and the test sets are degraded. Table 1: An example of the proposed input degradation schemes: D is the original sentence. Color Deprivation. We consistently replace source words that refer to colors with a special to- ken [v] (DC in Table 1). Our hypothesis is that a by explicitly violating the test-time semantic con- monomodal system will have to rely on source- gruence across modalities. Specifically, we feed side contextual information and biases, while a the visual features in reverse sample order to multimodal architecture could potentially capital- break image-sentence alignments. Consequently, ize on color information extracted by exploiting a model capable of integrating the visual modality the image and thus obtain better performance. would likely deteriorate in terms of metrics. This affects 3.3% and 3.1% of the words in the training and the test set, respectively. 3 Experimental Setup Entity Masking. The Flickr30K dataset, from Dataset. We conduct experiments on the which Multi30K is derived, has also been ex- English→French part of Multi30K. The models tended with coreference chains to tag mentions of are trained on the concatenation of the train and visually depictable entities in image descriptions val sets (30K sentences) whereas test2016 (dev) (Plummer et al., 2015). We use these to mask out and test2017 (test) are used for early-stopping the head nouns in the source sentences (DN in Ta- and model evaluation, respectively. For entity ble 1). This affects 26.2% of the words in both masking, we revert to the default Flickr30K splits the training and the test set. We hypothesize that a and perform the model evaluation on test2016, multimodal system should heavily rely on the im- since test2017 is not annotated for entities. We ages to infer the missing parts. use word-level vocabularies of 9,951 English and 11,216 French words. We use Moses (Koehn Progressive Masking. A progressively de- et al., 2007) scripts to lowercase, normalize and graded variant Dk replaces all but the first k tokenize the sentences with hyphen splitting. The tokens of source sentences with [v] . Unlike hyphens are stitched back prior to evaluation. the color deprivation and entity masking, mask- ing out suffixes does not guarantee systematic Visual Features. We use a ResNet-50 CNN (He removal of visual context, but rather simulates et al., 2016) trained on ImageNet (Deng et al., an increasingly low-resource scenario. Overall, 2009) as image encoder. Prior to feature extrac- we form 16 degraded variants Dk (Table 1) tion, we center and standardize the images using where k ∈ {0, 2, . . . , 30}. We stop at D30 since ImageNet statistics, resize the shortest edge to 256 99.8% of the sentences in Multi30K are shorter pixels and take a center crop of size 256x256. We than 30 words with an average sentence length extract spatial features of size 2048x8x8 from the of 12 words. D0 – where the only remaining final convolutional layer and apply L2 normaliza- information is the source sentence length – is an tion along the depth dimension (Caglayan et al., interesting case from two perspectives: a neural 2018). For the non-attentive model, we use the machine translation (NMT) model trained on 2048-dimensional global average pooled version it resembles a target language model, while an (pool5) of the above convolutional features. MMT model becomes an image captioner with Models. Our baseline NMT is an attentive access to “expected length information”. model (Bahdanau et al., 2014) with a 2-layer bidi- Visual Sensitivity. Inspired by Elliott (2018), rectional GRU encoder (Cho et al., 2014) and a we experiment with incongruent decoding in order 2-layer conditional GRU decoder (Sennrich et al., to understand how sensitive the multimodal sys- 2017). The second layer of the decoder receives tems are to the visual modality. This is achieved the output of the attention layer as input.
56 NMT D DC 54.4 54.7 53.9 INIT NMT 70.6 ± 0.5 68.4 ± 0.1 54 HIER DIRECT 52 INIT 70.7 ± 0.2 68.9 ± 0.1 METEOR HIER 70.9 ± 0.3 69.0 ± 0.3 50.5 DIRECT 70.9 ± 0.2 68.8 ± 0.3 50 Table 2: Baseline and color-deprivation METEOR 48 47.4 scores: bold systems are significantly different from the 46 45.4 45.0 NMT system within the same column (p-value ≤ 0.03). 44 Congruent Incongruent For the MMT model, we explore the basic Figure 1: Entity masking: all masked MMT models are multimodal attention (DIRECT) (Caglayan et al., significantly better than the masked NMT (dashed). In- 2016) and its hierarchical (HIER) extension (Li- congruent decoding severely worsens all systems. The bovický and Helcl, 2017). The former linearly vanilla NMT baseline is 75.92 . projects the concatenation of textual and visual context vectors to obtain the multimodal context vector, while the latter replaces the concatena- We first present test2017 METEOR scores for tion with another attention layer. Finally, we the baseline NMT and MMT systems, when also experiment with encoder-decoder initializa- trained on the full dataset D (Table 2). The first tion (INIT) (Calixto and Liu, 2017; Caglayan column indicates that, although MMT models per- et al., 2017a) where we initialize both the encoder form slightly better on average, they are not sig- and the decoder using a non-linear transformation nificantly better than the baseline NMT. We now of the pool5 features. introduce and discuss the results obtained under the proposed degradation schemes. Please refer to Hyperparameters. The encoder and decoder Table 5 and the appendix for qualitative examples. GRUs have 400 hidden units and are initialized with 0 except the multimodal INIT system. All 4.1 Color Deprivation embeddings are 200-dimensional and the decoder Unlike the inconclusive results for D, we observe embeddings are tied (Press and Wolf, 2016). A that all MMT models are significantly better than dropout of 0.4 and 0.5 is applied on source embed- NMT when color deprivation is applied (DC in Ta- dings and encoder/decoder outputs, respectively ble 2). If we further focus on the subset of the (Srivastava et al., 2014). The weights are decayed test set subjected to color deprivation (247 sen- with a factor of 1e−5. We use ADAM (Kingma tences), the gain increases to 1.6 METEOR for and Ba, 2014) with a learning rate of 4e−4 and HIER. For the latter subset, we also computed the mini-batches of 64 samples. The gradients are average color accuracy per sentence and found that clipped if the total norm exceeds 1 (Pascanu et al., the attentive models are 12% better than the NMT 2013). The training is early-stopped if dev set ME- (32.5→44.5) whereas the INIT model only brings TEOR (Denkowski and Lavie, 2014) does not im- 4% (32.5→36.5) improvement. This shows that prove for ten epochs. All experiments are con- more complex MMT models are better at integrat- ducted with nmtpytorch1 (Caglayan et al., 2017b). ing visual information to perform better. 4 Results 4.2 Entity Masking The gains are much more prominent with entity We train all systems three times each with dif- masking, where the degradation occurs at a larger ferent random initialization in order to perform scale: Attentive MMT models show up to 4.2 ME- significance testing with multeval (Clark et al., TEOR improvement over NMT (Figure 1). We ob- 2011). Throughout the section, we always report served a large performance drop with incongruent the mean over three runs (and the standard devi- decoding, suggesting that the visual modality is ation) of the considered metrics. We decode the translations with a beam size of 12. 2 Since entity masking uses Flickr30K splits (Section 3) rather than our splits, the scores are not comparable to those 1 github.com/lium-lst/nmtpytorch from other experiments in this paper.
8 jeune chanson 100 Source Information (%) young song 6 METEOR over NMT 75 INIT 4 50 HIER DIRECT 2 25 jeune enfant young [v] 0 0 28 24 20 16 12 8 4 0 Context Size (k) Figure 2: Baseline MMT (top) translates the misspelled Figure 3: Multimodal gain in absolute METEOR for “son” while the masked MMT (bottom) correctly pro- progressive masking: the thick gray curve indicates the duces “enfant” (child) by focusing on the image. percentage of non-masked words in the training set. + Gain (↓ Incongruence Drop) D4 D6 D12 D20 D INIT HIER DIRECT DIRECT 32.3 42.2 64.5 70.1 70.9 Incongruent Dec. ↓ 6.4 ↓ 5.5 ↓ 1.4 ↓ 0.7 ↓ 0.7 Czech +1.4 (↓ 2.9) +1.7 (↓ 3.5) +1.7 (↓ 4.1) German +2.1 (↓ 4.7) +2.5 (↓ 5.9) +2.7 (↓ 6.5) Blinding ↓ 3.9 ↓ 2.9 ↓ 0.4 ↓ 0.5 ↓ 0.3 French +3.4 (↓ 6.5) +3.9 (↓ 9.0) +4.2 (↓ 9.7) NMT ↓ 3.7 ↓ 2.6 ↓ 0.6 ↓ 0.2 ↓ 0.3 Table 3: Entity masking results across three languages: Table 4: The impact of incongruent decoding for pro- all MMT models perform significantly better than their gressive masking: all METEOR differences are against NMT counterparts (p-value ≤ 0.01). The incongruence the DIRECT model. The blinded systems are both drop applies on top of the MMT score. trained and decoded using incongruent features. now much more important than previously demon- to HIER (↓ 6.1) and DIRECT (↓ 6.8). This raises a strated (Elliott, 2018). A comparison of attention follow-up question as to whether the hidden state maps produced by the baseline and masked MMT initialization eventually loses its impact through- models reveals that the attention weights are more out the recurrence where, as a consequence, the consistent in the latter. An interesting example is only modality processed is the text. given in Figure 2 where the masked MMT model 4.3 Progressive Masking attends to the correct region of the image and suc- cessfully translates a dropped word that was oth- Finally, we discuss the results of the progressive erwise a spelling mistake (“son”→“song”). masking experiments for French. Figure 3 clearly shows that as the sentences are progressively de- Czech and German. In order to understand graded, all MMT systems are able to leverage the whether the above observations are also consis- visual modality. At the dashed red line – where tent across different languages, we extend the en- the multimodal task becomes image captioning – tity masking experiments to German and Czech MMT models improve over the language-model parts of Multi30K. Table 3 shows the gain of each counterpart by ∼7 METEOR. Further qualitative MMT system with respect to the NMT model and examples show that the systems perform surpris- the subsequent drop caused by incongruent decod- ingly well by producing visually plausible sen- ing3 . First, we see that the multimodal benefits tences (see Table 5 and the Appendix). clearly hold for German and Czech, although the To get a sense of the visual sensitivity, we pick gains are lower than for French4 . Second, when the DIRECT models trained on four degraded vari- we compute the average drop from using incon- ants and perform incongruent decoding. We no- gruent images across all languages, we see how tice that as the amount of linguistic information conservative the INIT system is (↓ 4.7) compared increases, the gap narrows down: the MMT sys- 3 For example, the INIT system for French (Figure 1) sur- tem gradually becomes less perplexed by the in- passes the baseline (50.5) by reaching 53.9 (+3.4), which congruence or, put in other words, less sensitive to ends up at 47.4 (↓ 6.5) after incongruent decoding. the visual modality (Table 4). 4 This is probably due to the morphological richness of DE and CS which is suboptimally handled by word-level MT.
SRC: an older woman in [v][v][v][v][v][v][v][v][v][v][v] NMT: une femme âgée avec un t-shirt blanc et des lunettes de soleil est assise sur un banc (an older woman with a white t-shirt and sunglasses is sitting on a bank) MMT: une femme âgée en maillot de bain rose est assise sur un rocher au bord de l’eau (an older woman with a pink swimsuit is sitting on a rock at the seaside) REF: une femme âgée en bikini bronze sur un rocher au bord de l’océan (an older woman in bikini is tanning on a rock at the edge of the ocean) SRC: a young [v] in [v] holding a tennis [v] NMT: un jeune garçon en bleu tenant une raquette de tennis (a young boy in blue holding a tennis racket) MMT: une jeune femme en blanc tenant une raquette de tennis REF: une jeune femme en blanc tenant une raquette de tennis (a young girl in white holding a tennis racket) SRC: little girl covering her face with a [v] towel NMT: une petite fille couvrant son visage avec une serviette blanche (a little girl covering her face with a white towel) MMT: une petite fille couvrant son visage avec une serviette bleue REF: une petite fille couvrant son visage avec une serviette bleue (a little girl covering her face with a blue towel) Table 5: Qualitative examples from progressive masking, entity masking and color deprivation, respectively. Un- derlined and bold words highlight the bad and good lexical choices. MMT is an attentive system. We then conduct a contrastive “blinding” exper- the source text. In the future, we would like to iment where the DIRECT models are not only devise models that can learn when and how to in- fed with incongruent features at decoding time but tegrate multiple modalities by taking care of the also trained with them from scratch. The results complementary and redundant aspects of them in suggest that the blinded models learn to ignore an intelligent way. the visual modality. In fact, their performance is equivalent to NMT models. Acknowledgments This work is a follow-up on the research ef- 5 Discussion and Conclusions forts conducted within the “Grounded sequence- to-sequence transduction” team of the JSALT We presented an in-depth study on the potential 2018 Workshop. We would like to thank Jindřich contribution of images for multimodal machine Libovický for contributing the hierarchical atten- translation. Specifically, we analysed the behav- tion to nmtpytorch during the workshop. We also ior of state-of-the-art MMT models under several thank the reviewers for their valuable comments. degradation schemes in the Multi30K dataset, in Ozan Caglayan and Loı̈c Barrault received order to reveal and understand the impact of tex- funding from the French National Research tual predominance. Our results show that the mod- Agency (ANR) through the CHIST-ERA M2CR els explored are able to integrate the visual modal- project under the contract ANR-15-CHR2-0006- ity if the available modalities are complementary 01. Lucia Specia and Pranava Madhyastha re- rather than redundant. In the latter case, the pri- ceived funding from the MultiMT (H2020 ERC mary modality (text) sufficient to accomplish the Starting Grant No. 678017) and MMVC (Newton task. This dominance effect corroborates the sem- Fund Institutional Links Grant, ID 352343575) inal work of Colavita (1974) in Psychophysics projects. where it has been demonstrated that visual stimuli dominate over the auditory stimuli when humans are asked to perform a simple audiovisual discrim- References ination task. Our investigation using source degra- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- dation also suggests that visual grounding can in- gio. 2014. Neural machine translation by jointly crease the robustness of machine translation sys- learning to align and translate. Computing Research tems by mitigating input noise such as errors in Repository, arXiv:1409.0473. Version 7.
Loı̈c Barrault, Fethi Bougares, Lucia Specia, Chi- Meeting of the Association for Computational Lin- raag Lala, Desmond Elliott, and Stella Frank. 2018. guistics: Human Language Technologies: Short Pa- Findings of the third shared task on multimodal ma- pers - Volume 2, HLT ’11, pages 176–181, Strouds- chine translation. In Proceedings of the Third Con- burg, PA, USA. Association for Computational Lin- ference on Machine Translation, Volume 2: Shared guistics. Task Papers, pages 308–327, Belgium, Brussels. As- sociation for Computational Linguistics. Francis B. Colavita. 1974. Human sensory dominance. Perception & Psychophysics, 16 (2):409–412. Ozan Caglayan, Walid Aransa, Adrien Bardet, Mer- J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei- cedes Garcı́a-Martı́nez, Fethi Bougares, Loı̈c Bar- Fei. 2009. Imagenet: A large-scale hierarchical im- rault, Marc Masana, Luis Herranz, and Joost age database. In 2009 IEEE Conference on Com- van de Weijer. 2017a. LIUM-CVC submissions for puter Vision and Pattern Recognition, pages 248– WMT17 multimodal translation task. In Proceed- 255. ings of the Second Conference on Machine Trans- lation, Volume 2: Shared Task Papers, pages 432– Michael Denkowski and Alon Lavie. 2014. Meteor 439, Copenhagen, Denmark. Association for Com- universal: Language specific translation evaluation putational Linguistics. for any target language. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages Ozan Caglayan, Adrien Bardet, Fethi Bougares, Loı̈c 376–380. Association for Computational Linguis- Barrault, Kai Wang, Marc Masana, Luis Herranz, tics. and Joost van de Weijer. 2018. LIUM-CVC sub- missions for WMT18 multimodal translation task. Desmond Elliott. 2018. Adversarial evaluation of mul- In Proceedings of the Third Conference on Machine timodal machine translation. In Proceedings of the Translation, Volume 2: Shared Task Papers, pages 2018 Conference on Empirical Methods in Natu- 603–608, Belgium, Brussels. Association for Com- ral Language Processing, pages 2974–2978, Brus- putational Linguistics. sels, Belgium. Association for Computational Lin- guistics. Ozan Caglayan, Loı̈c Barrault, and Fethi Bougares. 2016. Multimodal attention for neural ma- Desmond Elliott, Stella Frank, Loı̈c Barrault, Fethi chine translation. Computing Research Repository, Bougares, and Lucia Specia. 2017. Findings of the arXiv:1609.03976. second shared task on multimodal machine transla- tion and multilingual image description. In Proceed- Ozan Caglayan, Mercedes Garcı́a-Martı́nez, Adrien ings of the Second Conference on Machine Trans- Bardet, Walid Aransa, Fethi Bougares, and Loı̈c lation, Volume 2: Shared Task Papers, pages 215– Barrault. 2017b. NMTPY: A flexible toolkit for ad- 233, Copenhagen, Denmark. Association for Com- vanced neural machine translation systems. Prague putational Linguistics. Bull. Math. Linguistics, 109:15–28. Desmond Elliott, Stella Frank, Khalil Sima’an, and Lu- Iacer Calixto, Desmond Elliott, and Stella Frank. 2016. cia Specia. 2016. Multi30K: Multilingual english- DCU-UvA multimodal MT system report. In Pro- german image descriptions. In Proceedings of the ceedings of the First Conference on Machine Trans- 5th Workshop on Vision and Language, pages 70– lation, pages 634–638, Berlin, Germany. Associa- 74, Berlin, Germany. Association for Computational tion for Computational Linguistics. Linguistics. Stig-Arne Grönroos, Benoit Huet, Mikko Kurimo, Iacer Calixto and Qun Liu. 2017. Incorporating global Jorma Laaksonen, Bernard Merialdo, Phu Pham, visual features into attention-based neural machine Mats Sjöberg, Umut Sulubacak, Jörg Tiedemann, translation. In Proceedings of the 2017 Conference Raphael Troncy, and Raúl Vázquez. 2018. The on Empirical Methods in Natural Language Pro- MeMAD submission to the WMT18 multimodal cessing, pages 992–1003, Copenhagen, Denmark. translation task. In Proceedings of the Third Confer- Association for Computational Linguistics. ence on Machine Translation, pages 609–617, Bel- gium, Brussels. Association for Computational Lin- Kyunghyun Cho, Bart van Merrienboer, Caglar Gul- guistics. cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep resid- phrase representations using RNN encoder–decoder ual learning for image recognition. In 2016 IEEE for statistical machine translation. In Proceedings of Conference on Computer Vision and Pattern Recog- the 2014 Conference on Empirical Methods in Nat- nition (CVPR), pages 770–778. ural Language Processing, pages 1724–1734, Doha, Qatar. Association for Computational Linguistics. Jindřich Helcl, Jindřich Libovický, and Dusan Varis. 2018. CUNI system for the WMT18 multimodal Jonathan H. Clark, Chris Dyer, Alon Lavie, and translation task. In Proceedings of the Third Con- Noah A. Smith. 2011. Better hypothesis testing for ference on Machine Translation, Volume 2: Shared statistical machine translation: Controlling for opti- Task Papers, pages 622–629, Belgium, Brussels. As- mizer instability. In Proceedings of the 49th Annual sociation for Computational Linguistics.
Po-Yao Huang, Frederick Liu, Sz-Rung Shiang, Jean Ofir Press and Lior Wolf. 2016. Using the output em- Oh, and Chris Dyer. 2016. Attention-based multi- bedding to improve language models. Computing modal neural machine translation. In Proceedings of Research Repository, arXiv:1608.05859. the First Conference on Machine Translation, pages 639–645, Berlin, Germany. Association for Compu- Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexan- tational Linguistics. dra Birch, Barry Haddow, Julian Hitschler, Marcin Junczys-Dowmunt, Samuel Läubli, Antonio Valerio Diederik Kingma and Jimmy Ba. 2014. Adam: A Miceli Barone, Jozef Mokry, and Maria Nadejde. method for stochastic optimization. Computing Re- 2017. Nematus: a toolkit for neural machine trans- search Repository, arXiv:1412.6980. lation. In Proceedings of the Software Demonstra- tions of the 15th Conference of the European Chap- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris ter of the Association for Computational Linguistics, Callison-Burch, Marcello Federico, Nicola Bertoldi, pages 65–68. Association for Computational Lin- Brooke Cowan, Wade Shen, Christine Moran, guistics. Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open Lucia Specia, Stella Frank, Khalil Sima’an, and source toolkit for statistical machine translation. In Desmond Elliott. 2016. A shared task on multi- Proceedings of the 45th Annual Meeting of the ACL modal machine translation and crosslingual image on Interactive Poster and Demonstration Sessions, description. In Proceedings of the First Conference ACL ’07, pages 177–180, Stroudsburg, PA, USA. on Machine Translation, pages 543–553, Berlin, Association for Computational Linguistics. Germany. Association for Computational Linguis- tics. Chiraag Lala, Pranava Swaroop Madhyastha, Carolina Scarton, and Lucia Specia. 2018. Sheffield sub- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, missions for WMT18 multimodal translation shared Ilya Sutskever, and Ruslan Salakhutdinov. 2014. task. In Proceedings of the Third Conference on Ma- Dropout: A simple way to prevent neural networks chine Translation, Volume 2: Shared Task Papers, from overfitting. J. Mach. Learn. Res., 15(1):1929– pages 630–637, Belgium, Brussels. Association for 1958. Computational Linguistics. Peter Young, Alice Lai, Micah Hodosh, and Julia Jindřich Libovický and Jindřich Helcl. 2017. Attention Hockenmaier. 2014. From image descriptions to strategies for multi-source sequence-to-sequence visual denotations: New similarity metrics for se- learning. In Proceedings of the 55th Annual Meet- mantic inference over event descriptions. Transac- ing of the Association for Computational Linguis- tions of the Association for Computational Linguis- tics (Volume 2: Short Papers), pages 196–202, Van- tics, 2:67–78. couver, Canada. Association for Computational Lin- guistics. Mingbo Ma, Dapeng Li, Kai Zhao, and Liang Huang. 2017. OSU multimodal machine translation system report. In Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Pa- pers, pages 465–469, Copenhagen, Denmark. Asso- ciation for Computational Linguistics. Pranava Swaroop Madhyastha, Josiah Wang, and Lu- cia Specia. 2017. Sheffield MultiMT: Using object posterior predictions for multimodal machine trans- lation. In Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Pa- pers, pages 470–476, Copenhagen, Denmark. Asso- ciation for Computational Linguistics. Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1310–1318, Atlanta, Georgia, USA. PMLR. B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase cor- respondences for richer image-to-sentence models. In 2015 IEEE International Conference on Com- puter Vision (ICCV), pages 2641–2649.
A Qualitative Examples SRC: a girl in [v] is sitting on a bench NMT: pink Init: pink Hier: black Direct: black SRC: a man dressed in [v] talking to a girl NMT: black Init: black Hier: white Direct: white SRC: a [v] dog sits under a [v] umbrella NMT: brown / blue Init: black / blue Hier: black / blue Direct: black / blue SRC: a woman in a [v] top is dancing as a woman and boy in a [v] shirt watch NMT: blue / blue Init: blue / blue Hier: red / red Direct: red / red SRC: three female dancers in [v] dresses are performing a dance routine NMT: white Init: white Hier: white Direct: blue Table 6: Color deprivation examples from the English→French models: bold indicates correctly predicted cases. The colors generated by the models are shown in English for the sake of clarity.
SRC: a [v] in a red [v] plays in the [v] NMT: un garçon en t-shirt rouge joue dans la neige (a boy in a red t-shirt plays in the snow) MMT: un garçon en maillot de bain rouge joue dans l’eau REF: un garçon en maillot de bain rouge joue dans l’eau (a boy in a red swimsuit plays in the water) SRC: a [v] drinks [v] outside on the [v] NMT: un homme boit du vin dehors sur le trottoir (a man drinks wine outside on the sidewalk) MMT: un chien boit de l’eau dehors sur l’herbe REF: un chien boit de l’eau dehors sur l’herbe (a dog drinks water outside on the grass) SRC: two [v] are driving on a [v] NMT: deux hommes font du vélo sur une route (two men riding bicycles on a road) MMT: deux voitures roulent sur une piste (two cars driving on a track/circuit) REF: deux voitures roulent sur un circuit SRC: a [v] turns on the [v] to pursue a flying [v] NMT: un homme tourne sur la plage pour attraper un frisbee volant (a man turns on the beach to catch a flying frisbee) MMT: un chien tourne sur l’herbe pour attraper un frisbee volant (a dog turns on the grass to catch a flying frisbee) REF: un chien tourne sur l’herbe pour poursuivre une balle en l’air (a dog turns on the grass to chase a ball in the air) SRC: a [v] jumping [v] on a [v] near a parking [v] NMT: un homme sautant à cheval sur une plage près d’un parking (a man jumping on a beach near a parking lot) MMT: une fille sautant à la corde sur un trottoir près d’un parking REF: une fille sautant à la corde sur un trottoir près d’un parking (a girl jumping rope on a sidewalk near a parking lot) Table 7: Entity masking examples from the English→French models: underlined and bold words highlight bad and good lexical choices, respectively. English translations are provided in parentheses. MMT is an attentive model.
a une mère et sa mother and her young song jeune chanson profitant d' une enjoying a beautiful day outside belle journée dehors . . . une mère et sa une jeune profitant belle d' dehors chanson journée (a) Baseline (non-masked) MMT a une mère et son [v] and her young [v] jeune enfant profitant d' une enjoying a beautiful [v] outside belle journée dehors . . . une mère et son une enfant jeune profitant belle d' dehors journée (b) Entity-masked MMT Figure 4: Attention example from entity masking experiments: (a) Baseline MMT translates the misspelled “son” (song → chanson) while (b) the masked MMT achieves a correct translation ([v]→ enfant) by exploiting the visual modality.
a un terrier de boston boston terrier is running on court sur l' herbe verte lush green grass in front luxuriante devant une clôture blanche of a white fence . . un . terrier de sur herbe verte une court devant boston l' clôture blanche luxuriante (a) Baseline (non-masked) MMT a un berger allemand court boston [v] is running on lush sur l' herbe verte devant green [v] in front of une clôture blanche . a white [v] . un berger . sur herbe verte une court allemand devant l' clôture blanche (b) Entity-masked MMT Figure 5: Attention example from entity masking experiments where terrier, grass and fence are dropped from the source sentence: (a) Baseline MMT is not able to shift attention from the salient dog to the grass and fence, (b) the attention produced by the masked MMT first shifts to the background area while translating “on lush green [v]” then focuses on the fence.
SRC: a child [v][v][v][v][v][v] NMT: un enfant avec des lunettes de soleil en train de jouer au tennis (a child with sunglasses playing tennis) MMT: un enfant est debout dans un champ de fleurs (a child is standing in field of flowers) REF: un enfant dans un champ de tulipes (a child in a field of tulips) SRC: a jockey riding his [v][v] NMT: un jockey sur son vélo (a jockey on his bike) MMT: un jockey sur son cheval REF: un jockey sur son cheval (a jockey on his horse) SRC: girls are playing a [v][v][v] NMT: des filles jouent à un jeu de cartes (girls are playing a card game) MMT: des filles jouent un match de football REF: des filles jouent un match de football (girls are playing a football match) SRC: trees are in front [v][v][v][v][v] NMT: des vélos sont devant un bâtiment en plein air (bicycles are in front of an outdoor building) MMT: des arbres sont devant la montagne (trees are in front of the mountain) REF: des arbres sont devant une grande montagne (trees are in front of a big mountain) SRC: a fishing net on the deck of a [v][v] NMT: un filet de pêche sur la terrasse d’un bâtiment (a fishing net on the terrace of a building) MMT: un filet de pêche sur le pont d’un bateau (a fishing net on the deck of a boat) REF: un filet de pêche sur le pont d’un bateau rouge (a fishing net on the deck of a red boat) SRC: girls wave purple flags [v][v][v][v][v][v][v] NMT: des filles en t-shirts violets sont assises sur des chaises dans une salle de classe (girls in purple t-shirts are sitting on chairs in a classroom) MMT: des filles en costumes violets dansent dans une rue en ville (girls in purple costumes dance on a city street) REF: des filles agitent des drapeaux violets tandis qu’elles défilent dans la rue (girls wave purple flags as they parade down the street) Table 8: English→French progressive masking examples: underlined and bold words highlight bad and good lexical choices, respectively. English translations are provided in parentheses. MMT is an attentive model.
You can also read