Probing the Need for Visual Context in Multimodal Machine Translation

Page created by Mary Carter

Food & Drink

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Probing the Need for Visual Context in Multimodal Machine Translation

Probing the Need for Visual Context in Multimodal Machine Translation
                                                        Ozan Caglayan                                   Pranava Madhyastha
                                                    LIUM, Le Mans University                           Imperial College London
                                               ozan.caglayan@univ-lemans.fr                         pranava@imperial.ac.uk

                                                            Lucia Specia                                  Loı̈c Barrault
                                                       Imperial College London                        LIUM, Le Mans University
                                                    l.specia@imperial.ac.uk                      loic.barrault@univ-lemans.fr

                                                               Abstract                           with spatially-unaware global features (Calixto
                                                                                                  and Liu, 2017; Ma et al., 2017; Caglayan et al.,
                                             Current work on multimodal machine trans-
                                             lation (MMT) has suggested that the vi-              2017a; Madhyastha et al., 2017) and (iii) the in-
arXiv:1903.08678v1 [cs.CL] 20 Mar 2019

                                             sual modality is either unnecessary or only          tegration of regional features from object detec-
                                             marginally beneficial. We posit that this is         tion networks (Huang et al., 2016; Grönroos et al.,
                                             a consequence of the very simple, short and          2018). Nevertheless, the conclusion about the con-
                                             repetitive sentences used in the only available      tribution of the visual modality is still unclear:
                                             dataset for the task (Multi30K), rendering the       Grönroos et al. (2018) consider their multimodal
                                             source text sufficient as context. In the general
                                                                                                  gains “modest” and attribute the largest gain to
                                             case, however, we believe that it is possible to
                                             combine visual and textual information in or-
                                                                                                  the usage of external parallel corpora. Lala et al.
                                             der to ground translations. In this paper we         (2018) observe that their multimodal word-sense
                                             probe the contribution of the visual modality        disambiguation approach is not significantly dif-
                                             to state-of-the-art MMT models by conducting         ferent than the monomodal counterpart. The orga-
                                             a systematic analysis where we partially de-         nizers of the latest edition of the shared task con-
                                             prive the models from source-side textual con-       cluded that the multimodal integration schemes
                                             text. Our results show that under limited tex-       explored so far resulted in marginal changes in
                                             tual context, models are capable of leveraging
                                                                                                  terms of automatic metrics and human evaluation
                                             the visual input to generate better translations.
                                             This contradicts the current belief that MMT         (Barrault et al., 2018). In a similar vein, Elliott
                                             models disregard the visual modality because         (2018) demonstrated that MMT models can trans-
                                             of either the quality of the image features or       late without significant performance losses even in
                                             the way they are integrated into the model.          the presence of features from unrelated images.
                                                                                                     These empirical findings seem to indicate that
                                         1   Introduction
                                                                                                  images are ignored by the models and hint at the
                                         Multimodal Machine Translation (MMT) aims                fact that this is due to representation or modeling
                                         at designing better translation systems which            limitations. We conjecture that the most plausi-
                                         take into account auxiliary inputs such as im-           ble reason for the linguistic dominance is that – at
                                         ages. Initially organized as a shared task within        least in Multi30K – the source text is sufficient to
                                         the First Conference on Machine Translation              perform the translation, eventually preventing the
                                         (WMT16) (Specia et al., 2016), MMT has so far            visual information from intervening in the learn-
                                         been studied using the Multi30K dataset (Elliott         ing process. To investigate this hypothesis, we
                                         et al., 2016), a multilingual extension of Flickr30K     introduce several input degradation regimes (Sec-
                                         (Young et al., 2014) with translations of the En-        tion 2) and revisit state-of-the-art MMT models
                                         glish image descriptions into German, French and         (Section 3) to assess their behavior under degraded
                                         Czech (Elliott et al., 2017; Barrault et al., 2018).     regimes. We further probe the visual sensitivity by
                                            The three editions of the shared task have seen       deliberately feeding features from unrelated im-
                                         many exciting approaches that can be broadly cat-        ages. Our results (Section 4) show that MMT
                                         egorized as follows: (i) multimodal attention us-        models successfully exploit the visual modality
                                         ing convolutional features (Caglayan et al., 2016;       when the linguistic context is scarce, but indeed
                                         Calixto et al., 2016; Libovický and Helcl, 2017;        tend to be less sensitive to this modality when ex-
                                         Helcl et al., 2018) (ii) cross-modal interactions        posed to complete sentences.

2 Input Degradation D a lady in a blue dress singing
DC a lady in a [v] dress singing
In this section we propose several degradations
DN a [v] in a blue [v] singing
to the input language modality to simulate condi-
tions where sentences may miss crucial informa- D4 a lady in a [v] [v] [v]
D2 a lady [v] [v] [v] [v] [v]
tion. We denote a set of translation pairs by D and D0 [v] [v] [v] [v] [v] [v] [v]
indicate degraded variants with subscripts. Both
the training and the test sets are degraded. Table 1: An example of the proposed input degradation
schemes: D is the original sentence.
Color Deprivation. We consistently replace
source words that refer to colors with a special to-
ken [v] (DC in Table 1). Our hypothesis is that a by explicitly violating the test-time semantic con-
monomodal system will have to rely on source- gruence across modalities. Specifically, we feed
side contextual information and biases, while a the visual features in reverse sample order to
multimodal architecture could potentially capital- break image-sentence alignments. Consequently,
ize on color information extracted by exploiting a model capable of integrating the visual modality
the image and thus obtain better performance. would likely deteriorate in terms of metrics.
This affects 3.3% and 3.1% of the words in the
training and the test set, respectively. 3 Experimental Setup
Entity Masking. The Flickr30K dataset, from Dataset. We conduct experiments on the
which Multi30K is derived, has also been ex- English→French part of Multi30K. The models
tended with coreference chains to tag mentions of are trained on the concatenation of the train and
visually depictable entities in image descriptions val sets (30K sentences) whereas test2016 (dev)
(Plummer et al., 2015). We use these to mask out and test2017 (test) are used for early-stopping
the head nouns in the source sentences (DN in Ta- and model evaluation, respectively. For entity
ble 1). This affects 26.2% of the words in both masking, we revert to the default Flickr30K splits
the training and the test set. We hypothesize that a and perform the model evaluation on test2016,
multimodal system should heavily rely on the im- since test2017 is not annotated for entities. We
ages to infer the missing parts. use word-level vocabularies of 9,951 English and
11,216 French words. We use Moses (Koehn
Progressive Masking. A progressively de- et al., 2007) scripts to lowercase, normalize and
graded variant Dk replaces all but the first k tokenize the sentences with hyphen splitting. The
tokens of source sentences with [v] . Unlike hyphens are stitched back prior to evaluation.
the color deprivation and entity masking, mask-
ing out suffixes does not guarantee systematic Visual Features. We use a ResNet-50 CNN (He
removal of visual context, but rather simulates et al., 2016) trained on ImageNet (Deng et al.,
an increasingly low-resource scenario. Overall, 2009) as image encoder. Prior to feature extrac-
we form 16 degraded variants Dk (Table 1) tion, we center and standardize the images using
where k ∈ {0, 2, . . . , 30}. We stop at D30 since ImageNet statistics, resize the shortest edge to 256
99.8% of the sentences in Multi30K are shorter pixels and take a center crop of size 256x256. We
than 30 words with an average sentence length extract spatial features of size 2048x8x8 from the
of 12 words. D0 – where the only remaining final convolutional layer and apply L2 normaliza-
information is the source sentence length – is an tion along the depth dimension (Caglayan et al.,
interesting case from two perspectives: a neural 2018). For the non-attentive model, we use the
machine translation (NMT) model trained on 2048-dimensional global average pooled version
it resembles a target language model, while an (pool5) of the above convolutional features.
MMT model becomes an image captioner with
Models. Our baseline NMT is an attentive
access to “expected length information”.
model (Bahdanau et al., 2014) with a 2-layer bidi-
Visual Sensitivity. Inspired by Elliott (2018), rectional GRU encoder (Cho et al., 2014) and a
we experiment with incongruent decoding in order 2-layer conditional GRU decoder (Sennrich et al.,
to understand how sensitive the multimodal sys- 2017). The second layer of the decoder receives
tems are to the visual modality. This is achieved the output of the attention layer as input.

56                                         NMT
                            D                DC
                                                                                54.4   54.7
                                                                         53.9                                  INIT
               NMT      70.6 ± 0.5        68.4 ± 0.1                54                                         HIER
                                                                                                               DIRECT
                                                                    52
               INIT     70.7 ± 0.2        68.9 ± 0.1

                                                            METEOR
              HIER      70.9 ± 0.3        69.0 ± 0.3
                                                                                                                    50.5
            DIRECT      70.9 ± 0.2        68.8 ± 0.3                50

Table 2: Baseline and color-deprivation METEOR                      48                                47.4

scores: bold systems are significantly different from the           46                                       45.4   45.0
NMT system within the same column (p-value ≤ 0.03).
                                                                    44
                                                                           Congruent                   Incongruent
   For the MMT model, we explore the basic
                                                            Figure 1: Entity masking: all masked MMT models are
multimodal attention (DIRECT) (Caglayan et al.,             significantly better than the masked NMT (dashed). In-
2016) and its hierarchical (HIER) extension (Li-            congruent decoding severely worsens all systems. The
bovický and Helcl, 2017). The former linearly              vanilla NMT baseline is 75.92 .
projects the concatenation of textual and visual
context vectors to obtain the multimodal context
vector, while the latter replaces the concatena-               We first present test2017 METEOR scores for
tion with another attention layer. Finally, we              the baseline NMT and MMT systems, when
also experiment with encoder-decoder initializa-            trained on the full dataset D (Table 2). The first
tion (INIT) (Calixto and Liu, 2017; Caglayan                column indicates that, although MMT models per-
et al., 2017a) where we initialize both the encoder         form slightly better on average, they are not sig-
and the decoder using a non-linear transformation           nificantly better than the baseline NMT. We now
of the pool5 features.                                      introduce and discuss the results obtained under
                                                            the proposed degradation schemes. Please refer to
Hyperparameters. The encoder and decoder                    Table 5 and the appendix for qualitative examples.
GRUs have 400 hidden units and are initialized
with 0 except the multimodal INIT system. All               4.1      Color Deprivation
embeddings are 200-dimensional and the decoder              Unlike the inconclusive results for D, we observe
embeddings are tied (Press and Wolf, 2016). A               that all MMT models are significantly better than
dropout of 0.4 and 0.5 is applied on source embed-          NMT when color deprivation is applied (DC in Ta-
dings and encoder/decoder outputs, respectively             ble 2). If we further focus on the subset of the
(Srivastava et al., 2014). The weights are decayed          test set subjected to color deprivation (247 sen-
with a factor of 1e−5. We use ADAM (Kingma                  tences), the gain increases to 1.6 METEOR for
and Ba, 2014) with a learning rate of 4e−4 and              HIER. For the latter subset, we also computed the
mini-batches of 64 samples. The gradients are               average color accuracy per sentence and found that
clipped if the total norm exceeds 1 (Pascanu et al.,        the attentive models are 12% better than the NMT
2013). The training is early-stopped if dev set ME-         (32.5→44.5) whereas the INIT model only brings
TEOR (Denkowski and Lavie, 2014) does not im-               4% (32.5→36.5) improvement. This shows that
prove for ten epochs. All experiments are con-              more complex MMT models are better at integrat-
ducted with nmtpytorch1 (Caglayan et al., 2017b).           ing visual information to perform better.

4       Results                                             4.2      Entity Masking
                                                            The gains are much more prominent with entity
We train all systems three times each with dif-
                                                            masking, where the degradation occurs at a larger
ferent random initialization in order to perform
                                                            scale: Attentive MMT models show up to 4.2 ME-
significance testing with multeval (Clark et al.,
                                                            TEOR improvement over NMT (Figure 1). We ob-
2011). Throughout the section, we always report
                                                            served a large performance drop with incongruent
the mean over three runs (and the standard devi-
                                                            decoding, suggesting that the visual modality is
ation) of the considered metrics. We decode the
translations with a beam size of 12.                            2
                                                                 Since entity masking uses Flickr30K splits (Section 3)
                                                            rather than our splits, the scores are not comparable to those
    1
        github.com/lium-lst/nmtpytorch                      from other experiments in this paper.

8
                    jeune                 chanson                                      100

                                                                    Source Information (%)
       young song
                                                                                                                                                       6

                                                                                                                                                           METEOR over NMT
                                                                                             75
                                                                                                      INIT
                                                                                                                                                       4
                                                                                             50       HIER
                                                                                                      DIRECT
                                                                                                                                                       2
                                                                                             25
                    jeune                  enfant
       young [v]

                                                                                                                                                       0
                                                                                              0
                                                                                                      28   24     20    16    12     8     4       0
                                                                                                                 Context Size (k)

Figure 2: Baseline MMT (top) translates the misspelled                     Figure 3: Multimodal gain in absolute METEOR for
“son” while the masked MMT (bottom) correctly pro-                         progressive masking: the thick gray curve indicates the
duces “enfant” (child) by focusing on the image.                           percentage of non-masked words in the training set.

                             + Gain (↓ Incongruence Drop)                                                        D4          D6     D12     D20         D
                            INIT         HIER         DIRECT               DIRECT                               32.3     42.2       64.5    70.1        70.9
                                                                           Incongruent Dec.                     ↓ 6.4   ↓ 5.5      ↓ 1.4   ↓ 0.7       ↓ 0.7
   Czech             +1.4 (↓ 2.9)     +1.7 (↓ 3.5)   +1.7 (↓ 4.1)
 German              +2.1 (↓ 4.7)     +2.5 (↓ 5.9)   +2.7 (↓ 6.5)          Blinding                             ↓ 3.9   ↓ 2.9      ↓ 0.4   ↓ 0.5       ↓ 0.3
  French             +3.4 (↓ 6.5)     +3.9 (↓ 9.0)   +4.2 (↓ 9.7)          NMT                                  ↓ 3.7   ↓ 2.6      ↓ 0.6   ↓ 0.2       ↓ 0.3

Table 3: Entity masking results across three languages:                    Table 4: The impact of incongruent decoding for pro-
all MMT models perform significantly better than their                     gressive masking: all METEOR differences are against
NMT counterparts (p-value ≤ 0.01). The incongruence                        the DIRECT model. The blinded systems are both
drop applies on top of the MMT score.                                      trained and decoded using incongruent features.

now much more important than previously demon-                             to HIER (↓ 6.1) and DIRECT (↓ 6.8). This raises a
strated (Elliott, 2018). A comparison of attention                         follow-up question as to whether the hidden state
maps produced by the baseline and masked MMT                               initialization eventually loses its impact through-
models reveals that the attention weights are more                         out the recurrence where, as a consequence, the
consistent in the latter. An interesting example is                        only modality processed is the text.
given in Figure 2 where the masked MMT model
                                                                           4.3                    Progressive Masking
attends to the correct region of the image and suc-
cessfully translates a dropped word that was oth-                          Finally, we discuss the results of the progressive
erwise a spelling mistake (“son”→“song”).                                  masking experiments for French. Figure 3 clearly
                                                                           shows that as the sentences are progressively de-
Czech and German. In order to understand                                   graded, all MMT systems are able to leverage the
whether the above observations are also consis-                            visual modality. At the dashed red line – where
tent across different languages, we extend the en-                         the multimodal task becomes image captioning –
tity masking experiments to German and Czech                               MMT models improve over the language-model
parts of Multi30K. Table 3 shows the gain of each                          counterpart by ∼7 METEOR. Further qualitative
MMT system with respect to the NMT model and                               examples show that the systems perform surpris-
the subsequent drop caused by incongruent decod-                           ingly well by producing visually plausible sen-
ing3 . First, we see that the multimodal benefits                          tences (see Table 5 and the Appendix).
clearly hold for German and Czech, although the                               To get a sense of the visual sensitivity, we pick
gains are lower than for French4 . Second, when                            the DIRECT models trained on four degraded vari-
we compute the average drop from using incon-                              ants and perform incongruent decoding. We no-
gruent images across all languages, we see how                             tice that as the amount of linguistic information
conservative the INIT system is (↓ 4.7) compared                           increases, the gap narrows down: the MMT sys-
   3
     For example, the INIT system for French (Figure 1) sur-               tem gradually becomes less perplexed by the in-
passes the baseline (50.5) by reaching 53.9 (+3.4), which                  congruence or, put in other words, less sensitive to
ends up at 47.4 (↓ 6.5) after incongruent decoding.                        the visual modality (Table 4).
   4
     This is probably due to the morphological richness of DE
and CS which is suboptimally handled by word-level MT.

SRC: an older woman in [v][v][v][v][v][v][v][v][v][v][v]
                      NMT: une femme âgée avec un t-shirt blanc et des lunettes de soleil est assise sur un banc
                           (an older woman with a white t-shirt and sunglasses is sitting on a bank)
                      MMT: une femme âgée en maillot de bain rose est assise sur un rocher au bord de l’eau
                           (an older woman with a pink swimsuit is sitting on a rock at the seaside)
                      REF: une femme âgée en bikini bronze sur un rocher au bord de l’océan
                           (an older woman in bikini is tanning on a rock at the edge of the ocean)
                      SRC: a young [v] in [v] holding a tennis [v]
                      NMT: un jeune garçon en bleu tenant une raquette de tennis
                           (a young boy in blue holding a tennis racket)
                      MMT: une jeune femme en blanc tenant une raquette de tennis
                      REF: une jeune femme en blanc tenant une raquette de tennis
                           (a young girl in white holding a tennis racket)
                      SRC: little girl covering her face with a [v] towel
                      NMT: une petite fille couvrant son visage avec une serviette blanche
                           (a little girl covering her face with a white towel)
                      MMT: une petite fille couvrant son visage avec une serviette bleue
                      REF: une petite fille couvrant son visage avec une serviette bleue
                           (a little girl covering her face with a blue towel)

Table 5: Qualitative examples from progressive masking, entity masking and color deprivation, respectively. Un-
derlined and bold words highlight the bad and good lexical choices. MMT is an attentive system.

We then conduct a contrastive “blinding” exper-            the source text. In the future, we would like to
iment where the DIRECT models are not only                 devise models that can learn when and how to in-
fed with incongruent features at decoding time but         tegrate multiple modalities by taking care of the
also trained with them from scratch. The results           complementary and redundant aspects of them in
suggest that the blinded models learn to ignore            an intelligent way.
the visual modality. In fact, their performance is
equivalent to NMT models.                                  Acknowledgments
                                                           This work is a follow-up on the research ef-
5   Discussion and Conclusions                             forts conducted within the “Grounded sequence-
                                                           to-sequence transduction” team of the JSALT
We presented an in-depth study on the potential
                                                           2018 Workshop. We would like to thank Jindřich
contribution of images for multimodal machine
                                                           Libovický for contributing the hierarchical atten-
translation. Specifically, we analysed the behav-
                                                           tion to nmtpytorch during the workshop. We also
ior of state-of-the-art MMT models under several
                                                           thank the reviewers for their valuable comments.
degradation schemes in the Multi30K dataset, in
                                                              Ozan Caglayan and Loı̈c Barrault received
order to reveal and understand the impact of tex-
                                                           funding from the French National Research
tual predominance. Our results show that the mod-
                                                           Agency (ANR) through the CHIST-ERA M2CR
els explored are able to integrate the visual modal-
                                                           project under the contract ANR-15-CHR2-0006-
ity if the available modalities are complementary
                                                           01. Lucia Specia and Pranava Madhyastha re-
rather than redundant. In the latter case, the pri-
                                                           ceived funding from the MultiMT (H2020 ERC
mary modality (text) sufficient to accomplish the
                                                           Starting Grant No. 678017) and MMVC (Newton
task. This dominance effect corroborates the sem-
                                                           Fund Institutional Links Grant, ID 352343575)
inal work of Colavita (1974) in Psychophysics
                                                           projects.
where it has been demonstrated that visual stimuli
dominate over the auditory stimuli when humans
are asked to perform a simple audiovisual discrim-         References
ination task. Our investigation using source degra-
                                                           Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
dation also suggests that visual grounding can in-           gio. 2014. Neural machine translation by jointly
crease the robustness of machine translation sys-            learning to align and translate. Computing Research
tems by mitigating input noise such as errors in             Repository, arXiv:1409.0473. Version 7.

Loı̈c Barrault, Fethi Bougares, Lucia Specia, Chi-           Meeting of the Association for Computational Lin-
  raag Lala, Desmond Elliott, and Stella Frank. 2018.        guistics: Human Language Technologies: Short Pa-
  Findings of the third shared task on multimodal ma-        pers - Volume 2, HLT ’11, pages 176–181, Strouds-
  chine translation. In Proceedings of the Third Con-        burg, PA, USA. Association for Computational Lin-
  ference on Machine Translation, Volume 2: Shared           guistics.
  Task Papers, pages 308–327, Belgium, Brussels. As-
  sociation for Computational Linguistics.                 Francis B. Colavita. 1974. Human sensory dominance.
                                                             Perception & Psychophysics, 16 (2):409–412.
Ozan Caglayan, Walid Aransa, Adrien Bardet, Mer-
                                                           J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-
  cedes Garcı́a-Martı́nez, Fethi Bougares, Loı̈c Bar-
                                                              Fei. 2009. Imagenet: A large-scale hierarchical im-
  rault, Marc Masana, Luis Herranz, and Joost
                                                              age database. In 2009 IEEE Conference on Com-
  van de Weijer. 2017a. LIUM-CVC submissions for
                                                              puter Vision and Pattern Recognition, pages 248–
  WMT17 multimodal translation task. In Proceed-
                                                              255.
  ings of the Second Conference on Machine Trans-
  lation, Volume 2: Shared Task Papers, pages 432–         Michael Denkowski and Alon Lavie. 2014. Meteor
  439, Copenhagen, Denmark. Association for Com-             universal: Language specific translation evaluation
  putational Linguistics.                                    for any target language. In Proceedings of the Ninth
                                                             Workshop on Statistical Machine Translation, pages
Ozan Caglayan, Adrien Bardet, Fethi Bougares, Loı̈c          376–380. Association for Computational Linguis-
  Barrault, Kai Wang, Marc Masana, Luis Herranz,             tics.
  and Joost van de Weijer. 2018. LIUM-CVC sub-
  missions for WMT18 multimodal translation task.          Desmond Elliott. 2018. Adversarial evaluation of mul-
  In Proceedings of the Third Conference on Machine          timodal machine translation. In Proceedings of the
  Translation, Volume 2: Shared Task Papers, pages           2018 Conference on Empirical Methods in Natu-
  603–608, Belgium, Brussels. Association for Com-           ral Language Processing, pages 2974–2978, Brus-
  putational Linguistics.                                    sels, Belgium. Association for Computational Lin-
                                                             guistics.
Ozan Caglayan, Loı̈c Barrault, and Fethi Bougares.
  2016.      Multimodal attention for neural ma-           Desmond Elliott, Stella Frank, Loı̈c Barrault, Fethi
  chine translation. Computing Research Repository,          Bougares, and Lucia Specia. 2017. Findings of the
  arXiv:1609.03976.                                          second shared task on multimodal machine transla-
                                                             tion and multilingual image description. In Proceed-
Ozan Caglayan, Mercedes Garcı́a-Martı́nez, Adrien            ings of the Second Conference on Machine Trans-
  Bardet, Walid Aransa, Fethi Bougares, and Loı̈c            lation, Volume 2: Shared Task Papers, pages 215–
  Barrault. 2017b. NMTPY: A flexible toolkit for ad-         233, Copenhagen, Denmark. Association for Com-
  vanced neural machine translation systems. Prague          putational Linguistics.
  Bull. Math. Linguistics, 109:15–28.
                                                           Desmond Elliott, Stella Frank, Khalil Sima’an, and Lu-
Iacer Calixto, Desmond Elliott, and Stella Frank. 2016.      cia Specia. 2016. Multi30K: Multilingual english-
   DCU-UvA multimodal MT system report. In Pro-              german image descriptions. In Proceedings of the
   ceedings of the First Conference on Machine Trans-        5th Workshop on Vision and Language, pages 70–
   lation, pages 634–638, Berlin, Germany. Associa-          74, Berlin, Germany. Association for Computational
   tion for Computational Linguistics.                       Linguistics.
                                                           Stig-Arne Grönroos, Benoit Huet, Mikko Kurimo,
Iacer Calixto and Qun Liu. 2017. Incorporating global         Jorma Laaksonen, Bernard Merialdo, Phu Pham,
   visual features into attention-based neural machine        Mats Sjöberg, Umut Sulubacak, Jörg Tiedemann,
   translation. In Proceedings of the 2017 Conference         Raphael Troncy, and Raúl Vázquez. 2018. The
   on Empirical Methods in Natural Language Pro-              MeMAD submission to the WMT18 multimodal
   cessing, pages 992–1003, Copenhagen, Denmark.              translation task. In Proceedings of the Third Confer-
   Association for Computational Linguistics.                 ence on Machine Translation, pages 609–617, Bel-
                                                              gium, Brussels. Association for Computational Lin-
Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-
                                                              guistics.
  cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
  Schwenk, and Yoshua Bengio. 2014. Learning               K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep resid-
  phrase representations using RNN encoder–decoder            ual learning for image recognition. In 2016 IEEE
  for statistical machine translation. In Proceedings of      Conference on Computer Vision and Pattern Recog-
  the 2014 Conference on Empirical Methods in Nat-            nition (CVPR), pages 770–778.
  ural Language Processing, pages 1724–1734, Doha,
  Qatar. Association for Computational Linguistics.        Jindřich Helcl, Jindřich Libovický, and Dusan Varis.
                                                              2018. CUNI system for the WMT18 multimodal
Jonathan H. Clark, Chris Dyer, Alon Lavie, and                translation task. In Proceedings of the Third Con-
  Noah A. Smith. 2011. Better hypothesis testing for          ference on Machine Translation, Volume 2: Shared
  statistical machine translation: Controlling for opti-      Task Papers, pages 622–629, Belgium, Brussels. As-
  mizer instability. In Proceedings of the 49th Annual        sociation for Computational Linguistics.

Po-Yao Huang, Frederick Liu, Sz-Rung Shiang, Jean Ofir Press and Lior Wolf. 2016. Using the output em-
Oh, and Chris Dyer. 2016. Attention-based multi- bedding to improve language models. Computing
modal neural machine translation. In Proceedings of Research Repository, arXiv:1608.05859.
the First Conference on Machine Translation, pages
639–645, Berlin, Germany. Association for Compu- Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexan-
tational Linguistics. dra Birch, Barry Haddow, Julian Hitschler, Marcin
Junczys-Dowmunt, Samuel Läubli, Antonio Valerio
Diederik Kingma and Jimmy Ba. 2014. Adam: A Miceli Barone, Jozef Mokry, and Maria Nadejde.
method for stochastic optimization. Computing Re- 2017. Nematus: a toolkit for neural machine trans-
search Repository, arXiv:1412.6980. lation. In Proceedings of the Software Demonstra-
tions of the 15th Conference of the European Chap-
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris ter of the Association for Computational Linguistics,
Callison-Burch, Marcello Federico, Nicola Bertoldi, pages 65–68. Association for Computational Lin-
Brooke Cowan, Wade Shen, Christine Moran, guistics.
Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra
Constantin, and Evan Herbst. 2007. Moses: Open Lucia Specia, Stella Frank, Khalil Sima’an, and
source toolkit for statistical machine translation. In Desmond Elliott. 2016. A shared task on multi-
Proceedings of the 45th Annual Meeting of the ACL modal machine translation and crosslingual image
on Interactive Poster and Demonstration Sessions, description. In Proceedings of the First Conference
ACL ’07, pages 177–180, Stroudsburg, PA, USA. on Machine Translation, pages 543–553, Berlin,
Association for Computational Linguistics. Germany. Association for Computational Linguis-
tics.
Chiraag Lala, Pranava Swaroop Madhyastha, Carolina
Scarton, and Lucia Specia. 2018. Sheffield sub- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
missions for WMT18 multimodal translation shared Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
task. In Proceedings of the Third Conference on Ma- Dropout: A simple way to prevent neural networks
chine Translation, Volume 2: Shared Task Papers, from overfitting. J. Mach. Learn. Res., 15(1):1929–
pages 630–637, Belgium, Brussels. Association for 1958.
Computational Linguistics.
Peter Young, Alice Lai, Micah Hodosh, and Julia
Jindřich Libovický and Jindřich Helcl. 2017. Attention Hockenmaier. 2014. From image descriptions to
strategies for multi-source sequence-to-sequence visual denotations: New similarity metrics for se-
learning. In Proceedings of the 55th Annual Meet- mantic inference over event descriptions. Transac-
ing of the Association for Computational Linguis- tions of the Association for Computational Linguis-
tics (Volume 2: Short Papers), pages 196–202, Van- tics, 2:67–78.
couver, Canada. Association for Computational Lin-
guistics.
Mingbo Ma, Dapeng Li, Kai Zhao, and Liang Huang.
2017. OSU multimodal machine translation system
report. In Proceedings of the Second Conference
on Machine Translation, Volume 2: Shared Task Pa-
pers, pages 465–469, Copenhagen, Denmark. Asso-
ciation for Computational Linguistics.
Pranava Swaroop Madhyastha, Josiah Wang, and Lu-
cia Specia. 2017. Sheffield MultiMT: Using object
posterior predictions for multimodal machine trans-
lation. In Proceedings of the Second Conference on
Machine Translation, Volume 2: Shared Task Pa-
pers, pages 470–476, Copenhagen, Denmark. Asso-
ciation for Computational Linguistics.
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.
2013. On the difficulty of training recurrent neural
networks. In Proceedings of the 30th International
Conference on Machine Learning, volume 28 of
Proceedings of Machine Learning Research, pages
1310–1318, Atlanta, Georgia, USA. PMLR.
B. A. Plummer, L. Wang, C. M. Cervantes, J. C.
Caicedo, J. Hockenmaier, and S. Lazebnik. 2015.
Flickr30k entities: Collecting region-to-phrase cor-
respondences for richer image-to-sentence models.
In 2015 IEEE International Conference on Com-
puter Vision (ICCV), pages 2641–2649.

A   Qualitative Examples

                          SRC:    a girl in [v] is sitting on a bench
                          NMT:    pink
                          Init: pink
                          Hier: black
                          Direct: black

                          SRC:    a man dressed in [v] talking to a girl
                          NMT:    black
                          Init: black
                          Hier: white
                          Direct: white

                          SRC:    a [v] dog sits under a [v] umbrella
                          NMT:    brown / blue
                          Init: black / blue
                          Hier: black / blue
                          Direct: black / blue

                          SRC:    a woman in a [v] top is dancing as a woman and boy in a [v] shirt watch
                          NMT:    blue / blue
                          Init: blue / blue
                          Hier: red / red
                          Direct: red / red

                          SRC:    three female dancers in [v] dresses are performing a dance routine
                          NMT:    white
                          Init: white
                          Hier: white
                          Direct: blue

Table 6: Color deprivation examples from the English→French models: bold indicates correctly predicted cases.
The colors generated by the models are shown in English for the sake of clarity.

SRC: a [v] in a red [v] plays in the [v]
                               NMT: un garçon en t-shirt rouge joue dans la neige
                                    (a boy in a red t-shirt plays in the snow)
                               MMT: un garçon en maillot de bain rouge joue dans l’eau
                               REF: un garçon en maillot de bain rouge joue dans l’eau
                                    (a boy in a red swimsuit plays in the water)
                               SRC: a [v] drinks [v] outside on the [v]
                               NMT: un homme boit du vin dehors sur le trottoir
                                    (a man drinks wine outside on the sidewalk)
                               MMT: un chien boit de l’eau dehors sur l’herbe
                               REF: un chien boit de l’eau dehors sur l’herbe
                                    (a dog drinks water outside on the grass)
                               SRC: two [v] are driving on a [v]
                               NMT: deux hommes font du vélo sur une route
                                    (two men riding bicycles on a road)
                               MMT: deux voitures roulent sur une piste
                                    (two cars driving on a track/circuit)
                               REF: deux voitures roulent sur un circuit
                               SRC: a [v] turns on the [v] to pursue a flying [v]
                               NMT: un homme tourne sur la plage pour attraper un frisbee volant
                                    (a man turns on the beach to catch a flying frisbee)
                               MMT: un chien tourne sur l’herbe pour attraper un frisbee volant
                                    (a dog turns on the grass to catch a flying frisbee)
                               REF: un chien tourne sur l’herbe pour poursuivre une balle en l’air
                                    (a dog turns on the grass to chase a ball in the air)
                               SRC: a [v] jumping [v] on a [v] near a parking [v]
                               NMT: un homme sautant à cheval sur une plage près d’un parking
                                    (a man jumping on a beach near a parking lot)
                               MMT: une fille sautant à la corde sur un trottoir près d’un parking
                               REF: une fille sautant à la corde sur un trottoir près d’un parking
                                    (a girl jumping rope on a sidewalk near a parking lot)

Table 7: Entity masking examples from the English→French models: underlined and bold words highlight bad and
good lexical choices, respectively. English translations are provided in parentheses. MMT is an attentive model.

a                                                une            mère        et        sa
  mother
      and
      her
   young
     song                                    jeune       chanson        profitant   d'   une
 enjoying
        a
 beautiful
      day
  outside                                    belle       journée        dehors      .         
         .
   
                     .
                  une
                 mère
                    et
                    sa

                  une
                jeune

             profitant

                 belle
             d'

               dehors

             chanson

              journée

                                        (a) Baseline (non-masked) MMT
         a                                               une            mère        et        son
       [v]
     and
      her
   young
       [v]                                   jeune       enfant         profitant   d'   une
 enjoying
         a
 beautiful
       [v]
  outside                                    belle       journée        dehors      .         
         .
   
                      .
                  une
                 mère
                    et
                   son

                  une
               enfant
                jeune

             profitant

                 belle
             d'

               dehors

              journée

                                           (b) Entity-masked MMT

Figure 4: Attention example from entity masking experiments: (a) Baseline MMT translates the misspelled “son”
(song → chanson) while (b) the masked MMT achieves a correct translation ([v]→ enfant) by exploiting the
visual modality.

a                                                      un          terrier         de        boston
  boston
  terrier
        is
 running
       on                                          court        sur         l'         herbe     verte
     lush
   green
   grass
        in
    front                                          luxuriante devant        une             clôture   blanche
       of
         a
   white
   fence
         .                                         .            
  
                       un

                         .
                  terrier
                       de

                      sur

                   herbe
                   verte

                     une
                   court

                 devant

                 boston

                l'

                 clôture
               blanche
             luxuriante

                                           (a) Baseline (non-masked) MMT
         a
                                                           un              berger       allemand      court
  boston
       [v]
        is
 running
       on
     lush                                    sur           l'         herbe        verte         devant
   green
       [v]
        in
    front
        of                                   une           clôture         blanche      .             
         a
   white
       [v]
         .
  
                     un
                berger

                      .
                    sur

                 herbe
                  verte

                   une
                  court

             allemand

                devant
               l'

                clôture
              blanche

                                               (b) Entity-masked MMT

Figure 5: Attention example from entity masking experiments where terrier, grass and fence are dropped from the
source sentence: (a) Baseline MMT is not able to shift attention from the salient dog to the grass and fence, (b) the
attention produced by the masked MMT first shifts to the background area while translating “on lush green [v]”
then focuses on the fence.

SRC: a child [v][v][v][v][v][v]
                     NMT: un enfant avec des lunettes de soleil en train de jouer au tennis
                          (a child with sunglasses playing tennis)
                     MMT: un enfant est debout dans un champ de fleurs
                          (a child is standing in field of flowers)
                     REF: un enfant dans un champ de tulipes
                          (a child in a field of tulips)
                     SRC: a jockey riding his [v][v]
                     NMT: un jockey sur son vélo
                          (a jockey on his bike)
                     MMT: un jockey sur son cheval
                     REF: un jockey sur son cheval
                          (a jockey on his horse)
                     SRC: girls are playing a [v][v][v]
                     NMT: des filles jouent à un jeu de cartes
                          (girls are playing a card game)
                     MMT: des filles jouent un match de football
                     REF: des filles jouent un match de football
                          (girls are playing a football match)
                     SRC: trees are in front [v][v][v][v][v]
                     NMT: des vélos sont devant un bâtiment en plein air
                          (bicycles are in front of an outdoor building)
                     MMT: des arbres sont devant la montagne
                          (trees are in front of the mountain)
                     REF: des arbres sont devant une grande montagne
                          (trees are in front of a big mountain)
                     SRC: a fishing net on the deck of a [v][v]
                     NMT: un filet de pêche sur la terrasse d’un bâtiment
                          (a fishing net on the terrace of a building)
                     MMT: un filet de pêche sur le pont d’un bateau
                          (a fishing net on the deck of a boat)
                     REF: un filet de pêche sur le pont d’un bateau rouge
                          (a fishing net on the deck of a red boat)
                     SRC: girls wave purple flags [v][v][v][v][v][v][v]
                     NMT: des filles en t-shirts violets sont assises sur des chaises dans une salle de classe
                          (girls in purple t-shirts are sitting on chairs in a classroom)
                     MMT: des filles en costumes violets dansent dans une rue en ville
                          (girls in purple costumes dance on a city street)
                     REF: des filles agitent des drapeaux violets tandis qu’elles défilent dans la rue
                          (girls wave purple flags as they parade down the street)

Table 8: English→French progressive masking examples: underlined and bold words highlight bad and good
lexical choices, respectively. English translations are provided in parentheses. MMT is an attentive model.

You can also read