MULTI30K: MULTILINGUAL ENGLISH-GERMAN IMAGE DESCRIPTIONS
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Multi30K: Multilingual English-German Image Descriptions Desmond Elliott Stella Frank Khalil Sima’an ILLC, University of Amsterdam d.elliott@uva.nl s.c.frank@uva.nl k.simaan@uva.nl Lucia Specia University of Sheffield l.specia@sheffield.ac.uk Abstract Multi30K is an extension of the Flickr30K dataset (Young et al., 2014) with 31,014 German transla- We introduce the Multi30K dataset to tions of English descriptions and 155,070 indepen- stimulate multilingual multimodal re- dently collected German descriptions. The trans- search. Recent advances in image descrip- lations were collected from professionally con- tion have been demonstrated on English- tracted translators, whereas the descriptions were language datasets almost exclusively, but collected from untrained crowdworkers. The key image description should not be limited difference between these corpora is the relation- to English. This dataset extends the ship between the sentences in different languages. Flickr30K dataset with i) German trans- In the translated corpus, we know there is a strong lations created by professional translators correspondence between the sentences in both lan- over a subset of the English descriptions, guages. In the descriptions corpus, we only know and ii) German descriptions crowdsourced that the sentences, regardless of the language, are independently of the original English de- supposed to describe the same image. scriptions. We describe the data and out- A dataset of images paired with sentences in line how it can be used for multilingual im- multiple languages broadens the scope of multi- age description and multimodal machine modal NLP research. Image description with mul- translation, but we anticipate the data will tilingual data can also be seen as machine trans- be useful for a broader range of tasks. lation in a multimodal context. This opens up 1 Introduction new avenues for researchers in machine transla- tion (Koehn et al., 2003; Chiang, 2005; Sutskever Image description is one of the core challenges at et al., 2014; Bahdanau et al., 2015, inter-alia) to the intersection of Natural Language Processing work with multilingual multimodal data. Image– (NLP) and Computer Vision (CV) (Bernardi et al., sentence ranking using monolingual multimodal 2016). This task has only received attention in a datasets (Hodosh et al., 2013, inter-alia) is also monolingual English setting, helped by the avail- a natural task for multilingual modelling. ability of English datasets, e.g. Flickr8K (Hodosh The only existing datasets of images paired et al., 2013), Flickr30K (Young et al., 2014), and with multilingual sentences were created by pro- MS COCO (Chen et al., 2015). However, the pos- fessionally translating English into the target lan- sible applications of image description are useful guage: IAPR-TC12 with 20,000 English-German for all languages, such as searching for images described images (Grubinger et al., 2006), and using natural language, or providing alternative- the Pascal Sentences Dataset of 1,000 Japanese- description text for visually impaired Web users. English described images (Funaki and Nakayama, We introduce a large-scale dataset of images 2015). Multi30K dataset is larger than both of paired with sentences in English and German as these and contains both independent and translated an initial step towards studying the value and the sentences. We hope this dataset will be of broad characteristics of multilingual-multimodal data1 . interest across NLP and CV research and antici- 1 The dataset is freely available under the Creative Com- pate that these communities will put the data to use mons Attribution NonCommercial ShareAlike 4.0 Inter- national license from http://www.statmt.org/wmt in a broader range of tasks than we can foresee. 16/multimodal-task.html. 70 Proceedings of the 5th Workshop on Vision and Language, pages 70–74, Berlin, Germany, August 12 2016. c 2016 Association for Computational Linguistics
1. Brick layers constructing a wall. 1. The two men on the scaffolding are helping to build a red brick wall. 2. Maurer bauen eine Wand. 2. Zwei Mauerer mauern ein Haus zusammen. 1. Trendy girl talking on her cellphone 1. There is a young girl on her cell- while gliding slowly down the street phone while skating. 2. Ein schickes Mädchen spricht mit 2. Eine Frau im blauen Shirt telefoniert dem Handy während sie langsam die beim Rollschuhfahren. Straße entlangschwebt. (a) Translations (b) Independent descriptions Figure 1: Multilingual examples in the Multi30K dataset. The independent sentences are all accurate descriptions of the image but do not contain the same details in both languages, such as shirt colour or the scaffolding. In the second translation pair (bottom left) the translator has translated “glide” as “schweben” (“to float”) probably due to not seeing the image context (see Section 2.1 for more details). 2 The Multi30K Dataset descriptions collected as described in Section 2.2. The Flickr30K Dataset contains 31,014 im- 2.2 Independent Descriptions ages sourced from online photo-sharing websites The descriptions were collected from crowdwork- (Young et al., 2014). Each image is paired with ers via the Crowdflower platform3 . We col- five English descriptions, which were collected lected five descriptions per image in the Flickr30K from Amazon Mechanical Turk2 . The dataset con- dataset, resulting in a total of 155,070 sentences. tains 145,000 training, 5,070 development, and Workers were presented with a translated version 5,000 test descriptions. The Multi30K dataset ex- of the data collection interface used by (Hodosh et tends the Flickr30K dataset with translated and in- al., 2013), as shown in Figure 2. We translated the dependent German sentences. interface to make the task as similar as possible to 2.1 Translations the crowdsourcing of the English sentences. The The translations were collected from professional instructions were translated by one of the authors English-German translators contracted via an es- and checked by a native German Ph.D student. tablished Language Service in Germany. Fig- 185 crowdworkers took part in the task over a ure 1 presents an example of the differences be- period of 31 days. We split the task into 1,000 tween the types of data. We collected one trans- randomly selected images per day to control the lated description per image, resulting in a total of quality of the data and to prevent worker fatigue. 31,014 translations. To ensure an even distribu- Workers were required to have a German-language tion over description length, the English descrip- skill certification and be at least a Crowdflower tions were chosen based on their relative length, Level 2 Worker: they have participated in at least with an equal number of longest, shortest, and me- 10 different Crowdflower jobs, have passed at least dian length source descriptions. We paid a total 100 quality-control questions, and have an job ac- of e23,000 to collect the data (e0.06 per word). ceptance rate of at least 85%. Translators were shown an English language sen- The descriptions were collected in batches of tences and asked to produce a correct and fluent five images per job. Each image was randomly translation for it in German, without seeing the im- selected from the complete set of 1,000 images for age. We decided against showing the images to that day, and workers were limited to writing at translators to make this process as close as possi- most 250 descriptions per day. We paid workers ble to a standard translation task, also making the $0.05 per description4 and prevented them from data collected here distinct from the independent 3 http://www.crowdflower.com 4 2 This is the same rate as Rashtchian et al. (2010) and El- http://www.mturk.com 71
Figure 2: The German instructions shown to crowdworkers were translated from the original instructions. submitting faster than 90 seconds per job to dis- lation are slightly shorter, on average, than the courage poor/low-quality work. This works out at Flickr30k average (11.9 vs. 12.3). When we com- a rate of 40 jobs per hour, i.e. 200 descriptions pare the German translation dataset against an per hour. We configured Crowdflower to automat- equal number of sentences from the German de- ically ban users who worked faster than this rate. scriptions dataset, we find that the translations Thus the theoretical maximum wage per hour was also have more word types (19.3K vs. 17.6K), and $10/hour. We paid a total of $9,591.24 towards more singleton types occurring only once (11.3K collecting the data and paying the Crowdflower vs. 10.2K; in both datasets singletons comprise platform fees. 58% of the vocabulary). The translations thus have During the collection of the data, we assessed a wider vocabulary, despite being generated by a the quality both by manually checking a subset of smaller number of authors. The English datasets the descriptions and also with automated checks. (all descriptions vs. those selected for translation) We inspected the submissions of users who wrote show a similar trend, indicating that these differ- sentences with less than five words, and users with ences may be a result of the decision to select sim- high type to token ratios (to detect repetition). We ilar numbers of short, medium, and long English also used a character-level 6-gram LM to flag de- sentences for translation. scriptions with high perplexity, which was very ef- 2.4 English vs. German fective at catching nonsensical sentences. In gen- eral we did not have to ban or reject many users The English image descriptions are generally and overall description quality was high. longer than the German descriptions, both in terms of number of words and characters. Note that 2.3 Translated vs. Independent Descriptions the difference is much less smaller when measur- We now analyse the differences between the trans- ing characters: German uses 22% fewer words lated and the description corpora. For this anal- but only 2.5% fewer characters. However, we ysis, all sentences were stripped of punctuation observe a different pattern in the translation cor- and truecased using the Moses truecaser.pl5 pora: German uses 6.6% fewer words than En- script trained over Europarl v7 and News Com- glish but 17.1% more characters. The vocabulary mentary v11 English-German parallel corpora. of the German description and translation corpora Table 1 shows the differences between the cor- are more than twice as large as the English cor- pora. The German translations are longer than pora. Additionally, the German corpora have two- the independent descriptions (11.1 vs. 9.6 words), to-three times as many singletons. This is likely while the English descriptions selected for trans- due to richer morphological variation in German, liott and Keller (2013) paid to collect English sentences. as well as word compounding. 5 https://github.com/moses-smt/mosesde coder/blob/master/scripts/recaser/trueca se.perl 72
Sentences Tokens Types Characters Avg. length Singletons Translations English 357,172 11,420 1,472,251 11.9 5,073 31,014 German 333,833 19,397 1,774,234 11.1 11,285 Descriptions English 1,841,159 22,815 7,611,033 12.3 9,230 155,070 German 1,434,998 46,138 7,418,572 9.6 26,510 Table 1: Corpus-level statistics about the translation and the description data. 3 Discussion chine translation in a setting where multimodal The Multi30K dataset is immediately suitable data, such as images or video, are observed along- for research on a wide range of tasks, including side text. The potential advantages of using multi- but not limited to automatic image description, modal information for machine translation include image–sentence ranking, multimodal and multi- the ability to better deal with ambiguous source lingual semantics, and machine translation. In text and to avoid (untranslated) out-of-vocabulary what follows we highlight two applications in words in the target language (Calixto et al., 2012). which Multi30K could be directly used. For more Hitschler and Riezler (2016) have demonstrated examples of approaches targeting these applica- the potential of multimodal features in a target- tions, we refer the reader to the forthcoming re- side translation reranking model. Their approach port on the WMT16 shared task on Multimodal is initially trained over large text-only translation Machine Translation and Crosslingual Image De- copora and then fine-tuned with a small amount scription (Specia et al., 2016). of in-domain data, such as our dataset. We expect a variety of translation models can be adapted to 3.1 Multi30K for Image Description take advantage of multimodal data as features in Deep neural networks for image description typi- a translation model or as feature vectors in neural cally integrate visual features into a recurrent neu- machine translation models. ral network language model (Vinyals et al., 2015; 4 Conclusions Xu et al., 2015, inter-alia). Elliott et al. (2015) demonstrated how to build multilingual image de- We introduced Multi30K: a large-scale multilin- scription models that learn and transfer features gual multimodal dataset for interdisciplinary ma- between monolingual image description models. chine learning research. Our dataset is an exten- They performed a series of experiments on the sion of the popular Flickr30K dataset with descrip- IAPR-TC12 dataset (Grubinger et al., 2006) of im- tions and professional translations in German. ages aligned with German translations, showing The descriptions were collected from a crowd- that both English and German image description sourcing platform, while the translations were col- could be improved by transferring features from lected from professionally contracted translators. a multimodal neural language model trained to These differences are deliberate and part of the generate descriptions in the other language. The larger scope of studying multilingual multimodal Multi30K dataset will enable further research in data in different contexts. The descriptions were this direction, allowing researchers to work with collected as similarly as possible to the original larger datasets with multiple references per image. Flickr30K dataset by translating the instructions used by Young et al. (2014) into German. The 3.2 Multi30K for Machine Translation translations were collected without showing the Machine translation is typically performed using images to the translators to keep the process as only textual data, for example news data, the Eu- close to a standard translation task as possible. roparl corpora, or corpora harvested from the Web There are substantial differences between the (CommonCrawl, Wikipedia, etc.). The Multi30K translated and the description datasets. The trans- dataset makes it possible to further develop ma- lations contain approximately the same number of 73
tokens and have sentences of approximately the Desmond Elliott, Stella Frank, and Eva Hasler. 2015. same length in both languages. These properties Multi-language image description with neural se- quence models. CoRR, abs/1510.04709. make them suited to machine translations mod- els. The description datasets are very different in Ruka Funaki and Hideki Nakayama. 2015. Image- terms of average sentence lengths and the number mediated learning for zero-shot cross-lingual docu- of word types per language. This is likely to cause ment retrieval. In EMNLP, pages 585–590. different engineering and scientific challenges be- Michael Grubinger, Paul D. Clough, Henning Muller, cause the descriptions are independently collected and Thomas Desealers. 2006. The IAPR TC-12 corpora instead of a sentence-level aligned corpus. benchmark: A new evaluation resource for visual in- formation systems. In LREC. In the future, we want to study multilingual multimodality over a wider range of languages, for Julian Hitschler and Stefan Riezler. 2016. Multi- example beyond Indo-European families. We call modal pivots for image caption translation. CoRR, on the community to engage with us on creating abs/1601.03916. massively multilingual multimodal datasets. Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing Image Description as a Ranking Acknowledgements Task: Data, Models and Evaluation Metrics. Jour- DE and KS were supported by the NWO Vici grant nal of Artificial Intelligence Research, 47:853–899. nr. 277-89-002. SF was supported by European Philipp Koehn, Franz Josef Och, and Daniel Marcu. Union’s Horizon 2020 research and innovation 2003. Statistical phrase-based translation. In programme under grant agreement nr. 645452. We NAACL, pages 48–54. are grateful to Philip Schulz for checking the Ger- Cyrus Rashtchian, Peter Young, Micha Hodosh, and man translation of the worker instructions, and to Julia Hockenmaier. 2010. Collecting image annota- Joachim Daiber for providing the pre-trained true- tions using Amazon’s Mechanical Turk. In NAACL- casing models. HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. References Lucia Specia, Stella Frank, Khalil Sima’an, and Desmond Elliott. 2016. A shared task on multi- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- modal machine translation and crosslingual image gio. 2015. Neural machine translation by jointly description. In First Conference on Machine Trans- learning to align and translate. In ICLR. lation. Raffaella Bernardi, Ruken Cakici, Desmond Elliott, Ilya Sutskever, Oriol Vinyals, and Quoc V. V Le. Aykut Erdem, Erkut Erdem, Nazli Ikizler-Cinbis, 2014. Sequence to sequence learning with neural Frank Keller, Adrian Muscat, and Barbara Plank. networks. In Advances in Neural Information Pro- 2016. Automatic description generation from im- cessing Systems 27, pages 3104–3112. Curran Asso- ages: A survey of models, datasets, and evalua- ciates, Inc. tion measures. Journal of Artificial Intelligence Re- search, 55:409–442. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural im- Iacer Calixto, Teo E. de Campos, and Lucia Specia. age caption generator. In CVPR. 2012. Images as context in statistical machine trans- lation. In Second Annual Meeting of the EPSRC Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Network on Vision & Language. Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr- attend and tell: Neural image caption generation ishna Vedantam, Saurabh Gupta, Piotr Dollár, and with visual attention. CoRR, abs/1502.03044. C. Lawrence Zitnick. 2015. Microsoft COCO cap- tions: Data collection and evaluation server. CoRR, Peter Young, Alice Lai, Micha Hodosh, and Julia abs/1504.00325. Hockenmaier. 2014. From image descriptions to vi- sual denotations: New similarity metrics for seman- David Chiang. 2005. A hierarchical phrase-based tic inference over event descriptions. Transactions model for statistical machine translation. In ACL, of the Association for Computational Linguistics. pages 263–270. Desmond Elliott and Frank Keller. 2013. Image De- scription using Visual Dependency Representations. In EMNLP, pages 1292–1302. 74
You can also read