NLPHUT'S PARTICIPATION AT WAT2021 - Shantipriya Parida Subhadarshi Panda - TROPE HCRAESE PAID
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
IDIAP RESEARCH REPORT NLPHUT’S PARTICIPATION AT WAT2021 a Shantipriya Parida Subhadarshi Panda Ketan Kotwal Amulya Ratna Dash Satya Ranjan Dash Yashvardhan Sharma Petr Motlicek Ondrej Bojar Idiap-RR-10-2021 JULY 2021 a Idiap Research Institute Centre du Parc, Centre du Parc, Rue Marconi 19, CH - 1920 Martigny T +41 27 721 77 11 F +41 27 721 77 12 info@idiap.ch www.idiap.ch
NLPHut’s Participation at WAT2021 Shantipriya Parida∗ , Subhadarshi Panda† , Ketan Kotwal∗ , Amulya Ratna Dash♣ , Satya Ranjan Dash♠ , Yashvardhan Sharma♣ , Petr Motlicek∗ , Ondřej Bojar♢ ∗ Idiap Research Institute, Martigny, Switzerland {firstname.lastname}@idiap.ch † Graduate Center, City University of New York, USA spanda@gradcenter.cuny.edu ♣ Birla Institute of Technology and Science, Pilani, India {p20200105,yash}@pilani.bits-pilani.ac.in ♠ KIIT University, Bhubaneswar, India sdashfca@kiit.ac.in ♢ Charles University, MFF, ÚFAL, Prague, Czech Republic bojar@ufal.mff.cuni.cz Abstract The Workshop on Asian Translation (WAT) is an open evaluation campaign focusing This paper provides the description of on Asian languages since 2013 (Nakazawa shared tasks to the WAT 2021 by our et al., 2020). In WAT2021 (Nakazawa team “NLPHut”. We have participated in the English→Hindi Multimodal trans- et al., 2021) Multimodal track, a new In- lation task, English→Malayalam Multi- dian language Malayalam was introduced for modal translation task, and Indic Multi- English→Malayalam text, multimodal transla- lingual translation task. We have used tion, and Malayalam image captioning task.2 the state-of-the-art Transformer model This year, the MultiIndic3 task covers 10 Indic with language tags in different settings languages and English. for the translation task and proposed a In this system description paper, we explain novel “region-specific” caption generation our approach for the tasks (including the sub- approach using a combination of image CNN and LSTM for the Hindi and Malay- tasks) we participated in: alam image captioning. Our submission Task 1: English→Hindi (EN-HI) Multimodal tops in English→Malayalam Multimodal Translation translation task (text-only translation, and Malayalam caption), and ranks second- • EN-HI text-only translation best in English→Hindi Multimodal transla- • Hindi-only image captioning tion task (text-only translation, and Hindi Task 2: English→Malayalam (EN-ML) Mul- caption). Our submissions have also per- timodal Translation formed well in the Indic Multilingual trans- lation tasks. • EN-ML text-only translation • Malayalam-only image captioning 1 Introduction Task 3: Indic Multilingual translation task. Machine translation (MT) is considered to be Section 2 describes the datasets used in our one of the most successful applications of nat- experiment. Section 3 presents the model ural language processing (NLP)1 . It has sig- and experimental setups used in our approach. nificantly evolved especially in terms of the Section 4 provides the official evaluation re- accuracy of its output. Though MT perfor- sults of WAT20214 followed by the conclusion mance reached near to human level for several in Section 5. language pairs (see e.g. Popel et al., 2020), it 2 https://ufal.mff.cuni. remains challenging for low resource languages cz/malayalam-visual-genome/ wat2021-english-malayalam-multi or translation effectively utilizing other modal- 3 http://lotus.kuee.kyoto-u.ac.jp/WAT/ ities (e.g. image, Parida et al., 2020). indic-multilingual/ 4 http://lotus.kuee.kyoto-u.ac.jp/WAT/ 1 https://morioh.com/p/d596d2d4444d WAT2021/index.html
2 Dataset Task 3: Indic Multilingual Translation For this task, the organizers provided a train- We have used the official datasets provided by ing corpus that comprises in total 11 million the WAT2021 organizers for the tasks. sentence pairs collected from several corpora. Task 1: English→Hindi Multimodal The evaluation (dev and test set) contain fil- Translation For this task, the organiz- tered data of the PMIndia dataset (Haddow ers provided HindiVisualGenome 1.1 (Parida and Kirefu, 2020).8 We have not used any ad- et al., 2019)5 dataset (HVG for short). The ditional resources in this task. The statistics training part consists of 29k English and Hindi of the dataset are shown in Table 2. short captions of rectangular areas in photos of various scenes and it is complemented by 3 Experimental Details three test sets: development (D-Test), evalua- This section describes the experimental details tion (E-Test) and challenge test set (C-Test). of the tasks we participated in. Our WAT submissions were for E-Test (de- noted “EV” in WAT official tables) and C- 3.1 EN-HI and EN-ML text-only Test (denoted “CH” in WAT tables). Addition- translation ally, we used the IITB Corpus6 which is sup- For the HVG text-only translation track, we posedly the largest publicly available English- train a Transformer model (Vaswani et al., Hindi parallel corpus (Kunchukuttan et al., 2017) using the concatenation of IIT-B train- 2017). This corpus contains 1.59 million par- ing data and HVG training data (see Table 1). allel segments and it was found very effective Similar to the two-phase approach outlined in for English-Hindi translation (Parida and Bo- Section 3.3, we continue the training using jar, 2018). The statistics of the datasets are only the HVG training data to obtain the final shown in Table 1. checkpoint. For the MVG text-only transla- Tokens tion track, we train a Transformer model using Set Sentences English Hindi Malayalam only the MVG training data. Train 28930 143164 145448 107126 D-Test 998 4922 4978 3619 For both EN-HI and EN-ML translation, we E-Test 1595 7853 7852 5689 trained SentencePiece subword units (Kudo C-Test 1400 8186 8639 6044 IITB Train 1.5 M 20.6 M 22.1 M – and Richardson, 2018) setting maximum vo- cabulary size to 8k. The vocabulary was Table 1: Statistics of our data used in the learned jointly on the source and target sen- English→Hindi and English→Malayalam Multi- tences of HVG and IIT-B for EN-HI and of modal task: the number of sentences and tokens. MVG for EN-ML. The number of encoder and decoder layers was set to 3 each; while the Task 2: English→Malayalam Multi- number of heads was set to 8. We have set modal Translation For this task, the orga- the hidden size to 128, along with the dropout nizers provided MalayalamVisualGenome 1.0 value of 0.1. We initialized the model param- dataset7 (MVG for short). MVG is an ex- eters using Xavier initialization (Glorot and tension of the HVG dataset for supporting Bengio, 2010) and used the Adam optimizer Malayalam, which belongs to the Dravidian (Kingma and Ba, 2014) with a learning rate of language family (Kumar et al., 2017). The 5e−4 for optimizing model parameters. Gradi- dataset size and images are the same as HVG. ent clipping was used to clip gradients greater While HVG contains bilingual (English and than 1. The training was stopped when the Hindi) segments, MVG contains bilingual (En- development loss did not improve for 5 consec- glish and Malayalam) segments, with the En- utive epochs. While EN-HI training using con- glish shared across HVG and MVG, see Ta- catenated IIT-B + HVG data and the subse- ble 1. quent training using only HVG data, we used 5 the same HVG dev set for determining early https://lindat.mff.cuni.cz/repository/ xmlui/handle/11234/1-3267 stopping. For generating translations, we used 6 http://www.cfilt.iitb.ac.in/iitb_parallel/ greedy decoding and generated tokens autore- 7 https://lindat.mff.cuni.cz/repository/ 8 xmlui/handle/11234/1-3533 http://data.statmt.org/pmindia/
Language pair en-bn en-hi en-gu en-ml en-mr en-ta en-te en-pa en-or en-kn Train (ALL) 1756197 3534387 518015 1204503 781872 1499441 686626 518508 252160 396865 Train (PMI) 23306 50349 41578 26916 28974 32638 33380 28294 31966 28901 Dev 1000 Test 2390 Table 2: Statistics of the data used for Indic multilingual translation. gressively till the end-of-sentence token was generated or the maximum translation length was reached, which was set to 100. We show the training and development per- plexities for EN-HI and EN-ML translations during training in Figure 4b. The dev per- plexity for EN-HI translation is lower in the beginning (after epoch 1) because the model is trained using more training samples (IIT-B + HVG) in comparison to EN-ML. Overall, EN- English Text: The snow is white. Hindi Text: बफर् सफेद है HI training takes around twice as much time as EN-ML training, again due to the involvement Malayalam Text: മഞ്ഞ് െവളുത്തതാണ് Gloss: Snow is white of the bigger IIT-B training data. The drop in Figure 1: Sample image with specific region and perplexity midway for EN-HI is because of the its description for caption generation. Image taken change of training data from IIT-B + HVG to from Hindi Visual Genome (HVG) and Malayalam only HVG after the first phase of the training Visual Genome (MVG) (Parida et al., 2019) converges. Upon evaluating the translations using the (i.e., overall understanding) around the region development set, we obtained the following that can essentially be captured from the en- scores for Hindi translations. The BLEU score tire image as shown in Figure 1. It is chal- was 46.7 upon using HVG + IIT-B training lenging to generate the caption “snow” only data. In comparison, we observed that the considering the specific region (red bounding BLEU score was 39.9 upon using only the box). HVG training data (without IIT-B training data). For Malayalam translations, the BLEU We propose a region-specific image caption- score on the development set was 31.3. BLEU ing method through the fusion of encoded fea- scores were computed using sacreBLEU (Post, tures of the region as well as that of the com- 2018). plete image. Our proposed model for this task consists of three modules – an encoder, fusion, 3.2 Image Caption Generation and decoder – as shown in Figure 2. This task in WAT 2021 is formulated as gen- Image Encoder: To textually describe an erating a caption in Hindi and Malayalam for image or a region within, it first needs to be a specific region in the given image. Most ex- encoded into high-level complex features that isting research in the area of image caption- capture its visual attributes. Several image ing refers to generating a textual description captioning works (Yang and Okazaki, 2020; for the entire image (Yang and Okazaki, 2020; Yang et al., 2017; Lindh et al., 2018; Staniūtė Yang et al., 2017; Lindh et al., 2018; Staniūtė and Šešok, 2019; Miyazaki and Shimizu, 2016; and Šešok, 2019; Miyazaki and Shimizu, 2016; Wu et al., 2017) have demonstrated that Wu et al., 2017). However, a naive approach the outputs of final or pre-final convolutional of using only a specified region (as defined by (conv) layers of deep CNNs are excellent fea- the rectangular bounding box) as an input to tures for the aforementioned objective. Along the generic image caption generation system with features of the entire image, we propose often does not yield meaningful results. When to extract the features of the subregion as well a small region of the image with few objects is using the same set of outputs of the conv layer. considered for captioning, it lacks the context Let F ∈ RM N C be the features of the final conv
Figure 2: Architecture of the proposed model for region-specific image caption generator. The Encoder module consists of a pre-trained image CNN as feature extractor, while an LSTM-based decoder generates captions. Both modules are connected by a Fusion module. layer of a pre-trained image CNN where C rep- parameter in [0.50, 1] indicating relative im- resents the number of channels or maps, and portance provided to region-features Sfeat over M, N are the spatial dimensions of each fea- the features of the whole image. For α = 0.66, ture map. From the dimensions of the input the region-level features are weighted twice as image and the values of M, N , we compute high as the entire image-level features. The the spatial scaling factor. Through this factor weighing of a feature vector scales the magni- and nominal interpolation, we obtain a corre- tude of the corresponding vector without al- sponding location of the subregion in the conv tering its orientation. Unlike the fusion mech- layer, say with dimensionality (m, n). This anisms based on weighted addition, we do not subset, Fs ∈ RmnC , predominantly consists of modify the complex information captured by features from the subregion. The subset Fs is the features (except for scale); however, its rel- obtained through the region of interest (RoI) ative importance with respect to the other set pooling (Girshick, 2015). We do not modify of features is adjusted for better caption gen- the channel dimensions of Fs . The final fea- eration. The fused feature f with the dimen- tures, thus obtained, are linearized to form a sionality of the sum of both feature vectors are single column vector. We denote the region- then fed to the LSTM-based decoder. subset features as Sfeat . The features of the complete image are nothing but F. We apply LSTM Decoder: In the proposed approach, spatial pooling on this feature set to reduce the encoder module is not trainable, it only ex- their dimensionality, and obtain the linearized tracts the image features however the LSTM vector of full-image features denoted as Ifeat . decoder is trainable. We used LSTM decoder using the image features for caption genera- Fusion Module: The region-level features tion using greedy search approach (Soh). We capture details of the region (objects) to be used the cross-entropy loss during decoding described; whereas image-level features pro- (Yu et al., 2019). vide an overall context. To generate mean- ingful captions for a region of the image, we 3.3 Indic Multilingual Translation consider the features of the region Sfeat along Sharing parameters across multiple lan- with the features of the entire image Ifeat . This guages, particularly low-resource Indic lan- combining of feature vectors is crucial in gen- guages, results in gains in translation perfor- erating descriptions for the region. In this mance (Dabre et al., 2020). Motivated by work, we propose to conduct fusion through this finding, we train neural MT models with the concatenation of weighted features from shared parameters across multiple languages the region and those from the entire image for the Indic multilingual translation task. We for region-specific caption generation. The additionally apply transfer learning where we fused feature, f, can be represented as f = train a neural MT model in two phases (Kocmi [α Sfeat ; (1 − α) Ifeat ], where α is the weightage and Bojar, 2018). The first phase consists of
Many-to-one setup with source language tag We use a transformer model where the source language tag explicitly informs the model about the language of the source sen- tence as in Lample and Conneau (2019). We provide the language information at every po- sition by representing each source token as the sum of token embedding, positional embed- ding, and language embedding; which is then fed to the encoder (see Figure 3 for the inputs to the encoder). The training data for phase Figure 3: Architecture for Indic Multilingual trans- 1 of the training process is the same as in the lation. We show here the setup in which both the previous setup. source and the target language tags are used. One-to-many setup with target language tag This setup is based on a transformer training a multilingual translation model on model where the target language embedding training pairs drawn from one of the follow- is injected to the decoder at every step and it ing options: (a) any Indic language from the explicitly informs the model about the desired dataset as the source and corresponding En- language of the target sentence (Lample and glish target; (b) English as the source and Conneau, 2019). In this setup, the source is al- any corresponding Indic language as the tar- ways in English. Similar to the previous setup, get; and (c) combination of (a) and (b), that is, we represent each target token as the sum of the model is trained to enable translation from token embeddings, positional embedding, and any Indic language to English and also English language embedding. Figure 3 shows the in- to any Indic language. The second phase in- puts to the decoder. In phase 1 of the training volves fine-tuning of the model at the end of process, we concatenate across all Indic lan- phase 1 using pairs from a single language pair. guages the pairs drawn from English as the For phase 1, we used the PMI dataset for all source and the corresponding Indic language the languages combined; whereas, for phase target and use the resulting data for training. 2, we used either only the PMI portion or all the bilingual data available for the desired lan- Many-to-many setup with both the guage pair. In Table 2, the training data sizes source and target language tags In this are denoted as Train (PMI) for phase 1 of setup, we use a transformer model where training. both the encoder and decoder are informed To support multilinguality (i.e., going be- about the source and target languages explic- yond a bilingual translation setup), we have itly through language embedding at every to- to either fix the target language (many-to-one ken (Lample and Conneau, 2019). For in- setup) or provide a language tag for control- stance, the same model can be used for hi- ling the generation process. We highlight be- en translation and also for en-hi translation. low the four setups to achieve this: As shown in the architecture in Figure 3, the source token representation is computed as Many-to-one setup with no tag In this the sum of the token embedding, positional setup, we use a transformer model (Vaswani embedding, and source language embedding. et al., 2017) without any architectural modifi- Similarly, the target token representation is cation that would enable the model to explic- computed as the sum of the token embedding, itly distinguish between languages. In phase 1 positional embedding, and target language em- of the training process, we concatenate across bedding. The source and the target token rep- all Indic languages the pairs drawn from an In- resentations are provided to the encoder and dic language as the source and the correspond- decoder, respectively. The rest of the mod- ing English target and use the resulting data ules in the transformer model architecture are for training. same as in Vaswani et al. (2017). The training
6.0 EN-HI (train) No tag (train) 6 EN-HI (dev) 5.5 No tag (dev) EN-ML (train) 5.0 Src. tag (train) 5 EN-ML (dev) Src. tag (dev) 4.5 Trg. tag (train) Trg. tag (dev) log PPL log PPL 4 4.0 Src. & trg. tags (train) 3 3.5 Src. & trg. tags (dev) 3.0 2 2.5 1 2.0 0 10 20 30 40 0 20 40 60 80 100 Epoch Epoch (a) (b) Figure 4: Training and development perplexity for: (a) EN-HI and EN-ML translation training; and (b) Indic multilingual translation training in various setups (only phase 1 training curves are shown). data for phase 1 of the training process is the velopment data. For generating translations, combination of the training datasets for the we used greedy decoding where we picked the previous two setups. most likely token at each generation time step. In all the four setups described above, the The generation was done token-by-token till training data for phase 2 is the bilingual data the end-of-sentence token is generated or the corresponding to the desired language pair. maximum translation length is reached. The The bilingual data is either the PMI train- maximum translation length was set to 100. ing data or all the available bilingual training To compare the training under various se- data– sizes for which are provided in Table 2. tups related to the usage of language tags, we We now outline the training details for show the perplexity of the training and the all the setups. We first trained sentence- development data in Figure 4a. The best (low- piece BPE tokenization (Kudo and Richard- est) perplexity is obtained by using the target son, 2018) setting maximum vocabulary size language tag. However, using the target lan- to 32k.9 The vocabulary was learnt jointly on guage tag requires more epochs to converge, all the source and target sentence pairs. The where convergence is determined by the early number of encoder and decoder layers was set stopping criterion described above. to 3 each, and the number of heads was set to We show the development BLEU scores, 8. We have considered the hidden size of 128; computed using sacreBLEU (Post, 2018) in Ta- while the dropout rate was set to 0.1. We ini- ble 3 for each language pair. Results indicate tialized the model parameters using Xavier ini- that the usage of language tags produces bet- tialization (Glorot and Bengio, 2010). Adam ter translation overall. It may also be noted optimizer (Kingma and Ba, 2014) with a learn- that using both languages’ (source and tar- ing rate of 5e−4 was used for optimizing model get) tags resulted in the highest development parameters. Gradient clipping was used to BLEU scores for 8 out of 10 Indic languages clip gradients greater than 1. The training while translating to English. For translation was stopped when the development loss did from English to Indic languages, the target lan- not improve for 5 consecutive epochs. The guage tag setup performed the best overall ob- same early stopping criterion was followed for taining the highest development BLEU scores both phase 1 and phase 2 of the training pro- in 9 out of 10 languages. We selected the best cess. For phase 1, we used the combination systems (20 in total) based on the dev BLEU of the development data for all the language scores for each language pair and used them pairs in the training data; whereas, for phase to generate translations of the test inputs. 2, we only used the desired language pair’s de- The choices related to the hyperparameters 9 that determine the model size and the choice BPE based tokenization performed better in com- parison to word-level tokenization using Indic tokeniz- of the training data for phase 1 of the training ers (Kunchukuttan, 2020). process were made such that the per epoch
No tag Src. tag Trg. tag Src. & trg. tags Language Phase 2 Phase 2 Phase 2 Phase 2 Phase 1 Phase 1 Phase 1 Phase 1 pair PMI ALL PMI ALL PMI ALL PMI ALL bn-en 11.8 12.1 11.5 12.9 13.2 11.7 - - - 14.1 14.7 11.7 gu-en 17.7 17.8 24.4 19.4 19.3 24.9 - - - 22.7 23.1 23.1 hi-en 18.7 19.6 25.6 21.3 21.6 26.0 - - - 25.1 25.7 26.2 kn-en 14.5 15.1 16.5 16.6 16.8 15.5 - - - 18.7 19.5 17.0 ml-en 12.2 12.6 12.2 13.6 13.4 12.3 - - - 15.4 15.9 12.4 mr-en 13.3 12.9 16.1 14.9 15.1 17.0 - - - 16.6 17.2 17.3 or-en 14.0 14.1 16.9 15.5 15.6 18.7 - - - 17.5 17.8 20.3 pa-en 17.4 17.8 27.0 18.9 19.0 26.3 - - - 22.2 22.8 26.4 ta-en 13.2 13.2 15.0 14.7 14.3 14.6 - - - 15.8 16.4 15.9 te-en 14.4 14.5 16.5 15.6 16.3 16.8 - - - 16.9 17.9 16.7 en-bn - - - - - - 6.2 6.5 4.6 5.6 5.9 4.4 en-gu - - - - - - 18.4 19.9 18.8 16.9 18.4 18.5 en-hi - - - - - - 22.4 24.5 24.7 20.6 23.2 24.2 en-kn - - - - - - 12.6 13.4 10.6 10.9 12.6 9.8 en-ml - - - - - - 3.9 4.4 2.6 3.6 4.0 2.0 en-mr - - - - - - 10.2 11.2 10.4 8.8 10.6 10.1 en-or - - - - - - 12.4 13.2 14.0 11.4 12.3 14.2 en-pa - - - - - - 18.8 19.7 20.9 16.5 18.8 20.5 en-ta - - - - - - 8.5 9.6 8.4 7.8 8.3 8.0 en-te - - - - - - 2.2 2.9 2.4 2.0 2.6 2.9 Table 3: Development BLEU scores for Indic multilingual translations in various setups after phase 1 and phase 2 of the training process. Scores are shown for each language pair separately. From English Into English training time is below an hour on a single GPU. WAT Task NLPHut Best Comp NLPHut Best Comp We note that there is room for improvement INDIC21en-bn 8.13 15.97 13.88 31.87 INDIC21en-hi 25.37 38.65 24.55 46.93 in our results: (a) the model size in any of INDIC21en-gu 17.76 27.80 23.10 43.98 INDIC21en-ml 4.57 15.49 15.47 38.38 the setups described earlier can be increased INDIC21en-mr 10.41 20.42 17.07 36.64 to match the size of the transformer big model INDIC21en-ta 7.68 14.43 15.40 36.13 INDIC21en-te 4.88 16.85 16.48 39.80 (Vaswani et al., 2017), and (b) all the available INDIC21en-pa 22.60 33.43 24.35 46.39 INDIC21en-or 12.81 20.15 18.92 37.06 training data can be used for phase 1 of the INDIC21en-kn 11.84 21.30 17.72 40.34 training process instead of just the PMI data. Table 5: WAT2021 Automatic Evaluation Results 4 Results for Indic Multilingual Task. For each task, we show the score of our system (NLPHut) and the score of WAT BLEU System and WAT Task NLPHut Best Comp the best competitor (‘Best Comp’) in the respec- Label tive task. English→Hindi MM Task MMEVTEXT21en-hi 42.11 44.61 MMEVHI21en-hi 1.30 - MMCHTEXT21en-hi 43.29 53.54 for the image captioning task, although it is MMCHHI21en-hi 1.69 - English→Malayalam not apt for evaluating the quality of the gen- MM Task MMEVTEXT21en-ml 34.83∗ 30.49 erated caption. Thus, we have also provided MMEVHI21en-ml 0.97 - some sample outputs in Table 6. MMCHTEXT21en-ml 12.15 12.98 MMCHHI21en-ml 0.99 - 5 Conclusions Table 4: WAT2021 Automatic Evaluation Re- sults for English→Hindi and English→Malayalam. In this system description paper, we presented Rows containing “TEXT" in the task label name our systems for three tasks in WAT 2021 in denote text-only translation track, and the rest of which we participated: (a) English→Hindi the rows represent image-only track. For each task, we show the score of our system (NLPHut) and the Multimodal task, (b) English→Malayalam score of the best competitor in the respective task. Multimodal task, and (c) Indic Multilingual The scores marked with ‘∗’ indicate the best per- translation task. As the next steps, we plan formance in its track among all competitors. to explore further on the Indic Multilingual translation task by utilizing all given data and We report the official automatic evaluation using additional resources for training. We are results of our models for all the participated also working on improving the region-specific tasks in Table 4 and Table 5. We have pro- image captioning by fine-tuning the object de- vided the automatic evaluation score (BLEU) tection model.
Gold: एक लड़की टे िनस खे ल रही है Gold: आदमी समुद्र में सर्िंफग Gloss: A girl is playing tennis Gloss: man surfing in ocean Output:एक टे िनस रै केट पकड़े हुए आदमी Output: पानी में एक व्यिक्त Gloss: A man holding a tennis Gloss: A man in the water racket Gold: एक कुत्ता कूदता है Gold: हे लमे ट पहनना Gloss: A dog is jumping Gloss: Wearing helmet Output: कुत्ता भाग रहा है Output: एक आदमी के िसर पर एक काला हे लमे ट Gloss: A dog is running Gloss: A black helmet on the head of a person Gold: തിളക്കമുള്ള പച്ച ൈകറ്റ് Gold: ഒരു വത്തിെല ാഫിക് ൈലറ്റ് Gloss: Bright green kite Gloss: Traffic light at a pole Output:ആകാശത്ത് പറ ന്ന ൈക- Output: ാഫിക് ൈലറ്റ് ചുവപ്പ് തി- റ്റ് ള Gloss: Kite flying in the sky Gloss: The traffic light glows red Gold: തൂങ്ങി കിട ന്ന ഒരു കൂട്ടം വാ- Gold: ചുമരിൽ ഒരു ഘടികാരം വാഴ- ഴപ്പഴം പ്പഴം Gloss: A bunch of hanging ba- Gloss: A clock on the wall nanas Output: ഒരു കൂട്ടം വാഴപ്പഴം Output: ചുമരിൽ ഒരു ചി ം Gloss: A bunch of bananas Gloss: A picture on the wall Table 6: Sample captions generated for the evaluation test set using the proposed method: the top two rows present results of Hindi captions; and the bottom two rows are results of Malayalam caption. Acknowledgments Xavier Glorot and Yoshua Bengio. 2010. Under- standing the difficulty of training deep feedfor- The authors Shantipriya Parida and Petr ward neural networks. In Proceedings of the Motlicek were supported by the European Thirteenth International Conference on Artifi- Union’s Horizon 2020 research and innovation cial Intelligence and Statistics, volume 9 of Pro- ceedings of Machine Learning Research, pages program under grant agreement No. 833635 249–256, Chia Laguna Resort, Sardinia, Italy. (project ROXANNE: Real-time network, text, PMLR. and speaker analytics for combating organized crime, 2019-2022). Ondřej Bojar would like Barry Haddow and Faheem Kirefu. 2020. PMIndia – A Collection of Parallel Corpora of Languages to acknowledge the support of the grant 19- of India. arXiv e-prints, page arXiv:2001.09907. 26934X (NEUREM3) of the Czech Science Foundation. Diederik P. Kingma and Jimmy Ba. 2014. Adam: The authors do not see any significant ethi- A method for stochastic optimization. Cite cal or privacy concerns that would prevent the arxiv:1412.6980Comment: Published as a con- ference paper at the 3rd International Confer- processing of the data used in the study. The ence for Learning Representations, San Diego, datasets do contain personal data, and these 2015. are processed in compliance with the GDPR and national law. Tom Kocmi and Ondřej Bojar. 2018. Trivial trans- fer learning for low-resource neural machine translation. In Proceedings of the Third Confer- ence on Machine Translation: Research Papers, References pages 244–252, Brussels, Belgium. Association Raj Dabre, Chenhui Chu, and Anoop Kunchukut- for Computational Linguistics. tan. 2020. A survey of multilingual neural ma- chine translation. ACM Comput. Surv., 53(5). Taku Kudo and John Richardson. 2018. Sentence- Piece: A simple and language independent sub- Ross Girshick. 2015. Fast r-cnn. In Proceedings of word tokenizer and detokenizer for neural text the IEEE International Conference on Computer processing. In Proceedings of the 2018 Confer- Vision (ICCV), pages 1440–1448. ence on Empirical Methods in Natural Language
Processing: System Demonstrations, pages 66– Shantipriya Parida, Petr Motlicek, Amulya Ratna 71, Brussels, Belgium. Association for Computa- Dash, Satya Ranjan Dash, Debasish Kumar tional Linguistics. Mallick, Satya Prakash Biswal, Priyanka Pat- tnaik, Biranchi Narayan Nayak, and Ondřej Bo- Arun Kumar, Ryan Cotterell, Lluís Padró, and An- jar. 2020. Odianlp’s participation in wat2020. toni Oliver. 2017. Morphological analysis of the In Proceedings of the 7th Workshop on Asian dravidian language family. In Proceedings of the Translation, pages 103–108. 15th Conference of the European Chapter of the Association for Computational Linguistics: Vol- Martin Popel, Marketa Tomkova, Jakub Tomek, ume 2, Short Papers, pages 217–222. Łukasz Kaiser, Jakob Uszkoreit, Ondřej Bojar, and Zdeněk Žabokrtskỳ. 2020. Transforming Anoop Kunchukuttan. 2020. The IndicNLP Li- machine translation: a deep learning system brary. https://github.com/anoopkunchukuttan/ reaches news translation quality comparable to indic_nlp_library/blob/master/docs/indicnlp. human professionals. Nature communications, pdf. 11(1):1–15. Anoop Kunchukuttan, Pratik Mehta, and Push- Matt Post. 2018. A call for clarity in reporting pak Bhattacharyya. 2017. The IIT Bombay BLEU scores. In Proceedings of the Third Con- English-Hindi Parallel Corpus. arXiv preprint ference on Machine Translation: Research Pa- arXiv:1710.02855. pers, pages 186–191, Belgium, Brussels. Associ- ation for Computational Linguistics. Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. Moses Soh. Learning cnn-lstm architectures for im- CoRR, abs/1901.07291. age caption generation. Annika Lindh, Robert J Ross, Abhijit Mahalunkar, Raimonda Staniūtė and Dmitrij Šešok. 2019. A Giancarlo Salton, and John D Kelleher. 2018. systematic literature review on image caption- Generating diverse and meaningful captions. In ing. Applied Sciences, 9(10):2024. International Conference on Artificial Neural Networks, pages 176–187. Springer. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Takashi Miyazaki and Nobuyuki Shimizu. 2016. Ł ukasz Kaiser, and Illia Polosukhin. 2017. At- Cross-lingual image caption generation. In Pro- tention is all you need. In Advances in Neu- ceedings of the 54th Annual Meeting of the Asso- ral Information Processing Systems, volume 30. ciation for Computational Linguistics (Volume Curran Associates, Inc. 1: Long Papers), pages 1780–1790. Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, Toshiaki Nakazawa, Hideki Nakayama, Chenchen and Anton Van Den Hengel. 2017. Image cap- Ding, Raj Dabre, Shohei Higashiyama, Hideya tioning and visual question answering based on Mino, Isao Goto, Win Pa Pa, Anoop Kunchukut- attributes and external knowledge. IEEE trans- tan, Shantipriya Parida, Ondřej Bojar, Chen- actions on pattern analysis and machine intelli- hui Chu, Akiko Eriguchi, Kaori Abe, and Sadao gence, 40(6):1367–1381. Oda, Yusuke Kurohashi. 2021. Overview of the 8th workshop on Asian translation. In Proceed- Zhishen Yang and Naoaki Okazaki. 2020. Image ings of the 8th Workshop on Asian Translation, caption generation for news articles. In Pro- Bangkok, Thailand. Association for Computa- ceedings of the 28th International Conference on tional Linguistics. Computational Linguistics, pages 1941–1951. Toshiaki Nakazawa, Hideki Nakayama, Chenchen Zhongliang Yang, Yu-Jin Zhang, Sadaqat Ding, Raj Dabre, Shohei Higashiyama, Hideya ur Rehman, and Yongfeng Huang. 2017. Mino, Isao Goto, Win Pa Pa, Anoop Kunchukut- Image captioning with object detection and tan, Shantipriya Parida, et al. 2020. Overview localization. In International Conference on of the 7th workshop on asian translation. In Pro- Image and Graphics, pages 109–118. Springer. ceedings of the 7th Workshop on Asian Transla- tion, pages 1–44. Jun Yu, Jing Li, Zhou Yu, and Qingming Huang. 2019. Multimodal transformer with multi- Shantipriya Parida and Ondřej Bojar. 2018. Trans- view visual representation for image captioning. lating short segments with nmt: A case study in IEEE transactions on circuits and systems for english-to-hindi. In 21st Annual Conference of video technology, 30(12):4467–4480. the European Association for Machine Transla- tion, page 229. Shantipriya Parida, Ondřej Bojar, and Satya Ran- jan Dash. 2019. Hindi visual genome: A dataset for multi-modal english to hindi machine trans- lation. Computación y Sistemas, 23(4).
You can also read