NLPHUT'S PARTICIPATION AT WAT2021 - Shantipriya Parida Subhadarshi Panda - TROPE HCRAESE PAID

Page created by Cecil Burns

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

NLPHUT'S PARTICIPATION AT WAT2021 - Shantipriya Parida Subhadarshi Panda - TROPE HCRAESE PAID

IDIAP RESEARCH REPORT

                                   NLPHUT’S PARTICIPATION AT WAT2021
                                                                        a
                                           Shantipriya Parida    Subhadarshi Panda
                                             Ketan Kotwal      Amulya Ratna Dash
                                          Satya Ranjan Dash     Yashvardhan Sharma
                                                 Petr Motlicek    Ondrej Bojar

                                                                 Idiap-RR-10-2021

                                                                      JULY 2021

                           a
                            Idiap Research Institute

                        Centre du Parc, Centre du Parc, Rue Marconi 19, CH - 1920 Martigny
                        T +41 27 721 77 11 F +41 27 721 77 12 info@idiap.ch www.idiap.ch

NLPHut’s Participation at WAT2021
              Shantipriya Parida∗ , Subhadarshi Panda† , Ketan Kotwal∗ ,
           Amulya Ratna Dash♣ , Satya Ranjan Dash♠ , Yashvardhan Sharma♣ ,
                           Petr Motlicek∗ , Ondřej Bojar♢
                            ∗
                             Idiap Research Institute, Martigny, Switzerland
                                   {firstname.lastname}@idiap.ch
                         †
                           Graduate Center, City University of New York, USA
                                     spanda@gradcenter.cuny.edu
                       ♣
                         Birla Institute of Technology and Science, Pilani, India
                            {p20200105,yash}@pilani.bits-pilani.ac.in
                                 ♠
                                   KIIT University, Bhubaneswar, India
                                         sdashfca@kiit.ac.in
                      ♢
                        Charles University, MFF, ÚFAL, Prague, Czech Republic
                                       bojar@ufal.mff.cuni.cz

                       Abstract                           The Workshop on Asian Translation (WAT)
                                                       is an open evaluation campaign focusing
        This paper provides the description of         on Asian languages since 2013 (Nakazawa
        shared tasks to the WAT 2021 by our
                                                       et al., 2020).      In WAT2021 (Nakazawa
        team “NLPHut”. We have participated
        in the English→Hindi Multimodal trans-         et al., 2021) Multimodal track, a new In-
        lation task, English→Malayalam Multi-          dian language Malayalam was introduced for
        modal translation task, and Indic Multi-       English→Malayalam text, multimodal transla-
        lingual translation task. We have used         tion, and Malayalam image captioning task.2
        the state-of-the-art Transformer model         This year, the MultiIndic3 task covers 10 Indic
        with language tags in different settings       languages and English.
        for the translation task and proposed a           In this system description paper, we explain
        novel “region-specific” caption generation
                                                       our approach for the tasks (including the sub-
        approach using a combination of image
        CNN and LSTM for the Hindi and Malay-          tasks) we participated in:
        alam image captioning. Our submission          Task 1: English→Hindi (EN-HI) Multimodal
        tops in English→Malayalam Multimodal
                                                          Translation
        translation task (text-only translation, and
        Malayalam caption), and ranks second-               • EN-HI text-only translation
        best in English→Hindi Multimodal transla-           • Hindi-only image captioning
        tion task (text-only translation, and Hindi    Task 2: English→Malayalam (EN-ML) Mul-
        caption). Our submissions have also per-          timodal Translation
        formed well in the Indic Multilingual trans-
        lation tasks.
                                                            • EN-ML text-only translation
                                                            • Malayalam-only image captioning
1       Introduction                                   Task 3: Indic Multilingual translation task.

Machine translation (MT) is considered to be            Section 2 describes the datasets used in our
one of the most successful applications of nat-        experiment.    Section 3 presents the model
ural language processing (NLP)1 . It has sig-          and experimental setups used in our approach.
nificantly evolved especially in terms of the          Section 4 provides the oﬀicial evaluation re-
accuracy of its output. Though MT perfor-              sults of WAT20214 followed by the conclusion
mance reached near to human level for several          in Section 5.
language pairs (see e.g. Popel et al., 2020), it          2
                                                            https://ufal.mff.cuni.
remains challenging for low resource languages         cz/malayalam-visual-genome/
                                                       wat2021-english-malayalam-multi
or translation effectively utilizing other modal-         3
                                                            http://lotus.kuee.kyoto-u.ac.jp/WAT/
ities (e.g. image, Parida et al., 2020).               indic-multilingual/
                                                          4
                                                            http://lotus.kuee.kyoto-u.ac.jp/WAT/
    1
        https://morioh.com/p/d596d2d4444d              WAT2021/index.html

2 Dataset Task 3: Indic Multilingual Translation
For this task, the organizers provided a train-
We have used the oﬀicial datasets provided by
ing corpus that comprises in total 11 million
the WAT2021 organizers for the tasks.
sentence pairs collected from several corpora.
Task 1: English→Hindi Multimodal The evaluation (dev and test set) contain fil-
Translation For this task, the organiz- tered data of the PMIndia dataset (Haddow
ers provided HindiVisualGenome 1.1 (Parida and Kirefu, 2020).8 We have not used any ad-
et al., 2019)5 dataset (HVG for short). The ditional resources in this task. The statistics
training part consists of 29k English and Hindi of the dataset are shown in Table 2.
short captions of rectangular areas in photos
of various scenes and it is complemented by 3 Experimental Details
three test sets: development (D-Test), evalua- This section describes the experimental details
tion (E-Test) and challenge test set (C-Test). of the tasks we participated in.
Our WAT submissions were for E-Test (de-
noted “EV” in WAT oﬀicial tables) and C- 3.1 EN-HI and EN-ML text-only
Test (denoted “CH” in WAT tables). Addition- translation
ally, we used the IITB Corpus6 which is sup-
For the HVG text-only translation track, we
posedly the largest publicly available English-
train a Transformer model (Vaswani et al.,
Hindi parallel corpus (Kunchukuttan et al.,
2017) using the concatenation of IIT-B train-
2017). This corpus contains 1.59 million par-
ing data and HVG training data (see Table 1).
allel segments and it was found very effective
Similar to the two-phase approach outlined in
for English-Hindi translation (Parida and Bo-
Section 3.3, we continue the training using
jar, 2018). The statistics of the datasets are
only the HVG training data to obtain the final
shown in Table 1.
checkpoint. For the MVG text-only transla-
Tokens tion track, we train a Transformer model using
Set Sentences English Hindi Malayalam only the MVG training data.
Train 28930 143164 145448 107126
D-Test 998 4922 4978 3619 For both EN-HI and EN-ML translation, we
E-Test 1595 7853 7852 5689 trained SentencePiece subword units (Kudo
C-Test 1400 8186 8639 6044
IITB Train 1.5 M 20.6 M 22.1 M –
and Richardson, 2018) setting maximum vo-
cabulary size to 8k. The vocabulary was
Table 1: Statistics of our data used in the learned jointly on the source and target sen-
English→Hindi and English→Malayalam Multi- tences of HVG and IIT-B for EN-HI and of
modal task: the number of sentences and tokens. MVG for EN-ML. The number of encoder and
decoder layers was set to 3 each; while the
Task 2: English→Malayalam Multi- number of heads was set to 8. We have set
modal Translation For this task, the orga- the hidden size to 128, along with the dropout
nizers provided MalayalamVisualGenome 1.0 value of 0.1. We initialized the model param-
dataset7 (MVG for short). MVG is an ex- eters using Xavier initialization (Glorot and
tension of the HVG dataset for supporting Bengio, 2010) and used the Adam optimizer
Malayalam, which belongs to the Dravidian (Kingma and Ba, 2014) with a learning rate of
language family (Kumar et al., 2017). The 5e−4 for optimizing model parameters. Gradi-
dataset size and images are the same as HVG. ent clipping was used to clip gradients greater
While HVG contains bilingual (English and than 1. The training was stopped when the
Hindi) segments, MVG contains bilingual (En- development loss did not improve for 5 consec-
glish and Malayalam) segments, with the En- utive epochs. While EN-HI training using con-
glish shared across HVG and MVG, see Ta- catenated IIT-B + HVG data and the subse-
ble 1. quent training using only HVG data, we used
5 the same HVG dev set for determining early
https://lindat.mff.cuni.cz/repository/
xmlui/handle/11234/1-3267 stopping. For generating translations, we used
6
http://www.cfilt.iitb.ac.in/iitb_parallel/ greedy decoding and generated tokens autore-
7
https://lindat.mff.cuni.cz/repository/
8
xmlui/handle/11234/1-3533 http://data.statmt.org/pmindia/

Language pair     en-bn     en-hi    en-gu     en-ml    en-mr      en-ta      en-te     en-pa     en-or      en-kn
   Train (ALL)     1756197   3534387   518015   1204503   781872    1499441    686626     518508    252160     396865
   Train (PMI)      23306     50349     41578    26916    28974      32638      33380      28294    31966       28901
       Dev                                                     1000
       Test                                                    2390

                  Table 2: Statistics of the data used for Indic multilingual translation.

gressively till the end-of-sentence token was
generated or the maximum translation length
was reached, which was set to 100.
   We show the training and development per-
plexities for EN-HI and EN-ML translations
during training in Figure 4b. The dev per-
plexity for EN-HI translation is lower in the
beginning (after epoch 1) because the model is
trained using more training samples (IIT-B +
HVG) in comparison to EN-ML. Overall, EN-
                                                             English Text: The snow is white. Hindi Text: बफर् सफेद है
HI training takes around twice as much time as
EN-ML training, again due to the involvement                 Malayalam Text: മഞ്ഞ് െവളുത്തതാണ് Gloss: Snow is white

of the bigger IIT-B training data. The drop in            Figure 1: Sample image with speciﬁc region and
perplexity midway for EN-HI is because of the             its description for caption generation. Image taken
change of training data from IIT-B + HVG to               from Hindi Visual Genome (HVG) and Malayalam
only HVG after the first phase of the training            Visual Genome (MVG) (Parida et al., 2019)
converges.
   Upon evaluating the translations using the
                                                          (i.e., overall understanding) around the region
development set, we obtained the following
                                                          that can essentially be captured from the en-
scores for Hindi translations. The BLEU score
                                                          tire image as shown in Figure 1. It is chal-
was 46.7 upon using HVG + IIT-B training
                                                          lenging to generate the caption “snow” only
data. In comparison, we observed that the
                                                          considering the specific region (red bounding
BLEU score was 39.9 upon using only the
                                                          box).
HVG training data (without IIT-B training
data). For Malayalam translations, the BLEU                  We propose a region-specific image caption-
score on the development set was 31.3. BLEU               ing method through the fusion of encoded fea-
scores were computed using sacreBLEU (Post,               tures of the region as well as that of the com-
2018).                                                    plete image. Our proposed model for this task
                                                          consists of three modules – an encoder, fusion,
3.2   Image Caption Generation                            and decoder – as shown in Figure 2.

This task in WAT 2021 is formulated as gen-               Image Encoder: To textually describe an
erating a caption in Hindi and Malayalam for              image or a region within, it first needs to be
a specific region in the given image. Most ex-            encoded into high-level complex features that
isting research in the area of image caption-             capture its visual attributes. Several image
ing refers to generating a textual description            captioning works (Yang and Okazaki, 2020;
for the entire image (Yang and Okazaki, 2020;             Yang et al., 2017; Lindh et al., 2018; Staniūtė
Yang et al., 2017; Lindh et al., 2018; Staniūtė           and Šešok, 2019; Miyazaki and Shimizu, 2016;
and Šešok, 2019; Miyazaki and Shimizu, 2016;              Wu et al., 2017) have demonstrated that
Wu et al., 2017). However, a naive approach               the outputs of final or pre-final convolutional
of using only a specified region (as defined by           (conv) layers of deep CNNs are excellent fea-
the rectangular bounding box) as an input to              tures for the aforementioned objective. Along
the generic image caption generation system               with features of the entire image, we propose
often does not yield meaningful results. When             to extract the features of the subregion as well
a small region of the image with few objects is           using the same set of outputs of the conv layer.
considered for captioning, it lacks the context           Let F ∈ RM N C be the features of the final conv

Figure 2: Architecture of the proposed model for region-speciﬁc image caption generator. The Encoder
module consists of a pre-trained image CNN as feature extractor, while an LSTM-based decoder generates
captions. Both modules are connected by a Fusion module.

layer of a pre-trained image CNN where C rep- parameter in [0.50, 1] indicating relative im-
resents the number of channels or maps, and portance provided to region-features Sfeat over
M, N are the spatial dimensions of each fea- the features of the whole image. For α = 0.66,
ture map. From the dimensions of the input the region-level features are weighted twice as
image and the values of M, N , we compute high as the entire image-level features. The
the spatial scaling factor. Through this factor weighing of a feature vector scales the magni-
and nominal interpolation, we obtain a corre- tude of the corresponding vector without al-
sponding location of the subregion in the conv tering its orientation. Unlike the fusion mech-
layer, say with dimensionality (m, n). This anisms based on weighted addition, we do not
subset, Fs ∈ RmnC , predominantly consists of modify the complex information captured by
features from the subregion. The subset Fs is the features (except for scale); however, its rel-
obtained through the region of interest (RoI) ative importance with respect to the other set
pooling (Girshick, 2015). We do not modify of features is adjusted for better caption gen-
the channel dimensions of Fs . The final fea- eration. The fused feature f with the dimen-
tures, thus obtained, are linearized to form a sionality of the sum of both feature vectors are
single column vector. We denote the region- then fed to the LSTM-based decoder.
subset features as Sfeat . The features of the
complete image are nothing but F. We apply LSTM Decoder: In the proposed approach,
spatial pooling on this feature set to reduce the encoder module is not trainable, it only ex-
their dimensionality, and obtain the linearized tracts the image features however the LSTM
vector of full-image features denoted as Ifeat . decoder is trainable. We used LSTM decoder
using the image features for caption genera-
Fusion Module: The region-level features tion using greedy search approach (Soh). We
capture details of the region (objects) to be used the cross-entropy loss during decoding
described; whereas image-level features pro- (Yu et al., 2019).
vide an overall context. To generate mean-
ingful captions for a region of the image, we 3.3 Indic Multilingual Translation
consider the features of the region Sfeat along Sharing parameters across multiple lan-
with the features of the entire image Ifeat . This guages, particularly low-resource Indic lan-
combining of feature vectors is crucial in gen- guages, results in gains in translation perfor-
erating descriptions for the region. In this mance (Dabre et al., 2020). Motivated by
work, we propose to conduct fusion through this finding, we train neural MT models with
the concatenation of weighted features from shared parameters across multiple languages
the region and those from the entire image for the Indic multilingual translation task. We
for region-specific caption generation. The additionally apply transfer learning where we
fused feature, f, can be represented as f = train a neural MT model in two phases (Kocmi
[α Sfeat ; (1 − α) Ifeat ], where α is the weightage and Bojar, 2018). The first phase consists of

Many-to-one setup with source language
                                                       tag We use a transformer model where the
                                                       source language tag explicitly informs the
                                                       model about the language of the source sen-
                                                       tence as in Lample and Conneau (2019). We
                                                       provide the language information at every po-
                                                       sition by representing each source token as the
                                                       sum of token embedding, positional embed-
                                                       ding, and language embedding; which is then
                                                       fed to the encoder (see Figure 3 for the inputs
                                                       to the encoder). The training data for phase
Figure 3: Architecture for Indic Multilingual trans-   1 of the training process is the same as in the
lation. We show here the setup in which both the       previous setup.
source and the target language tags are used.
                                                       One-to-many setup with target language
                                                       tag This setup is based on a transformer
training a multilingual translation model on
                                                       model where the target language embedding
training pairs drawn from one of the follow-
                                                       is injected to the decoder at every step and it
ing options: (a) any Indic language from the
                                                       explicitly informs the model about the desired
dataset as the source and corresponding En-
                                                       language of the target sentence (Lample and
glish target; (b) English as the source and
                                                       Conneau, 2019). In this setup, the source is al-
any corresponding Indic language as the tar-
                                                       ways in English. Similar to the previous setup,
get; and (c) combination of (a) and (b), that is,
                                                       we represent each target token as the sum of
the model is trained to enable translation from
                                                       token embeddings, positional embedding, and
any Indic language to English and also English
                                                       language embedding. Figure 3 shows the in-
to any Indic language. The second phase in-
                                                       puts to the decoder. In phase 1 of the training
volves fine-tuning of the model at the end of
                                                       process, we concatenate across all Indic lan-
phase 1 using pairs from a single language pair.
                                                       guages the pairs drawn from English as the
For phase 1, we used the PMI dataset for all
                                                       source and the corresponding Indic language
the languages combined; whereas, for phase
                                                       target and use the resulting data for training.
2, we used either only the PMI portion or all
the bilingual data available for the desired lan-      Many-to-many setup with both the
guage pair. In Table 2, the training data sizes        source and target language tags In this
are denoted as Train (PMI) for phase 1 of              setup, we use a transformer model where
training.                                              both the encoder and decoder are informed
   To support multilinguality (i.e., going be-         about the source and target languages explic-
yond a bilingual translation setup), we have           itly through language embedding at every to-
to either fix the target language (many-to-one         ken (Lample and Conneau, 2019). For in-
setup) or provide a language tag for control-          stance, the same model can be used for hi-
ling the generation process. We highlight be-          en translation and also for en-hi translation.
low the four setups to achieve this:                   As shown in the architecture in Figure 3, the
                                                       source token representation is computed as
Many-to-one setup with no tag In this                  the sum of the token embedding, positional
setup, we use a transformer model (Vaswani             embedding, and source language embedding.
et al., 2017) without any architectural modifi-        Similarly, the target token representation is
cation that would enable the model to explic-          computed as the sum of the token embedding,
itly distinguish between languages. In phase 1         positional embedding, and target language em-
of the training process, we concatenate across         bedding. The source and the target token rep-
all Indic languages the pairs drawn from an In-        resentations are provided to the encoder and
dic language as the source and the correspond-         decoder, respectively. The rest of the mod-
ing English target and use the resulting data          ules in the transformer model architecture are
for training.                                          same as in Vaswani et al. (2017). The training

6.0
EN-HI (train) No tag (train)
6 EN-HI (dev) 5.5 No tag (dev)
EN-ML (train) 5.0
Src. tag (train)
5 EN-ML (dev) Src. tag (dev)
4.5 Trg. tag (train)
Trg. tag (dev)

log PPL

log PPL
4
4.0
Src. & trg. tags (train)
3 3.5 Src. & trg. tags (dev)
3.0
2
2.5
1
2.0
0 10 20 30 40 0 20 40 60 80 100
Epoch Epoch
(a) (b)

Figure 4: Training and development perplexity for: (a) EN-HI and EN-ML translation training; and (b)
Indic multilingual translation training in various setups (only phase 1 training curves are shown).

data for phase 1 of the training process is the velopment data. For generating translations,
combination of the training datasets for the we used greedy decoding where we picked the
previous two setups. most likely token at each generation time step.
In all the four setups described above, the The generation was done token-by-token till
training data for phase 2 is the bilingual data the end-of-sentence token is generated or the
corresponding to the desired language pair. maximum translation length is reached. The
The bilingual data is either the PMI train- maximum translation length was set to 100.
ing data or all the available bilingual training To compare the training under various se-
data– sizes for which are provided in Table 2. tups related to the usage of language tags, we
We now outline the training details for show the perplexity of the training and the
all the setups. We first trained sentence- development data in Figure 4a. The best (low-
piece BPE tokenization (Kudo and Richard- est) perplexity is obtained by using the target
son, 2018) setting maximum vocabulary size language tag. However, using the target lan-
to 32k.9 The vocabulary was learnt jointly on guage tag requires more epochs to converge,
all the source and target sentence pairs. The where convergence is determined by the early
number of encoder and decoder layers was set stopping criterion described above.
to 3 each, and the number of heads was set to We show the development BLEU scores,
8. We have considered the hidden size of 128; computed using sacreBLEU (Post, 2018) in Ta-
while the dropout rate was set to 0.1. We ini- ble 3 for each language pair. Results indicate
tialized the model parameters using Xavier ini- that the usage of language tags produces bet-
tialization (Glorot and Bengio, 2010). Adam ter translation overall. It may also be noted
optimizer (Kingma and Ba, 2014) with a learn- that using both languages’ (source and tar-
ing rate of 5e−4 was used for optimizing model get) tags resulted in the highest development
parameters. Gradient clipping was used to BLEU scores for 8 out of 10 Indic languages
clip gradients greater than 1. The training while translating to English. For translation
was stopped when the development loss did from English to Indic languages, the target lan-
not improve for 5 consecutive epochs. The guage tag setup performed the best overall ob-
same early stopping criterion was followed for taining the highest development BLEU scores
both phase 1 and phase 2 of the training pro- in 9 out of 10 languages. We selected the best
cess. For phase 1, we used the combination systems (20 in total) based on the dev BLEU
of the development data for all the language scores for each language pair and used them
pairs in the training data; whereas, for phase to generate translations of the test inputs.
2, we only used the desired language pair’s de- The choices related to the hyperparameters
9
that determine the model size and the choice
BPE based tokenization performed better in com-
parison to word-level tokenization using Indic tokeniz- of the training data for phase 1 of the training
ers (Kunchukuttan, 2020). process were made such that the per epoch

No tag                    Src. tag                  Trg. tag           Src. & trg. tags
    Language
                             Phase 2                     Phase 2                   Phase 2               Phase 2
               Phase 1                     Phase 1                   Phase 1                  Phase 1
      pair                 PMI    ALL                 PMI     ALL               PMI     ALL            PMI    ALL

      bn-en      11.8      12.1    11.5     12.9       13.2   11.7      -         -      -     14.1     14.7   11.7
      gu-en      17.7      17.8    24.4     19.4       19.3   24.9      -         -      -     22.7     23.1   23.1
      hi-en      18.7      19.6    25.6     21.3       21.6   26.0      -         -      -     25.1     25.7   26.2
      kn-en      14.5      15.1    16.5     16.6       16.8   15.5      -         -      -     18.7     19.5   17.0
      ml-en      12.2      12.6    12.2     13.6       13.4   12.3      -         -      -     15.4     15.9   12.4
      mr-en      13.3      12.9    16.1     14.9       15.1   17.0      -         -      -     16.6     17.2   17.3
      or-en      14.0      14.1    16.9     15.5       15.6   18.7      -         -      -     17.5     17.8   20.3
      pa-en      17.4      17.8    27.0     18.9       19.0   26.3      -         -      -     22.2     22.8   26.4
      ta-en      13.2      13.2    15.0     14.7       14.3   14.6      -         -      -     15.8     16.4   15.9
      te-en      14.4      14.5    16.5     15.6       16.3   16.8      -         -      -     16.9     17.9   16.7
      en-bn       -         -       -        -          -      -       6.2       6.5    4.6     5.6      5.9    4.4
      en-gu       -         -       -        -          -      -      18.4      19.9   18.8    16.9     18.4   18.5
      en-hi       -         -       -        -          -      -      22.4      24.5   24.7    20.6     23.2   24.2
      en-kn       -         -       -        -          -      -      12.6      13.4   10.6    10.9     12.6    9.8
      en-ml       -         -       -        -          -      -       3.9       4.4    2.6     3.6      4.0    2.0
      en-mr       -         -       -        -          -      -      10.2      11.2   10.4     8.8     10.6   10.1
      en-or       -         -       -        -          -      -      12.4      13.2   14.0    11.4     12.3   14.2
      en-pa       -         -       -        -          -      -      18.8      19.7   20.9    16.5     18.8   20.5
      en-ta       -         -       -        -          -      -       8.5       9.6    8.4     7.8      8.3    8.0
      en-te       -         -       -        -          -      -       2.2       2.9    2.4     2.0      2.6    2.9

Table 3: Development BLEU scores for Indic multilingual translations in various setups after phase 1 and
phase 2 of the training process. Scores are shown for each language pair separately.

                                                                                 From English      Into English
training time is below an hour on a single GPU.                  WAT Task     NLPHut Best Comp NLPHut Best Comp
We note that there is room for improvement                       INDIC21en-bn    8.13     15.97  13.88       31.87
                                                                 INDIC21en-hi   25.37     38.65  24.55       46.93
in our results: (a) the model size in any of                     INDIC21en-gu   17.76     27.80  23.10       43.98
                                                                 INDIC21en-ml    4.57     15.49  15.47       38.38
the setups described earlier can be increased                    INDIC21en-mr   10.41     20.42  17.07       36.64
to match the size of the transformer big model                   INDIC21en-ta    7.68     14.43  15.40       36.13
                                                                 INDIC21en-te    4.88     16.85  16.48       39.80
(Vaswani et al., 2017), and (b) all the available                INDIC21en-pa   22.60     33.43  24.35       46.39
                                                                 INDIC21en-or   12.81     20.15  18.92       37.06
training data can be used for phase 1 of the                     INDIC21en-kn   11.84     21.30  17.72       40.34
training process instead of just the PMI data.
                                                               Table 5: WAT2021 Automatic Evaluation Results
4   Results                                                    for Indic Multilingual Task. For each task, we show
                                                               the score of our system (NLPHut) and the score of
                                WAT BLEU
    System and WAT Task      NLPHut Best Comp                  the best competitor (‘Best Comp’) in the respec-
    Label                                                      tive task.
    English→Hindi MM
    Task
    MMEVTEXT21en-hi               42.11       44.61
    MMEVHI21en-hi                  1.30           -
    MMCHTEXT21en-hi               43.29       53.54            for the image captioning task, although it is
    MMCHHI21en-hi                  1.69           -
    English→Malayalam                                          not apt for evaluating the quality of the gen-
    MM Task
    MMEVTEXT21en-ml               34.83∗       30.49
                                                               erated caption. Thus, we have also provided
    MMEVHI21en-ml                   0.97           -           some sample outputs in Table 6.
    MMCHTEXT21en-ml                12.15      12.98
    MMCHHI21en-ml                   0.99           -

                                                               5     Conclusions
Table 4: WAT2021 Automatic Evaluation Re-
sults for English→Hindi and English→Malayalam.                 In this system description paper, we presented
Rows containing “TEXT" in the task label name                  our systems for three tasks in WAT 2021 in
denote text-only translation track, and the rest of
                                                               which we participated: (a) English→Hindi
the rows represent image-only track. For each task,
we show the score of our system (NLPHut) and the               Multimodal task, (b) English→Malayalam
score of the best competitor in the respective task.           Multimodal task, and (c) Indic Multilingual
The scores marked with ‘∗’ indicate the best per-              translation task. As the next steps, we plan
formance in its track among all competitors.                   to explore further on the Indic Multilingual
                                                               translation task by utilizing all given data and
  We report the oﬀicial automatic evaluation                   using additional resources for training. We are
results of our models for all the participated                 also working on improving the region-specific
tasks in Table 4 and Table 5. We have pro-                     image captioning by fine-tuning the object de-
vided the automatic evaluation score (BLEU)                    tection model.

Gold: एक लड़की टे िनस खे ल रही है Gold: आदमी समुद्र में सर्िंफग
Gloss: A girl is playing tennis Gloss: man surﬁng in ocean
Output:एक टे िनस रै केट पकड़े हुए आदमी Output: पानी में एक व्यिक्त

Gloss: A man holding a tennis Gloss: A man in the water
racket
Gold: एक कुत्ता कूदता है Gold: हे लमे ट पहनना
Gloss: A dog is jumping Gloss: Wearing helmet
Output: कुत्ता भाग रहा है Output: एक आदमी के िसर पर एक
काला हे लमे ट
Gloss: A dog is running Gloss: A black helmet on the
head of a person

Gold: തിളക്കമുള്ള പച്ച ൈകറ്റ് Gold: ഒരു വത്തിെല ാഫിക്
ൈലറ്റ്
Gloss: Bright green kite Gloss: Traffic light at a pole
Output:ആകാശത്ത് പറ ന്ന ൈക- Output: ാഫിക് ൈലറ്റ് ചുവപ്പ് തി-
റ്റ് ള
Gloss: Kite flying in the sky Gloss: The traffic light glows
red
Gold: തൂങ്ങി കിട ന്ന ഒരു കൂട്ടം വാ- Gold: ചുമരിൽ ഒരു ഘടികാരം വാഴ-
ഴപ്പഴം പ്പഴം
Gloss: A bunch of hanging ba- Gloss: A clock on the wall
nanas
Output: ഒരു കൂട്ടം വാഴപ്പഴം Output: ചുമരിൽ ഒരു ചി ം
Gloss: A bunch of bananas Gloss: A picture on the wall

Table 6: Sample captions generated for the evaluation test set using the proposed method: the top two
rows present results of Hindi captions; and the bottom two rows are results of Malayalam caption.

Acknowledgments Xavier Glorot and Yoshua Bengio. 2010. Under-
standing the diﬀiculty of training deep feedfor-
The authors Shantipriya Parida and Petr ward neural networks. In Proceedings of the
Motlicek were supported by the European Thirteenth International Conference on Artifi-
Union’s Horizon 2020 research and innovation cial Intelligence and Statistics, volume 9 of Pro-
ceedings of Machine Learning Research, pages
program under grant agreement No. 833635 249–256, Chia Laguna Resort, Sardinia, Italy.
(project ROXANNE: Real-time network, text, PMLR.
and speaker analytics for combating organized
crime, 2019-2022). Ondřej Bojar would like Barry Haddow and Faheem Kirefu. 2020. PMIndia
– A Collection of Parallel Corpora of Languages
to acknowledge the support of the grant 19- of India. arXiv e-prints, page arXiv:2001.09907.
26934X (NEUREM3) of the Czech Science
Foundation. Diederik P. Kingma and Jimmy Ba. 2014. Adam:
The authors do not see any significant ethi- A method for stochastic optimization. Cite
cal or privacy concerns that would prevent the arxiv:1412.6980Comment: Published as a con-
ference paper at the 3rd International Confer-
processing of the data used in the study. The ence for Learning Representations, San Diego,
datasets do contain personal data, and these 2015.
are processed in compliance with the GDPR
and national law. Tom Kocmi and Ondřej Bojar. 2018. Trivial trans-
fer learning for low-resource neural machine
translation. In Proceedings of the Third Confer-
ence on Machine Translation: Research Papers,
References pages 244–252, Brussels, Belgium. Association
Raj Dabre, Chenhui Chu, and Anoop Kunchukut- for Computational Linguistics.
tan. 2020. A survey of multilingual neural ma-
chine translation. ACM Comput. Surv., 53(5). Taku Kudo and John Richardson. 2018. Sentence-
Piece: A simple and language independent sub-
Ross Girshick. 2015. Fast r-cnn. In Proceedings of word tokenizer and detokenizer for neural text
the IEEE International Conference on Computer processing. In Proceedings of the 2018 Confer-
Vision (ICCV), pages 1440–1448. ence on Empirical Methods in Natural Language

Processing: System Demonstrations, pages 66– Shantipriya Parida, Petr Motlicek, Amulya Ratna
71, Brussels, Belgium. Association for Computa- Dash, Satya Ranjan Dash, Debasish Kumar
tional Linguistics. Mallick, Satya Prakash Biswal, Priyanka Pat-
tnaik, Biranchi Narayan Nayak, and Ondřej Bo-
Arun Kumar, Ryan Cotterell, Lluís Padró, and An- jar. 2020. Odianlp’s participation in wat2020.
toni Oliver. 2017. Morphological analysis of the In Proceedings of the 7th Workshop on Asian
dravidian language family. In Proceedings of the Translation, pages 103–108.
15th Conference of the European Chapter of the
Association for Computational Linguistics: Vol- Martin Popel, Marketa Tomkova, Jakub Tomek,
ume 2, Short Papers, pages 217–222. Łukasz Kaiser, Jakob Uszkoreit, Ondřej Bojar,
and Zdeněk Žabokrtskỳ. 2020. Transforming
Anoop Kunchukuttan. 2020. The IndicNLP Li- machine translation: a deep learning system
brary. https://github.com/anoopkunchukuttan/ reaches news translation quality comparable to
indic_nlp_library/blob/master/docs/indicnlp. human professionals. Nature communications,
pdf. 11(1):1–15.
Anoop Kunchukuttan, Pratik Mehta, and Push- Matt Post. 2018. A call for clarity in reporting
pak Bhattacharyya. 2017. The IIT Bombay BLEU scores. In Proceedings of the Third Con-
English-Hindi Parallel Corpus. arXiv preprint ference on Machine Translation: Research Pa-
arXiv:1710.02855. pers, pages 186–191, Belgium, Brussels. Associ-
ation for Computational Linguistics.
Guillaume Lample and Alexis Conneau. 2019.
Cross-lingual language model pretraining. Moses Soh. Learning cnn-lstm architectures for im-
CoRR, abs/1901.07291. age caption generation.
Annika Lindh, Robert J Ross, Abhijit Mahalunkar, Raimonda Staniūtė and Dmitrij Šešok. 2019. A
Giancarlo Salton, and John D Kelleher. 2018. systematic literature review on image caption-
Generating diverse and meaningful captions. In ing. Applied Sciences, 9(10):2024.
International Conference on Artificial Neural
Networks, pages 176–187. Springer. Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Takashi Miyazaki and Nobuyuki Shimizu. 2016. Ł ukasz Kaiser, and Illia Polosukhin. 2017. At-
Cross-lingual image caption generation. In Pro- tention is all you need. In Advances in Neu-
ceedings of the 54th Annual Meeting of the Asso- ral Information Processing Systems, volume 30.
ciation for Computational Linguistics (Volume Curran Associates, Inc.
1: Long Papers), pages 1780–1790.
Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick,
Toshiaki Nakazawa, Hideki Nakayama, Chenchen and Anton Van Den Hengel. 2017. Image cap-
Ding, Raj Dabre, Shohei Higashiyama, Hideya tioning and visual question answering based on
Mino, Isao Goto, Win Pa Pa, Anoop Kunchukut- attributes and external knowledge. IEEE trans-
tan, Shantipriya Parida, Ondřej Bojar, Chen- actions on pattern analysis and machine intelli-
hui Chu, Akiko Eriguchi, Kaori Abe, and Sadao gence, 40(6):1367–1381.
Oda, Yusuke Kurohashi. 2021. Overview of the
8th workshop on Asian translation. In Proceed- Zhishen Yang and Naoaki Okazaki. 2020. Image
ings of the 8th Workshop on Asian Translation, caption generation for news articles. In Pro-
Bangkok, Thailand. Association for Computa- ceedings of the 28th International Conference on
tional Linguistics. Computational Linguistics, pages 1941–1951.
Toshiaki Nakazawa, Hideki Nakayama, Chenchen Zhongliang Yang, Yu-Jin Zhang, Sadaqat
Ding, Raj Dabre, Shohei Higashiyama, Hideya ur Rehman, and Yongfeng Huang. 2017.
Mino, Isao Goto, Win Pa Pa, Anoop Kunchukut- Image captioning with object detection and
tan, Shantipriya Parida, et al. 2020. Overview localization. In International Conference on
of the 7th workshop on asian translation. In Pro- Image and Graphics, pages 109–118. Springer.
ceedings of the 7th Workshop on Asian Transla-
tion, pages 1–44. Jun Yu, Jing Li, Zhou Yu, and Qingming Huang.
2019. Multimodal transformer with multi-
Shantipriya Parida and Ondřej Bojar. 2018. Trans- view visual representation for image captioning.
lating short segments with nmt: A case study in IEEE transactions on circuits and systems for
english-to-hindi. In 21st Annual Conference of video technology, 30(12):4467–4480.
the European Association for Machine Transla-
tion, page 229.
Shantipriya Parida, Ondřej Bojar, and Satya Ran-
jan Dash. 2019. Hindi visual genome: A dataset
for multi-modal english to hindi machine trans-
lation. Computación y Sistemas, 23(4).

You can also read