"Wikily" Supervised Neural Translation Tailored to Cross-Lingual Tasks

Page created by Gilbert Walters
 
CONTINUE READING
"Wikily" Supervised Neural Translation Tailored to Cross-Lingual Tasks
“Wikily” Supervised Neural Translation Tailored to Cross-Lingual Tasks

         Mohammad Sadegh Rasooli1∗ Chris Callison-Burch2 Derry Tanti Wijaya3
                                                    1 Microsoft
             2 Department   of Computer and Information Science, University of Pennsylvania
                           3 Departmentof Computer Science, Boston University
                  mrasooli@microsoft.com, ccb@seas.upenn.edu, wijaya@bu.edu

                     Abstract                               could be used in downstream cross-lingual tasks
                                                            in which annotated data does not exist for some
     We present a simple but effective approach for         languages. There has recently been a great deal
     leveraging Wikipedia for neural machine trans-
                                                            of interest in unsupervised neural machine trans-
     lation as well as cross-lingual tasks of image
     captioning and dependency parsing without us-          lation (e.g. Artetxe et al. (2018a); Lample et al.
     ing any direct supervision from external paral-        (2018a,c);  Conneau and Lample (2019); Song et al.
     lel data or supervised models in the target lan-       (2019a); Kim et al. (2020); Tae et al. (2020)). Un-
     guage. We show that first sentences and titles         supervised neural machine translation models of-
     of linked Wikipedia pages, as well as cross-           ten perform nearly as well as supervised models
     lingual image captions, are strong signals for         when translating between similar languages, but
     a seed parallel data to extract bilingual dictio-      they fail to perform well in low-resource or dis-
     naries and cross-lingual word embeddings for
                                                            tant languages (Kim et al., 2020) or out-of-domain
     mining parallel text from Wikipedia. Our fi-
     nal model achieves high BLEU scores that are           monolingual data (Marchisio et al., 2020). In prac-
     close to or sometimes higher than strong su-           tice, the highest need for unsupervised models is
     pervised baselines in low-resource languages;          to expand beyond high resource, similar European
     e.g. supervised BLEU of 4.0 versus 12.1                language pairs.
     from our model in English-to-Kazakh. More-
                                                               There are two key goals in this paper: Our first
     over, we tailor our “wikily” supervised trans-
     lation models to unsupervised image caption-
                                                            goal is developing accurate translation models for
     ing, and cross-lingual dependency parser trans-        low-resource distant languages without any supervi-
     fer. In image captioning, we train a multi-            sion from a supervised model or gold-standard par-
     tasking machine translation and image cap-             allel data. Our second goal is to show that our ma-
     tioning pipeline for Arabic and English from           chine translation models can be directly tailored to
    which the Arabic training data is a translated          downstream natural language processing tasks. In
    version of the English captioning data, using           this paper, we showcase our claim in cross-lingual
     our wikily-supervised translation models. Our
                                                            image captioning and cross-lingual transfer of de-
     captioning results on Arabic are slightly better
     than that of its supervised model. In depen-           pendency parsers, but this idea is applicable to a
     dency parsing, we translate a large amount of          wide   variety of tasks.
     monolingual text, and use it as artificial train-         We present a fast and accurate approach for
     ing data in an annotation projection frame-            learning translation models using Wikipedia. Un-
    work. We show that our model outperforms                like unsupervised machine translation that solely
     recent work on cross-lingual transfer of depen-
                                                            relies on raw monolingual data, we believe that we
     dency parsers.
                                                            should not neglect the availability of incidental su-
1 Introduction                                              pervisions from online resources such as Wikipedia.
                                                            Wikipedia contains articles in nearly 300 languages
Developing machine translation models without us- and more languages might be added in the future,
ing bilingual parallel text is an intriguing research       including indigenous languages and dialects of dif-
problem with real applications: obtaining a large           ferent regions in the world. Different from similar
volume of parallel text for many languages is hard          recent work (Schwenk et al., 2019a), we do not
if not impossible. Moreover, translation models             rely on any supervision from supervised translation
    ∗
      Research was conducted at The University of Pennsyl-  models. Instead, we leverage the fact that many
vania.                                                      first sentences in linked Wikipedia pages are rough
                                                         1655
      Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1655–1670
                        November 7–11, 2021. c 2021 Association for Computational Linguistics
"Wikily" Supervised Neural Translation Tailored to Cross-Lingual Tasks
better  in low-resource
                                                                                     proach    towards usinglanguages.
                                                                                                                  the Wikipedia mono-                  117
                                                                                     A  summary
                                                                                     lingual          of our
                                                                                               data for       contribution
                                                                                                         machine     translationis as  follows: 1)
                                                                                                                                    without            118
                                                                                  We propose a simple, fast and effectivealgo-
                                                                                     any   explicit  supervision.      Our   mining      approach      119
                                                                                     rithm easily
                                                                                  towards    using scales   on large comparable
                                                                                                     the Wikipedia       monolingual    data
                                                                                                                                           data for    120
                                                                                     using   limited   computational      resources.
                                                                                  machine translation without any explicit supervi-      We            121
                                                                                     achieve
                                                                                  sion.         very high
                                                                                         Our mining         BLEU scores
                                                                                                         algorithm       easilyfor   distant
                                                                                                                                 scales   on large     122
                                                                                     languages, especially those in which current                      123
                                                                                  comparable data using limited computational re-
                                                                                     unsupervised methods perform very poorly.                         124
                                                                                  sources. We achieve very high BLEU scores for
                                                                                  distant   languages,
                                                                                   • We propose       novelespecially
                                                                                                              methods those        in which cur-
                                                                                                                           for leveraging              125
                                                                                  rentour current translation
                                                                                        unsupervised     methods  models    in image
                                                                                                                      perform     very cap-
                                                                                                                                        poorly. 2)     126
                                                                                     tioning.      We   show    that    how   a
                                                                                  We propose novel methods for leveraging our cur-combina-             127
                                                                                     tion
                                                                                  rent     of translating
                                                                                        translation         caption
                                                                                                       models          trainingcaptioning.
                                                                                                                   in image       data, and We         128

                                                                                  show that how a combination of translatingascaption
                                                                                     multi-task   learning  with   English   captioning                129
                                                                                     well as translation improves the performance.                     130
                                                                                  training data, and multi-task learning with English
                                                                                     Our results on Arabic captaining shows re-                        131
        Figure
      Figure 1:1:AApair
                    pairofofWikipedia
                            Wikipedia documents
                                      documents ininArabic
                                                     Arabicand
                                                             and                  captioning     as well as translation improves the per-
                                                                                     sults slightly superior to that of a supervised                   132
        English, along with a same image with two captions.                       formance.
      English, along with a same image with two captions.                            captioningOur      results
                                                                                                    model   trainedon onArabic     shows results
                                                                                                                            gold-standard              133
                                                                                  slightly   superior
                                                                                     datasets.           to  that   of  a  supervised     caption-     134
084      glish in which the titles, first sentences, and also the                 ing model trained on gold-standard datasets. 3)
      translations,      and furthermore, many captions of                         • We   propose   a novelmodification
                                                                                                              modification to to the
085      image captions are rough translations of each other.                     We   propose    a novel                         the anno-
                                                                                                                                       annotation      135
      theOur
           same    images
               method    learnsarea similar      sentences,
                                    seed bilingual              sometimes
                                                         dictionary   from           tation projection method in order to be able                      136
086                                                                               projection method to be able to leverage our trans-
087   translations.      Figure     1   shows      a  real  example
         a small collection of first sentence pairs, titles and          of a        to leverage our translation models. Our re-                       137
                                                                                  lation models. Our results on dependency parsing
088   pair  of linked
         captions,   and Wikipedia           pages in Arabic
                           then learns cross-lingual                and En-
                                                             word embed-             sults on dependency parsing performs better                       138
                                                                                  performs     better than
                                                                                     than previous     workprevious        work in
                                                                                                               in most cases,      andmost
                                                                                                                                        per-cases,
089   glish  in which
         dings.   We make the use
                               titles,
                                     of first   sentences,
                                          cross-lingual        andembed-
                                                            word    also the                                                                           139
                                                                                  andforms
                                                                                       performs
                                                                                             similarly to using gold-standard parallel par-
                                                                                                    similarly    to  using   gold-standard             140
090   image
         dingscaptions
                 to extractare   roughsentences
                              parallel      translationsfromofWikipedia.
                                                                 each other.
                                                                                  allel datasets.
                                                                                     datasets.                                                         141
091   OurOur   experiments
            method     learnsshowa seed  that  our approach
                                            bilingual            improves
                                                          dictionary    from
                                                                                     Our translation and captioning code              and models
092      over strong
      a small             unsupervised
                 collection                   translation
                                of first sentence             models
                                                          pairs,  titlesfor
                                                                          and     Our code is publicly available 1online1 .                            142
093      low-resource       languages:        we improveword   the BLEU           are publicly available online .
      captions,    and then     learns cross-lingual                 embed-
094      scoreWe of English!Gujarati            from 0.6 to     15.2,   and      2 Background                                                          143
      dings.         make use of cross-lingual               word    embed-        2    Background
095      English!Kazakh from 0.8 to 12.1.                                        In this section, we briefly describe the main con-
      dings to extract parallel sentences from Wikipedia.                                                                                              144
096         In the realm of downstream tasks, we show that                         Supervised
                                                                                 cepts            neural machine
                                                                                       that we repeatedly           translation
                                                                                                           use throughout the paper.Super-             145
      Our experiments show that our approach improves
097      we can easily use our translation models to generate                       vised machine translation uses a parallel text P =                 146
098
      over   strong unsupervised
         high-quality                         translation(Chen
                         translations of MS-COCO               models      for
                                                                      et al.,
      low-resource        languages:         we    improve      the   BLEU          {(si , ti )}ni=1
                                                                                 Supervised           in machine
                                                                                                 neural   which each       sentence Super-
                                                                                                                      translation     si ∈ l1 is
099      2015) and Flickr (Hodosh et al., 2013) datasets, and
      score                                                                         a translation     of ti ∈ usesl2 . aNeural
                                                                                                                         parallelmachine
                                                                                                                                  text P = trans-
100      trainofa cross-lingual
                   English→Gujarati             from 0.6 to
                                     image captioning             15.2,inand
                                                              model        a     vised   machine translation
101   English→Kazakh            from      0.8  to  12.1.
         multi-task pipeline paired with machine translation                     {(si , ti )}i=1 in which each sentence si 2 l1 is with
                                                                                    lation   uses
                                                                                             n      sequence-to-sequence         models    a     at-
         in
                                                                                    tention    (Cho   et  al., 2014;    Bahdanau
                                                                                 translation of ti 2 l2 . For having a high-quality  et al.,  2015;
102
          Inwhich    the model
             the realm             is initializedtasks,
                            of downstream             by theweparameters
                                                                  show that         Vaswanimodel,
                                                                                 translation     et al., we
                                                                                                         2017)    for need
                                                                                                            usually    which   the likelihood
                                                                                                                            a large amount       of
103
      we can easily use our translation models toon
         from    our  translation     model.      Our    results       Ara-
                                                                   generate      of training    data is
                                                                                     parallel text.      maximized
                                                                                                      Neural   machine  bytranslation
                                                                                                                            maximizing     the log-
                                                                                                                                        uses
104      bic captioning show a BLEU score of 5.72 that is
      high-quality translations of MS-COCO (Chen et al.,                         sequence-to-sequence
                                                                                    likelihood of predicting models    withtarget
                                                                                                                     each   attention
                                                                                                                                  word(Cho
                                                                                                                                         given its
105      slightly better than a supervised captioning model
106
      2015)    and Flickr (Hodosh et al., 2013) datasets, and
         with a BLEU score of 5.22. As another task, in de-                      et previous
                                                                                    al., 2014; predicted
                                                                                                 Bahdanau words
                                                                                                              et al., 2015;  Vaswani  et al.,
                                                                                                                       and source sequence:
107
      train  a  cross-lingual
         pendency     parsing, weimage         captioning
                                       first translate          model
                                                          a large  amount in a   2017) for which the likelihood of training data is
                                                                                                    n X |ti |
108   multi-task     pipeline
         of monolingual           paired
                               data  usingwith our machine
                                                    translation translation
                                                                   models        maximized by maximizing
                                                                                                  X           the log-likelihood of
      in and
          which    the  model     is  initialized     by   the  parameters                L(P)
                                                                                 predicting each=
                                                                                                target wordlog      i,j |t
                                                                                                                 p(tits
                                                                                                              given       i,k
"Wikily" Supervised Neural Translation Tailored to Cross-Lingual Tasks
models usually mask parts of every input sentence,         Wikipedia documents are rough translations of each
and try to uncover the masked words (Devlin et al.,        other. Moreover, captions of images in different
2019). The monolingual language models are used            languages are usually similar but not necessarily
along with iterative back-translation (Hoang et al.,       direct translations of each other. We leverage this
2018) to learn unsupervised translation. An input          information to extract many parallel sentences from
sentence s is translated to t0 using current model         Wikipedia without using any external supervision.
θ, then the model assumes that (t0 , s) is a gold-         In this section, we describe our algorithm which is
standard translation, and uses the same training           briefly shown in Figure 3.
objective as of supervised translation.
                                                           3.1   Data Definitions
Dependency parsing Dependency parsing algo-                For languages e and f in which e is English and f
rithms capture the best scoring dependency trees           is a low-resource target language of interest, there
for sentences among an exponential number of pos-                                                   (e)
                                                           are Wikipedia documents we = {w1 . . . wn }
                                                                                                              (e)
sible dependency trees. A valid dependency tree                             (f )     (f )                 (l)
                                                           and wf = {w1 . . . wm }. We refer to w(i,j) as
for a sentence s = s1 , . . . , sn assigns heads hi for
                                                           the jth sentence in the ith document for language
each for word si where 1 ≤ i ≤ n, 0 ≤ hi ≤ n and
                                                           l. A subset of these documents are aligned (us-
hi 6= i. The zeroth word represents a dummy root
                                                           ing Wikipedia languages links). Thus we have an
token as an indicator for the root of the sentence.
                                                           aligned set of document pairs in which we can eas-
For more details about efficient parsing algorithms,
                                                           ily extract many sentence pairs that are potentially
we encourage the reader to see Kübler et al. (2009).
                                                           translations of each other. A smaller subset F is the
                                                                                                    (e)     (f )
Annotation projection Annotation projection                set of first sentences in Wikipedia (w(i,1) , w(i0 ,1) )
is an effective method for transferring super-             in which documents i and i0 are linked and their
vised annotation from a rich-resource language             first sentence lengths are in a similar range. In
to a low-resource language through translated              addition to text content, Wikipedia has a large set
text (Yarowsky et al., 2001). Having a parallel            of images. Each image comes along with one or
data P = {(si , ti )}ni=1 , and supervised source          more captions, sometimes in different languages.
annotations for source sentences si , we transfer          A small subset of these images have captions both
those annotations through word translation links           in English and the target language. We refer to this
       (j)                                     (j)
0 ≤ ai ≤ |ti | for 1 ≤ j ≤ |si | where ai = 0              set as C. We use the set of all caption pairs (C),
shows a null alignment. The alignment links                title pairs (T ), and first sentences (F) as the seed
are learned in an unsupervised fashion using un-           parallel data: S = F ∪ C ∪ T .
supervised word alignment algorithms (Och and
                                                           3.2   Bilingual Dictionary Extraction and
Ney, 2003a). In dependency parsing, if hi = j and
                                                                 Cross-Lingual Word Embeddings
a(j) = k and a(i) = m, we project a dependency
k → m (i.e. hm = k) to the target side. Previous           Having the seed parallel data S, we run unsuper-
work (Rasooli and Collins, 2017, 2019) has shown           vised word alignment (Dyer et al., 2013) in both
that annotation projection only works when a large         English-to-target, and target-to-English directions.
amount of translation data exists. In the absence of       We use the intersected alignments to extract highly
parallel data, we create artificial parallel data using    confident word-to-word connections. Finally, we
our translation models. Figure 2 shows an example          pick the most frequently aligned word for each
of annotation projection using translated text.            word in English as translation. This set serves as a
                                                           bilingual dictionary D.
3   Learning Translation from Wikipedia                       Given two monolingual trained word embed-
                                                           dings ve ∈ RNe ×d and vf ∈ RNf ×d , and the ex-
The key component of our approach is to leverage
                                                           tracted bilingual dictionary D, we use the method
the multilingual cues from linked Wikipedia pages
                                                           of Faruqui and Dyer (2014) to project these two em-
across languages. Wikipedia is a great comparable
                                                           bedding vectors to a shared cross-lingual space.2
data in which many of its pages explain entities
                                                           This method uses a bilingual dictionary along with
in the world in different languages. In most cases,
                                                         2
first sentences define or introduce the mentioned          There are more recent approaches such as (Lample et al.,
                                                     2018b). Comparing different embedding methods is not the
entity in that page (e.g. Figure 1). Therefore, we   focus of this paper, thereby we leave further investigation to
observe that many first sentence pairs in linked     future work.
                                                  1657
"Wikily" Supervised Neural Translation Tailored to Cross-Lingual Tasks
punct

                                                                                                                                               advcl
                                                                     root

                                                                                                                          obl

                 det                                                                                               nmod
                                                                                                                                                                                                                                                    nmod
                           amod                   nsubj                                                                                                                                                                                                    case
                                                          advmod             xcomp           obj                                case                                case                                    punct
                                  compound                                                                                                                              compound                                    mark        obj                               det

   The   International Crisis Group recently suggested moving responsibility                                        for                pension                to   state level                          ,   to eliminate some                  of          the problems                     .

 Grupul International         de          Criza   a                sugerat       recent            mutarea   responsabilitatii pentru pensii                        la          nivelul statului ,
                                                                                                                                                                             case
                                                                                                                                                                                                                       pentru           a    elimina unele              dintre           probleme .
         amod                                                               advmod                                                                     case                                  compound                                               obj                           case
                                                                                                             obj                                                                                                                      mark                                 nmod
                                                                                     xcomp                                              nmod                                                                                punct
                compound

                                  nsubj                                                                                                obl

                                                                     root
                                                                                                                                                                     advcl

                                                                                                                                                                                     punct

 Figure 2: An example of annotation projection for which the source on top is a translation of the Romanian target
 via our wikily translation model. The supervised source tree is projected using intersected word alignments.
Figure 2: An example of annotation projection for which the source (English, on top) is a translation of the target
(Romanian) with our wikily translation model. The source side is parsed with supervised Stanza (Qi et al., 2020)
and the parse tree is projected using Giza++ (Och and Ney, 2003)                 after intersected alignments.
                                                                                        filtering sentence     Aswith
                                                                                                           pairs shown  in the figure,
                                                                                                                      different  numer-
  Definitions: 1) e is English, f is the foreign language, and g is a lan-
some  words     have     missing    dependencies.
    guage similar to f , 2) learn_dict (P ) extracts a bilingual dictionary from          ical values (e.g. sentences containing 2019 in the
     parallel data P , 3) t (x|m) translates input x given model m, , 4)                  source and 1987 in the target), we use a modified
       pretrain (x) pretrains on monolingual data x using MASS (Song et al.,
Supervised neural machine translation Super- and try to uncover                           version     of cosinethe  similarity
                                                                                                                      masked words    between       words:
                                                                                                                                                (Devlin     et al.,
      2019a), 5) train (P |m) trains on parallel data P initialized by model m,
vised machine translation uses a parallel text P =                                     2019). In this work,        ( we mainly use the MASS
      6) bt_train
               n (xin1 , x2 |m) trains iterative back-translation on monolingual
                                                                                                                      1.0, in whichifa(scontiguous  i , tj ) ∈ D
{(sidata
       , tix)}
             1 ∈ e and x2which
               i=1                    eachbysentence
                            ∈ f initialized        model m. si 2 l1 is a               model     (Song, tet)al.,
                                                                                              sim(s     i j =
                                                                                                                    2019),
translation       of ti 2documents
    Inputs: 1) Wikipedia        l2 . For  w having
                                             (e)    (f )
                                                 , w , and a whigh-quality
                                                                (g)
                                                                    , 2) Monolingual span of words are masked,        cos(si ,and   tj ), theotherwise
                                                                                                                                                 decoder pre-
      word embedding vectors ve and vf , 3) Set of linked pages from Wikipedia
translation       model,      we    usually        need   a  large     amount
       COMP , their aligned titles T , and their first sentence pairs F , 4) Set of    dicts   the    masked     words.       These       monolingual          lan-
of parallel                                                                               Using      the above     definition
                                                                                                                          along of        word     similarity,
                                                                                                                                                            back-we
      paired imagetext,
                   captionse.g.
                             C, andthe       Arabic-English
                                     5) Gold-standard    parallel data PUnited
                                                                          (e,g)
                                                                                .      guage     models      are used                 with    iterative
Nations
    Algorithm: parallel text (Ziemski et al., 2016) con- translation                      use the average-maximum
                                                                                                        (Hoang et al., 2018)        similarity
                                                                                                                                          to learn between
                                                                                                                                                       unsuper-   pairs
    → Learn bilingual dictionary and embeddings
tains                                                                                     of sentences.
      S=         C ∪ T sentences. Neural machine trans-
          n F⇠∪18M                                                                     vised   translation. In other words, an input sentence
      D (f,e) = learn_dict (S)
                                                                                                                      Pn
lation(g,e)uses sequence-to-sequence                      models       with     at-    s  is translated      to t0 using  i=1  current
                                                                                                                                   maxm     model
                                                                                                                                          j=1   sim(s  ✓.i ,Then
                                                                                                                                                             ti )
      D        = learn_dict (P
tention (Cho 0et al., 2014;
                                  (e,g)
                                         )
                                              Bahdanau         . Related language
                                                               et  al.,   2015;        the       score(s,
                                                                                             model            t)
                                                                                                        assumes   = that   (t  0 , s)   is  a  gold-standard
      Learn ve → ve and vf → vf0 using D (f,e) ∪ D (g,e)
                                                                                                                                            n
Vaswani        et al.,data
    → Mine parallel      2017) for which the likelihood of                             translation, and uses the same training objective
                                                                                          From a pool of candidates, we pick those pairs that
training
      Extract data  is maximized
              comparable  sentences Z from    byCOMP
                                                   maximizing the log- as of supervised neural translation. The main as-
      Extract P (f,e) from Z.                                                             have the highest score in both directions.
likelihood       of predicting each target word .given
      P (f,e) = P (f,e) ∪ T
                                                                               its
                                                                         Mined Data
                                                                                       sumption here is that languages have distributional
previous
    → Train MT predicted       words
                   with pretraining    andand      source sequence:
                                             back-translation                          similarities
                                                                                          3.4 Leveraging and theseSimilar
                                                                                                                      similarities          can be captured
                                                                                                                                     Languages
      θ0 = pretrain (w(e) ∪ w(f ) ∪ w(g) )                      . MASS Training
                                                                                       by pretrained multilingual language models (Con-
      θ = train (PX    (f,e) |t | (g,e)
                         n X   ∪iP          |θ0 )                 . NMT Training          In many low-resource scenarios, the number of
                                                                                       neau et al., 2020).
      P L(P)
         (e→f )
                == ( t (w(f ) |θ ), log p(ti,j |ti,k
from HuggingFace (Wolf et al., 2019) and                from one-shot translation is that the model uses
Pytorch (Paszke et al., 2019) with a shared             an online approach, and updates its parameters in
SentencePiece (Kudo and Richardson, 2018)               every batch.
vocabulary. All input and output token embeddings          We empirically find one-shot back-translation
are summed up with the language id embedding.           faster to train but with much less potential to reach
First tokens of every input and output sentence are     a high translation accuracy. A simple and ef-
shown by the language ID. Our training pipeline         fective way to have both a reliable and accurate
assumes that the encoder and decoder are shared         model is to first initialize a model with one-shot
across different languages, except that we use a        back-translation, and then apply iterative back-
separate output layer for each language in order to     translation. The model that is initialized with a
prevent input copying (Artetxe et al., 2018b; Sen       more accurate model reaches a higher accuracy.
et al., 2019). We pretrain the model on a tuple of
three Wikipedia datasets for the three languages        4     Cross-Lingual Tasks
g, f , and e using the MASS model (Song et al.,
                                                        In this section, we describe our approaches for tai-
2019a). The MASS model masks a contiguous
                                                        loring our translation models to cross-lingual tasks.
span of input tokens, and recovers that span in the
                                                        Note that henceforth we assume that our transla-
output sequence.
                                                        tions model training is finished, and we have access
   To facilitate multi-task learning with image cap-    to trained translation models for cross-lingual tasks.
tioning, our model has an image encoder that is
used in cases of image captioning (more details         4.1    Cross-Lingual Image Captioning
in §4.1). In other words, the decoder is shared
                                                        Having gold-standard image captioning training
between the translation and captioning tasks. We
                                                        data I = {(Ii , ci )}ni=1 where Ii is the image as
use the pretrained ResNet-152 model (He et al.,                                      (1)
2016) from Pytorch to encode every input image.         pixel values, and ci = ci , . . . , cki i as the textual
We extract the final layer as a 7 × 7 grid vector       description with ki words, our goal is to learn a cap-
(g ∈ R7×7×dg ), and project it to a new space by        tioning model that is able to describe new (unseen)
a linear transformation (g 0 ∈ R49×dt ), and then       images. As described in §3.5, we use a transformer
add location embeddings (l ∈ R49×dt ) by using          decoder from our translation model and a ResNet
entry-wise addition. Afterwards, we assume that         image encoder (He et al., 2016) for our image cap-
the 49 vectors are encoded text representations as if   tioning pipeline. Unfortunately, annotated image
a sentence with 49 words occurs. This is similar to     captioning datasets do not exist in many languages.
                                                        Having our translation model parameter θ         ∗ , we
but not exactly the same as the Virtex model (Desai
and Johnson, 2021).                                     can use its translation functionality to translate
                                                        each caption ci to c0i = translate(ci |θ     ∗ ). After-

                                                        wards, we will have a translated annotated dataset
3.6   Back-Translation: One-shot and Iterative
                                                        I 0 = {(Ii , c0i )}ni=1 in which the textual descrip-
Finally, we use the back-translation technique          tions are not gold-standard but translations from
to improve the quality of our models. Back-             the English captions. Figure 4 shows a real exam-
translation is done by translating a large amount       ple from MS-Coco (Chen et al., 2015) in which
of monolingual text to and from the target lan-         Arabic translations are provided by our translation
guage. The translated texts serve as noisy input        model. Furthermore, to augment our learning ca-
text along with the monolingual data as the silver-     pability, we initialize our decoder with decoding
                                                        parameters of θ   ∗ , and also continue training with
standard translations. Previous work (Sennrich
et al., 2016b; Edunov et al., 2018) has shown that      both English captioning and translation.
back-translation is a very simple but effective tech-
nique to improve the quality of translation models. 4.2 Cross-Lingual Dependency Parsing
Henceforth, we refer to this method as one-shot        Assuming that we have a large body of monolin-
back-translation. Another approach is to use iter- gual text, we translate that monolingual text to cre-
ative back-translation (Hoang et al., 2018), the       ate artificial parallel data. We run unsupervised
most popular approach in unsupervised transla- word alignments on the artificial parallel text. Fol-
tion (Artetxe et al., 2018b; Conneau and Lample, lowing previous work (Rasooli and Collins, 2015;
2019; Song et al., 2019a). The main difference         Ma and Xia, 2014), we run Giza++ (Och and Ney,
                                                    1659
This is an open box containing four                               Direction          aren   guen   kken   roen
                                 cucumbers.                                                        Foreign docs        1.0m     28k    230k    400k
                                             .‫وهذا صندوق مفتوح يحتوي على أربعة خيار‬
                                                                                                   Paired docs         745k    7.3k     80k    270k
                                 An open food container box with four
                                 unknown food items.
                                                                                                   First sents.        205k    3.2k     52k     78k
                                   .‫صندوق حاوية طعام مفتوح مع أربعة مواد غذائية مجهولة‬             Captions             92k    2.2k    1.9k     35k
                                 A small box filled with four green                                Comparable pairs    0.1b    14m     32m     64m
                                 vegetables.                                                       Mined sents.        1.7m     49k    183k    675k
                                                .‫ضراء‬K‫ضروات ا‬K‫مربع صغير مليء بأربعة ا‬              BT                  2.1m    1.5m    2.2m    2.1m
                                 An opened box of four chocolate                                   Iterative BT        4.0m    3.8m    4.0m    6.1m
                                 bananas.
                                                           .‫وز‬P‫علبة مفتوحة من أربعة من ا‬
                                 An open box contains an unknown,
                                 purple object                                             Table 1: Data sizes for different pairs. We use a sample
                                        ‫رجوان‬T‫مربع مفتوح يحتوي على كائن غير معروف ا‬
                                                                                           of English sentences with similar sizes to each data.
Figure 4: An image from MS-Coco (Chen et al., 2015)
with gold-standard English captions, and Arabic trans-
                                                                                           et al., 2016) for Romanian-English. Following pre-
lations from our wikily translation model.
                                                                                           vious work (Sennrich et al., 2016a), diacritics are
                                                                                           removed from the Romanian data. More details
2003b) alignments on both source-to-target and                                             about other datasets and their sizes, we refer the
target-to-source directions, and extract intersected                                       reader to the supplementary material.
alignments to keep high-precision one-to-one align-
ments. We run a supervised dependency parser of                                            Pretraining We pretrain four models on 3-tuples
English as our rich-resource language. Then, we                                            of languages via a single NVIDIA Geforce RTX
project dependencies to the target language sen-                                           2080 TI with 11GB of memory. We create batches
tences via word alignment links. Inspired by previ-                                        of 4K words, run pretraining for two million itera-
ous work (Rasooli and Collins, 2015), to remove                                            tions where we alternate between language batches,
noisy projections, we keep those sentences that at                                         and accumulate gradients for 8 steps. We use the
least 50% of words or 5 consecutive words in the                                           apex library3 to use FP-16 tensors. This whole
target side have projected dependencies.                                                   process takes four weeks in a single GPU. We use
                                                                                           the Adam optimizer (Kingma and Ba, 2015) with
5     Experiments                                                                          inverse square root and learning rate of 10−4 , 4000
                                                                                           warm-up steps, and dropout probability of 0.1.
In this section, we provide details about our experi-
mental settings and results for translation, caption-                                      Translation Training Table 1 shows the sizes
ing, and dependency parsing. We put more details                                           of different types of datasets in our experiments.
about our settings as well as thorough analysis of                                         We pick comparable candidates for sentence pairs
our results in the supplementary material.                                                 whose lengths are within a range of half to twice
                                                                                           of each other. As we see, the final size of mined
5.1    Datasets and Settings                                                               datasets heavily depends on the number of paired
Languages We focus on four language pairs:                                                 English-target language Wikipedia documents. We
Arabic-English, Gujarati-English, Kazakh-English,                                          train our translation models initialized by pre-
and Romanian-English. We choose these pairs to                                             trained models. More details about our hyper-
provide enough evidence that our model works in                                            parameters are in the supplementary material. All
distant languages, morphologically-rich languages,                                         of our evaluations are conducted using Sacre-
as well as similar languages. As for similar lan-                                          BLEU (Post, 2018) except for en↔ro in which
guages, we use Persian for Arabic (written with                                            we use BLEU score (Papineni et al., 2002) from
very similar scripts and have many words in com-                                           Moses decoder scripts (Koehn et al., 2007) for the
mon), Hindi for Gujarati (similar languages), Rus-                                         sake of comparison to previous work.
sian for Kazakh (written with the same script), and                                        Image Captioning We use the Flickr (Hodosh
Italian for Romanian (Romance languages).                                                  et al., 2013) and MS-Coco (Chen et al., 2015)
Monolingual and Translation Datasets We use                                                datasets for English4 , and the gold-standard Arabic
a shared SentencePiece vocabulary (Kudo and                                                Flickr dataset (ElJundi. et al., 2020) for evaluation.
Richardson, 2018) with size 60K. Table 1 shows                                             The Arabic test set has 1000 images with 3 captions
the sizes of Wikipedia data in different languages.       3
                                                            https://github.com/NVIDIA/apex
For evaluation, we use the Arabic-English UN              4
                                                            We have also tried Conceptual Captions (Sharma et al.,
data (Ziemski et al., 2016), WMT 2019 data (Bar- 2018) in our initial experiments but we have observed drops
                                                      in performance. Previous work (Singh et al., 2020) have also
rault et al., 2019) for Gujarati-English and Kazakh- observed a similar problem with Conceptual Captions as a
English, and WMT 2016 shared task data (Bojar         noisy crawled caption dataset.
                                                   1660
per image. We translate all the training datasets to           ther improvement by back-translation. To have a
Arabic for having translated caption data. The fi-             fair comparison, we list the best supervised models
nal training data contains 620K captions for about             for all language pairs (to the best of our knowl-
125K unique images. Throughout experiments,                    edge). In low-resource settings, we outperform
we use the pretrained Resnet-152 models (He et al.,            strong supervised models that are boosted by back-
2016) from Pytorch (Paszke et al., 2019), and let it           translation. In high-resource settings, our Arabic
fine-tune during our training pipeline. Each train-            models achieve very high performance but regard-
ing batch contains 20 images. We accumulate gra-               ing the fact that the parallel data for Arabic has
dients for 16 steps, and use a dropout of 0.1 for              18M sentences, it is quite impossible to reach that
the projected image output representations. Other              level of accuracy.
training parameters are the same as our translation               Figure 5 shows a randomly chosen example from
training. To make our pipeline fully unsupervised,             the Gujarati-English development data. As de-
we use translated development sets to pick the best            picted, we see that the model after back-translation
checkpoint during training.                                    reaches to somewhat the core meaning of the
                                                               sentence with a bit of divergence from exactly
Dependency Parsing We use the Universal De-                    matching the reference. The final iterative back-
pendencies v2.7 collection (Zeman et al., 2020)                translation output almost catches a correct transla-
for Arabic, Kazakh, and Romanian. We use the                   tion. We also see that the use of the word “creative”
Stanza (Qi et al., 2020) pretrained supervised mod-            is seen in Google Translate output, a model that
els for getting supervised parse trees for Arabic              is most likely trained on much larger parallel data
and Romanian, and use the UDPipe (Straka et al.,               than what is currently available for public use. In
2016) pretrained model for Kazakh. We translate                general, unsupervised translation performs very
about 2 million sentences from each language to                poorly compared to our approach in all directions.
English, and also 2 million English sentences to
Arabic. We use a simple modification to Stanza                 5.3    Captioning Results
to facilitate training on partially projected trees
                                                               Table 4 shows the final results on the Arabic test set
by masking dependency and label assignments for
                                                               using the SacreBLEU measure (Post, 2018). First,
words with missing dependencies. All of our train-
                                                               we should note that similar to ElJundi. et al. (2020),
ing on projected dependencies is blindly conducted
                                                               we see lower scales of BLEU scores due to morpho-
with 100k training steps with default parameters
                                                               logical richness in Arabic. We see that if we initial-
of Stanza (Qi et al., 2020). As for gold-standard
                                                               ize our model with the translation model and multi-
parallel data, we use our supervised translation
                                                               task it with translation and also English captioning,
training data for Romanian-English and Kazakh-
                                                               we achieve much higher performance. It is interest-
English and use a sample of 2 million sentences
                                                               ing to observe that translating the English output
from the UN Arabic-English data due to its large
                                                               on the test data to Arabic achieves a much lower re-
size that causes word alignment significant slow-
                                                               sult. This is a strong indicator of the strength of our
down. For Kazakh wikily projections, due to low
                                                               approach. We also see that supervised translation
supervised POS accuracy, we use the projected
                                                               fails to perform well. This might due to the UN
POS tags for projected words and supervised tags
                                                               translation training dataset which has a different
for unprojected words. We observe a two percent
                                                               domain from the caption dataset. Furthermore, we
increase in performance by using projected tags.
                                                               see that our model outperforms Google Translate
5.2    Translation Results                                     which is a strong machine translation system, and
                                                               that is actually what is being used as seed data for
Table 2 shows the results of different settings in             manual revision in the Arabic dataset. Finally, it is
addition to baseline and state-of-the-art results. We          interesting to see that our model outperforms super-
see that Arabic as a clear exception needs more                vised captioning. Multi-tasking make translation
rounds of training: we train our Arabic model                  performance slightly worse.
once again on mined data by initializing it by our                Figure 6 shows a randomly picked example with
back-translation model.5 We have not seen fur-
                                                               is improving both translation and captioning, but our further
   5
     We have seen that during multi-tasking with image cap-    investigation shows that it is actually due to lack of training for
tioning, the translation BLEU score for Arabic-English sig-    Arabic. We have tried the same procedure for other languages
nificantly improves. We initially thought that multi-tasking   but have not observed any further gains.
                                                           1661
Model                                                  ar→en      en→ar       gu→en       en→gu        kk→en        en→kk            ro→en            en→ro
                    Conneau and Lample (2019)                                 –           –           –          –             –           –               31.8             33.3
UNMT

                    Song et al. (2019a) (MASS; 8 GPUs)                        –           –           –          –             –           –               33.1             35.2
                    Best published results                                  11.0*       9.4*        0.61       0.61          2.01        0.81             37.64            36.32
                    First sentences + captions + titles                      6.1        3.1          0.7        1.1           2.3         1.0               2.0              1.9
                    Mined Corpora                                            23.1       19.7         4.2        4.9           2.8         1.6              22.1             21.6
Wikily UNMT

                    + Related Language                                        –           –          9.1        7.8           7.3         2.3              23.2             21.5
                    + One-shot back-translation (bt-beam=4)                  23.0       18.8        13.8       13.9           7.0        12.1              25.2             28.1
                    + Iterative back-translation (bt-beam=1)                 24.4       18.9        13.3       15.2           9.0        10.8              32.5             33.0
                    + Retrain on mined data                                  30.6       23.4          –          –             –           –                 –                –
                    (Semi-)Supervised                                       48.9*      40.6*       14.21       4.01         12.51        3.11             39.93            38.53

Table 2: BLEU scores for different models. Reference results are from *: Our implementation, 1: Kim et al. (2020),
2: Li et al. (2020), 3: Liu et al. (2020) (supervised), 4: Tran et al. (2020) (unsupervised with mined parallel data).

                                                                                                  Arabic                   Kazakh                    Romanian
                                     Method                Version Token and POS
                                                                                     UAS          LAS BLEX         UAS     LAS BLEX              UAS LAS BLEX
                    Rasooli and Collins (2019)               2.0     gold/supervised 61.2         48.8   –          –        –     –             76.3 64.3    –
      Previous

                    Ahmad et al. (2019)                      2.2           gold      38.1         28.0   –          –        –     –             65.1 54.1    –
                    Kurniawan et al. (2021)                  2.2           gold      48.3         29.9   –          –        –     –              –    –      –
                                                                           gold      62.5         50.7 46.3        46.8    28.5 25.0             74.1 57.7 52.6
                    Wikily translation
      Projection

                                                                       supervised    60.2         48.7 42.1        46.2    27.8 14.1             73.6 57.4 50.9
                                                             2.7           gold      61.5         47.3 42.4        22.2     9.3   7.9            75.9 62.4 57.3
                       Gold-standard Parallel data
                                                                       supervised    59.1         45.3 38.5        21.8     9.2   3.8            75.6 62.0 55.6
                                Supervised                             supervised    84.2         79.8 72.7        48.0    29.8 13.7             90.8 86.0 80.0

Table 3: Dependency parsing results on the Universal Dependencies dataset (Zeman et al., 2020). Previous work
has used different sub-versions of the Universal Dependencies data in which slight differences are expected.

                             Input                                 અથાત આપણે પહે લા તુલનાએ વધુ રચના મક બનવું પડશે.
                             Unsupervised                          Ut numerous ીit the mother, onwards, in theover અિધકાંશexualit theotherit theIN રોડ 19
                             First sentences + captions + titles   A view of the universe from the present to the present day.
                   Outputs

                             Mined Corpora                         For example, if the ghazal is more popular than ghazal.
                             + Related Language                     We need to become more creative than before.
                             + One-shot back-translation            For example, we must become more creative than before.
                             + Iterative back-translation          Meanwhile, we ’ll have to become more constructive than before.
                             Google Translate                      That means we have to be more creative than before.
                             Reference                             That means we have to be more constructive than before.

 Figure 5: An example of a Gujarati sentence and its outputs from different models, as well as Google Translate.

                                                                                                          A child on a red slide.
                                                                                                          A little boy sits on a slide on the playground.
                                                                            English gold                  A little boy slides down a bright red corkscrew slide.
                                                                                                          A little boy slides down a red slide.
                                                                                                          a young boy wearing a blue outfit sliding down a red slide.
                                                                            English supervised            A boy is sitting on a red slide.
                                                                            En– supervised translate                                    . ‫‐ ﺻﺒﻲ ﺻﺒﻲ ﻳﺠﻠﺲ ﻋﻠ ﺷﺎﺣﻨﺔ ﺧﻔﻴﻔﺔ‬
                                                                            En– unsupervised translate                                           .‫اﻟﻄﻔﻞ ﻳﺠﻠﺲ ﻋﻠ ﺷﺮﻳﺤﺔ ﺣﻤﺮاء‬
                                                                            En– Google translate                                                  .‫ﺻﺒﻲ ﻳﺠﻠﺲ ﻋﻠ ﺷﺮﻳﺤﺔ ﺣﻤﺮاء‬
                                                                            Supervised MT                                                                   ‫ﺻﺒﻲ ﺻﺒﻲ ﻋﻠ ﺷﻈﻴﺔ‬
                                                                            Unsupervised (mt + ar + en)                                   .‫ﻳﺠﻠﺲ ﺻﺒﻲ ﺻﻐﻴﺮ ﻋﻠ ﺷﺮﻳﺤﺔ ﺑﺮﺗﻘﺎﻟﻴﺔ‬
                                                                            Unsupervised (mt + ar)                                         .‫ﺻﺒﻲ ﺻﻐﻴﺮ ﻳﺠﻠﺲ ﻋﻠ ﺷﺮﻳﺤﺔ ﺣﻤﺮاء‬
                                                                            Supervised                                                        ‫ﺻﺒﻲ ﻓ ﻗﻤﻴﺺ أزرق ﻳﻘﻔﺰ ﻓ اﻟﻬﻮاء‬
                                                                                                                                                          ‫ﻃﻔﻞ ﻋﻠ ﻣﻨﺰﻟﻘﺔ ﺣﻤﺮاء‬
                                                                            Arabic Gold                                                 ‫ﺻﺒﻲ ﺻﻐﻴﺮ ﻳﺠﻠﺲ ﻋﻠ زﻻﺟﺔ ﻓ اﻟﻤﻠﻌﺐ‬
                                                                                                                                             ‫ﻳﻨﺰﻟﻖ ﺻﺒﻲ ﺻﻐﻴﺮ أﺳﻔﻞ ﻣﻨﺰﻟﻘﺔ ﺣﻤﺮاء‬

Figure 6: An example of different outputs in our captioning experiments both for English and Arabic, as well as
Arabic translations of English outputs on the Arabic Flickr dataset (ElJundi. et al., 2020).

different model outputs. We see that the two out-                                             The word     éJ ËA® KQK. means “orange” which is close
puts from our approach with multi-tasking are                                                 to  Z@QÔg that means “red”. The word ém ' Qå means
roughly the same but one of them as more syntactic
                                                                                              “slide” which is correct but other meanings of this
order overlap with the reference while both orders
                                                                                              word exist in the reference. In general, we observe
are correct in Arabic as a free-word order language.
                                                                                           1662
Multi-task     BLEU       et al., 2012; Patry and Langlais, 2011; Lin et al.,
                       Supervision Pretrained
                                              EN      MT      @1    @4
                         wikily       7        7        7     33.1 4.57
                                                                           2011; Tufiş et al., 2013; Barrón-Cedeño et al., 2015;
                                                                           Wijaya et al., 2017; Ruiter et al., 2019; Srinivasan
Translate train data

                         wikily      3         7        7     32.9 5.28
                         wikily      3         3        7     32.8 4.37    et al., 2021). The WikiMatrix data (Schwenk et al.,
                         wikily      3         7        3     33.3 5.72    2019a) is the most similar effort to ours in terms of
                         wikily      3         3        3     36.8 5.60
                       supervised    3         7        7     17.7 1.26
                                                                           using Wikipedia, but with using supervised transla-
                                  English test performance→   68.7 20.42   tion models. Bitext mining has a longer history of
Translate test

                         wikily      3         7        7     30.6 4.20    research (Resnik, 1998; Resnik and Smith, 2003) in
                       supervised    3         7        7     15.8 0.92
                         Google      3         7        7     31.8 5.56
                                                                           which most efforts are spent on using a seed super-
                                     3         7        7     33.7 3.76    vised translation model (Guo et al., 2018; Schwenk
                          Gold
                                     3         3        7     37.9 5.22    et al., 2019b; Artetxe and Schwenk, 2019; Schwenk
                                                                           et al., 2019a; Jones and Wijaya, 2021). Recently, a
Table 4: Image captioning results evaluated on the Ara-
                                                                           number of papers have focused on unsupervised ex-
bic Flickr dataset (ElJundi. et al., 2020) using Sacre-
BLEU (Post, 2018). “pretrained” indicates initializing
                                                                           traction of parallel data (Ruiter et al., 2019; Hangya
our captioning model with our translation parameters.                      and Fraser, 2019; Keung et al., 2020; Tran et al.,
                                                                           2020; Kuwanto et al., 2021). Ruiter et al. (2019)
                                                                           focus on using vector similarity of sentences to ex-
that although superficially the BLEU scores for                            tract parallel text from Wikipedia. Their work does
Arabic is low, it is mostly due to its lexical diversity,                  not leverage structural signals from Wikipedia.
free-word order, and morphological complexity.                                Cross-lingual and unsupervised image caption-
5.4                      Dependency Parsing Results                        ing has been studied in previous work (Gu et al.,
                                                                           2018; Feng et al., 2019; Song et al., 2019b; Gu
Table 3 shows the results for dependency parsing                           et al., 2019; Gao et al., 2020; Burns et al., 2020).
experiments. We see that our model performs very                           Unlike previous work, we do not have a supervised
high in Romanian with a UAS of 74 which is much                            translation model. Cross-lingual transfer of depen-
higher than that of Ahmad et al. (2019) and slightly                       dency parser have a long history. We encourage
lower than that of Rasooli and Collins (2019) which                        the reader to read a recent survey on this topic (Das
uses a combination of multi-source annotation pro-                         and Sarkar, 2020). Our work does not use gold-
jection and direct model transfer. Our work on Ara-                        standard parallel data or even supervised translation
bic outperforms all previous work and performs                             models to apply annotation projection.
even better than using gold-standard parallel data.
One clear highlight is our result in Kazakh. As                            7   Conclusion
mentioned before, by projecting the part-of-speech    We have described a fast and effective algorithm
tags, we achieve roughly 2 percent absolute im-       for learning translation systems using Wikipedia.
provement. Our final results on Kazakh are sig-       We show that by wisely choosing what to use as
nificantly higher than that of using gold-standard    seed data, we can have very good seed parallel data
parallel text (7K sentences).                         to mine more parallel text from Wikipedia. We
                                                      have also shown that our translation models can be
6 Related Work
                                                      used in downstream cross-lingual natural language
Kim et al. (2020) has shown that unsupervised         processing tasks. In the future, we plan to extend
translation models often fail to provide good trans- our approach beyond Wikipedia to other compara-
lation systems for distant languages. Our work        ble datasets like the BBC World Service. A clear
solves this problem by leveraging the Wikipedia       extension of this work is to try our approach on
data. Using pivot languages has been used in previ- other cross-lingual tasks. Moreover, as many cap-
ous work (Al-Shedivat and Parikh, 2019), as well as   tions of the same images in Wikipedia are similar
using related languages (Zoph et al., 2016; Nguyen    sentences and sometimes translations, multimodal
and Chiang, 2017). Our work only explores a sim- machine translation (Specia et al., 2016; Caglayan
ple idea of adding one similar language pair. Most    et al., 2019; Hewitt et al., 2018; Yao and Wan,
likely, adding more language pairs and using ideas    2020) based on this data or the analysis of the data,
from recent work might improve the performance. such as whether more similar languages may share
   Wikipedia is an interesting dataset for solving    more similar captions (Khani et al., 2021) are other
NLP problems including machine translation (Li        interesting avenues.
                                                   1663
Acknowledgments                                                  jointly learning to align and translate.      CoRR,
                                                                 abs/1409.0473.
We would like to thank reviewers and the editor for
their useful comments. We also would like to thank            Loïc Barrault, Ondřej Bojar, Marta R. Costa-jussà,
                                                                Christian Federmann, Mark Fishel, Yvette Gra-
Alireza Zareian, Daniel (Joongwon) Kim, Qing                    ham, Barry Haddow, Matthias Huck, Philipp Koehn,
Sun, and Afra Feyza Akyurek for their help and use-             Shervin Malmasi, Christof Monz, Mathias Müller,
ful comments througout this project. This work is               Santanu Pal, Matt Post, and Marcos Zampieri. 2019.
supported in part by the DARPA HR001118S0044                    Findings of the 2019 conference on machine transla-
                                                                tion (WMT19). In Proceedings of the Fourth Con-
(the LwLL program), and the Department of the Air               ference on Machine Translation (Volume 2: Shared
Force FA8750-19- 2-3334 (Semi-supervised Learn-                 Task Papers, Day 1), pages 1–61, Florence, Italy. As-
ing of Multimodal Representations). The U.S. Gov-               sociation for Computational Linguistics.
ernment is authorized to reproduce and distribute             Alberto Barrón-Cedeño, Cristina España-Bonet, Josu
reprints for Governmental purposes. The views                   Boldoba, and Lluís Màrquez. 2015. A factory of
and conclusions contained in this publication are               comparable corpora from Wikipedia. In Proceed-
those of the authors and should not be interpreted              ings of the Eighth Workshop on Building and Using
                                                                Comparable Corpora, pages 3–13, Beijing, China.
as representing official policies or endorsements of
                                                                Association for Computational Linguistics.
DARPA, the Air Force, and the U.S. Government.
                                                              Ondřej Bojar, Rajen Chatterjee, Christian Federmann,
                                                                Yvette Graham, Barry Haddow, Matthias Huck, An-
References                                                      tonio Jimeno Yepes, Philipp Koehn, Varvara Lo-
                                                                gacheva, Christof Monz, Matteo Negri, Aurélie
Wasi Ahmad, Zhisong Zhang, Xuezhe Ma, Eduard                    Névéol, Mariana Neves, Martin Popel, Matt Post,
  Hovy, Kai-Wei Chang, and Nanyun Peng. 2019. On                Raphael Rubino, Carolina Scarton, Lucia Spe-
  difficulties of cross-lingual transfer with order differ-     cia, Marco Turchi, Karin Verspoor, and Marcos
  ences: A case study on dependency parsing. In Pro-            Zampieri. 2016. Findings of the 2016 conference
  ceedings of the 2019 Conference of the North Amer-            on machine translation. In Proceedings of the
  ican Chapter of the Association for Computational             First Conference on Machine Translation: Volume
  Linguistics: Human Language Technologies, Vol-                2, Shared Task Papers, pages 131–198, Berlin, Ger-
  ume 1 (Long and Short Papers), pages 2440–2452,               many. Association for Computational Linguistics.
  Minneapolis, Minnesota. Association for Computa-
  tional Linguistics.                                         Ondrej Bojar, Vojtech Diatka, Pavel Rychlỳ, Pavel
                                                                Stranák, Vít Suchomel, Ales Tamchyna, and Daniel
Maruan Al-Shedivat and Ankur Parikh. 2019. Con-                 Zeman. 2014. Hindencorp-hindi-english and hindi-
  sistency by agreement in zero-shot neural machine             only corpus for machine translation. In LREC, pages
  translation. In Proceedings of the 2019 Conference            3550–3555.
  of the North American Chapter of the Association
  for Computational Linguistics: Human Language               Andrea Burns, Donghyun Kim, Derry Wijaya, Kate
 Technologies, Volume 1 (Long and Short Papers),                Saenko, and Bryan A Plummer. 2020. Learn-
  pages 1184–1197, Minneapolis, Minnesota. Associ-              ing to scale multilingual representations for vision-
  ation for Computational Linguistics.                          language tasks. In European Conference on Com-
                                                                puter Vision, pages 197–213. Springer.
Mikel Artetxe, Gorka Labaka, and Eneko Agirre.
  2018a. Unsupervised statistical machine transla-            Ozan Caglayan, Pranava Madhyastha, Lucia Specia,
  tion. In Proceedings of the 2018 Conference on                and Loïc Barrault. 2019. Probing the need for visual
  Empirical Methods in Natural Language Processing,             context in multimodal machine translation. arXiv
  pages 3632–3642, Brussels, Belgium. Association               preprint arXiv:1903.08678.
  for Computational Linguistics.
                                                              Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr-
Mikel Artetxe, Gorka Labaka, Eneko Agirre, and                  ishna Vedantam, Saurabh Gupta, Piotr Dollár, and
  Kyunghyun Cho. 2018b. Unsupervised neural ma-                 C Lawrence Zitnick. 2015. Microsoft coco cap-
  chine translation. In International Conference on             tions: Data collection and evaluation server. arXiv
  Learning Representations.                                     preprint arXiv:1504.00325.

Mikel Artetxe and Holger Schwenk. 2019. Margin-               Kyunghyun Cho, Bart van Merriënboer, Caglar Gul-
  based parallel corpus mining with multilingual sen-           cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
  tence embeddings. In Proceedings of the 57th An-              Schwenk, and Yoshua Bengio. 2014. Learning
  nual Meeting of the Association for Computational             phrase representations using RNN encoder–decoder
  Linguistics, pages 3197–3203, Florence, Italy. Asso-          for statistical machine translation. In Proceedings of
  ciation for Computational Linguistics.                        the 2014 Conference on Empirical Methods in Nat-
                                                                ural Language Processing (EMNLP), pages 1724–
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua                     1734, Doha, Qatar. Association for Computational
  Bengio. 2015.   Neural machine translation by                 Linguistics.
                                                          1664
Alexis Conneau and Guillaume Lample. 2019. Cross-        Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo. 2019. Un-
  lingual language model pretraining. In Advances          supervised image captioning. In Proceedings of the
  in Neural Information Processing Systems 32, pages       IEEE/CVF Conference on Computer Vision and Pat-
  7059–7069. Curran Associates, Inc.                       tern Recognition, pages 4125–4134.

Ayan Das and Sudeshna Sarkar. 2020. A survey of          Jiahui Gao, Yi Zhou, Philip LH Yu, and Jiuxiang Gu.
  the model transfer approaches to cross-lingual de-        2020. Unsupervised cross-lingual image captioning.
  pendency parsing. ACM Transactions on Asian               arXiv preprint arXiv:2010.01288.
  and Low-Resource Language Information Process-
                                                         Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Ar-
  ing (TALLIP), 19(5):1–60.
                                                           mand Joulin, and Tomas Mikolov. 2018. Learning
Karan Desai and Justin Johnson. 2021. VirTex: Learn-       word vectors for 157 languages. In Proceedings of
  ing Visual Representations from Textual Annota-          the Eleventh International Conference on Language
  tions. In CVPR.                                          Resources and Evaluation (LREC 2018), Miyazaki,
                                                           Japan. European Language Resources Association
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and              (ELRA).
   Kristina Toutanova. 2019. BERT: Pre-training of       Jiuxiang Gu, Shafiq Joty, Jianfei Cai, and Gang Wang.
   deep bidirectional transformers for language under-      2018. Unpaired image captioning by language piv-
   standing. In Proceedings of the 2019 Conference          oting. In Proceedings of the European Conference
   of the North American Chapter of the Association         on Computer Vision (ECCV), pages 503–519.
   for Computational Linguistics: Human Language
   Technologies, Volume 1 (Long and Short Papers),       Jiuxiang Gu, Shafiq Joty, Jianfei Cai, Handong Zhao,
   pages 4171–4186, Minneapolis, Minnesota. Associ-         Xu Yang, and Gang Wang. 2019. Unpaired image
   ation for Computational Linguistics.                     captioning via scene graph alignments. In Proceed-
                                                            ings of the IEEE/CVF International Conference on
Chris Dyer, Victor Chahuneau, and Noah A. Smith.            Computer Vision, pages 10323–10332.
  2013. A simple, fast, and effective reparameter-
  ization of IBM model 2. In Proceedings of the          Mandy Guo, Qinlan Shen, Yinfei Yang, Heming
  2013 Conference of the North American Chapter of         Ge, Daniel Cer, Gustavo Hernandez Abrego, Keith
  the Association for Computational Linguistics: Hu-       Stevens, Noah Constant, Yun-Hsuan Sung, Brian
  man Language Technologies, pages 644–648, At-            Strope, and Ray Kurzweil. 2018. Effective parallel
  lanta, Georgia. Association for Computational Lin-       corpus mining using bilingual sentence embeddings.
  guistics.                                                In Proceedings of the Third Conference on Machine
                                                          Translation: Research Papers, pages 165–176, Brus-
Sergey Edunov, Myle Ott, Michael Auli, and David           sels, Belgium. Association for Computational Lin-
  Grangier. 2018. Understanding back-translation at        guistics.
  scale. In Proceedings of the 2018 Conference on
  Empirical Methods in Natural Language Processing,      Viktor Hangya and Alexander Fraser. 2019. Unsuper-
  pages 489–500, Brussels, Belgium. Association for        vised parallel sentence extraction with parallel seg-
  Computational Linguistics.                               ment detection helps machine translation. In Pro-
                                                           ceedings of the 57th Annual Meeting of the Asso-
Obeida ElJundi., Mohamad Dhaybi., Kotaiba                  ciation for Computational Linguistics, pages 1224–
  Mokadam., Hazem Hajj., and Daniel Asmar.                 1234, Florence, Italy. Association for Computational
  2020. Resources and end-to-end neural network            Linguistics.
  models for arabic image captioning. In Proceedings     K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep resid-
  of the 15th International Joint Conference on             ual learning for image recognition. In 2016 IEEE
  Computer Vision, Imaging and Computer Graphics           Conference on Computer Vision and Pattern Recog-
  Theory and Applications - Volume 5: VISAPP,,              nition (CVPR), pages 770–778.
  pages 233–241. INSTICC, SciTePress.
                                                         John Hewitt, Daphne Ippolito, Brendan Callahan, Reno
Miquel Esplà, Mikel Forcada, Gema Ramírez-Sánchez,         Kriz, Derry Tanti Wijaya, and Chris Callison-Burch.
  and Hieu Hoang. 2019. ParaCrawl: Web-scale paral-        2018. Learning translations via images with a mas-
  lel corpora for the languages of the EU. In Proceed-     sively multilingual image dataset. In Proceedings
  ings of Machine Translation Summit XVII Volume 2:        of the 56th Annual Meeting of the Association for
  Translator, Project and User Tracks, pages 118–119,      Computational Linguistics (Volume 1: Long Papers),
  Dublin, Ireland. European Association for Machine        pages 2566–2576.
  Translation.
                                                         Vu Cong Duy Hoang, Philipp Koehn, Gholamreza
Manaal Faruqui and Chris Dyer. 2014. Improving vec-        Haffari, and Trevor Cohn. 2018. Iterative back-
  tor space word representations using multilingual        translation for neural machine translation. In Pro-
  correlation. In Proceedings of the 14th Conference       ceedings of the 2nd Workshop on Neural Machine
  of the European Chapter of the Association for Com-      Translation and Generation, pages 18–24, Mel-
  putational Linguistics, pages 462–471, Gothenburg,       bourne, Australia. Association for Computational
  Sweden. Association for Computational Linguistics.       Linguistics.
                                                     1665
Micah Hodosh, Peter Young, and Julia Hockenmaier.               Methods in Natural Language Processing: System
  2013. Framing image description as a ranking task:            Demonstrations, pages 66–71, Brussels, Belgium.
  Data, models and evaluation metrics. Journal of Ar-           Association for Computational Linguistics.
  tificial Intelligence Research, 47:853–899.
                                                             Anoop Kunchukuttan, Pratik Mehta, and Pushpak Bhat-
Rebecca Hwa, Philip Resnik, Amy Weinberg, Clara                tacharyya. 2018. The IIT Bombay English-Hindi
  Cabezas, and Okan Kolak. 2005. Bootstrapping                 parallel corpus. In Proceedings of the Eleventh In-
  parsers via syntactic projection across parallel texts.      ternational Conference on Language Resources and
  Natural language engineering, 11(03):311–325.                Evaluation (LREC 2018), Miyazaki, Japan. Euro-
                                                               pean Language Resources Association (ELRA).
Alex Jones and Derry Tanti Wijaya. 2021. Majority
  voting with bidirectional pre-translation for bitext re-   Kemal Kurniawan, Lea Frermann, Philip Schulz, and
  trieval.                                                     Trevor Cohn. 2021. Ppt: Parsimonious parser trans-
                                                               fer for unsupervised cross-lingual adaptation. arXiv
Omid Kashefi. 2018. Mizan: a large Persian-English             preprint arXiv:2101.11216.
  parallel corpus. arXiv preprint arXiv:1801.02107.
                                                             Garry Kuwanto, Afra Feyza Akyürek, Isidora Chara
Phillip Keung, Julian Salazar, Yichao Lu, and Noah A           Tourni, Siyang Li, and Derry Wijaya. 2021.
  Smith. 2020. Unsupervised bitext mining and trans-           Low-resource machine translation for low-resource
  lation via self-trained contextual embeddings. arXiv         languages: Leveraging comparable data, code-
  preprint arXiv:2010.07761.                                   switching and compute resources.
Nikzad Khani, Isidora Tourni, Mohammad Sadegh Ra-
                                                             Guillaume Lample, Alexis Conneau, Ludovic Denoyer,
  sooli, Chris Callison-Burch, and Derry Tanti Wijaya.
                                                               and Marc’Aurelio Ranzato. 2018a. Unsupervised
  2021. Cultural and geographical influences on im-
                                                               machine translation using monolingual corpora only.
  age translatability of words across languages. In
                                                               In International Conference on Learning Represen-
  Proceedings of the 2021 Conference of the North
                                                               tations.
  American Chapter of the Association for Computa-
  tional Linguistics: Human Language Technologies,           Guillaume Lample, Alexis Conneau, Marc’Aurelio
  pages 198–209.                                               Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018b.
Yunsu Kim, Miguel Graça, and Hermann Ney. 2020.                Word translation without parallel data. In Interna-
  When and why is unsupervised neural machine trans-           tional Conference on Learning Representations.
  lation useless? In Proceedings of the 22nd An-
                                                             Guillaume Lample, Myle Ott, Alexis Conneau, Lu-
  nual Conference of the European Association for
                                                               dovic Denoyer, and Marc’Aurelio Ranzato. 2018c.
  Machine Translation, pages 35–44, Lisboa, Portugal.
                                                               Phrase-based & neural unsupervised machine trans-
  European Association for Machine Translation.
                                                               lation. In Proceedings of the 2018 Conference on
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A                 Empirical Methods in Natural Language Processing,
  method for stochastic optimization. In 3rd Inter-            pages 5039–5049, Brussels, Belgium. Association
  national Conference on Learning Representations,             for Computational Linguistics.
  ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
  Conference Track Proceedings.                              Shen Li, Joao V Graça, and Ben Taskar. 2012. Wiki-ly
                                                               supervised part-of-speech tagging. In Proceedings
Philipp Koehn. 2005. Europarl: A parallel corpus for           of the 2012 Joint Conference on Empirical Methods
  statistical machine translation. In MT summit, vol-          in Natural Language Processing and Computational
  ume 5, pages 79–86. Citeseer.                                Natural Language Learning, pages 1389–1398. As-
                                                               sociation for Computational Linguistics.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
  Callison-Burch, Marcello Federico, Nicola Bertoldi,        Zuchao Li, Rui Wang, Kehai Chen, Masso Utiyama,
  Brooke Cowan, Wade Shen, Christine Moran,                    Eiichiro Sumita, Zhuosheng Zhang, and Hai Zhao.
  Richard Zens, et al. 2007. Moses: Open source                2020. Data-dependent gaussian prior objective for
  toolkit for statistical machine translation. In Pro-         language generation. In International Conference
  ceedings of the 45th annual meeting of the ACL               on Learning Representations.
  on interactive poster and demonstration sessions,
  pages 177–180. Association for Computational Lin-          Wen-Pin Lin, Matthew Snover, and Heng Ji. 2011. Un-
  guistics.                                                    supervised language-independent name translation
                                                               mining from wikipedia infoboxes. In Proceedings
Sandra Kübler, Ryan McDonald, and Joakim Nivre.                of the First workshop on Unsupervised Learning in
  2009. Dependency parsing. Synthesis lectures on              NLP, pages 43–52.
  human language technologies, 1(1):1–127.
                                                             Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey
Taku Kudo and John Richardson. 2018. SentencePiece:            Edunov, Marjan Ghazvininejad, Mike Lewis, and
  A simple and language independent subword tok-               Luke Zettlemoyer. 2020. Multilingual denoising
  enizer and detokenizer for neural text processing. In        pre-training for neural machine translation. arXiv
  Proceedings of the 2018 Conference on Empirical              cs.CL 2001.08210.
                                                         1666
You can also read