"Wikily" Neural Machine Translation Tailored to Cross-Lingual Tasks

“Wikily” Neural Machine Translation Tailored to Cross-Lingual Tasks

                                                   Mohammad Sadegh Rasooli1 Chris Callison-Burch1 Derry Tanti Wijaya2
                                                       1   Department of Computer and Information Science, University of Pennsylvania
                                                                    2 Department of Computer Science, Boston University

                                                                      {rasooli, ccb}@seas.upenn.edu, wijaya@bu.edu

                                                                Abstract                          deal of interest in unsupervised neural machine
                                                                                                  translation (e.g. Artetxe et al. (2018a); Lample et al.
                                              We present a simple but effective approach for
                                                                                                  (2018a,c); Conneau and Lample (2019); Song et al.
                                              leveraging Wikipedia for neural machine trans-
arXiv:2104.08384v1 [cs.CL] 16 Apr 2021

                                              lation as well as cross-lingual tasks of image      (2019a); Kim et al. (2020); Tae et al. (2020)). Un-
                                              captioning and dependency parsing without us-       supervised neural machine translation models of-
                                              ing any direct supervision from external paral-     ten perform nearly as well as supervised models
                                              lel data or supervised models in the target lan-    when translating between similar languages, but
                                              guage. We show that first sentences and titles      they fail to perform well in low-resource or dis-
                                              of linked Wikipedia pages, as well as cross-        tant languages (Kim et al., 2020) or out-of-domain
                                              lingual image captions, are strong signals for
                                                                                                  monolingual data (Marchisio et al., 2020). In prac-
                                              a seed parallel data to extract bilingual dictio-
                                              naries and cross-lingual word embeddings for        tice, the highest need for unsupervised models is
                                              mining parallel text from Wikipedia. Our fi-        to expand beyond high resource, similar European
                                              nal model achieves high BLEU scores that are        language pairs.
                                              close to or sometimes higher than strong su-           There are two key goals in this paper: Our first
                                              pervised baselines in low-resource languages;       goal is developing accurate translation models for
                                              e.g. supervised BLEU of 4.0 versus 12.1 from        low-resource distant languages without any super-
                                              our model in English-to-Kazakh. Moreover,
                                                                                                  vision from a supervised model or gold-standard
                                              we tailor our wikily translation models to unsu-
                                              pervised image captioning, and cross-lingual
                                                                                                  parallel data. Our second goal is to show that our
                                              dependency parser transfer. In image caption-       machine translation models could be directly tai-
                                              ing, we train a multi-tasking machine transla-      lored to downstream natural language processing
                                              tion and image captioning pipeline for Arabic       tasks. In this paper, we showcase our claim in cross-
                                              and English from which the Arabic training          lingual image captioning and cross-lingual transfer
                                              data is a wikily translation of the English cap-    of dependency parsers, but this idea is applicable
                                              tioning data. Our captioning results on Ara-        to a wide variety of tasks.
                                              bic are slightly better than that of its super-
                                              vised model. In dependency parsing, we trans-
                                                                                                     We present a fast and accurate approach for
                                              late a large amount of monolingual text, and        learning translation models using Wikipedia. Un-
                                              use it as an artificial training data in an an-     like unsupervised machine translation that solely
                                              notation projection framework. We show that         relies on raw monolingual data, we believe that we
                                              our model outperforms recent work on cross-         should not neglect the availability of incidental su-
                                              lingual transfer of dependency parsers.             pervisions from online resources such as Wikipedia.
                                                                                                  Wikipedia contains articles in nearly 300 languages
                                         1    Introduction
                                                                                                  and more languages might be added in the future,
                                         Developing machine translation models without            including indigenous languages and dialects of dif-
                                         using gold-standard parallel text is an intriguing       ferent regions in the world. Different from similar
                                         research problem with real applications: obtaining       recent work (Schwenk et al., 2019a), we do not
                                         a large volume of parallel text for many languages       rely on any supervision from supervised translation
                                         is hard if not impossible. Moreover, translation         models. Instead, we leverage the fact that many
                                         models could be used in downstream cross-lingual         first sentences in linked Wikipedia pages are rough
                                         tasks in which annotated data does not exist for         translations, and furthermore, many captions of
                                         some languages. There has recently been a great          the same images are similar sentences, sometimes
A summary of our contribution is as follows:

                                                               • We propose a simple, fast and effective ap-
                                                                 proach towards using the Wikipedia mono-
                                                                 lingual data for machine translation without
                                                                 any explicit supervision. Our mining algo-
                                                                 rithm easily scales on large comparable data
                                                                 using limited computational resources. We
                                                                 achieve very high BLEU scores for distant
                                                                 languages, especially those in which current
                                                                 unsupervised methods perform very poorly.

                                                               • We propose novel methods for leveraging
                                                                 our current translation models in image cap-
                                                                 tioning. We show that how a combina-
Figure 1: A pair of Wikipedia documents in Arabic and            tion of translating caption training data, and
English, along with a same image with two captions.              multi-task learning with English captioning as
                                                                 well as translation improves the performance.
                                                                 Our results on Arabic captaining shows re-
translations. Figure 1 shows a real example of a                 sults slightly superior to that of a supervised
pair of linked Wikipedia pages in Arabic and En-                 captioning model trained on gold-standard
glish in which the titles, first sentences, and also the         datasets.
image captions are rough translations of each other.
Our method learns a seed bilingual dictionary from             • We propose a novel modification to the anno-
a small collection of first sentence pairs, titles and           tation projection method in order to be able
captions, and then learns cross-lingual word embed-              to leverage our translation models. Our re-
dings. We make use of cross-lingual word embed-                  sults on dependency parsing performs better
dings to extract parallel sentences from Wikipedia.              than previous work in most cases, and per-
Our experiments show that our approach improves                  forms similarly to using gold-standard parallel
over strong unsupervised translation models for                  datasets.
low-resource languages: we improve the BLEU
score of English→Gujarati from 0.6 to 15.2, and                Our code is publicly available online1 .
English→Kazakh from 0.8 to 12.1.
   In the realm of downstream tasks, we show that          2    Background
we can easily use our translation models to generate       In this section, we briefly describe the main con-
high-quality translations of MS-COCO (Chen et al.,         cepts that we repeatedly use throughout the paper.
2015) and Flickr (Hodosh et al., 2013) datasets, and
train a cross-lingual image captioning model in a
multi-task pipeline paired with machine translation        Supervised neural machine translation Super-
in which the model is initialized by the parameters        vised machine translation uses a parallel text P =
from our translation model. Our results on Ara-            {(si , ti )}ni=1 in which each sentence si ∈ l1 is a
bic captioning show a BLEU score of 5.72 that is           translation of ti ∈ l2 . For having a high-quality
slightly better than a supervised captioning model         translation model, we usually need a large amount
with a BLEU score of 5.22. As another task, in de-         of parallel text. Neural machine translation uses
pendency parsing, we first translate a large amount        sequence-to-sequence models with attention (Cho
of monolingual data using our translation models           et al., 2014; Bahdanau et al., 2015; Vaswani et al.,
and then apply transfer using the annotation pro-          2017) for which the likelihood of training data is
jection method (Yarowsky et al., 2001; Hwa et al.,         maximized by maximizing the log-likelihood of
2005). Our results show that our approach performs
similarly compared to using gold-standard parallel              Our code: https://github.com/rasoolims/
                                                           ImageTranslate, and our modification to Stanza for train-
text in high-resource scenarios, and significantly         ing on partially projected trees: https://github.com/
better in low-resource languages.                          rasoolims/stanza
                 det                                                                                               nmod
                           amod                   nsubj                                                                                                                                                                                                    case
                                                          advmod             xcomp           obj                                case                                case                                    punct
                                  compound                                                                                                                              compound                                    mark        obj                               det

   The   International Crisis Group recently suggested moving responsibility                                        for                pension                to   state level                          ,   to eliminate some                  of          the problems                     .

 Grupul International         de          Criza   a                sugerat       recent            mutarea   responsabilitatii pentru pensii                        la          nivelul statului ,
                                                                                                                                                                                                                       pentru           a    elimina unele              dintre           probleme .
         amod                                                               advmod                                                                     case                                  compound                                               obj                           case
                                                                                                             obj                                                                                                                      mark                                 nmod
                                                                                     xcomp                                              nmod                                                                                punct

                                  nsubj                                                                                                obl



 Figure 2: An example of annotation projection for which the source (English, on top) is a translation of the target
 (Romanian) with our wikily translation model. The source side is parsed with supervised Stanza (Qi et al., 2020)
Figure 2:parse
 and the  An example     of annotation
                tree is projected usingprojection for which
                                         Giza++ (Och        the source
                                                       and Ney,  2003) (English,  onalignments.
                                                                        intersected  top) is a translation
                                                                                                  As shown of the target
                                                                                                              in the figure,
(Romanian) with our wikily translation model. The source side is parsed with supervised Stanza (Qi et al., 2020)
 some words have missing dependencies.
and the parse tree is projected using Giza++ (Och and Ney, 2003) intersected alignments. As shown in the figure,
some words have missing dependencies.
 predicting each target word given its previous pre- to a low-resource language through translated
 dicted words
Supervised           and source
                  neural     machine    sequence:
                                             translation Super- and              text
                                                                                         to uncover the      et al.,
                                                                                                                 masked2001). words  Having
                                                                                                                                        (Devlina et parallel
vised machine translation     |t    |   uses   a  parallel     text  P   =       data
                                                                              2019).     PIn   =
                                                                                               this {(s
                                                                                                          i , ti )}
                                                                                                                    n , and supervised source
                                                                                                                    i=1 mainly       use  the    MASS
                       X n X      i
         )}ni=1 =
{(si , tiL(P)      in which each            sentence      s 2 l1 is a            annotations
                                                                              model     (Song etfor    al.,source
                                                                                                              2019),sentences             i , we transfer
                                                                                                                          in which ascontiguous
                                      log p(t  i,j |ti,k
vision. In this section, we describe our algorithm                Definitions: 1) e is English, f is the foreign language, and
which is briefly shown in Figure 3.                                 g is a language similar to f , 2) learn_dict (P ) extracts a
                                                                    bilingual dictionary from parallel data P , 3) t (x|m) trans-
3.1    Data Definitions                                             lates input x given model m, , 4) pretrain (x) pretrains on
                                                                    monolingual data x using MASS (Song et al., 2019a), 5)
For languages e and f in which e is English and f
                                                                     train (P |m) trains on parallel data P initialized by model
is a low-resource target language of interest, there
                                         (e)       (e)              m, 6) bt_train (x1 , x2 |m) trains iterative back-translation
are Wikipedia documents we = {w1 . . . wn }                         on monolingual data x1 ∈ e and x2 ∈ f initialized by
                 (f )     (f )
and wf = {w1 . . . wm }. We refer to w(i,j) as
                                               (l)                  model m.
                                                                  Inputs: 1) Wikipedia documents w(e) , w(f ) , and w(g) , 2)
the jth sentence in the ith document for language                   Monolingual word embedding vectors ve and vf , 3) Set of
l. A subset of these documents are aligned (us-                     linked pages from Wikipedia COMP , their aligned titles
ing Wikipedia languages links). Thus we have an                     T , and their first sentence pairs F, 4) Set of paired image
                                                                    captions C, and 5) Gold-standard parallel data P (e,g) .
aligned set of document pairs in which we can eas-
ily extract many sentence pairs that are potentially
translations of each other. A smaller subset F is the             → Learn bilingual dictionary and embeddings
                                         (e)     (f )
set of first sentences in Wikipedia (w(i,1) , w(i0 ,1) )           S =F ∪C∪T
                                                                   D(f,e) = learn_dict (S)
in which documents i and i0 are linked and their                    D(g,e) = learn_dict (P (e,g) )        . Related language
first sentence lengths are in similar range. In ad-
                                                                    Learn ve → ve0 and vf → vf0 using D(f,e) ∪ D(g,e)
dition to text content, Wikipedia has a large set
of images. Each image comes along with one or                     → Mine parallel data
more captions, sometimes in different languages.                   Extract comparable sentences Z from COMP
A small subset of these images have captions both                  Extract P (f,e) from Z using Eq. 1.    . Cosine sim.
                                                                     (f,e)     (f,e)
in English and the target language. We refer to this               P       =P        ∪T                  . Mined Data
set as C. We use the set of all caption pairs (C),                → Train MT with pretraining and back-translation
title pairs (T ), and first sentences (F) as the seed              θ0 = pretrain (w(e) ∪ w(f ) ∪ w(g) ) . MASS Training
parallel data: S = F ∪ C ∪ T .                                      θ = train (P (f,e) ∪ P (g,e) |θ0 )        . NMT Training
                                                                        (e→f )
                                                                    P                 (f )
                                                                             = ( t (w |θ ), w )  (f )

3.2    Bilingual Dictionary Extraction and                          P (f →e) = ( t (w(e) |θ ), w(e) )
       Cross-Lingual Word Embeddings                                P 0(f,e) = P (e→f ) ∪ P (f →e) ∪ P (f,e)
                                                                    θ   = train (P 0(f,e) |θ0 )
Having the seed parallel data S, we run unsuper-                    θ∗ = bt_train (w(e) , w(f ) |θ 0
vised word alignment (Dyer et al., 2013) in both
English-to-target, and target-to-English directions.              Output: θ

We use the intersected alignments to extract highly
                                                                    Figure 3: A brief depiction of the training pipeline.
confident word-to-word connections. Finally, we
pick the most frequent aligned word for each word
in English as translation. This set serves as a bilin-
gual dictionary D.                                                3.3    Mining Parallel Sentences
   Given two monolingual trained word embed-
                                                                  We use cross-lingual embedding vectors ve0 ∈
dings ve ∈ RNe ×d and vf ∈ RNf ×d , and the ex-                                             0
                                                                  RNe ×d and vf0 ∈ RNf ×d for calculating the cosine
tracted bilingual dictionary D, we use the method
                                                                  similarity between pairs of words. Moreover, we
of Faruqui and Dyer (2014) to project these two em-
                                                                  use the extracted bilingual dictionary to boost the
bedding vectors to a shared cross-lingual space.2
                                                                  accuracy of the scoring function. For a pair of sen-
This method uses a bilingual dictionary along with
                                                                  tences (s, t) where s = s1 . . . sn and t = t1 . . . tm ,
canonical correlation analysis (CCA) to learn two
                                                                  after filtering sentence pairs with different numer-
projection matrices to map each embedding vector
                                0                    0            ical values (e.g. sentences containing 2019 in the
to a shared space ve0 ∈ RNe ×d and vf0 ∈ RNf ×d
                                                                  source and 1987 in the target), we use a modified
where d0 ≤ d.
                                                                  version of cosine similarity between words:
     There are other approaches for extracting bilingual em-
beddings such as (Lample et al., 2018b). Comparing different                         (
cross-lingual embedding learning methods is not the focus                             1.0,           if (si , tj ) ∈ D
of this paper, thereby we leave further investigation to future      sim(si , tj ) =
work.                                                                                 cos(si , tj ), otherwise
Using the above definition of word similarity,          three Wikipedia datasets for the three languages
we use the average-maximum similarity between             g, f , and e using the MASS model (Song et al.,
pair of sentences.                                        2019a). The MASS model masks a contiguous
                   Pn       m
                                                          span of input tokens, and recovers that span in the
                    i=1 maxj=1 sim(si , ti )              output sequence.
     score(s, t) =
                             n                               For facilitating multi-task learning with image
From a pool of candidates, if the following condi-        captioning, our model has an image encoder that
tions are hold, we pick (s, t) as translations, and       is used in cases of image captioning (more details
add this pair to the mined parallel data.                 in §4.1). In other words, the decoder is shared
                                                          between the translation and captioning tasks. We
        (                                                 use the pretrained ResNet-152 model (He et al.,
         t = arg maxy∈G(s) score(s, y)                    2016) from Pytorch to encode every input image.
         s = arg maxx∈G(t) score(x, t)                    We extract the final layer as a 7 × 7 grid vector
                                                          (g ∈ R7×7×dg ), and project it to a new space by
where G is the translation candidate generation           a linear transformation (g 0 ∈ R49×dt ), and then
function from linked Wikipedia pages based on             add location embeddings (l ∈ R49×dt ) by using
sentences with similar lengths.                           entry-wise addition. Afterwards, we assume that
3.4   Leveraging Similar Languages                        the 49 vectors are encoded text representations as
                                                          if a sentence with 49 words occurs. This is similar
In many low-resource scenarios, the number of             but not exactly the same as the Virtex model (Desai
paired documents is very small, leading to a small        and Johnson, 2021).
number and often noisy extracted parallel sen-
tences. To alleviate this problem to some extent,
we assume to have another language g in which             3.6   Back-Translation: One-shot and Iterative
g has a large lexical overlap with the target lan-
guage f (such as g=Russian and f =Kazakh). We             Finally, we use the back-translation technique
assume that a parallel data exists between language       to improve the quality of our models. Back-
g and English, and we can use it both as an auxil-        translation is done by translating a large amount
iary parallel data in training, and also for extracting   of monolingual text to and from the target lan-
extra lexical entries for the bilingual dictionaries:     guage. The translated texts serve as noisy input
as shown in Figure 3, we supplement the extracted         text along with the monolingual data as the silver-
bilingual dictionary from seed parallel data with         standard translations. Previous work (Sennrich
the bilingual dictionary extracted from related lan-      et al., 2016b; Edunov et al., 2018) has shown that
guage parallel data.                                      back-translation is a very simple but effective tech-
                                                          nique to improve the quality of translation models.
3.5   Translation Model                                   Henceforth, we refer to this method as one-shot
We use a standard sequence-to-sequence                    back-translation. Another approach is to use iter-
transformer-based translation model (Vaswani              ative back-translation (Hoang et al., 2018), the
et al., 2017) with a six-layer BERT-based (De-            most popular approach in unsupervised transla-
vlin et al., 2019) encoder-decoder architecture           tion (Artetxe et al., 2018b; Conneau and Lample,
from HuggingFace (Wolf et al., 2019) and                  2019; Song et al., 2019a). The main difference
Pytorch (Paszke et al., 2019) with a shared               from one-shot translation is that the model uses
SentencePiece (Kudo and Richardson, 2018)                 an online approach, and updates its parameters in
vocabulary. All input and output token embeddings         every batch.
are summed up with the language id embedding.                We empirically find one-shot back-translation
First tokens of every input and output sentence are       faster to train but with much less potential to reach
shown by the language ID. Our training pipeline           a high translation accuracy. A simple and ef-
assumes that the encoder and decoder are shared           fective way to have both a reliable and accurate
across different languages, except that we use a          model is to first initialize a model with one-shot
separate output layer for each language in order to       back-translation, and then apply iterative back-
prevent input copying (Artetxe et al., 2018b; Sen         translation. The model that is initialized with a
et al., 2019). We pretrain the model on a tuple of        more accurate model reaches a higher accuracy.
This is an open box containing four
                                  cucumbers.                                                Ney, 2003) alignments on both source-to-target
                                              .‫وهذا صندوق مفتوح يحتوي على أربعة خيار‬
                                  An open food container box with four                      and target-to-source directions, and extract inter-
                                  unknown food items.
                                    .‫صندوق حاوية طعام مفتوح مع أربعة مواد غذائية مجهولة‬
                                  A small box filled with four green
                                                                                            sected alignments to keep high-precision one-to-
                                                 .‫ضراء‬K‫ضروات ا‬K‫مربع صغير مليء بأربعة ا‬
                                                                                            one alignments. We run a supervised dependency
                                  An opened box of four chocolate
                                  bananas.                                                  parser of English as our rich-resource language.
                                                            .‫وز‬P‫علبة مفتوحة من أربعة من ا‬
                                  An open box contains an unknown,                          Then, we project dependencies to the target lan-
                                  purple object
                                         ‫رجوان‬T‫مربع مفتوح يحتوي على كائن غير معروف ا‬        guage sentences via word alignment links. Inspired
                                                                                            by previous work (Rasooli and Collins, 2015), to
Figure 4: An image from MS-Coco (Chen et al., 2015)                                         remove noisy projections, we keep those sentences
with gold-standard English captions, and Arabic trans-
lations from our wikily translation model.
                                                                                            that at least 50% of words or 5 consecutive words
                                                                                            in the target side have projected dependencies.

4     Cross-Lingual Tasks                                                                   5     Experiments

In this section, we describe our approaches for tai-                                        In this section, we provide details about our experi-
loring our translation models to cross-lingual tasks.                                       mental settings and results for translation, caption-
Note that henceforth we assume that our transla-                                            ing, and dependency parsing.
tions model training is finished, and we have access                                        5.1    Datasets and Settings
to trained translation models for cross-lingual tasks.
                                                                                            Languages We focus on four language pairs:
4.1    Cross-Lingual Image Captioning                                                       Arabic-English, Gujarati-English, Kazakh-English,
                                                                                            and Romanian-English. We choose these pairs to
Having gold-standard image captioning training
                                                                                            provide enough evidence that our model works in
data I = {(Ii , ci )}ni=1 where Ii is the image as
                             (1)                                                            distant languages, morphologically-rich languages,
pixel values, and ci = ci , . . . , cki i as the textual
                                                                                            as well as similar languages. As for similar lan-
description with ki words, our goal is to learn a cap-
                                                                                            guages, we use Persian for Arabic (written with
tioning model that is able to describe new (unseen)
                                                                                            very similar scripts and have many words in com-
images. As described in §3.5, we use a transformer
                                                                                            mon), Hindi for Gujarati (similar languages), Rus-
decoder from our translation model and a Resent
                                                                                            sian for Kazakh (written with the same script), and
image encoder (He et al., 2016) for our image cap-
                                                                                            Italian for Romanian (Romance languages).
tioning pipeline. Unfortunately, annotated image
captioning datasets do not exist in many languages.                                         Monolingual and Translation Datasets We use
Having our translation model parameter θ         ∗ , we                                    regular expressions to tokenize sentences of
can use its translation functionality to translate                                          Wikipedia dump text. We use a shared Senten-
each caption ci to c0i = translate(ci |θ     ∗ ). After-                                   cePiece vocabulary (Kudo and Richardson, 2018)
wards, we will have a translated annotated dataset                                          with size 60K. Table 1 shows the sizes of Wikipedia
I 0 = {(Ii , c0i )}ni=1 in which the textual descrip-                                       data in different languages. We use an off-the-
tions are not gold-standard but translations from                                           shelf Indic-transliteration library3 to convert the
the English captions. Figure 4 shows a real exam-                                           Devanagari script to Hindi script to make the Hindi
ple from MS-Coco (Chen et al., 2015) in which                                               documents look like Gujarati. This is done by
Arabic translations are provided by our translation                                         removing the graphical vertical bars from Hindi
model. Furthermore, to augment our learning ca-                                             letters: this would make them look like Gujarati,
pability, we initialize our decoder with decoding                                           thus increasing the chance of capturing more words
parameters of θ   ∗ , and also continue training with                                      in common. For parallel data in similar lan-
both English captioning and translation.                                                    guages, we use the Mizan parallel data for Per-
                                                                                            sian (Kashefi, 2018) with one million sentences,
4.2    Cross-Lingual Dependency Parsing                                                     the IITB data (Kunchukuttan et al., 2018) and Hin-
Assuming that we have a large body of monolin-                                              diEnCorp 0.5 (Bojar et al., 2014) for Hindi with a
gual text, we translate that monolingual text to cre-                                       total of 367K sentences, ParaCrawl for Russian (Es-
ate artificial parallel data. We run unsupervised                                           plà et al., 2019) with 12M sentences, and Europarl
word alignments on the artificial parallel text. Fol-                                       for Italian (Koehn, 2005) with 2M sentences. We
lowing previous work (Rasooli and Collins, 2015;                                              3
Ma and Xia, 2014), we run Giza++ (Och and                                                   indic-transliteration/
          Foreign docs
                                                             Translation Training Table 1 shows the sizes
          Paired docs
          First sents.
                                                             of different types of datasets in our experiments.
          Comparable pairs
                                                             We pick comparable candidates for sentence pairs
          Mined sents.
                                                             whose lengths are within a range of half to twice
          Iterative BT        4.0m    3.8m    4.0m    6.1m   of each other. As we see, the final size of mined
                                                             datasets heavily depends on the number of paired
Table 1: Data sizes for different pairs. English has 6
                                                             English-target language Wikipedia documents. We
million documents. We use a sample of English sen-
tences with similar sizes to each language.                  train our translation models initialized by pre-
                                                             trained models. Each batch has roughly 4K to-
                                                             kens. Except for Arabic, for which the size
use the Arabic-English UN data (Ziemski et al.,              of mined data significantly outnumbers the size
2016), WMT 2019 data (Barrault et al., 2019) for             of Persian-English parallel data, we use the re-
Gujarati-English and Kazakh-English, and WMT                 lated language data before using iterative back-
2016 shared task data (Bojar et al., 2016) for               translation in which we only use the source and
Romanian-English. Following previous work (Sen-              target monolingual datasets. We use similar learn-
nrich et al., 2016a), diacritics are removed from the        ing hyper-parameters to pretraining except for itera-
Romanian data.                                               tive back-translation in which we accumulate gradi-
                                                             ents for 100 steps, and use a dropout probability of
Cross-Lingual Embedding We use the off-the-                  0.2 and 10000 warmup steps since we find smaller
shelf 300-dimensional FastText embeddings (Grave             dropout and warmup make the model diverge. Our
et al., 2018) as monolingual embedding vectors.              one-shot back-translation experiments use a beam
We run FastAlign (Dyer et al., 2013) on the seed             size of 4, but we use a beam size of one for iterative
parallel text from both source-to-target and target-         back-translation since we have not seen much gains
to-source directions, run alignment intersection to          in using beam-based iterative back-translation ex-
get intersected alignments, and extract the high-            cept for purely unsupervised settings. All of our
est occurring alignment for every word as the dic-           translations are performed with a beam size of 4
tionary entry. We make use of the cross-lingual              and max_len_a = 1.3 and max_len_b = 5. We
CCA tool (Faruqui and Dyer, 2014) to extract 150-            alternate between supervised parallel data of a sim-
dimensional vectors. This tool can be run on a               ilar language paired with English and the mined
single CPU within a few hours.                               data.
                                                                We train translation models for roughly 400K
Pretraining We pretrain four models on 3-tuples              batches except for Gujarati that has smaller mined
of languages via a single NVIDIA Geforce RTX                 data for which we train for 200K iterations. We
2080 TI with 11GB of memory. We boost the Ro-                have seen a quick divergence in Kazakh iterative
manian, Gujarati, and Kazakh monolingual data                back-translation, thereby we stopped it early after
with newstext dataset from WMT in order to have              running it for one epoch of all monolingual data.
enough monolingual data as well have in-domain               Most likely, the mined data for Kazakh-English has
text. We create batches of 4K words, run pretrain-           lower quality (see §A for more details), and that
ing for two million iterations where we alternate be-        leads to very noisy translations in back-translation
tween language batches. We use the apex library4             outputs. All of our evaluations are conducted us-
to use 16-bit floating-point tensors and double the          ing SacreBLEU (Post, 2018) except for en↔ro in
processing speed. To mimic the multi-GPU sce-                which we use BLEU score (Papineni et al., 2002)
nario, we accumulate gradients for 8 steps. We               from Moses decoder scripts (Koehn et al., 2007)
use the Adam optimizer (Kingma and Ba, 2015)                 for the sake of comparison to previous work.
with inverse square root and learning rate of 10−4 ,
4000 warm-up steps, and dropout probability of 0.1.          Image Captioning We use the Flickr (Hodosh
Due to GPU memory limitation, this whole process             et al., 2013) and MS-Coco (Chen et al., 2015)
takes about four weeks: in theory, with 8 high-              datasets for English5 , and the gold-standard Arabic
memory GPUs, we could obtain higher-quality pre-                 5
trained models in a few days.                                      We have also tried Conceptual Captions (Sharma et al.,
                                                             2018) in our initial experiments but we have observed drops
                                                             in performance. Previous work (Singh et al., 2020) have also
       https://github.com/NVIDIA/apex                        observed a similar problem with Conceptual Captions as a
Flickr dataset (ElJundi. et al., 2020) for evaluation.   back-translation model.6 We have not seen fur-
The Arabic test set has 1000 images with 3 captions      ther improvement by back-translation. To have a
per image. We translate all the training datasets to     fair comparison, we list the best supervised models
Arabic for having translated caption data. The fi-       for all language pairs (to the best of our knowl-
nal training data contains 620K captions for about       edge). In low-resource settings, we outperform
125K unique images. Throughout experiments,              strong supervised models that are boosted by back-
we use the pretrained Resnet-152 models (He et al.,      translation. In high-resource settings, our Arabic
2016) from Pytorch (Paszke et al., 2019), and let it     models achieve very high performance but regard-
fine-tune during our training pipeline. Each train-      ing the fact that the parallel data for Arabic has
ing batch contains 20 images. We accumulate gra-         18M sentences, it is quite impossible to reach that
dients for 16 steps, and use a dropout of 0.1 for        level of accuracy: our Arabic Wikipedia data is
the projected image output representations. Other        much smaller than the UN parallel data.
training parameters are the same as our translation
training. To make our pipeline fully unsupervised,       5.3    Captioning Results
we use translated development sets to pick the best      Table 3 shows the final results on the Arabic test
model during training.                                   set using the SacreBLEU measure (Post, 2018).
                                                         First, we should note that similar to ElJundi. et al.
Dependency Parsing We use the Universal De-
                                                         (2020), we see lower scales of BLEU scores due
pendencies v2.7 collection (Zeman et al., 2020)
                                                         to morphological richness in Arabic (see §A for
for Arabic, Kazakh, and Romanian. We use the
                                                         details). We see that if we initialize our model with
Stanza (Qi et al., 2020) pretrained supervised mod-
                                                         the translation model and multi-task it with trans-
els for getting supervised parse trees for Arabic
                                                         lation and also English we achieve much higher
and Romanian, and use the UDPipe (Straka et al.,
                                                         performance. It is interesting to observe that by
2016) pretrained model for Kazakh. We use a sim-
                                                         translating the English output on the test data to
ple modification to Stanza to facilitate training on
                                                         Arabic, we achieve a much lower result. This is
partially projected trees by masking dependency
                                                         a strong indicator of the strength of our approach.
and label assignments for words with missing de-
                                                         We also see that supervised translation fails to per-
pendencies. All of our training on projected de-
                                                         form well. This might due to the UN dataset which
pendencies is blindly conducted with 100k training
                                                         has a different domain from the caption dataset.
steps in which we use default training parameters of
                                                         Furthermore, we see that our model outperforms
Stanza (Qi et al., 2020). As for gold-standard paral-
                                                         using Google Translate which is a strong machine
lel data, we use our supervised translation training
                                                         translation system, and that is actually what is be-
data for Romanian-English and Kazakh-English
                                                         ing used as seed data for manual revision in the
and use a sample of 2 million sentences from the
                                                         Arabic dataset. Finally, it is interesting to see that
UN Arabic-English data due to its large size that
                                                         our model outperforms supervised captioning. In
causes word alignment significant slow-down. For
                                                         is worth noting that our multi-tasking make transla-
Kazakh wikily projections, we observe that the su-
                                                         tion performance slighly worse.
pervised POS taggers have very low accuracy due
to the small size of the Kazakh gold-standard tree-      5.4    Dependency Parsing
bank (31 sentences for training). We project the
                                                         Table 4 shows the results for dependency parsing
POS tags for projected words. On average, we ob-
                                                         experiments. We see that our model performs very
serve a two percent increase in performance by pro-
                                                         high in Romanian with a UAS of 74 which is much
jecting part-of-speech tags as well as dependency
                                                         higher than that of Ahmad et al. (2019) and slightly
parse relations.
                                                         lower than that of Rasooli and Collins (2019) which
5.2   Translation Results                                uses a combination of multi-source annotation pro-
                                                         jection and direct model transfer. Our work on Ara-
Table 2 shows the results of different settings in
addition to baseline and state-of-the-art results. We         We have seen that during multi-tasking with image cap-
                                                         tioning, the translation BLEU score for Arabic-English sig-
see that Arabic as a clear exception needs more          nificantly improves. We initially thought that multi-tasking
rounds of training: we train our Arabic model            is improving both translation and captioning, but our further
once again on mined data by initializing it by our       investigation shows that it is actually due to lack of training for
                                                         Arabic. We have tried the same procedure for other languages
noisy crawled caption dataset.                           but have not observed any further gains.
Model                                       ar→en    en→ar    gu→en    en→gu   kk→en    en→kk   ro→en    en→ro
     UNMT                  Conneau and Lample (2019)                      –        –         –       –        –       –      31.8     33.3
                           Song et al. (2019a) (MASS; 8 GPUs)             –        –         –       –        –       –      33.1     35.2
                           Best published results                       11.0*     9.4*     0.61    0.61     2.01    0.81    37.64    36.32
                           First sentences + captions + titles           6.1      3.1       0.7     1.1      2.3     1.0      2.0      1.9
                           Mined Corpora                                23.1     19.7       4.2     4.9      2.8     1.6     22.1     21.6
      Wikily UNMT

                           + Related Language                             –        –        9.1     7.8      7.3     2.3     23.2     21.5
                           + One-shot back-translation (bt-beam=4)      23.0     18.8      13.8    13.9      7.0    12.1     25.2     28.1
                           + Iterative back-translation (bt-beam=1)     24.4     18.9      13.3    15.2      9.0    10.8     32.5     33.0
                           + Retrain on mined data                      30.6     23.4        –       –        –       –        –        –
                           (Semi-)Supervised                            48.9*    40.6*    14.21    4.01    12.51    3.11    39.93    38.53

Table 2: BLEU scores for different models. Our models are initialized by our pretrained MASS model. Reference
results are from *: Our implementation, 1: Kim et al. (2020), 2: Li et al. (2020), 3: Liu et al. (2020) (supervised
with back-translation), 4: Tran et al. (2020) (unsupervised with mined parallel data).

                          Supervision Pretrained
                                                    Multi-task          BLEU         likely, adding more language pairs and using ideas
                                                 EN      MT           @1    @4
                                                                                     from recent work should help improve the accuracy
                            wikily       7        7        7          33.1 4.57
Translate EN train data

                            wikily      3         7        7          32.9 5.28      of our models.
                            wikily      3         3        7          32.8 4.37         Wikipedia has always been an interesting dataset
                            wikily      3         7        3          33.3 5.72      for solving NLP problems including machine trans-
                            wikily      3         3        3          36.8 5.60      lation (Li et al., 2012; Patry and Langlais, 2011;
                          supervised    3         7        7          17.7 1.26
                                     English test performance→        68.7 20.42     Lin et al., 2011; Tufiş et al., 2013; Barrón-Cedeño
Translate test

                            wikily      3         7        7          30.6 4.20      et al., 2015; Ruiter et al., 2019). The WikiMatrix
                          supervised    3         7        7          15.8 0.92      data (Schwenk et al., 2019a) is the most similar
                            Google      3         7        7          31.8 5.56
                                        3         7        7          33.7 3.76      effort to ours in terms of using Wikipedia, but with
                                        3         3        7          37.9 5.22      using supervised translation models. There is a
                                                                                     very recent collection Wikipedia data (Srinivasan
Table 3: Image captioning results evaluated on the Ara-                              et al., 2021) for many languages with the goal of
bic Flickr dataset (ElJundi. et al., 2020) using Sacre-                              using in multimodal machine learning tasks.
BLEU (Post, 2018). The column “pretrained” indicates
initializing our captioning model with parameters from
                                                                                        Bitext mining has a longer history of re-
our translation model.                                                               search (Resnik, 1998; Resnik and Smith, 2003) in
                                                                                     which most efforts are spent on using a seed super-
                                                                                     vised translation model (Guo et al., 2018; Schwenk
bic outperforms all previous work and performs                                       et al., 2019b; Artetxe and Schwenk, 2019; Schwenk
even better than using gold-standard parallel data.                                  et al., 2019a; Jones and Wijaya, 2021). Recently, a
One clear highlight is our result in Kazakh. As                                      number of papers have focused on unsupervised ex-
mentioned before, by projecting the part-of-speech                                   traction of parallel data (Ruiter et al., 2019; Hangya
tags, we achieve roughly 2 percent absolute im-                                      and Fraser, 2019; Keung et al., 2020; Tran et al.,
provement. Our final results on Kazakh are sig-                                      2020; Kuwanto et al., 2021). Our work lies in the
nificantly higher than that of using gold-standard                                   group of unsupervised mining approaches with the
parallel text (7K sentences).                                                        focus on Wikipedia and fast retrieval of parallel
                                                                                     text. Ruiter et al. (2019) focus on using vector sim-
6                         Related Work
                                                                                     ilarity of sentences to extract high-quality parallel
Kim et al. (2020) has shown that unsupervised                                        text from Wikipedia. Their work have not lever-
translation models often fail to provide good                                        aged specific structural signals from Wikipedia. It
translation systems for distant languages. Our                                       is worth noting that recent work has considered
work solves this problem by wisely leveraging the                                    using a huge number of small Bible parallel data
Wikipedia data. Using pivot languages in zero-                                       for translation (Mueller et al., 2020): we think this
shot settings has been used in previous work (Al-                                    line of work can be combined with ours.
Shedivat and Parikh, 2019), as well as using related                                    Cross-lingual and unsupervised image caption-
languages (Zoph et al., 2016; Nguyen and Chiang,                                     ing has been studied in previous work (Gu et al.,
2017). Our work only explores a simple idea of                                       2018; Feng et al., 2019; Song et al., 2019b; Gu
adding one supervised similar language pair. Most                                    et al., 2019; Gao et al., 2020). Unlike previous
Arabic             Kazakh           Romanian
                         Method                Version Token and POS
                                                                      UAS    LAS BLEX    UAS    LAS BLEX     UAS LAS BLEX
                 Rasooli and Collins (2019)     2.0   gold/supervised 61.2   48.8   –     –       –     –    76.3 64.3    –

                 Ahmad et al. (2019)            2.2         gold      38.1   28.0   –     –       –     –    65.1 54.1    –
                 Kurniawan et al. (2021)        2.2         gold      48.3   29.9   –     –       –     –     –    –      –
                                                            gold      62.5   50.7 46.3   46.8   28.5 25.0    74.1 57.7 52.6
                 Wikily translation

                                                        supervised    60.2   48.7 42.1   46.2   27.8 14.1    73.6 57.4 50.9
                                                2.7         gold      61.5   47.3 42.4   22.2    9.3   7.9   75.9 62.4 57.3
                 Gold-standard Parallel data
                                                        supervised    59.1   45.3 38.5   21.8    9.2   3.8   75.6 62.0 55.6
                        Supervised                      supervised    84.2   79.8 72.7   48.0   29.8 13.7    90.8 86.0 80.0

Table 4: Dependency parsing results on the Universal Dependencies dataset (Zeman et al., 2020). Previous work
has used different sub-versions of the Universal Dependencies data in which slight differences are expected.

work, we do not have a supervised translation                           DARPA, the Air Force, and the U.S. Government.
model. Cross-lingual transfer of dependency parser
have a long history. We encourage the reader to                         A     Analysis and Discussion
read a recent survey on this topic (Das and Sarkar,                     Manual Observation: Figure 5 shows a ran-
2020). Our work does not use gold-standard par-                         domly chosen example from the Gujarati-English
allel data or even supervised translation models to                     development data. As depicted, we see that the
apply annotation projection but we still see that                       model after back-translation reaches to somewhat
our models performs similarly or sometimes better                       the core meaning of the sentence with a bit of diver-
than using gold-standard parallel text.                                 gence from exactly matching the reference. The fi-
7                Conclusion                                             nal iterative back-translation output almost catches
                                                                        a correct translation. We also see that the use of the
We have described a fast and effective algorithm                        word “creative” is seen in Google Translate output,
for learning unsupervised machine translation sys-                      a model that is most likely trained on much larger
tems using Wikipedia. We show that by wisely                            parallel data than what is currently available for
choosing what to use as seed data, we can have                          public use. In general, unsupervised translation
very good seed parallel data to mine more paral-                        performs very poorly compared to our approach in
lel text from Wikipedia. We have also shown that                        all directions.
those translation models can be used in downstream
cross-lingual natural language processing tasks. In                     Quality of mined data: The quality of parallel
the future, we plan to extend our approach beyond                       data matters a lot for getting high-accuracy. For
Wikipedia to other comparable datasets like the                         example, we manually observe that the quality
BBC World Service. Moreover, a clear extension                          of mined data for all languages are very good
of this work is to try our approach on other cross-                     except for Kazakh. Our hypothesis is that the
lingual tasks.                                                          Kazakh Wikipedia data is less aligned with the
                                                                        English content. We compare our mined data
Acknowledgments                                                         to that of the supervised mined data from Wiki-
                                                                        Matrix (Schwenk et al., 2019a) as well as gold-
We would like to thank Alireza Zareian, Daniel                          standard data. Figure 6 shows the difference be-
(Joongwon) Kim, Qing Sun, and Afra Feyza                                tween the three datasets of three language pairs
Akyurek for their help and useful comments                              (WikiMatrix does not contain Gujarati). As we see,
througout this project. This work is supported in                       our data has BLEU scores near to WikiMatrix in
part by the DARPA HR001118S0044 (the LwLL                               all languages, and in the case of Kazakh, the model
program), and the Department of the Air Force                           trained on our data performs higher than WikiMa-
FA8750-19- 2-3334 (Semi-supervised Learning                             trix. In other words, in the case of having very noisy
of Multimodal Representations). The U.S. Gov-                           comparable data, as is the case for Kazakh-English,
ernment is authorized to reproduce and distribute                       our model even outperforms a contextualized su-
reprints for Governmental purposes. The views                           pervised model. It is also interesting to see that
and conclusions contained in this publication are                       our model outperforms the supervised model for
those of the authors and should not be interpreted                      Kazakh that has only 7.7K gold-standard training
as representing official policies or endorsements of                    data. These are all strong evidences of the strength
Input                                                                અથાત આપણે પહે લા તુલનાએ વધુ રચના મક બનવું પડશે.
                                   Unsupervised                                                         Ut numerous ીit the mother, onwards, in theover અિધકાંશexualit theotherit theIN રોડ 19
                                   First sentences + captions + titles                                  A view of the universe from the present to the present day.

                                   Mined Corpora                                                        For example, if the ghazal is more popular than ghazal.
                                   + Related Language                                                    We need to become more creative than before.
                                   + One-shot back-translation                                           For example, we must become more creative than before.
                                   + Iterative back-translation                                         Meanwhile, we ’ll have to become more constructive than before.
                                   Google Translate                                                      That means we have to be more creative than before.
                                   Reference                                                             That means we have to be more constructive than before.

 Figure 5: An example of a Gujarati sentence and its outputs from different models, as well as Google Translate.

                                                                                                                                  from BLEU score of 2.9 to 9.0. If we had access
           50             48.9
                                                                          Ours        WikiMatrix       Supervised
                                                                                                                                  to a cluster of high-memory GPUs, we could po-
                    26.2                                                                                                          tentially obtain even higher results throughout all

                                                                                                   25.325.2            24.825.4
                23.1                                                                           23.2
                                                                                                                                  of our experiments. Therefore, we believe that part

           10                                          7.3
                                                                   3.9               3.3
                                                                                                                                  of the blame for our results in English-Romanian
                                                                                                                                  is on pretraining. As we see in Figure 6, our super-
                                                             0.3               0.7
                  ar→en                   en→ar            kk→en           en→kk                 ro→en               en→ro

                                                                                                                                  vised results without back-translation are also low
Figure 6: Results using our mined data versus WikiMa-                                                                             for English-Romanian.
trix (Schwenk et al., 2019a) and gold-standard data.
                                                                                                                                  Comparing to CRISS: The recent work of Tran
                                                     No Pretraining      With Pretraining
                                                                                                                                  et al. (2020) shows impressive gains using high-
                                                                                                                                  quality pretrained models and iterative parallel data
           20                         19.7                                                            19.5
                  18.6                                                                                               17.8

                                                                                                                                  mining from a larger comparable data than that of

           10                                        9.1
                                                                   7.8         7.3                                                Wikipedia. Their pretrained model is trained us-
                                               2.9           3.6
                                                                                               2.3                                ing 256 Nvidia V100 GPUs in approximately 2.5
                  ar→en            en→ar       gu→en         en→gu       kk→en         en→kk          ro→en          en→ro        weeks (Liu et al., 2020). Figure 8 shows that by
                                                                                                                                  considering all these facts, our model still outper-
Figure 7: Results using mined data (no back-
translation) with and without pretraining.
                                                                                                                                  forms their supervised model in English-to-Kazakh
                                                                                                                                  with a big margin (4.3 cs 10.8) and gets close to
                                                                                                                                  their performance in other directions. We should
                                                                                       CRISS         Ours
                                                                                                                                  emphasize on the fact that Tran et al. (2020) ex-
           15          13.8

                                                                                                                                  plores a much bigger comparable data than ours.


                                                                                                                                  One clear addition to our work is exploring parallel
           10                                                                              9

            5                                                                                                        4.3          data from other available comparable datasets. Due
                                                                                                                                  to limited computational resources, we skip this
                                                                                                                                  part but we do believe that using our current unsu-
                  gu→en                              en→gu                           kk→en                           en→kk

Figure 8: Our best results (Table 2) versus the super-                                                                            pervised models can help extract even more high-
vised model of Tran et al. (2020).                                                                                                quality parallel data from comparable datasets, and
                                                                                                                                  this might lead to further gains for low-resource
of our approach in truly low-resource settings.
                                                                                                                                  Image captioning quality Figure 9 shows a ran-
Pretraining matters: It is a truth universally ac-
                                                                                                                                  domly picked example with different model out-
knowledged, that a single model in possession of
                                                                                                                                  puts. We see that the two outputs from our ap-
a small training data and high learning capacity,
                                                                                                                                  proach with multi-tasking are roughly the same
must be in want of a pretrained model. To prove
                                                                                                                                  but one of them as more syntactic order overlap
this, we run our translation experiments with and
                                                                                                                                  with the reference while both orders are correct in
without pretraining. In this case, all models with
                                                                                                                                  Arabic as a free-word order language. The word
the same training data and parameters are equal,
but some models are more equal. Figure 7 shows
                                                                                                                                  éJ ËA® KQK. means “orange” which is close to Z@QÔg
                                                                                                                                  that means “red”. The word ém' Qå means “slide”
the results on the mined data. Clearly, there is a
significant gain by using pre-trained models. For                                                                                 which is correct but other meanings of this word
Gujarati, which is our the lowest-resource language                                                                               exist in the reference. In general, we observe that
in our experiments, the distance is more notable:                                                                                 although superficially the BLEU scores for Ara-
A child on a red slide.
                                                                              A little boy sits on a slide on the playground.
                                                English gold                  A little boy slides down a bright red corkscrew slide.
                                                                              A little boy slides down a red slide.
                                                                              a young boy wearing a blue outfit sliding down a red slide.
                                                English supervised            A boy is sitting on a red slide.
                                                En– supervised translate                                    . ‫‐ ﺻﺒﻲ ﺻﺒﻲ ﻳﺠﻠﺲ ﻋﻠ ﺷﺎﺣﻨﺔ ﺧﻔﻴﻔﺔ‬
                                                En– unsupervised translate                                           .‫اﻟﻄﻔﻞ ﻳﺠﻠﺲ ﻋﻠ ﺷﺮﻳﺤﺔ ﺣﻤﺮاء‬
                                                En– Google translate                                                  .‫ﺻﺒﻲ ﻳﺠﻠﺲ ﻋﻠ ﺷﺮﻳﺤﺔ ﺣﻤﺮاء‬
                                                Supervised MT                                                                   ‫ﺻﺒﻲ ﺻﺒﻲ ﻋﻠ ﺷﻈﻴﺔ‬
                                                Unsupervised (mt + ar + en)                                   .‫ﻳﺠﻠﺲ ﺻﺒﻲ ﺻﻐﻴﺮ ﻋﻠ ﺷﺮﻳﺤﺔ ﺑﺮﺗﻘﺎﻟﻴﺔ‬
                                                Unsupervised (mt + ar)                                         .‫ﺻﺒﻲ ﺻﻐﻴﺮ ﻳﺠﻠﺲ ﻋﻠ ﺷﺮﻳﺤﺔ ﺣﻤﺮاء‬
                                                Supervised                                                        ‫ﺻﺒﻲ ﻓ ﻗﻤﻴﺺ أزرق ﻳﻘﻔﺰ ﻓ اﻟﻬﻮاء‬
                                                                                                                              ‫ﻃﻔﻞ ﻋﻠ ﻣﻨﺰﻟﻘﺔ ﺣﻤﺮاء‬
                                                Arabic Gold                                                 ‫ﺻﺒﻲ ﺻﻐﻴﺮ ﻳﺠﻠﺲ ﻋﻠ زﻻﺟﺔ ﻓ اﻟﻤﻠﻌﺐ‬
                                                                                                                 ‫ﻳﻨﺰﻟﻖ ﺻﺒﻲ ﺻﻐﻴﺮ أﺳﻔﻞ ﻣﻨﺰﻟﻘﺔ ﺣﻤﺮاء‬

Figure 9: An example of different outputs in our captioning experiments both for English and Arabic, as well as
Arabic translations of English outputs on the Arabic Flickr dataset (ElJundi. et al., 2020).

