"Wikily" Supervised Neural Translation Tailored to Cross-Lingual Tasks
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
“Wikily” Supervised Neural Translation Tailored to Cross-Lingual Tasks Mohammad Sadegh Rasooli1∗ Chris Callison-Burch2 Derry Tanti Wijaya3 1 Microsoft 2 Department of Computer and Information Science, University of Pennsylvania 3 Departmentof Computer Science, Boston University mrasooli@microsoft.com, ccb@seas.upenn.edu, wijaya@bu.edu Abstract could be used in downstream cross-lingual tasks in which annotated data does not exist for some We present a simple but effective approach for languages. There has recently been a great deal leveraging Wikipedia for neural machine trans- of interest in unsupervised neural machine trans- lation as well as cross-lingual tasks of image captioning and dependency parsing without us- lation (e.g. Artetxe et al. (2018a); Lample et al. ing any direct supervision from external paral- (2018a,c); Conneau and Lample (2019); Song et al. lel data or supervised models in the target lan- (2019a); Kim et al. (2020); Tae et al. (2020)). Un- guage. We show that first sentences and titles supervised neural machine translation models of- of linked Wikipedia pages, as well as cross- ten perform nearly as well as supervised models lingual image captions, are strong signals for when translating between similar languages, but a seed parallel data to extract bilingual dictio- they fail to perform well in low-resource or dis- naries and cross-lingual word embeddings for tant languages (Kim et al., 2020) or out-of-domain mining parallel text from Wikipedia. Our fi- nal model achieves high BLEU scores that are monolingual data (Marchisio et al., 2020). In prac- close to or sometimes higher than strong su- tice, the highest need for unsupervised models is pervised baselines in low-resource languages; to expand beyond high resource, similar European e.g. supervised BLEU of 4.0 versus 12.1 language pairs. from our model in English-to-Kazakh. More- There are two key goals in this paper: Our first over, we tailor our “wikily” supervised trans- lation models to unsupervised image caption- goal is developing accurate translation models for ing, and cross-lingual dependency parser trans- low-resource distant languages without any supervi- fer. In image captioning, we train a multi- sion from a supervised model or gold-standard par- tasking machine translation and image cap- allel data. Our second goal is to show that our ma- tioning pipeline for Arabic and English from chine translation models can be directly tailored to which the Arabic training data is a translated downstream natural language processing tasks. In version of the English captioning data, using this paper, we showcase our claim in cross-lingual our wikily-supervised translation models. Our image captioning and cross-lingual transfer of de- captioning results on Arabic are slightly better than that of its supervised model. In depen- pendency parsers, but this idea is applicable to a dency parsing, we translate a large amount of wide variety of tasks. monolingual text, and use it as artificial train- We present a fast and accurate approach for ing data in an annotation projection frame- learning translation models using Wikipedia. Un- work. We show that our model outperforms like unsupervised machine translation that solely recent work on cross-lingual transfer of depen- relies on raw monolingual data, we believe that we dency parsers. should not neglect the availability of incidental su- 1 Introduction pervisions from online resources such as Wikipedia. Wikipedia contains articles in nearly 300 languages Developing machine translation models without us- and more languages might be added in the future, ing bilingual parallel text is an intriguing research including indigenous languages and dialects of dif- problem with real applications: obtaining a large ferent regions in the world. Different from similar volume of parallel text for many languages is hard recent work (Schwenk et al., 2019a), we do not if not impossible. Moreover, translation models rely on any supervision from supervised translation ∗ Research was conducted at The University of Pennsyl- models. Instead, we leverage the fact that many vania. first sentences in linked Wikipedia pages are rough 1655 Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1655–1670 November 7–11, 2021. c 2021 Association for Computational Linguistics
better in low-resource proach towards usinglanguages. the Wikipedia mono- 117 A summary lingual of our data for contribution machine translationis as follows: 1) without 118 We propose a simple, fast and effectivealgo- any explicit supervision. Our mining approach 119 rithm easily towards using scales on large comparable the Wikipedia monolingual data data for 120 using limited computational resources. machine translation without any explicit supervi- We 121 achieve sion. very high Our mining BLEU scores algorithm easilyfor distant scales on large 122 languages, especially those in which current 123 comparable data using limited computational re- unsupervised methods perform very poorly. 124 sources. We achieve very high BLEU scores for distant languages, • We propose novelespecially methods those in which cur- for leveraging 125 rentour current translation unsupervised methods models in image perform very cap- poorly. 2) 126 tioning. We show that how a We propose novel methods for leveraging our cur-combina- 127 tion rent of translating translation caption models trainingcaptioning. in image data, and We 128 show that how a combination of translatingascaption multi-task learning with English captioning 129 well as translation improves the performance. 130 training data, and multi-task learning with English Our results on Arabic captaining shows re- 131 Figure Figure 1:1:AApair pairofofWikipedia Wikipedia documents documents ininArabic Arabicand and captioning as well as translation improves the per- sults slightly superior to that of a supervised 132 English, along with a same image with two captions. formance. English, along with a same image with two captions. captioningOur results model trainedon onArabic shows results gold-standard 133 slightly superior datasets. to that of a supervised caption- 134 084 glish in which the titles, first sentences, and also the ing model trained on gold-standard datasets. 3) translations, and furthermore, many captions of • We propose a novelmodification modification to to the 085 image captions are rough translations of each other. We propose a novel the anno- annotation 135 theOur same images method learnsarea similar sentences, seed bilingual sometimes dictionary from tation projection method in order to be able 136 086 projection method to be able to leverage our trans- 087 translations. Figure 1 shows a real example a small collection of first sentence pairs, titles and of a to leverage our translation models. Our re- 137 lation models. Our results on dependency parsing 088 pair of linked captions, and Wikipedia pages in Arabic then learns cross-lingual and En- word embed- sults on dependency parsing performs better 138 performs better than than previous workprevious work in in most cases, andmost per-cases, 089 glish in which dings. We make the use titles, of first sentences, cross-lingual andembed- word also the 139 andforms performs similarly to using gold-standard parallel par- similarly to using gold-standard 140 090 image dingscaptions to extractare roughsentences parallel translationsfromofWikipedia. each other. allel datasets. datasets. 141 091 OurOur experiments method learnsshowa seed that our approach bilingual improves dictionary from Our translation and captioning code and models 092 over strong a small unsupervised collection translation of first sentence models pairs, titlesfor and Our code is publicly available 1online1 . 142 093 low-resource languages: we improveword the BLEU are publicly available online . captions, and then learns cross-lingual embed- 094 scoreWe of English!Gujarati from 0.6 to 15.2, and 2 Background 143 dings. make use of cross-lingual word embed- 2 Background 095 English!Kazakh from 0.8 to 12.1. In this section, we briefly describe the main con- dings to extract parallel sentences from Wikipedia. 144 096 In the realm of downstream tasks, we show that Supervised cepts neural machine that we repeatedly translation use throughout the paper.Super- 145 Our experiments show that our approach improves 097 we can easily use our translation models to generate vised machine translation uses a parallel text P = 146 098 over strong unsupervised high-quality translation(Chen translations of MS-COCO models for et al., low-resource languages: we improve the BLEU {(si , ti )}ni=1 Supervised in machine neural which each sentence Super- translation si ∈ l1 is 099 2015) and Flickr (Hodosh et al., 2013) datasets, and score a translation of ti ∈ usesl2 . aNeural parallelmachine text P = trans- 100 trainofa cross-lingual English→Gujarati from 0.6 to image captioning 15.2,inand model a vised machine translation 101 English→Kazakh from 0.8 to 12.1. multi-task pipeline paired with machine translation {(si , ti )}i=1 in which each sentence si 2 l1 is with lation uses n sequence-to-sequence models a at- in tention (Cho et al., 2014; Bahdanau translation of ti 2 l2 . For having a high-quality et al., 2015; 102 Inwhich the model the realm is initializedtasks, of downstream by theweparameters show that Vaswanimodel, translation et al., we 2017) for need usually which the likelihood a large amount of 103 we can easily use our translation models toon from our translation model. Our results Ara- generate of training data is parallel text. maximized Neural machine bytranslation maximizing the log- uses 104 bic captioning show a BLEU score of 5.72 that is high-quality translations of MS-COCO (Chen et al., sequence-to-sequence likelihood of predicting models withtarget each attention word(Cho given its 105 slightly better than a supervised captioning model 106 2015) and Flickr (Hodosh et al., 2013) datasets, and with a BLEU score of 5.22. As another task, in de- et previous al., 2014; predicted Bahdanau words et al., 2015; Vaswani et al., and source sequence: 107 train a cross-lingual pendency parsing, weimage captioning first translate model a large amount in a 2017) for which the likelihood of training data is n X |ti | 108 multi-task pipeline of monolingual paired data usingwith our machine translation translation models maximized by maximizing X the log-likelihood of in and which the model is initialized by the parameters L(P) predicting each= target wordlog i,j |t p(tits given i,k
models usually mask parts of every input sentence, Wikipedia documents are rough translations of each and try to uncover the masked words (Devlin et al., other. Moreover, captions of images in different 2019). The monolingual language models are used languages are usually similar but not necessarily along with iterative back-translation (Hoang et al., direct translations of each other. We leverage this 2018) to learn unsupervised translation. An input information to extract many parallel sentences from sentence s is translated to t0 using current model Wikipedia without using any external supervision. θ, then the model assumes that (t0 , s) is a gold- In this section, we describe our algorithm which is standard translation, and uses the same training briefly shown in Figure 3. objective as of supervised translation. 3.1 Data Definitions Dependency parsing Dependency parsing algo- For languages e and f in which e is English and f rithms capture the best scoring dependency trees is a low-resource target language of interest, there for sentences among an exponential number of pos- (e) are Wikipedia documents we = {w1 . . . wn } (e) sible dependency trees. A valid dependency tree (f ) (f ) (l) and wf = {w1 . . . wm }. We refer to w(i,j) as for a sentence s = s1 , . . . , sn assigns heads hi for the jth sentence in the ith document for language each for word si where 1 ≤ i ≤ n, 0 ≤ hi ≤ n and l. A subset of these documents are aligned (us- hi 6= i. The zeroth word represents a dummy root ing Wikipedia languages links). Thus we have an token as an indicator for the root of the sentence. aligned set of document pairs in which we can eas- For more details about efficient parsing algorithms, ily extract many sentence pairs that are potentially we encourage the reader to see Kübler et al. (2009). translations of each other. A smaller subset F is the (e) (f ) Annotation projection Annotation projection set of first sentences in Wikipedia (w(i,1) , w(i0 ,1) ) is an effective method for transferring super- in which documents i and i0 are linked and their vised annotation from a rich-resource language first sentence lengths are in a similar range. In to a low-resource language through translated addition to text content, Wikipedia has a large set text (Yarowsky et al., 2001). Having a parallel of images. Each image comes along with one or data P = {(si , ti )}ni=1 , and supervised source more captions, sometimes in different languages. annotations for source sentences si , we transfer A small subset of these images have captions both those annotations through word translation links in English and the target language. We refer to this (j) (j) 0 ≤ ai ≤ |ti | for 1 ≤ j ≤ |si | where ai = 0 set as C. We use the set of all caption pairs (C), shows a null alignment. The alignment links title pairs (T ), and first sentences (F) as the seed are learned in an unsupervised fashion using un- parallel data: S = F ∪ C ∪ T . supervised word alignment algorithms (Och and 3.2 Bilingual Dictionary Extraction and Ney, 2003a). In dependency parsing, if hi = j and Cross-Lingual Word Embeddings a(j) = k and a(i) = m, we project a dependency k → m (i.e. hm = k) to the target side. Previous Having the seed parallel data S, we run unsuper- work (Rasooli and Collins, 2017, 2019) has shown vised word alignment (Dyer et al., 2013) in both that annotation projection only works when a large English-to-target, and target-to-English directions. amount of translation data exists. In the absence of We use the intersected alignments to extract highly parallel data, we create artificial parallel data using confident word-to-word connections. Finally, we our translation models. Figure 2 shows an example pick the most frequently aligned word for each of annotation projection using translated text. word in English as translation. This set serves as a bilingual dictionary D. 3 Learning Translation from Wikipedia Given two monolingual trained word embed- dings ve ∈ RNe ×d and vf ∈ RNf ×d , and the ex- The key component of our approach is to leverage tracted bilingual dictionary D, we use the method the multilingual cues from linked Wikipedia pages of Faruqui and Dyer (2014) to project these two em- across languages. Wikipedia is a great comparable bedding vectors to a shared cross-lingual space.2 data in which many of its pages explain entities This method uses a bilingual dictionary along with in the world in different languages. In most cases, 2 first sentences define or introduce the mentioned There are more recent approaches such as (Lample et al., 2018b). Comparing different embedding methods is not the entity in that page (e.g. Figure 1). Therefore, we focus of this paper, thereby we leave further investigation to observe that many first sentence pairs in linked future work. 1657
punct advcl root obl det nmod nmod amod nsubj case advmod xcomp obj case case punct compound compound mark obj det The International Crisis Group recently suggested moving responsibility for pension to state level , to eliminate some of the problems . Grupul International de Criza a sugerat recent mutarea responsabilitatii pentru pensii la nivelul statului , case pentru a elimina unele dintre probleme . amod advmod case compound obj case obj mark nmod xcomp nmod punct compound nsubj obl root advcl punct Figure 2: An example of annotation projection for which the source on top is a translation of the Romanian target via our wikily translation model. The supervised source tree is projected using intersected word alignments. Figure 2: An example of annotation projection for which the source (English, on top) is a translation of the target (Romanian) with our wikily translation model. The source side is parsed with supervised Stanza (Qi et al., 2020) and the parse tree is projected using Giza++ (Och and Ney, 2003) after intersected alignments. filtering sentence Aswith pairs shown in the figure, different numer- Definitions: 1) e is English, f is the foreign language, and g is a lan- some words have missing dependencies. guage similar to f , 2) learn_dict (P ) extracts a bilingual dictionary from ical values (e.g. sentences containing 2019 in the parallel data P , 3) t (x|m) translates input x given model m, , 4) source and 1987 in the target), we use a modified pretrain (x) pretrains on monolingual data x using MASS (Song et al., Supervised neural machine translation Super- and try to uncover version of cosinethe similarity masked words between words: (Devlin et al., 2019a), 5) train (P |m) trains on parallel data P initialized by model m, vised machine translation uses a parallel text P = 2019). In this work, ( we mainly use the MASS 6) bt_train n (xin1 , x2 |m) trains iterative back-translation on monolingual 1.0, in whichifa(scontiguous i , tj ) ∈ D {(sidata , tix)} 1 ∈ e and x2which i=1 eachbysentence ∈ f initialized model m. si 2 l1 is a model (Song, tet)al., sim(s i j = 2019), translation of ti 2documents Inputs: 1) Wikipedia l2 . For w having (e) (f ) , w , and a whigh-quality (g) , 2) Monolingual span of words are masked, cos(si ,and tj ), theotherwise decoder pre- word embedding vectors ve and vf , 3) Set of linked pages from Wikipedia translation model, we usually need a large amount COMP , their aligned titles T , and their first sentence pairs F , 4) Set of dicts the masked words. These monolingual lan- of parallel Using the above definition along of word similarity, back-we paired imagetext, captionse.g. C, andthe Arabic-English 5) Gold-standard parallel data PUnited (e,g) . guage models are used with iterative Nations Algorithm: parallel text (Ziemski et al., 2016) con- translation use the average-maximum (Hoang et al., 2018) similarity to learn between unsuper- pairs → Learn bilingual dictionary and embeddings tains of sentences. S= C ∪ T sentences. Neural machine trans- n F⇠∪18M vised translation. In other words, an input sentence D (f,e) = learn_dict (S) Pn lation(g,e)uses sequence-to-sequence models with at- s is translated to t0 using i=1 current maxm model j=1 sim(s ✓.i ,Then ti ) D = learn_dict (P tention (Cho 0et al., 2014; (e,g) ) Bahdanau . Related language et al., 2015; the score(s, model t) assumes = that (t 0 , s) is a gold-standard Learn ve → ve and vf → vf0 using D (f,e) ∪ D (g,e) n Vaswani et al.,data → Mine parallel 2017) for which the likelihood of translation, and uses the same training objective From a pool of candidates, we pick those pairs that training Extract data is maximized comparable sentences Z from byCOMP maximizing the log- as of supervised neural translation. The main as- Extract P (f,e) from Z. have the highest score in both directions. likelihood of predicting each target word .given P (f,e) = P (f,e) ∪ T its Mined Data sumption here is that languages have distributional previous → Train MT predicted words with pretraining andand source sequence: back-translation similarities 3.4 Leveraging and theseSimilar similarities can be captured Languages θ0 = pretrain (w(e) ∪ w(f ) ∪ w(g) ) . MASS Training by pretrained multilingual language models (Con- θ = train (PX (f,e) |t | (g,e) n X ∪iP |θ0 ) . NMT Training In many low-resource scenarios, the number of neau et al., 2020). P L(P) (e→f ) == ( t (w(f ) |θ ), log p(ti,j |ti,k
from HuggingFace (Wolf et al., 2019) and from one-shot translation is that the model uses Pytorch (Paszke et al., 2019) with a shared an online approach, and updates its parameters in SentencePiece (Kudo and Richardson, 2018) every batch. vocabulary. All input and output token embeddings We empirically find one-shot back-translation are summed up with the language id embedding. faster to train but with much less potential to reach First tokens of every input and output sentence are a high translation accuracy. A simple and ef- shown by the language ID. Our training pipeline fective way to have both a reliable and accurate assumes that the encoder and decoder are shared model is to first initialize a model with one-shot across different languages, except that we use a back-translation, and then apply iterative back- separate output layer for each language in order to translation. The model that is initialized with a prevent input copying (Artetxe et al., 2018b; Sen more accurate model reaches a higher accuracy. et al., 2019). We pretrain the model on a tuple of three Wikipedia datasets for the three languages 4 Cross-Lingual Tasks g, f , and e using the MASS model (Song et al., In this section, we describe our approaches for tai- 2019a). The MASS model masks a contiguous loring our translation models to cross-lingual tasks. span of input tokens, and recovers that span in the Note that henceforth we assume that our transla- output sequence. tions model training is finished, and we have access To facilitate multi-task learning with image cap- to trained translation models for cross-lingual tasks. tioning, our model has an image encoder that is used in cases of image captioning (more details 4.1 Cross-Lingual Image Captioning in §4.1). In other words, the decoder is shared Having gold-standard image captioning training between the translation and captioning tasks. We data I = {(Ii , ci )}ni=1 where Ii is the image as use the pretrained ResNet-152 model (He et al., (1) 2016) from Pytorch to encode every input image. pixel values, and ci = ci , . . . , cki i as the textual We extract the final layer as a 7 × 7 grid vector description with ki words, our goal is to learn a cap- (g ∈ R7×7×dg ), and project it to a new space by tioning model that is able to describe new (unseen) a linear transformation (g 0 ∈ R49×dt ), and then images. As described in §3.5, we use a transformer add location embeddings (l ∈ R49×dt ) by using decoder from our translation model and a ResNet entry-wise addition. Afterwards, we assume that image encoder (He et al., 2016) for our image cap- the 49 vectors are encoded text representations as if tioning pipeline. Unfortunately, annotated image a sentence with 49 words occurs. This is similar to captioning datasets do not exist in many languages. Having our translation model parameter θ ∗ , we but not exactly the same as the Virtex model (Desai and Johnson, 2021). can use its translation functionality to translate each caption ci to c0i = translate(ci |θ ∗ ). After- wards, we will have a translated annotated dataset 3.6 Back-Translation: One-shot and Iterative I 0 = {(Ii , c0i )}ni=1 in which the textual descrip- Finally, we use the back-translation technique tions are not gold-standard but translations from to improve the quality of our models. Back- the English captions. Figure 4 shows a real exam- translation is done by translating a large amount ple from MS-Coco (Chen et al., 2015) in which of monolingual text to and from the target lan- Arabic translations are provided by our translation guage. The translated texts serve as noisy input model. Furthermore, to augment our learning ca- text along with the monolingual data as the silver- pability, we initialize our decoder with decoding parameters of θ ∗ , and also continue training with standard translations. Previous work (Sennrich et al., 2016b; Edunov et al., 2018) has shown that both English captioning and translation. back-translation is a very simple but effective tech- nique to improve the quality of translation models. 4.2 Cross-Lingual Dependency Parsing Henceforth, we refer to this method as one-shot Assuming that we have a large body of monolin- back-translation. Another approach is to use iter- gual text, we translate that monolingual text to cre- ative back-translation (Hoang et al., 2018), the ate artificial parallel data. We run unsupervised most popular approach in unsupervised transla- word alignments on the artificial parallel text. Fol- tion (Artetxe et al., 2018b; Conneau and Lample, lowing previous work (Rasooli and Collins, 2015; 2019; Song et al., 2019a). The main difference Ma and Xia, 2014), we run Giza++ (Och and Ney, 1659
This is an open box containing four Direction aren guen kken roen cucumbers. Foreign docs 1.0m 28k 230k 400k .وهذا صندوق مفتوح يحتوي على أربعة خيار Paired docs 745k 7.3k 80k 270k An open food container box with four unknown food items. First sents. 205k 3.2k 52k 78k .صندوق حاوية طعام مفتوح مع أربعة مواد غذائية مجهولة Captions 92k 2.2k 1.9k 35k A small box filled with four green Comparable pairs 0.1b 14m 32m 64m vegetables. Mined sents. 1.7m 49k 183k 675k .ضراءKضروات اKمربع صغير مليء بأربعة ا BT 2.1m 1.5m 2.2m 2.1m An opened box of four chocolate Iterative BT 4.0m 3.8m 4.0m 6.1m bananas. .وزPعلبة مفتوحة من أربعة من ا An open box contains an unknown, purple object Table 1: Data sizes for different pairs. We use a sample رجوانTمربع مفتوح يحتوي على كائن غير معروف ا of English sentences with similar sizes to each data. Figure 4: An image from MS-Coco (Chen et al., 2015) with gold-standard English captions, and Arabic trans- et al., 2016) for Romanian-English. Following pre- lations from our wikily translation model. vious work (Sennrich et al., 2016a), diacritics are removed from the Romanian data. More details 2003b) alignments on both source-to-target and about other datasets and their sizes, we refer the target-to-source directions, and extract intersected reader to the supplementary material. alignments to keep high-precision one-to-one align- ments. We run a supervised dependency parser of Pretraining We pretrain four models on 3-tuples English as our rich-resource language. Then, we of languages via a single NVIDIA Geforce RTX project dependencies to the target language sen- 2080 TI with 11GB of memory. We create batches tences via word alignment links. Inspired by previ- of 4K words, run pretraining for two million itera- ous work (Rasooli and Collins, 2015), to remove tions where we alternate between language batches, noisy projections, we keep those sentences that at and accumulate gradients for 8 steps. We use the least 50% of words or 5 consecutive words in the apex library3 to use FP-16 tensors. This whole target side have projected dependencies. process takes four weeks in a single GPU. We use the Adam optimizer (Kingma and Ba, 2015) with 5 Experiments inverse square root and learning rate of 10−4 , 4000 warm-up steps, and dropout probability of 0.1. In this section, we provide details about our experi- mental settings and results for translation, caption- Translation Training Table 1 shows the sizes ing, and dependency parsing. We put more details of different types of datasets in our experiments. about our settings as well as thorough analysis of We pick comparable candidates for sentence pairs our results in the supplementary material. whose lengths are within a range of half to twice of each other. As we see, the final size of mined 5.1 Datasets and Settings datasets heavily depends on the number of paired Languages We focus on four language pairs: English-target language Wikipedia documents. We Arabic-English, Gujarati-English, Kazakh-English, train our translation models initialized by pre- and Romanian-English. We choose these pairs to trained models. More details about our hyper- provide enough evidence that our model works in parameters are in the supplementary material. All distant languages, morphologically-rich languages, of our evaluations are conducted using Sacre- as well as similar languages. As for similar lan- BLEU (Post, 2018) except for en↔ro in which guages, we use Persian for Arabic (written with we use BLEU score (Papineni et al., 2002) from very similar scripts and have many words in com- Moses decoder scripts (Koehn et al., 2007) for the mon), Hindi for Gujarati (similar languages), Rus- sake of comparison to previous work. sian for Kazakh (written with the same script), and Image Captioning We use the Flickr (Hodosh Italian for Romanian (Romance languages). et al., 2013) and MS-Coco (Chen et al., 2015) Monolingual and Translation Datasets We use datasets for English4 , and the gold-standard Arabic a shared SentencePiece vocabulary (Kudo and Flickr dataset (ElJundi. et al., 2020) for evaluation. Richardson, 2018) with size 60K. Table 1 shows The Arabic test set has 1000 images with 3 captions the sizes of Wikipedia data in different languages. 3 https://github.com/NVIDIA/apex For evaluation, we use the Arabic-English UN 4 We have also tried Conceptual Captions (Sharma et al., data (Ziemski et al., 2016), WMT 2019 data (Bar- 2018) in our initial experiments but we have observed drops in performance. Previous work (Singh et al., 2020) have also rault et al., 2019) for Gujarati-English and Kazakh- observed a similar problem with Conceptual Captions as a English, and WMT 2016 shared task data (Bojar noisy crawled caption dataset. 1660
per image. We translate all the training datasets to ther improvement by back-translation. To have a Arabic for having translated caption data. The fi- fair comparison, we list the best supervised models nal training data contains 620K captions for about for all language pairs (to the best of our knowl- 125K unique images. Throughout experiments, edge). In low-resource settings, we outperform we use the pretrained Resnet-152 models (He et al., strong supervised models that are boosted by back- 2016) from Pytorch (Paszke et al., 2019), and let it translation. In high-resource settings, our Arabic fine-tune during our training pipeline. Each train- models achieve very high performance but regard- ing batch contains 20 images. We accumulate gra- ing the fact that the parallel data for Arabic has dients for 16 steps, and use a dropout of 0.1 for 18M sentences, it is quite impossible to reach that the projected image output representations. Other level of accuracy. training parameters are the same as our translation Figure 5 shows a randomly chosen example from training. To make our pipeline fully unsupervised, the Gujarati-English development data. As de- we use translated development sets to pick the best picted, we see that the model after back-translation checkpoint during training. reaches to somewhat the core meaning of the sentence with a bit of divergence from exactly Dependency Parsing We use the Universal De- matching the reference. The final iterative back- pendencies v2.7 collection (Zeman et al., 2020) translation output almost catches a correct transla- for Arabic, Kazakh, and Romanian. We use the tion. We also see that the use of the word “creative” Stanza (Qi et al., 2020) pretrained supervised mod- is seen in Google Translate output, a model that els for getting supervised parse trees for Arabic is most likely trained on much larger parallel data and Romanian, and use the UDPipe (Straka et al., than what is currently available for public use. In 2016) pretrained model for Kazakh. We translate general, unsupervised translation performs very about 2 million sentences from each language to poorly compared to our approach in all directions. English, and also 2 million English sentences to Arabic. We use a simple modification to Stanza 5.3 Captioning Results to facilitate training on partially projected trees Table 4 shows the final results on the Arabic test set by masking dependency and label assignments for using the SacreBLEU measure (Post, 2018). First, words with missing dependencies. All of our train- we should note that similar to ElJundi. et al. (2020), ing on projected dependencies is blindly conducted we see lower scales of BLEU scores due to morpho- with 100k training steps with default parameters logical richness in Arabic. We see that if we initial- of Stanza (Qi et al., 2020). As for gold-standard ize our model with the translation model and multi- parallel data, we use our supervised translation task it with translation and also English captioning, training data for Romanian-English and Kazakh- we achieve much higher performance. It is interest- English and use a sample of 2 million sentences ing to observe that translating the English output from the UN Arabic-English data due to its large on the test data to Arabic achieves a much lower re- size that causes word alignment significant slow- sult. This is a strong indicator of the strength of our down. For Kazakh wikily projections, due to low approach. We also see that supervised translation supervised POS accuracy, we use the projected fails to perform well. This might due to the UN POS tags for projected words and supervised tags translation training dataset which has a different for unprojected words. We observe a two percent domain from the caption dataset. Furthermore, we increase in performance by using projected tags. see that our model outperforms Google Translate 5.2 Translation Results which is a strong machine translation system, and that is actually what is being used as seed data for Table 2 shows the results of different settings in manual revision in the Arabic dataset. Finally, it is addition to baseline and state-of-the-art results. We interesting to see that our model outperforms super- see that Arabic as a clear exception needs more vised captioning. Multi-tasking make translation rounds of training: we train our Arabic model performance slightly worse. once again on mined data by initializing it by our Figure 6 shows a randomly picked example with back-translation model.5 We have not seen fur- is improving both translation and captioning, but our further 5 We have seen that during multi-tasking with image cap- investigation shows that it is actually due to lack of training for tioning, the translation BLEU score for Arabic-English sig- Arabic. We have tried the same procedure for other languages nificantly improves. We initially thought that multi-tasking but have not observed any further gains. 1661
Model ar→en en→ar gu→en en→gu kk→en en→kk ro→en en→ro Conneau and Lample (2019) – – – – – – 31.8 33.3 UNMT Song et al. (2019a) (MASS; 8 GPUs) – – – – – – 33.1 35.2 Best published results 11.0* 9.4* 0.61 0.61 2.01 0.81 37.64 36.32 First sentences + captions + titles 6.1 3.1 0.7 1.1 2.3 1.0 2.0 1.9 Mined Corpora 23.1 19.7 4.2 4.9 2.8 1.6 22.1 21.6 Wikily UNMT + Related Language – – 9.1 7.8 7.3 2.3 23.2 21.5 + One-shot back-translation (bt-beam=4) 23.0 18.8 13.8 13.9 7.0 12.1 25.2 28.1 + Iterative back-translation (bt-beam=1) 24.4 18.9 13.3 15.2 9.0 10.8 32.5 33.0 + Retrain on mined data 30.6 23.4 – – – – – – (Semi-)Supervised 48.9* 40.6* 14.21 4.01 12.51 3.11 39.93 38.53 Table 2: BLEU scores for different models. Reference results are from *: Our implementation, 1: Kim et al. (2020), 2: Li et al. (2020), 3: Liu et al. (2020) (supervised), 4: Tran et al. (2020) (unsupervised with mined parallel data). Arabic Kazakh Romanian Method Version Token and POS UAS LAS BLEX UAS LAS BLEX UAS LAS BLEX Rasooli and Collins (2019) 2.0 gold/supervised 61.2 48.8 – – – – 76.3 64.3 – Previous Ahmad et al. (2019) 2.2 gold 38.1 28.0 – – – – 65.1 54.1 – Kurniawan et al. (2021) 2.2 gold 48.3 29.9 – – – – – – – gold 62.5 50.7 46.3 46.8 28.5 25.0 74.1 57.7 52.6 Wikily translation Projection supervised 60.2 48.7 42.1 46.2 27.8 14.1 73.6 57.4 50.9 2.7 gold 61.5 47.3 42.4 22.2 9.3 7.9 75.9 62.4 57.3 Gold-standard Parallel data supervised 59.1 45.3 38.5 21.8 9.2 3.8 75.6 62.0 55.6 Supervised supervised 84.2 79.8 72.7 48.0 29.8 13.7 90.8 86.0 80.0 Table 3: Dependency parsing results on the Universal Dependencies dataset (Zeman et al., 2020). Previous work has used different sub-versions of the Universal Dependencies data in which slight differences are expected. Input અથાત આપણે પહે લા તુલનાએ વધુ રચના મક બનવું પડશે. Unsupervised Ut numerous ીit the mother, onwards, in theover અિધકાંશexualit theotherit theIN રોડ 19 First sentences + captions + titles A view of the universe from the present to the present day. Outputs Mined Corpora For example, if the ghazal is more popular than ghazal. + Related Language We need to become more creative than before. + One-shot back-translation For example, we must become more creative than before. + Iterative back-translation Meanwhile, we ’ll have to become more constructive than before. Google Translate That means we have to be more creative than before. Reference That means we have to be more constructive than before. Figure 5: An example of a Gujarati sentence and its outputs from different models, as well as Google Translate. A child on a red slide. A little boy sits on a slide on the playground. English gold A little boy slides down a bright red corkscrew slide. A little boy slides down a red slide. a young boy wearing a blue outfit sliding down a red slide. English supervised A boy is sitting on a red slide. En– supervised translate . ‐ ﺻﺒﻲ ﺻﺒﻲ ﻳﺠﻠﺲ ﻋﻠ ﺷﺎﺣﻨﺔ ﺧﻔﻴﻔﺔ En– unsupervised translate .اﻟﻄﻔﻞ ﻳﺠﻠﺲ ﻋﻠ ﺷﺮﻳﺤﺔ ﺣﻤﺮاء En– Google translate .ﺻﺒﻲ ﻳﺠﻠﺲ ﻋﻠ ﺷﺮﻳﺤﺔ ﺣﻤﺮاء Supervised MT ﺻﺒﻲ ﺻﺒﻲ ﻋﻠ ﺷﻈﻴﺔ Unsupervised (mt + ar + en) .ﻳﺠﻠﺲ ﺻﺒﻲ ﺻﻐﻴﺮ ﻋﻠ ﺷﺮﻳﺤﺔ ﺑﺮﺗﻘﺎﻟﻴﺔ Unsupervised (mt + ar) .ﺻﺒﻲ ﺻﻐﻴﺮ ﻳﺠﻠﺲ ﻋﻠ ﺷﺮﻳﺤﺔ ﺣﻤﺮاء Supervised ﺻﺒﻲ ﻓ ﻗﻤﻴﺺ أزرق ﻳﻘﻔﺰ ﻓ اﻟﻬﻮاء ﻃﻔﻞ ﻋﻠ ﻣﻨﺰﻟﻘﺔ ﺣﻤﺮاء Arabic Gold ﺻﺒﻲ ﺻﻐﻴﺮ ﻳﺠﻠﺲ ﻋﻠ زﻻﺟﺔ ﻓ اﻟﻤﻠﻌﺐ ﻳﻨﺰﻟﻖ ﺻﺒﻲ ﺻﻐﻴﺮ أﺳﻔﻞ ﻣﻨﺰﻟﻘﺔ ﺣﻤﺮاء Figure 6: An example of different outputs in our captioning experiments both for English and Arabic, as well as Arabic translations of English outputs on the Arabic Flickr dataset (ElJundi. et al., 2020). different model outputs. We see that the two out- The word éJ ËA® KQK. means “orange” which is close puts from our approach with multi-tasking are to Z@QÔg that means “red”. The word ém ' Qå means roughly the same but one of them as more syntactic “slide” which is correct but other meanings of this order overlap with the reference while both orders word exist in the reference. In general, we observe are correct in Arabic as a free-word order language. 1662
Multi-task BLEU et al., 2012; Patry and Langlais, 2011; Lin et al., Supervision Pretrained EN MT @1 @4 wikily 7 7 7 33.1 4.57 2011; Tufiş et al., 2013; Barrón-Cedeño et al., 2015; Wijaya et al., 2017; Ruiter et al., 2019; Srinivasan Translate train data wikily 3 7 7 32.9 5.28 wikily 3 3 7 32.8 4.37 et al., 2021). The WikiMatrix data (Schwenk et al., wikily 3 7 3 33.3 5.72 2019a) is the most similar effort to ours in terms of wikily 3 3 3 36.8 5.60 supervised 3 7 7 17.7 1.26 using Wikipedia, but with using supervised transla- English test performance→ 68.7 20.42 tion models. Bitext mining has a longer history of Translate test wikily 3 7 7 30.6 4.20 research (Resnik, 1998; Resnik and Smith, 2003) in supervised 3 7 7 15.8 0.92 Google 3 7 7 31.8 5.56 which most efforts are spent on using a seed super- 3 7 7 33.7 3.76 vised translation model (Guo et al., 2018; Schwenk Gold 3 3 7 37.9 5.22 et al., 2019b; Artetxe and Schwenk, 2019; Schwenk et al., 2019a; Jones and Wijaya, 2021). Recently, a Table 4: Image captioning results evaluated on the Ara- number of papers have focused on unsupervised ex- bic Flickr dataset (ElJundi. et al., 2020) using Sacre- BLEU (Post, 2018). “pretrained” indicates initializing traction of parallel data (Ruiter et al., 2019; Hangya our captioning model with our translation parameters. and Fraser, 2019; Keung et al., 2020; Tran et al., 2020; Kuwanto et al., 2021). Ruiter et al. (2019) focus on using vector similarity of sentences to ex- that although superficially the BLEU scores for tract parallel text from Wikipedia. Their work does Arabic is low, it is mostly due to its lexical diversity, not leverage structural signals from Wikipedia. free-word order, and morphological complexity. Cross-lingual and unsupervised image caption- 5.4 Dependency Parsing Results ing has been studied in previous work (Gu et al., 2018; Feng et al., 2019; Song et al., 2019b; Gu Table 3 shows the results for dependency parsing et al., 2019; Gao et al., 2020; Burns et al., 2020). experiments. We see that our model performs very Unlike previous work, we do not have a supervised high in Romanian with a UAS of 74 which is much translation model. Cross-lingual transfer of depen- higher than that of Ahmad et al. (2019) and slightly dency parser have a long history. We encourage lower than that of Rasooli and Collins (2019) which the reader to read a recent survey on this topic (Das uses a combination of multi-source annotation pro- and Sarkar, 2020). Our work does not use gold- jection and direct model transfer. Our work on Ara- standard parallel data or even supervised translation bic outperforms all previous work and performs models to apply annotation projection. even better than using gold-standard parallel data. One clear highlight is our result in Kazakh. As 7 Conclusion mentioned before, by projecting the part-of-speech We have described a fast and effective algorithm tags, we achieve roughly 2 percent absolute im- for learning translation systems using Wikipedia. provement. Our final results on Kazakh are sig- We show that by wisely choosing what to use as nificantly higher than that of using gold-standard seed data, we can have very good seed parallel data parallel text (7K sentences). to mine more parallel text from Wikipedia. We have also shown that our translation models can be 6 Related Work used in downstream cross-lingual natural language Kim et al. (2020) has shown that unsupervised processing tasks. In the future, we plan to extend translation models often fail to provide good trans- our approach beyond Wikipedia to other compara- lation systems for distant languages. Our work ble datasets like the BBC World Service. A clear solves this problem by leveraging the Wikipedia extension of this work is to try our approach on data. Using pivot languages has been used in previ- other cross-lingual tasks. Moreover, as many cap- ous work (Al-Shedivat and Parikh, 2019), as well as tions of the same images in Wikipedia are similar using related languages (Zoph et al., 2016; Nguyen sentences and sometimes translations, multimodal and Chiang, 2017). Our work only explores a sim- machine translation (Specia et al., 2016; Caglayan ple idea of adding one similar language pair. Most et al., 2019; Hewitt et al., 2018; Yao and Wan, likely, adding more language pairs and using ideas 2020) based on this data or the analysis of the data, from recent work might improve the performance. such as whether more similar languages may share Wikipedia is an interesting dataset for solving more similar captions (Khani et al., 2021) are other NLP problems including machine translation (Li interesting avenues. 1663
Acknowledgments jointly learning to align and translate. CoRR, abs/1409.0473. We would like to thank reviewers and the editor for their useful comments. We also would like to thank Loïc Barrault, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Yvette Gra- Alireza Zareian, Daniel (Joongwon) Kim, Qing ham, Barry Haddow, Matthias Huck, Philipp Koehn, Sun, and Afra Feyza Akyurek for their help and use- Shervin Malmasi, Christof Monz, Mathias Müller, ful comments througout this project. This work is Santanu Pal, Matt Post, and Marcos Zampieri. 2019. supported in part by the DARPA HR001118S0044 Findings of the 2019 conference on machine transla- tion (WMT19). In Proceedings of the Fourth Con- (the LwLL program), and the Department of the Air ference on Machine Translation (Volume 2: Shared Force FA8750-19- 2-3334 (Semi-supervised Learn- Task Papers, Day 1), pages 1–61, Florence, Italy. As- ing of Multimodal Representations). The U.S. Gov- sociation for Computational Linguistics. ernment is authorized to reproduce and distribute Alberto Barrón-Cedeño, Cristina España-Bonet, Josu reprints for Governmental purposes. The views Boldoba, and Lluís Màrquez. 2015. A factory of and conclusions contained in this publication are comparable corpora from Wikipedia. In Proceed- those of the authors and should not be interpreted ings of the Eighth Workshop on Building and Using Comparable Corpora, pages 3–13, Beijing, China. as representing official policies or endorsements of Association for Computational Linguistics. DARPA, the Air Force, and the U.S. Government. Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, An- References tonio Jimeno Yepes, Philipp Koehn, Varvara Lo- gacheva, Christof Monz, Matteo Negri, Aurélie Wasi Ahmad, Zhisong Zhang, Xuezhe Ma, Eduard Névéol, Mariana Neves, Martin Popel, Matt Post, Hovy, Kai-Wei Chang, and Nanyun Peng. 2019. On Raphael Rubino, Carolina Scarton, Lucia Spe- difficulties of cross-lingual transfer with order differ- cia, Marco Turchi, Karin Verspoor, and Marcos ences: A case study on dependency parsing. In Pro- Zampieri. 2016. Findings of the 2016 conference ceedings of the 2019 Conference of the North Amer- on machine translation. In Proceedings of the ican Chapter of the Association for Computational First Conference on Machine Translation: Volume Linguistics: Human Language Technologies, Vol- 2, Shared Task Papers, pages 131–198, Berlin, Ger- ume 1 (Long and Short Papers), pages 2440–2452, many. Association for Computational Linguistics. Minneapolis, Minnesota. Association for Computa- tional Linguistics. Ondrej Bojar, Vojtech Diatka, Pavel Rychlỳ, Pavel Stranák, Vít Suchomel, Ales Tamchyna, and Daniel Maruan Al-Shedivat and Ankur Parikh. 2019. Con- Zeman. 2014. Hindencorp-hindi-english and hindi- sistency by agreement in zero-shot neural machine only corpus for machine translation. In LREC, pages translation. In Proceedings of the 2019 Conference 3550–3555. of the North American Chapter of the Association for Computational Linguistics: Human Language Andrea Burns, Donghyun Kim, Derry Wijaya, Kate Technologies, Volume 1 (Long and Short Papers), Saenko, and Bryan A Plummer. 2020. Learn- pages 1184–1197, Minneapolis, Minnesota. Associ- ing to scale multilingual representations for vision- ation for Computational Linguistics. language tasks. In European Conference on Com- puter Vision, pages 197–213. Springer. Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018a. Unsupervised statistical machine transla- Ozan Caglayan, Pranava Madhyastha, Lucia Specia, tion. In Proceedings of the 2018 Conference on and Loïc Barrault. 2019. Probing the need for visual Empirical Methods in Natural Language Processing, context in multimodal machine translation. arXiv pages 3632–3642, Brussels, Belgium. Association preprint arXiv:1903.08678. for Computational Linguistics. Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr- Mikel Artetxe, Gorka Labaka, Eneko Agirre, and ishna Vedantam, Saurabh Gupta, Piotr Dollár, and Kyunghyun Cho. 2018b. Unsupervised neural ma- C Lawrence Zitnick. 2015. Microsoft coco cap- chine translation. In International Conference on tions: Data collection and evaluation server. arXiv Learning Representations. preprint arXiv:1504.00325. Mikel Artetxe and Holger Schwenk. 2019. Margin- Kyunghyun Cho, Bart van Merriënboer, Caglar Gul- based parallel corpus mining with multilingual sen- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger tence embeddings. In Proceedings of the 57th An- Schwenk, and Yoshua Bengio. 2014. Learning nual Meeting of the Association for Computational phrase representations using RNN encoder–decoder Linguistics, pages 3197–3203, Florence, Italy. Asso- for statistical machine translation. In Proceedings of ciation for Computational Linguistics. the 2014 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 1724– Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua 1734, Doha, Qatar. Association for Computational Bengio. 2015. Neural machine translation by Linguistics. 1664
Alexis Conneau and Guillaume Lample. 2019. Cross- Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo. 2019. Un- lingual language model pretraining. In Advances supervised image captioning. In Proceedings of the in Neural Information Processing Systems 32, pages IEEE/CVF Conference on Computer Vision and Pat- 7059–7069. Curran Associates, Inc. tern Recognition, pages 4125–4134. Ayan Das and Sudeshna Sarkar. 2020. A survey of Jiahui Gao, Yi Zhou, Philip LH Yu, and Jiuxiang Gu. the model transfer approaches to cross-lingual de- 2020. Unsupervised cross-lingual image captioning. pendency parsing. ACM Transactions on Asian arXiv preprint arXiv:2010.01288. and Low-Resource Language Information Process- Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Ar- ing (TALLIP), 19(5):1–60. mand Joulin, and Tomas Mikolov. 2018. Learning Karan Desai and Justin Johnson. 2021. VirTex: Learn- word vectors for 157 languages. In Proceedings of ing Visual Representations from Textual Annota- the Eleventh International Conference on Language tions. In CVPR. Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association Jacob Devlin, Ming-Wei Chang, Kenton Lee, and (ELRA). Kristina Toutanova. 2019. BERT: Pre-training of Jiuxiang Gu, Shafiq Joty, Jianfei Cai, and Gang Wang. deep bidirectional transformers for language under- 2018. Unpaired image captioning by language piv- standing. In Proceedings of the 2019 Conference oting. In Proceedings of the European Conference of the North American Chapter of the Association on Computer Vision (ECCV), pages 503–519. for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jiuxiang Gu, Shafiq Joty, Jianfei Cai, Handong Zhao, pages 4171–4186, Minneapolis, Minnesota. Associ- Xu Yang, and Gang Wang. 2019. Unpaired image ation for Computational Linguistics. captioning via scene graph alignments. In Proceed- ings of the IEEE/CVF International Conference on Chris Dyer, Victor Chahuneau, and Noah A. Smith. Computer Vision, pages 10323–10332. 2013. A simple, fast, and effective reparameter- ization of IBM model 2. In Proceedings of the Mandy Guo, Qinlan Shen, Yinfei Yang, Heming 2013 Conference of the North American Chapter of Ge, Daniel Cer, Gustavo Hernandez Abrego, Keith the Association for Computational Linguistics: Hu- Stevens, Noah Constant, Yun-Hsuan Sung, Brian man Language Technologies, pages 644–648, At- Strope, and Ray Kurzweil. 2018. Effective parallel lanta, Georgia. Association for Computational Lin- corpus mining using bilingual sentence embeddings. guistics. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 165–176, Brus- Sergey Edunov, Myle Ott, Michael Auli, and David sels, Belgium. Association for Computational Lin- Grangier. 2018. Understanding back-translation at guistics. scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Viktor Hangya and Alexander Fraser. 2019. Unsuper- pages 489–500, Brussels, Belgium. Association for vised parallel sentence extraction with parallel seg- Computational Linguistics. ment detection helps machine translation. In Pro- ceedings of the 57th Annual Meeting of the Asso- Obeida ElJundi., Mohamad Dhaybi., Kotaiba ciation for Computational Linguistics, pages 1224– Mokadam., Hazem Hajj., and Daniel Asmar. 1234, Florence, Italy. Association for Computational 2020. Resources and end-to-end neural network Linguistics. models for arabic image captioning. In Proceedings K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep resid- of the 15th International Joint Conference on ual learning for image recognition. In 2016 IEEE Computer Vision, Imaging and Computer Graphics Conference on Computer Vision and Pattern Recog- Theory and Applications - Volume 5: VISAPP,, nition (CVPR), pages 770–778. pages 233–241. INSTICC, SciTePress. John Hewitt, Daphne Ippolito, Brendan Callahan, Reno Miquel Esplà, Mikel Forcada, Gema Ramírez-Sánchez, Kriz, Derry Tanti Wijaya, and Chris Callison-Burch. and Hieu Hoang. 2019. ParaCrawl: Web-scale paral- 2018. Learning translations via images with a mas- lel corpora for the languages of the EU. In Proceed- sively multilingual image dataset. In Proceedings ings of Machine Translation Summit XVII Volume 2: of the 56th Annual Meeting of the Association for Translator, Project and User Tracks, pages 118–119, Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland. European Association for Machine pages 2566–2576. Translation. Vu Cong Duy Hoang, Philipp Koehn, Gholamreza Manaal Faruqui and Chris Dyer. 2014. Improving vec- Haffari, and Trevor Cohn. 2018. Iterative back- tor space word representations using multilingual translation for neural machine translation. In Pro- correlation. In Proceedings of the 14th Conference ceedings of the 2nd Workshop on Neural Machine of the European Chapter of the Association for Com- Translation and Generation, pages 18–24, Mel- putational Linguistics, pages 462–471, Gothenburg, bourne, Australia. Association for Computational Sweden. Association for Computational Linguistics. Linguistics. 1665
Micah Hodosh, Peter Young, and Julia Hockenmaier. Methods in Natural Language Processing: System 2013. Framing image description as a ranking task: Demonstrations, pages 66–71, Brussels, Belgium. Data, models and evaluation metrics. Journal of Ar- Association for Computational Linguistics. tificial Intelligence Research, 47:853–899. Anoop Kunchukuttan, Pratik Mehta, and Pushpak Bhat- Rebecca Hwa, Philip Resnik, Amy Weinberg, Clara tacharyya. 2018. The IIT Bombay English-Hindi Cabezas, and Okan Kolak. 2005. Bootstrapping parallel corpus. In Proceedings of the Eleventh In- parsers via syntactic projection across parallel texts. ternational Conference on Language Resources and Natural language engineering, 11(03):311–325. Evaluation (LREC 2018), Miyazaki, Japan. Euro- pean Language Resources Association (ELRA). Alex Jones and Derry Tanti Wijaya. 2021. Majority voting with bidirectional pre-translation for bitext re- Kemal Kurniawan, Lea Frermann, Philip Schulz, and trieval. Trevor Cohn. 2021. Ppt: Parsimonious parser trans- fer for unsupervised cross-lingual adaptation. arXiv Omid Kashefi. 2018. Mizan: a large Persian-English preprint arXiv:2101.11216. parallel corpus. arXiv preprint arXiv:1801.02107. Garry Kuwanto, Afra Feyza Akyürek, Isidora Chara Phillip Keung, Julian Salazar, Yichao Lu, and Noah A Tourni, Siyang Li, and Derry Wijaya. 2021. Smith. 2020. Unsupervised bitext mining and trans- Low-resource machine translation for low-resource lation via self-trained contextual embeddings. arXiv languages: Leveraging comparable data, code- preprint arXiv:2010.07761. switching and compute resources. Nikzad Khani, Isidora Tourni, Mohammad Sadegh Ra- Guillaume Lample, Alexis Conneau, Ludovic Denoyer, sooli, Chris Callison-Burch, and Derry Tanti Wijaya. and Marc’Aurelio Ranzato. 2018a. Unsupervised 2021. Cultural and geographical influences on im- machine translation using monolingual corpora only. age translatability of words across languages. In In International Conference on Learning Represen- Proceedings of the 2021 Conference of the North tations. American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Guillaume Lample, Alexis Conneau, Marc’Aurelio pages 198–209. Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018b. Yunsu Kim, Miguel Graça, and Hermann Ney. 2020. Word translation without parallel data. In Interna- When and why is unsupervised neural machine trans- tional Conference on Learning Representations. lation useless? In Proceedings of the 22nd An- Guillaume Lample, Myle Ott, Alexis Conneau, Lu- nual Conference of the European Association for dovic Denoyer, and Marc’Aurelio Ranzato. 2018c. Machine Translation, pages 35–44, Lisboa, Portugal. Phrase-based & neural unsupervised machine trans- European Association for Machine Translation. lation. In Proceedings of the 2018 Conference on Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Empirical Methods in Natural Language Processing, method for stochastic optimization. In 3rd Inter- pages 5039–5049, Brussels, Belgium. Association national Conference on Learning Representations, for Computational Linguistics. ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Shen Li, Joao V Graça, and Ben Taskar. 2012. Wiki-ly supervised part-of-speech tagging. In Proceedings Philipp Koehn. 2005. Europarl: A parallel corpus for of the 2012 Joint Conference on Empirical Methods statistical machine translation. In MT summit, vol- in Natural Language Processing and Computational ume 5, pages 79–86. Citeseer. Natural Language Learning, pages 1389–1398. As- sociation for Computational Linguistics. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Zuchao Li, Rui Wang, Kehai Chen, Masso Utiyama, Brooke Cowan, Wade Shen, Christine Moran, Eiichiro Sumita, Zhuosheng Zhang, and Hai Zhao. Richard Zens, et al. 2007. Moses: Open source 2020. Data-dependent gaussian prior objective for toolkit for statistical machine translation. In Pro- language generation. In International Conference ceedings of the 45th annual meeting of the ACL on Learning Representations. on interactive poster and demonstration sessions, pages 177–180. Association for Computational Lin- Wen-Pin Lin, Matthew Snover, and Heng Ji. 2011. Un- guistics. supervised language-independent name translation mining from wikipedia infoboxes. In Proceedings Sandra Kübler, Ryan McDonald, and Joakim Nivre. of the First workshop on Unsupervised Learning in 2009. Dependency parsing. Synthesis lectures on NLP, pages 43–52. human language technologies, 1(1):1–127. Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Taku Kudo and John Richardson. 2018. SentencePiece: Edunov, Marjan Ghazvininejad, Mike Lewis, and A simple and language independent subword tok- Luke Zettlemoyer. 2020. Multilingual denoising enizer and detokenizer for neural text processing. In pre-training for neural machine translation. arXiv Proceedings of the 2018 Conference on Empirical cs.CL 2001.08210. 1666
You can also read