"Wikily" Neural Machine Translation Tailored to Cross-Lingual Tasks - arXiv.org
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
“Wikily” Neural Machine Translation Tailored to Cross-Lingual Tasks Mohammad Sadegh Rasooli1 Chris Callison-Burch1 Derry Tanti Wijaya2 1 Department of Computer and Information Science, University of Pennsylvania 2 Department of Computer Science, Boston University {rasooli, ccb}@seas.upenn.edu, wijaya@bu.edu Abstract deal of interest in unsupervised neural machine translation (e.g. Artetxe et al. (2018a); Lample et al. We present a simple but effective approach for (2018a,c); Conneau and Lample (2019); Song et al. leveraging Wikipedia for neural machine trans- arXiv:2104.08384v1 [cs.CL] 16 Apr 2021 lation as well as cross-lingual tasks of image (2019a); Kim et al. (2020); Tae et al. (2020)). Un- captioning and dependency parsing without us- supervised neural machine translation models of- ing any direct supervision from external paral- ten perform nearly as well as supervised models lel data or supervised models in the target lan- when translating between similar languages, but guage. We show that first sentences and titles they fail to perform well in low-resource or dis- of linked Wikipedia pages, as well as cross- tant languages (Kim et al., 2020) or out-of-domain lingual image captions, are strong signals for monolingual data (Marchisio et al., 2020). In prac- a seed parallel data to extract bilingual dictio- naries and cross-lingual word embeddings for tice, the highest need for unsupervised models is mining parallel text from Wikipedia. Our fi- to expand beyond high resource, similar European nal model achieves high BLEU scores that are language pairs. close to or sometimes higher than strong su- There are two key goals in this paper: Our first pervised baselines in low-resource languages; goal is developing accurate translation models for e.g. supervised BLEU of 4.0 versus 12.1 from low-resource distant languages without any super- our model in English-to-Kazakh. Moreover, vision from a supervised model or gold-standard we tailor our wikily translation models to unsu- pervised image captioning, and cross-lingual parallel data. Our second goal is to show that our dependency parser transfer. In image caption- machine translation models could be directly tai- ing, we train a multi-tasking machine transla- lored to downstream natural language processing tion and image captioning pipeline for Arabic tasks. In this paper, we showcase our claim in cross- and English from which the Arabic training lingual image captioning and cross-lingual transfer data is a wikily translation of the English cap- of dependency parsers, but this idea is applicable tioning data. Our captioning results on Ara- to a wide variety of tasks. bic are slightly better than that of its super- vised model. In dependency parsing, we trans- We present a fast and accurate approach for late a large amount of monolingual text, and learning translation models using Wikipedia. Un- use it as an artificial training data in an an- like unsupervised machine translation that solely notation projection framework. We show that relies on raw monolingual data, we believe that we our model outperforms recent work on cross- should not neglect the availability of incidental su- lingual transfer of dependency parsers. pervisions from online resources such as Wikipedia. Wikipedia contains articles in nearly 300 languages 1 Introduction and more languages might be added in the future, Developing machine translation models without including indigenous languages and dialects of dif- using gold-standard parallel text is an intriguing ferent regions in the world. Different from similar research problem with real applications: obtaining recent work (Schwenk et al., 2019a), we do not a large volume of parallel text for many languages rely on any supervision from supervised translation is hard if not impossible. Moreover, translation models. Instead, we leverage the fact that many models could be used in downstream cross-lingual first sentences in linked Wikipedia pages are rough tasks in which annotated data does not exist for translations, and furthermore, many captions of some languages. There has recently been a great the same images are similar sentences, sometimes
A summary of our contribution is as follows: • We propose a simple, fast and effective ap- proach towards using the Wikipedia mono- lingual data for machine translation without any explicit supervision. Our mining algo- rithm easily scales on large comparable data using limited computational resources. We achieve very high BLEU scores for distant languages, especially those in which current unsupervised methods perform very poorly. • We propose novel methods for leveraging our current translation models in image cap- tioning. We show that how a combina- Figure 1: A pair of Wikipedia documents in Arabic and tion of translating caption training data, and English, along with a same image with two captions. multi-task learning with English captioning as well as translation improves the performance. Our results on Arabic captaining shows re- translations. Figure 1 shows a real example of a sults slightly superior to that of a supervised pair of linked Wikipedia pages in Arabic and En- captioning model trained on gold-standard glish in which the titles, first sentences, and also the datasets. image captions are rough translations of each other. Our method learns a seed bilingual dictionary from • We propose a novel modification to the anno- a small collection of first sentence pairs, titles and tation projection method in order to be able captions, and then learns cross-lingual word embed- to leverage our translation models. Our re- dings. We make use of cross-lingual word embed- sults on dependency parsing performs better dings to extract parallel sentences from Wikipedia. than previous work in most cases, and per- Our experiments show that our approach improves forms similarly to using gold-standard parallel over strong unsupervised translation models for datasets. low-resource languages: we improve the BLEU score of English→Gujarati from 0.6 to 15.2, and Our code is publicly available online1 . English→Kazakh from 0.8 to 12.1. In the realm of downstream tasks, we show that 2 Background we can easily use our translation models to generate In this section, we briefly describe the main con- high-quality translations of MS-COCO (Chen et al., cepts that we repeatedly use throughout the paper. 2015) and Flickr (Hodosh et al., 2013) datasets, and train a cross-lingual image captioning model in a multi-task pipeline paired with machine translation Supervised neural machine translation Super- in which the model is initialized by the parameters vised machine translation uses a parallel text P = from our translation model. Our results on Ara- {(si , ti )}ni=1 in which each sentence si ∈ l1 is a bic captioning show a BLEU score of 5.72 that is translation of ti ∈ l2 . For having a high-quality slightly better than a supervised captioning model translation model, we usually need a large amount with a BLEU score of 5.22. As another task, in de- of parallel text. Neural machine translation uses pendency parsing, we first translate a large amount sequence-to-sequence models with attention (Cho of monolingual data using our translation models et al., 2014; Bahdanau et al., 2015; Vaswani et al., and then apply transfer using the annotation pro- 2017) for which the likelihood of training data is jection method (Yarowsky et al., 2001; Hwa et al., maximized by maximizing the log-likelihood of 2005). Our results show that our approach performs 1 similarly compared to using gold-standard parallel Our code: https://github.com/rasoolims/ ImageTranslate, and our modification to Stanza for train- text in high-resource scenarios, and significantly ing on partially projected trees: https://github.com/ better in low-resource languages. rasoolims/stanza
punct advcl root obl det nmod nmod amod nsubj case advmod xcomp obj case case punct compound compound mark obj det The International Crisis Group recently suggested moving responsibility for pension to state level , to eliminate some of the problems . Grupul International de Criza a sugerat recent mutarea responsabilitatii pentru pensii la nivelul statului , case pentru a elimina unele dintre probleme . amod advmod case compound obj case obj mark nmod xcomp nmod punct compound nsubj obl root advcl punct Figure 2: An example of annotation projection for which the source (English, on top) is a translation of the target (Romanian) with our wikily translation model. The source side is parsed with supervised Stanza (Qi et al., 2020) Figure 2:parse and the An example of annotation tree is projected usingprojection for which Giza++ (Och the source and Ney, 2003) (English, onalignments. intersected top) is a translation As shown of the target in the figure, (Romanian) with our wikily translation model. The source side is parsed with supervised Stanza (Qi et al., 2020) some words have missing dependencies. and the parse tree is projected using Giza++ (Och and Ney, 2003) intersected alignments. As shown in the figure, some words have missing dependencies. predicting each target word given its previous pre- to a low-resource language through translated dicted words Supervised and source neural machine sequence: translation Super- and text try(Yarowsky to uncover the et al., masked2001). words Having (Devlina et parallel al., vised machine translation |t | uses a parallel text P = data 2019). PIn = this {(s work, i , ti )} we n , and supervised source i=1 mainly use the MASS X n X i )}ni=1 = {(si , tiL(P) in which each sentence s 2 l1 is a annotations model (Song etfor al.,source 2019),sentences i , we transfer in which ascontiguous log p(t i,j |ti,k
vision. In this section, we describe our algorithm Definitions: 1) e is English, f is the foreign language, and which is briefly shown in Figure 3. g is a language similar to f , 2) learn_dict (P ) extracts a bilingual dictionary from parallel data P , 3) t (x|m) trans- 3.1 Data Definitions lates input x given model m, , 4) pretrain (x) pretrains on monolingual data x using MASS (Song et al., 2019a), 5) For languages e and f in which e is English and f train (P |m) trains on parallel data P initialized by model is a low-resource target language of interest, there (e) (e) m, 6) bt_train (x1 , x2 |m) trains iterative back-translation are Wikipedia documents we = {w1 . . . wn } on monolingual data x1 ∈ e and x2 ∈ f initialized by (f ) (f ) and wf = {w1 . . . wm }. We refer to w(i,j) as (l) model m. Inputs: 1) Wikipedia documents w(e) , w(f ) , and w(g) , 2) the jth sentence in the ith document for language Monolingual word embedding vectors ve and vf , 3) Set of l. A subset of these documents are aligned (us- linked pages from Wikipedia COMP , their aligned titles ing Wikipedia languages links). Thus we have an T , and their first sentence pairs F, 4) Set of paired image captions C, and 5) Gold-standard parallel data P (e,g) . aligned set of document pairs in which we can eas- Algorithm: ily extract many sentence pairs that are potentially translations of each other. A smaller subset F is the → Learn bilingual dictionary and embeddings (e) (f ) set of first sentences in Wikipedia (w(i,1) , w(i0 ,1) ) S =F ∪C∪T D(f,e) = learn_dict (S) in which documents i and i0 are linked and their D(g,e) = learn_dict (P (e,g) ) . Related language first sentence lengths are in similar range. In ad- Learn ve → ve0 and vf → vf0 using D(f,e) ∪ D(g,e) dition to text content, Wikipedia has a large set of images. Each image comes along with one or → Mine parallel data more captions, sometimes in different languages. Extract comparable sentences Z from COMP A small subset of these images have captions both Extract P (f,e) from Z using Eq. 1. . Cosine sim. (f,e) (f,e) in English and the target language. We refer to this P =P ∪T . Mined Data set as C. We use the set of all caption pairs (C), → Train MT with pretraining and back-translation title pairs (T ), and first sentences (F) as the seed θ0 = pretrain (w(e) ∪ w(f ) ∪ w(g) ) . MASS Training parallel data: S = F ∪ C ∪ T . θ = train (P (f,e) ∪ P (g,e) |θ0 ) . NMT Training (e→f ) P (f ) = ( t (w |θ ), w ) (f ) 3.2 Bilingual Dictionary Extraction and P (f →e) = ( t (w(e) |θ ), w(e) ) Cross-Lingual Word Embeddings P 0(f,e) = P (e→f ) ∪ P (f →e) ∪ P (f,e) 0 θ = train (P 0(f,e) |θ0 ) Having the seed parallel data S, we run unsuper- θ∗ = bt_train (w(e) , w(f ) |θ 0 ) vised word alignment (Dyer et al., 2013) in both English-to-target, and target-to-English directions. Output: θ ∗ We use the intersected alignments to extract highly Figure 3: A brief depiction of the training pipeline. confident word-to-word connections. Finally, we pick the most frequent aligned word for each word in English as translation. This set serves as a bilin- gual dictionary D. 3.3 Mining Parallel Sentences Given two monolingual trained word embed- We use cross-lingual embedding vectors ve0 ∈ dings ve ∈ RNe ×d and vf ∈ RNf ×d , and the ex- 0 RNe ×d and vf0 ∈ RNf ×d for calculating the cosine tracted bilingual dictionary D, we use the method similarity between pairs of words. Moreover, we of Faruqui and Dyer (2014) to project these two em- use the extracted bilingual dictionary to boost the bedding vectors to a shared cross-lingual space.2 accuracy of the scoring function. For a pair of sen- This method uses a bilingual dictionary along with tences (s, t) where s = s1 . . . sn and t = t1 . . . tm , canonical correlation analysis (CCA) to learn two after filtering sentence pairs with different numer- projection matrices to map each embedding vector 0 0 ical values (e.g. sentences containing 2019 in the to a shared space ve0 ∈ RNe ×d and vf0 ∈ RNf ×d source and 1987 in the target), we use a modified where d0 ≤ d. version of cosine similarity between words: 2 There are other approaches for extracting bilingual em- beddings such as (Lample et al., 2018b). Comparing different ( cross-lingual embedding learning methods is not the focus 1.0, if (si , tj ) ∈ D of this paper, thereby we leave further investigation to future sim(si , tj ) = work. cos(si , tj ), otherwise
Using the above definition of word similarity, three Wikipedia datasets for the three languages we use the average-maximum similarity between g, f , and e using the MASS model (Song et al., pair of sentences. 2019a). The MASS model masks a contiguous Pn m span of input tokens, and recovers that span in the i=1 maxj=1 sim(si , ti ) output sequence. score(s, t) = n For facilitating multi-task learning with image From a pool of candidates, if the following condi- captioning, our model has an image encoder that tions are hold, we pick (s, t) as translations, and is used in cases of image captioning (more details add this pair to the mined parallel data. in §4.1). In other words, the decoder is shared between the translation and captioning tasks. We ( use the pretrained ResNet-152 model (He et al., t = arg maxy∈G(s) score(s, y) 2016) from Pytorch to encode every input image. (1) s = arg maxx∈G(t) score(x, t) We extract the final layer as a 7 × 7 grid vector (g ∈ R7×7×dg ), and project it to a new space by where G is the translation candidate generation a linear transformation (g 0 ∈ R49×dt ), and then function from linked Wikipedia pages based on add location embeddings (l ∈ R49×dt ) by using sentences with similar lengths. entry-wise addition. Afterwards, we assume that 3.4 Leveraging Similar Languages the 49 vectors are encoded text representations as if a sentence with 49 words occurs. This is similar In many low-resource scenarios, the number of but not exactly the same as the Virtex model (Desai paired documents is very small, leading to a small and Johnson, 2021). number and often noisy extracted parallel sen- tences. To alleviate this problem to some extent, we assume to have another language g in which 3.6 Back-Translation: One-shot and Iterative g has a large lexical overlap with the target lan- guage f (such as g=Russian and f =Kazakh). We Finally, we use the back-translation technique assume that a parallel data exists between language to improve the quality of our models. Back- g and English, and we can use it both as an auxil- translation is done by translating a large amount iary parallel data in training, and also for extracting of monolingual text to and from the target lan- extra lexical entries for the bilingual dictionaries: guage. The translated texts serve as noisy input as shown in Figure 3, we supplement the extracted text along with the monolingual data as the silver- bilingual dictionary from seed parallel data with standard translations. Previous work (Sennrich the bilingual dictionary extracted from related lan- et al., 2016b; Edunov et al., 2018) has shown that guage parallel data. back-translation is a very simple but effective tech- nique to improve the quality of translation models. 3.5 Translation Model Henceforth, we refer to this method as one-shot We use a standard sequence-to-sequence back-translation. Another approach is to use iter- transformer-based translation model (Vaswani ative back-translation (Hoang et al., 2018), the et al., 2017) with a six-layer BERT-based (De- most popular approach in unsupervised transla- vlin et al., 2019) encoder-decoder architecture tion (Artetxe et al., 2018b; Conneau and Lample, from HuggingFace (Wolf et al., 2019) and 2019; Song et al., 2019a). The main difference Pytorch (Paszke et al., 2019) with a shared from one-shot translation is that the model uses SentencePiece (Kudo and Richardson, 2018) an online approach, and updates its parameters in vocabulary. All input and output token embeddings every batch. are summed up with the language id embedding. We empirically find one-shot back-translation First tokens of every input and output sentence are faster to train but with much less potential to reach shown by the language ID. Our training pipeline a high translation accuracy. A simple and ef- assumes that the encoder and decoder are shared fective way to have both a reliable and accurate across different languages, except that we use a model is to first initialize a model with one-shot separate output layer for each language in order to back-translation, and then apply iterative back- prevent input copying (Artetxe et al., 2018b; Sen translation. The model that is initialized with a et al., 2019). We pretrain the model on a tuple of more accurate model reaches a higher accuracy.
This is an open box containing four cucumbers. Ney, 2003) alignments on both source-to-target .وهذا صندوق مفتوح يحتوي على أربعة خيار An open food container box with four and target-to-source directions, and extract inter- unknown food items. .صندوق حاوية طعام مفتوح مع أربعة مواد غذائية مجهولة A small box filled with four green sected alignments to keep high-precision one-to- vegetables. .ضراءKضروات اKمربع صغير مليء بأربعة ا one alignments. We run a supervised dependency An opened box of four chocolate bananas. parser of English as our rich-resource language. .وزPعلبة مفتوحة من أربعة من ا An open box contains an unknown, Then, we project dependencies to the target lan- purple object رجوانTمربع مفتوح يحتوي على كائن غير معروف ا guage sentences via word alignment links. Inspired by previous work (Rasooli and Collins, 2015), to Figure 4: An image from MS-Coco (Chen et al., 2015) remove noisy projections, we keep those sentences with gold-standard English captions, and Arabic trans- lations from our wikily translation model. that at least 50% of words or 5 consecutive words in the target side have projected dependencies. 4 Cross-Lingual Tasks 5 Experiments In this section, we describe our approaches for tai- In this section, we provide details about our experi- loring our translation models to cross-lingual tasks. mental settings and results for translation, caption- Note that henceforth we assume that our transla- ing, and dependency parsing. tions model training is finished, and we have access 5.1 Datasets and Settings to trained translation models for cross-lingual tasks. Languages We focus on four language pairs: 4.1 Cross-Lingual Image Captioning Arabic-English, Gujarati-English, Kazakh-English, and Romanian-English. We choose these pairs to Having gold-standard image captioning training provide enough evidence that our model works in data I = {(Ii , ci )}ni=1 where Ii is the image as (1) distant languages, morphologically-rich languages, pixel values, and ci = ci , . . . , cki i as the textual as well as similar languages. As for similar lan- description with ki words, our goal is to learn a cap- guages, we use Persian for Arabic (written with tioning model that is able to describe new (unseen) very similar scripts and have many words in com- images. As described in §3.5, we use a transformer mon), Hindi for Gujarati (similar languages), Rus- decoder from our translation model and a Resent sian for Kazakh (written with the same script), and image encoder (He et al., 2016) for our image cap- Italian for Romanian (Romance languages). tioning pipeline. Unfortunately, annotated image captioning datasets do not exist in many languages. Monolingual and Translation Datasets We use Having our translation model parameter θ ∗ , we regular expressions to tokenize sentences of can use its translation functionality to translate Wikipedia dump text. We use a shared Senten- each caption ci to c0i = translate(ci |θ ∗ ). After- cePiece vocabulary (Kudo and Richardson, 2018) wards, we will have a translated annotated dataset with size 60K. Table 1 shows the sizes of Wikipedia I 0 = {(Ii , c0i )}ni=1 in which the textual descrip- data in different languages. We use an off-the- tions are not gold-standard but translations from shelf Indic-transliteration library3 to convert the the English captions. Figure 4 shows a real exam- Devanagari script to Hindi script to make the Hindi ple from MS-Coco (Chen et al., 2015) in which documents look like Gujarati. This is done by Arabic translations are provided by our translation removing the graphical vertical bars from Hindi model. Furthermore, to augment our learning ca- letters: this would make them look like Gujarati, pability, we initialize our decoder with decoding thus increasing the chance of capturing more words parameters of θ ∗ , and also continue training with in common. For parallel data in similar lan- both English captioning and translation. guages, we use the Mizan parallel data for Per- sian (Kashefi, 2018) with one million sentences, 4.2 Cross-Lingual Dependency Parsing the IITB data (Kunchukuttan et al., 2018) and Hin- Assuming that we have a large body of monolin- diEnCorp 0.5 (Bojar et al., 2014) for Hindi with a gual text, we translate that monolingual text to cre- total of 367K sentences, ParaCrawl for Russian (Es- ate artificial parallel data. We run unsupervised plà et al., 2019) with 12M sentences, and Europarl word alignments on the artificial parallel text. Fol- for Italian (Koehn, 2005) with 2M sentences. We lowing previous work (Rasooli and Collins, 2015; 3 https://pypi.org/project/ Ma and Xia, 2014), we run Giza++ (Och and indic-transliteration/
Direction Foreign docs aren 1.0m guen 28k kken 230k roen 400k Translation Training Table 1 shows the sizes Paired docs First sents. 745k 205k 7.3k 3.2k 80k 52k 270k 78k of different types of datasets in our experiments. Captions Comparable pairs 92k 0.1b 2.2k 14m 1.9k 32m 35k 64m We pick comparable candidates for sentence pairs Mined sents. BT 1.7m 2.1m 49k 1.5m 183k 2.2m 675k 2.1m whose lengths are within a range of half to twice Iterative BT 4.0m 3.8m 4.0m 6.1m of each other. As we see, the final size of mined datasets heavily depends on the number of paired Table 1: Data sizes for different pairs. English has 6 English-target language Wikipedia documents. We million documents. We use a sample of English sen- tences with similar sizes to each language. train our translation models initialized by pre- trained models. Each batch has roughly 4K to- kens. Except for Arabic, for which the size use the Arabic-English UN data (Ziemski et al., of mined data significantly outnumbers the size 2016), WMT 2019 data (Barrault et al., 2019) for of Persian-English parallel data, we use the re- Gujarati-English and Kazakh-English, and WMT lated language data before using iterative back- 2016 shared task data (Bojar et al., 2016) for translation in which we only use the source and Romanian-English. Following previous work (Sen- target monolingual datasets. We use similar learn- nrich et al., 2016a), diacritics are removed from the ing hyper-parameters to pretraining except for itera- Romanian data. tive back-translation in which we accumulate gradi- ents for 100 steps, and use a dropout probability of Cross-Lingual Embedding We use the off-the- 0.2 and 10000 warmup steps since we find smaller shelf 300-dimensional FastText embeddings (Grave dropout and warmup make the model diverge. Our et al., 2018) as monolingual embedding vectors. one-shot back-translation experiments use a beam We run FastAlign (Dyer et al., 2013) on the seed size of 4, but we use a beam size of one for iterative parallel text from both source-to-target and target- back-translation since we have not seen much gains to-source directions, run alignment intersection to in using beam-based iterative back-translation ex- get intersected alignments, and extract the high- cept for purely unsupervised settings. All of our est occurring alignment for every word as the dic- translations are performed with a beam size of 4 tionary entry. We make use of the cross-lingual and max_len_a = 1.3 and max_len_b = 5. We CCA tool (Faruqui and Dyer, 2014) to extract 150- alternate between supervised parallel data of a sim- dimensional vectors. This tool can be run on a ilar language paired with English and the mined single CPU within a few hours. data. We train translation models for roughly 400K Pretraining We pretrain four models on 3-tuples batches except for Gujarati that has smaller mined of languages via a single NVIDIA Geforce RTX data for which we train for 200K iterations. We 2080 TI with 11GB of memory. We boost the Ro- have seen a quick divergence in Kazakh iterative manian, Gujarati, and Kazakh monolingual data back-translation, thereby we stopped it early after with newstext dataset from WMT in order to have running it for one epoch of all monolingual data. enough monolingual data as well have in-domain Most likely, the mined data for Kazakh-English has text. We create batches of 4K words, run pretrain- lower quality (see §A for more details), and that ing for two million iterations where we alternate be- leads to very noisy translations in back-translation tween language batches. We use the apex library4 outputs. All of our evaluations are conducted us- to use 16-bit floating-point tensors and double the ing SacreBLEU (Post, 2018) except for en↔ro in processing speed. To mimic the multi-GPU sce- which we use BLEU score (Papineni et al., 2002) nario, we accumulate gradients for 8 steps. We from Moses decoder scripts (Koehn et al., 2007) use the Adam optimizer (Kingma and Ba, 2015) for the sake of comparison to previous work. with inverse square root and learning rate of 10−4 , 4000 warm-up steps, and dropout probability of 0.1. Image Captioning We use the Flickr (Hodosh Due to GPU memory limitation, this whole process et al., 2013) and MS-Coco (Chen et al., 2015) takes about four weeks: in theory, with 8 high- datasets for English5 , and the gold-standard Arabic memory GPUs, we could obtain higher-quality pre- 5 trained models in a few days. We have also tried Conceptual Captions (Sharma et al., 2018) in our initial experiments but we have observed drops in performance. Previous work (Singh et al., 2020) have also 4 https://github.com/NVIDIA/apex observed a similar problem with Conceptual Captions as a
Flickr dataset (ElJundi. et al., 2020) for evaluation. back-translation model.6 We have not seen fur- The Arabic test set has 1000 images with 3 captions ther improvement by back-translation. To have a per image. We translate all the training datasets to fair comparison, we list the best supervised models Arabic for having translated caption data. The fi- for all language pairs (to the best of our knowl- nal training data contains 620K captions for about edge). In low-resource settings, we outperform 125K unique images. Throughout experiments, strong supervised models that are boosted by back- we use the pretrained Resnet-152 models (He et al., translation. In high-resource settings, our Arabic 2016) from Pytorch (Paszke et al., 2019), and let it models achieve very high performance but regard- fine-tune during our training pipeline. Each train- ing the fact that the parallel data for Arabic has ing batch contains 20 images. We accumulate gra- 18M sentences, it is quite impossible to reach that dients for 16 steps, and use a dropout of 0.1 for level of accuracy: our Arabic Wikipedia data is the projected image output representations. Other much smaller than the UN parallel data. training parameters are the same as our translation training. To make our pipeline fully unsupervised, 5.3 Captioning Results we use translated development sets to pick the best Table 3 shows the final results on the Arabic test model during training. set using the SacreBLEU measure (Post, 2018). First, we should note that similar to ElJundi. et al. Dependency Parsing We use the Universal De- (2020), we see lower scales of BLEU scores due pendencies v2.7 collection (Zeman et al., 2020) to morphological richness in Arabic (see §A for for Arabic, Kazakh, and Romanian. We use the details). We see that if we initialize our model with Stanza (Qi et al., 2020) pretrained supervised mod- the translation model and multi-task it with trans- els for getting supervised parse trees for Arabic lation and also English we achieve much higher and Romanian, and use the UDPipe (Straka et al., performance. It is interesting to observe that by 2016) pretrained model for Kazakh. We use a sim- translating the English output on the test data to ple modification to Stanza to facilitate training on Arabic, we achieve a much lower result. This is partially projected trees by masking dependency a strong indicator of the strength of our approach. and label assignments for words with missing de- We also see that supervised translation fails to per- pendencies. All of our training on projected de- form well. This might due to the UN dataset which pendencies is blindly conducted with 100k training has a different domain from the caption dataset. steps in which we use default training parameters of Furthermore, we see that our model outperforms Stanza (Qi et al., 2020). As for gold-standard paral- using Google Translate which is a strong machine lel data, we use our supervised translation training translation system, and that is actually what is be- data for Romanian-English and Kazakh-English ing used as seed data for manual revision in the and use a sample of 2 million sentences from the Arabic dataset. Finally, it is interesting to see that UN Arabic-English data due to its large size that our model outperforms supervised captioning. In causes word alignment significant slow-down. For is worth noting that our multi-tasking make transla- Kazakh wikily projections, we observe that the su- tion performance slighly worse. pervised POS taggers have very low accuracy due to the small size of the Kazakh gold-standard tree- 5.4 Dependency Parsing bank (31 sentences for training). We project the Table 4 shows the results for dependency parsing POS tags for projected words. On average, we ob- experiments. We see that our model performs very serve a two percent increase in performance by pro- high in Romanian with a UAS of 74 which is much jecting part-of-speech tags as well as dependency higher than that of Ahmad et al. (2019) and slightly parse relations. lower than that of Rasooli and Collins (2019) which 5.2 Translation Results uses a combination of multi-source annotation pro- jection and direct model transfer. Our work on Ara- Table 2 shows the results of different settings in 6 addition to baseline and state-of-the-art results. We We have seen that during multi-tasking with image cap- tioning, the translation BLEU score for Arabic-English sig- see that Arabic as a clear exception needs more nificantly improves. We initially thought that multi-tasking rounds of training: we train our Arabic model is improving both translation and captioning, but our further once again on mined data by initializing it by our investigation shows that it is actually due to lack of training for Arabic. We have tried the same procedure for other languages noisy crawled caption dataset. but have not observed any further gains.
Model ar→en en→ar gu→en en→gu kk→en en→kk ro→en en→ro UNMT Conneau and Lample (2019) – – – – – – 31.8 33.3 Song et al. (2019a) (MASS; 8 GPUs) – – – – – – 33.1 35.2 Best published results 11.0* 9.4* 0.61 0.61 2.01 0.81 37.64 36.32 First sentences + captions + titles 6.1 3.1 0.7 1.1 2.3 1.0 2.0 1.9 Mined Corpora 23.1 19.7 4.2 4.9 2.8 1.6 22.1 21.6 Wikily UNMT + Related Language – – 9.1 7.8 7.3 2.3 23.2 21.5 + One-shot back-translation (bt-beam=4) 23.0 18.8 13.8 13.9 7.0 12.1 25.2 28.1 + Iterative back-translation (bt-beam=1) 24.4 18.9 13.3 15.2 9.0 10.8 32.5 33.0 + Retrain on mined data 30.6 23.4 – – – – – – (Semi-)Supervised 48.9* 40.6* 14.21 4.01 12.51 3.11 39.93 38.53 Table 2: BLEU scores for different models. Our models are initialized by our pretrained MASS model. Reference results are from *: Our implementation, 1: Kim et al. (2020), 2: Li et al. (2020), 3: Liu et al. (2020) (supervised with back-translation), 4: Tran et al. (2020) (unsupervised with mined parallel data). Supervision Pretrained Multi-task BLEU likely, adding more language pairs and using ideas EN MT @1 @4 from recent work should help improve the accuracy wikily 7 7 7 33.1 4.57 Translate EN train data wikily 3 7 7 32.9 5.28 of our models. wikily 3 3 7 32.8 4.37 Wikipedia has always been an interesting dataset wikily 3 7 3 33.3 5.72 for solving NLP problems including machine trans- wikily 3 3 3 36.8 5.60 lation (Li et al., 2012; Patry and Langlais, 2011; supervised 3 7 7 17.7 1.26 English test performance→ 68.7 20.42 Lin et al., 2011; Tufiş et al., 2013; Barrón-Cedeño Translate test wikily 3 7 7 30.6 4.20 et al., 2015; Ruiter et al., 2019). The WikiMatrix supervised 3 7 7 15.8 0.92 data (Schwenk et al., 2019a) is the most similar Google 3 7 7 31.8 5.56 3 7 7 33.7 3.76 effort to ours in terms of using Wikipedia, but with Gold 3 3 7 37.9 5.22 using supervised translation models. There is a very recent collection Wikipedia data (Srinivasan Table 3: Image captioning results evaluated on the Ara- et al., 2021) for many languages with the goal of bic Flickr dataset (ElJundi. et al., 2020) using Sacre- using in multimodal machine learning tasks. BLEU (Post, 2018). The column “pretrained” indicates initializing our captioning model with parameters from Bitext mining has a longer history of re- our translation model. search (Resnik, 1998; Resnik and Smith, 2003) in which most efforts are spent on using a seed super- vised translation model (Guo et al., 2018; Schwenk bic outperforms all previous work and performs et al., 2019b; Artetxe and Schwenk, 2019; Schwenk even better than using gold-standard parallel data. et al., 2019a; Jones and Wijaya, 2021). Recently, a One clear highlight is our result in Kazakh. As number of papers have focused on unsupervised ex- mentioned before, by projecting the part-of-speech traction of parallel data (Ruiter et al., 2019; Hangya tags, we achieve roughly 2 percent absolute im- and Fraser, 2019; Keung et al., 2020; Tran et al., provement. Our final results on Kazakh are sig- 2020; Kuwanto et al., 2021). Our work lies in the nificantly higher than that of using gold-standard group of unsupervised mining approaches with the parallel text (7K sentences). focus on Wikipedia and fast retrieval of parallel text. Ruiter et al. (2019) focus on using vector sim- 6 Related Work ilarity of sentences to extract high-quality parallel Kim et al. (2020) has shown that unsupervised text from Wikipedia. Their work have not lever- translation models often fail to provide good aged specific structural signals from Wikipedia. It translation systems for distant languages. Our is worth noting that recent work has considered work solves this problem by wisely leveraging the using a huge number of small Bible parallel data Wikipedia data. Using pivot languages in zero- for translation (Mueller et al., 2020): we think this shot settings has been used in previous work (Al- line of work can be combined with ours. Shedivat and Parikh, 2019), as well as using related Cross-lingual and unsupervised image caption- languages (Zoph et al., 2016; Nguyen and Chiang, ing has been studied in previous work (Gu et al., 2017). Our work only explores a simple idea of 2018; Feng et al., 2019; Song et al., 2019b; Gu adding one supervised similar language pair. Most et al., 2019; Gao et al., 2020). Unlike previous
Arabic Kazakh Romanian Method Version Token and POS UAS LAS BLEX UAS LAS BLEX UAS LAS BLEX Rasooli and Collins (2019) 2.0 gold/supervised 61.2 48.8 – – – – 76.3 64.3 – Previous Ahmad et al. (2019) 2.2 gold 38.1 28.0 – – – – 65.1 54.1 – Kurniawan et al. (2021) 2.2 gold 48.3 29.9 – – – – – – – gold 62.5 50.7 46.3 46.8 28.5 25.0 74.1 57.7 52.6 Wikily translation Projection supervised 60.2 48.7 42.1 46.2 27.8 14.1 73.6 57.4 50.9 2.7 gold 61.5 47.3 42.4 22.2 9.3 7.9 75.9 62.4 57.3 Gold-standard Parallel data supervised 59.1 45.3 38.5 21.8 9.2 3.8 75.6 62.0 55.6 Supervised supervised 84.2 79.8 72.7 48.0 29.8 13.7 90.8 86.0 80.0 Table 4: Dependency parsing results on the Universal Dependencies dataset (Zeman et al., 2020). Previous work has used different sub-versions of the Universal Dependencies data in which slight differences are expected. work, we do not have a supervised translation DARPA, the Air Force, and the U.S. Government. model. Cross-lingual transfer of dependency parser have a long history. We encourage the reader to A Analysis and Discussion read a recent survey on this topic (Das and Sarkar, Manual Observation: Figure 5 shows a ran- 2020). Our work does not use gold-standard par- domly chosen example from the Gujarati-English allel data or even supervised translation models to development data. As depicted, we see that the apply annotation projection but we still see that model after back-translation reaches to somewhat our models performs similarly or sometimes better the core meaning of the sentence with a bit of diver- than using gold-standard parallel text. gence from exactly matching the reference. The fi- 7 Conclusion nal iterative back-translation output almost catches a correct translation. We also see that the use of the We have described a fast and effective algorithm word “creative” is seen in Google Translate output, for learning unsupervised machine translation sys- a model that is most likely trained on much larger tems using Wikipedia. We show that by wisely parallel data than what is currently available for choosing what to use as seed data, we can have public use. In general, unsupervised translation very good seed parallel data to mine more paral- performs very poorly compared to our approach in lel text from Wikipedia. We have also shown that all directions. those translation models can be used in downstream cross-lingual natural language processing tasks. In Quality of mined data: The quality of parallel the future, we plan to extend our approach beyond data matters a lot for getting high-accuracy. For Wikipedia to other comparable datasets like the example, we manually observe that the quality BBC World Service. Moreover, a clear extension of mined data for all languages are very good of this work is to try our approach on other cross- except for Kazakh. Our hypothesis is that the lingual tasks. Kazakh Wikipedia data is less aligned with the English content. We compare our mined data Acknowledgments to that of the supervised mined data from Wiki- Matrix (Schwenk et al., 2019a) as well as gold- We would like to thank Alireza Zareian, Daniel standard data. Figure 6 shows the difference be- (Joongwon) Kim, Qing Sun, and Afra Feyza tween the three datasets of three language pairs Akyurek for their help and useful comments (WikiMatrix does not contain Gujarati). As we see, througout this project. This work is supported in our data has BLEU scores near to WikiMatrix in part by the DARPA HR001118S0044 (the LwLL all languages, and in the case of Kazakh, the model program), and the Department of the Air Force trained on our data performs higher than WikiMa- FA8750-19- 2-3334 (Semi-supervised Learning trix. In other words, in the case of having very noisy of Multimodal Representations). The U.S. Gov- comparable data, as is the case for Kazakh-English, ernment is authorized to reproduce and distribute our model even outperforms a contextualized su- reprints for Governmental purposes. The views pervised model. It is also interesting to see that and conclusions contained in this publication are our model outperforms the supervised model for those of the authors and should not be interpreted Kazakh that has only 7.7K gold-standard training as representing official policies or endorsements of data. These are all strong evidences of the strength
Input અથાત આપણે પહે લા તુલનાએ વધુ રચના મક બનવું પડશે. Unsupervised Ut numerous ીit the mother, onwards, in theover અિધકાંશexualit theotherit theIN રોડ 19 First sentences + captions + titles A view of the universe from the present to the present day. Outputs Mined Corpora For example, if the ghazal is more popular than ghazal. + Related Language We need to become more creative than before. + One-shot back-translation For example, we must become more creative than before. + Iterative back-translation Meanwhile, we ’ll have to become more constructive than before. Google Translate That means we have to be more creative than before. Reference That means we have to be more constructive than before. Figure 5: An example of a Gujarati sentence and its outputs from different models, as well as Google Translate. from BLEU score of 2.9 to 9.0. If we had access 50 48.9 Ours WikiMatrix Supervised 40.6 40 to a cluster of high-memory GPUs, we could po- 30 26.2 tentially obtain even higher results throughout all BLEU 25.325.2 24.825.4 23.1 23.2 21.5 of our experiments. Therefore, we believe that part 19.720.4 20 10 7.3 3.9 3.3 of the blame for our results in English-Romanian 2.3 is on pretraining. As we see in Figure 6, our super- 0.3 0.7 0 ar→en en→ar kk→en en→kk ro→en en→ro vised results without back-translation are also low Figure 6: Results using our mined data versus WikiMa- for English-Romanian. trix (Schwenk et al., 2019a) and gold-standard data. Comparing to CRISS: The recent work of Tran 25 23 No Pretraining With Pretraining 23.2 21.5 et al. (2020) shows impressive gains using high- quality pretrained models and iterative parallel data 20 19.7 19.5 18.6 17.8 mining from a larger comparable data than that of 14.9 15 BLEU 10 9.1 7.8 7.3 Wikipedia. Their pretrained model is trained us- 5 2.9 3.6 2.4 0.7 2.3 ing 256 Nvidia V100 GPUs in approximately 2.5 0 ar→en en→ar gu→en en→gu kk→en en→kk ro→en en→ro weeks (Liu et al., 2020). Figure 8 shows that by considering all these facts, our model still outper- Figure 7: Results using mined data (no back- translation) with and without pretraining. forms their supervised model in English-to-Kazakh with a big margin (4.3 cs 10.8) and gets close to 20 18 their performance in other directions. We should CRISS Ours emphasize on the fact that Tran et al. (2020) ex- 16.9 15.2 15 13.8 plores a much bigger comparable data than ours. 13.2 10.8 BLEU One clear addition to our work is exploring parallel 10 9 5 4.3 data from other available comparable datasets. Due 0 to limited computational resources, we skip this part but we do believe that using our current unsu- gu→en en→gu kk→en en→kk Figure 8: Our best results (Table 2) versus the super- pervised models can help extract even more high- vised model of Tran et al. (2020). quality parallel data from comparable datasets, and this might lead to further gains for low-resource languages. of our approach in truly low-resource settings. Image captioning quality Figure 9 shows a ran- Pretraining matters: It is a truth universally ac- domly picked example with different model out- knowledged, that a single model in possession of puts. We see that the two outputs from our ap- a small training data and high learning capacity, proach with multi-tasking are roughly the same must be in want of a pretrained model. To prove but one of them as more syntactic order overlap this, we run our translation experiments with and with the reference while both orders are correct in without pretraining. In this case, all models with Arabic as a free-word order language. The word the same training data and parameters are equal, but some models are more equal. Figure 7 shows éJ ËA® KQK. means “orange” which is close to Z@QÔg that means “red”. The word ém' Qå means “slide” the results on the mined data. Clearly, there is a significant gain by using pre-trained models. For which is correct but other meanings of this word Gujarati, which is our the lowest-resource language exist in the reference. In general, we observe that in our experiments, the distance is more notable: although superficially the BLEU scores for Ara-
A child on a red slide. A little boy sits on a slide on the playground. English gold A little boy slides down a bright red corkscrew slide. A little boy slides down a red slide. a young boy wearing a blue outfit sliding down a red slide. English supervised A boy is sitting on a red slide. En– supervised translate . ‐ ﺻﺒﻲ ﺻﺒﻲ ﻳﺠﻠﺲ ﻋﻠ ﺷﺎﺣﻨﺔ ﺧﻔﻴﻔﺔ En– unsupervised translate .اﻟﻄﻔﻞ ﻳﺠﻠﺲ ﻋﻠ ﺷﺮﻳﺤﺔ ﺣﻤﺮاء En– Google translate .ﺻﺒﻲ ﻳﺠﻠﺲ ﻋﻠ ﺷﺮﻳﺤﺔ ﺣﻤﺮاء Supervised MT ﺻﺒﻲ ﺻﺒﻲ ﻋﻠ ﺷﻈﻴﺔ Unsupervised (mt + ar + en) .ﻳﺠﻠﺲ ﺻﺒﻲ ﺻﻐﻴﺮ ﻋﻠ ﺷﺮﻳﺤﺔ ﺑﺮﺗﻘﺎﻟﻴﺔ Unsupervised (mt + ar) .ﺻﺒﻲ ﺻﻐﻴﺮ ﻳﺠﻠﺲ ﻋﻠ ﺷﺮﻳﺤﺔ ﺣﻤﺮاء Supervised ﺻﺒﻲ ﻓ ﻗﻤﻴﺺ أزرق ﻳﻘﻔﺰ ﻓ اﻟﻬﻮاء ﻃﻔﻞ ﻋﻠ ﻣﻨﺰﻟﻘﺔ ﺣﻤﺮاء Arabic Gold ﺻﺒﻲ ﺻﻐﻴﺮ ﻳﺠﻠﺲ ﻋﻠ زﻻﺟﺔ ﻓ اﻟﻤﻠﻌﺐ ﻳﻨﺰﻟﻖ ﺻﺒﻲ ﺻﻐﻴﺮ أﺳﻔﻞ ﻣﻨﺰﻟﻘﺔ ﺣﻤﺮاء Figure 9: An example of different outputs in our captioning experiments both for English and Arabic, as well as Arabic translations of English outputs on the Arabic Flickr dataset (ElJundi. et al., 2020). bic is low, it is mostly due to its lexical diversity, jointly learning to align and translate. CoRR, free-word order, and morphological complexity. abs/1409.0473. Loïc Barrault, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Yvette Gra- References ham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz, Mathias Müller, Wasi Ahmad, Zhisong Zhang, Xuezhe Ma, Eduard Santanu Pal, Matt Post, and Marcos Zampieri. 2019. Hovy, Kai-Wei Chang, and Nanyun Peng. 2019. On Findings of the 2019 conference on machine transla- difficulties of cross-lingual transfer with order differ- tion (WMT19). In Proceedings of the Fourth Con- ences: A case study on dependency parsing. In Pro- ference on Machine Translation (Volume 2: Shared ceedings of the 2019 Conference of the North Amer- Task Papers, Day 1), pages 1–61, Florence, Italy. As- ican Chapter of the Association for Computational sociation for Computational Linguistics. Linguistics: Human Language Technologies, Vol- ume 1 (Long and Short Papers), pages 2440–2452, Alberto Barrón-Cedeño, Cristina España-Bonet, Josu Minneapolis, Minnesota. Association for Computa- Boldoba, and Lluís Màrquez. 2015. A factory of tional Linguistics. comparable corpora from Wikipedia. In Proceed- ings of the Eighth Workshop on Building and Using Maruan Al-Shedivat and Ankur Parikh. 2019. Con- Comparable Corpora, pages 3–13, Beijing, China. sistency by agreement in zero-shot neural machine Association for Computational Linguistics. translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association Ondřej Bojar, Rajen Chatterjee, Christian Federmann, for Computational Linguistics: Human Language Yvette Graham, Barry Haddow, Matthias Huck, An- Technologies, Volume 1 (Long and Short Papers), tonio Jimeno Yepes, Philipp Koehn, Varvara Lo- pages 1184–1197, Minneapolis, Minnesota. Associ- gacheva, Christof Monz, Matteo Negri, Aurélie ation for Computational Linguistics. Névéol, Mariana Neves, Martin Popel, Matt Post, Raphael Rubino, Carolina Scarton, Lucia Spe- Mikel Artetxe, Gorka Labaka, and Eneko Agirre. cia, Marco Turchi, Karin Verspoor, and Marcos 2018a. Unsupervised statistical machine transla- Zampieri. 2016. Findings of the 2016 conference tion. In Proceedings of the 2018 Conference on on machine translation. In Proceedings of the Empirical Methods in Natural Language Processing, First Conference on Machine Translation: Volume pages 3632–3642, Brussels, Belgium. Association 2, Shared Task Papers, pages 131–198, Berlin, Ger- for Computational Linguistics. many. Association for Computational Linguistics. Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Ondrej Bojar, Vojtech Diatka, Pavel Rychlỳ, Pavel Kyunghyun Cho. 2018b. Unsupervised neural ma- Stranák, Vít Suchomel, Ales Tamchyna, and Daniel chine translation. In International Conference on Zeman. 2014. Hindencorp-hindi-english and hindi- Learning Representations. only corpus for machine translation. In LREC, pages 3550–3555. Mikel Artetxe and Holger Schwenk. 2019. Margin- based parallel corpus mining with multilingual sen- Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr- tence embeddings. In Proceedings of the 57th An- ishna Vedantam, Saurabh Gupta, Piotr Dollár, and nual Meeting of the Association for Computational C Lawrence Zitnick. 2015. Microsoft coco cap- Linguistics, pages 3197–3203, Florence, Italy. Asso- tions: Data collection and evaluation server. arXiv ciation for Computational Linguistics. preprint arXiv:1504.00325. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Kyunghyun Cho, Bart van Merriënboer, Caglar Gul- Bengio. 2015. Neural machine translation by cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Schwenk, and Yoshua Bengio. 2014. Learning Manaal Faruqui and Chris Dyer. 2014. Improving vec- phrase representations using RNN encoder–decoder tor space word representations using multilingual for statistical machine translation. In Proceedings of correlation. In Proceedings of the 14th Conference the 2014 Conference on Empirical Methods in Nat- of the European Chapter of the Association for Com- ural Language Processing (EMNLP), pages 1724– putational Linguistics, pages 462–471, Gothenburg, 1734, Doha, Qatar. Association for Computational Sweden. Association for Computational Linguistics. Linguistics. Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo. 2019. Un- Alexis Conneau and Guillaume Lample. 2019. Cross- supervised image captioning. In Proceedings of the lingual language model pretraining. In Advances IEEE/CVF Conference on Computer Vision and Pat- in Neural Information Processing Systems 32, pages tern Recognition, pages 4125–4134. 7059–7069. Curran Associates, Inc. Jiahui Gao, Yi Zhou, Philip LH Yu, and Jiuxiang Gu. Ayan Das and Sudeshna Sarkar. 2020. A survey of 2020. Unsupervised cross-lingual image captioning. the model transfer approaches to cross-lingual de- arXiv preprint arXiv:2010.01288. pendency parsing. ACM Transactions on Asian and Low-Resource Language Information Process- Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Ar- ing (TALLIP), 19(5):1–60. mand Joulin, and Tomas Mikolov. 2018. Learning word vectors for 157 languages. In Proceedings of Karan Desai and Justin Johnson. 2021. VirTex: Learn- the Eleventh International Conference on Language ing Visual Representations from Textual Annota- Resources and Evaluation (LREC 2018), Miyazaki, tions. In CVPR. Japan. European Language Resources Association (ELRA). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Jiuxiang Gu, Shafiq Joty, Jianfei Cai, and Gang Wang. deep bidirectional transformers for language under- 2018. Unpaired image captioning by language piv- standing. In Proceedings of the 2019 Conference oting. In Proceedings of the European Conference of the North American Chapter of the Association on Computer Vision (ECCV), pages 503–519. for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jiuxiang Gu, Shafiq Joty, Jianfei Cai, Handong Zhao, pages 4171–4186, Minneapolis, Minnesota. Associ- Xu Yang, and Gang Wang. 2019. Unpaired image ation for Computational Linguistics. captioning via scene graph alignments. In Proceed- ings of the IEEE/CVF International Conference on Chris Dyer, Victor Chahuneau, and Noah A. Smith. Computer Vision, pages 10323–10332. 2013. A simple, fast, and effective reparameter- ization of IBM model 2. In Proceedings of the Mandy Guo, Qinlan Shen, Yinfei Yang, Heming 2013 Conference of the North American Chapter of Ge, Daniel Cer, Gustavo Hernandez Abrego, Keith the Association for Computational Linguistics: Hu- Stevens, Noah Constant, Yun-Hsuan Sung, Brian man Language Technologies, pages 644–648, At- Strope, and Ray Kurzweil. 2018. Effective parallel lanta, Georgia. Association for Computational Lin- corpus mining using bilingual sentence embeddings. guistics. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 165–176, Brus- Sergey Edunov, Myle Ott, Michael Auli, and David sels, Belgium. Association for Computational Lin- Grangier. 2018. Understanding back-translation at guistics. scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Viktor Hangya and Alexander Fraser. 2019. Unsuper- pages 489–500, Brussels, Belgium. Association for vised parallel sentence extraction with parallel seg- Computational Linguistics. ment detection helps machine translation. In Pro- ceedings of the 57th Annual Meeting of the Asso- Obeida ElJundi., Mohamad Dhaybi., Kotaiba ciation for Computational Linguistics, pages 1224– Mokadam., Hazem Hajj., and Daniel Asmar. 1234, Florence, Italy. Association for Computational 2020. Resources and end-to-end neural network Linguistics. models for arabic image captioning. In Proceedings of the 15th International Joint Conference on K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep resid- Computer Vision, Imaging and Computer Graphics ual learning for image recognition. In 2016 IEEE Theory and Applications - Volume 5: VISAPP,, Conference on Computer Vision and Pattern Recog- pages 233–241. INSTICC, SciTePress. nition (CVPR), pages 770–778. Miquel Esplà, Mikel Forcada, Gema Ramírez-Sánchez, Vu Cong Duy Hoang, Philipp Koehn, Gholamreza and Hieu Hoang. 2019. ParaCrawl: Web-scale paral- Haffari, and Trevor Cohn. 2018. Iterative back- lel corpora for the languages of the EU. In Proceed- translation for neural machine translation. In Pro- ings of Machine Translation Summit XVII Volume 2: ceedings of the 2nd Workshop on Neural Machine Translator, Project and User Tracks, pages 118–119, Translation and Generation, pages 18–24, Mel- Dublin, Ireland. European Association for Machine bourne, Australia. Association for Computational Translation. Linguistics.
Micah Hodosh, Peter Young, and Julia Hockenmaier. Kemal Kurniawan, Lea Frermann, Philip Schulz, and 2013. Framing image description as a ranking task: Trevor Cohn. 2021. Ppt: Parsimonious parser trans- Data, models and evaluation metrics. Journal of Ar- fer for unsupervised cross-lingual adaptation. arXiv tificial Intelligence Research, 47:853–899. preprint arXiv:2101.11216. Rebecca Hwa, Philip Resnik, Amy Weinberg, Clara Garry Kuwanto, Afra Feyza Akyürek, Isidora Chara Cabezas, and Okan Kolak. 2005. Bootstrapping Tourni, Siyang Li, and Derry Wijaya. 2021. parsers via syntactic projection across parallel texts. Low-resource machine translation for low-resource Natural language engineering, 11(03):311–325. languages: Leveraging comparable data, code- switching and compute resources. Alex Jones and Derry Tanti Wijaya. 2021. Majority voting with bidirectional pre-translation for bitext re- Guillaume Lample, Alexis Conneau, Ludovic Denoyer, trieval. and Marc’Aurelio Ranzato. 2018a. Unsupervised machine translation using monolingual corpora only. Omid Kashefi. 2018. Mizan: a large persian-english In International Conference on Learning Represen- parallel corpus. arXiv preprint arXiv:1801.02107. tations. Phillip Keung, Julian Salazar, Yichao Lu, and Noah A Guillaume Lample, Alexis Conneau, Marc’Aurelio Smith. 2020. Unsupervised bitext mining and trans- Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018b. lation via self-trained contextual embeddings. arXiv Word translation without parallel data. In Interna- preprint arXiv:2010.07761. tional Conference on Learning Representations. Yunsu Kim, Miguel Graça, and Hermann Ney. 2020. When and why is unsupervised neural machine trans- Guillaume Lample, Myle Ott, Alexis Conneau, Lu- lation useless? In Proceedings of the 22nd An- dovic Denoyer, and Marc’Aurelio Ranzato. 2018c. nual Conference of the European Association for Phrase-based & neural unsupervised machine trans- Machine Translation, pages 35–44, Lisboa, Portugal. lation. In Proceedings of the 2018 Conference on European Association for Machine Translation. Empirical Methods in Natural Language Processing, pages 5039–5049, Brussels, Belgium. Association Diederik P. Kingma and Jimmy Ba. 2015. Adam: A for Computational Linguistics. method for stochastic optimization. In 3rd Inter- national Conference on Learning Representations, Shen Li, Joao V Graça, and Ben Taskar. 2012. Wiki-ly ICLR 2015, San Diego, CA, USA, May 7-9, 2015, supervised part-of-speech tagging. In Proceedings Conference Track Proceedings. of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Philipp Koehn. 2005. Europarl: A parallel corpus for Natural Language Learning, pages 1389–1398. As- statistical machine translation. In MT summit, vol- sociation for Computational Linguistics. ume 5, pages 79–86. Citeseer. Zuchao Li, Rui Wang, Kehai Chen, Masso Utiyama, Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Eiichiro Sumita, Zhuosheng Zhang, and Hai Zhao. Callison-Burch, Marcello Federico, Nicola Bertoldi, 2020. Data-dependent gaussian prior objective for Brooke Cowan, Wade Shen, Christine Moran, language generation. In International Conference Richard Zens, et al. 2007. Moses: Open source on Learning Representations. toolkit for statistical machine translation. In Pro- ceedings of the 45th annual meeting of the ACL Wen-Pin Lin, Matthew Snover, and Heng Ji. 2011. Un- on interactive poster and demonstration sessions, supervised language-independent name translation pages 177–180. Association for Computational Lin- mining from wikipedia infoboxes. In Proceedings guistics. of the First workshop on Unsupervised Learning in NLP, pages 43–52. Sandra Kübler, Ryan McDonald, and Joakim Nivre. 2009. Dependency parsing. Synthesis lectures on Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey human language technologies, 1(1):1–127. Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual denoising Taku Kudo and John Richardson. 2018. SentencePiece: pre-training for neural machine translation. arXiv A simple and language independent subword tok- cs.CL 2001.08210. enizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Xuezhe Ma and Fei Xia. 2014. Unsupervised depen- Methods in Natural Language Processing: System dency parsing with transferring distribution via par- Demonstrations, pages 66–71, Brussels, Belgium. allel guidance and entropy regularization. In Pro- Association for Computational Linguistics. ceedings of the 52nd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Anoop Kunchukuttan, Pratik Mehta, and Pushpak Bhat- Long Papers), pages 1337–1348, Baltimore, Mary- tacharyya. 2018. The IIT Bombay English-Hindi land. Association for Computational Linguistics. parallel corpus. In Proceedings of the Eleventh In- ternational Conference on Language Resources and Kelly Marchisio, Kevin Duh, and Philipp Koehn. 2020. Evaluation (LREC 2018), Miyazaki, Japan. Euro- When does unsupervised machine translation work? pean Language Resources Association (ELRA). In Proceedings of the Fifth Conference on Machine
You can also read