On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation Wei Zhao† , Goran Glavaš‡ , Maxime PeyrardΦ , Yang Gao? , Robert WestΦ , Steffen Eger† † Technische Universität Darmstadt ‡ University of Mannheim, Germany Φ EPFL, Switerland ? Royal Holloway University of London, UK {zhao,eger}@aiphes.tu-darmstadt.de goran@informatik.uni-mannheim.de, yang.gao@rhul.ac.uk {maxime.peyrard,robert.west}@epfl.ch Abstract lation (MT) (Bahdanau et al., 2015; Johnson et al., 2017) and text summarization (Rush et al., 2015; Evaluation of cross-lingual encoders is usually Tan et al., 2017), where we do not predict a single performed either via zero-shot cross-lingual transfer in supervised downstream tasks or discrete label but generate natural language text. via unsupervised cross-lingual textual simi- Thus, the set of labels for NLG is neither clearly larity. In this paper, we concern ourselves defined nor finite. Yet, the standard evaluation with reference-free machine translation (MT) protocols for NLG still predominantly follow the evaluation where we directly compare source described default paradigm: (1) evaluation datasets texts to (sometimes low-quality) system trans- come with human-created reference texts and (2) lations, which represents a natural adversarial evaluation metrics, e.g., BLEU (Papineni et al., setup for multilingual encoders. Reference- free evaluation holds the promise of web-scale 2002) or METEOR (Lavie and Agarwal, 2007) for comparison of MT systems. We systemati- MT and ROUGE (Lin and Hovy, 2003) for sum- cally investigate a range of metrics based on marization, count the exact “label” (i.e., n-gram) state-of-the-art cross-lingual semantic repre- matches between reference and system-generated sentations obtained with pretrained M-BERT text. In other words, established NLG evaluation and LASER models. We find that they perform compares semantically ambiguous labels from an poorly as semantic encoders for reference-free unbounded set (i.e., natural language texts) via hard MT evaluation and identify their two key lim- symbolic matching (i.e., string overlap). itations, namely, (a) a semantic mismatch be- tween representations of mutual translations The first remedy is to replace the hard symbolic and, more prominently, (b) the inability to comparison of natural language “labels” with a punish “translationese”, i.e., low-quality literal soft comparison of texts’ meaning, using seman- translations. We propose two partial reme- tic vector space representations. Recently, a num- dies: (1) post-hoc re-alignment of the vector ber of MT evaluation methods appeared focusing spaces and (2) coupling of semantic-similarity based metrics with target-side language mod- on semantic comparison of reference and system eling. In segment-level MT evaluation, our translations (Shimanaka et al., 2018; Clark et al., best combined metric surpasses the reference- 2019; Zhao et al., 2019). While these correlate based BLEU by 5.7 correlation points. We better than n-gram overlap metrics with human as- make our MT evaluation code available.1 sessments, they do not address inherent limitations stemming from the need for reference translations, 1 Introduction namely: (1) references are expensive to obtain; (2) A standard evaluation setup for supervised machine they assume a single correct solution and bias the learning (ML) tasks assumes an evaluation metric evaluation, both automatic and human (Dreyer and which compares a gold label to a classifier predic- Marcu, 2012; Fomicheva and Specia, 2016), and tion. This setup assumes that the task has clearly (3) limitation of MT evaluation to language pairs defined and unambiguous labels and, in most cases, with available parallel data. that an instance can be assigned few labels. These Reliable reference-free evaluation metrics, di- assumptions, however, do not hold for natural lan- rectly measuring the (semantic) correspondence guage generation (NLG) tasks like machine trans- between the source language text and system trans- 1 https://github.com/AIPHES/ACL20-Reference-Free- lation, would remove the need for human refer- MT-Evaluation ences and allow for unlimited MT evaluations: any
monolingual corpus could be used for evaluating re-mapping step, we can to some extent alleviate MT systems. However, the proposals of reference- both previous issues. (iv) Finally, we show that the free MT evaluation metrics have been few and far combination of cross-lingual reference-free metrics apart and have required either non-negligible super- and language modeling on the target side (which vision (i.e., human translation quality labels) (Spe- is able to detect “translationese”), surpasses the cia et al., 2010) or language-specific preprocessing performance of reference-based baselines. like semantic parsing (Lo et al., 2014; Lo, 2019), Beyond designating a viable prospect of web- both hindering the wide applicability of the pro- scale domain-agnostic MT evaluation, our findings posed metrics. Moreover, they have also typically indicate that the challenging task of reference-free exhibited performance levels well below those of MT evaluation is able to expose an important limi- standard reference-based metrics (Ma et al., 2019). tations of current state-of-the-art multilingual en- In this work, we comparatively evaluate a num- coders, i.e., the failure to properly represent corrupt ber of reference-free MT evaluation metrics that input, that may go unnoticed in simpler evaluation build on the most recent developments in multilin- setups such as zero-shot cross-lingual text classi- gual representation learning, namely cross-lingual fication or measuring cross-lingual text similarity contextualized embeddings (Devlin et al., 2019) not involving such “adversarial” conditions. We be- and cross-lingual sentence encoders (Artetxe and lieve this is a prospect direction to facilitate cross- Schwenk, 2019). We investigate two types of cross- lingual representations, together with the recent lingual reference-free metrics: (1) Soft token-level benchmark which focuses on zero-shot transfer sce- alignment metrics find the optimal soft alignment narios (Hu et al., 2020). between source sentence and system translation us- ing Word Mover’s Distance (WMD) (Kusner et al., 2 Related Work 2015). Zhao et al. (2019) recently demonstrated Manual human evaluations of MT systems undoubt- that WMD operating on BERT representations (De- edly yield the most reliable results, but are expen- vlin et al., 2019) substantially outperforms baseline sive, tedious, and generally do not scale to a mul- MT evaluation metrics in the reference-based set- titude of domains. A significant body of research ting. In this work, we investigate whether WMD is thus dedicated to the study of automatic evalu- can yield comparable success in the reference-free ation metrics for machine translation. Here, we (i.e., cross-lingual) setup; (2) Sentence-level simi- provide an overview of both reference-based MT larity metrics measure the similarity between sen- evaluation metrics and recent research efforts to- tence representations of the source sentence and wards reference-free MT evaluation, which lever- system translation using cosine similarity. age cross-lingual semantic representations and un- Our analysis yields several interesting find- supervised MT techniques. ings. (i) We show that, unlike in the monolingual reference-based setup, metrics that operate on con- Reference-based MT evaluation. Most of the textualized representations generally do not outper- commonly used evaluation metrics in MT com- form symbolic matching metrics like BLEU, which pare system and reference translations. They are operate in the reference-based environment. (ii) often based on surface forms such as n-gram over- We identify two reasons for this failure: (a) firstly, laps like BLEU (Papineni et al., 2002), SentBLEU, cross-lingual semantic mismatch, especially for NIST (Doddington, 2002), chrF++ (Popović, 2017) multi-lingual BERT (M-BERT), which construes a or METEOR++(Guo and Hu, 2019). They have shared multilingual space in an unsupervised fash- been extensively tested and compared in recent ion, without any direct bilingual signal; (b) sec- WMT metrics shared tasks (Bojar et al., 2017a; Ma ondly, the inability of the state-of-the-art cross- et al., 2018a, 2019). lingual metrics based on multilingual encoders These metrics, however, operate at the surface to adequately capture and punish “translationese”, level, and by design fail to recognize semantic i.e., literal word-by-word translations of the source equivalence lacking lexical overlap. To overcome sentence—as translationese is an especially per- these limitations, some research efforts exploited sistent property of MT systems, this problem is static word embeddings (Mikolov et al., 2013b) particularly troubling in our context of reference- and trained embedding-based supervised metrics free MT evaluation. (iii) We show that by execut- on sufficiently large datasets with available hu- ing an additional weakly-supervised cross-lingual man judgments of translation quality (Shimanaka
et al., 2018). With the development of contextual stance, Yankovskaya et al. (2019) propose to train word embeddings (Peters et al., 2018; Devlin et al., a metric combining multilingual embeddings ex- 2019), we have witnessed proposals of semantic tracted from M-BERT and LASER (Artetxe and metrics that account for word order. For exam- Schwenk, 2019) together with the log-probability ple, Clark et al. (2019) introduce a semantic met- scores from neural machine translation. Our work ric relying on sentence mover’s similarity and the differs from that of Yankovskaya et al. (2019) in contextualized ELMo embeddings (Peters et al., one crucial aspect: the cross-lingual reference-free 2018). Similarly, Zhang et al. (2019) describe a metrics that we investigate and benchmark do not reference-based semantic similarity metric based require any human supervision. on contextualized BERT representations (Devlin et al., 2019). Zhao et al. (2019) generalize this line Cross-lingual Representations. Cross-lingual of work with their MoverScore metric, which com- text representations offer a prospect of model- putes the mover’s distance, i.e., the optimal soft ing meaning across languages and support cross- alignment between tokens of the two sentences, lingual transfer for downstream tasks (Klementiev based on the similarities between their contextual- et al., 2012; Rücklé et al., 2018; Glavaš et al., 2019; ized embeddings. Mathur et al. (2019) train a su- Josifoski et al., 2019; Conneau et al., 2020). Most pervised BERT-based regressor for reference-based recently, the (massively) multilingual encoders, MT evaluation. such as multilingual M-BERT (Devlin et al., 2019), XLM-on-RoBERTa (Conneau et al., 2020), and (sentence-based) LASER, have profiles themselves Reference-free MT evaluation. Recently, there as state-of-the-art solutions for (massively) multi- has been a growing interest in reference-free MT lingual semantic encoding of text. While LASER evaluation (Ma et al., 2019), also referred to as has been jointly trained on parallel data of 93 lan- “quality estimation” (QE) in the MT community. guages, M-BERT has been trained on the concate- In this setup, evaluation metrics semantically com- nation of monolingual data in more than 100 lan- pare system translations directly to the source sen- guages, without any cross-lingual mapping signal. tences. The attractiveness of automatic reference- There has been a recent vivid discussion on the free MT evaluation is obvious: it does not require cross-lingual abilities of M-BERT (Pires et al., any human effort or parallel data. To approach 2019; K et al., 2020; Cao et al., 2020). In par- this task, Popović et al. (2011) exploit a bag-of- ticular, Cao et al. (2020) show that M-BERT often word translation model to estimate translation qual- yields disparate vector space representations for ity, which sums over the likelihoods of aligned mutual translations and propose a multilingual re- word-pairs between source and translation texts. mapping based on parallel corpora, to remedy for Specia et al. (2013) estimate translation quality us- this issue. In this work, we introduce re-mapping ing language-agnostic linguistic features extracted solutions that are resource-leaner and require easy- from source lanuage texts and system translations. to-obtain limited-size word translation dictionaries Lo et al. (2014) introduce XMEANT as a cross- rather than large parallel corpora. lingual reference-free variant of MEANT, a metric based on semantic frames. Lo (2019) extended 3 Reference-Free MT Evaluation Metrics this idea by leveraging M-BERT embeddings. The resulting metric, YiSi-2, evaluates system trans- In the following, we use x to denote a source sen- lations by summing similarity scores over words tence (i.e., a sequence of tokens in the source lan- pairs that are best-aligned mutual translations. YiSi- guage), y to denote a system translation of x in 2-SRL optionally combines an additional similar- the target language, and y? to denote the human ity score based on the alignment over the semantic reference translation for x. structures (e.g., semantic roles and frames). Both metrics are reference-free, but YiSi-2-SRL is not 3.1 Soft Token-Level Alignment resource-lean as it requires a semantic parser for We start from the MoverScore (Zhao et al., 2019), both languages. a recently proposed reference-based MT evaluation Recent progress in cross-lingual semantic sim- metric designed to measure the semantic similarity ilarity (Agirre et al., 2016; Cer et al., 2017) and between system outputs (y) and human references unsupervised MT (Artetxe and Schwenk, 2019) (y? ). It finds an optimal soft semantic alignments has also led to novel reference-free metrics. For in- between tokens from y and y? by minimizing the
Word Mover’s Distance (Kusner et al., 2015). In mutual word or sentence translations.3 To this end, this work, we extend the MoverScore metric to op- we apply two simple, weakly-supervised linear pro- erate in the cross-lingual setup, i.e., to measure the jection methods for post-hoc improvement of the semantic similarity between n-grams (unigram or cross-lingual alignments in these multilingual rep- bigrams) of the source text x and the system trans- resentation spaces. lation y, represented with embeddings originating Notation. Let D = {(w`1 , wk1 ), . . . , (w`n , wkn )} from a cross-lingual semantic space. be a set of matched word or sentence pairs from First, we decompose the source text x into a se- two different languages ` and k. We define a re- quence of n-grams, denoted by xn = (xn1 , . . . , xnm ) mapping function f such that any f (E(w` )) and and then do the same operation for the system E(wk ) are better aligned in the resulting shared translation y, denoting the resulting sequence of vector space. We investigate two resource-lean n-grams with yn . Given xn and yn , we can choices for the re-mapping function f . then define a distance matrix C such that Cij = kE(xni ) − E(ynj )k2 is the distance between the i-th Linear Cross-lingual Projection (CLP). Fol- n-gram of x and the j-th n-gram of y, where E is lowing related work (Schuster et al., 2019), we a cross-lingual embedding function that maps text re-map contextualized embedding spaces using lin- in different languages to a shared embedding space. ear projection. Given ` and k, we stack all vectors With respect to the function E, we experimented of the source language words and target language with cross-lingual representations induced (a) from words for pairs D, respectively, to form matrices static word embeddings with RCSLS (Joulin et al., X` and Xk ∈ Rn×d , with d as the embedding 2018)) (b) with M-BERT (Devlin et al., 2019) as dimension and n as the number of word or sen- the multilingual encoder; with a focus on the latter. tence alignments. The word pairs we use to cali- WMD between the two sequences of n-grams brate M-BERT are extracted from EuroParl (Koehn, xn and y n with associated n-gram weights 2 to 2005) using FastAlign (Dyer et al., 2013), and the n n fxn ∈ R|x | and fyn ∈ R|y | is defined as: sentence pairs to calibrate LASER are sampled X directly from EuroParl.4 For M-BERT, we take m(x, y) := WMD(xn , y n ) = min Cij · Fij , the representations of the last transformer layer as F ij the cross-lingual token representations. Mikolov | s.t. F 1 = fxn , F 1 = fyn , et al. (2013a) propose to learn a projection matrix n n W ∈ Rd×d by minimizing the Euclidean distance where F ∈ R|x |×|y | is a transportation matrix beetween the projected source language vectors with Fij denoting the amount of flow traveling and their corresponding target language vectors: from xni to ynj . min kW X` − Xk k2 . 3.2 Sentence-Level Semantic Similarity W In addition to measuring semantic distance between Xing et al. (2015) achieve further improvement on x and y at word-level, one can also encode them the task of bilingual lexicon induction (BLI) by into sentence representations with multilingual sen- constraining W to an orthogonal matrix, i.e., such tence encoders like LASER (Artetxe and Schwenk, that W T W = I. This turns the optimization into 2019), and then measure their cosine distance the well-known Procrustes problem (Schönemann, 1966) with the following closed-form solution: E(x)| E(y) m(x, y) = 1 − . kE(x)k · kE(y)k Ŵ = U V | , U ΣV | = SVD(X` Xk| ) 3.3 Improving Cross-Lingual Alignments We note that the above CLP re-mapping is known to Initial analysis indicated that, despite the multilin- have deficits, i.e., it requires the embedding spaces gual pretraining of M-BERT (Devlin et al., 2019) of the involved languages to be approximately iso- and LASER (Artetxe and Schwenk, 2019), the morphic (Søgaard et al., 2018; Vulić et al., 2019). monolingual subspaces of the multilingual spaces 3 LASER is jointly trained on parallel corpora of different they induce are far from being semantically well- languages, but in resource-lean language pairs, the induced aligned, i.e., we obtain fairly distant vectors for embeddings from mutual translations may be far apart. 4 While in pretraining LASER requires large parallel cor- 2 We follow Zhao et al. (2019) in obtaining n-gram embed- pora, we believe that fine-tuning/calibrating the embeddings dings and their associated weights based on IDF. post-hoc requires fewer data points.
Recently, some re-mapping methods that report- (CLP or UMD). For example, Mover-2 + UMD(M- edly remedy for this issue have been suggested BERT) denotes the metric combining MoverScore (Cao et al., 2020; Glavaš and Vulić, 2020; Mohiud- based on bigram alignments, with M-BERT embed- din and Joty, 2020). We leave the investigation of dings and UMD as the post-hoc alignment method. these novel techniques for our future work. Sentence-level metric. We denote our sentence- Universal Language Mismatch-Direction level metrics as: Cosine + Align(Embedding). For (UMD) Our second post-hoc linear alignment example, Cosine + CLP(LASER) measures the co- method is inspired by the recent work on removing sine distance between the sentence embeddings biases in distributional word vectors (Dev and obtained with LASER, post-hoc aligned with CLP. Phillips, 2019; Lauscher et al., 2019). We adopt the same approaches in order to quantify and 4.1 Datasets remedy for the “language bias”, i.e., representation We collect the source language sentences, their sys- mismatches between mutual translations in the tem and reference translations from the WMT17-19 initial multilingual space. Formally, given ` and news translation shared task (Bojar et al., 2017b; k, we create individual misalignment vectors Ma et al., 2018b, 2019), which contains predictions E(w`i ) − E(wki ) for each bilingual pair in D. of 166 translation systems across 16 language pairs Then we stack these individual vectors to form in WMT17, 149 translation systems across 14 lan- a matrix Q ∈ Rn×d . We then obtain the global guage pairs in WMT18 and 233 translation systems misalignment vector vB as the top left singular across 18 language pairs in WMT19. We evaluate vector of Q. The global misalignment vector for X-en language pairs, selecting X from a set presumably captures the direction of the represen- of 12 diverse languages: German (de), Chinese tational misalignment between the languages better (zh), Czech (cs), Latvian (lv), Finnish (fi), Russian than the individual (noisy) misalignment vectors (ru), and Turkish (tr), Gujarati (gu), Kazakh (kk), E(w`i ) − E(wki ). Finally, we modify all vectors Lithuanian (lt) and Estonian (et). Each language E(w` ) and E(wk ), by subtracting their projections pair in WMT17-19 has approximately 3,000 source onto the global misalignment direction vector vB : sentences, each associated to one reference transla- tion and to the automatic translations generated by f (E(w` )) = E(w` ) − cos(E(w` ), vB )vB . participating systems. Language Model BLEU scores often fail to re- 4.2 Baselines flect the fluency level of translated texts (Edunov We compare with a range of reference-free metrics: et al., 2019). Hence, we use the language model ibm1-morpheme and ibm1-pos4gram (Popović, (LM) of the target language to regularize the cross- 2012), LASIM (Yankovskaya et al., 2019), LP lingual semantic similarity metrics, by coupling (Yankovskaya et al., 2019), YiSi-2 and YiSi-2-srl our cross-lingual similarity scores with a GPT lan- (Lo, 2019), and reference-based baselines BLEU guage model of the target language (Radford et al., (Papineni et al., 2002), SentBLEU (Koehn et al., 2018). We expect the language model to penalize 2007) and ChrF++ (Popović, 2017) for MT eval- translationese, i.e., unnatural word-by-word trans- uation (See §2).6 The main results are reported lations and boost the performance of our metrics.5 on WMT17. We report the results obtained on 4 Experiments WMT18 and WMT19 in the Appendix. In this section, we evaluate the quality of our MT 4.3 Results reference-free metrics by comparing them with hu- Figure 1 shows that our metric, namely M OVER -2 man judgments of translation quality. + CLP(M-BERT) ⊕ LM, operating on modified Word-level metrics. We denote our word-level M-BERT with the post-hoc re-mapping and com- alignment metrics based on WMD as MoverScore- bining a target-side LM, outperforms BLEU by ngram + Align(Embedding), where Align is one of 5.7 points in segment-level evaluation and achieves our two post-hoc cross-lingual alignment methods comparable performance in the system-level evalu- ation. Figure 2 shows that the same metric obtains 5 We linearly combine the cross-lingual metrics with the 6 LM scores using a coefficient of 0.1 for all setups. We choose The code of these unsupervised metrics is not released, this value based on initial experiments on one language pair. thus we compare to their official results on WMT19 only.
Setting Metrics cs-en de-en fi-en lv-en ru-en tr-en zh-en Average BLEU 43.5 43.2 57.1 39.3 48.4 53.8 51.2 48.1 m(y∗ , y) CHR F++ 52.3 53.4 67.8 52.0 58.8 61.4 59.3 57.9 Baseline with Original Embeddings M OVER -1 + M-BERT 22.7 37.1 34.8 26.0 26.7 42.5 48.2 34.0 C OSINE + LASER 32.6 40.2 41.4 48.3 36.3 42.3 46.7 41.1 Cross-lingual Alignment for Sentence Embedding C OSINE + CLP(LASER) 33.4 40.5 42.0 48.6 36.0 44.7 42.2 41.1 C OSINE + UMD(LASER) 36.6 28.1 45.5 48.5 31.3 46.2 49.4 40.8 m(x, y) Cross-lingual Alignment for Word Embedding M OVER -1 + RCSLS 18.9 26.4 31.9 33.1 25.7 31.1 34.3 28.8 M OVER -1 + CLP(M-BERT) 33.4 38.6 50.8 48.0 33.9 51.6 53.2 44.2 M OVER -2 + CLP(M-BERT) 33.7 38.8 52.2 50.3 35.4 51.0 53.3 45.0 M OVER -1 + UMD(M-BERT) 22.3 38.1 34.5 30.5 31.2 43.5 48.6 35.5 M OVER -2 + UMD(M-BERT) 23.1 38.9 37.1 34.7 33.0 44.8 48.9 37.2 Combining Language Model C OSINE + CLP(LASER) ⊕ LM 48.8 46.7 63.2 66.2 51.0 54.6 48.6 54.2 C OSINE + UMD(LASER) ⊕ LM 49.4 46.2 64.7 66.4 51.1 56.0 52.8 55.2 M OVER -2 + CLP(M-BERT) ⊕ LM 46.5 46.4 63.3 63.8 47.6 55.5 53.5 53.8 M OVER -2 + UMD(M-BERT) ⊕ LM 41.8 46.8 60.4 59.8 46.1 53.8 52.4 51.6 Table 1: Pearson correlations with segment-level human judgments on WMT17 dataset. 100 93.3 15 points gains (73.1 vs. 58.1), averaged over 7 90.1 languages, on WMT19 (system-level) compared to 80 the the state-of-the-art reference-free metric YiSi-2. Pearson Correlation Except for one language pair, gu-en, our metric 60 53.8 48.1 performs on a par with the reference-based BLEU 40 (see Table 8 in the Appendix). In Table 1, we exhaustively compare results for 20 several of our metric variants, based either on M- BERT or LASER. We note that re-mapping has 0 considerable effect for M-BERT (up to 10 points Segment-level System-level improvements), but much less so for LASER. We BLEU This work believe that this is because the underlying embed- ding space of LASER is less ‘misaligned’ since it Figure 1: The averaged results of our best-performing metric, together with reference-based BLEU on has been (pre-)trained on parallel data.7 While the WMT17. re-mapping is thus effective for metrics based on M-BERT, we still require the target-side LM to out- 100 System-level BLEU: 91.2 perform BLEU. We assume the LM can address challenges that the re-mapping apparently is not 80 Pearson Correlation able to handle properly; see our discussion in §5.1. This Work:73.1 60 Overall, we remark that none of our metric com- ibm1-morpheme:52.4 LASIM:56.2YiSi-2:58.1 binations performs consistently best. The reason 40 LP:48.1 may be that LASER and M-BERT are pretrained ibm1-pos4gram:33.9 20 over hundreds of languages with substantial differ- ences in corpora sizes in addition to the different 0 effects of the re-mapping. However, we observe 2010 2012 2014 2016 2018 2020 that M OVER -2 + CLP(M-BERT) performs best Figure 2: The averaged results of our metric best- 7 performing metric, together with the official results of However, in the appendix, we find that re-mapping reference-free metrics, and reference-based BLEU on LASER using 2k parallel sentences achieves considerable improvements on low-resourcing languages, e.g., kk-en (from system-level WMT19. -61.1 to 49.8) and lt-en (from 68.3 to 75.9); see Table 8.
on average over all language pairs when the LM Figure 3 shows histograms for the d statistic for is not added. When the LM is added, M OVER -2 the 50 selected sentences. This illustrates that both + CLP(M-BERT) ⊕ LM and C OSINE + UMD M OVER + M-BERT and C OSINE +LASER prefer (LASER) ⊕ LM perform comparably. This indi- the original human references over random reorder- cates that there may be a saturation effect when it ings, indicating that they are not BOW models, a comes to the LM or that the LM coefficients should reassuring finding. They are largely indifferent be- be tuned individually for each semantic similarity tween correct English word order and the situation metric based on cross-lingual representations. where the word order of the human reference is the same as the German. Finally, they strongly pre- 5 Analysis fer the expert word-by-word translations over the We first analyze preferences of our metrics based human references. on M-BERT and LASER (§5.1) and then examine Taken together, this yields the following con- how much parallel data we need for re-mapping our clusions: (i) it appears that M-BERT and LASER vector spaces (§5.2). Finally, we discuss whether it are mostly indifferent between correct target lan- is legitimate to correlate our metric scores, which guage word order and the situation where source evaluate the similarity of system predictions and and target language have identical word order; (ii) source texts, to human judgments based on system this appears to make them prefer expert word-by- predictions and references (§5.3). word translations the most: for a given source text, these have higher lexical overlap than human refer- 5.1 Metric preferences ences and in addition they have a favorable target To analyze why our metrics based on M-BERT and language syntax, viz., where the source and target LASER perform so badly for the task of reference- language word order are equal (confirmed by (i)). free MT evaluation, we query them for their pref- This explains why our metrics do not perform well, erences. In particular, for a fixed source sentence by themselves and without a language model, as x, we consider two target sentences ỹ and ŷ and reference-free MT evaluation metrics. More worry- evaluate the following score difference: ingly, it indicates that cross-lingual M-BERT and LASER are not robust to the ‘adversarial inputs’ d(ỹ, ŷ; x) := m(x, ỹ) − m(x, ŷ) (1) given by MT systems. When d > 0, then metric m prefers ỹ over ŷ, given Automatic word-by-word translations. For a x, and when d < 0, this relationship is reversed. large-scale analysis across different language pairs, In the following, we compare preferences of our we resort to automatic word-by-word translations metrics for specifically modified target sentences obtained from Google Translate (GT). To do so, ỹ over the human references y? . We choose ỹ to we look up each word independently of context in be (1) a random reordering of y? , to ensure that GT to compile dictionaries. When a word has sev- our metrics do not have the BOW (bag-of-words) eral translations, we keep the first one offered by property, (2) a word-order preserving translation of GT. Due to context-independence, the GT word-by- x (i.e., human-written or automatic word-by-word word translations are of much lower quality than translation coupled with human reordering of the the expert word-by-word translations since they of- English y? ). The latter tests for preferences for ten pick the wrong word senses—e.g., the German literal translations, a common MT-system property. word sein may either be a personal pronoun (his) or the infinitive to be, which would be selected correctly only by chance; cf. Table 2. Expert word-by-word translations and reorder- Instead of reporting histograms of d, we define a ing. We had an expert (one of the co-authors) “W2W” statistic that counts the relative number of translate 50 German sentences word-by-word into times that d(x0 , y? ) is positive, where x0 denotes English. Table 2 illustrates this scenario. We note the literal translation of x into the target language: how bad the word-by-word translations sometimes are even for the closely related language pair such 1 X W2W := I( d(x0 , y? ) > 0 ) (2) as German-English. For example, the word-by- N 0 ? (x ,y ) word translations in English retain the original Ger- man verb final positions, leading to quite ungram- Here N normalizes W2W to lie in [0, 1] and a high matical English translations. W2W score indicates the metric prefers transla-
x Dieser von Langsamkeit geprägte Lebensstil scheint aber ein Patentrezept für ein hohes Alter zu sein. y? However, this slow pace of life seems to be the key to a long life. y? -random To pace slow seems be the this life. life to a key however, of long y? -reordered This slow pace of life seems however the key to a long life to be. x0 -GT This from slowness embossed lifestyle seems but on nostrum for on high older to his. x0 -expert This of slow pace characterized life style seems however a patent recipe for a high age to be. x Putin teilte aus und beschuldigte Ankara, Russland in den Rücken gefallen zu sein. y? Mr Putin lashed out accusing Ankara of stabbing Moscow in the back. y? -random Moscow accusing lashed Putin the in Ankara out, Mr of back. stabbing y? -reordered Mr Putin lashed out, accusing Ankara of Moscow in the back stabbing. x0 -GT Putin divided out and accused Ankara Russia in the move like to his. x0 -expert Putin lashed out and accused Ankara, Russia in the back fallen to be. Table 2: Original German input sentence x, together with the human reference y? , in English, and a randomly (y? -random) and expertly reordered (y? -reordered) English sentence as well as expert word-by-word translation (x0 ) of the German source sentence. The latter is either obtained by the human expert or by Google Translate (GT). m(x, y*-random)-m(x, y*) m(x, y*-reorder)-m(x, y*) m(x, x'-expert)-m(x, y*) 14 14 12 12 12 10 10 10 8 8 8 6 6 6 4 4 4 2 2 2 0 0 0 0.25 0.20 0.15 0.10 0.05 0.00 0.05 0.075 0.050 0.025 0.000 0.025 0.050 0.075 0.05 0.00 0.05 0.10 0.15 LASER M-BERT LASER M-BERT LASER M-BERT Figure 3: Histograms of d scores defined in Eq. (1). Left: Metrics based on LASER and M-BERT favor gold over randomly-shuffled human references. Middle: Metrics are roughly indifferent between gold and reordered human references. Right: Metrics favor expert word-by-word translations over gold human references. tionese over human-written references. Table 3 Seven-languages-to-English 45 shows that reference-free metrics with original em- beddings (LASER and M-BERT) either still prefer 40 Pearson Correlation literal over human translations (e.g., 70.2 in cs-en) 35 or struggle in distinguishing them. Re-mapping helps to a small degree. Only when combined with 30 Mover-2 + CLP(M-BERT) the LM scores do we get adequate scores for the Mover-2 + UMD(M-BERT) 25 Mover-2 + M-BERT W2W statistic. Indeed, the LM is expected to cap- Cosine + CLP(LASER) ture unnatural word order in the target language and 20 Cosine + UMD(LASER) Cosine + LASER penalize word-by-word translations by recognizing them as much less likely to appear in a language. 2000 4000 6000 8000 10000 Note that for expert word-by-word translations, Figure 4: Comparison in sentence- and word- align- we would expect the metrics to perform even worse. ment on different size of the parallel corpus (x-axis). The results are averaged on seven languages pairs in WMT17. Metrics cs-en de-en fi-en C OSINE + LASER 70.2 65.7 53.9 C OSINE + CLP(LASER) 70.7 64.8 53.7 5.2 Size of Parallel Corpora C OSINE + UMD(LASER) 67.5 59.5 52.9 C OSINE + UMD(LASER) ⊕ LM 7.0 7.1 6.4 Figure 4 compares sentence- and word-level re- M OVER -2 + M-BERT 61.8 50.2 45.9 mapping trained with a varying number of paral- M OVER -2 + CLP(M-BERT) 44.6 44.5 32.0 lel sentences. Metrics based on M-BERT result M OVER -2 + UMD(M-BERT) 54.5 44.3 39.6 M OVER -2 + CLP(M-BERT) ⊕ LM 7.3 10.2 6.4 in the highest correlations after re-mapping, even with a small amount of training data (1k). We ob- Table 3: W2W statistics for selected language pairs. serve that Cosine + CLP(LASER) and Mover-2 + CLP(M-BERT) show very similar trends with a
sharp increase with increasing amounts of paral- sentences in the target language. We show that lel data and then level off quickly. However, the the reference-free coupling of cross-lingual M-BERT based Mover-2 reaches its peak and out- similarity scores with the target-side language performs the original baseline with only 1k data, model surpasses the reference-based BLEU in while LASER needs 2k before beating the corre- segment-level MT evaluation. sponding original baseline. We believe our results have two relevant implica- tions. First, they portray the viability of reference- 5.3 Human Judgments free MT evaluation and warrant wider research The WMT datasets contain segment- and system- efforts in this direction. Second, they indicate that level human judgments that we use for evaluat- reference-free MT evaluation may be the challeng- ing the quality of our reference-free metrics. The ing (“adversarial”) evaluation task for multilingual segment-level judgments assign one direct assess- text encoders as it uncovers some of their shortcom- ment (DA) score to each pair of system and human ings (prominently, the inability to capture seman- translation, while system-level judgments associate tically non-sensical word-by-word translations or each system with a single DA score averaged across paraphrases) which remain hidden in their common all pairs in the dataset. We initially suspected the evaluation settings. DA scores to be biased for our setup—which com- pares x with y—as they are based on comparing Acknowledgments y? and y. Indeed, it is known that (especially) hu- We thank the anonymous reviewers for their in- man professional translators “improve” y? , e.g., by sightful comments and suggestions, which greatly making it more readable, relative to the original x improved the final version of the paper. This work (Rabinovich et al., 2017). We investigated the valid- has been supported by the German Research Foun- ity of DA scores by collecting human assessments dation as part of the Research Training Group in the cross-lingual settings (CLDA), where anno- Adaptive Preparation of Information from Hetero- tators directly compare source and translation pairs geneous Sources (AIPHES) at the Technische Uni- (x, y) from the WMT17 dataset. This small-scale versität Darmstadt under grant No. GRK 1994/1. manual analysis hints that DA scores are a valid The contribution of Goran Glavaš is supported proxy for CLDA. Therefore, we decided to treat by the Eliteprogramm of the Baden-Württemberg- them as reliable scores for our setup and evaluate Stiftung, within the scope of the grant AGREE. our proposed metrics by comparing their correla- tion with DA scores. References 6 Conclusion Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Existing semantically-motivated metrics for Aitor Gonzalez-Agirre, Rada Mihalcea, German Rigau, and Janyce Wiebe. 2016. SemEval-2016 reference-free evaluation of MT systems have so Task 1: Semantic Textual Similarity, Monolingual far displayed rather poor correlation with human and Cross-Lingual Evaluation. In Proceedings of estimates of translation quality. In this work, we the 10th International Workshop on Semantic Eval- investigate a range of reference-free metrics based uation (SemEval-2016), pages 497–511, San Diego, on cutting-edge models for inducing cross-lingual California. Association for Computational Linguis- tics. semantic representations: cross-lingual (contex- tualized) word embeddings and cross-lingual Mikel Artetxe and Holger Schwenk. 2019. Massively sentence embeddings. We have identified some Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond. Transactions scenarios in which these metrics fail, prominently of the Association for Computational Linguistics, their inability to punish literal word-by-word 7:597–610. translations (the so-called “translationese”). We have investigated two different mechanisms for Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural machine translation by jointly mitigating this undesired phenomenon: (1) an learning to align and translate. In International Con- additional (weakly-supervised) cross-lingual ference on Learning Representations. alignment step, reducing the mismatch between Ondrej Bojar, Yvette Graham, and Amir Kamran. representations of mutual translations, and (2) 2017a. Results of the WMT17 metrics shared task. language modeling (LM) on the target side, which In Proceedings of the Conference on Machine Trans- is inherently equipped to punish “unnatural” lation (WMT).
Ondřej Bojar, Yvette Graham, and Amir Kamran. Chris Dyer, Victor Chahuneau, and Noah A. Smith. 2017b. Results of the WMT17 metrics shared 2013. A simple, fast, and effective reparameter- task. In Proceedings of the Second Conference on ization of IBM model 2. In Proceedings of the Machine Translation, pages 489–513, Copenhagen, 2013 Conference of the North American Chapter of Denmark. Association for Computational Linguis- the Association for Computational Linguistics: Hu- tics. man Language Technologies, pages 644–648, At- lanta, Georgia. Association for Computational Lin- Steven Cao, Nikita Kitaev, and Dan Klein. 2020. Mul- guistics. tilingual alignment of contextual word representa- tions. In International Conference on Learning Rep- Sergey Edunov, Myle Ott, Marc’Aurelio Ranzato, and resentations. Michael Auli. 2019. On the evaluation of machine translation systems trained with back-translation. Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez- CoRR, abs/1908.05204. Gazpio, and Lucia Specia. 2017. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual Marina Fomicheva and Lucia Specia. 2016. Refer- and Crosslingual Focused Evaluation. In Proceed- ence bias in monolingual machine translation evalu- ings of the 11th International Workshop on Semantic ation. In 54th Annual Meeting of the Association for Evaluation (SemEval-2017), pages 1–14, Vancouver, Computational Linguistics, ACL 2016-Short Papers, Canada. Association for Computational Linguistics. pages 77–82. ACL Home Association for Computa- tional Linguistics. Elizabeth Clark, Asli Celikyilmaz, and Noah A Smith. 2019. Sentence Mover’s Similarity: Automatic Goran Glavaš, Robert Litschko, Sebastian Ruder, and Evaluation for Multi-Sentence Texts. In Proceed- Ivan Vulić. 2019. How to (properly) evaluate cross- ings of the 57th Annual Meeting of the Association lingual word embeddings: On strong baselines, com- for Computational Linguistics, pages 2748–2760, parative analyses, and some misconceptions. In Pro- Florence, Italy. Association for Computational Lin- ceedings of the 57th Annual Meeting of the Associa- guistics. tion for Computational Linguistics, pages 710–721. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Goran Glavaš and Ivan Vulić. 2020. Non-linear Guzmán, Edouard Grave, Myle Ott, Luke Zettle- instance-based cross-lingual mapping for moyer, and Veselin Stoyanov. 2020. Unsupervised non-isomorphic embedding spaces. In Proceedings cross-lingual representation learning at scale. In of ACL. Proceedings of ACL. Yinuo Guo and Junfeng Hu. 2019. Meteor++ 2.0: Sunipa Dev and Jeff M. Phillips. 2019. Attenuating Adopt Syntactic Level Paraphrase Knowledge into bias in word vectors. In The 22nd International Machine Translation Evaluation. In Proceedings of Conference on Artificial Intelligence and Statistics, the Fourth Conference on Machine Translation (Vol- AISTATS 2019, 16-18 April 2019, Naha, Okinawa, ume 2: Shared Task Papers, Day 1), pages 501–506, Japan, pages 879–887. Florence, Italy. Association for Computational Lin- guistics. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Junjie Hu, Sebastian Ruder, Aditya Siddhant, Gra- deep bidirectional transformers for language under- ham Neubig, Orhan Firat, and Melvin Johnson. standing. In Proceedings of the 2019 Conference 2020. XTREME: A massively multilingual multi- of the North American Chapter of the Association task benchmark for evaluating cross-lingual general- for Computational Linguistics: Human Language ization. CoRR, abs/2003.11080. Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Associ- Melvin Johnson, Mike Schuster, Quoc Le, Maxim ation for Computational Linguistics. Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Tho- rat, Fernand a Vià c gas, Martin Wattenberg, Greg George Doddington. 2002. Automatic Evaluation Corrado, Macduff Hughes, and Jeffrey Dean. 2017. of Machine Translation Quality Using N-gram Co- Google’s multilingual neural machine translation occurrence Statistics. In Proceedings of the Sec- system: Enabling zero-shot translation. Transac- ond International Conference on Human Language tions of the Association for Computational Linguis- Technology Research, HLT ’02, pages 138–145, San tics, 5:339–351. Francisco, CA, USA. Morgan Kaufmann Publishers Inc. Martin Josifoski, Ivan S Paskov, Hristo S Paskov, Mar- tin Jaggi, and Robert West. 2019. Crosslingual doc- Markus Dreyer and Daniel Marcu. 2012. Hyter: ument embedding as reduced-rank ridge regression. Meaning-equivalent semantics for translation evalu- In Proceedings of the Twelfth ACM International ation. In Proceedings of the 2012 Conference of the Conference on Web Search and Data Mining, pages North American Chapter of the Association for Com- 744–752. putational Linguistics: Human Language Technolo- gies, pages 162–171. Association for Computational Armand Joulin, Piotr Bojanowski, Tomas Mikolov, Linguistics. Hervé Jégou, and Edouard Grave. 2018. Loss in
translation: Learning bilingual word mapping with Chi-kiu Lo, Meriem Beloucif, Markus Saers, and a retrieval criterion. In Proceedings of the 2018 Dekai Wu. 2014. XMEANT: Better semantic MT Conference on Empirical Methods in Natural Lan- evaluation without reference translations. In Pro- guage Processing, pages 2979–2984, Brussels, Bel- ceedings of the 52nd Annual Meeting of the Associa- gium. Association for Computational Linguistics. tion for Computational Linguistics (Volume 2: Short Papers), pages 765–771, Baltimore, Maryland. As- Karthikeyan K, Zihan Wang, Stephen Mayhew, and sociation for Computational Linguistics. Dan Roth. 2020. Cross-lingual ability of multilin- gual bert: An empirical study. In International Con- Qingsong Ma, Ondrej Bojar, and Yvette Graham. ference on Learning Representations. 2018a. Results of the WMT18 metrics shared task. In Proceedings of the Third Conference on Machine Alexandre Klementiev, Ivan Titov, and Binod Bhat- Translation (WMT). tarai. 2012. Inducing crosslingual distributed rep- resentations of words. In Proceedings of COLING Qingsong Ma, Ondřej Bojar, and Yvette Graham. 2012, pages 1459–1474, Mumbai, India. The COL- 2018b. Results of the WMT18 metrics shared task: ING 2012 Organizing Committee. Both characters and embeddings achieve good per- formance. In Proceedings of the Third Conference Philipp Koehn. 2005. Europarl: A parallel corpus for on Machine Translation: Shared Task Papers, pages statistical machine translation. In MT summit, vol- 671–688, Belgium, Brussels. Association for Com- ume 5, pages 79–86. Citeseer. putational Linguistics. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Qingsong Ma, Johnny Wei, Ondřej Bojar, and Yvette Callison-Burch, Marcello Federico, Nicola Bertoldi, Graham. 2019. Results of the WMT19 Metrics Brooke Cowan, Wade Shen, Christine Moran, Shared Task: Segment-Level and Strong MT Sys- Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra tems Pose Big Challenges. In Proceedings of the Constantin, and Evan Herbst. 2007. Moses: Open Fourth Conference on Machine Translation (Volume source toolkit for statistical machine translation. In 2: Shared Task Papers, Day 1), pages 62–90, Flo- Proceedings of the 45th Annual Meeting of the As- rence, Italy”. Association for Computational Lin- sociation for Computational Linguistics Companion guistics. Volume Proceedings of the Demo and Poster Ses- Nitika Mathur, Timothy Baldwin, and Trevor Cohn. sions, pages 177–180, Prague, Czech Republic. As- 2019. Putting Evaluation in Context: Contextual sociation for Computational Linguistics. Embeddings Improve Machine Translation Evalua- tion. In Proceedings of the 57th Annual Meeting Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian of the Association for Computational Linguistics, Weinberger. 2015. From word embeddings to doc- pages 2799–2808, Florence, Italy. Association for ument distances. In International conference on ma- Computational Linguistics. chine learning, pages 957–966. Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013a. Anne Lauscher, Goran Glavaš, Simone Paolo Ponzetto, Exploiting similarities among languages for ma- and Ivan Vulić. 2019. A general framework for im- chine translation. CoRR, abs/1309.4168. plicit and explicit debiasing of distributional word vector spaces. arXiv preprint arXiv:1909.06092. Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013b. Distributed rep- Alon Lavie and Abhaya Agarwal. 2007. Meteor: An resentations of words and phrases and their com- automatic metric for mt evaluation with high levels positionality. In Advances in Neural Information of correlation with human judgments. In Proceed- Processing Systems 26: 27th Annual Conference on ings of the Second Workshop on Statistical Machine Neural Information Processing Systems 2013. Pro- Translation, pages 228–231. Association for Compu- ceedings of a meeting held December 5-8, 2013, tational Linguistics. Lake Tahoe, Nevada, United States., pages 3111– 3119. Chin-Yew Lin and Eduard Hovy. 2003. Auto- matic evaluation of summaries using n-gram co- Bari Saiful M Mohiuddin, Tasnim and Shafiq Joty. occurrence statistics. In Proceedings of the 2003 Hu- 2020. Lnmap: Departures from isomorphic as- man Language Technology Conference of the North sumption in bilingual lexicon induction through American Chapter of the Association for Computa- non-linear mapping in latent space. CoRR, tional Linguistics, pages 150–157. abs/1309.4168. Chi-kiu Lo. 2019. YiSi - a Unified Semantic MT Qual- Kishore Papineni, Salim Roukos, Todd Ward, and Wei- ity Evaluation and Estimation Metric for Languages Jing Zhu. 2002. BLEU: A Method for Automatic with Different Levels of Available Resources. In Evaluation of Machine Translation. In Proceedings Proceedings of the Fourth Conference on Machine of the 40th Annual Meeting on Association for Com- Translation (Volume 2: Shared Task Papers, Day putational Linguistics, ACL ’02, pages 311–318, 1), pages 507–513, Florence, Italy. Association for Stroudsburg, PA, USA. Association for Computa- Computational Linguistics. tional Linguistics.
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Tal Schuster, Ori Ram, Regina Barzilay, and Amir Gardner, Christopher Clark, Kenton Lee, and Luke Globerson. 2019. Cross-lingual alignment of con- Zettlemoyer. 2018. Deep contextualized word rep- textual word embeddings, with applications to zero- resentations. In Proceedings of the North American shot dependency parsing. In Proceedings of the Chapter of the Association for Computational Lin- 2019 Conference of the North American Chapter of guistics (NAACL). the Association for Computational Linguistics: Hu- man Language Technologies, Volume 1 (Long and Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. Short Papers), pages 1599–1613, Minneapolis, Min- How multilingual is multilingual BERT? In Pro- nesota. Association for Computational Linguistics. ceedings of the 57th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 4996– Hiroki Shimanaka, Tomoyuki Kajiwara, and Mamoru 5001, Florence, Italy. Association for Computa- Komachi. 2018. RUSE: Regressor using sentence tional Linguistics. embeddings for automatic machine translation eval- uation. In Proceedings of the Third Conference on Maja Popović. 2012. Morpheme- and POS-based Machine Translation (WMT). IBM1 and language model scores for translation quality estimation. In Proceedings of the Seventh Anders Søgaard, Sebastian Ruder, and Ivan Vulić. Workshop on Statistical Machine Translation, pages 2018. On the limitations of unsupervised bilingual 133–137, Montréal, Canada. Association for Com- dictionary induction. In Proceedings of the 56th An- putational Linguistics. nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 778– Maja Popović. 2017. chrF++: Words Helping Char- 788, Melbourne, Australia. Association for Compu- acter N-grams. In Proceedings of the Second Con- tational Linguistics. ference on Machine Translation, pages 612–618, Copenhagen, Denmark. Lucia Specia, Dhwaj Raj, and Marco Turchi. 2010. Ma- chine translation evaluation versus quality estima- Maja Popović, David Vilar, Eleftherios Avramidis, and tion. Machine translation, 24(1):39–50. Aljoscha Burchardt. 2011. Evaluation without refer- ences: IBM1 scores as evaluation metrics. In Pro- Lucia Specia, Kashif Shah, Jose G.C. de Souza, and ceedings of the Sixth Workshop on Statistical Ma- Trevor Cohn. 2013. QuEst - a translation quality es- chine Translation, pages 99–103, Edinburgh, Scot- timation framework. In Proceedings of the 51st An- land. Association for Computational Linguistics. nual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 79–84, Sofia, Bulgaria. Association for Computational Lin- Ella Rabinovich, Noam Ordan, and Shuly Wintner. guistics. 2017. Found in translation: Reconstructing phylo- genetic language trees from translations. In Proceed- Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. 2017. ings of the 55th Annual Meeting of the Association Abstractive document summarization with a graph- for Computational Linguistics (Volume 1: Long Pa- based attentional neural model. In Proceedings pers), pages 530–540, Vancouver, Canada. Associa- of the 55th Annual Meeting of the Association for tion for Computational Linguistics. Computational Linguistics (Volume 1: Long Papers), pages 1171–1181. Association for Computational Alec Radford, Karthik Narasimhan, Tim Salimans, Linguistics. and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL Ivan Vulić, Goran Glavaš, Roi Reichart, and Anna Ko- https://s3-us-west-2. amazonaws. com/openai- rhonen. 2019. Do we really need fully unsuper- assets/researchcovers/languageunsupervised/language vised cross-lingual embeddings? In Proceedings of understanding paper. pdf. the 2019 Conference on Empirical Methods in Nat- ural Language Processing and the 9th International Andreas Rücklé, Steffen Eger, Maxime Peyrard, and Joint Conference on Natural Language Processing Iryna Gurevych. 2018. Concatenated power mean (EMNLP-IJCNLP), pages 4398–4409. word embeddings as universal cross-lingual sen- tence representations. arXiv. Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. 2015. Normalized word embedding and orthogonal trans- Alexander M. Rush, Sumit Chopra, and Jason Weston. form for bilingual word translation. In Proceedings 2015. A neural attention model for abstractive sen- of the 2015 Conference of the North American Chap- tence summarization. In Proceedings of the 2015 ter of the Association for Computational Linguistics: Conference on Empirical Methods in Natural Lan- Human Language Technologies, pages 1006–1011, guage Processing, pages 379–389. Association for Denver, Colorado. Association for Computational Computational Linguistics. Linguistics. Peter H Schönemann. 1966. A generalized solution of Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy the orthogonal procrustes problem. Psychometrika, Guo, Jax Law, Noah Constant, Gustavo Hernan- 31(1):1–10. dez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan
Sung, et al. 2019. Multilingual universal sen- tence encoder for semantic retrieval. arXiv preprint arXiv:1907.04307. Elizaveta Yankovskaya, Andre Tättar, and Mark Fishel. 2019. Quality Estimation and Translation Metrics via Pre-trained Word and Sentence Embeddings. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), pages 101–105, Florence, Italy. Association for Computational Linguistics. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with BERT. CoRR, abs/1904.09675. Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Chris- tian M. Meyer, and Steffen Eger. 2019. Moverscore: Text generation evaluating with contextualized em- beddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Hong Kong, China. Association for Computational Linguistics.
7 Appendix 7.1 Zero-shot Transfer to Resource-lean Language Our metric allows for estimating translation qual- ity on new domains. However, the evaluation is limited to those languages covered by multilin- gual embeddings. This is a major drawback for low-resourcing languages—e.g., Gujarati is not in- cluded in LASER. To this end, we take multilingual USE (Yang et al., 2019) as an illustrating exam- ple which covers only 16 languages (in our sample Czech, Latvian and Finish are not included in USE). We re-align the corresponding embedding spaces with our re-mapping functions to induce evaluation metrics even for these languages, using only 2k translation pairs. Table 4 shows that our metric with a composition of re-mapping functions can raise correlation from zero to 0.10 for cs-en and to 0.18 for lv-en. However, for one language pair, fi-en, we see correlation goes from negative to zero, indicating that this approach does not always work. This observation warrants further investigation. Metrics cs-en fi-en lv-en BLEU 0.849 0.834 0.946 C OSINE + LAS -0.001 -0.149 0.019 C OSINE + CLP(USE) 0.072 -0.068 0.109 C OSINE + UMD(USE) 0.056 -0.061 0.113 C OSINE + CLP ◦ UMD(USE) 0.089 -0.030 0.162 C OSINE + UMD ◦ CLP(USE) 0.102 -0.007 0.180 Table 4: The Pearson correlation of merics on segment- level WMT17 dataset. ’◦’ marks the composition of two re-mapping functions.
You can also read