Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation Samuel Läubli1 Rico Sennrich1,2 Martin Volk1 1 Institute of Computational Linguistics, University of Zurich {laeubli,volk}@cl.uzh.ch 2 School of Informatics, University of Edinburgh rico.sennrich@ed.ac.uk Abstract released their data publicly to allow external val- arXiv:1808.07048v1 [cs.CL] 21 Aug 2018 Recent research suggests that neural machine idation of their claims. Their claims are further translation achieves parity with professional strengthened by the fact that they follow best prac- human translation on the WMT Chinese– tices in human machine translation evaluation, us- English news translation task. We empiri- ing evaluation protocols and tools that are also cally test this claim with alternative evalua- used at the yearly Conference on Machine Trans- tion protocols, contrasting the evaluation of lation (WMT) (Bojar et al., 2017), and take great single sentences and entire documents. In a care in guarding against some confounds such as pairwise ranking experiment, human raters as- sessing adequacy and fluency show a stronger test set selection and rater inconsistency. preference for human over machine transla- However, the implications of a statistical tie be- tion when evaluating documents as compared tween two machine translation systems in a shared to isolated sentences. Our findings emphasise translation task are less severe than that of a statis- the need to shift towards document-level eval- tical tie between a machine translation system and uation as machine translation improves to the a professional human translator, so we consider degree that errors which are hard or impossible the results worthy of further scrutiny. We per- to spot at the sentence-level become decisive in discriminating quality of different transla- form an independent evaluation of the professional tion outputs. translation and best machine translation system that were found to be of equal quality by Hassan 1 Introduction et al. (2018). Our main interest lies in the eval- Neural machine translation (Kalchbrenner and uation protocol, and we empirically investigate if Blunsom, 2013; Sutskever et al., 2014; Bahdanau the lack of document-level context could explain et al., 2015) has become the de-facto standard in the inability of human raters to find a quality dif- machine translation, outperforming earlier phrase- ference between human and machine translations. based approaches in many data settings and shared We test the following hypothesis: translation tasks (Luong and Manning, 2015; Sen- nrich et al., 2016; Cromieres et al., 2016). Some A professional translator who is asked to recent results suggest that neural machine transla- rank the quality of two candidate trans- tion “approaches the accuracy achieved by average lations on the document level will prefer bilingual human translators [on some test sets]” a professional human translation over a (Wu et al., 2016), or even that its “translation qual- machine translation. ity is at human parity when compared to profes- sional human translators” (Hassan et al., 2018). Note that our hypothesis is slightly different from Claims of human parity in machine translation are that tested by Hassan et al. (2018), which could be certainly extraordinary, and require extraordinary phrased as follows: evidence.1 Laudably, Hassan et al. (2018) have 1 The term “parity” may raise the expectation that there is A bilingual crowd worker who is asked evidence for equivalence, but the term is used in the definition to directly assess the quality of candi- of “there [being] no statistical significance between [two out- date translations on the sentence level puts] for a test set of candidate translations” by Hassan et al. (2018). Still, we consider this finding noteworthy given the will prefer a professional human trans- strong evaluation setup. lation over a machine translation.
As such, our evaluation is not a direct replication dently, and parity is assumed if the mean score of of that by Hassan et al. (2018), and a failure to re- the former does not significantly differ from the produce their findings does not imply an error on mean score of the latter. either our or their part. Rather, we hope to indi- rectly assess the accuracy of different evaluation Raters To optimise cost, machine translation protocols. Our underlying assumption is that pro- quality is typically assessed by means of crowd- fessional human translation is still superior to neu- sourcing. Combined ratings of bilingual crowd ral machine translation, but that the sensitivity of workers have been shown to be more reliable than human raters to these quality differences depends automatic metrics and “very similar” to ratings on the evaluation protocol. produced by “experts”2 (Callison-Burch, 2009). Graham et al. (2017) compare crowdsourced to 2 Human Evaluation of Machine “expert” ratings on machine translations from Translation WMT 2012, concluding that, with proper quality control, “machine translation systems can indeed Machine translation is typically evaluated by com- be evaluated by the crowd alone.” However, it is paring system outputs to source texts, reference unclear whether this finding carries over to trans- translations, other system outputs, or a combi- lations produced by NMT systems where, due to nation thereof (for examples, see Bojar et al., increased fluency, errors are more difficult to iden- 2016a). The scientific community concentrates tify (Castilho et al., 2017a), and concurrent work on two aspects: adequacy, typically assessed by by Toral et al. (2018) highlights the importance of bilinguals; and target language fluency, typically expert translators for MT evaluation. assessed by monolinguals. Evaluation protocols have been subject to controversy for decades (e. g., Experimental Unit Machine translation evalu- Van Slype, 1979), and we identify three aspects ation is predominantly performed on single sen- with particular relevance to assessing human par- tences, presented to raters in random order (e. g., ity: granularity of measurement (ordinal vs. inter- Bojar et al., 2017; Cettolo et al., 2017). There val scales), raters (experts vs. crowd workers), and are two main reasons for this. The first is cost: experimental unit (sentence vs. document). if raters assess entire documents, obtaining the same number of data points in an evaluation cam- 2.1 Related Work paign multiplies the cost by the average number Granularity of Measurement Callison-Burch of sentences per document. The second is exper- et al. (2007) show that ranking (Which of these imental validity. When comparing systems that translations is better?) leads to better inter-rater produce sentences without considering document- agreement than absolute judgement on 5-point level context, the perceived suprasentential cohe- Likert scales (How good is this translation?) but sion of a system output is likely due to random- gives no insight about how much a candidate trans- ness and thus a confounding factor. While in- lation differs from a (presumably perfect) refer- corporating document-level context into machine ence. To this end, Graham et al. (2013) suggest translation systems is an active field of research the use of continuous scales for direct assessment (Webber et al., 2017), state-of-the-art systems still of translation quality. Implemented as a slider operate at the level of single sentences (Sennrich between 0 (Not at all) and 100 (Perfectly), their et al., 2017; Vaswani et al., 2017; Hassan et al., method yields scores on a 100-point interval scale 2018). In contrast, human translators can and do in practice (Bojar et al., 2016b, 2017), with each take document-level context into account (Krings, raters’ rating being standardised to increase ho- 1986). The same holds for raters in evaluation mogeneity. Hassan et al. (2018) use source-based campaigns. In the discussion of their results, direct assessment to avoid bias towards reference Wu et al. (2016) note that their raters “[did] not translations. In the shared task evaluation by Cet- necessarily fully understand each randomly sam- tolo et al. (2017), raters are shown the source and pled sentence sufficiently” because it was pro- a candidate text, and asked: How accurately does vided with no context. In such setups, raters can- the above candidate text convey the semantics of not reward textual cohesion and coherence. the source text? In doing so, they have translations 2 “Experts” here are computational linguists who develop produced by humans and machines rated indepen- MT systems, who may not be expert translators.
2.2 Our Evaluation Protocol 2.3 Data Collection We conduct a quality evaluation experiment with We use the experimental protocol described in the a 2 × 2 mixed factorial design, testing the effect previous section for a quality assessment of Chi- of source text availability (adequacy, fluency) and nese to English translations of news articles. To experimental unit (sentence, document) on ratings this end, we randomly sampled 55 documents and by professional translators. 2×120 sentences from the WMT 2017 test set. We only considered the 123 articles (documents) Granularity of Measurement We elicit judge- which are native Chinese,4 containing 8.13 sen- ments by means of pairwise ranking. Raters tences on average. Human and machine transla- choose the better (with ties allowed) of two trans- tions (R EFERENCE -HT as H UMAN, and C OMBO - lations for each item: one produced by a profes- 6 as MT) were obtained from data released by sional translator (H UMAN), the other by machine Hassan et al. (2018).5 translation (MT). Since our evaluation includes The sampled documents and sentences were that of human translation, it is reference-free. We rated by professional translators we recruited from evaluate in two conditions: adequacy, where raters ProZ:6 4 native in Chinese (2), English (1), or both see source texts and translations (Which transla- (1) to rate adequacy, and 4 native in English to rate tion expresses the meaning of the source text more fluency. On average, translators had 13.7 years of adequately?); and fluency, where raters only see experience and 8.8 positive client reviews on ProZ, translations (Which text is better English?). and received US$ 188.75 for rating 55 documents and 120 sentences. Raters We recruit professional translators, only The averages reported above include an ad- considering individuals with at least three years ditional translator we recruited when one rater of professional experience and positive client re- showed poor performance on document-level views. spam items in the fluency condition, whose judge- ments we exclude from analysis. We also ex- Experimental Unit To test the effect of context clude sentence-level results from 4 raters because on perceived translation quality, raters evaluate en- there was overlap with the documents they anno- tire documents as well as single sentences in ran- tated, which means that we cannot rule out that the dom order (i. e., context is a within-subjects fac- sentence-level decisions were informed by access tor). They are shown both translations (H UMAN to the full document. To allow for external val- and MT) for each unit; the source text is only idation and further experimentation, we make all shown in the adequacy condition. experimental data publicly available.7 Quality Control To hedge against random rat- 3 Results ings, we convert 5 documents and 16 sentences per In the adequacy condition, MT and H UMAN are set into spam items (Kittur et al., 2008): we render not statistically significantly different on the sen- one of the two options nonsensical by shuffling its tence level (x = 86, n = 189, p = .244). This is words randomly, except for 10 % at the beginning consistent with the results Hassan et al. (2018) and end. obtained with an alternative evaluation protocol (crowdsourcing and direct assessment; see Sec- Statistical Analysis We test for statistically sig- tion 2.1). However, when evaluating entire doc- nificant preference of H UMAN over MT or vice versa by means of two-sided Sign Tests. Let a be 4 While it is common practice in machine translation to the number of ratings in favour of MT, b the num- use the same test set in both translation directions, we con- sider a direct comparison between human “translation” and ber of ratings in favour of H UMAN, and t the num- machine translation hard to interpret if one is in fact the orig- ber of ties. We report the number of successes x inal English text, and the other an automatic translation into and the number of trials n for each test, such that English of a human translation into Chinese. In concurrent work, Toral et al. (2018) expand on the confounding effect of x = b and n = a + b.3 evaluating text where the target side is actually the original document. 3 5 Emerson and Simon (1979) suggest the inclusion of ties http://aka.ms/Translator-HumanParityData 6 such that x = b + 0.5t and n = a + b + t. This modification https://www.proz.com 7 has no effect on the significance levels reported in this paper. https://github.com/laeubli/parity
(a) Adequacy (b) Fluency 60 52 60 51 50 Preference (%) Preference (%) 50 41 40 37 40 32 29 22 20 20 17 9 11 0 0 MT Tie H UMAN MT Tie H UMAN Sentence (N=208) Document (N=200) Sentence (N=208) Document (N=200) Figure 1: Raters prefer human translation more strongly in entire documents. When evaluating isolated sentences in terms of adequacy, there is no statistically significant difference between H UMAN and MT; in all other settings, raters show a statistically significant preference for H UMAN. uments, raters show a statistically significant pref- Conversely, we observe a tendency to rate H U - erence for H UMAN (x = 104, n = 178, p < .05). MAN more favourably on the document level than While the number of ties is similar in sentence- on the sentence level, even within single raters. and document-level evaluation, preference for MT Adequacy raters show a statistically significant drops from 50 to 37 % in the latter (Figure 1a). preference for H UMAN when evaluating entire In the fluency condition, raters prefer H U - documents. We hypothesise that document-level MAN on both the sentence (x = 106, n = 172, evaluation unveils errors such as mistranslation of p < .01) and document level (x = 99, n = 143, p < an ambiguous word, or errors related to textual co- .001). In contrast to adequacy, fluency ratings in hesion and coherence, which remain hard or im- favour of H UMAN are similar in sentence- and possible to spot in a sentence-level evaluation. For document-level evaluation, but raters find more a subset of articles, we elicited both sentence-level ties with document-level context as preference for and document-level judgements, and inspected ar- MT drops from 32 to 22 % (Figure 1b). ticles for which sentence-level judgements were We note that these large effect sizes lead mixed, but where H UMAN was strongly preferred to statistical significance despite modest sam- in document-level evaluation. In these articles, ple size. Inter-annotator agreement (Cohen’s κ) we do indeed observe the hypothesised phenom- ranges from 0.13 to 0.32 (see Appendix for full ena. We find an example of lexical coherence results and discussion). in a 6-sentence article about a new app “微信挪 车”, which H UMAN consistently translates into 4 Discussion “WeChat Move the Car”. In MT, we find three different translations in the same article: “Twit- Our results emphasise the need for suprasentential ter Move Car”, “WeChat mobile”, and “WeChat context in human evaluation of machine transla- Move”. Other observations include the use of tion. Starting with Hassan et al.’s (2018) finding more appropriate discourse connectives in H U - of no statistically significant difference in trans- MAN , a more detailed investigation of which we lation quality between H UMAN and MT for their leave to future work. Chinese–English test set, we set out to test this re- sult with an alternative evaluation protocol which To our surprise, fluency raters show a stronger we expected to strengthen the ability of raters to preference for H UMAN than adequacy raters (Fig- judge translation quality. We employed profes- ure 1). The main strength of neural machine trans- sional translators instead of crowd workers, and lation in comparison to previous statistical ap- pairwise ranking instead of direct assessment, but proaches was found to be increased fluency, while in a sentence-level evaluation of adequacy, raters adequacy improvements were less clear (Bojar still found it hard to discriminate between H UMAN et al., 2016b; Castilho et al., 2017b), and we ex- and MT: they did not show a statistically signifi- pected a similar pattern in our evaluation. Does cant preference for either of them. this indicate that adequacy is in fact a strength of
MT, not fluency? We are wary to jump to this pect that this will require further efforts in cre- conclusion. An alternative interpretation is that ating document-level training data, designing ap- MT, which tends to be more literal than H UMAN, propriate models, and supporting research with is judged more favourably by raters in the bilin- discourse-aware automatic metrics. gual condition, where the majority of raters are native speakers of the source language, because of Acknowledgements L1 interference. We note that the availability of We thank Xin Sennrich for her help with the document-level context still has a strong impact in analysis of translation errors. We also thank An- the fluency condition (Section 3). tonio Toral and the anonymous reviewers for their helpful comments. 5 Conclusions In response to recent claims of parity between hu- References man and machine translation, we have empirically Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- tested the impact of sentence and document level gio. 2015. Neural Machine Translation by Jointly context on human assessment of machine transla- Learning to Align and Translate. In Proceedings of tion. Raters showed a markedly stronger prefer- ICLR, San Diego, CA. ence for human translations when evaluating at the Ondrej Bojar, Christian Federmann, Barry Haddow, level of documents, as compared to an evaluation Philipp Koehn, Matt Post, and Lucia Specia. 2016a. of single, isolated sentences. Ten years of WMT evaluation campaigns: Lessons We believe that our findings have several impli- learnt. In Proceedings of the LREC 2016 Workshop “Translation Evaluation – From Fragmented Tools cations for machine translation research. Most im- and Data Sets to an Integrated Ecosystem”, pages portantly, if we accept our interpretation that hu- 27–34, Portorož, Slovenia. man translation is indeed of higher quality in the Ondřej Bojar, Rajen Chatterjee, Christian Federmann, dataset we tested, this points to a failure of cur- Yvette Graham, Barry Haddow, Shujian Huang, rent best practices in machine translation evalu- Matthias Huck, Philipp Koehn, Qun Liu, Varvara ation. As machine translation quality improves, Logacheva, Christof Monz, Matteo Negri, Matt translations will become harder to discriminate Post, Raphael Rubino, Lucia Specia, and Marco in terms of quality, and it may be time to shift Turchi. 2017. Findings of the 2017 Conference on Machine Translation (WMT17). In Proceedings of towards document-level evaluation, which gives WMT, pages 169–214, Copenhagen, Denmark. raters more context to understand the original text and its translation, and also exposes translation er- Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, rors related to discourse phenomena which remain Antonio Jimeno Yepes, Philipp Koehn, Varvara invisible in a sentence-level evaluation. Logacheva, Christof Monz, Matteo Negri, Aure- Our evaluation protocol was designed with the lie Neveol, Mariana Neves, Martin Popel, Matt aim of providing maximal validity, which is why Post, Raphael Rubino, Carolina Scarton, Lucia Spe- cia, Marco Turchi, Karin Verspoor, and Marcos we chose to use professional translators and pair- Zampieri. 2016b. Findings of the 2016 Conference wise ranking. For future work, it would be of on Machine Translation (WMT16). In Proceedings high practical relevance to test whether we can of WMT, pages 131–198, Berlin, Germany. also elicit accurate quality judgements on the Chris Callison-Burch. 2009. Fast, cheap, and creative: document-level via crowdsourcing and direct as- Evaluating translation quality using amazon’s me- sessment, or via alternative evaluation protocols. chanical turk. In Proceedings of EMNLP, pages The data released by Hassan et al. (2018) could 286–295, Singapore. serve as a test bed to this end. Chris Callison-Burch, Cameron Fordyce, Philipp One reason why document-level evaluation Koehn, Christof Monz, and Josh Schroeder. 2007. widens the quality gap between machine trans- (Meta-) evaluation of machine translation. In Pro- ceedings of WMT, pages 136–158, Prague, Czech lation and human translation is that the machine Republic. translation system we tested still operates on the sentence level, ignoring wider context. It will Sheila Castilho, Joss Moorkens, Federico Gaspari, Iacer Calixto, John Tinsley, and Andy Way. 2017a. be interesting to explore to what extent exist- Is Neural Machine Translation the New State of the ing and future techniques for document-level ma- Art? The Prague Bulletin of Mathematical Linguis- chine translation can narrow this gap. We ex- tics, 108:109–120.
Sheila Castilho, Joss Moorkens, Federico Gaspari, Rico Sennrich, Alexandra Birch, Anna Currey, Ulrich Rico Sennrich, Vilelmini Sosoni, Yota Geor- Germann, Barry Haddow, Kenneth Heafield, An- gakopoulou, Pintu Lohar, Andy Way, Antonio Va- tonio Valerio Miceli Barone, and Philip Williams. lerio Miceli Barone, and Maria Gialama. 2017b. 2017. The University of Edinburgh’s Neural MT A Comparative Quality Evaluation of PBSMT and Systems for WMT17. In Proceedings of WMT, NMT using Professional Translators. In Proceed- pages 389–399, Copenhagen, Denmark. ings of MT Summit, Nagoya, Japan. Rico Sennrich, Barry Haddow, and Alexandra Birch. Mauro Cettolo, Marcello Federico, Luisa Bentivogli, 2016. Edinburgh Neural Machine Translation Sys- Jan Niehues, Sebastian Stüker, Katsuitho Sudoh, tems for WMT 16. In Proceedings of WMT, pages Koichiro Yoshino, and Christian Federmann. 2017. 368–373, Berlin, Germany. Overview of the IWSLT 2017 Evaluation Campaign. In Proceedings of IWSLT, pages 2–14, Tokyo, Japan. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Net- Fabien Cromieres, Chenhui Chu, Toshiaki Nakazawa, works. In Proceedings of NIPS, pages 3104–3112, and Sadao Kurohashi. 2016. Kyoto University par- Montreal, Canada. ticipation to WAT 2016. In Proceedings of WAT, Antonio Toral, Sheila Castilho, Ke Hu, and Andy Way. pages 166–174, Osaka, Japan. 2018. Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Trans- John D. Emerson and Gary A. Simon. 1979. Another lation. In Proceedings of WMT, Brussels, Belgium. Look at the Sign Test When Ties Are Present: The Problem of Confidence Intervals. The American Georges Van Slype. 1979. Critical study of methods for Statistician, 33(3):140–142. evaluating the quality of machine translation. Re- search report BR 19142 prepared for the Commis- Yvette Graham, Timothy Baldwin, Alistair Moffat, sion of the European Communities, Bureau Marcel and Justin Zobel. 2013. Continuous Measurement van Dijk, Brussels, Belgium. Scales in Human Evaluation of Machine Transla- tion. In Proceedings of the 7th Linguistic Anno- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob tation Workshop & Interoperability with Discourse, Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz pages 33–41, Sofia, Bulgaria. Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proceedings of NIPS, pages 5998– Yvette Graham, Timothy Baldwin, Alistair Moffat, and 6008, Long Beach, CA. Justin Zobel. 2017. Can machine translation sys- tems be evaluated by the crowd alone? Natural Lan- Bonnie Webber, Andrei Popescu-Belis, and Jörg Tiede- guage Engineering, 23(1):3–30. mann, editors. 2017. Proceedings of the Third Work- shop on Discourse in Machine Translation. Copen- Hany Hassan, Anthony Aue, Chang Chen, Vishal hagen, Denmark. Chowdhary, Jonathan Clark, Christian Feder- mann, Xuedong Huang, Marcin Junczys-Dowmunt, Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. William Lewis, Mu Li, Shujie Liu, Tie-Yan Liu, Le, Mohammad Norouzi, Wolfgang Macherey, Renqian Luo, Arul Menezes, Tao Qin, Frank Seide, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu, Yingce Macherey, Jeff Klingner, Apurva Shah, Melvin Xia, Dongdong Zhang, Zhirui Zhang, and Ming Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Zhou. 2018. Achieving Human Parity on Automatic Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Chinese to English News Translation. Computing Kazawa, Keith Stevens, George Kurian, Nishant Research Repository, arXiv:1803.05567. Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent Macduff Hughes, and Jeffrey Dean. 2016. Google’s Continuous Translation Models. In Proceedings of Neural Machine Translation System: Bridging the EMNLP, pages 1700–1709, Seattle, WA. Gap between Human and Machine Translation. Computing Research Repository, arXiv:1609.08144. Aniket Kittur, Ed H. Chi, and Bongwon Suh. 2008. Crowdsourcing User Studies with Mechanical Turk. In Proceedings of CHI, pages 453–456, Florence, Italy. Hans P. Krings. 1986. Was in den Köpfen von Übersetzern vorgeht. Gunter Narr, Tübingen, Ger- many. Minh-Thang Luong and Christopher D. Manning. 2015. Stanford Neural Machine Translation Sys- tems for Spoken Language Domains. In Proceed- ings of IWSLT, Da Nang, Vietnam.
A Further Statistical Analysis A common procedure in situations where inter- rater agreement is low is to aggregate ratings of Table 1 shows detailed results, including those of different annotators (Graham et al., 2017). As individual raters, for all four experimental condi- shown in Table 2, majority voting leads to clearer tions. Raters choose between three labels for each discrimination between MT and H UMAN in all item: MT is better than H UMAN (a), H UMAN is conditions, except for sentence-level adequacy. better than MT (b), or tie (t). Table 3 lists inter- rater agreement. Besides percent agreement (same label), we calculate Cohen’s kappa coefficient Document Sentence Rater MT Tie Human MT Tie Human P (A) − P (E) Fluency κ= , (1) 1 − P (E) A 13 8 29 30 32 42 B1 36 4 64 B2 8 18 24 where P (A) is the proportion of times that two C 12 14 24 40 14 50 raters agree, and P (E) the likelihood of agree- D 11 17 22 32 30 42 ment by chance. We calculate Cohen’s kappa, and total 44 57 99 66 36 106 specifically P (E), as in WMT (Bojar et al., 2016b, Adequacy Section 3.3), on the basis of all pairwise ratings E 26 0 24 59 3 42 across all raters. F 10 15 25 44 16 44 G 18 4 28 38 23 43 In pairwise rankings of machine translation out- H 20 3 27 38 11 55 puts, κ coefficients typically centre around 0.3 total 74 22 104 103 19 86 (Bojar et al., 2016b). We observe lower inter-rater agreement in three out of four conditions, and at- Table 1: Ratings by rater and condition. Greyed- tribute this to two reasons. Firstly, the quality out fields indicate that raters had access to full doc- of the machine translations produced by Hassan uments for which we elicited sentence-level judge- et al. (2018) is high, making it difficult to discrim- ments; these are not considered for total results. inate from professional translation particularly at the sentence level. Secondly, we do not provide guidelines detailing error severity and thus assume Document Sentence that raters have differing interpretations of what Aggregation MT Tie Human MT Tie Human constitutes a “better” or “worse” translation. Con- fusion matrices in Table 4 indicate that raters han- Fluency dle ties very differently: in document-level ade- Average 22 29 50 32 17 51 Majority 24 10 66 26 23 51 quacy, for example, rater E assigns no ties at all, while rater F rates 15 out of 50 items as ties (Ta- Adequacy ble 4g). The assignment of ties is more uniform in Average 37 11 52 50 9 41 Majority 32 18 50 38 32 31 documents assessed for fluency (Tables 1, 4a–4f), leading to higher κ in this condition (Table 3). Despite low inter-annotator agreement, the Table 2: Aggregation of ratings (%). quality control we apply shows that raters assess items carefully: they only miss 1 out of 40 and 5 out of 128 spam items in the document- and Document Sentence sentence-level conditions overall, respectively, a Fluency very low number compared to crowdsourced work Same label 55 % 45 % Cohen’s κ 0.32 0.13 (Kittur et al., 2008). All of these misses are ties (i. e., not marking spam items as “better”, but Adequacy Same label 49 % 50 % rather equally bad as their counterpart), and 5 out Cohen’s κ 0.13 0.14 of 9 raters (A, B1, B2, D, F) do not miss a single spam item. Table 3: Inter-rater agreement.
B2 C D a t b a t b a t b a 5 4 4 a 7 2 4 a 6 3 4 A t 1 5 2 A t 2 4 2 A t 2 6 0 b 2 9 18 b 3 8 18 b 3 8 18 (a) fluency, document, N=50 (b) fluency, document, N=50 (c) fluency, document, N=50 C D D a t b a t b a t b a 5 1 2 a 6 1 1 a 7 3 2 B2 t 4 5 9 B2 t 3 7 8 C t 1 7 6 b 3 8 13 b 2 9 13 b 3 7 14 (d) fluency, document, N=50 (e) fluency, document, N=50 (f) fluency, document, N=50 F G H a t b a t b a t b a 4 9 13 a 9 4 13 a 11 1 14 E t 0 0 0 E t 0 0 0 E t 0 0 0 b 6 6 12 b 9 0 15 b 9 2 13 (g) adequacy, document, N=50 (h) adequacy, document, N=50 (i) adequacy, document, N=50 G H H a t b a t b a t b a 7 1 2 a 6 1 3 a 11 2 5 F t 7 1 7 F t 8 0 7 G t 1 1 2 b 4 2 19 b 6 2 17 b 8 0 20 (j) adequacy, document, N=50 (k) adequacy, document, N=50 (l) adequacy, document, N=50 B1 F a t b a t b a 16 1 13 a 31 6 22 A t 10 1 21 E t 2 0 1 b 10 2 30 b 11 10 21 (m) fluency, sentence, N=104 (n) adequacy, sentence, N=104 Table 4: Confusion matrices: MT is better than H UMAN (a), H UMAN is better than MT (b), or tie (t). Participant IDs (A–H) are the same as in Table 1.
You can also read