Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation

Page created by Donald Love

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Has Machine Translation Achieved Human Parity?
                                                                  A Case for Document-level Evaluation
                                                                      Samuel Läubli1            Rico Sennrich1,2        Martin Volk1
                                                                 1
                                                                     Institute of Computational Linguistics, University of Zurich
                                                                                  {laeubli,volk}@cl.uzh.ch
                                                                          2
                                                                              School of Informatics, University of Edinburgh
                                                                                    rico.sennrich@ed.ac.uk

                                                                  Abstract                                  released their data publicly to allow external val-
arXiv:1808.07048v1 [cs.CL] 21 Aug 2018

                                             Recent research suggests that neural machine                   idation of their claims. Their claims are further
                                             translation achieves parity with professional                  strengthened by the fact that they follow best prac-
                                             human translation on the WMT Chinese–                          tices in human machine translation evaluation, us-
                                             English news translation task. We empiri-                      ing evaluation protocols and tools that are also
                                             cally test this claim with alternative evalua-                 used at the yearly Conference on Machine Trans-
                                             tion protocols, contrasting the evaluation of                  lation (WMT) (Bojar et al., 2017), and take great
                                             single sentences and entire documents. In a
                                                                                                            care in guarding against some confounds such as
                                             pairwise ranking experiment, human raters as-
                                             sessing adequacy and fluency show a stronger                   test set selection and rater inconsistency.
                                             preference for human over machine transla-                        However, the implications of a statistical tie be-
                                             tion when evaluating documents as compared                     tween two machine translation systems in a shared
                                             to isolated sentences. Our findings emphasise                  translation task are less severe than that of a statis-
                                             the need to shift towards document-level eval-                 tical tie between a machine translation system and
                                             uation as machine translation improves to the                  a professional human translator, so we consider
                                             degree that errors which are hard or impossible
                                                                                                            the results worthy of further scrutiny. We per-
                                             to spot at the sentence-level become decisive
                                             in discriminating quality of different transla-                form an independent evaluation of the professional
                                             tion outputs.                                                  translation and best machine translation system
                                                                                                            that were found to be of equal quality by Hassan
                                         1 Introduction                                                     et al. (2018). Our main interest lies in the eval-
                                         Neural machine translation (Kalchbrenner and                       uation protocol, and we empirically investigate if
                                         Blunsom, 2013; Sutskever et al., 2014; Bahdanau                    the lack of document-level context could explain
                                         et al., 2015) has become the de-facto standard in                  the inability of human raters to find a quality dif-
                                         machine translation, outperforming earlier phrase-                 ference between human and machine translations.
                                         based approaches in many data settings and shared                  We test the following hypothesis:
                                         translation tasks (Luong and Manning, 2015; Sen-
                                         nrich et al., 2016; Cromieres et al., 2016). Some                       A professional translator who is asked to
                                         recent results suggest that neural machine transla-                     rank the quality of two candidate trans-
                                         tion “approaches the accuracy achieved by average                       lations on the document level will prefer
                                         bilingual human translators [on some test sets]”                        a professional human translation over a
                                         (Wu et al., 2016), or even that its “translation qual-                  machine translation.
                                         ity is at human parity when compared to profes-
                                         sional human translators” (Hassan et al., 2018).                   Note that our hypothesis is slightly different from
                                         Claims of human parity in machine translation are                  that tested by Hassan et al. (2018), which could be
                                         certainly extraordinary, and require extraordinary                 phrased as follows:
                                         evidence.1 Laudably, Hassan et al. (2018) have
                                             1
                                               The term “parity” may raise the expectation that there is         A bilingual crowd worker who is asked
                                         evidence for equivalence, but the term is used in the definition        to directly assess the quality of candi-
                                         of “there [being] no statistical significance between [two out-         date translations on the sentence level
                                         puts] for a test set of candidate translations” by Hassan et al.
                                         (2018). Still, we consider this finding noteworthy given the            will prefer a professional human trans-
                                         strong evaluation setup.                                                lation over a machine translation.

As such, our evaluation is not a direct replication dently, and parity is assumed if the mean score of
of that by Hassan et al. (2018), and a failure to re- the former does not significantly differ from the
produce their findings does not imply an error on mean score of the latter.
either our or their part. Rather, we hope to indi-
rectly assess the accuracy of different evaluation Raters To optimise cost, machine translation
protocols. Our underlying assumption is that pro- quality is typically assessed by means of crowd-
fessional human translation is still superior to neu- sourcing. Combined ratings of bilingual crowd
ral machine translation, but that the sensitivity of workers have been shown to be more reliable than
human raters to these quality differences depends automatic metrics and “very similar” to ratings
on the evaluation protocol. produced by “experts”2 (Callison-Burch, 2009).
Graham et al. (2017) compare crowdsourced to
2 Human Evaluation of Machine “expert” ratings on machine translations from
Translation WMT 2012, concluding that, with proper quality
control, “machine translation systems can indeed
Machine translation is typically evaluated by com- be evaluated by the crowd alone.” However, it is
paring system outputs to source texts, reference unclear whether this finding carries over to trans-
translations, other system outputs, or a combi- lations produced by NMT systems where, due to
nation thereof (for examples, see Bojar et al., increased fluency, errors are more difficult to iden-
2016a). The scientific community concentrates tify (Castilho et al., 2017a), and concurrent work
on two aspects: adequacy, typically assessed by by Toral et al. (2018) highlights the importance of
bilinguals; and target language fluency, typically expert translators for MT evaluation.
assessed by monolinguals. Evaluation protocols
have been subject to controversy for decades (e. g., Experimental Unit Machine translation evalu-
Van Slype, 1979), and we identify three aspects ation is predominantly performed on single sen-
with particular relevance to assessing human par- tences, presented to raters in random order (e. g.,
ity: granularity of measurement (ordinal vs. inter- Bojar et al., 2017; Cettolo et al., 2017). There
val scales), raters (experts vs. crowd workers), and are two main reasons for this. The first is cost:
experimental unit (sentence vs. document). if raters assess entire documents, obtaining the
same number of data points in an evaluation cam-
2.1 Related Work paign multiplies the cost by the average number
Granularity of Measurement Callison-Burch of sentences per document. The second is exper-
et al. (2007) show that ranking (Which of these imental validity. When comparing systems that
translations is better?) leads to better inter-rater produce sentences without considering document-
agreement than absolute judgement on 5-point level context, the perceived suprasentential cohe-
Likert scales (How good is this translation?) but sion of a system output is likely due to random-
gives no insight about how much a candidate trans- ness and thus a confounding factor. While in-
lation differs from a (presumably perfect) refer- corporating document-level context into machine
ence. To this end, Graham et al. (2013) suggest translation systems is an active field of research
the use of continuous scales for direct assessment (Webber et al., 2017), state-of-the-art systems still
of translation quality. Implemented as a slider operate at the level of single sentences (Sennrich
between 0 (Not at all) and 100 (Perfectly), their et al., 2017; Vaswani et al., 2017; Hassan et al.,
method yields scores on a 100-point interval scale 2018). In contrast, human translators can and do
in practice (Bojar et al., 2016b, 2017), with each take document-level context into account (Krings,
raters’ rating being standardised to increase ho- 1986). The same holds for raters in evaluation
mogeneity. Hassan et al. (2018) use source-based campaigns. In the discussion of their results,
direct assessment to avoid bias towards reference Wu et al. (2016) note that their raters “[did] not
translations. In the shared task evaluation by Cet- necessarily fully understand each randomly sam-
tolo et al. (2017), raters are shown the source and pled sentence sufficiently” because it was pro-
a candidate text, and asked: How accurately does vided with no context. In such setups, raters can-
the above candidate text convey the semantics of not reward textual cohesion and coherence.
the source text? In doing so, they have translations 2
“Experts” here are computational linguists who develop
produced by humans and machines rated indepen- MT systems, who may not be expert translators.

2.2    Our Evaluation Protocol                                     2.3 Data Collection
We conduct a quality evaluation experiment with                    We use the experimental protocol described in the
a 2 × 2 mixed factorial design, testing the effect                 previous section for a quality assessment of Chi-
of source text availability (adequacy, fluency) and                nese to English translations of news articles. To
experimental unit (sentence, document) on ratings                  this end, we randomly sampled 55 documents and
by professional translators.                                       2×120 sentences from the WMT 2017 test set.
                                                                   We only considered the 123 articles (documents)
Granularity of Measurement We elicit judge-                        which are native Chinese,4 containing 8.13 sen-
ments by means of pairwise ranking. Raters                         tences on average. Human and machine transla-
choose the better (with ties allowed) of two trans-                tions (R EFERENCE -HT as H UMAN, and C OMBO -
lations for each item: one produced by a profes-                   6 as MT) were obtained from data released by
sional translator (H UMAN), the other by machine                   Hassan et al. (2018).5
translation (MT). Since our evaluation includes                       The sampled documents and sentences were
that of human translation, it is reference-free. We                rated by professional translators we recruited from
evaluate in two conditions: adequacy, where raters                 ProZ:6 4 native in Chinese (2), English (1), or both
see source texts and translations (Which transla-                  (1) to rate adequacy, and 4 native in English to rate
tion expresses the meaning of the source text more                 fluency. On average, translators had 13.7 years of
adequately?); and fluency, where raters only see                   experience and 8.8 positive client reviews on ProZ,
translations (Which text is better English?).                      and received US$ 188.75 for rating 55 documents
                                                                   and 120 sentences.
Raters We recruit professional translators, only                      The averages reported above include an ad-
considering individuals with at least three years                  ditional translator we recruited when one rater
of professional experience and positive client re-                 showed poor performance on document-level
views.                                                             spam items in the fluency condition, whose judge-
                                                                   ments we exclude from analysis. We also ex-
Experimental Unit To test the effect of context                    clude sentence-level results from 4 raters because
on perceived translation quality, raters evaluate en-              there was overlap with the documents they anno-
tire documents as well as single sentences in ran-                 tated, which means that we cannot rule out that the
dom order (i. e., context is a within-subjects fac-                sentence-level decisions were informed by access
tor). They are shown both translations (H UMAN                     to the full document. To allow for external val-
and MT) for each unit; the source text is only                     idation and further experimentation, we make all
shown in the adequacy condition.                                   experimental data publicly available.7

Quality Control To hedge against random rat-                       3 Results
ings, we convert 5 documents and 16 sentences per
                                                                   In the adequacy condition, MT and H UMAN are
set into spam items (Kittur et al., 2008): we render
                                                                   not statistically significantly different on the sen-
one of the two options nonsensical by shuffling its
                                                                   tence level (x = 86, n = 189, p = .244). This is
words randomly, except for 10 % at the beginning
                                                                   consistent with the results Hassan et al. (2018)
and end.
                                                                   obtained with an alternative evaluation protocol
                                                                   (crowdsourcing and direct assessment; see Sec-
Statistical Analysis We test for statistically sig-
                                                                   tion 2.1). However, when evaluating entire doc-
nificant preference of H UMAN over MT or vice
versa by means of two-sided Sign Tests. Let a be                       4
                                                                         While it is common practice in machine translation to
the number of ratings in favour of MT, b the num-                  use the same test set in both translation directions, we con-
                                                                   sider a direct comparison between human “translation” and
ber of ratings in favour of H UMAN, and t the num-                 machine translation hard to interpret if one is in fact the orig-
ber of ties. We report the number of successes x                   inal English text, and the other an automatic translation into
and the number of trials n for each test, such that                English of a human translation into Chinese. In concurrent
                                                                   work, Toral et al. (2018) expand on the confounding effect of
x = b and n = a + b.3                                              evaluating text where the target side is actually the original
                                                                   document.
   3                                                                  5
    Emerson and Simon (1979) suggest the inclusion of ties              http://aka.ms/Translator-HumanParityData
                                                                      6
such that x = b + 0.5t and n = a + b + t. This modification             https://www.proz.com
                                                                      7
has no effect on the significance levels reported in this paper.        https://github.com/laeubli/parity

(a) Adequacy                                                      (b) Fluency

                 60                                     52
                                                                                   60
                                                                                                                       51 50
Preference (%)

                                                                  Preference (%)
                         50
                                                   41
                 40           37                                                   40      32
                                                                                                              29
                                                                                                22
                 20                                                                20                    17
                                        9 11

                  0                                                                 0
                          MT             Tie      H UMAN                                    MT             Tie        H UMAN

                      Sentence (N=208)         Document (N=200)                         Sentence (N=208)           Document (N=200)

Figure 1: Raters prefer human translation more strongly in entire documents. When evaluating isolated
sentences in terms of adequacy, there is no statistically significant difference between H UMAN and MT;
in all other settings, raters show a statistically significant preference for H UMAN.

uments, raters show a statistically significant pref-                    Conversely, we observe a tendency to rate H U -
erence for H UMAN (x = 104, n = 178, p < .05).                         MAN  more favourably on the document level than
While the number of ties is similar in sentence-                      on the sentence level, even within single raters.
and document-level evaluation, preference for MT                      Adequacy raters show a statistically significant
drops from 50 to 37 % in the latter (Figure 1a).                      preference for H UMAN when evaluating entire
   In the fluency condition, raters prefer H U -                      documents. We hypothesise that document-level
MAN on both the sentence (x = 106, n = 172,                           evaluation unveils errors such as mistranslation of
p < .01) and document level (x = 99, n = 143, p <                     an ambiguous word, or errors related to textual co-
.001). In contrast to adequacy, fluency ratings in                    hesion and coherence, which remain hard or im-
favour of H UMAN are similar in sentence- and                         possible to spot in a sentence-level evaluation. For
document-level evaluation, but raters find more                       a subset of articles, we elicited both sentence-level
ties with document-level context as preference for                    and document-level judgements, and inspected ar-
MT drops from 32 to 22 % (Figure 1b).                                 ticles for which sentence-level judgements were
   We note that these large effect sizes lead                         mixed, but where H UMAN was strongly preferred
to statistical significance despite modest sam-                       in document-level evaluation. In these articles,
ple size. Inter-annotator agreement (Cohen’s κ)                       we do indeed observe the hypothesised phenom-
ranges from 0.13 to 0.32 (see Appendix for full                       ena. We find an example of lexical coherence
results and discussion).                                              in a 6-sentence article about a new app “微信挪
                                                                      车”, which H UMAN consistently translates into
4 Discussion                                                          “WeChat Move the Car”. In MT, we find three
                                                                      different translations in the same article: “Twit-
Our results emphasise the need for suprasentential
                                                                      ter Move Car”, “WeChat mobile”, and “WeChat
context in human evaluation of machine transla-
                                                                      Move”. Other observations include the use of
tion. Starting with Hassan et al.’s (2018) finding
                                                                      more appropriate discourse connectives in H U -
of no statistically significant difference in trans-
                                                                      MAN , a more detailed investigation of which we
lation quality between H UMAN and MT for their
                                                                      leave to future work.
Chinese–English test set, we set out to test this re-
sult with an alternative evaluation protocol which                       To our surprise, fluency raters show a stronger
we expected to strengthen the ability of raters to                    preference for H UMAN than adequacy raters (Fig-
judge translation quality. We employed profes-                        ure 1). The main strength of neural machine trans-
sional translators instead of crowd workers, and                      lation in comparison to previous statistical ap-
pairwise ranking instead of direct assessment, but                    proaches was found to be increased fluency, while
in a sentence-level evaluation of adequacy, raters                    adequacy improvements were less clear (Bojar
still found it hard to discriminate between H UMAN                    et al., 2016b; Castilho et al., 2017b), and we ex-
and MT: they did not show a statistically signifi-                    pected a similar pattern in our evaluation. Does
cant preference for either of them.                                   this indicate that adequacy is in fact a strength of

MT, not fluency? We are wary to jump to this pect that this will require further efforts in cre-
conclusion. An alternative interpretation is that ating document-level training data, designing ap-
MT, which tends to be more literal than H UMAN, propriate models, and supporting research with
is judged more favourably by raters in the bilin- discourse-aware automatic metrics.
gual condition, where the majority of raters are
native speakers of the source language, because of Acknowledgements
L1 interference. We note that the availability of We thank Xin Sennrich for her help with the
document-level context still has a strong impact in analysis of translation errors. We also thank An-
the fluency condition (Section 3). tonio Toral and the anonymous reviewers for their
helpful comments.
5 Conclusions
In response to recent claims of parity between hu- References
man and machine translation, we have empirically
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
tested the impact of sentence and document level gio. 2015. Neural Machine Translation by Jointly
context on human assessment of machine transla- Learning to Align and Translate. In Proceedings of
tion. Raters showed a markedly stronger prefer- ICLR, San Diego, CA.
ence for human translations when evaluating at the Ondrej Bojar, Christian Federmann, Barry Haddow,
level of documents, as compared to an evaluation Philipp Koehn, Matt Post, and Lucia Specia. 2016a.
of single, isolated sentences. Ten years of WMT evaluation campaigns: Lessons
We believe that our findings have several impli- learnt. In Proceedings of the LREC 2016 Workshop
“Translation Evaluation – From Fragmented Tools
cations for machine translation research. Most im- and Data Sets to an Integrated Ecosystem”, pages
portantly, if we accept our interpretation that hu- 27–34, Portorož, Slovenia.
man translation is indeed of higher quality in the
Ondřej Bojar, Rajen Chatterjee, Christian Federmann,
dataset we tested, this points to a failure of cur-
Yvette Graham, Barry Haddow, Shujian Huang,
rent best practices in machine translation evalu- Matthias Huck, Philipp Koehn, Qun Liu, Varvara
ation. As machine translation quality improves, Logacheva, Christof Monz, Matteo Negri, Matt
translations will become harder to discriminate Post, Raphael Rubino, Lucia Specia, and Marco
in terms of quality, and it may be time to shift Turchi. 2017. Findings of the 2017 Conference on
Machine Translation (WMT17). In Proceedings of
towards document-level evaluation, which gives WMT, pages 169–214, Copenhagen, Denmark.
raters more context to understand the original text
and its translation, and also exposes translation er- Ondřej Bojar, Rajen Chatterjee, Christian Federmann,
Yvette Graham, Barry Haddow, Matthias Huck,
rors related to discourse phenomena which remain Antonio Jimeno Yepes, Philipp Koehn, Varvara
invisible in a sentence-level evaluation. Logacheva, Christof Monz, Matteo Negri, Aure-
Our evaluation protocol was designed with the lie Neveol, Mariana Neves, Martin Popel, Matt
aim of providing maximal validity, which is why Post, Raphael Rubino, Carolina Scarton, Lucia Spe-
cia, Marco Turchi, Karin Verspoor, and Marcos
we chose to use professional translators and pair- Zampieri. 2016b. Findings of the 2016 Conference
wise ranking. For future work, it would be of on Machine Translation (WMT16). In Proceedings
high practical relevance to test whether we can of WMT, pages 131–198, Berlin, Germany.
also elicit accurate quality judgements on the Chris Callison-Burch. 2009. Fast, cheap, and creative:
document-level via crowdsourcing and direct as- Evaluating translation quality using amazon’s me-
sessment, or via alternative evaluation protocols. chanical turk. In Proceedings of EMNLP, pages
The data released by Hassan et al. (2018) could 286–295, Singapore.
serve as a test bed to this end. Chris Callison-Burch, Cameron Fordyce, Philipp
One reason why document-level evaluation Koehn, Christof Monz, and Josh Schroeder. 2007.
widens the quality gap between machine trans- (Meta-) evaluation of machine translation. In Pro-
ceedings of WMT, pages 136–158, Prague, Czech
lation and human translation is that the machine Republic.
translation system we tested still operates on the
sentence level, ignoring wider context. It will Sheila Castilho, Joss Moorkens, Federico Gaspari,
Iacer Calixto, John Tinsley, and Andy Way. 2017a.
be interesting to explore to what extent exist- Is Neural Machine Translation the New State of the
ing and future techniques for document-level ma- Art? The Prague Bulletin of Mathematical Linguis-
chine translation can narrow this gap. We ex- tics, 108:109–120.

Sheila Castilho, Joss Moorkens, Federico Gaspari,      Rico Sennrich, Alexandra Birch, Anna Currey, Ulrich
  Rico Sennrich, Vilelmini Sosoni, Yota Geor-            Germann, Barry Haddow, Kenneth Heafield, An-
  gakopoulou, Pintu Lohar, Andy Way, Antonio Va-         tonio Valerio Miceli Barone, and Philip Williams.
  lerio Miceli Barone, and Maria Gialama. 2017b.         2017. The University of Edinburgh’s Neural MT
  A Comparative Quality Evaluation of PBSMT and          Systems for WMT17. In Proceedings of WMT,
  NMT using Professional Translators. In Proceed-        pages 389–399, Copenhagen, Denmark.
  ings of MT Summit, Nagoya, Japan.
                                                       Rico Sennrich, Barry Haddow, and Alexandra Birch.
Mauro Cettolo, Marcello Federico, Luisa Bentivogli,      2016. Edinburgh Neural Machine Translation Sys-
 Jan Niehues, Sebastian Stüker, Katsuitho Sudoh,        tems for WMT 16. In Proceedings of WMT, pages
 Koichiro Yoshino, and Christian Federmann. 2017.        368–373, Berlin, Germany.
 Overview of the IWSLT 2017 Evaluation Campaign.
 In Proceedings of IWSLT, pages 2–14, Tokyo, Japan.    Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
                                                          Sequence to Sequence Learning with Neural Net-
Fabien Cromieres, Chenhui Chu, Toshiaki Nakazawa,         works. In Proceedings of NIPS, pages 3104–3112,
  and Sadao Kurohashi. 2016. Kyoto University par-        Montreal, Canada.
  ticipation to WAT 2016. In Proceedings of WAT,
                                                       Antonio Toral, Sheila Castilho, Ke Hu, and Andy Way.
  pages 166–174, Osaka, Japan.
                                                         2018. Attaining the Unattainable? Reassessing
                                                         Claims of Human Parity in Neural Machine Trans-
John D. Emerson and Gary A. Simon. 1979. Another
                                                         lation. In Proceedings of WMT, Brussels, Belgium.
  Look at the Sign Test When Ties Are Present: The
  Problem of Confidence Intervals. The American        Georges Van Slype. 1979. Critical study of methods for
  Statistician, 33(3):140–142.                           evaluating the quality of machine translation. Re-
                                                         search report BR 19142 prepared for the Commis-
Yvette Graham, Timothy Baldwin, Alistair Moffat,         sion of the European Communities, Bureau Marcel
  and Justin Zobel. 2013. Continuous Measurement         van Dijk, Brussels, Belgium.
  Scales in Human Evaluation of Machine Transla-
  tion. In Proceedings of the 7th Linguistic Anno-     Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  tation Workshop & Interoperability with Discourse,     Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
  pages 33–41, Sofia, Bulgaria.                          Kaiser, and Illia Polosukhin. 2017. Attention is All
                                                         you Need. In Proceedings of NIPS, pages 5998–
Yvette Graham, Timothy Baldwin, Alistair Moffat, and     6008, Long Beach, CA.
  Justin Zobel. 2017. Can machine translation sys-
  tems be evaluated by the crowd alone? Natural Lan-   Bonnie Webber, Andrei Popescu-Belis, and Jörg Tiede-
  guage Engineering, 23(1):3–30.                         mann, editors. 2017. Proceedings of the Third Work-
                                                         shop on Discourse in Machine Translation. Copen-
Hany Hassan, Anthony Aue, Chang Chen, Vishal             hagen, Denmark.
  Chowdhary, Jonathan Clark, Christian Feder-
  mann, Xuedong Huang, Marcin Junczys-Dowmunt,         Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.
  William Lewis, Mu Li, Shujie Liu, Tie-Yan Liu,         Le, Mohammad Norouzi, Wolfgang Macherey,
  Renqian Luo, Arul Menezes, Tao Qin, Frank Seide,       Maxim Krikun, Yuan Cao, Qin Gao, Klaus
  Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu, Yingce       Macherey, Jeff Klingner, Apurva Shah, Melvin
  Xia, Dongdong Zhang, Zhirui Zhang, and Ming            Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan
  Zhou. 2018. Achieving Human Parity on Automatic        Gouws, Yoshikiyo Kato, Taku Kudo, Hideto
  Chinese to English News Translation. Computing         Kazawa, Keith Stevens, George Kurian, Nishant
  Research Repository, arXiv:1803.05567.                 Patil, Wei Wang, Cliff Young, Jason Smith, Jason
                                                         Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado,
Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent       Macduff Hughes, and Jeffrey Dean. 2016. Google’s
  Continuous Translation Models. In Proceedings of       Neural Machine Translation System: Bridging the
  EMNLP, pages 1700–1709, Seattle, WA.                   Gap between Human and Machine Translation.
                                                         Computing Research Repository, arXiv:1609.08144.
Aniket Kittur, Ed H. Chi, and Bongwon Suh. 2008.
  Crowdsourcing User Studies with Mechanical Turk.
  In Proceedings of CHI, pages 453–456, Florence,
  Italy.

Hans P. Krings. 1986. Was in den Köpfen von
  Übersetzern vorgeht. Gunter Narr, Tübingen, Ger-
  many.

Minh-Thang Luong and Christopher D. Manning.
  2015. Stanford Neural Machine Translation Sys-
  tems for Spoken Language Domains. In Proceed-
  ings of IWSLT, Da Nang, Vietnam.

A    Further Statistical Analysis                        A common procedure in situations where inter-
                                                         rater agreement is low is to aggregate ratings of
Table 1 shows detailed results, including those of       different annotators (Graham et al., 2017). As
individual raters, for all four experimental condi-      shown in Table 2, majority voting leads to clearer
tions. Raters choose between three labels for each       discrimination between MT and H UMAN in all
item: MT is better than H UMAN (a), H UMAN is            conditions, except for sentence-level adequacy.
better than MT (b), or tie (t). Table 3 lists inter-
rater agreement. Besides percent agreement (same
label), we calculate Cohen’s kappa coefficient                                 Document                  Sentence
                                                          Rater          MT      Tie Human       MT        Tie Human
                     P (A) − P (E)                        Fluency
               κ=                  ,              (1)
                       1 − P (E)                            A             13        8      29       30      32        42
                                                            B1                                      36       4        64
                                                            B2             8     18        24
where P (A) is the proportion of times that two             C             12     14        24       40      14        50
raters agree, and P (E) the likelihood of agree-            D             11     17        22       32      30        42
ment by chance. We calculate Cohen’s kappa, and             total         44     57        99       66      36      106
specifically P (E), as in WMT (Bojar et al., 2016b,       Adequacy
Section 3.3), on the basis of all pairwise ratings          E             26      0        24       59       3        42
across all raters.                                          F             10     15        25       44      16        44
                                                            G             18      4        28       38      23        43
    In pairwise rankings of machine translation out-        H             20      3        27       38      11        55
puts, κ coefficients typically centre around 0.3            total         74     22        104   103        19        86
(Bojar et al., 2016b). We observe lower inter-rater
agreement in three out of four conditions, and at-       Table 1: Ratings by rater and condition. Greyed-
tribute this to two reasons. Firstly, the quality        out fields indicate that raters had access to full doc-
of the machine translations produced by Hassan           uments for which we elicited sentence-level judge-
et al. (2018) is high, making it difficult to discrim-   ments; these are not considered for total results.
inate from professional translation particularly at
the sentence level. Secondly, we do not provide
guidelines detailing error severity and thus assume                             Document                   Sentence
that raters have differing interpretations of what        Aggregation      MT       Tie Human       MT       Tie Human
constitutes a “better” or “worse” translation. Con-
fusion matrices in Table 4 indicate that raters han-      Fluency
dle ties very differently: in document-level ade-           Average        22       29      50       32       17      51
                                                            Majority       24       10      66       26       23      51
quacy, for example, rater E assigns no ties at all,
while rater F rates 15 out of 50 items as ties (Ta-       Adequacy
ble 4g). The assignment of ties is more uniform in          Average        37       11      52       50       9       41
                                                            Majority       32       18      50       38       32      31
documents assessed for fluency (Tables 1, 4a–4f),
leading to higher κ in this condition (Table 3).
    Despite low inter-annotator agreement, the                      Table 2: Aggregation of ratings (%).
quality control we apply shows that raters assess
items carefully: they only miss 1 out of 40 and
5 out of 128 spam items in the document- and                                             Document        Sentence
sentence-level conditions overall, respectively, a                   Fluency
very low number compared to crowdsourced work                          Same label            55 %           45 %
                                                                       Cohen’s κ             0.32            0.13
(Kittur et al., 2008). All of these misses are ties
(i. e., not marking spam items as “better”, but                      Adequacy
                                                                       Same label            49 %           50 %
rather equally bad as their counterpart), and 5 out                    Cohen’s κ             0.13            0.14
of 9 raters (A, B1, B2, D, F) do not miss a single
spam item.                                                            Table 3: Inter-rater agreement.

B2                                                      C                                                    D
                   a       t       b                                       a       t       b                                    a       t       b
           a       5       4        4                              a       7       2        4                               a   6       3        4
    A      t       1       5        2                       A      t       2       4        2                     A         t   2       6        0
           b       2       9       18                              b       3       8       18                               b   3       8       18

         (a) fluency, document, N=50                             (b) fluency, document, N=50                              (c) fluency, document, N=50

                               C                                                       D                                                D
                       a       t       b                                       a       t       b                                a       t       b
               a       5       1        2                              a       6       1        1                          a    7       3       2
    B2         t       4       5        9                   B2         t       3       7        8                 C         t   1       7       6
               b       3       8    13                                 b       2       9    13                              b   3       7       14

        (d) fluency, document, N=50                              (e) fluency, document, N=50                              (f) fluency, document, N=50

                           F                                                       G                                                    H
                   a       t       b                                       a       t       b                                     a          t    b
          a        4       9       13                             a        9       4       13                              a    11      1       14
    E      t       0       0        0                       E      t       0       0        0                     E         t    0      0           0
           b       6       6       12                              b       9       0       15                               b    9      2       13

        (g) adequacy, document, N=50                            (h) adequacy, document, N=50                          (i) adequacy, document, N=50

                           G                                                       H                                                    H
                   a       t       b                                       a       t       b                                     a          t       b
          a        7       1        2                             a        6       1        3                               a   11          2        5
    F      t       7       1        7                       F      t       8       0        7                     G         t       1       1        2
          b        4       2       19                             b        6       2       17                               b       8       0    20

        (j) adequacy, document, N=50                            (k) adequacy, document, N=50                          (l) adequacy, document, N=50

                                                       B1                                                    F
                                                   a   t    b                                           a     t       b
                                             a   16    1    13                                      a   31   6    22
                                    A        t   10    1    21                              E       t    2   0        1
                                             b   10    2    30                                      b   11   10   21

                                           (m) fluency, sentence, N=104                         (n) adequacy, sentence, N=104

Table 4: Confusion matrices: MT is better than H UMAN (a), H UMAN is better than MT (b), or tie (t).
Participant IDs (A–H) are the same as in Table 1.

You can also read