DivEMT: Neural Machine Translation Post-Editing Effort Across Typologically Diverse Languages

 
CONTINUE READING
DivEMT: Neural Machine Translation Post-Editing Effort Across Typologically Diverse Languages
DivEMT: Neural Machine Translation Post-Editing Effort
                                                             Across Typologically Diverse Languages

                                              Gabriele Sarti1 , Arianna Bisazza1 , Ana Guerberof-Arenas1,2 and Antonio Toral1
                                                        1
                                                     Center for Language and Cognition (CLCG), University of Groningen
                                                            2
                                                              Centre for Translation Studies, University of Surrey
                                             {g.sarti, a.bisazza, a.guerberof.arenas, a.toral.ruiz}@rug.nl

                                                                Abstract                          the benefits of state-of-the-art MT for cross-
                                                                                                  lingual information access are undisputed (Lom-
                                              We introduce DivEMT, the first publicly avail-      mel and Pielmeier, 2021), its usefulness as an
                                              able post-editing study of Neural Machine
arXiv:2205.12215v1 [cs.CL] 24 May 2022

                                                                                                  aid to professional translators varies considerably
                                              Translation (NMT) over a typologically di-          across domains, subjects and language combina-
                                              verse set of target languages. Using a strictly
                                                                                                  tions (Zouhar et al., 2021).
                                              controlled setup, 18 professional translators
                                              were instructed to translate or post-edit the          In the last decade, the MT community has been
                                              same set of English documents into Ara-             including an increasing number of languages in
                                              bic, Dutch, Italian, Turkish, Ukrainian, and        its automatic and human evaluation efforts (Bo-
                                              Vietnamese. During the process, their ed-           jar et al., 2013; Barrault et al., 2021). However,
                                              its, keystrokes, editing times, pauses, and per-    the results of these evaluations are typically not
                                              ceived effort were recorded, enabling an in-
                                                                                                  directly comparable across different language pairs
                                              depth, cross-lingual evaluation of NMT qual-
                                              ity and its post-editing process. Using this        for various reasons. First, reference-based auto-
                                              new dataset, we assess the impact on transla-       matic quality metrics are hardly comparable across
                                              tion productivity of two state-of-the-art NMT       different target languages (Bugliarello et al., 2020).
                                              systems, namely: Google Translate and the           Secondly, human judgements are collected inde-
                                              open-source multilingual model mBART50.             pendently for different language pairs, making their
                                              We find that, while post-editing is consistently    cross-lingual comparison vulnerable to confound-
                                              faster than translation from scratch, the mag-
                                                                                                  ing factors such as tested domains and training
                                              nitude of its contribution varies largely across
                                              systems and languages, ranging from doubled
                                                                                                  data sizes. Similarly, recent work on NMT post-
                                              productivity in Dutch and Italian to marginal       editing efficiency has focused on specific lan-
                                              gains in Arabic, Turkish and Ukrainian, for         guage pairs such as English-Czech (Zouhar et al.,
                                              some of the evaluated modalities. Moreover,         2021), German-Italian, German-French (Läubli
                                              the observed cross-language variability ap-         et al., 2019) and English-Hindi (Ahsan et al., 2021),
                                              pears to partly reflect source-target relatedness   but a controlled comparison across a broad set of
                                              and type of target morphology, while remain-        typologically different languages is still lacking.
                                              ing hard to predict even based on state-of-the-
                                              art automatic MT quality metrics. We publicly          In this work, we assess the usefulness of state-
                                              release the complete dataset1 including all col-    of-the-art NMT for professional translation with a
                                              lected behavioural data, to foster new research     strictly controlled setup (Figure 1). Specifically,
                                              on the ability of state-of-the-art NMT systems      experienced professionals were asked to translate
                                              to generate text in typologically diverse lan-      the same English documents into six typologically
                                              guages.                                             different languages (Arabic, Dutch, Italian, Turk-
                                                                                                  ish, Ukrainian, and Vietnamese) using the same
                                         1    Introduction                                        platform and guidelines. Three translation modali-
                                                                                                  ties were adopted: human translation from scratch
                                         Advances in neural language modeling and multi-
                                                                                                  (HT), post-editing of Google Translate’s transla-
                                         lingual training have enabled the delivery of Ma-
                                                                                                  tion (PE1 ), and post-editing of mBART50’s transla-
                                         chine Translation (MT) technologies across an
                                                                                                  tion (PE2 ), the latter being a state-of-the-art open-
                                         unprecedented range of world languages. While
                                                                                                  source, multilingual NMT system. In addition to
                                           1
                                             https://github.com/gsarti/divemt                     post-editing results, subjects’ fine-grained editing
                                         https://huggingface.co/GroNLP/divemt                     behaviour — including keystrokes and time infor-
DivEMT: Neural Machine Translation Post-Editing Effort Across Typologically Diverse Languages
Translation
                                                                     from Scratch
                                                                                                                              Behavioral
        SRC                                                                                         HT                        Data

                          Go
                               og
                                 le T
                                     ran
                                                                                                                    Edits · Keystrokes ·
                      m                        sla                                                                  Time measurements
                       BA                          te
    Source Language       RT                                   MT1                                  PE1
      Documents                 1-
                                   to
                                        -5
                                           0

                                                                                                                              Qualitative
                                                               MT2                                  PE2
                                                                                                                              Assessment

              Neural Machine                                                   Professional                         Perceived MT quality
                Translation                                                    Post-editing
                                                        Machine-translated                        Finalized         Perceived PE effort
                                                           Documents                             Translations

Figure 1: Overview of the DIVEMT data collection process. The same process is repeated for each of the six
target languages (Arabic, Dutch, Italian, Turkish, Ukrainian, Vietnamese) starting from the same English document
sample taken from FLORES-101. For every document in every language, three professional translators are matched
randomly with the three translation modalities (HT/PE1 /PE2 ), leading to a complete set of translations for all
documents and modalities in each target language. Behavioral data and qualitative assessments are collected during
and after the process, respectively.

mation — are logged and later leveraged to measure                                     open research question. Moreover, no language
productivity and different effort typologies across                                    properties were found to be robustly correlated to
languages, systems and translation modalities. Fi-                                     MT quality apart from type-token ratio.
nally, translators are asked to complete a qualitative
                                                                                          Human evaluations of MT quality are routinely
assessment regarding their perceptions of MT qual-
                                                                                       conducted during campaigns such as WMT (Koehn
ity and post-editing effort.
                                                                                       and Monz, 2006; Akhbardeh et al., 2021) and
   Using this dataset, we measure the impact on                                        IWSLT (Cettolo et al., 2016, 2017), but their fo-
professional translators’ productivity of the two                                      cus is on language- and domain-specific ranking
NMT systems, and study how this impact varies                                          of MT systems, often leveraging non-professional
across languages. DivEMT is publicly released                                          quality assessments (Freitag et al., 2021), rather
alongside this paper as a unique resource to study                                     than cross-lingual quality comparisons.
the language- and system-dependent nature of
NMT advances in real-world translation scenarios.                                         Measuring post-editing effort, and in particular
                                                                                       its temporal, cognitive, and technical dimensions
2     Previous Work                                                                    (Krings, 2001), is a well-established way to assess
                                                                                       the effectiveness and efficiency of MT as a com-
Before the advent of NMT, Birch et al. (2008)                                          ponent of specialized translation workflows. Sem-
found the amount of reordering, target morpho-                                         inal work on post-editing highlighted an increase
logical complexity, and source-target language his-                                    in translators’ productivity following MT adop-
torical relatedness to be strong predictors of MT                                      tion (Guerberof, 2009; Green et al., 2013; Läubli
quality. These findings, however, were solely based                                    et al., 2013; Plitt and Masselot, 2010; Parra Es-
on BLEU, which is in fact not comparable across                                        cartín and Arcedillo, 2015). However, the same
different target languages (Bugliarello et al., 2020).                                 research also struggled to identify generalizable
More recent work on neural models has proposed                                         findings due to confounding factors such as out-
more principled ways to measure the intrinsic diffi-                                   put quality, content domains, and high variance
culty of language-modeling (Gerz et al., 2018; Cot-                                    across different language combinations and human
terell et al., 2018; Mielke et al., 2019) and machine-                                 subjects. With the advent of NMT, productivity
translating (Bugliarello et al., 2020; Bisazza et al.,                                 gains of statistical MT — the highly-customized
2021) different languages, but achieving this re-                                      dominant paradigm at the time — and new neural
liably without any human evaluation remains an                                         network-based methods were extensively compared
DivEMT: Neural Machine Translation Post-Editing Effort Across Typologically Diverse Languages
(Castilho et al., 2017; Bentivogli et al., 2016; Toral      isolation (HT), or alongside a translation proposal
et al., 2018; Läubli et al., 2019). Initial results were   produced by one of the NMT systems (PE1 , PE2 ).
promising for NMT due to its better fluency and             We divide the experiment in two phases: During
overall results even in language combinations tra-          the warm-up phase a set of 5 documents is trans-
ditionally deemed difficult, like English–German.           lated by all subjects in their respective language
Moreover, translators were shown to prefer NMT              following the same, randomly sampled sequence
over SMT for post-editing, although a pronounced            of modalities (HT, PE1 or PE2 ). This phase allows
productivity increase was not always present. More          the subjects to get used to the setup, and allows
recent work highlighted the productivity gains              us to spot possible issues by inspecting the logged
driven by NMT post-editing in a wider array of              behavioral data before moving forward.2 In the
languages that were previously challenging for MT,          main collection phase, each subject is asked to
such as English to Dutch (Daems et al., 2017),              translate documents in a pseudo-random sequence
English-Hindi (Ahsan et al., 2021), English-Greek           of modalities. This time, however, the sequence
(Stasimioti and Sosoni, 2020), English-Finnish and          of modalities is different for each translator and
English-Swedish (Koponen et al., 2020), all show-           chosen so that each document gets translated in all
ing a considerable variance among language pairs            three modalities. This allows us to measure trans-
and subjects. The English-Czech results reported            lation productivity independently from subject’s
by Zouhar et al. (2021), however, show a differ-            productivity and from document-specific difficul-
ent picture. NMT post-editing speed was found               ties. A graphical overview of this process is shown
to be comparable to translating from scratch, and           in Figure 1, and additional precisions are avail-
marginal increases in automatic MT quality met-             able in Appendix C. Since productivity and other
rics were found to be unrelated to better post-             behavioral metrics can only be estimated with a siz-
editing productivity. In sum, research on post-             able sample, we prioritize document set size over
editing NMT generally reports increased fluency             number of translators per language during budget
and output quality, but productivity gains are hardly       allocation.
generalizable across language pairs and domains.               All subjects are professional translators with at
Importantly, to our knowledge, no previous work             least 3 years of professional experience, at least one
studied NMT post-editing over a set of typologi-            year of post-editing experience and professional
cally different languages in a unified setup where          proficiency with CAT tools.3 Translators were
the effect of content type and domain, NMT engine,          asked to produce translations of publishable qual-
and translation interface are strictly controlled for.      ity. We informed them of the machine-generated
                                                            nature of translation suggestions, but the presence
3     The DivEMT Dataset                                    of two different MT engines in the background
                                                            was never explicitly mentioned. Moreover, transla-
DivEMT’s main purpose is to assess the usefulness           tors were provided with links to the original source
of state-of-the-art NMT for professional transla-           articles to facilitate contextualization, and were in-
tors, and to study how this usefulness varies across        structed not to use any external MT engine to pro-
target languages that have very different typolog-          duce their translations. Assessing the final quality
ical properties. With our data collection setup,            of the translated material is out of the scope of
we aimed at striking a balance between simulating           the current study, although we realize that this is
a realistic professional translation workflow and           an important consideration to assess usability in
maximizing comparability of the results across lan-         a professional context. A summary of translation
guages. In what follows, we explain our design              guidelines is provided in Appendix B.
choices in detail.
                                                            3.2    Choice of Source Texts
3.1    Subjects and Task Scheduling
                                                            The selected documents represent a subset of the
To control for the effect of individual translators’        FLORES-101 evaluation benchmark (Goyal et al.,
preferences and style, we involve a total of 18 sub-        2022),4 which consists of sentences taken from
jects (three per target language). During the ex-           English Wikipedia, covering a mix of topics and
periment, each subject is presented with a series              2
                                                                 Warm-up data are excluded from Section 4 analysis.
of documents — each containing between 3 and 5                 3
                                                                 Individual subjects’ details are available in Appendix A
                                                               4
sentences — where the source text is presented in                https://github.com/facebookresearch/flores
DivEMT: Neural Machine Translation Post-Editing Effort Across Typologically Diverse Languages
Morphology                 Word Order
                      Genus            Family           Type        # Cases    Dominant Freedom          Script
 English         Indo-European        Germanic        Fusional        (3)          SVO          low      Latin
 Arabic (MSA) Afro-Asiatic             Semitic       Introflex.       3         VSO / SVO       low     Arabic
 Dutch       Indo-European            Germanic       Fusional         –       SVO / SOV / V 2   low      Latin
 Italian     Indo-European            Romance        Fusional         –            SVO          low      Latin
 Turkish         Altaic                Turkic        Agglutin.        6            SOV          high    Latin*
 Ukrainian   Indo-European              Slavic       Fusional         7            SVO          high    Cyrillic
 Vietnamese Austro-Asiatic           Viet-Muong      Isolating        –            SVO          low     Latin*

Table 1: The typological diversity of the selected language sample across various linguistic dimensions. Note that
English has 3 cases, but these are only marked for pronouns. V2 indicates verb-second word order. Latin* indicates
language-specific extensions of the Latin alphabet. Sources include the World Atlas of Language Structures (Dryer
and Haspelmath, 2013) and (Futrell et al., 2015).

domains.5 While professional translators are typi-         Steedman, 2015) – would dramatically restrict the
cally specialized in one or few domains, we opt for        set of languages available to our study, or imply
a mix-domain dataset to minimize domain adapta-            a prohibitively low translation quality on general-
tion efforts by the subjects and maximize general-         domain text. In order to minimize the effect of
izability of the results.                                  training data disparity while maximizing language
   Importantly, FLORES-101 includes high-quality           diversity, we choose representatives of six different
human translations into 101 languages, which make          language families for which comparable amounts
it possible to automatically estimate NMT quality          of training data are available in our open-source
and discard excessively low-scoring models or lan-         model, namely Arabic, Dutch, Italian, Turkish,
guage pairs before the start of our experiment. An-        Ukrainian, and Vietnamese. While the num-
other important reason to choose this dataset is the       ber of parallel sentence pairs used for fine-tuning
availability of metadata such as the source URL,           mBART50 on 50 language pairs varied widely be-
which allows us to check for the existence of public       tween 45M to only 4K, all our selected language
translations in our languages of interest, and the         pairs fall within the range of 100K-250K sentence
presence of cross-sentence context. The documents          pairs (mid-resourced, see Table 2). As shown in Ta-
used for our study are fragments of contiguous sen-        ble 1, our language sample represents a wide diver-
tences extracted from Wikipedia articles that com-         sity in terms of language family, type of morphol-
pose the original FLORES-101 corpus. Even if               ogy, degree of inflection, word order and script.
small, the context provided by document structure
allows us to simulate a more realistic translation         3.4      Choice of MT Systems
workflow if compared to out-of-context sentences.          While most of the best performing general-domain
   Based on our available budget, we select 112            NMT systems are commercial, experiments based
English documents from the devtest portion of              on such systems are not replicable as their back
FLORES-101 corresponding to 450 sentences and              ends get silently updated. Moreover, without know-
9626 words. More details on the data selection             ing the exact training setup data specifics, we can-
process are provided in Appendix D.                        not attribute differences in the cross-lingual re-
                                                           sults to the languages themselves. We balance
3.3   Choice of Languages                                  these observations by including two NMT sys-
Besides its architecture, training data has a very         tems in our study: Google Translate6 as a rep-
important impact on the quality of a NMT sys-              resentative of commercial quality, and mBART50
tem. Unfortunately, using strictly comparable or           one-to-Many7 (Tang et al., 2021) as a represen-
multi-parallel datasets – like the Europarl (Koehn,        tative of state-of-the-art open-source multilingual
2005) or Bible corpus (Christodouloupoulos and             NMT technology. The original multilingual BART
  5                                                            6
    We use a balanced sample of articles sourced from              Evaluation performed in October 2021
                                                               7
WikiNews, WikiVoyage and WikiBooks.                                facebook/mbart-large-50-one-to-many
DivEMT: Neural Machine Translation Post-Editing Effort Across Typologically Diverse Languages
Target          Google Translate (PE1 )                            mBART-50 (PE2 )
      Language
                      BLEU      CHRF      COMET       BLEU      CHRF      COMET      # Pairs        Source
      Arabic           34.1      65.6       .737       17.0      48.5      .452       226K        IWSLT’17
      Dutch            29.1      60.0       .667       22.6      53.9      .532       226K      IWSLT’17.mlt
      Italian          32.8      61.4       .781       24.4      54.7      .648       233K      IWSLT’17.mlt
      Turkish          35.0      65.5       1.00       18.8      52.7      .755       204K         WMT’17
      Ukrainian        31.1      59.8       .758       21.9      50.7      .587       104K         TED58
      Vietnamese       45.1      61.9       .724       34.7      54.0      .608       127K        IWSLT’15

Table 2: Automatic MT quality of the two selected NMT systems for English-to-Target translation on the full
FLORES-101 devtest split. Best scores are highlighted in bold. We report the number of sentence pairs used for
the multilingual fine-tuning of mBART50 and their respective sources (Tang et al. 2021, Table 6).

model (Liu et al., 2020) is an encoder-decoder                logs information about the post-editing process,
transformer model pre-trained on monolingual doc-             which we use to assess effort (see Section 4);
uments in 25 languages, using the random span                 and (ii) it is a mature research-oriented tool that
masking and order permutation tasks. Tang et al.              has been successfully used in a number of pre-
(2021) extend mBART by further pre-training it on             vious studies, e.g. Koponen et al. (2012); Toral
an additional set of 25 languages and performing              et al. (2018). Compared to commercial tools, PET
multilingual fine-tuning on translation data for 50           has less functionalities. While this would clearly
languages, producing three configurations of mul-             limit its application in daily translation activities,
tilingual NMT models: many-to-one, one-to-many,               it proves useful for our controlled setup to avoid
and many-to-many. Our choice of the mBART50                   confounding effects stemming from such additional
model is largely motivated by its maneageable size,           capabilities. In addition, the tool is expected to be
its good performances across the set of evaluated             unfamiliar to all the translators that take part in our
languages (see Table 2) and its adoption for other            study, thereby avoiding differences in performance
NMT studies (Liu et al., 2021) and post-editing re-           across translators due to different degrees of pro-
sources (Fomicheva et al., 2020). While mBART50               ficiency with the software. We also observe that,
performances are usually comparable or slightly               due to the varied and generic nature of source doc-
worse than the ones of tested bilingual NMT mod-              uments, functionalities such as concordance and
els,8 the choice of a multilingual model allows us to         translation memory matches would have proven
evaluate the downstream effectiveness of a single             much less useful in our setup. We collect three
system trained on pairs evenly distributed across             main types of data:
tested languages, as mentioned in Section 3.3. Fi-                • The resulting translations, i.e. the end product
nally, adopting two systems with a marked dif-                      of the translators’ task in either HT or PE
ference in automatic MT evaluation performances                     modes, forming a multilingual corpus with
allows us to estimate how a significant increase in                 one source text and 18 translations (one per
standard translation metrics such as BLEU, CHRF                     language-modality combination) exemplified
and COMET (Papineni et al., 2002; Popović, 2015;                   in Table 3.
Rei et al., 2020) can affect downstream productivity
                                                                 • Behavioral data for each translated sentence,
when the NMT systems are adopted in a realistic
                                                                   including: editing time, number of keystrokes
post-editing scenario.
                                                                   and keystroke type (content, navigation, erase,
                                                                   etc.), number of revisions, as well as number
3.5    Translation Platform and Collected Data
                                                                   and duration of pauses longer than 300 and
All the translations were performed with PET (Aziz                 1000 milliseconds (Lacruz et al., 2014).
et al., 2012), a computer-assisted translation tool
                                                                 • Pre-task and post-task questionnaires. The
that supports both translating from scratch and
                                                                   former focuses on demographics, education,
post-editing. This tool was chosen because (i) it
                                                                   and work experience with translation and post-
    8
      See Table 9 for automatic MT quality results by five         editing. The latter elicits subjective assess-
different models over a larger set of 10 target languages.         ments on the quality, effort and enjoyability of
ENGLISH
        Inland waterways can be a good theme to base a holiday around.
  ARABIC
  HT
                 ¡J¢jJË @ YJ k @P AJk éJ Êg@YË@ éJ KAÖÏ @ H@
        . AêËñk éÊ¢«                                             QÒÖÏ @ àñºK à @ áºÖß
                                 .
        MT:
  PE1
        PE:

        MT:
  PE2
        PE:
  DUTCH
  HT    Binnenlandse waterwegen kunnen een goed thema zijn voor een vakantie.
        MT: De binnenwateren kunnen een goed thema zijn om een vakantie omheen te baseren .
  PE1
        PE: Binnenwateren kunnen een goede vakantiebestemming zijn .

        MT: Binnenwaterwegen kunnen een goed thema zijn om een vakantie rond te zetten .
  PE2
        PE: Binnenwaterwegen kunnen een goed thema zijn om een vakantie rond te organiseren .
  ITALIAN
  HT    I corsi d’acqua dell’entroterra possono essere un ottimo punto di partenza da cui organizzare una vacanza.
        MT: Trasporto fluviale può essere un buon tema per basare una vacanza in giro .
  PE1
        PE: I canali di navigazione interna possono essere un ottimo motivo per cui intraprendere una vacanza .

        MT: I corsi d’acqua interni possono essere un buon tema per fondare una vacanza.
  PE2
        PE: I corsi d’acqua interni possono essere un buon tema su cui basare una vacanza.
  TURKISH
  HT    İç bölgelerdeki su yolları, tatil planı için iyi bir tema olabilir.
        MT: İç su yolları, bir tatili temel almak için iyi bir tema olabilir.
  PE1
        PE: İç su yolları, bir tatil planı yapmak için iyi bir tema olabilir.

        MT: İç suyolları, tatil için uygun bir tema olabilir.
  PE2
        PE: İç sular tatil için uygun bir tema olabilir.
  UKRAINIAN
  HT    Можна спланувати вихiднi, взявши за основу подорож внутрiшнiми водними шляхами.
        MT: Внутрiшнi воднi шляхи можуть стати гарною темою для вiдпочинку навколо .
  PE1
        PE: Внутрiшнi воднi шляхи можуть стати гарною темою для проведення вихiдних .

        MT: Воднi шляхи можуть                 бути     хорошим об ’єктом          для    базування вiдпочинку навколо .
  PE2
        PE:     Мiсцевiсть         навколо внутрiшнiх водних шляхiв може                    бути    гарним вибором    для
          органiзацiї вiдпочинку.
  VIETNAMESE
  HT    Du lịch trên sông có thể là một lựa chọn phù hợp cho kỳ nghỉ.

        MT: Đường thủy nội địa có thể là một chủ đề hay để tạo cơ sở cho một kỳ nghỉ xung quanh .
  PE1
        PE: Đường thủy nội địa có thể là một ý tưởng hay để lập kế hoạch cho kỳ nghỉ.

        MT: Các tuyến nước nội địa có thể là một chủ đề tốt để xây dựng một kì nghỉ.
  PE2
        PE: Du lịch bằng đường thủy nội địa là một ý tưởng nghỉ dưỡng không tồi.

Table 3: An example sentence (81.1) from the DivEMT corpus, with the English source and all output modali-
ties for all target languages, including intermediate machine translations (MT) and subsequent post-editings (PE).
Colors denote insertions , deletions , substitutions and shifts computed with Tercom (Snover et al., 2006).
the post-editing modality compared to trans-                                PROD ↑     ∆HT ↑
      lating from scratch.
                                                                           HT       13.1           -
4     Post-Editing Effort Across Languages                         ARA     PE1      21.7       +84%
                                                                           PE2      16.3       +10%
In this section, we use the DivEMT dataset to
                                                                           HT       13.6           -
quantify the effort of professional translators post-
                                                                   NLD     PE1      28.7      +119%
editing NMT outputs, across our diverse set of tar-
                                                                           PE2      21.7       +61%
get languages. We consider two main objective
indicators of effort, namely (i) task time (and re-                        HT        8.8           -
lated productivity gains); and (ii) post-editing rate,             ITA     PE1      18.6       +96%
measured by Human-targeted Translation Edit Rate                           PE2      15.6       +95%
(Snover et al., 2006). We reiterate that all scores in                     HT       17.9           -
this section are computed on the same set of source                TUR     PE1      25.5       +34%
sentences for all languages, resulting in a faithful                       PE2      21.0       +12%
cross-lingual comparison of post-editing effort that
would not be feasible without the DivEMT dataset.                          HT        8.0           -
                                                                   UKR     PE1      12.3       +71%
4.1    Temporal Effort and Productivity Gains                              PE2       9.8       +14%
We start by comparing task time (seconds per pro-                          HT       10.2           -
cessed source word) across languages and modal-                    VIE     PE1      13.0       +32%
ities. For this analysis, data is aggregated over                          PE2      11.1       +23%
the three subjects in each language and editing
time is computed for each document. As shown              Table 4: Median productivity (PROD, # processed
in Figure 2, translation time varies considerably         source words per minute) and median % post-editing
                                                          speedup (∆HT) for all analyzed languages and modal-
across languages even when no MT system is in-
                                                          ities. Arrows denote the direction of improvement.
volved, suggesting an intrinsic variability in transla-
tion complexity for different subjects and language
pairs. The ‘slowest’ language pairs are En-Italian        quality metrics indeed reflect on post-editing ef-
and En-Ukrainian, followed by En-Vietnamese,              fort, suggesting a nuanced picture that is com-
then by En-Arabic and En-Dutch, whereas the               plementary to previous findings: While Zouhar
‘fastest’ pair is En-Turkish. This pattern cannot         et al. (2021) noted how post-editing time gains
be easily explained, and is in contrast with tradi-       quickly saturate when MT quality is already high,
tional factors associated with MT difficulty, such        we observe that moving from medium-quality to
as target morphological richness and source-target        high-quality MT yields meaningful productivity
language relatedness (Birch et al., 2008).                improvements across most of the evaluated lan-
   On the other hand, when comparing the three            guages. Across languages, too, the magnitude of
different modalities, we find HT as the slowest           productivity gains ranges widely, from doubling
modality, PE1 as the fastest and PE2 in between the       in some languages (PE1 in Dutch, PE1 and PE2 in
two for all the evaluated languages. This confirms        Italian) to only about 10% (PE2 in Arabic, Turk-
once again the overall positive impact of NMT post-       ish and Ukrainian). When only considering the
editing on translation productivity. That said, we        better performing system (PE1 ), despite a large
note how the magnitude of this impact is highly           variation between +119% (Dutch) and +32% (Viet-
variable across systems and languages.                    namese), post-editing remains clearly beneficial in
   For a measure of productivity gains that is easier     all languages. As for the open-source system (PE1 ),
to interpret and more in line with translation in-        three out of six languages display only marginal
dustry practices, we turn to productivity expressed       gains (i.e. less than 15% in Arabic, Turkish and
in source words processed per minute, and com-            Ukrainian). Recall that mBART50 (PE2 ), despite
pute the speed-up induced by the two post-editing         its overall lower performance, is the only one of the
modalities over translating from scratch (∆HT).           two NMT systems that enables a fair comparison
Table 4 presents our results. Across systems, we          across languages (from the point of view of training
find that large differences across automatic MT           data size and architecture, see Section 3.4). Inter-
Modality
                                                         From Scratch (HT)      Google Translate (PE1)      mBART-50 (PE2)
                                          8                                        3                                   3             10
                           21
                           18
Time per source word (s)

                           15
                           12
                            9
                            6
                            3
                            0
                                      Arabic              Dutch              Italian              Turkish         Ukrainian     Vietnamese
                                                                              Target language
 Figure 2: Temporal effort across languages and translation modalities, measured in seconds per processed source
 word. Each point represents a document, with higher scores representing slower editing. ↑ denote the amount of
 data points not included in the plot for every language.

 estingly, if we focus on the gains induced by this                                        guage sample does not allow to draw direct causal
 system, explaining factors like language relatedness                                      links between specific typological properties and
 and morphological richness come back into play.                                           post-editing efficiency. Still, we believe these re-
 Specifically, Italian (+95%), Dutch (+61%) and                                            sults have important implications on the claimed
 Ukrainian (+14%) are all related to English, but                                          ‘universality’ of current state-of-the-art MT and
 Ukrainian has a much richer morphology. On the                                            NLP systems, mostly based on the Transformer ar-
 other hand, Vietnamese (+23%), Turkish (+12%)                                             chitecture (Vaswani et al., 2017) and on BPE-style
 and Arabic (+10%) all belong to different families.                                       subword segmentation techniques (Sennrich et al.,
 From the point of view of morphology, Vietnamese                                          2016).
 is isolating (little to no morphology), while Turkish
 and Arabic have very rich morphological systems                                           4.2    Post-Editing Rate
 of different types, respectively: agglutinative and
                                                                                           Figure 3 presents the results in terms of Human-
 introflexive, the latter of which is especially prob-
                                                                                           target Translation Edit Rate, HTER (Snover et al.,
 lematic for subword segmentation (Amrhein and
                                                                                           2006), which is computed as the sum of substitu-
 Sennrich, 2021). Some differences are however
                                                                                           tions, insertions, deletions and shift operations per-
 harder to explain. For instance, Dutch is geneti-
                                                                                           formed by a translator on the MT output, normal-
 cally related to English and has a morphology that
                                                                                           ized by its length. First, we can see that the com-
 is poorer than Italian, but its productivity gain with
                                                                                           mercial system needs less editing than the open-
 mBART50 is lower (61% vs 95%). This finding
                                                                                           source one for all languages. Recall that the trans-
 is accompanied by an important gap in BLEU and
                                                                                           lators were not informed about the presence of two
 COMET scores achieved by mBART50 on the two
                                                                                           systems in the background, so we can exclude the
 languages (i.e. 22.6 BLEU for Dutch versus 24.4 for
                                                                                           possibility of an over-reliance or distrust towards
 Italian, 0.532 COMET for Dutch versus 0.648 for
                                                                                           a specific MT system from past experience. Here
 Italian) which cannot be easily explained in terms
                                                                                           too, we observe a high variation both across sub-
 of training data size.
                                                                                           jects and across languages. The cross-language
    We note the important role of inter-subject vari-
                                                                                           pattern is similar, but not identical, for the two MT
 ability,9 in line with previous studies (Koponen
                                                                                           systems: For Google Translate, Ukrainian shows by
 et al., 2020). Moreover, the small size of our lan-
                                                                                           far the heaviest edit rate, followed by Vietnamese,
                     9
                           See Figure 6 for a per-subject productivity overview.           whereas Arabic, Dutch, Italian and Turkish all show
Arabic   Dutch     Italian     Turkish    Ukrain. Vietnam.
                                            Language                                                 50
                                  Arabic    Italian    Ukrainian
                                  Dutch     Turkish    Vietnamese                                    40
                     150

                                                                                                                                                                       Google Translate (PE1)
                                                                                                     30
                     125

                                                                       % of Errorless MT Sentences
                                                                                                     20
HTER (edits/words)

                     100                                                                             10

                                                                                                      0
                     75                                                                              50

                                                                                                     40
                     50

                                                                                                                                                                       mBART-50 (PE2)
                                                                                                     30
                     25                                                                              20

                                                                                                     10
                      0
                           Google Translate (PE1)     mBART-50 (PE2)                                 0
                                           NMT System                                                                  Subject x Language

  Figure 3: Human-targeted Translation Edit Rate                        Figure 4: Distribution of error-less machine transla-
  (HTER) for Google Translate and mBART50 post-                         tion sentence outputs (no edits performed during post-
  editing across available languages.                                   editing) for each translator and every language.

  relatively low amounts of edits. Focusing again on                                                                       ρ HTER ↓                ρ PROD ↑
  mBART50 for a fairer cross-lingual comparison,                                                                           PE1         PE2     PE1       PE2
  Ukrainian is by far the most heavily edited lan-
                                                                                                           BLEU            .12         -.07    -.30      -.42
  guage, followed by a medium-tier group composed
                                                                                                           CHRF            -.49        -.49     .45      .34
  of Vietnamese, Arabic and Turkish, and finally
                                                                                                           COMET           -.17         .03     .65      .27
  by Dutch and Italian as low-edit languages. From
  these results, we can conclude that several of our                    Table 5: Correlation of automatic metrics and HTER
  observations on the linguistic relatedness and type                   scores obtained after post-editing, after averaging on
  morphology transfer to edit rates, with languages                     language pairs. Arrows denote the direction of ex-
  more closely related to English requiring less post-                  pected correlation, and moderate/high scores are in
  edits on average.                                                     bold.
     Figure 4 visualizes the large gap in edit rates
  across languages and subjects by presenting the                       metrics computed for the two NMT systems on the
  amount of “errorless” MT sentences that were ac-                      existing reference translations and the HTER and
  cepted directly, without any further post-editing.                    productivity scores derived from their adoption in
  Once again, we observe how the proportion of these                    the post-editing setting.10 Given the availability of
  sentences is heavily influenced by the NMT system,                    FLORES reference translations for 101 languages,
  but nonetheless shows how the majority of transla-                    a metric correlating well with post-editing effort
  tors for the En-Dutch and En-Italian language pairs                   across target languages would make a valuable
  enjoys lower editing rates compared to their col-                     tool to predict the usefulness of MT post-editing
  leagues translating to Ukrainian and Vietnamese.                      in a new language pair where MT has not been de-
  Surprisingly, the En-Turkish setting fares well, de-                  ployed yet. The results in Table 5 show that BLEU
  spite the lower relatedness of the two languages. In                  (Papineni et al., 2002) does not correlate at all with
  particular, in the context on Google Translate out-                   post-editing effort, giving empirical support to the
  puts, the average proportion of error-less sentences                  claim that BLEU should not be compared across tar-
  is ∼ 25% for the former languages, while for the
                                                                          10
  latter it corresponds only to ∼ 3%.                                        Scores are averaged for every language pair and then
                                                                        correlated. The automatic metrics are computed on the full
     To conclude our analysis, we consider the inter-                   FLORES devtest (cf. Table 2) as proxies of general MT quality
  language correlation between automatic quality                        on the given language pair.
get languages (Bugliarello et al., 2020). The char-     light into our current findings.
acter ngram-based CHRF (Popović, 2015), despite
its simplicity, shows much better results, with posi-   Acknowledgements
tive correlations across the board. Finally, we look
                                                        Data collection was fully funded by the Netherlands
at the neural COMET, a metric based on large pre-
                                                        Organization for Scientific Research (NWO) under
trained multilingual language models, which boasts
                                                        project number 639.021.646. GS is supported by
state-of-the-art (within language-pair) correlations
                                                        the NWO project InDeep (NWA.1292.19.399).
with human judgements of MT quality (Rei et al.,
2020). Despite its apparent language-agnostic na-
ture, COMET does not appear to correlate reliably       References
with post-editing effort, which might point to an
                                                        Arafat Ahsan, Vandan Mujadia, and Dipti Misra
uneven quality of the neural sentence encoder, or to
                                                          Sharma. 2021. Assessing post-editing effort in the
a lack of diversity in the human judgement dataset        English-Hindi direction. In Proceedings of the 18th
used for fine-tuning.                                     International Conference on Natural Language Pro-
                                                          cessing (ICON), National Institute of Technology
5   Conclusions                                           Silchar , Silchar, India. NLP Association of India
                                                          (NLPAI).
We have presented the results of a post-editing         Farhad Akhbardeh, Arkady Arkhangorodsky, Mag-
study spanning two state-of-the-art NMT systems           dalena Biesialska, Ondřej Bojar, Rajen Chatter-
                                                          jee, Vishrav Chaudhary, Marta R. Costa-jussa,
and six target languages, under a unified setup. The      Cristina España-Bonet, Angela Fan, Christian Fe-
resulting dataset, DivEMT, allows for a controlled        dermann, Markus Freitag, Yvette Graham, Ro-
comparison of NMT quality and post-editing effort         man Grundkiewicz, Barry Haddow, Leonie Har-
across typologically diverse languages.                   ter, Kenneth Heafield, Christopher Homan, Matthias
                                                          Huck, Kwabena Amponsah-Kaakyire, Jungo Ka-
   We adopted DivEMT to perform a novel cross-            sai, Daniel Khashabi, Kevin Knight, Tom Kocmi,
language analysis of post-editing effort, along its       Philipp Koehn, Nicholas Lourie, Christof Monz,
temporal and editing rate dimensions. Our results         Makoto Morishita, Masaaki Nagata, Ajay Nagesh,
show that the use of NMT drives improvements              Toshiaki Nakazawa, Matteo Negri, Santanu Pal, Al-
                                                          lahsera Auguste Tapo, Marco Turchi, Valentin Vy-
in productivity across all the evaluated languages,
                                                          drin, and Marcos Zampieri. 2021. Findings of the
but the magnitude of these improvements depends           2021 conference on machine translation (WMT21).
heavily on the language and the choice of the un-         In Proceedings of the Sixth Conference on Machine
derlying NMT system. Moreover, some of the re-            Translation, pages 1–88, Online. Association for
ported cross-language variation seems to reflect          Computational Linguistics.
language properties traditionally known to affect       Chantal Amrhein and Rico Sennrich. 2021. How suit-
                                                          able are subword segmentation strategies for trans-
MT quality, like relatedness to source and type of
                                                          lating non-concatenative morphology? In Findings
morphology (Birch et al., 2008). Given the un-            of the Association for Computational Linguistics:
derlying open-source system used the exact same           EMNLP 2021, pages 689–705, Punta Cana, Domini-
architecture and comparable training data size in         can Republic. Association for Computational Lin-
all evaluated language, our results suggest that the      guistics.
current state-of-the-art approach to MT, based on       Wilker Aziz, Sheila Castilho, and Lucia Specia. 2012.
Transformer and BPE-based input tokens, is not            PET: a tool for post-editing and assessing machine
                                                          translation. In Proceedings of the Eighth Inter-
universally suitable to model human languages.            national Conference on Language Resources and
   Finally, we investigated whether language-pair         Evaluation (LREC’12), pages 3982–3987, Istanbul,
level MT quality, as measured by various automatic        Turkey. European Language Resources Association
                                                          (ELRA).
metrics, could be used as an indicator of the use-
fulness of MT for post-editing in a new language        Loic Barrault, Ondrej Bojar, Fethi Bougares, Rajen
                                                          Chatterjee, Marta R. Costa-jussa, Christian Feder-
pair. Our results show consistently moderate cor-         mann, Mark Fishel, Alexander Fraser, Markus Fre-
relations for CHRF, but not for BLEU, and also not        itag, Yvette Graham, Roman Grundkiewicz, Paco
for COMET despite its supposed language-agnostic          Guzman, Barry Haddow, Matthias Huck, Anto-
nature.                                                   nio Jimeno Yepes, Philipp Koehn, Tom Kocmi, An-
                                                          dre Martins, Makoto Morishita, and Christof Monz,
   In future work, a more fine-grained analysis of        editors. 2021. Proceedings of the Sixth Conference
the types of edits conducted by the translators, and      on Machine Translation. Association for Computa-
their differences across languages, could shed more       tional Linguistics, Online.
Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo, and        language-model? In Proceedings of the 2018 Con-
  Marcello Federico. 2016. Neural versus phrase-             ference of the North American Chapter of the Asso-
  based machine translation quality: a case study.           ciation for Computational Linguistics: Human Lan-
  In Proceedings of the 2016 Conference on Empiri-           guage Technologies, Volume 2 (Short Papers), pages
  cal Methods in Natural Language Processing, pages          536–541, New Orleans, Louisiana. Association for
  257–267, Austin, Texas. Association for Computa-           Computational Linguistics.
  tional Linguistics.                                      Joke Daems, Sonia Vandepitte, Robert Hartsuiker, and
Alexandra Birch, Miles Osborne, and Philipp Koehn.           Lieve Macken. 2017. Translation Methods and Ex-
  2008. Predicting success in machine translation.           perience: A Comparative Analysis of Human Trans-
  In Proceedings of the 2008 Conference on Empiri-           lation and Post-editing with Students and Profes-
  cal Methods in Natural Language Processing, pages          sional Translators. Meta : journal des traducteurs
  745–754, Honolulu, Hawaii. Association for Com-            / Meta: Translators’ Journal, 62(2):245–270. Pub-
  putational Linguistics.                                    lisher: Les Presses de l’Université de Montréal.
Arianna Bisazza, Ahmet Üstün, and Stephan Sportel.       Matthew S. Dryer and Martin Haspelmath, editors.
  2021. On the difficulty of translating free-order         2013. WALS Online. Max Planck Institute for Evo-
  case-marking languages. Transactions of the Associ-       lutionary Anthropology, Leipzig.
  ation for Computational Linguistics, 9:1233–1248.        Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi
Ondřej Bojar, Christian Buck, Chris Callison-Burch,         Ma, Ahmed El-Kishky, Siddharth Goyal, Man-
  Christian Federmann, Barry Haddow, Philipp                 deep Baines, Onur Çelebi, Guillaume Wenzek,
  Koehn, Christof Monz, Matt Post, Radu Soricut, and         Vishrav Chaudhary, Naman Goyal, Tom Birch, Vi-
  Lucia Specia. 2013. Findings of the 2013 Work-             taliy Liptchinsky, Sergey Edunov, Edouard Grave,
  shop on Statistical Machine Translation. In Pro-           Michael Auli, and Armand Joulin. 2021. Be-
  ceedings of the Eighth Workshop on Statistical Ma-         yond english-centric multilingual machine trans-
  chine Translation, pages 1–44, Sofia, Bulgaria. As-        lation. Journal of Machine Learning Research,
  sociation for Computational Linguistics.                   22(107):1–48.
Emanuele Bugliarello, Sabrina J. Mielke, Anto-             Marina Fomicheva, Shuo Sun, Erick Rocha Fonseca,
  nios Anastasopoulos, Ryan Cotterell, and Naoaki           F. Blain, Vishrav Chaudhary, Francisco Guzmán,
  Okazaki. 2020. It’s easier to translate out of English    Nina Lopatina, Lucia Specia, and André F. T.
  than into it: Measuring neural translation difficulty     Martins. 2020. MLQE-PE: A multilingual qual-
  by cross-mutual information. In Proceedings of the        ity estimation and post-editing dataset. ArXiv,
  58th Annual Meeting of the Association for Compu-         abs/2010.04480.
  tational Linguistics, pages 1640–1649, Online. As-       Markus Freitag, George Foster, David Grangier, Viresh
  sociation for Computational Linguistics.                  Ratnakar, Qijun Tan, and Wolfgang Macherey. 2021.
Sheila Castilho, Joss Moorkens, Federico Gaspari,           Experts, errors, and context: A large-scale study of
  Iacer Calixto, John Tinsley, and Andy Way. 2017.          human evaluation for machine translation. Transac-
  Is Neural Machine Translation the New State of the        tions of the Association for Computational Linguis-
  Art? The Prague Bulletin of Mathematical Linguis-         tics, 9:1460–1474.
  tics, 108(1):109–120. Publisher: Sciendo Section:        Richard Futrell, Kyle Mahowald, and Edward Gibson.
  The Prague Bulletin of Mathematical Linguistics.           2015. Quantifying word order freedom in depen-
Mauro Cettolo, Marcello Federico, Luisa Bentivogli,          dency corpora. In Proceedings of the Third In-
 Jan Niehues, Sebastian Stüker, Katsuhito Sudoh,            ternational Conference on Dependency Linguistics
 Koichiro Yoshino, and Christian Federmann. 2017.            (Depling 2015), pages 91–100, Uppsala, Sweden.
 Overview of the IWSLT 2017 evaluation campaign.             Uppsala University, Uppsala, Sweden.
 In Proceedings of the 14th International Confer-          Daniela Gerz, Ivan Vulić, Edoardo Maria Ponti, Roi
 ence on Spoken Language Translation, pages 2–14,            Reichart, and Anna Korhonen. 2018. On the rela-
 Tokyo, Japan. International Workshop on Spoken              tion between linguistic typology and (limitations of)
 Language Translation.                                       multilingual language modeling. In Proceedings of
Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa         the 2018 Conference on Empirical Methods in Nat-
 Bentivogli, Rolando Cattoni, and Marcello Federico.         ural Language Processing, pages 316–327, Brus-
 2016. The IWSLT 2016 evaluation campaign. In                sels, Belgium. Association for Computational Lin-
 Proceedings of the 13th International Conference on         guistics.
 Spoken Language Translation, Seattle, Washington          Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-
 D.C. International Workshop on Spoken Language              Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Kr-
 Translation.                                                ishnan, Marc’Aurelio Ranzato, Francisco Guzmán,
Christos Christodouloupoulos and Mark Steedman.              and Angela Fan. 2022. The Flores-101 Evaluation
  2015. A massively parallel corpus: the bible in            Benchmark for Low-Resource and Multilingual Ma-
  100 languages. Language Resources and Evalua-              chine Translation. Transactions of the Association
  tion, 49(2):375–395.                                       for Computational Linguistics, 10:522–538.
Ryan Cotterell, Sabrina J. Mielke, Jason Eisner, and       Spence Green, Jeffrey Heer, and Christopher D. Man-
  Brian Roark. 2018. Are all languages equally hard to       ning. 2013. The efficacy of human post-editing for
language translation. In Proceedings of the SIGCHI          Findings of the Association for Computational Lin-
  Conference on Human Factors in Computing Sys-               guistics: ACL-IJCNLP 2021, pages 2706–2718, On-
  tems, CHI ’13, pages 439–448, New York, NY, USA.            line. Association for Computational Linguistics.
  Association for Computing Machinery.                      Arl Lommel and Hélène Pielmeier. 2021. Machine
Ana Guerberof. 2009. Productivity and quality in MT           Translation Use at LSPs: Insights on How Language
  post-editing. In Beyond Translation Memories: New           Service Providers’ Use of MT Is Evolving. Techni-
  Tools for Translators Workshop, Ottawa, Canada.             cal report, Common Sense Advisory Research.
Philipp Koehn. 2005. Europarl: A parallel corpus for        Sabrina J. Mielke, Ryan Cotterell, Kyle Gorman, Brian
  statistical machine translation. In Proceedings of          Roark, and Jason Eisner. 2019. What kind of lan-
  Machine Translation Summit X: Papers, pages 79–             guage is hard to language-model? In Proceedings of
  86, Phuket, Thailand.                                       the 57th Annual Meeting of the Association for Com-
                                                              putational Linguistics, pages 4975–4989, Florence,
Philipp Koehn and Christof Monz, editors. 2006. Pro-          Italy. Association for Computational Linguistics.
  ceedings on the Workshop on Statistical Machine
                                                            Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
  Translation. Association for Computational Linguis-
                                                              Jing Zhu. 2002. Bleu: a method for automatic eval-
  tics, New York City.
                                                              uation of machine translation. In Proceedings of the
Maarit Koponen, Wilker Aziz, Luciana Ramos, and               40th Annual Meeting of the Association for Compu-
 Lucia Specia. 2012. Post-editing time as a measure           tational Linguistics, pages 311–318, Philadelphia,
 of cognitive effort. In Workshop on Post-Editing             Pennsylvania, USA. Association for Computational
 Technology and Practice.                                     Linguistics.
Maarit Koponen, Umut Sulubacak, Kaisa Vitikainen,           Carla Parra Escartín and Manuel Arcedillo. 2015. Ma-
 and Jörg Tiedemann. 2020. MT for subtitling: User           chine translation evaluation made fuzzier: a study
 evaluation of post-editing productivity. In Proceed-         on post-editing productivity and evaluation metrics
 ings of the 22nd Annual Conference of the Euro-              in commercial settings. In Proceedings of Machine
 pean Association for Machine Translation, pages              Translation Summit XV: Papers, Miami, USA.
 115–124, Lisboa, Portugal. European Association            Mirko Plitt and François Masselot. 2010. A Produc-
 for Machine Translation.                                     tivity Test of Statistical Machine Translation Post-
Hans P. Krings. 2001. Repairing Texts: Empirical              Editing in a Typical Localisation Context. The
  Investigations of Machine Translation Post-editing          Prague Bulletin of Mathematical Linguistics, 93(1).
  Processes. Kent State University Press. Google-           Maja Popović. 2015. chrF: character n-gram F-score
  Books-ID: vsdPsIXCiWAC.                                     for automatic MT evaluation. In Proceedings of the
Isabel Lacruz, Michael Denkowski, and Alon Lavie.             Tenth Workshop on Statistical Machine Translation,
   2014. Cognitive demand and cognitive effort in             pages 392–395, Lisbon, Portugal. Association for
   post-editing. In Proceedings of the 11th Conference        Computational Linguistics.
   of the Association for Machine Translation in the        Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
   Americas, pages 73–84, Vancouver, Canada. Asso-            Lavie. 2020. COMET: A neural framework for MT
   ciation for Machine Translation in the Americas.           evaluation. In Proceedings of the 2020 Conference
                                                              on Empirical Methods in Natural Language Pro-
Samuel Läubli, Chantal Amrhein, Patrick Düggelin,
                                                              cessing (EMNLP), pages 2685–2702, Online. Asso-
  Beatriz Gonzalez, Alena Zwahlen, and Martin Volk.
                                                              ciation for Computational Linguistics.
  2019. Post-editing productivity with neural machine
  translation: An empirical assessment of speed and         Rico Sennrich, Barry Haddow, and Alexandra Birch.
  quality in the banking and finance domain. In Pro-          2016. Neural machine translation of rare words
  ceedings of Machine Translation Summit XVII: Re-            with subword units. In Proceedings of the 54th An-
  search Track, pages 267–272, Dublin, Ireland. Euro-         nual Meeting of the Association for Computational
  pean Association for Machine Translation.                   Linguistics (Volume 1: Long Papers), pages 1715–
                                                              1725, Berlin, Germany. Association for Computa-
Samuel Läubli, Mark Fishel, Gary Massey, Maureen             tional Linguistics.
  Ehrensberger-Dow, and Martin Volk. 2013. As-
  sessing post-editing efficiency in a realistic transla-   Matthew Snover, Bonnie Dorr, Rich Schwartz, Lin-
  tion environment. In Proceedings of the 2nd Work-           nea Micciulla, and John Makhoul. 2006. A study
  shop on Post-editing Technology and Practice, Nice,         of translation edit rate with targeted human annota-
  France.                                                     tion. In Proceedings of the 7th Conference of the As-
                                                              sociation for Machine Translation in the Americas:
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey           Technical Papers, pages 223–231, Cambridge, Mas-
  Edunov, Marjan Ghazvininejad, Mike Lewis, and               sachusetts, USA. Association for Machine Transla-
  Luke Zettlemoyer. 2020. Multilingual denoising              tion in the Americas.
  pre-training for neural machine translation. Trans-       Maria Stasimioti and Vilelmini Sosoni. 2020. Trans-
  actions of the Association for Computational Lin-           lation vs post-editing of NMT output: Insights from
  guistics, 8:726–742.                                        the English-Greek language pair. In Proceedings of
Zihan Liu, Genta Indra Winata, and Pascale Fung.              1st Workshop on Post-Editing in Modern-Day Trans-
  2021. Continual mixed-language pre-training for ex-         lation, pages 109–124, Virtual. Association for Ma-
  tremely low-resource neural machine translation. In         chine Translation in the Americas.
Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na-
  man Goyal, Vishrav Chaudhary, Jiatao Gu, and An-
  gela Fan. 2021. Multilingual translation from de-
  noising pre-training. In Findings of the Association
  for Computational Linguistics: ACL-IJCNLP 2021,
  pages 3450–3466, Online. Association for Compu-
  tational Linguistics.
Jörg Tiedemann and Santhosh Thottingal. 2020.
    OPUS-MT – building open translation services for
    the world. In Proceedings of the 22nd Annual Con-
    ference of the European Association for Machine
    Translation, pages 479–480, Lisboa, Portugal. Eu-
    ropean Association for Machine Translation.
Antonio Toral, Martijn Wieling, and Andy Way. 2018.
  Post-editing Effort of a Novel With Statistical and
  Neural Machine Translation. Frontiers in Digital
  Humanities, 5:1–11. Publisher: Frontiers.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  Uszkoreit, Llion Jones, Aidan N Gomez, ukasz
  Kaiser, and Illia Polosukhin. 2017. Attention is all
  you need. In Advances in Neural Information Pro-
  cessing Systems, volume 30. Curran Associates, Inc.
Vilém Zouhar, Martin Popel, Ondřej Bojar, and Aleš
  Tamchyna. 2021. Neural machine translation quality
  and post-editing performance. In Proceedings of the
  2021 Conference on Empirical Methods in Natural
  Language Processing, pages 10204–10214, Online
  and Punta Cana, Dominican Republic. Association
  for Computational Linguistics.
A     Subject Information                                                                       T1      T2       T3
During the setup of our experiment one translator                                      docW1    HT      HT       HT

                                                                             warm-up
refused to carry out the main task after the warmup                                    docW2    PE1     PE1      PE1
phase, and another was substituted by our choice.                                       ...
Both translators were working for the English-                                         docWN    PE2     PE2      PE2
Italian direction and were found to make heavy
                                                                                       docM1    HT      PE1      PE2
usage of copy-pasting during the warmup stage,
                                                                                       docM2    PE2     HT       PE1

                                                                              main
suggesting an incorrect utilization of the platform
                                                                                       docM3    HT      PE2      PE!
in light of our guidelines. Both translators, which
                                                                                        ...
we identified as T2 and T3 for Italian, were re-
                                                                                       docMN    PE2     PE1      HT
placed by T5 and T4 respectively. Table 7 reflects
the final translation selection for all languages, with             Table 6: Modality scheduling overview. For each lan-
the information collected by means of the pre-task                  guage, each subject (Ti ) works with a pseudo-random
questionnaire.                                                      sequence of modalities (HT, PE1 , PE2 ). For the warm-
                                                                    up task (N = 5), all translators are provided with the
B    Translation Guidelines                                         same documents in the same modalities. For the main
                                                                    task (N = 107), each translator is assigned a modal-
An extract of the translation guidelines provided to                ity at random. Each document is translated once for
the translators follows.                                            every modality. The same procedure is repeated inde-
    • Fill-in the pre-task questionnaire before starting the        pendently for all the languages.
      project.
    • In this experiment, your goal is to complete the transla-
      tion of multiple files in one of two possible translation             – Only start to translate when you are in editing
      settings.                                                               mode (yellow background). In other words, do not
    • Please, complete the tasks on your own, even if you                     start thinking how you will translate a sentence
      know another translator that might be working on this                   when the active unit is not yet in editing mode
      project.                                                                (green or red background).
    • The translation setting alternates between texts, with                – Do not leave a unit in editing mode (yellow back-
      each text requiring a single translation in the assigned                ground) while you do something else. If you need
      setting. The two translation settings are:                              to do something unrelated in the middle of a trans-
                                                                              lation then go out of editing mode and come back
         1. Translation from scratch. Only the source sen-                    to editing mode when you are ready to resume
            tence is provided, you are to write the translation               translating.
            from scratch.
                                                                            – First you will be translating a warmup task, and
         2. Post-editing. The source sentence is provided                     then the main task. When you are translating
            alongside a translation produced by an MT sys-                    each file, you can consult the Source text (ST) by
            tem. You are to post-edit this MT output. Post-edit               looking up the url in the Excel files that we have
            the text so you are satisfied with the final transla-             sent for reference.
            tion (the required quality is publishable quality).
            If the MT output is too time consuming to fix,                  – In order to find the correct terminology for the
            you can delete it and start from scratch. However,                translation you can consult any source in the Inter-
            please do not systematically delete the provided                  net. Important: However, it is NOT ALLOWED
            MT output to give your own translation.                           to use any MT engine to find terms or alternatives
                                                                              to translations (such as GoogleTranslate, DeepL,
    • Important: All editing MUST happen in the provided                      MS Translator or any MT engine available in your
      PET interface: that is, working in other editors and                    language). Using MT engines invalidates the ex-
      copy-pasting the text back to PET is NOT ALLOWED,                       periment, and will be detected in the log data.
      because it invalidates the experiment. This is easy to
      spot in the log data, so please avoid doing this.                     – Please fill-in the post-task questionnaire ONLY
    • Complete the translation of all files sequentially, i.e. in             ONCE after completing all the translation tasks
      the order presented in the tool. DO NOT SKIP files at                   (both warmup and main tasks).
      your own convenience.
    • The aim is to produce publishable professional quality
      translations for both translation settings. Thus, please      C    Modality scheduling
      translate to your best abilities. You can return to the
      files and self-review as many times as you think it is        Table 6 visualizes an example of the adopted
      necessary.
                                                                    modality scheduling for every translator. The
    • Important: The time invested to translate is recorded
      while the active unit (sentence) is in editing mode (yel-     modality of document docMi for translator Tj in
      low background). Therefore:                                   the main task is picked randomly among the two
Gender     Age      Degree        Position     En Level   YoE     YoE w/ PE    % PE
                 T1      M      35-44      BA           Freelancer     C2       > 15       2-5      20%-40%
   Arabic        T2      M      25-34      BA           Employed       C2       5-10       2-5      60%-80%
                 T3      M      25-34      MA           Freelancer     C1       5-10        80%
                 T1      F      25-34       BA          Freelancer     C2        5-10      2-5       < 20%
   Turkish       T2      F      25-34       BA          Freelancer     C1        5-10     5-10       < 20%
                 T3      M      25-34   High school     Freelancer     C2       10-15
You can also read