DivEMT: Neural Machine Translation Post-Editing Effort Across Typologically Diverse Languages
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
DivEMT: Neural Machine Translation Post-Editing Effort Across Typologically Diverse Languages Gabriele Sarti1 , Arianna Bisazza1 , Ana Guerberof-Arenas1,2 and Antonio Toral1 1 Center for Language and Cognition (CLCG), University of Groningen 2 Centre for Translation Studies, University of Surrey {g.sarti, a.bisazza, a.guerberof.arenas, a.toral.ruiz}@rug.nl Abstract the benefits of state-of-the-art MT for cross- lingual information access are undisputed (Lom- We introduce DivEMT, the first publicly avail- mel and Pielmeier, 2021), its usefulness as an able post-editing study of Neural Machine arXiv:2205.12215v1 [cs.CL] 24 May 2022 aid to professional translators varies considerably Translation (NMT) over a typologically di- across domains, subjects and language combina- verse set of target languages. Using a strictly tions (Zouhar et al., 2021). controlled setup, 18 professional translators were instructed to translate or post-edit the In the last decade, the MT community has been same set of English documents into Ara- including an increasing number of languages in bic, Dutch, Italian, Turkish, Ukrainian, and its automatic and human evaluation efforts (Bo- Vietnamese. During the process, their ed- jar et al., 2013; Barrault et al., 2021). However, its, keystrokes, editing times, pauses, and per- the results of these evaluations are typically not ceived effort were recorded, enabling an in- directly comparable across different language pairs depth, cross-lingual evaluation of NMT qual- ity and its post-editing process. Using this for various reasons. First, reference-based auto- new dataset, we assess the impact on transla- matic quality metrics are hardly comparable across tion productivity of two state-of-the-art NMT different target languages (Bugliarello et al., 2020). systems, namely: Google Translate and the Secondly, human judgements are collected inde- open-source multilingual model mBART50. pendently for different language pairs, making their We find that, while post-editing is consistently cross-lingual comparison vulnerable to confound- faster than translation from scratch, the mag- ing factors such as tested domains and training nitude of its contribution varies largely across systems and languages, ranging from doubled data sizes. Similarly, recent work on NMT post- productivity in Dutch and Italian to marginal editing efficiency has focused on specific lan- gains in Arabic, Turkish and Ukrainian, for guage pairs such as English-Czech (Zouhar et al., some of the evaluated modalities. Moreover, 2021), German-Italian, German-French (Läubli the observed cross-language variability ap- et al., 2019) and English-Hindi (Ahsan et al., 2021), pears to partly reflect source-target relatedness but a controlled comparison across a broad set of and type of target morphology, while remain- typologically different languages is still lacking. ing hard to predict even based on state-of-the- art automatic MT quality metrics. We publicly In this work, we assess the usefulness of state- release the complete dataset1 including all col- of-the-art NMT for professional translation with a lected behavioural data, to foster new research strictly controlled setup (Figure 1). Specifically, on the ability of state-of-the-art NMT systems experienced professionals were asked to translate to generate text in typologically diverse lan- the same English documents into six typologically guages. different languages (Arabic, Dutch, Italian, Turk- ish, Ukrainian, and Vietnamese) using the same 1 Introduction platform and guidelines. Three translation modali- ties were adopted: human translation from scratch Advances in neural language modeling and multi- (HT), post-editing of Google Translate’s transla- lingual training have enabled the delivery of Ma- tion (PE1 ), and post-editing of mBART50’s transla- chine Translation (MT) technologies across an tion (PE2 ), the latter being a state-of-the-art open- unprecedented range of world languages. While source, multilingual NMT system. In addition to 1 https://github.com/gsarti/divemt post-editing results, subjects’ fine-grained editing https://huggingface.co/GroNLP/divemt behaviour — including keystrokes and time infor-
Translation from Scratch Behavioral SRC HT Data Go og le T ran Edits · Keystrokes · m sla Time measurements BA te Source Language RT MT1 PE1 Documents 1- to -5 0 Qualitative MT2 PE2 Assessment Neural Machine Professional Perceived MT quality Translation Post-editing Machine-translated Finalized Perceived PE effort Documents Translations Figure 1: Overview of the DIVEMT data collection process. The same process is repeated for each of the six target languages (Arabic, Dutch, Italian, Turkish, Ukrainian, Vietnamese) starting from the same English document sample taken from FLORES-101. For every document in every language, three professional translators are matched randomly with the three translation modalities (HT/PE1 /PE2 ), leading to a complete set of translations for all documents and modalities in each target language. Behavioral data and qualitative assessments are collected during and after the process, respectively. mation — are logged and later leveraged to measure open research question. Moreover, no language productivity and different effort typologies across properties were found to be robustly correlated to languages, systems and translation modalities. Fi- MT quality apart from type-token ratio. nally, translators are asked to complete a qualitative Human evaluations of MT quality are routinely assessment regarding their perceptions of MT qual- conducted during campaigns such as WMT (Koehn ity and post-editing effort. and Monz, 2006; Akhbardeh et al., 2021) and Using this dataset, we measure the impact on IWSLT (Cettolo et al., 2016, 2017), but their fo- professional translators’ productivity of the two cus is on language- and domain-specific ranking NMT systems, and study how this impact varies of MT systems, often leveraging non-professional across languages. DivEMT is publicly released quality assessments (Freitag et al., 2021), rather alongside this paper as a unique resource to study than cross-lingual quality comparisons. the language- and system-dependent nature of NMT advances in real-world translation scenarios. Measuring post-editing effort, and in particular its temporal, cognitive, and technical dimensions 2 Previous Work (Krings, 2001), is a well-established way to assess the effectiveness and efficiency of MT as a com- Before the advent of NMT, Birch et al. (2008) ponent of specialized translation workflows. Sem- found the amount of reordering, target morpho- inal work on post-editing highlighted an increase logical complexity, and source-target language his- in translators’ productivity following MT adop- torical relatedness to be strong predictors of MT tion (Guerberof, 2009; Green et al., 2013; Läubli quality. These findings, however, were solely based et al., 2013; Plitt and Masselot, 2010; Parra Es- on BLEU, which is in fact not comparable across cartín and Arcedillo, 2015). However, the same different target languages (Bugliarello et al., 2020). research also struggled to identify generalizable More recent work on neural models has proposed findings due to confounding factors such as out- more principled ways to measure the intrinsic diffi- put quality, content domains, and high variance culty of language-modeling (Gerz et al., 2018; Cot- across different language combinations and human terell et al., 2018; Mielke et al., 2019) and machine- subjects. With the advent of NMT, productivity translating (Bugliarello et al., 2020; Bisazza et al., gains of statistical MT — the highly-customized 2021) different languages, but achieving this re- dominant paradigm at the time — and new neural liably without any human evaluation remains an network-based methods were extensively compared
(Castilho et al., 2017; Bentivogli et al., 2016; Toral isolation (HT), or alongside a translation proposal et al., 2018; Läubli et al., 2019). Initial results were produced by one of the NMT systems (PE1 , PE2 ). promising for NMT due to its better fluency and We divide the experiment in two phases: During overall results even in language combinations tra- the warm-up phase a set of 5 documents is trans- ditionally deemed difficult, like English–German. lated by all subjects in their respective language Moreover, translators were shown to prefer NMT following the same, randomly sampled sequence over SMT for post-editing, although a pronounced of modalities (HT, PE1 or PE2 ). This phase allows productivity increase was not always present. More the subjects to get used to the setup, and allows recent work highlighted the productivity gains us to spot possible issues by inspecting the logged driven by NMT post-editing in a wider array of behavioral data before moving forward.2 In the languages that were previously challenging for MT, main collection phase, each subject is asked to such as English to Dutch (Daems et al., 2017), translate documents in a pseudo-random sequence English-Hindi (Ahsan et al., 2021), English-Greek of modalities. This time, however, the sequence (Stasimioti and Sosoni, 2020), English-Finnish and of modalities is different for each translator and English-Swedish (Koponen et al., 2020), all show- chosen so that each document gets translated in all ing a considerable variance among language pairs three modalities. This allows us to measure trans- and subjects. The English-Czech results reported lation productivity independently from subject’s by Zouhar et al. (2021), however, show a differ- productivity and from document-specific difficul- ent picture. NMT post-editing speed was found ties. A graphical overview of this process is shown to be comparable to translating from scratch, and in Figure 1, and additional precisions are avail- marginal increases in automatic MT quality met- able in Appendix C. Since productivity and other rics were found to be unrelated to better post- behavioral metrics can only be estimated with a siz- editing productivity. In sum, research on post- able sample, we prioritize document set size over editing NMT generally reports increased fluency number of translators per language during budget and output quality, but productivity gains are hardly allocation. generalizable across language pairs and domains. All subjects are professional translators with at Importantly, to our knowledge, no previous work least 3 years of professional experience, at least one studied NMT post-editing over a set of typologi- year of post-editing experience and professional cally different languages in a unified setup where proficiency with CAT tools.3 Translators were the effect of content type and domain, NMT engine, asked to produce translations of publishable qual- and translation interface are strictly controlled for. ity. We informed them of the machine-generated nature of translation suggestions, but the presence 3 The DivEMT Dataset of two different MT engines in the background was never explicitly mentioned. Moreover, transla- DivEMT’s main purpose is to assess the usefulness tors were provided with links to the original source of state-of-the-art NMT for professional transla- articles to facilitate contextualization, and were in- tors, and to study how this usefulness varies across structed not to use any external MT engine to pro- target languages that have very different typolog- duce their translations. Assessing the final quality ical properties. With our data collection setup, of the translated material is out of the scope of we aimed at striking a balance between simulating the current study, although we realize that this is a realistic professional translation workflow and an important consideration to assess usability in maximizing comparability of the results across lan- a professional context. A summary of translation guages. In what follows, we explain our design guidelines is provided in Appendix B. choices in detail. 3.2 Choice of Source Texts 3.1 Subjects and Task Scheduling The selected documents represent a subset of the To control for the effect of individual translators’ FLORES-101 evaluation benchmark (Goyal et al., preferences and style, we involve a total of 18 sub- 2022),4 which consists of sentences taken from jects (three per target language). During the ex- English Wikipedia, covering a mix of topics and periment, each subject is presented with a series 2 Warm-up data are excluded from Section 4 analysis. of documents — each containing between 3 and 5 3 Individual subjects’ details are available in Appendix A 4 sentences — where the source text is presented in https://github.com/facebookresearch/flores
Morphology Word Order Genus Family Type # Cases Dominant Freedom Script English Indo-European Germanic Fusional (3) SVO low Latin Arabic (MSA) Afro-Asiatic Semitic Introflex. 3 VSO / SVO low Arabic Dutch Indo-European Germanic Fusional – SVO / SOV / V 2 low Latin Italian Indo-European Romance Fusional – SVO low Latin Turkish Altaic Turkic Agglutin. 6 SOV high Latin* Ukrainian Indo-European Slavic Fusional 7 SVO high Cyrillic Vietnamese Austro-Asiatic Viet-Muong Isolating – SVO low Latin* Table 1: The typological diversity of the selected language sample across various linguistic dimensions. Note that English has 3 cases, but these are only marked for pronouns. V2 indicates verb-second word order. Latin* indicates language-specific extensions of the Latin alphabet. Sources include the World Atlas of Language Structures (Dryer and Haspelmath, 2013) and (Futrell et al., 2015). domains.5 While professional translators are typi- Steedman, 2015) – would dramatically restrict the cally specialized in one or few domains, we opt for set of languages available to our study, or imply a mix-domain dataset to minimize domain adapta- a prohibitively low translation quality on general- tion efforts by the subjects and maximize general- domain text. In order to minimize the effect of izability of the results. training data disparity while maximizing language Importantly, FLORES-101 includes high-quality diversity, we choose representatives of six different human translations into 101 languages, which make language families for which comparable amounts it possible to automatically estimate NMT quality of training data are available in our open-source and discard excessively low-scoring models or lan- model, namely Arabic, Dutch, Italian, Turkish, guage pairs before the start of our experiment. An- Ukrainian, and Vietnamese. While the num- other important reason to choose this dataset is the ber of parallel sentence pairs used for fine-tuning availability of metadata such as the source URL, mBART50 on 50 language pairs varied widely be- which allows us to check for the existence of public tween 45M to only 4K, all our selected language translations in our languages of interest, and the pairs fall within the range of 100K-250K sentence presence of cross-sentence context. The documents pairs (mid-resourced, see Table 2). As shown in Ta- used for our study are fragments of contiguous sen- ble 1, our language sample represents a wide diver- tences extracted from Wikipedia articles that com- sity in terms of language family, type of morphol- pose the original FLORES-101 corpus. Even if ogy, degree of inflection, word order and script. small, the context provided by document structure allows us to simulate a more realistic translation 3.4 Choice of MT Systems workflow if compared to out-of-context sentences. While most of the best performing general-domain Based on our available budget, we select 112 NMT systems are commercial, experiments based English documents from the devtest portion of on such systems are not replicable as their back FLORES-101 corresponding to 450 sentences and ends get silently updated. Moreover, without know- 9626 words. More details on the data selection ing the exact training setup data specifics, we can- process are provided in Appendix D. not attribute differences in the cross-lingual re- sults to the languages themselves. We balance 3.3 Choice of Languages these observations by including two NMT sys- Besides its architecture, training data has a very tems in our study: Google Translate6 as a rep- important impact on the quality of a NMT sys- resentative of commercial quality, and mBART50 tem. Unfortunately, using strictly comparable or one-to-Many7 (Tang et al., 2021) as a represen- multi-parallel datasets – like the Europarl (Koehn, tative of state-of-the-art open-source multilingual 2005) or Bible corpus (Christodouloupoulos and NMT technology. The original multilingual BART 5 6 We use a balanced sample of articles sourced from Evaluation performed in October 2021 7 WikiNews, WikiVoyage and WikiBooks. facebook/mbart-large-50-one-to-many
Target Google Translate (PE1 ) mBART-50 (PE2 ) Language BLEU CHRF COMET BLEU CHRF COMET # Pairs Source Arabic 34.1 65.6 .737 17.0 48.5 .452 226K IWSLT’17 Dutch 29.1 60.0 .667 22.6 53.9 .532 226K IWSLT’17.mlt Italian 32.8 61.4 .781 24.4 54.7 .648 233K IWSLT’17.mlt Turkish 35.0 65.5 1.00 18.8 52.7 .755 204K WMT’17 Ukrainian 31.1 59.8 .758 21.9 50.7 .587 104K TED58 Vietnamese 45.1 61.9 .724 34.7 54.0 .608 127K IWSLT’15 Table 2: Automatic MT quality of the two selected NMT systems for English-to-Target translation on the full FLORES-101 devtest split. Best scores are highlighted in bold. We report the number of sentence pairs used for the multilingual fine-tuning of mBART50 and their respective sources (Tang et al. 2021, Table 6). model (Liu et al., 2020) is an encoder-decoder logs information about the post-editing process, transformer model pre-trained on monolingual doc- which we use to assess effort (see Section 4); uments in 25 languages, using the random span and (ii) it is a mature research-oriented tool that masking and order permutation tasks. Tang et al. has been successfully used in a number of pre- (2021) extend mBART by further pre-training it on vious studies, e.g. Koponen et al. (2012); Toral an additional set of 25 languages and performing et al. (2018). Compared to commercial tools, PET multilingual fine-tuning on translation data for 50 has less functionalities. While this would clearly languages, producing three configurations of mul- limit its application in daily translation activities, tilingual NMT models: many-to-one, one-to-many, it proves useful for our controlled setup to avoid and many-to-many. Our choice of the mBART50 confounding effects stemming from such additional model is largely motivated by its maneageable size, capabilities. In addition, the tool is expected to be its good performances across the set of evaluated unfamiliar to all the translators that take part in our languages (see Table 2) and its adoption for other study, thereby avoiding differences in performance NMT studies (Liu et al., 2021) and post-editing re- across translators due to different degrees of pro- sources (Fomicheva et al., 2020). While mBART50 ficiency with the software. We also observe that, performances are usually comparable or slightly due to the varied and generic nature of source doc- worse than the ones of tested bilingual NMT mod- uments, functionalities such as concordance and els,8 the choice of a multilingual model allows us to translation memory matches would have proven evaluate the downstream effectiveness of a single much less useful in our setup. We collect three system trained on pairs evenly distributed across main types of data: tested languages, as mentioned in Section 3.3. Fi- • The resulting translations, i.e. the end product nally, adopting two systems with a marked dif- of the translators’ task in either HT or PE ference in automatic MT evaluation performances modes, forming a multilingual corpus with allows us to estimate how a significant increase in one source text and 18 translations (one per standard translation metrics such as BLEU, CHRF language-modality combination) exemplified and COMET (Papineni et al., 2002; Popović, 2015; in Table 3. Rei et al., 2020) can affect downstream productivity • Behavioral data for each translated sentence, when the NMT systems are adopted in a realistic including: editing time, number of keystrokes post-editing scenario. and keystroke type (content, navigation, erase, etc.), number of revisions, as well as number 3.5 Translation Platform and Collected Data and duration of pauses longer than 300 and All the translations were performed with PET (Aziz 1000 milliseconds (Lacruz et al., 2014). et al., 2012), a computer-assisted translation tool • Pre-task and post-task questionnaires. The that supports both translating from scratch and former focuses on demographics, education, post-editing. This tool was chosen because (i) it and work experience with translation and post- 8 See Table 9 for automatic MT quality results by five editing. The latter elicits subjective assess- different models over a larger set of 10 target languages. ments on the quality, effort and enjoyability of
ENGLISH Inland waterways can be a good theme to base a holiday around. ARABIC HT ¡J¢jJË @ YJ k @P AJk éJ Êg@YË@ éJ KAÖÏ @ H@ . AêËñk éÊ¢« QÒÖÏ @ àñºK à @ áºÖß . MT: PE1 PE: MT: PE2 PE: DUTCH HT Binnenlandse waterwegen kunnen een goed thema zijn voor een vakantie. MT: De binnenwateren kunnen een goed thema zijn om een vakantie omheen te baseren . PE1 PE: Binnenwateren kunnen een goede vakantiebestemming zijn . MT: Binnenwaterwegen kunnen een goed thema zijn om een vakantie rond te zetten . PE2 PE: Binnenwaterwegen kunnen een goed thema zijn om een vakantie rond te organiseren . ITALIAN HT I corsi d’acqua dell’entroterra possono essere un ottimo punto di partenza da cui organizzare una vacanza. MT: Trasporto fluviale può essere un buon tema per basare una vacanza in giro . PE1 PE: I canali di navigazione interna possono essere un ottimo motivo per cui intraprendere una vacanza . MT: I corsi d’acqua interni possono essere un buon tema per fondare una vacanza. PE2 PE: I corsi d’acqua interni possono essere un buon tema su cui basare una vacanza. TURKISH HT İç bölgelerdeki su yolları, tatil planı için iyi bir tema olabilir. MT: İç su yolları, bir tatili temel almak için iyi bir tema olabilir. PE1 PE: İç su yolları, bir tatil planı yapmak için iyi bir tema olabilir. MT: İç suyolları, tatil için uygun bir tema olabilir. PE2 PE: İç sular tatil için uygun bir tema olabilir. UKRAINIAN HT Можна спланувати вихiднi, взявши за основу подорож внутрiшнiми водними шляхами. MT: Внутрiшнi воднi шляхи можуть стати гарною темою для вiдпочинку навколо . PE1 PE: Внутрiшнi воднi шляхи можуть стати гарною темою для проведення вихiдних . MT: Воднi шляхи можуть бути хорошим об ’єктом для базування вiдпочинку навколо . PE2 PE: Мiсцевiсть навколо внутрiшнiх водних шляхiв може бути гарним вибором для органiзацiї вiдпочинку. VIETNAMESE HT Du lịch trên sông có thể là một lựa chọn phù hợp cho kỳ nghỉ. MT: Đường thủy nội địa có thể là một chủ đề hay để tạo cơ sở cho một kỳ nghỉ xung quanh . PE1 PE: Đường thủy nội địa có thể là một ý tưởng hay để lập kế hoạch cho kỳ nghỉ. MT: Các tuyến nước nội địa có thể là một chủ đề tốt để xây dựng một kì nghỉ. PE2 PE: Du lịch bằng đường thủy nội địa là một ý tưởng nghỉ dưỡng không tồi. Table 3: An example sentence (81.1) from the DivEMT corpus, with the English source and all output modali- ties for all target languages, including intermediate machine translations (MT) and subsequent post-editings (PE). Colors denote insertions , deletions , substitutions and shifts computed with Tercom (Snover et al., 2006).
the post-editing modality compared to trans- PROD ↑ ∆HT ↑ lating from scratch. HT 13.1 - 4 Post-Editing Effort Across Languages ARA PE1 21.7 +84% PE2 16.3 +10% In this section, we use the DivEMT dataset to HT 13.6 - quantify the effort of professional translators post- NLD PE1 28.7 +119% editing NMT outputs, across our diverse set of tar- PE2 21.7 +61% get languages. We consider two main objective indicators of effort, namely (i) task time (and re- HT 8.8 - lated productivity gains); and (ii) post-editing rate, ITA PE1 18.6 +96% measured by Human-targeted Translation Edit Rate PE2 15.6 +95% (Snover et al., 2006). We reiterate that all scores in HT 17.9 - this section are computed on the same set of source TUR PE1 25.5 +34% sentences for all languages, resulting in a faithful PE2 21.0 +12% cross-lingual comparison of post-editing effort that would not be feasible without the DivEMT dataset. HT 8.0 - UKR PE1 12.3 +71% 4.1 Temporal Effort and Productivity Gains PE2 9.8 +14% We start by comparing task time (seconds per pro- HT 10.2 - cessed source word) across languages and modal- VIE PE1 13.0 +32% ities. For this analysis, data is aggregated over PE2 11.1 +23% the three subjects in each language and editing time is computed for each document. As shown Table 4: Median productivity (PROD, # processed in Figure 2, translation time varies considerably source words per minute) and median % post-editing speedup (∆HT) for all analyzed languages and modal- across languages even when no MT system is in- ities. Arrows denote the direction of improvement. volved, suggesting an intrinsic variability in transla- tion complexity for different subjects and language pairs. The ‘slowest’ language pairs are En-Italian quality metrics indeed reflect on post-editing ef- and En-Ukrainian, followed by En-Vietnamese, fort, suggesting a nuanced picture that is com- then by En-Arabic and En-Dutch, whereas the plementary to previous findings: While Zouhar ‘fastest’ pair is En-Turkish. This pattern cannot et al. (2021) noted how post-editing time gains be easily explained, and is in contrast with tradi- quickly saturate when MT quality is already high, tional factors associated with MT difficulty, such we observe that moving from medium-quality to as target morphological richness and source-target high-quality MT yields meaningful productivity language relatedness (Birch et al., 2008). improvements across most of the evaluated lan- On the other hand, when comparing the three guages. Across languages, too, the magnitude of different modalities, we find HT as the slowest productivity gains ranges widely, from doubling modality, PE1 as the fastest and PE2 in between the in some languages (PE1 in Dutch, PE1 and PE2 in two for all the evaluated languages. This confirms Italian) to only about 10% (PE2 in Arabic, Turk- once again the overall positive impact of NMT post- ish and Ukrainian). When only considering the editing on translation productivity. That said, we better performing system (PE1 ), despite a large note how the magnitude of this impact is highly variation between +119% (Dutch) and +32% (Viet- variable across systems and languages. namese), post-editing remains clearly beneficial in For a measure of productivity gains that is easier all languages. As for the open-source system (PE1 ), to interpret and more in line with translation in- three out of six languages display only marginal dustry practices, we turn to productivity expressed gains (i.e. less than 15% in Arabic, Turkish and in source words processed per minute, and com- Ukrainian). Recall that mBART50 (PE2 ), despite pute the speed-up induced by the two post-editing its overall lower performance, is the only one of the modalities over translating from scratch (∆HT). two NMT systems that enables a fair comparison Table 4 presents our results. Across systems, we across languages (from the point of view of training find that large differences across automatic MT data size and architecture, see Section 3.4). Inter-
Modality From Scratch (HT) Google Translate (PE1) mBART-50 (PE2) 8 3 3 10 21 18 Time per source word (s) 15 12 9 6 3 0 Arabic Dutch Italian Turkish Ukrainian Vietnamese Target language Figure 2: Temporal effort across languages and translation modalities, measured in seconds per processed source word. Each point represents a document, with higher scores representing slower editing. ↑ denote the amount of data points not included in the plot for every language. estingly, if we focus on the gains induced by this guage sample does not allow to draw direct causal system, explaining factors like language relatedness links between specific typological properties and and morphological richness come back into play. post-editing efficiency. Still, we believe these re- Specifically, Italian (+95%), Dutch (+61%) and sults have important implications on the claimed Ukrainian (+14%) are all related to English, but ‘universality’ of current state-of-the-art MT and Ukrainian has a much richer morphology. On the NLP systems, mostly based on the Transformer ar- other hand, Vietnamese (+23%), Turkish (+12%) chitecture (Vaswani et al., 2017) and on BPE-style and Arabic (+10%) all belong to different families. subword segmentation techniques (Sennrich et al., From the point of view of morphology, Vietnamese 2016). is isolating (little to no morphology), while Turkish and Arabic have very rich morphological systems 4.2 Post-Editing Rate of different types, respectively: agglutinative and Figure 3 presents the results in terms of Human- introflexive, the latter of which is especially prob- target Translation Edit Rate, HTER (Snover et al., lematic for subword segmentation (Amrhein and 2006), which is computed as the sum of substitu- Sennrich, 2021). Some differences are however tions, insertions, deletions and shift operations per- harder to explain. For instance, Dutch is geneti- formed by a translator on the MT output, normal- cally related to English and has a morphology that ized by its length. First, we can see that the com- is poorer than Italian, but its productivity gain with mercial system needs less editing than the open- mBART50 is lower (61% vs 95%). This finding source one for all languages. Recall that the trans- is accompanied by an important gap in BLEU and lators were not informed about the presence of two COMET scores achieved by mBART50 on the two systems in the background, so we can exclude the languages (i.e. 22.6 BLEU for Dutch versus 24.4 for possibility of an over-reliance or distrust towards Italian, 0.532 COMET for Dutch versus 0.648 for a specific MT system from past experience. Here Italian) which cannot be easily explained in terms too, we observe a high variation both across sub- of training data size. jects and across languages. The cross-language We note the important role of inter-subject vari- pattern is similar, but not identical, for the two MT ability,9 in line with previous studies (Koponen systems: For Google Translate, Ukrainian shows by et al., 2020). Moreover, the small size of our lan- far the heaviest edit rate, followed by Vietnamese, 9 See Figure 6 for a per-subject productivity overview. whereas Arabic, Dutch, Italian and Turkish all show
Arabic Dutch Italian Turkish Ukrain. Vietnam. Language 50 Arabic Italian Ukrainian Dutch Turkish Vietnamese 40 150 Google Translate (PE1) 30 125 % of Errorless MT Sentences 20 HTER (edits/words) 100 10 0 75 50 40 50 mBART-50 (PE2) 30 25 20 10 0 Google Translate (PE1) mBART-50 (PE2) 0 NMT System Subject x Language Figure 3: Human-targeted Translation Edit Rate Figure 4: Distribution of error-less machine transla- (HTER) for Google Translate and mBART50 post- tion sentence outputs (no edits performed during post- editing across available languages. editing) for each translator and every language. relatively low amounts of edits. Focusing again on ρ HTER ↓ ρ PROD ↑ mBART50 for a fairer cross-lingual comparison, PE1 PE2 PE1 PE2 Ukrainian is by far the most heavily edited lan- BLEU .12 -.07 -.30 -.42 guage, followed by a medium-tier group composed CHRF -.49 -.49 .45 .34 of Vietnamese, Arabic and Turkish, and finally COMET -.17 .03 .65 .27 by Dutch and Italian as low-edit languages. From these results, we can conclude that several of our Table 5: Correlation of automatic metrics and HTER observations on the linguistic relatedness and type scores obtained after post-editing, after averaging on morphology transfer to edit rates, with languages language pairs. Arrows denote the direction of ex- more closely related to English requiring less post- pected correlation, and moderate/high scores are in edits on average. bold. Figure 4 visualizes the large gap in edit rates across languages and subjects by presenting the metrics computed for the two NMT systems on the amount of “errorless” MT sentences that were ac- existing reference translations and the HTER and cepted directly, without any further post-editing. productivity scores derived from their adoption in Once again, we observe how the proportion of these the post-editing setting.10 Given the availability of sentences is heavily influenced by the NMT system, FLORES reference translations for 101 languages, but nonetheless shows how the majority of transla- a metric correlating well with post-editing effort tors for the En-Dutch and En-Italian language pairs across target languages would make a valuable enjoys lower editing rates compared to their col- tool to predict the usefulness of MT post-editing leagues translating to Ukrainian and Vietnamese. in a new language pair where MT has not been de- Surprisingly, the En-Turkish setting fares well, de- ployed yet. The results in Table 5 show that BLEU spite the lower relatedness of the two languages. In (Papineni et al., 2002) does not correlate at all with particular, in the context on Google Translate out- post-editing effort, giving empirical support to the puts, the average proportion of error-less sentences claim that BLEU should not be compared across tar- is ∼ 25% for the former languages, while for the 10 latter it corresponds only to ∼ 3%. Scores are averaged for every language pair and then correlated. The automatic metrics are computed on the full To conclude our analysis, we consider the inter- FLORES devtest (cf. Table 2) as proxies of general MT quality language correlation between automatic quality on the given language pair.
get languages (Bugliarello et al., 2020). The char- light into our current findings. acter ngram-based CHRF (Popović, 2015), despite its simplicity, shows much better results, with posi- Acknowledgements tive correlations across the board. Finally, we look Data collection was fully funded by the Netherlands at the neural COMET, a metric based on large pre- Organization for Scientific Research (NWO) under trained multilingual language models, which boasts project number 639.021.646. GS is supported by state-of-the-art (within language-pair) correlations the NWO project InDeep (NWA.1292.19.399). with human judgements of MT quality (Rei et al., 2020). Despite its apparent language-agnostic na- ture, COMET does not appear to correlate reliably References with post-editing effort, which might point to an Arafat Ahsan, Vandan Mujadia, and Dipti Misra uneven quality of the neural sentence encoder, or to Sharma. 2021. Assessing post-editing effort in the a lack of diversity in the human judgement dataset English-Hindi direction. In Proceedings of the 18th used for fine-tuning. International Conference on Natural Language Pro- cessing (ICON), National Institute of Technology 5 Conclusions Silchar , Silchar, India. NLP Association of India (NLPAI). We have presented the results of a post-editing Farhad Akhbardeh, Arkady Arkhangorodsky, Mag- study spanning two state-of-the-art NMT systems dalena Biesialska, Ondřej Bojar, Rajen Chatter- jee, Vishrav Chaudhary, Marta R. Costa-jussa, and six target languages, under a unified setup. The Cristina España-Bonet, Angela Fan, Christian Fe- resulting dataset, DivEMT, allows for a controlled dermann, Markus Freitag, Yvette Graham, Ro- comparison of NMT quality and post-editing effort man Grundkiewicz, Barry Haddow, Leonie Har- across typologically diverse languages. ter, Kenneth Heafield, Christopher Homan, Matthias Huck, Kwabena Amponsah-Kaakyire, Jungo Ka- We adopted DivEMT to perform a novel cross- sai, Daniel Khashabi, Kevin Knight, Tom Kocmi, language analysis of post-editing effort, along its Philipp Koehn, Nicholas Lourie, Christof Monz, temporal and editing rate dimensions. Our results Makoto Morishita, Masaaki Nagata, Ajay Nagesh, show that the use of NMT drives improvements Toshiaki Nakazawa, Matteo Negri, Santanu Pal, Al- lahsera Auguste Tapo, Marco Turchi, Valentin Vy- in productivity across all the evaluated languages, drin, and Marcos Zampieri. 2021. Findings of the but the magnitude of these improvements depends 2021 conference on machine translation (WMT21). heavily on the language and the choice of the un- In Proceedings of the Sixth Conference on Machine derlying NMT system. Moreover, some of the re- Translation, pages 1–88, Online. Association for ported cross-language variation seems to reflect Computational Linguistics. language properties traditionally known to affect Chantal Amrhein and Rico Sennrich. 2021. How suit- able are subword segmentation strategies for trans- MT quality, like relatedness to source and type of lating non-concatenative morphology? In Findings morphology (Birch et al., 2008). Given the un- of the Association for Computational Linguistics: derlying open-source system used the exact same EMNLP 2021, pages 689–705, Punta Cana, Domini- architecture and comparable training data size in can Republic. Association for Computational Lin- all evaluated language, our results suggest that the guistics. current state-of-the-art approach to MT, based on Wilker Aziz, Sheila Castilho, and Lucia Specia. 2012. Transformer and BPE-based input tokens, is not PET: a tool for post-editing and assessing machine translation. In Proceedings of the Eighth Inter- universally suitable to model human languages. national Conference on Language Resources and Finally, we investigated whether language-pair Evaluation (LREC’12), pages 3982–3987, Istanbul, level MT quality, as measured by various automatic Turkey. European Language Resources Association (ELRA). metrics, could be used as an indicator of the use- fulness of MT for post-editing in a new language Loic Barrault, Ondrej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussa, Christian Feder- pair. Our results show consistently moderate cor- mann, Mark Fishel, Alexander Fraser, Markus Fre- relations for CHRF, but not for BLEU, and also not itag, Yvette Graham, Roman Grundkiewicz, Paco for COMET despite its supposed language-agnostic Guzman, Barry Haddow, Matthias Huck, Anto- nature. nio Jimeno Yepes, Philipp Koehn, Tom Kocmi, An- dre Martins, Makoto Morishita, and Christof Monz, In future work, a more fine-grained analysis of editors. 2021. Proceedings of the Sixth Conference the types of edits conducted by the translators, and on Machine Translation. Association for Computa- their differences across languages, could shed more tional Linguistics, Online.
Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo, and language-model? In Proceedings of the 2018 Con- Marcello Federico. 2016. Neural versus phrase- ference of the North American Chapter of the Asso- based machine translation quality: a case study. ciation for Computational Linguistics: Human Lan- In Proceedings of the 2016 Conference on Empiri- guage Technologies, Volume 2 (Short Papers), pages cal Methods in Natural Language Processing, pages 536–541, New Orleans, Louisiana. Association for 257–267, Austin, Texas. Association for Computa- Computational Linguistics. tional Linguistics. Joke Daems, Sonia Vandepitte, Robert Hartsuiker, and Alexandra Birch, Miles Osborne, and Philipp Koehn. Lieve Macken. 2017. Translation Methods and Ex- 2008. Predicting success in machine translation. perience: A Comparative Analysis of Human Trans- In Proceedings of the 2008 Conference on Empiri- lation and Post-editing with Students and Profes- cal Methods in Natural Language Processing, pages sional Translators. Meta : journal des traducteurs 745–754, Honolulu, Hawaii. Association for Com- / Meta: Translators’ Journal, 62(2):245–270. Pub- putational Linguistics. lisher: Les Presses de l’Université de Montréal. Arianna Bisazza, Ahmet Üstün, and Stephan Sportel. Matthew S. Dryer and Martin Haspelmath, editors. 2021. On the difficulty of translating free-order 2013. WALS Online. Max Planck Institute for Evo- case-marking languages. Transactions of the Associ- lutionary Anthropology, Leipzig. ation for Computational Linguistics, 9:1233–1248. Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ondřej Bojar, Christian Buck, Chris Callison-Burch, Ma, Ahmed El-Kishky, Siddharth Goyal, Man- Christian Federmann, Barry Haddow, Philipp deep Baines, Onur Çelebi, Guillaume Wenzek, Koehn, Christof Monz, Matt Post, Radu Soricut, and Vishrav Chaudhary, Naman Goyal, Tom Birch, Vi- Lucia Specia. 2013. Findings of the 2013 Work- taliy Liptchinsky, Sergey Edunov, Edouard Grave, shop on Statistical Machine Translation. In Pro- Michael Auli, and Armand Joulin. 2021. Be- ceedings of the Eighth Workshop on Statistical Ma- yond english-centric multilingual machine trans- chine Translation, pages 1–44, Sofia, Bulgaria. As- lation. Journal of Machine Learning Research, sociation for Computational Linguistics. 22(107):1–48. Emanuele Bugliarello, Sabrina J. Mielke, Anto- Marina Fomicheva, Shuo Sun, Erick Rocha Fonseca, nios Anastasopoulos, Ryan Cotterell, and Naoaki F. Blain, Vishrav Chaudhary, Francisco Guzmán, Okazaki. 2020. It’s easier to translate out of English Nina Lopatina, Lucia Specia, and André F. T. than into it: Measuring neural translation difficulty Martins. 2020. MLQE-PE: A multilingual qual- by cross-mutual information. In Proceedings of the ity estimation and post-editing dataset. ArXiv, 58th Annual Meeting of the Association for Compu- abs/2010.04480. tational Linguistics, pages 1640–1649, Online. As- Markus Freitag, George Foster, David Grangier, Viresh sociation for Computational Linguistics. Ratnakar, Qijun Tan, and Wolfgang Macherey. 2021. Sheila Castilho, Joss Moorkens, Federico Gaspari, Experts, errors, and context: A large-scale study of Iacer Calixto, John Tinsley, and Andy Way. 2017. human evaluation for machine translation. Transac- Is Neural Machine Translation the New State of the tions of the Association for Computational Linguis- Art? The Prague Bulletin of Mathematical Linguis- tics, 9:1460–1474. tics, 108(1):109–120. Publisher: Sciendo Section: Richard Futrell, Kyle Mahowald, and Edward Gibson. The Prague Bulletin of Mathematical Linguistics. 2015. Quantifying word order freedom in depen- Mauro Cettolo, Marcello Federico, Luisa Bentivogli, dency corpora. In Proceedings of the Third In- Jan Niehues, Sebastian Stüker, Katsuhito Sudoh, ternational Conference on Dependency Linguistics Koichiro Yoshino, and Christian Federmann. 2017. (Depling 2015), pages 91–100, Uppsala, Sweden. Overview of the IWSLT 2017 evaluation campaign. Uppsala University, Uppsala, Sweden. In Proceedings of the 14th International Confer- Daniela Gerz, Ivan Vulić, Edoardo Maria Ponti, Roi ence on Spoken Language Translation, pages 2–14, Reichart, and Anna Korhonen. 2018. On the rela- Tokyo, Japan. International Workshop on Spoken tion between linguistic typology and (limitations of) Language Translation. multilingual language modeling. In Proceedings of Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa the 2018 Conference on Empirical Methods in Nat- Bentivogli, Rolando Cattoni, and Marcello Federico. ural Language Processing, pages 316–327, Brus- 2016. The IWSLT 2016 evaluation campaign. In sels, Belgium. Association for Computational Lin- Proceedings of the 13th International Conference on guistics. Spoken Language Translation, Seattle, Washington Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng- D.C. International Workshop on Spoken Language Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Kr- Translation. ishnan, Marc’Aurelio Ranzato, Francisco Guzmán, Christos Christodouloupoulos and Mark Steedman. and Angela Fan. 2022. The Flores-101 Evaluation 2015. A massively parallel corpus: the bible in Benchmark for Low-Resource and Multilingual Ma- 100 languages. Language Resources and Evalua- chine Translation. Transactions of the Association tion, 49(2):375–395. for Computational Linguistics, 10:522–538. Ryan Cotterell, Sabrina J. Mielke, Jason Eisner, and Spence Green, Jeffrey Heer, and Christopher D. Man- Brian Roark. 2018. Are all languages equally hard to ning. 2013. The efficacy of human post-editing for
language translation. In Proceedings of the SIGCHI Findings of the Association for Computational Lin- Conference on Human Factors in Computing Sys- guistics: ACL-IJCNLP 2021, pages 2706–2718, On- tems, CHI ’13, pages 439–448, New York, NY, USA. line. Association for Computational Linguistics. Association for Computing Machinery. Arl Lommel and Hélène Pielmeier. 2021. Machine Ana Guerberof. 2009. Productivity and quality in MT Translation Use at LSPs: Insights on How Language post-editing. In Beyond Translation Memories: New Service Providers’ Use of MT Is Evolving. Techni- Tools for Translators Workshop, Ottawa, Canada. cal report, Common Sense Advisory Research. Philipp Koehn. 2005. Europarl: A parallel corpus for Sabrina J. Mielke, Ryan Cotterell, Kyle Gorman, Brian statistical machine translation. In Proceedings of Roark, and Jason Eisner. 2019. What kind of lan- Machine Translation Summit X: Papers, pages 79– guage is hard to language-model? In Proceedings of 86, Phuket, Thailand. the 57th Annual Meeting of the Association for Com- putational Linguistics, pages 4975–4989, Florence, Philipp Koehn and Christof Monz, editors. 2006. Pro- Italy. Association for Computational Linguistics. ceedings on the Workshop on Statistical Machine Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Translation. Association for Computational Linguis- Jing Zhu. 2002. Bleu: a method for automatic eval- tics, New York City. uation of machine translation. In Proceedings of the Maarit Koponen, Wilker Aziz, Luciana Ramos, and 40th Annual Meeting of the Association for Compu- Lucia Specia. 2012. Post-editing time as a measure tational Linguistics, pages 311–318, Philadelphia, of cognitive effort. In Workshop on Post-Editing Pennsylvania, USA. Association for Computational Technology and Practice. Linguistics. Maarit Koponen, Umut Sulubacak, Kaisa Vitikainen, Carla Parra Escartín and Manuel Arcedillo. 2015. Ma- and Jörg Tiedemann. 2020. MT for subtitling: User chine translation evaluation made fuzzier: a study evaluation of post-editing productivity. In Proceed- on post-editing productivity and evaluation metrics ings of the 22nd Annual Conference of the Euro- in commercial settings. In Proceedings of Machine pean Association for Machine Translation, pages Translation Summit XV: Papers, Miami, USA. 115–124, Lisboa, Portugal. European Association Mirko Plitt and François Masselot. 2010. A Produc- for Machine Translation. tivity Test of Statistical Machine Translation Post- Hans P. Krings. 2001. Repairing Texts: Empirical Editing in a Typical Localisation Context. The Investigations of Machine Translation Post-editing Prague Bulletin of Mathematical Linguistics, 93(1). Processes. Kent State University Press. Google- Maja Popović. 2015. chrF: character n-gram F-score Books-ID: vsdPsIXCiWAC. for automatic MT evaluation. In Proceedings of the Isabel Lacruz, Michael Denkowski, and Alon Lavie. Tenth Workshop on Statistical Machine Translation, 2014. Cognitive demand and cognitive effort in pages 392–395, Lisbon, Portugal. Association for post-editing. In Proceedings of the 11th Conference Computational Linguistics. of the Association for Machine Translation in the Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Americas, pages 73–84, Vancouver, Canada. Asso- Lavie. 2020. COMET: A neural framework for MT ciation for Machine Translation in the Americas. evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Pro- Samuel Läubli, Chantal Amrhein, Patrick Düggelin, cessing (EMNLP), pages 2685–2702, Online. Asso- Beatriz Gonzalez, Alena Zwahlen, and Martin Volk. ciation for Computational Linguistics. 2019. Post-editing productivity with neural machine translation: An empirical assessment of speed and Rico Sennrich, Barry Haddow, and Alexandra Birch. quality in the banking and finance domain. In Pro- 2016. Neural machine translation of rare words ceedings of Machine Translation Summit XVII: Re- with subword units. In Proceedings of the 54th An- search Track, pages 267–272, Dublin, Ireland. Euro- nual Meeting of the Association for Computational pean Association for Machine Translation. Linguistics (Volume 1: Long Papers), pages 1715– 1725, Berlin, Germany. Association for Computa- Samuel Läubli, Mark Fishel, Gary Massey, Maureen tional Linguistics. Ehrensberger-Dow, and Martin Volk. 2013. As- sessing post-editing efficiency in a realistic transla- Matthew Snover, Bonnie Dorr, Rich Schwartz, Lin- tion environment. In Proceedings of the 2nd Work- nea Micciulla, and John Makhoul. 2006. A study shop on Post-editing Technology and Practice, Nice, of translation edit rate with targeted human annota- France. tion. In Proceedings of the 7th Conference of the As- sociation for Machine Translation in the Americas: Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Technical Papers, pages 223–231, Cambridge, Mas- Edunov, Marjan Ghazvininejad, Mike Lewis, and sachusetts, USA. Association for Machine Transla- Luke Zettlemoyer. 2020. Multilingual denoising tion in the Americas. pre-training for neural machine translation. Trans- Maria Stasimioti and Vilelmini Sosoni. 2020. Trans- actions of the Association for Computational Lin- lation vs post-editing of NMT output: Insights from guistics, 8:726–742. the English-Greek language pair. In Proceedings of Zihan Liu, Genta Indra Winata, and Pascale Fung. 1st Workshop on Post-Editing in Modern-Day Trans- 2021. Continual mixed-language pre-training for ex- lation, pages 109–124, Virtual. Association for Ma- tremely low-resource neural machine translation. In chine Translation in the Americas.
Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na- man Goyal, Vishrav Chaudhary, Jiatao Gu, and An- gela Fan. 2021. Multilingual translation from de- noising pre-training. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3450–3466, Online. Association for Compu- tational Linguistics. Jörg Tiedemann and Santhosh Thottingal. 2020. OPUS-MT – building open translation services for the world. In Proceedings of the 22nd Annual Con- ference of the European Association for Machine Translation, pages 479–480, Lisboa, Portugal. Eu- ropean Association for Machine Translation. Antonio Toral, Martijn Wieling, and Andy Way. 2018. Post-editing Effort of a Novel With Statistical and Neural Machine Translation. Frontiers in Digital Humanities, 5:1–11. Publisher: Frontiers. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- cessing Systems, volume 30. Curran Associates, Inc. Vilém Zouhar, Martin Popel, Ondřej Bojar, and Aleš Tamchyna. 2021. Neural machine translation quality and post-editing performance. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10204–10214, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
A Subject Information T1 T2 T3 During the setup of our experiment one translator docW1 HT HT HT warm-up refused to carry out the main task after the warmup docW2 PE1 PE1 PE1 phase, and another was substituted by our choice. ... Both translators were working for the English- docWN PE2 PE2 PE2 Italian direction and were found to make heavy docM1 HT PE1 PE2 usage of copy-pasting during the warmup stage, docM2 PE2 HT PE1 main suggesting an incorrect utilization of the platform docM3 HT PE2 PE! in light of our guidelines. Both translators, which ... we identified as T2 and T3 for Italian, were re- docMN PE2 PE1 HT placed by T5 and T4 respectively. Table 7 reflects the final translation selection for all languages, with Table 6: Modality scheduling overview. For each lan- the information collected by means of the pre-task guage, each subject (Ti ) works with a pseudo-random questionnaire. sequence of modalities (HT, PE1 , PE2 ). For the warm- up task (N = 5), all translators are provided with the B Translation Guidelines same documents in the same modalities. For the main task (N = 107), each translator is assigned a modal- An extract of the translation guidelines provided to ity at random. Each document is translated once for the translators follows. every modality. The same procedure is repeated inde- • Fill-in the pre-task questionnaire before starting the pendently for all the languages. project. • In this experiment, your goal is to complete the transla- tion of multiple files in one of two possible translation – Only start to translate when you are in editing settings. mode (yellow background). In other words, do not • Please, complete the tasks on your own, even if you start thinking how you will translate a sentence know another translator that might be working on this when the active unit is not yet in editing mode project. (green or red background). • The translation setting alternates between texts, with – Do not leave a unit in editing mode (yellow back- each text requiring a single translation in the assigned ground) while you do something else. If you need setting. The two translation settings are: to do something unrelated in the middle of a trans- lation then go out of editing mode and come back 1. Translation from scratch. Only the source sen- to editing mode when you are ready to resume tence is provided, you are to write the translation translating. from scratch. – First you will be translating a warmup task, and 2. Post-editing. The source sentence is provided then the main task. When you are translating alongside a translation produced by an MT sys- each file, you can consult the Source text (ST) by tem. You are to post-edit this MT output. Post-edit looking up the url in the Excel files that we have the text so you are satisfied with the final transla- sent for reference. tion (the required quality is publishable quality). If the MT output is too time consuming to fix, – In order to find the correct terminology for the you can delete it and start from scratch. However, translation you can consult any source in the Inter- please do not systematically delete the provided net. Important: However, it is NOT ALLOWED MT output to give your own translation. to use any MT engine to find terms or alternatives to translations (such as GoogleTranslate, DeepL, • Important: All editing MUST happen in the provided MS Translator or any MT engine available in your PET interface: that is, working in other editors and language). Using MT engines invalidates the ex- copy-pasting the text back to PET is NOT ALLOWED, periment, and will be detected in the log data. because it invalidates the experiment. This is easy to spot in the log data, so please avoid doing this. – Please fill-in the post-task questionnaire ONLY • Complete the translation of all files sequentially, i.e. in ONCE after completing all the translation tasks the order presented in the tool. DO NOT SKIP files at (both warmup and main tasks). your own convenience. • The aim is to produce publishable professional quality translations for both translation settings. Thus, please C Modality scheduling translate to your best abilities. You can return to the files and self-review as many times as you think it is Table 6 visualizes an example of the adopted necessary. modality scheduling for every translator. The • Important: The time invested to translate is recorded while the active unit (sentence) is in editing mode (yel- modality of document docMi for translator Tj in low background). Therefore: the main task is picked randomly among the two
Gender Age Degree Position En Level YoE YoE w/ PE % PE T1 M 35-44 BA Freelancer C2 > 15 2-5 20%-40% Arabic T2 M 25-34 BA Employed C2 5-10 2-5 60%-80% T3 M 25-34 MA Freelancer C1 5-10 80% T1 F 25-34 BA Freelancer C2 5-10 2-5 < 20% Turkish T2 F 25-34 BA Freelancer C1 5-10 5-10 < 20% T3 M 25-34 High school Freelancer C2 10-15
You can also read