On the Difficulty of Translating Free-Order Case-Marking Languages
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
On the Difficulty of Translating Free-Order Case-Marking Languages Arianna Bisazza Ahmet Üstün Stephan Sportel Center for Language and Cognition University of Groningen {a.bisazza, a.ustun}@rug.nl, research@spor.tel Abstract Why, then, do some language pairs have lower translation accuracy? And, more specifically: Identifying factors that make certain lan- guages harder to model than others is es- Are certain typological profiles more challenging sential to reach language equality in future for current state-of-the-art NMT models? Every Natural Language Processing technologies. language has its own combination of typological arXiv:2107.06055v1 [cs.CL] 13 Jul 2021 Free-order case-marking languages, such as properties, including word order, morphosyntactic Russian, Latin or Tamil, have proved more features and more (Dryer and Haspelmath, 2013). challenging than fixed-order languages for Identifying language properties (or combinations the tasks of syntactic parsing and subject- thereof) that pose major problems to the current verb agreement prediction. In this work, we modeling paradigms is essential to reach language investigate whether this class of languages is also more difficult to translate by state- equality in future MT (and other NLP) technolo- of-the-art Neural Machine Translation mod- gies (Joshi et al., 2020), in a way that is orthogonal els (NMT). Using a variety of synthetic lan- to data collection efforts. Among others, natural guages and a newly introduced translation languages adopt different mechanisms to disam- challenge set, we find that word order flex- biguate the role of their constituents: Flexible or- ibility in the source language only leads der typically correlates with the presence of case to a very small loss of NMT quality, even marking and, vice versa, fixed order is observed though the core verb arguments become im- possible to disambiguate in sentences with- in languages with little or no case marking (Com- out semantic cues. The latter issue is in- rie, 1981; Sinnemäki, 2008; Futrell et al., 2015b). deed solved by the addition of case marking. Morphologically rich languages in general are However, in medium- and low-resource set- known to be challenging for MT at least since the tings, the overall NMT quality of fixed-order times of phrase-based statistical MT (Birch et al., languages remains unmatched. 2008) due to their larger and sparser vocabularies, and remain challenging even for modern neural ar- 1 Introduction chitectures (Ataman and Federico, 2018; Belinkov et al., 2017). By contrast, the relation between Despite the tremendous advances achieved in less word order flexibility and MT quality has not been than a decade, Natural Language Processing re- directly studied to our knowledge. mains a field where language equality is far from In this paper, we study this relationship using being reached (Joshi et al., 2020). In the field strictly controlled experimental setups. Specifi- of Machine Translation, modern neural models cally, we ask: have attained remarkable quality for high-resource language pairs like German-English, Chinese- • Are current state-of-the-art NMT systems bi- English or English-Czech, with a number of stud- ased towards fixed-order languages? ies claiming even human parity (Hassan et al., • To what extent does case marking compen- 2018; Bojar et al., 2018; Barrault et al., 2019; sate for the lack of a fixed order in the source Popel et al., 2020). These results may lead to the language? unfounded belief that NMT methods will perform equally well in any language pair, provided similar Unfortunately parallel data is scarce in most of amounts of training data. In fact, several studies the world languages (Guzmán et al., 2019), and suggest the opposite (Platanios et al., 2018; Ata- corpora in different languages are drawn from dif- man and Federico, 2018; Bugliarello et al., 2020). ferent domains. Exceptions exist, like the widely
used Europarl (Koehn, 2005), but represent a small VSO follows the little cat the friendly dog Fixed fraction of the large variety of typological feature VOS follows the friendly dog the little cat combinations attested in the world. This makes follows the little cat#S the friendly dog#O it very difficult to run a large-scale comparative Free+Case OR study and isolate the factors of interest from, e.g., follows the friendly dog#O the little cat#S domain mismatch effects. As a solution, we pro- Translation de kleine kat volgt de vriendelijke hond pose to evaluate NMT on synthetic languages (Gu- lordava and Merlo, 2016; Wang and Eisner, 2016; Table 1: Example sentence in different fixed/flexible- Ravfogel et al., 2019) that differ from each other order English-based synthetic languages and their SVO only by specific properties, namely: the order of Dutch translation. The subject in each sentence is main constituents, or the presence and nature of underlined. Artificial case markers start with #. case markers (see example in Table 1). We use this approach to isolate the impact of various source-language typological features on to infer syntactic relationships in their language MT quality and to remove the typical confounders (Slobin, 1966). However, cross-linguistic studies of corpus size and domain. Using a variety of syn- have later revealed that children are equally pre- thetic languages and a newly introduced challenge pared to acquire both fixed-order and inflectional set, we find that state-of-the-art NMT has little to languages (Slobin and Bever, 1982). no bias towards fixed-order languages, but only Coming to computational linguistics, data- when a sizeable training set is available. driven MT and other NLP approaches were also historically developed around languages with re- 2 Free-order Case-marking Languages markably fixed order and very simple to moder- ately simple morphological systems, like English The word order profile of a language is usually or French. Luckily, our community has been giv- represented by the canonical order of its main ing increasing attention to more and more lan- constituents, (S)ubject, (O)bject, (V)erb. For in- guages with diverse typologies, especially in the stance, English and French are SVO languages, last decade. So far, previous work has found while Turkish and Hindi are SOV. Other, less com- that free-order languages are more challenging monly attested, word orders are VSO and VOS, for parsing (Gulordava and Merlo, 2015, 2016) while OSV and OVS are extremely rare (Dryer, and subject-verb agreement prediction (Ravfogel 2013). While many other word order features ex- et al., 2019) than their fixed-order counterparts. ist (e.g., noun/adjective), they often correlate with This raises the question of whether word order the order of main constituents (Greenberg, 1963). flexibility also negatively affects MT quality. A different, but likewise important dimension is Before the advent of modern NMT, Birch et al. that of word order freedom (or flexibility). Lan- (2008) used the Europarl corpus to study how var- guages that primarily rely on the position of a ious language properties affected the quality of word to encode grammatical roles typically dis- phrase-based Statistical MT. Amount of reorder- play rigid orders (like English or Mandarin Chi- ing, target morphological complexity, and histor- nese), while languages that rely on case marking ical relatedness of source and target languages can be more flexible allowing word order to ex- were identified as strong predictors of MT qual- press discourse-related factors like topicalization. ity. Recent work by Bugliarello et al. (2020), Examples of highly flexible-order languages in- however, has failed to show a correlation between clude languages as diverse as Russian, Hungarian, NMT difficulty (measured by a novel information- Latin, Tamil and Turkish.1 theoretic metric) and several linguistic properties In the field of psycholinguistics, due to the his- of source and target language, including Mor- torical influence of English-centered studies, word phological Counting Complexity (Sagot, 2013) order has long been considered the primary and and Average Dependency Length (Futrell et al., most natural device through which children learn 2015a). While that work specifically aimed at en- 1 suring cross-linguistic comparability, the sample See Futrell et al. (2015b) for detailed figures of word order freedom (measured by the entropy of subject and ob- on which the linguistic properties could be com- ject dependency relation order) in a diverse sample of 34 lan- puted (Europarl) was rather small and not very ty- guages. pologically diverse, leaving our research questions
open to further investigation. In this paper, we input’s position.2 Transformer has nowadays therefore opt for a different methodology: namely, surpassed recurrent encoder-decoder models in synthetic languages. terms of generic MT quality. Moreover, Choshen and Abend (2019) have recently shown that 3 Methodology Transformer-based NMT models are indifferent to the absolute order of source words, at least when Synthetic languages This paper presents two equipped with learned positional embeddings. On sets of experiments: In the first (§4), we create par- the other hand, the lack of recurrence in Trans- allel corpora using very simple and predictable ar- formers has been linked to a limited ability to cap- tificial grammars and small vocabularies (Lupyan ture hierarchical structure (Tran et al., 2018; Hahn, and Christiansen, 2002). See example in Table 1. 2020). To our knowledge, no previous work has By varying the position of subject/verb/object and studied the biases of either architectures towards introducing case markers to the source language, fixed-order languages in a systematic manner. we study the biases of two NMT architectures in optimal training data conditions and a fully con- 4 Toy Parallel Grammar trolled setup, i.e. without any other linguistic cues We start by evaluating our models on a pair of toy that may disambiguate constituent roles. In the languages inspired by the English-Dutch pair and second set of experiments (§5), we move to a more created using a Synchronous Context-Free Gram- realistic setup using synthetic versions of the En- mar (Chiang and Knight, 2006). Each sentence glish language that differ from it in only one or consists of a simple clause with a transitive verb, few selected typological features (Ravfogel et al., subject and object. Both arguments are singu- 2019). For instance, the original sentence’s order lar and optionally modified by an adjective. The (SVO) is transformed to different orders, like SOV source vocabulary contains 6 nouns, 6 verbs, 6 or VSO, based on its syntactic parse tree. adjectives, and the complete corpus contains 10k In both cases, typological variations are intro- generated sentence pairs. Working with such a duced in the source side of the parallel corpora, small, finite grammar allows us to simulate an oth- while the target language remains fixed. In this erwise impossible situation where the NMT model way, we avoid the issue of non-comparable BLEU can be trained on (almost) the totality of a lan- scores across different target languages. Lastly, guage’s utterances, canceling out data sparsity ef- we make the simplifying assumption that, when fects.3 verb-argument order varies from the canonical or- der in a flexible-order language, it does so in a Source Language Variants We consider three totally arbitrary way. While this is rarely true in source language variants, illustrated in Table 1: practice, as word order may be predictable given • fixed-order VSO; pragmatics or other factors, we focus here on “the extent to which word order is conditioned on the • fixed-order VOS; syntactic and compositional semantic properties of an utterance” (Futrell et al., 2015b). • mixed-order (randomly chosen between VSO or VOS) with nominal case marking. Translation models We consider two widely We choose these word orders so that, in the used NMT architectures that crucially differ in flexible-order corpus, the only way to disam- their encoding of positional information: (i) Re- biguate argument roles is case marking, realized current sequence-to-sequence BiLSTM with at- by simple unambiguous suffixes (#S and #O). The tention (Bahdanau et al., 2015; Luong et al., target language is always fixed SVO. The same 2015) processes the input symbols sequentially random split (80/10/10% training/validation/test) and has each hidden state directly conditioned on is applied to the three corpora. that of the previous (or following, for the back- 2 ward LSTM) timestep (Elman, 1990; Hochreiter We use sinusoidal embeddings (Vaswani et al., 2017). and Schmidhuber, 1997). (ii) The non-recurrent, All our models are built using OpenNMT: https:// github.com/OpenNMT/OpenNMT-py fully attention-based Transformer (Vaswani et al., 3 Data and code to replicate the toy grammar experiments 2017) processes all input symbols in parallel re- in this section are available at https://github.com/ lying on dedicated embeddings to encode each 573phn/cm-vs-wo
100 100 100 75 75 80 50 50 60 vos vos 40 vos 25 vso 25 vso vso mix mix 20 mix 0 0 200 400 600 800 1000 200 400 600 800 1000 200 400 600 800 1000 (a) BiLSTM with attention (b) Large Transformer (c) Small Transformer Figure 1: Toy language NMT sentence-level accuracy on validation set by number of training epochs. Source languages: fixed-order VSO, fixed-order VOS, and mixed-order (VSO/VOS) with case marking. Target language: always fixed SVO. Each experiment is repeated five times, and averaged results are shown. NMT Setup As recurrent model, we trained a 2- ilar translation quality than their fixed-order coun- layer BiLSTM with attention (Luong et al., 2015) terparts. In §5 we validate this hypothesis on more with 500 hidden layer size. As Transformer mod- naturalistic language data. els, we trained one using the standard 6-layer con- figuration (Vaswani et al., 2017) and a smaller 5 Synthetic English Variants one with only 2 layers given the simplicity of the languages. All models are trained at the Experimenting with toy languages has its short- word level using the complete vocabulary. More comings, like the small vocabulary size and non- hyper-parameters are provided in Appendix A.1. realistic distribution of words and structures. In Note that our goal is not to compare LSTM and this section, we follow the approach of Ravfogel Transformer accuracy to each other, but rather et al. (2019) to validate our findings in a less con- to observe the different trends across fixed- and trolled but more realistic setup. Specifically, we flexible-order language variants. Given the small create several variants of the Europarl English- vocabulary, we use sentence-level accuracy in- French parallel corpus where the source sentences stead of BLEU for evaluation. are modified by changing word order and adding artificial case markers. We choose French as target Results As shown in Figure 1, all models language because of its fixed order, SVO, and its achieve perfect accuracy on all language pairs af- relatively simple morphology.4 As Indo-European ter 1000 training steps, except for the Large Trans- languages, English and French are moderately re- former on the free-order language, likely due to lated in terms of syntax and vocabulary while be- overparametrization (Sankararaman et al., 2020). ing sufficiently distant to avoid a word-by-word These results demonstrate that our NMT architec- translation strategy in many cases. tures are equally capable of modeling translation Source language variants are obtained by trans- of both types of language, when all other factors forming the syntactic tree of the original sen- of variation are controlled for. Nonetheless, a pat- tences. While Ravfogel et al. (2019) could rely tern emerges when looking at the learning curves on the Penn Treebank (Marcus et al., 1993) for within each plot: While the two fixed-order lan- their monolingual task of agreement prediction, guages have very similar learning curves, the free- we instead need parallel data. For this reason, we order language with case markers always requires parse the English side of the Europarl v.7 corpus slightly more training steps to converge. This is (Koehn, 2005) using the Stanza dependency parser also the case, albeit to a lesser extent, when the (Qi et al., 2020; Manning et al., 2014). After pars- mixed-order corpus is pre-processed by splitting ing, we adopt a modified version of the synthetic all case suffixes from the nouns (extra experiment language generator by Ravfogel et al. (2019) to not shown in the plot). This trend is notewor- create the following English variants:5 thy, given the simplicity of our grammars and the 4 transparency of the case system. As our training According to the Morphological Counting Complexity (Sagot, 2013) values reported by Cotterell et al. (2018), En- sets cover a large majority of the languages, this glish scores 6 (least complex), Dutch 26, French 30, Spanish result might suggest that free-order natural lan- 71, Czech 195, and Finnish 198 (most complex). 5 guages need larger training datasets to reach a sim- Our revised language generator is available at https:
• fixed-order: either SVO, SOV, VSO or Original (no case): VOS;6 The woman says her sisters often invited her for dinner. SOV (no case): • free-order: for each sentence in the corpus, The woman her sisters her often invited for dinner say. one of the six possible orders of (Subject, Ob- SOV, syncretic case marking (overt): ject, Verb) is chosen randomly; The woman.arg.sg her sisters.arg.pl she.arg.sg often in- vited.arg.pl for dinner say.arg.sg. • shuffled words: all source words are shuffled SOV, unambiguous case marking (overt): regardless of their syntactic role. This is our The woman.nsubj.sg her sisters.nsubj.pl she.dobj.sg of- lower bound, measuring the reordering abil- ten invited.dobj.sg.nsubj.pl for dinner say.nsubj.sg. ity of a model in the total absence of source- SOV, unambiguous case (implicit): side order cues (akin to bag-of-words input). The womankar her sisterskon shekin often in- vitedkinkon for dinner saykar. To allow for a fair comparison with the artifi- SOV, unambiguous case (implicit with declensions): cial case-marking languages, we remove number The womankar her sisterspon shekit often invitedkitpon agreement features from verbs in all the above for dinner saykar. variants (cf. says → say in Table 2). French translation: To answer our second research question, we ex- La femme dit que ses soeurs l’invitaient souvent à dîner. periment with two artificial case systems proposed by Ravfogel et al. (2019) and illustrated in Table 2 Table 2: Examples of synthetic English variants and (overt suffixes): their (common) French translation. The full list of suf- fixes is provided in Appendix A.3. • unambiguous case system: suffixes indicat- ing argument role (subject/object/indirect ob- ject) and number (singular/plural) are added All models use subword representation based on to the heads of noun and verb phrases; 32k BPE merge operations (Sennrich et al., 2016), • syncretic case system: suffixes indicating except in the low-resource setup where this is re- number but not grammatical function are duced to 10k operations. More hyper-parameters added to the heads of main arguments, pro- are provided in Appendix A.1. viding only partial disambiguation of argu- ment roles. This system is inspired from sub- Data and Evaluation We train our models on ject/object syncretism in Russian. various subsets of the English-French Europarl Syncretic case systems were found to be roughly corpus: 1,9M sentence pairs (high-resource), as common as non-syncretic ones in a large sam- 100K (medium-resource), 10K (low-resource). ple of almost 200 world languages (Baerman and For evaluation, we use 5K sentences randomly Brown, 2013). Case marking is always combined held-out from the same corpus. Given the impor- with the fully flexible order of main constituents. tance of word order to assess the correct transla- As in (Ravfogel et al., 2019), English number tion of verb arguments into French, we compute marking is removed from verbs and their argu- the reordering-focused RIBES7 metric (Isozaki ments before adding the artificial suffixes. et al., 2010) in addition to the more commonly used BLEU (Papineni et al., 2002). In each exper- 5.1 NMT Setup iment, the source side of training and test data is Models As recurrent model, we used a 3-layer transformed using the same procedure whereas the BiLSTM with hidden size of 512 and MLP at- target side remains unchanged. We repeat each ex- tention (Bahdanau et al., 2015). The Transformer periment 3 times (or 4 for languages with random model has the standard 6-layer configuration with order choice) and report the averaged results. hidden size of 512, 8 attention heads, and sinu- soidal positional encoding (Vaswani et al., 2017). 7 BLEU captures local word-order errors only indirectly //github.com/573phn/rnn_typology (lower precision of higher-order n-grams) and does not cap- 6 To keep the number of experiments manageable, we omit ture long-range word-order errors at all. By contrast, RIBES object-initial languages which are significantly less attested directly measures correlation between the word ranks in the among world languages (Dryer, 2013). reference and those in the MT output.
5.2 Challenge Set Fixed-Order Variants All four tested fixed- order variants obtain very similar BLEU/RIBES Besides syntactic structure, natural language of- scores on the Europarl-test. This is in line with ten contains semantic and collocational cues that previous work in NMT showing that linguistically help disambiguate the role of an argument. Small motivated pre-ordering leads to small gains (Zhao BLEU/RIBES differences between our language et al., 2018) or none at all (Du and Way, 2017), variants may indicate actual robustness of a model and that Transformer-based models are not bi- to word order flexibility, but may also indicate that ased towards monotonic translation (Choshen and a model relies on those cues rather than on syntac- Abend, 2019). On the challenge set, scores are tic structure (Gulordava et al., 2018). To discern slightly more variable but a manual inspection re- these two hypotheses, we create a challenge set veals that this is due to different lexical choices, of 7,200 simple affirmative and negative sentences while word order is always correct for this group where swapping subject and object leads to an- of languages. To sum up, in the high-resource other plausible sentence.8 Each English sentence setup, our Transformer models are perfectly able and its reverse are included in the test set together to disambiguate the core argument roles when with the respective translations, as for example: these are consistently encoded by word order. (1) a. The president thanks the minister. / Fixed-Order vs Random-Order Somewhat Le président remercie le ministre. surprisingly, the Transformer results are only marginally affected by the random ordering of b. The minister thanks the president. / verb and core arguments. Recall that in the Le ministre remercie le président. ‘Random’ language all six possible permutations of (S,V,O) are equally likely. Thus, Transformer The source side is then processed as explained in shows an excellent ability to reconstruct the cor- §5 and translated by the NMT model trained on rect constituent order in the general-purpose test the corresponding language variant. Thus, trans- set. The picture is very different on the challenge lation quality on this set reflects the extent to set, where RIBES drops severely from 97.6 to which NMT models have robustly learnt to detect 74.1. These low results were to be expected given verb arguments and their roles independently from the challenge set design (it is impossible even for other cues, which we consider an important sign a human to recognize subject from object in the of linguistic generalization ability. For space con- ‘Random, no case’ challenge set). Nonetheless, straints we only present RIBES scores on the chal- they demonstrate that the general-purpose set lenge set.9 cannot tell us whether an NMT model has learnt to reliably exploit syntactic structure of the source 5.3 High-Resource Results language, because of the abundant non-syntactic Table 3 reports the high-resource setting results. cues. In fact, even when all source words are The first row (original English to French) is given shuffled, Transformer still achieves a respectable only for reference and shows the overall highest 25.8/71.2 BLEU/RIBES on the Europarl-test. results. The BLEU drop observed when moving to any of the fixed-order variants (including SVO) Case Marking The key comparison in our study is likely due to parsing flaws resulting in awkward lies between fixed-order and free-order case- reorderings. As this issue affects all our synthetic marking languages. Here, we find that case mark- variants, it does not undermine the validity of our ing can indeed restore near-perfect accuracy on the findings. For clarity, we center our main discus- challenge set (98.1 RIBES). However, this only sion on the Transformer results and comment on happens when the marking system is completely the BiLSTM results at the end of this section. unambiguous, which, as already mentioned, is true for only about a half of the real case-marking lan- 8 More details can be found in Appendix A.2. We guages (Baerman and Brown, 2013). Indeed the release the challenge set at https://github.com/ syncretic system visibly improves quality on the arianna-bis/freeorder-mt 9 We also computed BLEU scores: they strongly correlate challenge set (74.1 to 84.4 RIBES) but remains far with RIBES but fluctuate more due to the larger effect of lex- behind the fixed-order score (97.6). In terms of ical choice. overall NMT quality (Europarl-test), fixed-order
English*→French BI - LSTM TRANSFORMER Large Training (1.9M) Europarl-Test Challenge Europarl-Test Challenge BLEU RIBES RIBES BLEU RIBES RIBES Original English 39.4 85.0 98.0 38.3 84.9 97.7 Fixed Order: S-V-O 38.3 84.5 98.1 37.7 84.6 98.0 S-O-V 37.6 84.2 97.7 37.9 84.5 97.2 V-S-O 38.0 84.2 97.8 37.8 84.6 98.0 V-O-S 37.8 84.0 98.0 37.6 84.3 97.2 Average (fixed orders) 37.9±0.4 84.2±0.3 97.9±0.2 37.8±0.1 84.5±0.1 97.6±0.4 Flexible Order: Random, no case 37.1 83.7 75.1 37.5 84.2 74.1 Random + syncretic case 36.9 83.6 75.4 37.3 84.2 84.4 Random + unambig. case 37.3 83.9 97.7 37.3 84.4 98.1 Shuffle all words 18.5 65.2 79.4 25.8 71.2 83.2 Table 3: Translation quality from various English-based synthetic languages into standard French, using the largest training data (1.9M sentences). NMT architectures: 3-layer BiLSTM seq-to-seq with attention; 6-layer Trans- former. Europarl-Test: 5K held-out Europarl sentences; Challenge set: see §5.2. All scores are averaged over three training runs. languages score only marginally higher than the tations, Choshen and Abend (2019) might have free-order case-marking ones, regardless of the overestimated the bias of recurrent NMT towards unambiguous/syncretic distinction. Thus our find- more monotonic translation, whereas the more re- ing that Transformer NMT systems are equally ca- alistic combination of constituent-level reordering pable of modeling the two types of languages (§4) with case marking used in our study is not so prob- is also confirmed with more naturalistic language lematic for this type of model. data. That said, we will show in Sect. 5.4 that this Interestingly, on the challenge set, BiLSTM and positive finding is conditional on the availability Transformer perform on par, with the notable ex- of large amounts of training samples. ception that syncretic case is much more difficult for the BiLSTM model. Our results agree with the BiLSTM vs Transformer The LSTM-based re- large drop of subject-verb agreement prediction sults generally correlate with the Transformer re- accuracy observed by Ravfogel et al. (2019) when sults discussed above, however our recurrent mod- experimenting with the random order of main con- els appear to be slightly more sensitive to changes stituents. However, their scores were also low in the source-side order, in line with previous for SOV and VOS, which is not the case in our findings (Choshen and Abend, 2019). Specifi- NMT experiments. Besides the fact that our chal- cally, translation quality on Europarl-test fluctu- lenge set only contains short sentences (hence no ates slightly more than Transformer among dif- long dependencies and few agreement attractors), ferent fixed orders, with the most monotonic or- our task is considerably different in that agreement der (SVO) leading to the best results. When only needs to be predicted in the target language, all words are randomly shuffled, BiLSTM scores which is fixed-order SVO. drop much more than Transformer. However, when comparing the fixed-order variants to the Summary Our results so far suggest that state- ones with free order of main constituents, BiL- of-the-art NMT models, especially if Transformer- STM shows only a slightly stronger preference for based, have little or no bias towards fixed-order fixed-order, compared to Transformer. This sug- languages. In what follows, we study whether this gests that, by experimenting with arbitrary permu- finding is robust to differences in data size, type of
morphology, and target language. • implicit: the combination of number and case is expressed by unique suffixes without 5.4 Effect of Data Size and Morphological internal structure (e.g. kar for .nsubj.sg, ker Features for .dobj.pl) similar to fusional languages. Data Size The results shown in Table 3 rep- This system displays exponence (many:1); resent a high-resource setting (almost 2M train- • implicit with declensions: like the previous, ing sentences). While recent successes in cross- but with three different paradigms each ar- lingual transfer learning alleviate the need for la- bitrarily assigned to a different subset of the beled data (Liu et al., 2020), their success still de- lexicon. This system displays exponence and pends on the availability of large unlabeled data flexivity (many:many). as well as other, yet to be explained, language properties (Joshi et al., 2020). We then ask: Do A complete overview of our morphological free-order case-marking languages need more data paradigms is provided in Appendix A.3 All our than fixed-order non-case-marking ones to reach languages have moderate inflectional synthesis similar NMT quality? We simulate a medium- and, in terms of fusion, are exclusively concate- and low-resource scenario by sampling 100K and native. Despite this, the effect on vocabulary size 10K training sentences, respectively, from the full is substantial: 180% increase by overt and implicit Europarl data. To reduce the number of exper- case marking, 250% by implicit marking with de- iments, we only consider Transformer with one clensions (in the full data setting). fixed-order language variant (SOV)10 and exclude syncretic case marking. To disentagle the ef- Results Results are shown in the plots of Fig- fect of word order from that of case marking on ure 2 (detailed numerical scores are given in Ap- low-resource translation quality, we also exper- pendix A.4). We find that reducing training size iment with a language variant combining fixed- has, not surprisingly, a major effect on transla- order (SOV) and case marking. Results are shown tion quality. Among source language variants, in Figure 2 and discussed below. fixed-order obtains the highest quality across all setups. In terms of BLEU (2(a)), the spread Morphological Features The artificial case sys- among variants increases somewhat with less data tems used so far included easily separable suffixes however differences are small. A clearer picture with a 1:1 mapping between grammatical cate- emerges from RIBES (2(b)), whereby less data gories and morphemes (e.g. .nsubj.sg, .dobj.pl) clearly leads to more disparity. This is already reminiscent of agglutinative morphologies. Many visible in the 100k setup, with the fixed SOV lan- world languages, however, do not comply to this guage dominating the others. Case marking, de- 1:1 mapping principle but display flexivity (multi- spite being necessary to disambiguate argument ple categories conveyed by one morpheme) and/or roles in the absence of semantic cues, does not im- exponence (the same category expressed by var- prove translation quality and even degrades it in ious, lexically determined, morphemes). Well- the low-resource setup. Looking at the challenge studied examples of languages with case+number set results (2(c)) we see that the free-order case- exponence include Russian and Finnish, while marking languages are clearly disadvantaged: In flexive languages include, again, Russian and the mid-resource setup, case marking improves Latin. Motivated by previous findings on the substantially over the underspecified random,no- impact of fine-grained morphological features on case language but remains far behind fixed-order. language modeling difficulty (Gerz et al., 2018), In low-resource, case marking notably hurts qual- we experiment with three types of suffixes (see ex- ity even in comparison with the underspecified amples in Table 2): language. These results thus demonstrate that free-order case-marking languages require more • overt: number and case are denoted by eas- data than their fixed-order counterparts to be ac- ily separable suffixes (e.g. .nsubj.sg, .dobj.pl) curately translated by state-of-the-art NMT.11 Our similar to agglutinative languages (1:1); experiments also show that this greater learning 10 11 We choose SOV because it is a commonly attested word In the light of this finding, it would be interesting to re- order and is different from that of the target language, thereby visit the evaluation of Bugliarello et al. (2020) in relation to requiring some non-trivial reorderings during translation. varying data sizes.
sov 85 sov 100 sov+overt sov+overt sov random random sov+overt 35 r+overt r+overt random r+overt r+implicit r+implicit r+implicit r+declens 80 r+declens 90 r+declens 30 75 25 80 20 70 70 15 65 60 10 60 1.9M 100K 10K 1.9M 100K 10K 1.9M 100K 10K (a) Europarl-Test BLEU (b) Europarl-Test RIBES (c) Challenge RIBES Figure 2: EN*-FR Transformer NMT quality versus training data size (x-axis). Source language variants: Fixed- order (SOV) and free-order (random) with different case systems (r+overt/implicit/declens). Scores averaged over three training runs. Detailed numerical results are provided in Appendix A.4. difficulty is not only due to case marking (and sub- and moderately flexible, syntactically determined sequent data sparsity), but also to word order flexi- order.12 bility (compare sov+overt to r+overt in Figure 2). Figure 3 shows the results with 100k training Regarding different morphology types, we do sentences. In terms of BLEU, differences are not observe a consistent trend in terms of overall even smaller than in English-French. In terms of translation quality (Europarl-test): in some cases, RIBES, trends are similar across target languages, the richest morphology (with declensions) slightly with the fixed SOV source language obtaining best outperforms the one without declensions —a re- results and the case-marked source language ob- sult that would deserve further exploration. On taining worst results. This suggests that the major the other hand, results on the challenge set, where findings of our study are not due to the specific most words are case-marked, show that morpho- choice of French as the target language. logical richness inversely correlates with transla- tion quality when data is scarce. We postulate that 26 80 our artificial morphologies may be too limited in 24 sov 78 random scope (only 3-way case and number marking) to 22 r+overt 76 impact overall translation quality and leave the in- 20 74 vestigation of richer inflectional synthesis to future 18 72 work. 16 70 14 68 5.5 Effect of Target Language 12 en*-FR en*-CS en*-NL 66 en*-FR en*-CS en*-NL All results so far involved translation into a fixed- (a) Europarl BLEU (b) Europarl RIBES order (SVO) language without case marking. To verify the generality of our findings, we repeat a Figure 3: Transformer results for more target languages subset of experiments with the same synthetic En- (100k training size). Scores averaged over 2 runs. glish variants, but using Czech or Dutch as target languages. Czech has rich fusional morphology including case marking, and very flexible order. 12 Dutch word order is very similar to German, with the Dutch has simple morphology (no case marking) position of S, V, and O depending on the type of clause.
6 Related Work art NMT models, especially Transformer-based ones, have little or no bias towards fixed-order lan- The effect of word order flexibility on NLP model guages. Our simulated low-resource experiments, performance has been mostly studied in the field however, reveal a different picture, that is: free- of syntactic parsing, for instance using Aver- order case-marking languages require more data age Dependency Length (Gildea and Temperley, to be translated as accurately as their fixed-order 2010; Futrell et al., 2015a) or head-dependent or- counterparts. Since parallel data (like labeled data der entropy (Futrell et al., 2015b; Gulordava and in general) is scarce for most of the world lan- Merlo, 2016) as syntactic correlates of word or- guages (Guzmán et al., 2019; Joshi et al., 2020), der freedom. Related work in language model- we believe this should be considered as a further ing has shown that certain languages are intrinsi- obstacle to language equality in future NLP tech- cally more difficult to model than others (Cotterell nologies. et al., 2018; Mielke et al., 2019) and has further- In future work, our analysis should be extended more studied the impact of fine-grained morphol- to target language variants using principled alter- ogy features (Gerz et al., 2018) on LM perplexity. natives to BLEU (Bugliarello et al., 2020), and Regarding the word order biases of seq-to-seq to other typological features that are likely to af- models, Chaabouni et al. (2019) use miniature lan- fect MT performance, such as inflectional synthe- guages similar to those of Sect. 4 to study the sis and degree of fusion (Gerz et al., 2018). Fi- evolution of LSTM-based agents in a simulated nally, the synthetic languages and challenge set iterated learning setup. Their results in a stan- proposed in this paper could be used to evaluate dard “individual learning” setup show, like ours, syntax-aware NMT models (Eriguchi et al., 2016; that a free-order case-marking toy language can be Bisk and Tran, 2018; Currey and Heafield, 2019), learned just as well as a fixed-order one, confirm- which promise to better capture linguistic struc- ing earlier results obtained by simple Elman net- ture, especially in low-resource scenarios. works trained for grammatical role classification (Lupyan and Christiansen, 2002). Transformer Acknowledgements was not included in these studies. Choshen and Abend (2019) measure the ability of LSTM- and Arianna Bisazza was partly funded by the Nether- Transformer-based NMT to model a language pair lands Organization for Scientific Research (NWO) where the same arbitrary (non syntactically mo- under project number 639.021.646. We would like tivated) permutation is applied to all source sen- to thank the Center for Information Technology of tences. They find that Transformer is largely indif- the University of Groningen for providing access ferent to the order of source words (provided this to the Peregrine HPC cluster, and the anonymous is fixed and consistent across training and test set) reviewers for their helpful comments. but nonetheless struggles to translate long depen- dencies actually occurring in natural data. They do not directly study the effect of order flexibility. References The idea of permuting dependency trees to gen- erate synthetic languages was introduced indepen- Duygu Ataman and Marcello Federico. 2018. An dently by Gulordava and Merlo (2016) (discussed evaluation of two vocabulary reduction methods above) and by Wang and Eisner (2016), the latter for neural machine translation. In Proceedings with the aim of diversifying the set of treebanks of the 13th Conference of the Association for currently available for language adaptation. Machine Translation in the Americas (Volume 1: Research Papers), pages 97–110. 7 Conclusions Matthew Baerman and Dunstan Brown. 2013. We have presented an in-depth analysis of how Case syncretism. In Matthew S. Dryer and Neural Machine Translation difficulty is affected Martin Haspelmath, editors, The World Atlas of by word order flexibility and case marking in the Language Structures Online. Max Planck Insti- source language. Although these common lan- tute for Evolutionary Anthropology, Leipzig. guage properties were previously shown to nega- tively affect parsing and agreement prediction ac- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua curacy, our main results show that state-of-the- Bengio. 2015. Neural machine translation by
jointly learning to align and translate. In 3rd In- translation difficulty by cross-mutual informa- ternational Conference on Learning Represen- tion. In Proceedings of the 58th Annual Meet- tations, ICLR 2015. ing of the Association for Computational Lin- guistics, pages 1640–1649, Online. Association Loïc Barrault, Ondřej Bojar, Marta R. Costa-jussà, for Computational Linguistics. Christian Federmann, Mark Fishel, Yvette Gra- ham, Barry Haddow, Matthias Huck, Philipp Rahma Chaabouni, Eugene Kharitonov, Alessan- Koehn, Shervin Malmasi, Christof Monz, dro Lazaric, Emmanuel Dupoux, and Marco Mathias Müller, Santanu Pal, Matt Post, and Baroni. 2019. Word-order biases in deep-agent Marcos Zampieri. 2019. Findings of the 2019 emergent communication. In Proceedings of conference on machine translation (WMT19). the 57th Annual Meeting of the Association for In Proceedings of the Fourth Conference on Computational Linguistics, pages 5166–5175, Machine Translation (Volume 2: Shared Task Florence, Italy. Association for Computational Papers, Day 1), pages 1–61, Florence, Italy. As- Linguistics. sociation for Computational Linguistics. David Chiang and Kevin Knight. 2006. An in- Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, troduction to synchronous grammars. Tuto- Hassan Sajjad, and James Glass. 2017. What do rial available at http://www.isi.edu/ neural machine translation models learn about ~chiang/papers/synchtut.pdf. morphology? In Proceedings of the 55th An- Leshem Choshen and Omri Abend. 2019. Auto- nual Meeting of the Association for Compu- matically extracting challenge sets for non-local tational Linguistics (Volume 1: Long Papers), phenomena in neural machine translation. In pages 861–872, Vancouver, Canada. Associa- Proceedings of the 23rd Conference on Compu- tion for Computational Linguistics. tational Natural Language Learning (CoNLL), Alexandra Birch, Miles Osborne, and Philipp pages 291–303, Hong Kong, China. Associa- Koehn. 2008. Predicting success in machine tion for Computational Linguistics. translation. In Proceedings of the 2008 Con- Benrard Comrie. 1981. Language Universals and ference on Empirical Methods in Natural Lan- Linguistic Typology. Blackwell. guage Processing, pages 745–754, Honolulu, Hawaii. Association for Computational Lin- Ryan Cotterell, Sabrina J. Mielke, Jason Eisner, guistics. and Brian Roark. 2018. Are all languages equally hard to language-model? In Proceed- Yonatan Bisk and Ke Tran. 2018. Inducing gram- ings of the 2018 Conference of the North Amer- mars with and for neural machine translation. ican Chapter of the Association for Computa- In Proceedings of the 2nd Workshop on Neu- tional Linguistics: Human Language Technolo- ral Machine Translation and Generation, pages gies, Volume 2 (Short Papers), pages 536–541, 25–35, Melbourne, Australia. Association for New Orleans, Louisiana. Association for Com- Computational Linguistics. putational Linguistics. Ondřej Bojar, Christian Federmann, Mark Fishel, Anna Currey and Kenneth Heafield. 2019. Incor- Yvette Graham, Barry Haddow, Philipp Koehn, porating source syntax into transformer-based and Christof Monz. 2018. Findings of the 2018 neural machine translation. In Proceedings of conference on machine translation (WMT18). the Fourth Conference on Machine Translation In Proceedings of the Third Conference on Ma- (Volume 1: Research Papers), pages 24–33, chine Translation: Shared Task Papers, pages Florence, Italy. Association for Computational 272–303, Belgium, Brussels. Association for Linguistics. Computational Linguistics. Matthew S. Dryer. 2013. Order of subject, ob- Emanuele Bugliarello, Sabrina J. Mielke, An- ject and verb. In Matthew S. Dryer and Martin tonios Anastasopoulos, Ryan Cotterell, and Haspelmath, editors, The World Atlas of Lan- Naoaki Okazaki. 2020. It’s easier to translate guage Structures Online. Max Planck Institute out of English than into it: Measuring neural for Evolutionary Anthropology, Leipzig.
Matthew S. Dryer and Martin Haspelmath, editors. Kristina Gulordava, Piotr Bojanowski, Edouard 2013. WALS Online. Max Planck Institute for Grave, Tal Linzen, and Marco Baroni. 2018. Evolutionary Anthropology, Leipzig. Colorless green recurrent networks dream hi- erarchically. In Proceedings of the 2018 Jinhua Du and Andy Way. 2017. Pre-reordering Conference of the North American Chapter for neural machine translation: Helpful or of the Association for Computational Linguis- harmful? The Prague Bulletin of Mathemati- tics: Human Language Technologies, Volume 1 cal Linguistics, 108(1):171–182. (Long Papers), pages 1195–1205, New Orleans, Louisiana. Association for Computational Lin- Jeffrey L. Elman. 1990. Finding structure in time. guistics. Cognitive Science, 14(2):179–211. Kristina Gulordava and Paola Merlo. 2015. Di- Akiko Eriguchi, Kazuma Hashimoto, and Yoshi- achronic trends in word order freedom and de- masa Tsuruoka. 2016. Tree-to-sequence atten- pendency length in dependency-annotated cor- tional neural machine translation. In Proceed- pora of Latin and ancient Greek. In Proceed- ings of the 54th Annual Meeting of the Asso- ings of the Third International Conference on ciation for Computational Linguistics (Volume Dependency Linguistics (Depling 2015), pages 1: Long Papers), pages 823–833, Berlin, Ger- 121–130, Uppsala, Sweden. Uppsala Univer- many. Association for Computational Linguis- sity, Uppsala, Sweden. tics. Kristina Gulordava and Paola Merlo. 2016. Multi- Richard Futrell, Kyle Mahowald, and Edward lingual dependency parsing evaluation: a large- Gibson. 2015a. Large-scale evidence of de- scale analysis of word order properties using ar- pendency length minimization in 37 languages. tificial data. Transactions of the Association for Proceedings of the National Academy of Sci- Computational Linguistics, 4:343–356. ences, 112(33):10336–10341. Francisco Guzmán, Peng-Jen Chen, Myle Ott, Richard Futrell, Kyle Mahowald, and Edward Juan Pino, Guillaume Lample, Philipp Koehn, Gibson. 2015b. Quantifying word order free- Vishrav Chaudhary, and Marc’Aurelio Ranzato. dom in dependency corpora. In Proceedings of 2019. The FLORES Evaluation Datasets for the Third International Conference on Depen- Low-Resource Machine Translation: Nepali– dency Linguistics (Depling 2015), pages 91– English and Sinhala–English. In Proceedings 100, Uppsala, Sweden. Uppsala University, Up- of the 2019 Conference on Empirical Meth- psala, Sweden. ods in Natural Language Processing and the 9th International Joint Conference on Natu- Daniela Gerz, Ivan Vulić, Edoardo Maria Ponti, ral Language Processing (EMNLP-IJCNLP), Roi Reichart, and Anna Korhonen. 2018. On pages 6100–6113. the relation between linguistic typology and Michael Hahn. 2020. Theoretical limitations (limitations of) multilingual language model- of self-attention in neural sequence models. ing. In Proceedings of the 2018 Conference on Transactions of the Association for Computa- Empirical Methods in Natural Language Pro- tional Linguistics, 8:156–171. cessing, pages 316–327, Brussels, Belgium. As- sociation for Computational Linguistics. Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Fed- Daniel Gildea and David Temperley. 2010. Do ermann, Xuedong Huang, Marcin Junczys- grammars minimize dependency length? Cog- Dowmunt, William Lewis, Mu Li, et al. 2018. nitive Science, 34(2):286–310. Achieving human parity on automatic Chinese to English news translation. arXiv preprint Joseph H. Greenberg. 1963. Some universals of arXiv:1803.05567. grammar with particular reference to the order of meaningful elements. In Joseph H. Green- Sepp Hochreiter and Jürgen Schmidhuber. 1997. berg, editor, Universals of Human Language, Long short-term memory. Neural Computation, pages 73–113. MIT Press, Cambridge, Mass. 9(8):1735–1780.
Hideki Isozaki, Tsutomu Hirao, Kevin Duh, Kat- a large annotated corpus of English: The Penn suhito Sudoh, and Hajime Tsukada. 2010. Au- Treebank. tomatic evaluation of translation quality for dis- tant language pairs. In Proceedings of the 2010 Sabrina J. Mielke, Ryan Cotterell, Kyle Gorman, Conference on Empirical Methods in Natural Brian Roark, and Jason Eisner. 2019. What Language Processing, pages 944–952, Cam- kind of language is hard to language-model? bridge, MA. Association for Computational In Proceedings of the 57th Annual Meeting of Linguistics. the Association for Computational Linguistics, pages 4975–4989, Florence, Italy. Association Pratik Joshi, Sebastin Santy, Amar Budhiraja, Ka- for Computational Linguistics. lika Bali, and Monojit Choudhury. 2020. The state and fate of linguistic diversity and inclu- Kishore Papineni, Salim Roukos, Todd Ward, and sion in the NLP world. In Proceedings of Wei-Jing Zhu. 2002. BLEU: A Method for Au- the 58th Annual Meeting of the Association for tomatic Evaluation of Machine Translation. In Computational Linguistics, pages 6282–6293, Proceedings of the 40th Annual Meeting on As- Online. Association for Computational Linguis- sociation for Computational Linguistics, ACL tics. ’02, pages 311–318, Stroudsburg, PA, USA. Association for Computational Linguistics. Philipp Koehn. 2005. Europarl: A parallel cor- pus for statistical machine translation. In The Emmanouil Antonios Platanios, Mrinmaya Tenth Machine Translation Summit Proceedings Sachan, Graham Neubig, and Tom Mitchell. of Conference, pages 79–86. International As- 2018. Contextual parameter generation for sociation for Machine Translation. universal neural machine translation. In Pro- ceedings of the 2018 Conference on Empirical Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Methods in Natural Language Processing, Sergey Edunov, Marjan Ghazvininejad, Mike pages 425–435. Lewis, and Luke Zettlemoyer. 2020. Multilin- gual denoising pre-training for neural machine Martin Popel, Marketa Tomkova, Jakub Tomek, translation. Transactions of the Association for Łukasz Kaiser, Jakob Uszkoreit, Ondřej Bo- Computational Linguistics, 8:726–742. jar, and Zdeněk Žabokrtský. 2020. Transform- ing machine translation: a deep learning system Minh-Thang Luong, Hieu Pham, and Christo- reaches news translation quality comparable to pher D. Manning. 2015. Effective approaches human professionals. Nature Communications, to attention-based neural machine translation. 11(1):4381. In Empirical Methods in Natural Language Processing (EMNLP), pages 1412–1421, Lis- Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason bon, Portugal. Association for Computational Bolton, and Christopher D. Manning. 2020. Linguistics. Stanza: A Python natural language processing toolkit for many human languages. In Proceed- Gary Lupyan and Morten H. Christiansen. 2002. ings of the 58th Annual Meeting of the Asso- Case, word order, and language learnability: In- ciation for Computational Linguistics: System sights from connectionist modeling. In Pro- Demonstrations. ceedings of the Twenty-Fourth Annual Confer- ence of the Cognitive Science Society. Shauli Ravfogel, Yoav Goldberg, and Tal Linzen. Christopher D. Manning, Mihai Surdeanu, John 2019. Studying the inductive biases of RNNs Bauer, Jenny Finkel, Steven J. Bethard, and with synthetic variations of natural languages. David McClosky. 2014. The Stanford CoreNLP In Proceedings of the 2019 Conference of the natural language processing toolkit. In Associ- North American Chapter of the Association for ation for Computational Linguistics (ACL) Sys- Computational Linguistics: Human Language tem Demonstrations, pages 55–60. Technologies, Volume 1 (Long and Short Pa- pers), pages 3532–3542, Minneapolis, Min- Mitchell Marcus, Beatrice Santorini, and nesota. Association for Computational Linguis- Mary Ann Marcinkiewicz. 1993. Building tics.
Benoît Sagot. 2013. Comparing Complexity Mea- Adina Williams, Tiago Pimentel, Hagen Blix, sures. In Computational approaches to mor- Arya D. McCarthy, Eleanor Chodroff, and Ryan phological complexity, Paris, France. Surrey Cotterell. 2020. Predicting declension class Morphology Group. from form and meaning. In Proceedings of the 58th Annual Meeting of the Association for Karthik Abinav Sankararaman, Soham De, Zheng Computational Linguistics, pages 6682–6695, Xu, W. Ronny Huang, and Tom Goldstein. Online. Association for Computational Linguis- 2020. Analyzing the effect of neural net- tics. work architecture on training performance. In Proceedings of Machine Learning and Systems Yang Zhao, Jiajun Zhang, and Chengqing Zong. 2020, pages 9834–9845. 2018. Exploiting pre-ordering for neural ma- chine translation. In Proceedings of the Rico Sennrich, Barry Haddow, and Alexandra Eleventh International Conference on Lan- Birch. 2016. Neural machine translation of rare guage Resources and Evaluation (LREC-2018). words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- A Appendices pers), pages 1715–1725, Berlin, Germany. As- A.1 NMT Hyperparameters sociation for Computational Linguistics. In the toy parallel grammar experiments (§4), Kaius Sinnemäki. 2008. Complexity trade-offs in batch size of 64 (sentences) and 1K max update core argument marking. In Language Complex- steps are used for all models. We train BiLSTM ity, pages 67–88. John Benjamins. with learning rate 1, and Transformer with learn- ing rate of 2 together with 40 warm-up steps by Dan I. Slobin. 1966. The acquisition of Russian as using noam learning rate decay. Dropout ratio of a native language. The genesis of language: A 0.3 and 0.1 are used in BiLSTM and Transformer psycholinguistic approach, pages 129–148. models respectively. In the synthetic English vari- Dan I. Slobin and Thomas G. Bever. 1982. ants experiments (§5), we set a constant learning Children use canonical sentence schemas: A rate of 0.001 for BiLSTM. We also increased batch crosslinguistic study of word order and inflec- size to 128, number of warm-up steps to 80K and tions. Cognition, 12(3):229–265. update steps to 2M for all models. Finally, for 100k and 10k datasize experiments, we decreased Ke Tran, Arianna Bisazza, and Christof Monz. the warm-up steps to 4K. During evaluation we 2018. The importance of being recurrent for chose the best performing model on validation set. modeling hierarchical structure. In Proceedings of the 2018 Conference on Empirical Methods A.2 Challenge Set in Natural Language Processing, pages 4731– The English-French challenge set used in this pa- 4736, Brussels, Belgium. Association for Com- per, and available at https://github.com/ putational Linguistics. arianna-bis/freeorder-mt, is generated Ashish Vaswani, Noam Shazeer, Niki Parmar, by a small synchronous context-free grammar and Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, contains 7,200 simple sentences consisting of a Lukasz Kaiser, and Illia Polosukhin. 2017. At- subject, a transitive verb, and an object (see Ta- tention is all you need. In Advances in Neu- ble 4). All sentences are in the present tense; half ral Information Processing Systems 30: An- are affirmative, and half negative. All nouns in nual Conference on Neural Information Pro- the grammar can plausibly act as both subject and cessing Systems 2017, 4-9 December 2017, object of the verbs, so that an MT system must Long Beach, CA, USA, pages 5998–6008. rely on sentence structure to get perfect translation accuracy. The sentences are from a general do- Dingquan Wang and Jason Eisner. 2016. The main, but we specifically choose nouns and verbs galactic dependencies treebanks: Getting more with little translation ambiguity that are well rep- data by synthesizing new languages. Transac- resented in the Europarl corpus: most have thou- tions of the Association for Computational Lin- sands of occurrences, while the rarest word has guistics, 4:491–505. about 80. Sentence example (English side): ‘The
You can also read