The paradox of the compositionality of natural language: a neural machine translation case study
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
The paradox of the compositionality of natural language: a neural machine translation case study Verna Dankers Elia Bruni Dieuwke Hupkes ILCC, University of Edinburgh University of Osnabrück Facebook AI Research vernadankers@gmail.com elia.bruni@gmail.com dieuwkehupkes@fb.com Abstract trained on artificial datasets, in which composition- Moving towards human-like linguistic perfor- ality can be ensured and isolated (e.g. Lake and mance is often argued to require compositional Baroni, 2018; Hupkes et al., 2020).1 In such tests, generalisation. Whether neural networks ex- the interpretation of expressions is computed com- arXiv:2108.05885v1 [cs.CL] 12 Aug 2021 hibit this ability is typically studied using artifi- pletely locally: every subpart is evaluated indepen- cial languages, for which the compositionality dently – without taking into account any external of input fragments can be guaranteed and their context – and the meaning of the whole expres- meanings algebraically composed. However, sion is then formed by combining the meaning of compositionality in natural language is vastly more complex than this rigid, arithmetics-like its parts in a bottom-up fashion. This protocol version of compositionality, and as such arti- matches the type of compositionality observed in ficial compositionality tests do not allow us arithmetics: the meaning of (3 + 5) is always 8, to draw conclusions about how neural mod- independent from the context it occurs in. els deal with compositionality in more realistic However, as exemplified by the sub-par per- scenarios. In this work, we re-instantiate three formance of symbolic models that allow only compositionality tests from the literature and such strict protocols, compositionality in natu- reformulate them for neural machine transla- ral domains is far more intricate than this rigid, tion (NMT). The results highlight two main is- sues: the inconsistent behaviour of NMT mod- arithmetics-like variant of compositionality. Natu- els and their inability to (correctly) modulate ral language seems very compositional, but at the between local and global processing. Aside same time, it is riddled with cases that are diffi- from an empirical study, our work is a call cult to interpret with a strictly local interpretation to action: we should rethink the evaluation of of compositionality. Sometimes, the meaning of compositionality in neural networks of natural an expression does not derive from its parts (e.g. language, where composing meaning is not as idioms), but the parts themselves are used composi- straightforward as doing the math. tionally in other contexts. Other times, the meaning 1 Introduction of an expression does depend on its parts in a com- Although the successes of deep neural networks in positional way, but arriving at this meaning requires natural language processing (NLP) are astounding a more global approach, in which the meaning of a and undeniable, they are still regularly criticised for part is disambiguated using information from some- lacking the powerful generalisation capacities that where else in the sentence (e.g. homonyms, scope characterise human intelligence. A frequently men- ambiguities). Successfully modelling language re- tioned concept in such critiques is compositionality: quires balancing such local and global forms of the ability to build up the meaning of a complex (non-)compositionality, which makes evaluating expression by combining the meanings of its parts compositionality in state-of-the-art models ‘in the (e.g. Partee, 1984). Compositionality is assumed wild’ a complicated endeavour. to play an essential role in how humans understand In this work, we face this challenge head-on. We language, but whether modern neural networks also concentrate on the domain of neural machine trans- exhibit this property has since long been a topic lation (NMT), which is paradigmatically close to of vivid debate (e.g. Fodor and Pylyshyn, 1988; 1 With the exception of Raunak et al. (2019), work on com- Smolensky, 1990; Marcus, 2003; Nefdt, 2020). positional generalisation in the context of language considers highly structured subsets of natural language (e.g. Kim and Studies about the compositional abilities of neu- Linzen, 2020; Keysers et al., 2019) or focuses on tendencies of ral networks consider almost exclusively models neural networks to learn shortcuts (e.g. McCoy et al., 2019).
the sequence-to-sequence tasks typically consid- compositions thus require only local information ered for compositionality tests, where the target se- – they are context independent and unambiguous: quences are assumed to represent the ‘meaning’ of (2 + 1) × (4 − 5) is evaluated in a manner similar the input sequences.2 Furthermore, MT is an impor- to walk twice after jump thrice (a fragment from tant domain of NLP, for which compositional gen- SCAN by Lake and Baroni, 2018). In MT, this type eralisation is suggested to be important to produce of compositionality would imply that a change in more robust translations and train adequate models a word or phrase should affect only the translation for low-resource languages (see, e.g. Chaabouni of that particular word or phrase, or at most the et al., 2021). As an added advantage, composition- smallest constituent it is a part of. For instance, the ality is traditionally well studied and motivated for translation of the phrase the girl should not change MT (Janssen and Partee, 1997; Janssen, 1998). depending on the verb phrase that follows it, and in We reformulate three theoretically grounded the translation of a conjunction of two sentences, tests from Hupkes et al. (2020): systematicity, sub- making a change in the first conjunct should not stitutivity and overgeneralisation. As accuracy – change the translation of the second. While trans- commonly used in artificial compositionality tests – lating in such a local fashion seems robust and is not a suitable evaluation metric for MT, we base productive, it is not always realistic. Consider, for our evaluations on the extent to which models be- instance, the translation of the polysemous word have consistently, rather than correctly. In our tests ‘dates’ in “She hated bananas and she liked dates”. for systematicity and substitutivity, we consider In linguistics and philosophy of language, the whether processing is maximally local; in our over- level of compositionality is a widely discussed generalisation test, we consider how models treat topic, which has resulted in a wide variety of com- idioms that are assumed to require global process- positionality definitions. One of the most well- ing. Our results indicate that consistent behaviour known definitions is the one from Partee (1984): is currently not achieved and that many inconsis- tencies are due to models not achieving the right “The meaning of a compound expres- level of processing (local nor global). The actual sion is a function of the meanings of its level of processing, however, is neither captured by parts and of the way they are syntacti- consistency nor accuracy measures. cally combined.”3 With our study, we contribute to ongoing ques- tions about the compositional abilities of neural This definition places almost no restrictions on the networks, and we provide nuance to the nature of relationship between an expression and its parts. this question when natural language is concerned: The type of function that relates them is unspeci- how local should the compositionality of models fied and could take into account the global syntac- for natural language be, and how does the type tic structure or even external arguments, and also of compositionality required for MT, where trans- the meanings of the parts can depend on global lations are used as proxies to meaning, relate to information. Partee’s version of compositionality the compositionality of natural language at large? is thus also called weak, global, or open compo- Aside from an empirical study, our work is also sitionality (Szabó, 2012; Garcı́a-Ramı́rez, 2019). a call to action: we should rethink the evaluation When, instead, the meaning of a compound can of compositionality in neural networks trained on depend only on the meaning or translation of its natural language, where composing meaning is not largest parts, regardless of their internal structure as straightforward as doing the math. (the arithmetics-like variant of compositionality), this is called strong, local or closed compositional- 2 Local and global compositionality ity (Szabó, 2012; Jacobson, 2002). This stricter interpretation of compositionality Tests for compositional generalisation in neural underlies the generalisation aimed for in previous networks typically assume an arithmetic-like ver- compositionality tests (see also §5). Yet, it is not sion of compositionality, in which meanings can be suitable for modelling natural language phenomena computed in a completely bottom-up fashion. The traditionally considered problematic for composi- 2 3 In SCAN (Lake and Baroni, 2018), for instance, the input The principle can straight-forwardly be extended to trans- is an instruction (walk twice) and the intended output repre- lation, by replacing the word meaning with the word transla- sents its execution (walk walk). tion (Janssen and Partee, 1997; Janssen, 1998).
tionality, such as quotation, belief sentences, am- sider the synthetic data generated by Lakretz et al. biguities, idioms, noun-noun compounds, to name (2019), which contains a large number of sentences a few (Pagin and Westerståhl, 2010; Pavlick and with a fixed syntactic structure and diverse lexical Callison-Burch, 2016). Here, we aim to open up material. Lakretz et al. (2019), and other authors af- the discussion about what it means for computa- terwards (e.g. Jumelet et al., 2019; Lu et al., 2020), tional models of natural language to be compo- successfully used this data to assess number agree- sitional, and, to that end, discuss properties that ment in language models, validating the data as require composing meaning locally, and look at a reasonable resource to test neural models. We global compositionality through idioms. extend the set of templates in the dataset and the vocabulary used. For each of the resulting ten tem- 3 Setup plates (see Table 1a), we generate 3000 sentences. First, we describe the models analysed and the data Semi-natural data In the synthetic data, we that form the basis of our tests. have full control over the sentence structure and lexical items, but the sentences are shorter (9 to- 3.1 Model and training kens vs. 16 in OPUS) and simpler than typical in We focus on English-Dutch translation, for which NMT data. To obtain more complex yet plausible we can ensure good command for both languages. test sentences, we employ a data-driven approach We train Transformer-base models (Vaswani et al., to generate semi-natural data. Using the tree sub- 2017) using Fairseq (Ott et al., 2019).4 Our training stitution grammar Double DOP (Van Cranenburgh data consists of a collection of MT corpora bundled et al., 2016), we obtain noun and verb phrases (NP, in OPUS (Tiedemann and Thottingal, 2020), of VP) whose structures frequently occur in OPUS. which we use the English-Dutch subset provided by We then embed these NPs and VPs in ten synthetic Tiedemann (2020) containing 69M source-target templates with 3000 samples each (see Table 1b). pairs.5 To examine the impact of the amount of See Appendix A for details on the data generation. training data – a dimension that is relevant because compositionality is hypothesised to be important Natural data Lastly, we use natural data that particularly in low-resource settings – we train one we extract directly from OPUS. The extraction setup using the full dataset, one using 18 of the data procedures are test-specific, and are provided in (medium), and one using one million source-target the subsections of the individual tests (§4). pairs in the small setup. For each setup, we train 4 Experiments and results models with five seeds and average the results. To ensure the quality of our trained models, we In our experiments, we consider the two properties adopt the F LORES -101 corpus (Goyal et al., 2021), of systematicity (§4.1) and substitutivity (§4.2), which contains 3001 sentences from Wikinews, that require local meaning compositions, as well as Wikijunior and WikiVoyage, translated by profes- the translation of idioms that require a different (a sional translators and split across three subsets. We more global) type of processing (§4.3). train the models until convergence on the ‘dev’ set. Afterwards, we compute BLEU scores on the 4.1 Systematicity ‘devtest’ set, using beam search (beam size = 5), One of the most commonly tested properties of yielding scores of 20.5±.4, 24.3±.3 and 25.7±.1 for compositional generalisation is systematicity – the the small, medium and full datasets, respectively. ability to understand novel combinations made up from known components (Lake and Baroni, 2018; 3.2 Evaluation data Raunak et al., 2019; Hupkes et al., 2020). A classic For our compositionality tests, we use three dif- example of systematicity comes from Szabó (2012): ferent types of data – synthetic, semi-natural and someone who understands “brown dog” and “black natural – with varying degrees of control. cat” also understands “brown cat”. Synthetic data For our synthetic test data, we 4.1.1 Experiments take inspiration from literature on probing hier- In natural data, the number of potential recombina- archical structure in language models: we con- tions to consider is infinite. We chose to focus on 4 Training details for Transformer base are available here. recombinations in two sentence-level context-free 5 Visit the Tatoeba challenge for the OPUS training data. rules: S → NP VP and S → S CONJ S.
n Template n Template 1 The Npeople V the Nslelite. 1,2,3 The Npeople VP1,2,3 . 2 The Npeople Adv V the Nslelite . The men are gon na have to move off-camera . 3 The Npeople P the Nslvehicle V the Nslelite . 4,5 The Npeople read(s) an article about NP1,2 . The man reads an article about the development 4 The Npeople and the Npeople V the Nslelite . of ascites in rats with liver cirrhosis . 5 The Nslquantity of Npl sl sl people P the Nvehicle V the Nelite . pl 6,7 An article about NP3,4 is read by Npeople . 6 The Npeople V that the Npeople V. An article about the criterion on price stability , 7 The Npeople Adv V that the Npl people V . which was 27 % , is read by the child . 8 The Npeople V that the Npl people V Adv . 8,9,10 Did the Npeople hear about NP5,6,7 ? 9 The Npeople that V V the Nslelite . Did the teacher hear about the march on 10 The Npeople that V Pro V the Nslelite . Employment which happened here on Sunday ? (a) Synthetic templates (b) Semi-natural templates Table 1: The synthetic and semi-natural templates, with POS tags of the lexical items varied shown in blue with the plurality as superscript and the subcategory as subscript. The OPUS-extracted NP and VP fragments are red. Test design In our first setup, S → NP VP, we ensured by high training accuracies, and system- consider recombinations of noun and verb phrases. aticity is quantified by measuring the test set accu- We extract translations for all input sentences from racy. If the training data is a natural corpus and the the templates from §3.2, as well as versions of model is evaluated with a measure like BLEU in them in which we adapt either (1) the noun (NP MT, this strategy is not available. We observe that → NP’) or (2) the verb phrase (VP → VP’). In being systematic requires being consistent in the (1), a noun from the NP in the subject position is interpretation assigned to a (sub)expression across replaced with a different noun while preserving contexts, both in artificial and natural domains. number agreement with the VP. In (2), a noun in Here, we, therefore, focus on consistency rather the VP is replaced. NP → NP’ is applied to both than accuracy, allowing us to employ a model- synthetic and semi-natural data; VP → VP’ only to driven approach that evaluates the model’s system- synthetic data. We use 500 samples per template aticity as the consistency of the translations when per condition per data type. presenting words or phrases in multiple contexts. In our second setup, S → S CONJ S, we con- We measure consistency as the equality of catenate phrases using ‘and’, and we test whether two translations after accounting for anticipated the translation of the second sentence is dependent changes. For instance, in the S → NP VP setup, on the (independently sampled) first sentence. We two translations are consistent if they differ in one concatenate two sentences (S1 and S2 ) from differ- word only, after accounting for determiner changes ent templates, and we consider again two different in Dutch (‘de’ vs. ‘het’). In the evaluation of conditions. First, in condition S1 → S01 , we make S → S CONJ S, we remove the translation of the a minimal change to S1 yielding S01 by changing first conjunct based on the position of the conjunc- the noun in its verb phrase. In S1 → S3 , instead, tion in Dutch, and measure the consistency of the we replace S1 with a sentence S3 that is sampled translations of the second conjunct. from a template different from S1 . We compare the translation of S2 in all conditions. In prepar- 4.1.2 Results ing the data, the first conjunct is sampled from the In Figure 1, we show the performance for the synthetic data templates. The second conjunct is S → NP VP and S → S CONJ S setups, respec- sampled from synthetic data, semi-natural data, or tively, distinguishing between training dataset sizes, from natural sentences sampled from OPUS with evaluation data types and templates.6 similar lengths and word-frequencies as the semi- Firstly, we observe quite some variation across natural inputs. We use 500 samples per template templates, which is not simply explained by sen- per condition per data type. tence length – i.e. the shortest template is not nec- essarily the best performing one. The synthetic Evaluation In artificial domains, systematicity is data uses the same lexical items across multiple evaluated by leaving out combinations of ‘known templates, suggesting that the grammatical struc- components’ from the training data and using them ture contributes to more or less compositional be- for testing purposes. The necessary familiarity of 6 the components (the fact that they are ‘known’) is Appendix B presents the results through tables.
1.0 666 1.0 1.0 1.0 1 1 training size 65 6 7 1 4 0.8 102 44 0.8 6 0.8 4 6 0.8 10 2 small 4 10 2 44 9 9 3 medium consistency consistency consistency consistency 0.6 1010 0.6 10 0.6 10 10 5 0.6 10 full 10 10 0.4 4 0.4 0.4 0.4 2 1010 0.2 0.2 0.2 0.2 10 0.0 0.0 0.0 0.0 synthetic semi-n. natural synthetic semi-n. natural synthetic semi-n. synthetic (a) S1 → S01 (b) S1 → S3 (c) NP → NP 0 (d) VP → VP0 Figure 1: Systematicity results for setup S → S CONJ S (a and b) and S → NP VP (c and d). Consistency scores are shown per evaluation data type (x-axis), per training dataset size (colours). Data points represent templates (◦) and means over templates (). haviour. The average performance for the natural agreement that exists between nouns and verbs – data in S → S CONJ S closely resembles the per- surely synonyms should be given equal treatment formance on semi-natural data, suggesting that the by a compositional model. This test addresses that increased degree of control did not severely impact by performing the substitutivity test from Hupkes the results obtained using this generated data. et al. (2020), that measures whether the outputs Secondly, the different changes have vary- remain consistent after synonym substitution. ing effects. For S → NP VP, changing the NP has a larger impact than changing the VP. For 4.2.1 Experiments S → S CONJ S, replacing the entire first conjunct In natural data, true synonyms are hard to find – with S3 has a larger impact than merely substitut- some might even argue they do not exist. Here, we ing a word in S1 . Increasing the training dataset consider two source terms synonymous if they con- size results in increased consistency scores. This sistently translate into the same target term. To find could be because the larger training set provides such synonyms, we exploit the fact that OPUS con- the model with more confident translations. Yet, tains texts both in British and American English. increasing dataset sizes is a somewhat paradoxical Therefore, it contains synonymous terms that are solution to compositional generalisation: after all, spelt different – e.g. doughnut / donut – and syn- in humans, compositionality is assumed to under- onymous terms with a very different form – e.g. lie their ability to generalise usage from very few aubergine / eggplant. We use 20 synonym pairs examples (Lake et al., 2019). total (see Figure 2b). Lastly, the consistency scores are quite low overall, suggesting that the model is prone to Test design Per synonym pair, we select natural emitting a different translation of a (sub)sentence data from OPUS in which the terms appear and following small (unrelated) adaptations to the then perform synonym substitutions. Thus, each input. Moreover, it hardly seems to matter sample has two sentences, one with the British En- whether that change occurs in the sentence itself glish term and one with the American English term. (S → NP VP), or whether it occurs in the other con- We also insert the synonyms into the synthetic and junct (S → S CONJ S), suggesting a lack of local semi-natural data using 500 samples per synonym processing. pair per template, through subordinate clauses that modify a noun – e.g. “the king that eats the dough- 4.2 Substitutivity nut”. In Appendix C, Table 6, we list all clauses used. Under a local interpretation of the principle of compositionality, synonym substitutions should be Evaluation Like systematicity, we evaluate sub- meaning-preserving: substituting a constituent in stitutivity using the consistency score, expressing a complex expression with a synonym should not whether the model translations for a sample are alter the complex expression’s meaning, or, in the identical. We report both the full sentence consis- case of MT, its translation. Even if one argued that tency and the consistency of the synonyms’ trans- the systematic recombinations in §4.1 warranted lations only, excluding the context. Cases in which some alterations to the translation – e.g. due to the the model omits the synonym from the translation
a(e|i)r(o)plane postcode / zip code alumin(i)um p(y|a)jamas atr e the tach us e ach ley mo tach aubergine / e ust e sail(ing )boat e ach mo tach 1.0 e ust trol us e training size eggplant us motach shopping trolley / moping mo do(ugh)nut pe us small shopping cart shoach 0.8 mo fl(a)utist sul(ph|f)ate ust medium mo consistency 0.6 full f(o)etus theat(re|er) football / soccer tumo(u)r 0.4 veterinarian / tist holiday / vacation st veterinary surgeon tist flau uti flau fla 0.2 ladybird / ladybug whisk(e)y synonym consistency flau rgine flau ird consistency tist m(o)ustache yog(h)urt flau st tist tist yb ti e 0.0 aub flau lad 0.0 0.25 0.5 0.75 1.0 0.0 0.25 0.5 0.75 1.0 synthetic semi natural natural consistency consistency (a) (b) Figure 2: (a) Average consistency scores of synonyms () for substitutivity per evaluation data type, for models trained on three training set sizes. Individual data points (◦) represent synonyms. We annotate synonyms with the highest and lowest scores. (b) Consistency detailed per synonym, measured using full sentences (in dark blue) or the synonym’s translation only (in green), averaged over training set sizes and data types. are labelled as consistent if the rest of the transla- 4.3 Global compositionality tion is the same for both input sequences. In our final test, we focus on exceptions to composi- tional rules. In natural language, typical exceptions 4.2.2 Results that constitute a challenge for local composition- ality are idioms. For instance, the idiom “raining In Figure 2a, we summarise consistency scores cats and dogs” should be treated globally to arrive across synonyms, data types and training set sizes.7 at its meaning of heavy rainfall. A local approach We observe trends similar to the systematicity re- would yield an overly literal, non-sensical trans- sults, considering that models trained on larger lation (“het regent katten en honden”). When a training sets perform better and that the synthetic model’s translation is too local, we follow Hup- data yields more consistent translations compared kes et al. (2020) in saying that it overgeneralises, to (semi-)natural data. or, in other words, it applies a general rule to an The results are characterised by large variations expression that is an exception to this rule. Over- across synonyms, for which we further detail the generalisation indicates that a language learner has performance aggregated across experimental se- internalised the general rule (e.g. Penke, 2012). tups in Figure 2b. The three synonyms with note- 4.3.1 Experiments worthily low scores – flautist, aubergine and la- To find idioms in our corpus, we exploit the MAG- dybug – are among the least frequent synonyms PIE corpus (Haagsma et al., 2020). We select 20 (see Appendix C), which stresses the importance English idioms for which an accurate Dutch trans- of frequency for the model to pick up on synonymy. lation differs from the literal translation. As acqui- Across the board, it is quite remarkable that for syn- sition of idioms is dependent on their frequency in onyms that have the same meaning (and, for some, the corpus, we use idioms with at least 200 occur- are even spelt nearly identically) the translations rences in OPUS based on exact matches, for which are so inconsistent. over 80% of the target translations does not contain Can we attribute this to the translations of the a literal translation. synonyms? To investigate this, Figure 2b presents both the regular consistency and the consistency Test design Per idiom, we extract natural sen- of the translation of the synonym (‘synonym con- tences containing the idiom from OPUS. For the sistency’). The fact that the latter is much higher synthetic and semi-natural data types, we insert the indicates that a substantial part of the inconsisten- idiom in 500 samples per idiom per template, by cies are due to varying translations of the context attaching a subordinate clause to a noun – e.g. “the rather than the synonym, stressing again the non- king that said ‘I knew the formula by heart’”. The local processing of inputs by the models. clauses used can be found in Appendix D, Table 7. Evaluation Per idiom, we assess how often a 7 Appendix C provides the same results in tables. model overgeneralises and how often it translates
1.0 small medium full overgeneralisation – translation of the idiom. 3) Eventually, the model 0.8 0.6 starts to memorise the idiom’s translation. This is 0.4 in line with the results of Hupkes et al. (2020), who 0.2 created the artificial counterpart of exceptions to 0.0 rules, as well as earlier results presented in the past 1 40 80 120 160 1 10 20 30 40 50 1 10 20 30 epoch epoch epoch tense debate by – among others – Rumelhart and (a) Synthetic McClelland (1986). 1.0 Although the height of the overgeneralisation overgeneralisation 0.8 0.6 peak is similar across evaluation data types and 0.4 training set sizes, overgeneralisation is more promi- 0.2 nent in converged models trained on smaller 0.0 datasets than it is in models trained on the full 1 40 80 120 160 1 10 20 30 40 50 1 10 20 30 epoch epoch epoch corpus.9 In addition to training set size, the type (b) Semi-Natural of evaluation data used also matters, since there 1.0 is more overgeneralisation for synthetic and semi- overgeneralisation 0.8 natural data compared to natural data, stressing the 0.6 0.4 impact of the context in which an idiom is embed- 0.2 ded. The extreme case of a context unsupportive of 0.0 an idiomatic interpretation is a sequence of random 1 40 80 120 160 1 10 20 30 40 50 1 10 20 30 epoch epoch epoch words; to evaluate the hypothesis that this yields (c) Natural local translations, we surround the idioms with ten random words. The results (see Appendix D, Figure 3: Visualisation of overgeneralisation for id- Table 7) indicate that, indeed, when the context pro- ioms throughout training, for five model seeds and their means. Overgeneralisation occurs early on in training vides no support at all for a global interpretation, and precedes memorisation of idioms’ translations. the model provides a local translation for nearly all idioms. the idiom globally. To do so, we identify keywords In addition to variations between setups, there that indicate that a translation is translated locally is quite some variation in the overgeneralisation (literal) instead of globally (idiomatic). If the key- observed for individual idioms. The variation is words are copied to the model output or their literal partly explained by the fact that some idioms are translations are present, the translation is labelled less frequent compared to others since the number as an overgeneralised translation. For instance, for of exact matches in the training corpus correlates “by heart”, a literal translation is identified through significantly with the difference in overgenerali- the presence of “hart” (“heart”), whereas an ade- sation between the peak and overgeneralisation at quate paraphrase would say “uit het hoofd” (“from convergence,10 suggesting that frequent idioms are the head”). Visit Appendix D, Table 7, for the full more likely to be memorised by the models. list of keywords. We evaluate overgeneralisation 5 Related Work for ten checkpoints throughout training. In this work, we considered compositional general- 4.3.2 Results isation in neural network models. In previous work, In Figure 3, we visualise the results per data type, a variety of artificial tasks have been proposed to again for three training dataset sizes.8 For all eval- evaluate compositional generalisation using non- uation data types and all training set sizes three i.i.d. test sets that are designed to assess a specific phases can be identified: 1) Initially the transla- characteristic of compositional behaviour. Exam- tions do not contain the idiom’s keyword; not be- ples are systematicity (Hupkes et al., 2020; Lake cause the idiom’s meaning is paraphrased in the and Baroni, 2018), substitutivity (Mul and Zuidema, translation, but because the translations consist of 9 high-frequency words in the target language only. Convergence is based on the BLEU scores for validation data. When training models for longer, this could further 2) Afterwards, overgeneralisation peaks: the model change the overgeneralisation observed. emits a very literal – and potentially, compositional 10 Pearson’s r of .56 for synthetic data, Pearson’s r of .56 for semi-natural data and Pearson’s r of .53 for natural data, 8 Appendix D further details numerical results per idiom. p < .0001.
2019; Hupkes et al., 2020), localism (Hupkes et al., of language, tests are needed that can assess how 2020; Saphra and Lopez, 2020), productivity (Lake compositional models trained on natural data are. and Baroni, 2018) or overgeneralisation (Hupkes We laid out reformulations of three compositional et al., 2020; Korrel et al., 2019). Generally, neural generalisation tests – systematicity, substitutivity models struggle to generalise in such evaluation se- and overgeneralisation – for NMT models trained tups, although data augmentation (Andreas, 2020) on natural corpora, and we focus on the level of and modelling techniques (Lake, 2019) have been compositionality that models exhibit. Aside from shown to improve performance. providing an empirical contribution, our work also There are also studies that consider composi- highlights vital hurdles to overcome when consid- tional generalisation on more natural data for the ering what it means for models of natural language tasks of semantic parsing and MT, although the to be compositional. Below, we reflect on these corpora used still represent a small and controlled hurdles as well as our results. subset of natural language. Finegan-Dollak et al. 6.1 The proxy to meaning problem (2018) release eight text-to-SQL datasets, along with non-i.i.d. test sets. Keysers et al. (2019) apply Compositionality is a property of the mapping be- automated rule-based dataset generation yielding tween the form and meaning of an expression. As their CFQ dataset, for which test sets with maxi- translation is a meaning-preserving mapping from mum compound divergence are used to measure form in one language to form in another, it is an compositional generalisation. Kim and Linzen attractive task to evaluate compositionality: the (2020) present a PCFG-generated semantic parsing translation of its sentence can be seen as a proxy task with test sets for specific kinds of lexical and to its meaning. However, while expressions are structural generalisation for fragments of English. assumed to have only one meaning, translation is Lake and Baroni (2018) measure few-shot gener- a many-to-many mapping: the same sentence can alisation for a toy NMT task for English-French, have multiple correct translations. This does not that uses sentence pairs generated according to tem- only complicate evaluation – MT systems are typ- plates. Raunak et al. (2019) conducted additional ically evaluated with BLEU, because accuracy is experiments using that toy task to test for the sys- not a suitable option – it also raises questions about tematicity of NMT models. More recently, Li et al. how compositional the desired behaviour of an MT (2021) presented CoGnition to test compositional model should be. On the one hand, one could ar- generalisation for English-Chinese MT, a bench- gue that for optimal generalisation, robustness, and mark dataset used to train models on data that only accountability we like models to behave systemati- includes short sentences from a small vocabulary, cally and consistently: we expect the translations excluding any problematic constructions that con- of expressions to not be dependent on unrelated tribute to the complexity of natural language for contextual changes that do not affect its meaning compositional generalisation, such as polysemous (e.g. swapping out a synonym in a nearby sentence). words or metaphorical language. In fact, consistency is the main metric that we are To the best of our knowledge, the only attempt to using in most of our tests. However, this does im- explicitly measure compositional generalisation of ply that if a model changes its translation in a way NMT models trained on large natural MT corpora that does not alter the correctness of the transla- is the study presented by Raunak et al. (2019). They tion, this is considered incorrect. Does changing measure productivity – generalisation to longer sen- a translation from “atleet” (“athlete”) to “sporter” tence lengths – of an LSTM-based NMT model (“sportsman”) really matter, even if this change trained on a full-size, natural MT dataset. happened after changing an unrelated word some- what far away? The answer to that question might 6 Discussion depend on the domain, the quantity of data avail- able for training and how important the faithfulness Whether neural networks can generalise composi- of the translation is. tionally is often studied using artificial tasks that assume strictly local interpretations of composi- 6.2 The locality problem tionality. In this paper, we argued that such in- Almost inextricably linked to the proxy-to-meaning terpretations exclude large parts of language and problem is the locality problem. In virtually all of that to move towards human-like productive usage our tests we see that small, local source changes
elicit global changes in translations. For instance, meanings. We believe that it is important to con- in our systematicity tests, changing one noun in a tinue to investigate how models can represent both sentence elicited changes in the translation of an in- these types of inputs, providing a locally composi- dependently sampled sentence that it was conjoined tional treatment when possible but deviating from with. In our substitutivity test, even synonyms that that when necessary as well. merely differed in spelling (e.g. “doughnut” and “donut”) elicited changes to the remainder of the 6.3 Conclusion sentence. This counters the idea of compositional- In conclusion, with this work we contribute to the ity as a means of productively reusing language: if question of how compositional models trained on the translation of a phrase is dependent on (unre- natural data are, and we argue that MT is a suitable lated) context that is not even in its direct vicinity, and relevant testing ground to ask this question. this suggests that more evidence is required to ac- Focusing on the balance between local and global quire the translation of this phrase. forms of compositionality, we formulate three dif- ferent compositionality tests and discuss the issues In our tests using synthetic data, we explicitly and considerations that come up when considering consider sentences in which maximally local be- compositionality in the context of natural data. Our haviour is possible and argue that it is therefore tests indicate a lack of local processing for most of also desirable. Our experiments show that models our models, stressing the need for new evaluation are currently not capable of translating in such a measures that capture the level of compositionality local fashion and indicate that for some words or beyond mere consistency. phrases, the translation constantly changes as if a coin is tossed every time we slightly adapt the Acknowledgements input. On the one hand, this volatility (see also Fadaee and Monz, 2020) might be essential for We thank Sebastian Riedel, Douwe Kiela, Thomas coping with ambiguities for which meanings are Wolf, Khalil Sima’an, Marzieh Fadaee, Marco Ba- context dependent. For instance, when substitut- roni and in particular Brenden Lake and Adina ing “doughnut” with “donut” in “doughnut and Williams for providing feedback on this draft and chips” the “chips” are likely to be crisps, instead our work in several different stages of it. We thank of fries. On the other hand, this erratic behaviour Michiel van der Meer for contributing to the initial highlights a lack of default reasoning, which can experiments that led to this paper. VD is supported be problematic or even harmful in several different by the UKRI Centre for Doctoral Training in Nat- ways, especially if faithfulness (Parthasarathi et al., ural Language Processing, funded by the UKRI 2021) or consistency is important (c.f. in a different (grant EP/S022481/1) and the University of Edin- domain: an insignificant change in a prompt to a burgh. large language model like GPT-3 can elicit a large change in response). References In linguistics, many compositionality definitions Jacob Andreas. 2020. Good-enough compositional have been proposed that allow considering ‘prob- data augmentation. In Proceedings of the 58th An- lem cases’ such as word sense ambiguities and nual Meeting of the Association for Computational idiomatic expressions to be part of a compositional Linguistics, pages 7556–7566. language (Pagin and Westerståhl, 2010). How- Giosuè Baggio. 2021. Compositionality in a parallel ever, in such formalisations, non-local behaviour architecture for language processing. Cognitive Sci- is used to deal with non-local phenomena, while ence, 45(5):e12949. other parts of language are still treated locally; in Rahma Chaabouni, Roberto Dessı̀, and Eugene our models, global behaviour appears in situations Kharitonov. 2021. Can transformers jump around where a local treatment would be perfectly suitable right in natural language? Assessing performance transfer from SCAN. CoRR, abs/2107.01366. and where there is no clear evidence for ambigu- ity. We follow Baggio (2021) in suggesting that we Marzieh Fadaee and Christof Monz. 2020. The unrea- should learn from strategies employed by humans, sonable volatility of neural machine translation mod- els. In Proceedings of the Fourth Workshop on Neu- who can assign compositional interpretations to ral Generation and Translation, NGT@ACL 2020, expressions, treating them as novel language, but Online, July 5-10, 2020, pages 88–96. Association can for some inputs also derive non-compositional for Computational Linguistics.
Catherine Finegan-Dollak, Jonathan K Kummerfeld, Kris Korrel, Dieuwke Hupkes, Verna Dankers, and Li Zhang, Karthik Ramanathan, Sesh Sadasivam, Elia Bruni. 2019. Transcoding compositionally: Us- Rui Zhang, and Dragomir Radev. 2018. Improving ing attention to find more generalizable solutions. text-to-SQL evaluation methodology. In Proceed- In Proceedings of the 2019 ACL Workshop Black- ings of the 56th Annual Meeting of the Association boxNLP: Analyzing and Interpreting Neural Net- for Computational Linguistics (Volume 1: Long Pa- works for NLP, pages 1–11. pers), pages 351–360. Brenden Lake and Marco Baroni. 2018. Generalization Jerry A Fodor and Zenon W Pylyshyn. 1988. Connec- without systematicity: On the compositional skills tionism and cognitive architecture: A critical analy- of sequence-to-sequence recurrent networks. In In- sis. Cognition, 28(1-2):3–71. ternational Conference on Machine Learning, pages 2873–2882. PMLR. Eduardo Garcı́a-Ramı́rez. 2019. Open Compositional- ity: Toward a New Methodology of Language. Row- Brenden M Lake. 2019. Compositional generalization man & Littlefield. through meta sequence-to-sequence learning. In Ad- vances in Neural Information Processing Systems, Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng- pages 9791–9801. Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Kr- ishnan, Marc’Aurelio Ranzato, Francisco Guzman, Brenden M. Lake, Tal Linzen, and Marco Baroni. 2019. and Angela Fan. 2021. The FLORES-101 evalu- Human few-shot learning of compositional instruc- ation benchmark for low-resource and multilingual tions. In Proceedings of the 41th Annual Meet- machine translation. CoRR, abs/2106.03193. ing of the Cognitive Science Society, CogSci 2019: Creativity + Cognition + Computation, Montreal, Hessel Haagsma, Johan Bos, and Malvina Nissim. Canada, July 24-27, 2019, pages 611–617. cogni- 2020. Magpie: A large corpus of potentially id- tivesciencesociety.org. iomatic expressions. In Proceedings of The 12th Language Resources and Evaluation Conference, Yair Lakretz, German Kruszewski, Theo Desbordes, pages 279–287. Dieuwke Hupkes, Stanislas Dehaene, and Marco Ba- roni. 2019. The emergence of number and syn- Dieuwke Hupkes, Verna Dankers, Mathijs Mul, and tax units in LSTM language models. In Proceed- Elia Bruni. 2020. Compositionality decomposed: ings of the 2019 Conference of the North American How do neural networks generalise? Journal of Ar- Chapter of the Association for Computational Lin- tificial Intellgence Research, 67:757–795. guistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 11–20. Pauline Jacobson. 2002. The (dis)organization of the grammar: 25 years. Linguistics and Philosophy, Yafu Li, Yongjing Yin, Yulong Chen, and Yue Zhang. 25(5/6):601–626. 2021. On compositional generalization of neural machine translation. In Proceedings of the 59th An- Theo MV Janssen. 1998. Algebraic translations, cor- nual Meeting of the Association for Computational rectness and algebraic compiler construction. Theo- Linguistics and the 11th International Joint Confer- retical Computer Science, 199(1-2):25–56. ence on Natural Language Processing (Volume 1: Long Papers), pages 4767–4780. Theo MV Janssen and Barbara H Partee. 1997. Com- positionality. In Handbook of logic and language, Kaiji Lu, Piotr Mardziel, Klas Leino, Matt Fredrikson, pages 417–473. Elsevier. and Anupam Datta. 2020. Influence paths for char- acterizing subject-verb number agreement in lstm Jaap Jumelet, Willem Zuidema, and Dieuwke Hupkes. language models. In Proceedings of the 58th An- 2019. Analysing neural language models: Con- nual Meeting of the Association for Computational textual decomposition reveals default reasoning in Linguistics, pages 4748–4757. number and gender assignment. In Proceedings of the 23rd Conference on Computational Natural Lan- Gary F Marcus. 2003. The algebraic mind: Integrating guage Learning (CoNLL), pages 1–11. connectionism and cognitive science. MIT press. Daniel Keysers, Nathanael Schärli, Nathan Scales, Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Hylke Buisman, Daniel Furrer, Sergii Kashubin, Right for the wrong reasons: Diagnosing syntactic Nikola Momchev, Danila Sinopalnikov, Lukasz heuristics in natural language inference. In Proceed- Stafiniak, Tibor Tihon, et al. 2019. Measuring com- ings of the 57th Annual Meeting of the Association positional generalization: A comprehensive method for Computational Linguistics, pages 3428–3448. on realistic data. In International Conference on Learning Representations. Mathijs Mul and Willem Zuidema. 2019. Siamese recurrent networks learn first-order logic reasoning Najoung Kim and Tal Linzen. 2020. COGS: a composi- and exhibit zero-shot compositional generalization. tional generalization challenge based on semantic in- In CoRR, abs/1906.00180. terpretation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Process- Ryan M Nefdt. 2020. A puzzle concerning composi- ing (EMNLP), pages 9087–9105. tionality in machines. Minds & Machines, 30(1).
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Andreas Van Cranenburgh, Remko Scha, and Rens Fan, Sam Gross, Nathan Ng, David Grangier, and Bod. 2016. Data-oriented parsing with discontinu- Michael Auli. 2019. fairseq: A fast, extensible ous constituents and function tags. Journal of Lan- toolkit for sequence modeling. In Proceedings of guage Modelling, 4(1):57–111. the 2019 Conference of the North American Chap- ter of the Association for Computational Linguistics Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob (Demonstrations), pages 48–53. Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all Peter Pagin and Dag Westerståhl. 2010. Composition- you need. In Advances in Neural Information Pro- ality ii: Arguments and problems. Philosophy Com- cessing Systems, pages 5998–6008. pass, 5(3):265–282. Barbara Partee. 1984. Compositionality. Varieties of formal semantics, 3:281–311. Prasanna Parthasarathi, Koustuv Sinha, Joelle Pineau, and Adina Williams. 2021. Sometimes we want translationese. CoRR, abs/2104.07623. Ellie Pavlick and Chris Callison-Burch. 2016. Most “babies” are “little” and most “problems” are “huge”: Compositional entailment in adjective-nouns. In Proceedings of the 54th Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 2164–2173. Martina Penke. 2012. The dual-mechanism debate. In The Oxford handbook of compositionality. Vikas Raunak, Vaibhav Kumar, Florian Metze, and Jaimie Callan. 2019. On compositionality in neural machine translation. In NeurIPS 2019 Context and Compositionality in Biological and Artificial Neural Systems Workshop. D E Rumelhart and J McClelland. 1986. On Learning the Past Tenses of English Verbs. In Parallel dis- tributed processing: Explorations in the microstruc- ture of cognition, pages 216–271. MIT Press, Cam- bridge, MA. Naomi Saphra and Adam Lopez. 2020. LSTMs compose—and Learn—Bottom-up. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2797–2809, Online. Paul Smolensky. 1990. Tensor product variable bind- ing and the representation of symbolic structures in connectionist systems. Artificial intelligence, 46(1- 2):159–216. Zoltan Szabó. 2012. The case for compositionality. The Oxford handbook of compositionality, 64:80. Jörg Tiedemann. 2020. The Tatoeba Translation Chal- lenge – Realistic data sets for low resource and multi- lingual MT. In Proceedings of the Fifth Conference on Machine Translation, pages 1174–1182, Online. Association for Computational Linguistics. Jörg Tiedemann and Santhosh Thottingal. 2020. OPUS-MT – building open translation services for the world. In Proceedings of the 22nd Annual Con- ference of the European Association for Machine Translation, pages 479–480.
Appendix A Semi-Natural Templates The semi-natural data that we use in our test sets is generated with the library DiscoDOP,11 developed for data-oriented parsing (Van Cranenburgh et al., 2016). We generate the data with the following seven step process: Step 1. Sample 100k English OPUS sentences. Step 2. Generate a treebank using the disco-dop library and the discodop parser en ptb com- mand. The library was developed for discontinuous data-oriented parsing. Use the library’s --fmt bracket to turn off discontinuous parsing. Step 3. Compute tree fragments from the resulting treebank (discodop fragments). These tree fragments are the building blocks of a Tree-Substitution Grammar. Step 4. We assume the most frequent fragments to be common syntactic structures in English. To construct complex test sentences, we collect the 100 most frequent fragments containing at least 15 non-terminal nodes for NPs and VPs. Step 5. Selection of three VP and five NP fragments to be used in our final semi-natural templates. These structures are selected through qualitative analysis for their diversity. Step 6. Extract sentences matching the eight fragments (discodop treesearch). Step 7. Create semi-natural sentences by varying one lexical item and varying the matching NPs and VPs retrieved in Step 6. In Table 2, we provide examples for each of the ten templates used, along with the internal structure of the complex NP or VP that is varied in the template. In Table 3, we provide some additional examples for our ten synthetic templates. n Template 1 The Npeople (VP (TO ) (VP (VB ) (NP (NP ) (PP (IN ) (NP (NP ) (PP (IN ) (NP ))))))) E.g. The woman wants to use the Internet as a means of communication . 2 The Npeople (VP (VBP ) (VP (VBG ) (S (VP (TO ) (VP (VB ) (S (VP (TO ) (VP ))))))))) E.g. The men are gon na have to move off-camera . 3 The Npeople (VP (VB ) (NP (NP ) (PP (IN ) (NP ))) (PP (IN ) (NP (NP ) (PP (IN ) (NP ))))) E.g. The doctors retain 10 % of these amounts by way of collection costs . 4 The Npeople reads an article about (NP (NP ) (PP (IN ) (NP (NP ) (PP (IN ) (NP (NP ) (PP (IN ) (NP ))))))) E.g. The friend reads an article about the development of ascites in rats with liver cirrhosis . 5 The Npeople reads an article about (NP (NP (DT ) (NN )) (PP (IN ) (NP (NP ) (SBAR (S (WHNP (WDT )) (VP )))))) . E.g. The teachers read an article about the degree of progress that can be achieved by the industry . 6 An article about (NP (NP ) (PP (IN ) (NP (NP ) (PP (IN ) (NP (NP ) (PP (IN ) (NP ))))))) is read by the Npeople . E.g. An article about the inland transport of dangerous goods from a variety of Member States is read by the lawyer . 7 An article about (NP (NP ) (PP (IN ) (NP (NP ) (, ,) (SBAR (S (WHNP (WDT )) (VP )))))) , is read by the Npeople . E.g. An article about the criterion on price stability , which was 27 % , is read by the child . 8 Did the Npeople hear about (NP (NP ) (PP (IN ) (NP (NP ) (PP (IN ) (NP (NP ) (PP (IN ) (NP ))))))) . E.g. Did the friend hear about an inhospitable fringe of land on the shores of the Dead Sea ? 9 Did the Npeople hear about (NP (NP (DT ) (NN )) (PP (IN ) (NP (NP ) (SBAR (S (WHNP (WDT )) (VP )))))) ? E.g. Did the teacher hear about the march on Employment which happened here on Sunday ? 10 Did the Npeople hear about (NP (NP ) (SBAR (S (VP (TO ) (VP (VB ) (NP (NP ) (PP (IN ) (NP )))))))) ? E.g. Did the lawyers hear about a qualification procedure to examine the suitability of the applicants ? Table 2: Semi-natural data templates along with their identifiers (n). The syntactical structures for noun and verb phrases in purple are instantiated with data from the OPUS collection. Generated data from every template contains varying sentence structures and varying tokens, but the predefined tokens in black remain the same. 11 github.com/andreasvc/disco-dop
n Template 1 The Npeople Vtransitive the Nslelite . E.g. The poet criticises the king . 2 The Npeople Adv Vtransitive the Nslelite . E.g. The victim carefully observes the queen . 3 The Npeople P the Nslvehicle Vtransitive the Nslelite . E.g. The athlete near the bike observes the leader . 4 The Npeople and the Npeople Vpl sl transitive the Nelite . E.g. The poet and the child understand the mayor . 5 The Nslquantity of Npl sl sl people P the Nvehicle Vtransitive the Nelite . sl E.g. The group of friends beside the bike forgets the queen . 6 The Npeople Vtransitive that the Npl pl people Vintransitive . E.g. The farmer sees that the lawyers cry . 7 The Npeople Adv Vtransitive that the Npl pl people Vintransitive . E.g. The mother probably thinks that the fathers scream . 8 The Npeople Vtransitive that the Npl pl people Vintransitive Adv . E.g. The mother thinks that the fathers scream carefully . 9 The Npeople that Vintransitive Vtransitive the Nslelite . E.g. The poets that sleep understand the queen . 10 The Npeople that Vtransitive Pro Vsltransitive the Nslelite . E.g. The mother that criticises him recognises the queen . Table 3: Artificial sentence templates similar to Lakretz et al. (2019), along with their identifiers (n). Appendix B Systematicity Table 4 provides the numerical counterparts of the results visualised in Figure 1. Data Condition Model Template small medium full 1 2 3 4 5 6 7 8 9 10 S → NP VP synthetic NP .73 .84 .84 .86 .74 .85 .87 .75 .89 .85 .85 .70 .68 synthetic VP .76 .87 .88 .92 .73 .90 .91 .84 .88 .85 .82 .77 .74 semi-natural NP .63 .66 .64 .66 .63 .65 .70 .64 .69 .63 .63 .60 .58 S → S CONJ S synthetic S01 .81 .90 .92 .91 .82 .88 .88 .86 .95 .90 .91 .84 .79 synthetic S3 .53 .76 .82 .75 .54 .72 .66 .73 .88 .74 .81 .66 .55 semi-natural S01 .63 .71 .73 .73 .75 .75 .80 .75 .73 .66 .60 .59 .56 semi-natural S3 .28 .46 .47 .50 .50 .51 .58 .52 .43 .35 .23 .23 .21 natural S01 .58 .67 .72 .67 .74 .65 .64 .63 .64 .62 .66 .63 .66 natural S3 .25 .39 .47 .39 .49 .35 .35 .34 .37 .33 .38 .34 .38 (a) Per models’ training set size (b) Per template Table 4: Consistency scores for the systematicity experiments, detailed per experimental setup and evaluation data type. We provide scores (a) per models’ training set size, and (b) per template of our generated evaluation data. For natural data, the template number is meaningless, apart from the fact that it determines sentence length and word frequency. Appendix C Substitutivity Synonyms employed In Table 5, we provide some information about the synonymous word pairs used in the substitutivity test, including their frequency in OPUS and their most common Dutch translation. The last column of the table contains the subordinate clauses that we used to include the synonyms in the synthetic and semi-natural data. We include them as a relative clause behind nouns representing a human, such as “The poets criticises the king that eats the doughnut”. Results In the main paper, Figures 2a and 2b provided the consistency scores for the substitutivity tests. Here, Table 6 further details the results from the figure, by presenting the average consistency per evaluation data type and training set size, and per evaluation data type and synonym pair.
Synonym pair Dutch translation Subordinate clause British Freq. American Freq. aeroplane 6728 airplane 5403 vliegtuig that travels by . . . aluminium 17982 aluminum 5700 aluminium that sells . . . doughnut 2014 donut 1889 donut that eats the . . . foetus 1943 fetus 1878 foetus that researches the . . . flautist 112 flutist 101 fluitist that knows the . . . moustache 1132 mustache 1639 snor that has a . . . tumour 7338 tumor 6348 tumor that has a . . . pyjamas 808 pajamas 1106 pyjama that wears . . . sulphate 3776 sulfate 1143 zwavel that sells . . . yoghurt 1467 yogurt 2070 yoghurt that eats the . . . aubergine 765 eggplant 762 aubergine that eats the . . . shopping trolley 217 shopping cart 13366 winkelwagen that uses a . . . veterinary surgeon 941 veterinarian 6995 dierenarts that knows the . . . sailing boat 5097 sailboat 1977 zeilboot that owns a . . . football 33125 soccer 6841 voetbal that plays . . . holiday 125430 vacation 23532 vakantie that enjoys the . . . ladybird 235 ladybug 303 lieveheersbeestje that caught a . . . theatre 19451 theater 13508 theater that loves . . . postcode 479 zip code 1392 postcode with the same . . . whisky 3604 whiskey 4313 whisky that drinks . . . Table 5: Synonyms for the substitutivity test, along with their OPUS frequency, Dutch translation, and the subor- dinate clause used to insert them in the data. Data Metric Model small medium full synthetic consistency .49 .67 .76 synonym consistency .67 .83 .92 semi-natural consistency .34 .55 .62 synonym consistency .63 .84 .93 natural consistency .36 .51 .62 synonym consistency .65 .77 .87 (a) Per models’ training set size Data Metric Synonym veterinary surgeon shopping trolley sailing boat aluminium moustache aubergine aeroplane doughnut postcode sulphate ladybird pyjamas yoghurt football holiday tumour whisky flautist theatre foetus synthetic consistency .54 .87 .74 .82 .1 .92 .78 .64 .79 .55 .25 .4 .64 .73 .68 .81 .27 .85 .48 .88 syn. consistency 1.0 1.0 .87 1.0 .1 1.0 1.0 .71 .95 1.0 .38 .59 .84 1.0 .75 1.0 .4 1.0 .53 1.0 semi-nat. consistency .43 .59 .58 .54 .08 .85 .52 .55 .56 .42 .24 .31 .33 .73 .66 .71 .2 .62 .43 .75 syn. consistency 1.0 1.0 .83 1.0 .1 1.0 .99 .67 .93 .98 .4 .57 .76 1.0 .9 1.0 .38 1.0 .58 .99 natural consistency .5 .52 .53 .56 .09 .75 .5 .6 .47 .57 .23 .7 .29 .64 .47 .62 .17 .59 .61 .58 syn. consistency .94 .86 .74 .99 .12 .88 .95 .77 .89 .88 .33 .92 .71 .92 .77 .89 .27 .81 .85 .79 (b) Per synonym Table 6: Consistency scores for the substitutivity experiments, detailed per evaluation data type. We present scores (a) per models’ training set size and (b) per synonym.
You can also read