The paradox of the compositionality of natural language: a neural machine translation case study

Page created by Carolyn Walker
 
CONTINUE READING
The paradox of the compositionality of natural language: a neural
                                                                machine translation case study

                                                Verna Dankers                                 Elia Bruni                       Dieuwke Hupkes
                                          ILCC, University of Edinburgh                 University of Osnabrück             Facebook AI Research
                                              vernadankers@gmail.com                     elia.bruni@gmail.com                dieuwkehupkes@fb.com

                                                                Abstract                             trained on artificial datasets, in which composition-
                                             Moving towards human-like linguistic perfor-            ality can be ensured and isolated (e.g. Lake and
                                             mance is often argued to require compositional          Baroni, 2018; Hupkes et al., 2020).1 In such tests,
                                             generalisation. Whether neural networks ex-             the interpretation of expressions is computed com-
arXiv:2108.05885v1 [cs.CL] 12 Aug 2021

                                             hibit this ability is typically studied using artifi-   pletely locally: every subpart is evaluated indepen-
                                             cial languages, for which the compositionality          dently – without taking into account any external
                                             of input fragments can be guaranteed and their          context – and the meaning of the whole expres-
                                             meanings algebraically composed. However,
                                                                                                     sion is then formed by combining the meaning of
                                             compositionality in natural language is vastly
                                             more complex than this rigid, arithmetics-like          its parts in a bottom-up fashion. This protocol
                                             version of compositionality, and as such arti-          matches the type of compositionality observed in
                                             ficial compositionality tests do not allow us           arithmetics: the meaning of (3 + 5) is always 8,
                                             to draw conclusions about how neural mod-               independent from the context it occurs in.
                                             els deal with compositionality in more realistic           However, as exemplified by the sub-par per-
                                             scenarios. In this work, we re-instantiate three        formance of symbolic models that allow only
                                             compositionality tests from the literature and          such strict protocols, compositionality in natu-
                                             reformulate them for neural machine transla-
                                                                                                     ral domains is far more intricate than this rigid,
                                             tion (NMT). The results highlight two main is-
                                             sues: the inconsistent behaviour of NMT mod-            arithmetics-like variant of compositionality. Natu-
                                             els and their inability to (correctly) modulate         ral language seems very compositional, but at the
                                             between local and global processing. Aside              same time, it is riddled with cases that are diffi-
                                             from an empirical study, our work is a call             cult to interpret with a strictly local interpretation
                                             to action: we should rethink the evaluation of          of compositionality. Sometimes, the meaning of
                                             compositionality in neural networks of natural          an expression does not derive from its parts (e.g.
                                             language, where composing meaning is not as
                                                                                                     idioms), but the parts themselves are used composi-
                                             straightforward as doing the math.
                                                                                                     tionally in other contexts. Other times, the meaning
                                         1   Introduction                                            of an expression does depend on its parts in a com-
                                         Although the successes of deep neural networks in           positional way, but arriving at this meaning requires
                                         natural language processing (NLP) are astounding            a more global approach, in which the meaning of a
                                         and undeniable, they are still regularly criticised for     part is disambiguated using information from some-
                                         lacking the powerful generalisation capacities that         where else in the sentence (e.g. homonyms, scope
                                         characterise human intelligence. A frequently men-          ambiguities). Successfully modelling language re-
                                         tioned concept in such critiques is compositionality:       quires balancing such local and global forms of
                                         the ability to build up the meaning of a complex            (non-)compositionality, which makes evaluating
                                         expression by combining the meanings of its parts           compositionality in state-of-the-art models ‘in the
                                         (e.g. Partee, 1984). Compositionality is assumed            wild’ a complicated endeavour.
                                         to play an essential role in how humans understand             In this work, we face this challenge head-on. We
                                         language, but whether modern neural networks also           concentrate on the domain of neural machine trans-
                                         exhibit this property has since long been a topic           lation (NMT), which is paradigmatically close to
                                         of vivid debate (e.g. Fodor and Pylyshyn, 1988;                1
                                                                                                          With the exception of Raunak et al. (2019), work on com-
                                         Smolensky, 1990; Marcus, 2003; Nefdt, 2020).                positional generalisation in the context of language considers
                                                                                                     highly structured subsets of natural language (e.g. Kim and
                                            Studies about the compositional abilities of neu-        Linzen, 2020; Keysers et al., 2019) or focuses on tendencies of
                                         ral networks consider almost exclusively models             neural networks to learn shortcuts (e.g. McCoy et al., 2019).
the sequence-to-sequence tasks typically consid-                compositions thus require only local information
ered for compositionality tests, where the target se-           – they are context independent and unambiguous:
quences are assumed to represent the ‘meaning’ of               (2 + 1) × (4 − 5) is evaluated in a manner similar
the input sequences.2 Furthermore, MT is an impor-              to walk twice after jump thrice (a fragment from
tant domain of NLP, for which compositional gen-                SCAN by Lake and Baroni, 2018). In MT, this type
eralisation is suggested to be important to produce             of compositionality would imply that a change in
more robust translations and train adequate models              a word or phrase should affect only the translation
for low-resource languages (see, e.g. Chaabouni                 of that particular word or phrase, or at most the
et al., 2021). As an added advantage, composition-              smallest constituent it is a part of. For instance, the
ality is traditionally well studied and motivated for           translation of the phrase the girl should not change
MT (Janssen and Partee, 1997; Janssen, 1998).                   depending on the verb phrase that follows it, and in
   We reformulate three theoretically grounded                  the translation of a conjunction of two sentences,
tests from Hupkes et al. (2020): systematicity, sub-            making a change in the first conjunct should not
stitutivity and overgeneralisation. As accuracy –               change the translation of the second. While trans-
commonly used in artificial compositionality tests –            lating in such a local fashion seems robust and
is not a suitable evaluation metric for MT, we base             productive, it is not always realistic. Consider, for
our evaluations on the extent to which models be-               instance, the translation of the polysemous word
have consistently, rather than correctly. In our tests          ‘dates’ in “She hated bananas and she liked dates”.
for systematicity and substitutivity, we consider                  In linguistics and philosophy of language, the
whether processing is maximally local; in our over-             level of compositionality is a widely discussed
generalisation test, we consider how models treat               topic, which has resulted in a wide variety of com-
idioms that are assumed to require global process-              positionality definitions. One of the most well-
ing. Our results indicate that consistent behaviour             known definitions is the one from Partee (1984):
is currently not achieved and that many inconsis-
tencies are due to models not achieving the right                      “The meaning of a compound expres-
level of processing (local nor global). The actual                     sion is a function of the meanings of its
level of processing, however, is neither captured by                   parts and of the way they are syntacti-
consistency nor accuracy measures.                                     cally combined.”3
   With our study, we contribute to ongoing ques-
tions about the compositional abilities of neural               This definition places almost no restrictions on the
networks, and we provide nuance to the nature of                relationship between an expression and its parts.
this question when natural language is concerned:               The type of function that relates them is unspeci-
how local should the compositionality of models                 fied and could take into account the global syntac-
for natural language be, and how does the type                  tic structure or even external arguments, and also
of compositionality required for MT, where trans-               the meanings of the parts can depend on global
lations are used as proxies to meaning, relate to               information. Partee’s version of compositionality
the compositionality of natural language at large?              is thus also called weak, global, or open compo-
Aside from an empirical study, our work is also                 sitionality (Szabó, 2012; Garcı́a-Ramı́rez, 2019).
a call to action: we should rethink the evaluation              When, instead, the meaning of a compound can
of compositionality in neural networks trained on               depend only on the meaning or translation of its
natural language, where composing meaning is not                largest parts, regardless of their internal structure
as straightforward as doing the math.                           (the arithmetics-like variant of compositionality),
                                                                this is called strong, local or closed compositional-
2       Local and global compositionality                       ity (Szabó, 2012; Jacobson, 2002).
                                                                   This stricter interpretation of compositionality
Tests for compositional generalisation in neural                underlies the generalisation aimed for in previous
networks typically assume an arithmetic-like ver-               compositionality tests (see also §5). Yet, it is not
sion of compositionality, in which meanings can be              suitable for modelling natural language phenomena
computed in a completely bottom-up fashion. The                 traditionally considered problematic for composi-
    2                                                              3
     In SCAN (Lake and Baroni, 2018), for instance, the input        The principle can straight-forwardly be extended to trans-
is an instruction (walk twice) and the intended output repre-   lation, by replacing the word meaning with the word transla-
sents its execution (walk walk).                                tion (Janssen and Partee, 1997; Janssen, 1998).
tionality, such as quotation, belief sentences, am-                 sider the synthetic data generated by Lakretz et al.
biguities, idioms, noun-noun compounds, to name                     (2019), which contains a large number of sentences
a few (Pagin and Westerståhl, 2010; Pavlick and                    with a fixed syntactic structure and diverse lexical
Callison-Burch, 2016). Here, we aim to open up                      material. Lakretz et al. (2019), and other authors af-
the discussion about what it means for computa-                     terwards (e.g. Jumelet et al., 2019; Lu et al., 2020),
tional models of natural language to be compo-                      successfully used this data to assess number agree-
sitional, and, to that end, discuss properties that                 ment in language models, validating the data as
require composing meaning locally, and look at                      a reasonable resource to test neural models. We
global compositionality through idioms.                             extend the set of templates in the dataset and the
                                                                    vocabulary used. For each of the resulting ten tem-
3       Setup                                                       plates (see Table 1a), we generate 3000 sentences.
First, we describe the models analysed and the data                 Semi-natural data In the synthetic data, we
that form the basis of our tests.                                   have full control over the sentence structure and
                                                                    lexical items, but the sentences are shorter (9 to-
3.1       Model and training
                                                                    kens vs. 16 in OPUS) and simpler than typical in
We focus on English-Dutch translation, for which                    NMT data. To obtain more complex yet plausible
we can ensure good command for both languages.                      test sentences, we employ a data-driven approach
We train Transformer-base models (Vaswani et al.,                   to generate semi-natural data. Using the tree sub-
2017) using Fairseq (Ott et al., 2019).4 Our training               stitution grammar Double DOP (Van Cranenburgh
data consists of a collection of MT corpora bundled                 et al., 2016), we obtain noun and verb phrases (NP,
in OPUS (Tiedemann and Thottingal, 2020), of                        VP) whose structures frequently occur in OPUS.
which we use the English-Dutch subset provided by                   We then embed these NPs and VPs in ten synthetic
Tiedemann (2020) containing 69M source-target                       templates with 3000 samples each (see Table 1b).
pairs.5 To examine the impact of the amount of                      See Appendix A for details on the data generation.
training data – a dimension that is relevant because
compositionality is hypothesised to be important                    Natural data Lastly, we use natural data that
particularly in low-resource settings – we train one                we extract directly from OPUS. The extraction
setup using the full dataset, one using 18 of the data              procedures are test-specific, and are provided in
(medium), and one using one million source-target                   the subsections of the individual tests (§4).
pairs in the small setup. For each setup, we train
                                                                    4     Experiments and results
models with five seeds and average the results.
   To ensure the quality of our trained models, we                  In our experiments, we consider the two properties
adopt the F LORES -101 corpus (Goyal et al., 2021),                 of systematicity (§4.1) and substitutivity (§4.2),
which contains 3001 sentences from Wikinews,                        that require local meaning compositions, as well as
Wikijunior and WikiVoyage, translated by profes-                    the translation of idioms that require a different (a
sional translators and split across three subsets. We               more global) type of processing (§4.3).
train the models until convergence on the ‘dev’
set. Afterwards, we compute BLEU scores on the                      4.1    Systematicity
‘devtest’ set, using beam search (beam size = 5),                   One of the most commonly tested properties of
yielding scores of 20.5±.4, 24.3±.3 and 25.7±.1 for                 compositional generalisation is systematicity – the
the small, medium and full datasets, respectively.                  ability to understand novel combinations made up
                                                                    from known components (Lake and Baroni, 2018;
3.2       Evaluation data                                           Raunak et al., 2019; Hupkes et al., 2020). A classic
For our compositionality tests, we use three dif-                   example of systematicity comes from Szabó (2012):
ferent types of data – synthetic, semi-natural and                  someone who understands “brown dog” and “black
natural – with varying degrees of control.                          cat” also understands “brown cat”.
Synthetic data For our synthetic test data, we                      4.1.1 Experiments
take inspiration from literature on probing hier-                   In natural data, the number of potential recombina-
archical structure in language models: we con-                      tions to consider is infinite. We chose to focus on
    4
        Training details for Transformer base are available here.   recombinations in two sentence-level context-free
    5
        Visit the Tatoeba challenge for the OPUS training data.     rules: S → NP VP and S → S CONJ S.
n   Template                                                     n          Template
     1    The Npeople V the   Nslelite.                              1,2,3        The Npeople VP1,2,3 .
     2    The Npeople Adv V the Nslelite .                                        The men are gon na have to move off-camera .
     3    The Npeople P the Nslvehicle V the Nslelite .               4,5         The Npeople read(s) an article about NP1,2 .
                                                                                  The man reads an article about the development
     4    The Npeople and the Npeople V the Nslelite .
                                                                                  of ascites in rats with liver cirrhosis .
     5    The Nslquantity of Npl                sl         sl
                              people P the Nvehicle V the Nelite .
                                        pl                            6,7         An article about NP3,4 is read by Npeople .
     6    The Npeople V that the Npeople V.                                       An article about the criterion on price stability ,
     7    The Npeople Adv V that the Npl       people V .                         which was 27 % , is read by the child .
     8    The Npeople V that the Npl    people V Adv .               8,9,10       Did the Npeople hear about NP5,6,7 ?
     9    The Npeople that V V the Nslelite .                                     Did the teacher hear about the march on
     10   The Npeople that V Pro V the Nslelite .                                 Employment which happened here on Sunday ?

                       (a) Synthetic templates                                          (b) Semi-natural templates
Table 1: The synthetic and semi-natural templates, with POS tags of the lexical items varied shown in blue with
the plurality as superscript and the subcategory as subscript. The OPUS-extracted NP and VP fragments are red.

Test design In our first setup, S → NP VP, we                        ensured by high training accuracies, and system-
consider recombinations of noun and verb phrases.                    aticity is quantified by measuring the test set accu-
We extract translations for all input sentences from                 racy. If the training data is a natural corpus and the
the templates from §3.2, as well as versions of                      model is evaluated with a measure like BLEU in
them in which we adapt either (1) the noun (NP                       MT, this strategy is not available. We observe that
→ NP’) or (2) the verb phrase (VP → VP’). In                         being systematic requires being consistent in the
(1), a noun from the NP in the subject position is                   interpretation assigned to a (sub)expression across
replaced with a different noun while preserving                      contexts, both in artificial and natural domains.
number agreement with the VP. In (2), a noun in                      Here, we, therefore, focus on consistency rather
the VP is replaced. NP → NP’ is applied to both                      than accuracy, allowing us to employ a model-
synthetic and semi-natural data; VP → VP’ only to                    driven approach that evaluates the model’s system-
synthetic data. We use 500 samples per template                      aticity as the consistency of the translations when
per condition per data type.                                         presenting words or phrases in multiple contexts.
   In our second setup, S → S CONJ S, we con-                           We measure consistency as the equality of
catenate phrases using ‘and’, and we test whether                    two translations after accounting for anticipated
the translation of the second sentence is dependent                  changes. For instance, in the S → NP VP setup,
on the (independently sampled) first sentence. We                    two translations are consistent if they differ in one
concatenate two sentences (S1 and S2 ) from differ-                  word only, after accounting for determiner changes
ent templates, and we consider again two different                   in Dutch (‘de’ vs. ‘het’). In the evaluation of
conditions. First, in condition S1 → S01 , we make                   S → S CONJ S, we remove the translation of the
a minimal change to S1 yielding S01 by changing                      first conjunct based on the position of the conjunc-
the noun in its verb phrase. In S1 → S3 , instead,                   tion in Dutch, and measure the consistency of the
we replace S1 with a sentence S3 that is sampled                     translations of the second conjunct.
from a template different from S1 . We compare
the translation of S2 in all conditions. In prepar-                  4.1.2         Results
ing the data, the first conjunct is sampled from the                 In Figure 1, we show the performance for the
synthetic data templates. The second conjunct is                     S → NP VP and S → S CONJ S setups, respec-
sampled from synthetic data, semi-natural data, or                   tively, distinguishing between training dataset sizes,
from natural sentences sampled from OPUS with                        evaluation data types and templates.6
similar lengths and word-frequencies as the semi-                       Firstly, we observe quite some variation across
natural inputs. We use 500 samples per template                      templates, which is not simply explained by sen-
per condition per data type.                                         tence length – i.e. the shortest template is not nec-
                                                                     essarily the best performing one. The synthetic
Evaluation In artificial domains, systematicity is                   data uses the same lexical items across multiple
evaluated by leaving out combinations of ‘known                      templates, suggesting that the grammatical struc-
components’ from the training data and using them                    ture contributes to more or less compositional be-
for testing purposes. The necessary familiarity of
                                                                           6
the components (the fact that they are ‘known’) is                             Appendix B presents the results through tables.
1.0    666                                      1.0                                             1.0                                       1.0        1 1    training size
                                                                          65                                          6 7 1                                    4
               0.8      102        44                          0.8
                                                                      6
                                                                                                               0.8               4 6                     0.8        10 2        small
                               4
                      10                                                    2       44                                  9 9            3                                        medium

                                                                                                                                           consistency
 consistency

                                                                                                 consistency
                                                 consistency
               0.6               1010                          0.6        10                                   0.6    10            10 5                 0.6   10               full
                               10                                                                                                10
               0.4                                                              4                              0.4                                       0.4
                                                               0.4    2
                                                                                  1010
               0.2                                             0.2                                             0.2                                       0.2
                                                                                10
               0.0                                             0.0                                             0.0                                       0.0
                     synthetic semi-n. natural                       synthetic semi-n. natural                       synthetic   semi-n.                       synthetic
                       (a) S1 →     S01                                (b) S1 → S3                                   (c) NP → NP  0
                                                                                                                                                               (d) VP → VP0

Figure 1: Systematicity results for setup S → S CONJ S (a and b) and S → NP VP (c and d). Consistency scores
are shown per evaluation data type (x-axis), per training dataset size (colours). Data points represent templates (◦)
and means over templates ().

haviour. The average performance for the natural                                                               agreement that exists between nouns and verbs –
data in S → S CONJ S closely resembles the per-                                                                surely synonyms should be given equal treatment
formance on semi-natural data, suggesting that the                                                             by a compositional model. This test addresses that
increased degree of control did not severely impact                                                            by performing the substitutivity test from Hupkes
the results obtained using this generated data.                                                                et al. (2020), that measures whether the outputs
   Secondly, the different changes have vary-                                                                  remain consistent after synonym substitution.
ing effects. For S → NP VP, changing the NP
has a larger impact than changing the VP. For                                                                  4.2.1       Experiments
S → S CONJ S, replacing the entire first conjunct                                                              In natural data, true synonyms are hard to find –
with S3 has a larger impact than merely substitut-                                                             some might even argue they do not exist. Here, we
ing a word in S1 . Increasing the training dataset                                                             consider two source terms synonymous if they con-
size results in increased consistency scores. This                                                             sistently translate into the same target term. To find
could be because the larger training set provides                                                              such synonyms, we exploit the fact that OPUS con-
the model with more confident translations. Yet,                                                               tains texts both in British and American English.
increasing dataset sizes is a somewhat paradoxical                                                             Therefore, it contains synonymous terms that are
solution to compositional generalisation: after all,                                                           spelt different – e.g. doughnut / donut – and syn-
in humans, compositionality is assumed to under-                                                               onymous terms with a very different form – e.g.
lie their ability to generalise usage from very few                                                            aubergine / eggplant. We use 20 synonym pairs
examples (Lake et al., 2019).                                                                                  total (see Figure 2b).
   Lastly, the consistency scores are quite low
overall, suggesting that the model is prone to                                                                 Test design Per synonym pair, we select natural
emitting a different translation of a (sub)sentence                                                            data from OPUS in which the terms appear and
following small (unrelated) adaptations to the                                                                 then perform synonym substitutions. Thus, each
input. Moreover, it hardly seems to matter                                                                     sample has two sentences, one with the British En-
whether that change occurs in the sentence itself                                                              glish term and one with the American English term.
(S → NP VP), or whether it occurs in the other con-                                                            We also insert the synonyms into the synthetic and
junct (S → S CONJ S), suggesting a lack of local                                                               semi-natural data using 500 samples per synonym
processing.                                                                                                    pair per template, through subordinate clauses that
                                                                                                               modify a noun – e.g. “the king that eats the dough-
4.2             Substitutivity                                                                                 nut”. In Appendix C, Table 6, we list all clauses
                                                                                                               used.
Under a local interpretation of the principle of
compositionality, synonym substitutions should be                                                              Evaluation Like systematicity, we evaluate sub-
meaning-preserving: substituting a constituent in                                                              stitutivity using the consistency score, expressing
a complex expression with a synonym should not                                                                 whether the model translations for a sample are
alter the complex expression’s meaning, or, in the                                                             identical. We report both the full sentence consis-
case of MT, its translation. Even if one argued that                                                           tency and the consistency of the synonyms’ trans-
the systematic recombinations in §4.1 warranted                                                                lations only, excluding the context. Cases in which
some alterations to the translation – e.g. due to the                                                          the model omits the synonym from the translation
a(e|i)r(o)plane                        postcode / zip code
                                                                                                       alumin(i)um                                 p(y|a)jamas

                         atr e
                      the tach
                         us e

                                                                      ach ley
                       mo tach
                                                                                                        aubergine /

                                                e
                                         ust e
                                                                                                                                                  sail(ing )boat

                            e

                                            ach
                                      mo tach
               1.0

                                                                          e
                                                                   ust trol
                                         us e
                                                                                training size              eggplant

                        us

                                      motach
                                                                                                                                              shopping trolley /

                                                                moping
                     mo
                                                                                                        do(ugh)nut

                                                                 pe
                                      us
                                                                                     small                                                       shopping cart

                                                             shoach
               0.8

                                    mo
                                                                                                            fl(a)utist                              sul(ph|f)ate

                                                            ust
                                                                                     medium

                                                          mo
 consistency

               0.6                                                                   full                    f(o)etus                               theat(re|er)
                                                                                                  football / soccer                                    tumo(u)r
               0.4                                                                                                                                veterinarian /
                             tist
                                                                                                 holiday / vacation

                                                                      st
                                                                                                                                            veterinary surgeon

                                                   tist
                           flau

                                                                    uti
                                              flau

                                                                     fla
               0.2                                                                              ladybird / ladybug                                    whisk(e)y        synonym consistency

                                                          flau rgine
                                    flau ird
                                                                                                                                                                       consistency

                                                              tist
                                                                                                       m(o)ustache                                     yog(h)urt
                     flau st
                          tist

                                         tist
                                        yb
                          ti

                                                              e
               0.0

                                                          aub
                     flau

                                    lad
                                                                                                                      0.0 0.25 0.5 0.75 1.0                     0.0 0.25 0.5 0.75 1.0
                       synthetic    semi natural              natural                                                  consistency                                   consistency
                                                 (a)                                                                                       (b)

Figure 2: (a) Average consistency scores of synonyms () for substitutivity per evaluation data type, for models
trained on three training set sizes. Individual data points (◦) represent synonyms. We annotate synonyms with the
highest and lowest scores. (b) Consistency detailed per synonym, measured using full sentences (in dark blue) or
the synonym’s translation only (in green), averaged over training set sizes and data types.

are labelled as consistent if the rest of the transla-                                            4.3       Global compositionality
tion is the same for both input sequences.                                                        In our final test, we focus on exceptions to composi-
                                                                                                  tional rules. In natural language, typical exceptions
4.2.2                Results                                                                      that constitute a challenge for local composition-
                                                                                                  ality are idioms. For instance, the idiom “raining
In Figure 2a, we summarise consistency scores                                                     cats and dogs” should be treated globally to arrive
across synonyms, data types and training set sizes.7                                              at its meaning of heavy rainfall. A local approach
We observe trends similar to the systematicity re-                                                would yield an overly literal, non-sensical trans-
sults, considering that models trained on larger                                                  lation (“het regent katten en honden”). When a
training sets perform better and that the synthetic                                               model’s translation is too local, we follow Hup-
data yields more consistent translations compared                                                 kes et al. (2020) in saying that it overgeneralises,
to (semi-)natural data.                                                                           or, in other words, it applies a general rule to an
   The results are characterised by large variations                                              expression that is an exception to this rule. Over-
across synonyms, for which we further detail the                                                  generalisation indicates that a language learner has
performance aggregated across experimental se-                                                    internalised the general rule (e.g. Penke, 2012).
tups in Figure 2b. The three synonyms with note-                                                  4.3.1         Experiments
worthily low scores – flautist, aubergine and la-
                                                                                                  To find idioms in our corpus, we exploit the MAG-
dybug – are among the least frequent synonyms
                                                                                                  PIE corpus (Haagsma et al., 2020). We select 20
(see Appendix C), which stresses the importance
                                                                                                  English idioms for which an accurate Dutch trans-
of frequency for the model to pick up on synonymy.
                                                                                                  lation differs from the literal translation. As acqui-
Across the board, it is quite remarkable that for syn-
                                                                                                  sition of idioms is dependent on their frequency in
onyms that have the same meaning (and, for some,
                                                                                                  the corpus, we use idioms with at least 200 occur-
are even spelt nearly identically) the translations
                                                                                                  rences in OPUS based on exact matches, for which
are so inconsistent.
                                                                                                  over 80% of the target translations does not contain
   Can we attribute this to the translations of the                                               a literal translation.
synonyms? To investigate this, Figure 2b presents
both the regular consistency and the consistency                                                  Test design Per idiom, we extract natural sen-
of the translation of the synonym (‘synonym con-                                                  tences containing the idiom from OPUS. For the
sistency’). The fact that the latter is much higher                                               synthetic and semi-natural data types, we insert the
indicates that a substantial part of the inconsisten-                                             idiom in 500 samples per idiom per template, by
cies are due to varying translations of the context                                               attaching a subordinate clause to a noun – e.g. “the
rather than the synonym, stressing again the non-                                                 king that said ‘I knew the formula by heart’”. The
local processing of inputs by the models.                                                         clauses used can be found in Appendix D, Table 7.

                                                                                                  Evaluation Per idiom, we assess how often a
        7
            Appendix C provides the same results in tables.                                       model overgeneralises and how often it translates
1.0        small                medium            full
overgeneralisation
                                                                                        – translation of the idiom. 3) Eventually, the model
                     0.8
                     0.6
                                                                                        starts to memorise the idiom’s translation. This is
                     0.4                                                                in line with the results of Hupkes et al. (2020), who
                     0.2                                                                created the artificial counterpart of exceptions to
                     0.0                                                                rules, as well as earlier results presented in the past
                           1   40 80 120 160    1 10 20 30 40 50   1   10     20   30
                                 epoch               epoch              epoch           tense debate by – among others – Rumelhart and
                                                (a) Synthetic                           McClelland (1986).
                     1.0
                                                                                           Although the height of the overgeneralisation
overgeneralisation

                     0.8
                     0.6
                                                                                        peak is similar across evaluation data types and
                     0.4                                                                training set sizes, overgeneralisation is more promi-
                     0.2                                                                nent in converged models trained on smaller
                     0.0                                                                datasets than it is in models trained on the full
                           1   40 80 120 160    1 10 20 30 40 50   1   10     20   30
                                 epoch               epoch              epoch           corpus.9 In addition to training set size, the type
                                               (b) Semi-Natural                         of evaluation data used also matters, since there
                     1.0                                                                is more overgeneralisation for synthetic and semi-
overgeneralisation

                     0.8
                                                                                        natural data compared to natural data, stressing the
                     0.6
                     0.4                                                                impact of the context in which an idiom is embed-
                     0.2                                                                ded. The extreme case of a context unsupportive of
                     0.0                                                                an idiomatic interpretation is a sequence of random
                           1   40 80 120 160    1 10 20 30 40 50   1   10     20   30
                                 epoch               epoch              epoch           words; to evaluate the hypothesis that this yields
                                                 (c) Natural                            local translations, we surround the idioms with
                                                                                        ten random words. The results (see Appendix D,
   Figure 3: Visualisation of overgeneralisation for id-
                                                                                        Table 7) indicate that, indeed, when the context pro-
   ioms throughout training, for five model seeds and their
   means. Overgeneralisation occurs early on in training                                vides no support at all for a global interpretation,
   and precedes memorisation of idioms’ translations.                                   the model provides a local translation for nearly all
                                                                                        idioms.
the idiom globally. To do so, we identify keywords                                         In addition to variations between setups, there
that indicate that a translation is translated locally                                  is quite some variation in the overgeneralisation
(literal) instead of globally (idiomatic). If the key-                                  observed for individual idioms. The variation is
words are copied to the model output or their literal                                   partly explained by the fact that some idioms are
translations are present, the translation is labelled                                   less frequent compared to others since the number
as an overgeneralised translation. For instance, for                                    of exact matches in the training corpus correlates
“by heart”, a literal translation is identified through                                 significantly with the difference in overgenerali-
the presence of “hart” (“heart”), whereas an ade-                                       sation between the peak and overgeneralisation at
quate paraphrase would say “uit het hoofd” (“from                                       convergence,10 suggesting that frequent idioms are
the head”). Visit Appendix D, Table 7, for the full                                     more likely to be memorised by the models.
list of keywords. We evaluate overgeneralisation
                                                                                        5    Related Work
for ten checkpoints throughout training.
                                                                                        In this work, we considered compositional general-
   4.3.2                        Results
                                                                                        isation in neural network models. In previous work,
   In Figure 3, we visualise the results per data type,                                 a variety of artificial tasks have been proposed to
   again for three training dataset sizes.8 For all eval-                               evaluate compositional generalisation using non-
   uation data types and all training set sizes three                                   i.i.d. test sets that are designed to assess a specific
   phases can be identified: 1) Initially the transla-                                  characteristic of compositional behaviour. Exam-
   tions do not contain the idiom’s keyword; not be-                                    ples are systematicity (Hupkes et al., 2020; Lake
   cause the idiom’s meaning is paraphrased in the                                      and Baroni, 2018), substitutivity (Mul and Zuidema,
   translation, but because the translations consist of
                                                                                            9
   high-frequency words in the target language only.                                          Convergence is based on the BLEU scores for validation
                                                                                        data. When training models for longer, this could further
   2) Afterwards, overgeneralisation peaks: the model                                   change the overgeneralisation observed.
   emits a very literal – and potentially, compositional                                   10
                                                                                              Pearson’s r of .56 for synthetic data, Pearson’s r of .56
                                                                                        for semi-natural data and Pearson’s r of .53 for natural data,
                      8
                           Appendix D further details numerical results per idiom.      p < .0001.
2019; Hupkes et al., 2020), localism (Hupkes et al.,     of language, tests are needed that can assess how
2020; Saphra and Lopez, 2020), productivity (Lake        compositional models trained on natural data are.
and Baroni, 2018) or overgeneralisation (Hupkes          We laid out reformulations of three compositional
et al., 2020; Korrel et al., 2019). Generally, neural    generalisation tests – systematicity, substitutivity
models struggle to generalise in such evaluation se-     and overgeneralisation – for NMT models trained
tups, although data augmentation (Andreas, 2020)         on natural corpora, and we focus on the level of
and modelling techniques (Lake, 2019) have been          compositionality that models exhibit. Aside from
shown to improve performance.                            providing an empirical contribution, our work also
    There are also studies that consider composi-        highlights vital hurdles to overcome when consid-
tional generalisation on more natural data for the       ering what it means for models of natural language
tasks of semantic parsing and MT, although the           to be compositional. Below, we reflect on these
corpora used still represent a small and controlled      hurdles as well as our results.
subset of natural language. Finegan-Dollak et al.
                                                         6.1   The proxy to meaning problem
(2018) release eight text-to-SQL datasets, along
with non-i.i.d. test sets. Keysers et al. (2019) apply   Compositionality is a property of the mapping be-
automated rule-based dataset generation yielding         tween the form and meaning of an expression. As
their CFQ dataset, for which test sets with maxi-        translation is a meaning-preserving mapping from
mum compound divergence are used to measure              form in one language to form in another, it is an
compositional generalisation. Kim and Linzen             attractive task to evaluate compositionality: the
(2020) present a PCFG-generated semantic parsing         translation of its sentence can be seen as a proxy
task with test sets for specific kinds of lexical and    to its meaning. However, while expressions are
structural generalisation for fragments of English.      assumed to have only one meaning, translation is
Lake and Baroni (2018) measure few-shot gener-           a many-to-many mapping: the same sentence can
alisation for a toy NMT task for English-French,         have multiple correct translations. This does not
that uses sentence pairs generated according to tem-     only complicate evaluation – MT systems are typ-
plates. Raunak et al. (2019) conducted additional        ically evaluated with BLEU, because accuracy is
experiments using that toy task to test for the sys-     not a suitable option – it also raises questions about
tematicity of NMT models. More recently, Li et al.       how compositional the desired behaviour of an MT
(2021) presented CoGnition to test compositional         model should be. On the one hand, one could ar-
generalisation for English-Chinese MT, a bench-          gue that for optimal generalisation, robustness, and
mark dataset used to train models on data that only      accountability we like models to behave systemati-
includes short sentences from a small vocabulary,        cally and consistently: we expect the translations
excluding any problematic constructions that con-        of expressions to not be dependent on unrelated
tribute to the complexity of natural language for        contextual changes that do not affect its meaning
compositional generalisation, such as polysemous         (e.g. swapping out a synonym in a nearby sentence).
words or metaphorical language.                          In fact, consistency is the main metric that we are
    To the best of our knowledge, the only attempt to    using in most of our tests. However, this does im-
explicitly measure compositional generalisation of       ply that if a model changes its translation in a way
NMT models trained on large natural MT corpora           that does not alter the correctness of the transla-
is the study presented by Raunak et al. (2019). They     tion, this is considered incorrect. Does changing
measure productivity – generalisation to longer sen-     a translation from “atleet” (“athlete”) to “sporter”
tence lengths – of an LSTM-based NMT model               (“sportsman”) really matter, even if this change
trained on a full-size, natural MT dataset.              happened after changing an unrelated word some-
                                                         what far away? The answer to that question might
6   Discussion                                           depend on the domain, the quantity of data avail-
                                                         able for training and how important the faithfulness
Whether neural networks can generalise composi-          of the translation is.
tionally is often studied using artificial tasks that
assume strictly local interpretations of composi-        6.2   The locality problem
tionality. In this paper, we argued that such in-        Almost inextricably linked to the proxy-to-meaning
terpretations exclude large parts of language and        problem is the locality problem. In virtually all of
that to move towards human-like productive usage         our tests we see that small, local source changes
elicit global changes in translations. For instance,      meanings. We believe that it is important to con-
in our systematicity tests, changing one noun in a        tinue to investigate how models can represent both
sentence elicited changes in the translation of an in-    these types of inputs, providing a locally composi-
dependently sampled sentence that it was conjoined        tional treatment when possible but deviating from
with. In our substitutivity test, even synonyms that      that when necessary as well.
merely differed in spelling (e.g. “doughnut” and
“donut”) elicited changes to the remainder of the         6.3   Conclusion
sentence. This counters the idea of compositional-        In conclusion, with this work we contribute to the
ity as a means of productively reusing language: if       question of how compositional models trained on
the translation of a phrase is dependent on (unre-        natural data are, and we argue that MT is a suitable
lated) context that is not even in its direct vicinity,   and relevant testing ground to ask this question.
this suggests that more evidence is required to ac-       Focusing on the balance between local and global
quire the translation of this phrase.                     forms of compositionality, we formulate three dif-
                                                          ferent compositionality tests and discuss the issues
   In our tests using synthetic data, we explicitly
                                                          and considerations that come up when considering
consider sentences in which maximally local be-
                                                          compositionality in the context of natural data. Our
haviour is possible and argue that it is therefore
                                                          tests indicate a lack of local processing for most of
also desirable. Our experiments show that models
                                                          our models, stressing the need for new evaluation
are currently not capable of translating in such a
                                                          measures that capture the level of compositionality
local fashion and indicate that for some words or
                                                          beyond mere consistency.
phrases, the translation constantly changes as if
a coin is tossed every time we slightly adapt the         Acknowledgements
input. On the one hand, this volatility (see also
Fadaee and Monz, 2020) might be essential for             We thank Sebastian Riedel, Douwe Kiela, Thomas
coping with ambiguities for which meanings are            Wolf, Khalil Sima’an, Marzieh Fadaee, Marco Ba-
context dependent. For instance, when substitut-          roni and in particular Brenden Lake and Adina
ing “doughnut” with “donut” in “doughnut and              Williams for providing feedback on this draft and
chips” the “chips” are likely to be crisps, instead       our work in several different stages of it. We thank
of fries. On the other hand, this erratic behaviour       Michiel van der Meer for contributing to the initial
highlights a lack of default reasoning, which can         experiments that led to this paper. VD is supported
be problematic or even harmful in several different       by the UKRI Centre for Doctoral Training in Nat-
ways, especially if faithfulness (Parthasarathi et al.,   ural Language Processing, funded by the UKRI
2021) or consistency is important (c.f. in a different    (grant EP/S022481/1) and the University of Edin-
domain: an insignificant change in a prompt to a          burgh.
large language model like GPT-3 can elicit a large
change in response).                                      References
   In linguistics, many compositionality definitions      Jacob Andreas. 2020. Good-enough compositional
have been proposed that allow considering ‘prob-             data augmentation. In Proceedings of the 58th An-
lem cases’ such as word sense ambiguities and                nual Meeting of the Association for Computational
idiomatic expressions to be part of a compositional          Linguistics, pages 7556–7566.
language (Pagin and Westerståhl, 2010). How-             Giosuè Baggio. 2021. Compositionality in a parallel
ever, in such formalisations, non-local behaviour           architecture for language processing. Cognitive Sci-
is used to deal with non-local phenomena, while             ence, 45(5):e12949.
other parts of language are still treated locally; in     Rahma Chaabouni, Roberto Dessı̀, and Eugene
our models, global behaviour appears in situations          Kharitonov. 2021. Can transformers jump around
where a local treatment would be perfectly suitable         right in natural language? Assessing performance
                                                            transfer from SCAN. CoRR, abs/2107.01366.
and where there is no clear evidence for ambigu-
ity. We follow Baggio (2021) in suggesting that we        Marzieh Fadaee and Christof Monz. 2020. The unrea-
should learn from strategies employed by humans,           sonable volatility of neural machine translation mod-
                                                           els. In Proceedings of the Fourth Workshop on Neu-
who can assign compositional interpretations to            ral Generation and Translation, NGT@ACL 2020,
expressions, treating them as novel language, but          Online, July 5-10, 2020, pages 88–96. Association
can for some inputs also derive non-compositional          for Computational Linguistics.
Catherine Finegan-Dollak, Jonathan K Kummerfeld,           Kris Korrel, Dieuwke Hupkes, Verna Dankers, and
  Li Zhang, Karthik Ramanathan, Sesh Sadasivam,              Elia Bruni. 2019. Transcoding compositionally: Us-
  Rui Zhang, and Dragomir Radev. 2018. Improving             ing attention to find more generalizable solutions.
  text-to-SQL evaluation methodology. In Proceed-            In Proceedings of the 2019 ACL Workshop Black-
  ings of the 56th Annual Meeting of the Association         boxNLP: Analyzing and Interpreting Neural Net-
  for Computational Linguistics (Volume 1: Long Pa-          works for NLP, pages 1–11.
  pers), pages 351–360.
                                                           Brenden Lake and Marco Baroni. 2018. Generalization
Jerry A Fodor and Zenon W Pylyshyn. 1988. Connec-            without systematicity: On the compositional skills
   tionism and cognitive architecture: A critical analy-     of sequence-to-sequence recurrent networks. In In-
   sis. Cognition, 28(1-2):3–71.                             ternational Conference on Machine Learning, pages
                                                             2873–2882. PMLR.
Eduardo Garcı́a-Ramı́rez. 2019. Open Compositional-
  ity: Toward a New Methodology of Language. Row-          Brenden M Lake. 2019. Compositional generalization
  man & Littlefield.                                         through meta sequence-to-sequence learning. In Ad-
                                                             vances in Neural Information Processing Systems,
Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-           pages 9791–9801.
  Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Kr-
  ishnan, Marc’Aurelio Ranzato, Francisco Guzman,          Brenden M. Lake, Tal Linzen, and Marco Baroni. 2019.
  and Angela Fan. 2021. The FLORES-101 evalu-                Human few-shot learning of compositional instruc-
  ation benchmark for low-resource and multilingual          tions. In Proceedings of the 41th Annual Meet-
  machine translation. CoRR, abs/2106.03193.                 ing of the Cognitive Science Society, CogSci 2019:
                                                             Creativity + Cognition + Computation, Montreal,
Hessel Haagsma, Johan Bos, and Malvina Nissim.               Canada, July 24-27, 2019, pages 611–617. cogni-
  2020. Magpie: A large corpus of potentially id-            tivesciencesociety.org.
  iomatic expressions. In Proceedings of The 12th
  Language Resources and Evaluation Conference,            Yair Lakretz, German Kruszewski, Theo Desbordes,
  pages 279–287.                                             Dieuwke Hupkes, Stanislas Dehaene, and Marco Ba-
                                                             roni. 2019. The emergence of number and syn-
Dieuwke Hupkes, Verna Dankers, Mathijs Mul, and              tax units in LSTM language models. In Proceed-
  Elia Bruni. 2020. Compositionality decomposed:             ings of the 2019 Conference of the North American
  How do neural networks generalise? Journal of Ar-          Chapter of the Association for Computational Lin-
  tificial Intellgence Research, 67:757–795.                 guistics: Human Language Technologies, Volume 1
                                                             (Long and Short Papers), pages 11–20.
Pauline Jacobson. 2002. The (dis)organization of the
  grammar: 25 years. Linguistics and Philosophy,           Yafu Li, Yongjing Yin, Yulong Chen, and Yue Zhang.
  25(5/6):601–626.                                           2021. On compositional generalization of neural
                                                             machine translation. In Proceedings of the 59th An-
Theo MV Janssen. 1998. Algebraic translations, cor-          nual Meeting of the Association for Computational
  rectness and algebraic compiler construction. Theo-        Linguistics and the 11th International Joint Confer-
  retical Computer Science, 199(1-2):25–56.                  ence on Natural Language Processing (Volume 1:
                                                             Long Papers), pages 4767–4780.
Theo MV Janssen and Barbara H Partee. 1997. Com-
  positionality. In Handbook of logic and language,        Kaiji Lu, Piotr Mardziel, Klas Leino, Matt Fredrikson,
  pages 417–473. Elsevier.                                   and Anupam Datta. 2020. Influence paths for char-
                                                             acterizing subject-verb number agreement in lstm
Jaap Jumelet, Willem Zuidema, and Dieuwke Hupkes.
                                                             language models. In Proceedings of the 58th An-
   2019. Analysing neural language models: Con-
                                                             nual Meeting of the Association for Computational
   textual decomposition reveals default reasoning in
                                                             Linguistics, pages 4748–4757.
   number and gender assignment. In Proceedings of
   the 23rd Conference on Computational Natural Lan-       Gary F Marcus. 2003. The algebraic mind: Integrating
   guage Learning (CoNLL), pages 1–11.                       connectionism and cognitive science. MIT press.
Daniel Keysers, Nathanael Schärli, Nathan Scales,         Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019.
  Hylke Buisman, Daniel Furrer, Sergii Kashubin,             Right for the wrong reasons: Diagnosing syntactic
  Nikola Momchev, Danila Sinopalnikov, Lukasz                heuristics in natural language inference. In Proceed-
  Stafiniak, Tibor Tihon, et al. 2019. Measuring com-        ings of the 57th Annual Meeting of the Association
  positional generalization: A comprehensive method          for Computational Linguistics, pages 3428–3448.
  on realistic data. In International Conference on
  Learning Representations.                                Mathijs Mul and Willem Zuidema. 2019. Siamese
                                                            recurrent networks learn first-order logic reasoning
Najoung Kim and Tal Linzen. 2020. COGS: a composi-          and exhibit zero-shot compositional generalization.
  tional generalization challenge based on semantic in-     In CoRR, abs/1906.00180.
  terpretation. In Proceedings of the 2020 Conference
  on Empirical Methods in Natural Language Process-        Ryan M Nefdt. 2020. A puzzle concerning composi-
  ing (EMNLP), pages 9087–9105.                              tionality in machines. Minds & Machines, 30(1).
Myle Ott, Sergey Edunov, Alexei Baevski, Angela               Andreas Van Cranenburgh, Remko Scha, and Rens
 Fan, Sam Gross, Nathan Ng, David Grangier, and                 Bod. 2016. Data-oriented parsing with discontinu-
 Michael Auli. 2019. fairseq: A fast, extensible                ous constituents and function tags. Journal of Lan-
 toolkit for sequence modeling. In Proceedings of               guage Modelling, 4(1):57–111.
 the 2019 Conference of the North American Chap-
 ter of the Association for Computational Linguistics         Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
 (Demonstrations), pages 48–53.                                 Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
                                                                Kaiser, and Illia Polosukhin. 2017. Attention is all
Peter Pagin and Dag Westerståhl. 2010. Composition-            you need. In Advances in Neural Information Pro-
  ality ii: Arguments and problems. Philosophy Com-             cessing Systems, pages 5998–6008.
  pass, 5(3):265–282.

Barbara Partee. 1984. Compositionality. Varieties of
  formal semantics, 3:281–311.

Prasanna Parthasarathi, Koustuv Sinha, Joelle Pineau,
  and Adina Williams. 2021. Sometimes we want
  translationese. CoRR, abs/2104.07623.

Ellie Pavlick and Chris Callison-Burch. 2016. Most
  “babies” are “little” and most “problems” are “huge”:
   Compositional entailment in adjective-nouns. In
   Proceedings of the 54th Annual Meeting of the As-
   sociation for Computational Linguistics (Volume 1:
   Long Papers), pages 2164–2173.

Martina Penke. 2012. The dual-mechanism debate. In
 The Oxford handbook of compositionality.

Vikas Raunak, Vaibhav Kumar, Florian Metze, and
  Jaimie Callan. 2019. On compositionality in neural
  machine translation. In NeurIPS 2019 Context and
  Compositionality in Biological and Artificial Neural
  Systems Workshop.

D E Rumelhart and J McClelland. 1986. On Learning
  the Past Tenses of English Verbs. In Parallel dis-
  tributed processing: Explorations in the microstruc-
  ture of cognition, pages 216–271. MIT Press, Cam-
  bridge, MA.

Naomi Saphra and Adam Lopez. 2020. LSTMs
  compose—and Learn—Bottom-up. In Findings
  of the Association for Computational Linguistics:
  EMNLP 2020, pages 2797–2809, Online.

Paul Smolensky. 1990. Tensor product variable bind-
  ing and the representation of symbolic structures in
  connectionist systems. Artificial intelligence, 46(1-
  2):159–216.

Zoltan Szabó. 2012. The case for compositionality.
  The Oxford handbook of compositionality, 64:80.

Jörg Tiedemann. 2020. The Tatoeba Translation Chal-
    lenge – Realistic data sets for low resource and multi-
    lingual MT. In Proceedings of the Fifth Conference
    on Machine Translation, pages 1174–1182, Online.
    Association for Computational Linguistics.

Jörg Tiedemann and Santhosh Thottingal. 2020.
    OPUS-MT – building open translation services for
    the world. In Proceedings of the 22nd Annual Con-
    ference of the European Association for Machine
   Translation, pages 479–480.
Appendix A           Semi-Natural Templates
The semi-natural data that we use in our test sets is generated with the library DiscoDOP,11 developed
for data-oriented parsing (Van Cranenburgh et al., 2016). We generate the data with the following seven
step process:

Step 1. Sample 100k English OPUS sentences.
Step 2. Generate a treebank using the disco-dop library and the discodop parser en ptb com-
mand. The library was developed for discontinuous data-oriented parsing. Use the library’s --fmt
bracket to turn off discontinuous parsing.
Step 3. Compute tree fragments from the resulting treebank (discodop fragments). These tree
fragments are the building blocks of a Tree-Substitution Grammar.
Step 4. We assume the most frequent fragments to be common syntactic structures in English. To
construct complex test sentences, we collect the 100 most frequent fragments containing at least 15
non-terminal nodes for NPs and VPs.
Step 5. Selection of three VP and five NP fragments to be used in our final semi-natural templates. These
structures are selected through qualitative analysis for their diversity.
Step 6. Extract sentences matching the eight fragments (discodop treesearch).
Step 7. Create semi-natural sentences by varying one lexical item and varying the matching NPs and VPs
retrieved in Step 6.

   In Table 2, we provide examples for each of the ten templates used, along with the internal structure of
the complex NP or VP that is varied in the template. In Table 3, we provide some additional examples for
our ten synthetic templates.

 n       Template

 1       The Npeople (VP (TO ) (VP (VB ) (NP (NP ) (PP (IN ) (NP (NP ) (PP (IN ) (NP )))))))
         E.g. The woman wants to use the Internet as a means of communication .
 2       The Npeople (VP (VBP ) (VP (VBG ) (S (VP (TO ) (VP (VB ) (S (VP (TO ) (VP )))))))))
         E.g. The men are gon na have to move off-camera .
 3       The Npeople (VP (VB ) (NP (NP ) (PP (IN ) (NP ))) (PP (IN ) (NP (NP ) (PP (IN ) (NP )))))
         E.g. The doctors retain 10 % of these amounts by way of collection costs .
 4       The Npeople reads an article about (NP (NP ) (PP (IN ) (NP (NP ) (PP (IN ) (NP (NP ) (PP (IN ) (NP )))))))
         E.g. The friend reads an article about the development of ascites in rats with liver cirrhosis .
 5       The Npeople reads an article about (NP (NP (DT ) (NN )) (PP (IN ) (NP (NP ) (SBAR (S (WHNP (WDT )) (VP )))))) .
         E.g. The teachers read an article about the degree of progress that can be achieved by the industry .
 6       An article about (NP (NP ) (PP (IN ) (NP (NP ) (PP (IN ) (NP (NP ) (PP (IN ) (NP ))))))) is read by the Npeople .
         E.g. An article about the inland transport of dangerous goods from a variety of Member States is read by the lawyer .
 7       An article about (NP (NP ) (PP (IN ) (NP (NP ) (, ,) (SBAR (S (WHNP (WDT )) (VP )))))) , is read by the Npeople .
         E.g. An article about the criterion on price stability , which was 27 % , is read by the child .
 8       Did the Npeople hear about (NP (NP ) (PP (IN ) (NP (NP ) (PP (IN ) (NP (NP ) (PP (IN ) (NP ))))))) .
         E.g. Did the friend hear about an inhospitable fringe of land on the shores of the Dead Sea ?
 9       Did the Npeople hear about (NP (NP (DT ) (NN )) (PP (IN ) (NP (NP ) (SBAR (S (WHNP (WDT )) (VP )))))) ?
         E.g. Did the teacher hear about the march on Employment which happened here on Sunday ?
 10      Did the Npeople hear about (NP (NP ) (SBAR (S (VP (TO ) (VP (VB ) (NP (NP ) (PP (IN ) (NP )))))))) ?
         E.g. Did the lawyers hear about a qualification procedure to examine the suitability of the applicants ?

Table 2: Semi-natural data templates along with their identifiers (n). The syntactical structures for noun and verb
phrases in purple are instantiated with data from the OPUS collection. Generated data from every template contains
varying sentence structures and varying tokens, but the predefined tokens in black remain the same.

  11
       github.com/andreasvc/disco-dop
n      Template

                               1      The Npeople Vtransitive the Nslelite .
                                      E.g. The poet criticises the king .
                               2      The Npeople Adv Vtransitive the Nslelite .
                                      E.g. The victim carefully observes the queen .
                               3      The Npeople P the Nslvehicle Vtransitive the Nslelite .
                                      E.g. The athlete near the bike observes the leader .
                               4      The Npeople and the Npeople Vpl                       sl
                                                                          transitive the Nelite .
                                      E.g. The poet and the child understand the mayor .
                               5      The Nslquantity of Npl                 sl         sl
                                                          people P the Nvehicle Vtransitive the Nelite .
                                                                                                         sl

                                      E.g. The group of friends beside the bike forgets the queen .
                               6      The Npeople Vtransitive that the Npl              pl
                                                                              people Vintransitive .
                                      E.g. The farmer sees that the lawyers cry .
                               7      The Npeople Adv Vtransitive that the Npl                 pl
                                                                                     people Vintransitive .
                                      E.g. The mother probably thinks that the fathers scream .
                               8      The Npeople Vtransitive that the Npl              pl
                                                                              people Vintransitive Adv .
                                      E.g. The mother thinks that the fathers scream carefully .
                               9      The Npeople that Vintransitive Vtransitive the Nslelite .
                                      E.g. The poets that sleep understand the queen .
                               10     The Npeople that Vtransitive Pro Vsltransitive the Nslelite .
                                      E.g. The mother that criticises him recognises the queen .

        Table 3: Artificial sentence templates similar to Lakretz et al. (2019), along with their identifiers (n).

Appendix B          Systematicity
Table 4 provides the numerical counterparts of the results visualised in Figure 1.

 Data               Condition             Model                                                     Template
                                    small medium full                       1     2      3     4     5   6   7        8     9     10
 S → NP VP
 synthetic              NP          .73      .84       .84                 .86 .74 .85 .87 .75 .89 .85 .85 .70 .68
 synthetic              VP          .76      .87       .88                 .92 .73 .90 .91 .84 .88 .85 .82 .77 .74
 semi-natural           NP          .63      .66       .64                 .66 .63 .65 .70 .64 .69 .63 .63 .60 .58
 S → S CONJ S
 synthetic              S01         .81      .90       .92                 .91   .82    .88   .88   .86   .95   .90   .91   .84   .79
 synthetic              S3          .53      .76       .82                 .75   .54    .72   .66   .73   .88   .74   .81   .66   .55
 semi-natural           S01         .63      .71       .73                 .73   .75    .75   .80   .75   .73   .66   .60   .59   .56
 semi-natural           S3          .28      .46       .47                 .50   .50    .51   .58   .52   .43   .35   .23   .23   .21
 natural                S01         .58      .67       .72                 .67   .74    .65   .64   .63   .64   .62   .66   .63   .66
 natural                S3          .25      .39       .47                 .39   .49    .35   .35   .34   .37   .33   .38   .34   .38

                (a) Per models’ training set size                                              (b) Per template

Table 4: Consistency scores for the systematicity experiments, detailed per experimental setup and evaluation data
type. We provide scores (a) per models’ training set size, and (b) per template of our generated evaluation data.
For natural data, the template number is meaningless, apart from the fact that it determines sentence length and
word frequency.

Appendix C          Substitutivity
Synonyms employed In Table 5, we provide some information about the synonymous word pairs used
in the substitutivity test, including their frequency in OPUS and their most common Dutch translation.
The last column of the table contains the subordinate clauses that we used to include the synonyms in the
synthetic and semi-natural data. We include them as a relative clause behind nouns representing a human,
such as “The poets criticises the king that eats the doughnut”.
Results In the main paper, Figures 2a and 2b provided the consistency scores for the substitutivity
tests. Here, Table 6 further details the results from the figure, by presenting the average consistency per
evaluation data type and training set size, and per evaluation data type and synonym pair.
Synonym pair                                                                                                     Dutch translation                                                        Subordinate clause
          British                    Freq.                        American                             Freq.

          aeroplane                  6728                         airplane                             5403                vliegtuig                                                                that travels by . . .
          aluminium                  17982                        aluminum                             5700                aluminium                                                                that sells . . .
          doughnut                   2014                         donut                                1889                donut                                                                    that eats the . . .
          foetus                     1943                         fetus                                1878                foetus                                                                   that researches the . . .
          flautist                   112                          flutist                              101                 fluitist                                                                 that knows the . . .
          moustache                  1132                         mustache                             1639                snor                                                                     that has a . . .
          tumour                     7338                         tumor                                6348                tumor                                                                    that has a . . .
          pyjamas                    808                          pajamas                              1106                pyjama                                                                   that wears . . .
          sulphate                   3776                         sulfate                              1143                zwavel                                                                   that sells . . .
          yoghurt                    1467                         yogurt                               2070                yoghurt                                                                  that eats the . . .
          aubergine                  765                          eggplant                             762                 aubergine                                                                that eats the . . .
          shopping trolley           217                          shopping cart                        13366               winkelwagen                                                              that uses a . . .
          veterinary surgeon         941                          veterinarian                         6995                dierenarts                                                               that knows the . . .
          sailing boat               5097                         sailboat                             1977                zeilboot                                                                 that owns a . . .
          football                   33125                        soccer                               6841                voetbal                                                                  that plays . . .
          holiday                    125430                       vacation                             23532               vakantie                                                                 that enjoys the . . .
          ladybird                   235                          ladybug                              303                 lieveheersbeestje                                                        that caught a . . .
          theatre                    19451                        theater                              13508               theater                                                                  that loves . . .
          postcode                   479                          zip code                             1392                postcode                                                                 with the same . . .
          whisky                     3604                         whiskey                              4313                whisky                                                                   that drinks . . .

Table 5: Synonyms for the substitutivity test, along with their OPUS frequency, Dutch translation, and the subor-
dinate clause used to insert them in the data.

                                       Data                                  Metric                                                   Model
                                                                                                                                small medium full

                                       synthetic    consistency                                                                     .49                    .67                           .76
                                                    synonym consistency                                                             .67                    .83                           .92
                                       semi-natural consistency                                                                     .34                    .55                           .62
                                                    synonym consistency                                                             .63                    .84                           .93
                                       natural      consistency                                                                     .36                    .51                           .62
                                                    synonym consistency                                                             .65                    .77                           .87

                                                                        (a) Per models’ training set size
 Data      Metric                                                                                                                   Synonym
                                                                                                                                                                               veterinary surgeon
                                                                                                                                                            shopping trolley

                                                                                                                                                                                                     sailing boat
                                           aluminium

                                                                                          moustache

                                                                                                                                               aubergine
                               aeroplane

                                                       doughnut

                                                                                                                                                                                                                                                              postcode
                                                                                                                         sulphate

                                                                                                                                                                                                                                         ladybird
                                                                                                               pyjamas

                                                                                                                                     yoghurt

                                                                                                                                                                                                                    football
                                                                                                                                                                                                                               holiday
                                                                                                      tumour

                                                                                                                                                                                                                                                                         whisky
                                                                               flautist

                                                                                                                                                                                                                                                    theatre
                                                                    foetus

 synthetic consistency         .54         .87         .74         .82         .1         .92         .78      .64       .79        .55        .25           .4                .64                  .73             .68        .81       .27        .85       .48        .88
           syn. consistency    1.0         1.0         .87         1.0         .1         1.0         1.0      .71       .95        1.0        .38          .59                .84                  1.0             .75        1.0        .4        1.0       .53        1.0
 semi-nat. consistency         .43         .59         .58         .54        .08         .85         .52      .55       .56        .42        .24          .31                .33                  .73             .66        .71        .2        .62       .43        .75
           syn. consistency    1.0         1.0         .83         1.0         .1         1.0         .99      .67       .93        .98         .4          .57                .76                  1.0              .9        1.0       .38        1.0       .58        .99
 natural   consistency          .5         .52         .53         .56        .09         .75          .5       .6       .47        .57        .23           .7                .29                  .64             .47        .62       .17        .59       .61        .58
           syn. consistency    .94         .86         .74         .99        .12         .88         .95      .77       .89        .88        .33          .92                .71                  .92             .77        .89       .27        .81       .85        .79

                                                                                          (b) Per synonym

Table 6: Consistency scores for the substitutivity experiments, detailed per evaluation data type. We present scores
(a) per models’ training set size and (b) per synonym.
You can also read