Word Interdependence Exposes How LSTMs Compose Representations

Page created by Byron Cobb

Home & Garden

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Word Interdependence Exposes How LSTMs Compose Representations

Word Interdependence Exposes How LSTMs Compose Representations

                                                               Naomi Saphra                                Adam Lopez
                                                           University of Edinburgh                     University of Edinburgh
                                                           n.saphra@ed.ac.uk                          alopez@inf.ed.ac.uk

                                                                Abstract                         implying that they exploit long-distance depen-
                                                                                                 dencies (Liu et al., 2018). Their internal represen-
                                              Recent work in NLP shows that LSTM lan-            tations seem to be hierarchical in nature (Blevins
arXiv:2004.13195v1 [cs.CL] 27 Apr 2020

                                              guage models capture compositional structure       et al., 2018; Hupkes et al., 2017). They seemingly
                                              in language data. For a closer look at how
                                                                                                 encode knowledge of part of speech (Belinkov
                                              these representations are composed hierarchi-
                                              cally, we present a novel measure of inter-
                                                                                                 et al., 2017), morphological productivity (Vania
                                              dependence between word meanings in an             and Lopez, 2017), and verb agreement (Lakretz
                                              LSTM, based on their interactions at the in-       et al., 2019). How does this apparently compo-
                                              ternal gates. To explore how compositional         sitional behavior arise in learning?
                                              representations arise over training, we conduct       Concretely, we are interested in an aspect of
                                              simple experiments on synthetic data, which        compositionality sometimes called localism (Hup-
                                              illustrate our measure by showing how high in-
                                                                                                 kes et al., 2019), in which meanings of long se-
                                              terdependence can hurt generalization. These
                                              synthetic experiments also illustrate a specific
                                                                                                 quences are recursively composed from meanings
                                              hypothesis about how hierarchical structures       of shorter child sequences, without regard for how
                                              are discovered over the course of training: that   the child meaning is itself constructed—that is,
                                              parent constituents rely on effective represen-    the computation of the composed meaning relies
                                              tations of their children, rather than on learn-   only on the local properties of the child meanings.
                                              ing long-range relations independently. We         By contrast, a global composition would be con-
                                              further support this measure with experiments      structed from all words. A local composition op-
                                              on English language data, where interdepen-
                                                                                                 eration leads to hierarchical structure (like a clas-
                                              dence is higher for more closely syntactically
                                              linked word pairs.                                 sic syntax tree), whereas a global operation leads
                                                                                                 to flat structure. If meaning is composed locally,
                                         1    Introduction                                       then in a sentence like “The chimney sweep has
                                                                                                 sick lungs”, the unknown composition function
                                         Recent work in NLP has seen a flurry of interest        f might reflect syntactic structure, computing the
                                         in the question: are the representations learned by     full meaning as: f (f (The, chimney, sweep), has,
                                         neural networks compositional? That is, are repre-      f (sick, lungs)). Local composition assumes low
                                         sentations of longer phrases built recursively from     interdependence between the meanings of “chim-
                                         representations of shorter phrases, as they are in      ney” and “has”, or indeed between any pair of
                                         many linguistic theories? If so, how and when do        words not local to the same invocation of f .
                                         they learn to do this?                                     To analyze compositionality, we propose a mea-
                                            Computationally, sequence models like LSTMs          sure of word interdependence that directly mea-
                                         scan a sentence from left to right, accumulating        sures the composition of meaning in LSTMs
                                         meaning into a hidden representation at each time       through the interactions between words (Sec-
                                         step. Yet we have extensive evidence that fully         tion 2). Our method builds on Contextual Decom-
                                         trained LSTMs are sensitive to syntactic structure,     position (CD; Murdoch et al., 2018), a tool for an-
                                         suggesting that they learn something about recur-       alyzing the representations produced by LSTMs.
                                         sive composition of meaning. For example, they          We conduct experiments on a synthetic corpus
                                         can recall more history in natural language data        (Section 3), which illustrate interdependence and
                                         than in similarly Zipfian-distributed n-gram data,      find that highly familiar constituents make nearby

vocabulary statistically dependent on them, leav-         that:
ing them vulnerable to the domain shift. We then                             h ≈ hβ + hβ̄;βÓβ̄
relate word interdependence in an English lan-
                                                          This decomposed form is achieved by linearizing
guage corpus (Section 4) to syntax, finding that
                                                          the contribution of the words in focus. This is nec-
word pairs with close syntactic links have higher
                                                          essarily approximate, because the internal gating
interdependence than more distantly linked words,
                                                          mechanisms in an LSTM each employ a nonlin-
even stratifying by sequential distance and part
                                                          ear activation function, either σ or tanh. Murdoch
of speech. This pattern offers a potential struc-
                                                          et al. (2018) use a linearized approximation Lσ
tural probe that can be computed directly from an
                                                          for σ and linearized approximation Ltanh for tanh
LSTM without learning additional parameters as
                                                          such that for arbitrary input N
                                                                                        P
                                                                                          j=1 yj :
required in other methods (Hewitt and Manning,
2019).
                                                                                   
                                                                             N
                                                                             X              N
                                                                                            X
                                                                       σ          yj  =         Lσ (yj )    (1)
                                                                             j=1            j=1
2     Methods
                                                             These approximations are then used to split
We now introduce our interdependence measure,             each gate into components contributed by the pre-
a natural extension of Contextual Decomposi-              vious hidden state ht−1 and by the current input
tion (CD; Murdoch et al., 2018), a tool for an-           xt , for example the input gate it :
alyzing the representations produced by LSTMs.
To conform with Murdoch et al. (2018), all exper-            it = σ(Wi xt + Vt ht−1 + bi )
iments use a one layer LSTM, with inputs taken                    ≈ Lσ (Wi xt ) + Lσ (Vt ht−1 ) + Lσ (bi )
from an embedding layer and outputs processed
by a softmax layer.
                                                             This linear form Lσ is achieved by computing
2.1    Contextual Decomposition                           the Shapley value (Shapley, 1953) of its param-
                                                          eter, defined as the average difference resulting
Let us say that we need to determine when our             from excluding the parameter, over all possible
language model has learned that “either” implies          permutations of the input summants. To apply
an appearance of “or” later in the sequence. We           Formula 1 to σ(y1 + y2 ) for a linear approxima-
consider an example sentence, “Either Socrates is         tion of the isolated effect of the summant y1 :
mortal or not”. Because many nonlinear functions
are applied in the intervening span “Socrates is                    1
                                                          Lσ (y1 ) = [(σ(y1 )−σ(0))+(σ(y2 +y1 )−σ(y1 ))]
mortal”, it is difficult to directly measure the in-                2
fluence of “either” on the later occurrence of “or”.         With this function, we can take a hidden
To dissect the sequence and understand the impact         state from the previous timestep, decomposed as
of individual elements in the sequence, we could          ht−1 ≈ ht−1
                                                                    β   + ht−1
                                                                            β̄;βÓβ̄
                                                                                    and add xt to the appro-
employ CD.                                                priate component. For example, if xt is in focus,
   CD is a method of looking at the individual in-        we count it in the relevant function inputs when
fluences that words and phrases in a sequence have        computing the input gate:
on the output of a recurrent model. Illustrated
in Figure 1, CD decomposes the activation vector             it = σ(Wi xt + Vt ht−1 + bi )
produced by an LSTM layer into a sum of relevant                  ≈ σ(Wi xt + Vt (ht−1
                                                                                   β
                                                                                          t−1
                                                                                       + hβ̄;βÓ β̄
                                                                                                   ) + bi )
and irrelevant parts. The relevant part is the con-
                                                                  ≈ [Lσ (Wi xt + Vt ht−1
                                                                                     β ) + Lσ (bi )]
tribution of the phrase or set of words in focus, i.e.,
                                                                               t−1
a set of words whose impact we want to measure.                       +Lσ (Vt hβ̄;βÓ β̄
                                                                                        )
We denote this contribution as β. The irrelevant                  = itβ + itβ̄;βÓβ̄
part includes the contribution of all words not in
that set (denoted β̄) as well as interactions between
the relevant and irrelevant words (denoted βÓβ̄).            Because the individual contributions of the
For an output hidden state vector h at any partic-        items in a sequence interact in nonlinear ways, this
ular timestep, CD will decompose it into two vec-         decomposition is only an approximation and can-
tors: the relevant hβ , and irrelevant hβ̄;βÓβ̄ , such    not exactly compute the impact of a specific word

relation; if you “happily eat a slice of cake”, the
meaning of “cake” does not depend on “happily”,
which modifies “eat” and is far on the syntactic
tree from “cake”. We will use the nonlinear inter-
actions in contextual decomposition to analyze the
interdependence between words alternately con-
sidered in focus.
Figure 1: CD uses linear approximations of gate op- Generally, CD considers all nonlinear interac-
erations to linearize the sequential application of the tions between the relevant and irrelevant sets of
LSTM module.
words to fall under the irrelevant contribution as
βÓβ̄, although other allocations of interactions
or words on the label predicted. However, the dy- have been proposed (Jumelet et al., 2019). A fully
namics of LSTMs are roughly linear in natural set- flat structure for building meaning could lead to a
tings, as found by Morcos et al. (2018), who found contextual representation that breaks into a linear
close linear projections between the activations at sum of each word’s meaning, which is the simpli-
each timestep in a repeating sequence and the ac- fying assumption at the heart of CD.
tivations at the end of the sequence. This obser- Given two interacting sets of words to poten-
vation allows CD to linearize hidden states with tially designate as the β in focus, A, B such that
low approximation error, but the presence of slight A ∩ B = ∅, we use a measure of interdependence
nonlinearity forms the basis for our measure of in- to quantify the degree to which A ∪ B be bro-
terdependence later on. ken into their individual meanings. With vA and
We can use softmax to convert the relevant log- vB denoting the relevant contributions of A and B
its vβ into a probability distribution as P (Y | according to CD, and vA,B as the relevant contri-
xβ ) = softmax(vβ ). This allows us to analyze the bution of A ∪ B, we compute the magnitude of
effect of input xβ on the representation of a later nonlinear interactions, rescaled to control for the
element while controlling for the influence of the magnitude of the representation:
rest of the sequence. kvA∪B − (vA + vB )k2
In our analyses, CD yielded an approximation interdependence(A, B) =
k(v +vβ̄;βÓβ̄ )−vk kvA∪B k2
error β kvk < 10−5 at the logits. How-
ever, this measurement misses another source of This quantity is related to probabilistic indepen-
approximation error: the allocation of credit be- dence. We would say that events X and Y are
tween β and the interactions βÓβ̄. Changing the independent if their joint probability P (X, Y ) =
sequence out of focus β̄ might influence vβ , for ex- P (X)P (Y ). Likewise, the meanings of A and B
ample, even though the contribution of the words can be called independent if vA∪B = vA + vB .
in focus should be mostly confined to the irrel-
evant vector component. This approximation er- 3 Synthetic Experiments
ror is crucial because the component attributed to Our first experiments use synthetic data to under-
βÓβ̄ is central to our measure of interdependence. stand the role of compositionality in LSTM learn-
ing dynamics. These dynamics see long-range
2.2 Interdependence connections discovered after short-range connec-
We frame compositionality in terms of whether the tions; in particular document-level content topic
meanings of a pair of words or word subsets can information is preserved much later in training
be treated independently. For example, a “slice of than local information like part of speech (Saphra
cake” can be broken into the individual meanings and Lopez, 2019). There are several explanations
of “slice”, “of”, and “cake”, but an idiomatic ex- for this phenomenon.
pression such as “piece of cake”, meaning a simple First, long-range connections are less consistent
task, cannot be broken into the individual mean- (particularly in a right-branching language like
ings of “piece”, “of”, and “cake”. The words in the English). For example, the pattern of a determiner
idiom have higher interdependence, or reliance followed by a noun will appear very frequently, as
on their interactions to build meaning. Another in “the man”. However, we will less frequently see
influence on interdependence should be syntactic long-range connections like the either/or in “Ei-

(a) unfamiliar-conduit training set (b) familiar-conduit training set

Figure 2: We have highlighted rule boundaries α and ω in red, and conduit q ∈ Qk in green.

ther Socrates is mortal or not”. Rarer patterns are over training, that increasingly long connections
learned slowly (Appendix A). are gradually constructed from constituents, can-
The following experiments are designed to ex- not be dismissed as purely a data effect. For ex-
plore a third possibility: that the training process ample, consider “either/or”. “Either” should de-
is inherently compositional. That is, the shorter termine that “or” will later occur, regardless of the
sequences must be learned first in order to form phrase that intercedes them. To learn this rule,
the basis for longer relations learned around them. a language model must backpropagate informa-
The compositional view of training is not a given tion from the occurrence of “or” through the in-
and must be verified. In fact, simple rules learned tervening sequence of words, which we will call
early on might inhibit the learning of more com- a conduit. Perhaps it encounters a training ex-
plex rules through the phenomenon of gradient ample that uses a conduit that is predictable by
starvation (Combes et al., 2018), in which more being structured in familiar ways, here italicized:
frequent features dominate the gradient directed “Either Socrates is mortal or not”. But what if
at rarer features. Shorter familiar patterns could the conduit is unfamiliar and the structure cannot
slow down the process for learning longer range be interpreted by the model? For example, if the
patterns by degrading the gradient passed through conduit includes unknown tokens: “Either slithy
them, or by trapping the model in a local mini- toves gyre or not”. How will the gradient carried
mum which makes the long-distance rule harder from “or” to “either” be shaped according to the
to reach. However, if the training process builds conduit, and how will the representation of that
syntactic patterns hierarchically, it can lead to rep- long-range connection change accordingly? A fa-
resentations that are built hierarchically at infer- miliar conduit could be used by a compositional
ence time, reflecting linguistic structure. To test training process as a short constituent on which to
the idea of a compositional training process, we build longer-range representations, so the meaning
use synthetic data that controls for the consistency of “Either” in context will depend on the conduit.
and frequency of longer-range relations. Conversely, if training is not biased to be compo-
sitional, the connection will be made regardless
3.1 The dataset so the rule will generalize to test data. To inves-
tigate whether long-range dependencies are built
We may expect representations of long-range con- from short constituents in this way, we train mod-
nections to be built in a way that depends strongly els on synthetic data which varies the predictabil-
on the subtrees they span, but not all such con- ity of short sequences.
nections rely on subtrees. If connections that do
not rely on shorter constituents nonetheless are We generate data uniformly at random from a
built from these constituents, then the observation vocabulary Σ. We insert n instances of the long-

(a) In-domain conduit test setting (b) Random conduit test setting

Figure 3: Mean marginal target probability of the close symbol in a rule. Solid lines are trained in the unfamiliar-
conduit set, dashed lines on familiar-conduit. Scale of y-axis is matched among graphs.

distance rule αΣk ω, with conduit Σk of length k,
open symbol α, and close symbol ω, with α, ω 6∈
Σ. Relating to our running example, α stands for
“either” and ω stands for “or”. We use a corpus
of 1m tokens with |Σ| = 1k types, which leaves a
low probability that any conduit sequence longer
than 1 token appears elsewhere by chance.
We train with a learning rate set at 1 throughout
and gradients clipped at 0.25. We found momen-
tum and weight decay to slow rule learning in this
setting, so they are not used.
Figure 4: Mean target probability of ω using CD with
3.2 The Effect of Conduit Familiarity
α in focus, out-domain test set. Solid lines are trained
To understand the effect of conduit predictabil- in the unfamiliar-conduit set, dashed lines on familiar-
ity on longer-range connections, we modify the conduit.
original synthetic data (Figure 2a) so each con-
duit appears frequently outside of the α/ω rule
(Figure 2b). The conduits are sampled from a lower than with the unfamiliar setting, and is brit-
randomly generated vocabulary of 100 phrases of tle to overtraining.
length k, so each unique conduit q appears in the If the test set conduits are sampled uniformly
training set 10 times in the context αqω. This rep- at random (Figure 2c), Figure 3b shows that the
etition is necessary in order to fit 1000 occurrences familiar-conduit training setting never teaches the
of the rule in all settings. In the familiar-conduit model to generalize the α/ω rule. For a model
setting, we randomly distribute 1000 occurrences trained on the familiar domain, a familiar conduit
of each conduit throughout the corpus outside of is required to predict the close symbol.
the rule patterns. Therefore each conduit is seen
often enough to be memorized (see Appendix B). 3.2.1 Isolating the Effect of the Open-Symbol
In the original unfamiliar-conduit setting, q ap- Raw predictions in the out-domain test setting ap-
pears only in this context as a conduit, so the con- pear to suggest that the familiar-conduit training
duit is not memorized. setting fails to teach the model to associate α and
We also use two distinct test sets. Our in- ω. However, the changing domain makes this an
domain test set (Figure 2) uses the same set of con- unfair assertion: the poor performance may be at-
duits as the train set. In Figure 3a, the model learns tributed to interactions between the open symbol
when to predict the close symbol faster if the con- and the conduit. In order to control for the po-
duits are familiar (as predicted in Appendix B). tential role that memorization of conduits plays in
However, the maximum performance reached is the prediction of the close symbol, we use CD to

Figure 5: The predicted P (xt = ω|xt−k . . . xt−k+i ) Figure 6: Mean interdependence between open symbol
according to CD, varying i as the x-axis and with and conduit on the in-domain test set. Solid lines are
xt−k = α and k = 8. Solid lines are trained in trained in the unfamiliar-conduit set, dashed lines on
the unfamiliar-conduit set, dashed lines on familiar- familiar-conduit.
conduit.
ROOT
DET
AMOD
NMOD
NSUBJ OBJ

isolate the contributions of the open symbol in the
random conduit test setting. The chimney sweep has sick lungs

Figure 4 shows that even in the out-domain test
setting, the presence of α predicts ω at the appro- Figure 7: A dependency parsed sentence.
priate time step. Furthermore, we confirm that the
familiar-conduit training setting enables earlier ac-
quences as scaffolding.
quisition of this rule.
To what, then, can we attribute the failure to 4 English Language Experiments
generalize to the random-conduit domain? Fig-
ure 5 illustrates how the unfamiliar-conduit model We now apply our measure of interdependence to
predicts the close symbol ω with high probability a natural language setting. In natural language,
based only on the contributions of the open symbol disentangling the meaning of individual words re-
α. Meanwhile, the familiar-conduit model prob- quires contextual information which is hierarchi-
ability increases substantially with each symbol cally composed. For example, in the sentence,
consumed until the end of the conduit, indicating “The chimney sweep has sick lungs”, “chimney
that the model is relying on interactions between sweep” has a clear definition and strong connota-
the open symbol and the conduit rather than reg- tions that are less evident in each word individu-
istering only the effect of the open symbol. Note ally. However, knowing that “sweep” and “sick”
that this effect cannot be because the conduit is co-occur is not sufficient to clarify the meaning
more predictive of ω. Because each conduit ap- and connotations of either word or compose a
pears frequently outside of the specific context of shared meaning. Does interdependence effectively
the rule in the familiar-conduit setting, the conduit express this syntactic link?
is less predictive of ω based on distribution alone. These experiments use language models trained
These results indicate that predictable patterns on wikitext-2 (Merity et al., 2016), run on the Uni-
play a vital role in shaping the representations of versal Dependencies corpus English-EWT (Sil-
symbols around them by composing in a way that veira et al., 2014).
cannot be easily linearized as a sum of the compo-
nent parts. In particular, as seen in Figure 6, the 4.1 Interdependence and Syntax
interdependence between open symbol and con- To assess the connection between interdependence
duit is substantially higher for the familiar-setting and syntax, we consider the interdependence of
model and increases throughout training. Long- word pairs with different syntactic distances. For
range connections are not learned independently example, in Figure 7, “chimney” is one edge away
from conduit representations, but are built com- from “sweep”, two from “has”, and four from
positionally using already-familiar shorter subse- “sick”. In Figure 8, we see that in general, the

learning dynamics.
                                                           LSTMs have the theoretical capacity to encode
                                                        a wide range of context-sensitive languages, but
                                                        in practice their ability to learn such rules from
                                                        data is limited (Weiss et al., 2018). Empirically,
                                                        LSTMs encode the most recent noun as the sub-
Figure 8: Average interdependence between word pairs    ject of a verb by default, but they are still capa-
xl , xr at different sequential distances r − l.        ble of learning to encode grammatical inflection
                                                        from the first word in a sequence rather than the
                                                        most recent (Ravfogel et al., 2019a). Therefore,
closer two words occur in sequence, the more
                                                        while inductive biases inherent to the model play
they influence each other, leading to correspond-
                                                        a critical role in the ability of an LSTM to learn ef-
ingly high interdependence. Because proximity is
                                                        fectively, they are neither necessary nor sufficient
a dominant factor in interdependence, we control
                                                        in determining what the model can learn. Hier-
for the sequential distance of words when we in-
                                                        archical linguistic structure may be learned from
vestigate syntactic distance.
                                                        data alone, or be a natural product of the training
   The synthetic data experiments show that
                                                        process, with neither hypothesis a foregone con-
phrase frequency and predictability play a critical
                                                        clusion. We provide a more precise lens on how
role in determining interdependence (although raw
                                                        LSTM training is itself compositional.
word frequency shows no clear correlation with in-
terdependence in English). We control for these            While Saphra and Lopez (2019) illustrate
properties through POS tag, as open and closed          LSTM learning dynamics expanding from repre-
tags vary in their predictability in context; for ex-   senting short-range properties to long, Voita et al.
ample, determiners (a closed POS class) are al-         (2019) presents evidence of transformer models
most always soon followed by a noun, but adjec-         building long range connections after shorter ones.
tives (an open POS class) appear in many con-           However, it is not clear whether this composition-
structions like “Socrates is mortal” where they are     ality is an inherent property of the model or an ef-
not. We stratify the data in Figure 9 according         fect of hierarchical structure in the data.
to whether the POS tags are in closed or open              There is a limited literature on compositionality
classes, which serves as a proxy for predictabil-       as an inductive bias of neural networks. Saxe et al.
ity. Irrespective of both sequential distance and       (2018) explored how hierarchical ontologies are
part of speech, we see broadly decreasing trends        learned by following their tree structure in 2-layer
in interdependence as the syntactic distance be-        feedforward networks. Liu et al. (2018) showed
tween words increases, consistent with the pre-         that LSTMs take advantage of some inherent trait
diction that syntactic proximity drives interdepen-     of language. The compositional training we have
dence. This pattern is clearer as words become          explored may be the mechanism behind this biased
further apart in the sequence, implicating non-         representational power.
syntactic influences such as priming effects that          Synthetic data, meanwhile, has formed the ba-
are stronger immediately following a word.              sis for analyzing the inductive biases of neural net-
                                                        works their capacity to learn compositional rules.
5   Discussion & Related Work                           Common synthetic datasets include the Dyck lan-
Humans learn by memorizing short rote phrases           guages (Suzgun et al., 2019; Skachkova et al.,
and later mastering the ability to construct deep       2018), SPk (Mahalunkar and Kelleher, 2019), syn-
syntactic trees from them (Lieven and Tomasello,        thetic variants of natural language (Ravfogel et al.,
2008). LSTM models learn by backpropagation             2019b; Liu et al., 2018), and others (Mul and
through time, which is unlikely to lead to the same     Zuidema, 2019; Liška et al., 2018; Korrel et al.,
inductive biases, the assumptions that define how       2019). Unlike these works, our synthetic task is
the model generalizes from its training data. It        not designed primarily to test the biases of the neu-
may not be expected for an LSTM to exhibit sim-         ral network or to improve its performance in a re-
ilarly compositional learning behavior by build-        stricted setting, but to investigate the internal be-
ing longer constituents out of shorter ones during      havior of an LSTM in response to memorization.
training, but we present evidence in favor of such         Our results may offer insight into selecting

Figure 9: Mean interdependence (y-axis) between word pairs at varying syntactic distances (x-axis), stratified by
whether the POS tags are closed or open class (line color) and by sequential distance (plot title). The y-axis ranges
differ, but the scale is the same for all plots. Each mean is plotted only if there are at least 100 cases to average.

training curricula. For feedforward language cause LSTM models take advantage of linguistic
models, a curriculum-based approach has been structure, we cannot be confident that predictable
shown to improve performance from small train- natural language exhibits the same cell state dy-
ing data (Bengio et al., 2009). Because LSTMs namics that make a memorized uniformly sam-
seem to naturally learn short-range dependencies pled conduit promote or inhibit long-range rule
before long-range dependencies, it may be tempt- learning. Future work could test these findings
ing to enhance this natural tendency with a cur- through carefully selected natural language, rather
riculum. However, curricula that move from short than synthetic, data.
sequences to long apparently fail to support more Our natural language results could lead to inter-
modern recurrent language models. Although dependence as a probe for testing syntax. Similar
these data schedules may help the model con- analyses of word interaction and nonlinearity may
verge faster and improve performance early on also be used to probe transformer models.
(Zhang et al., 2017), after further training the Some effects on our natural language experi-
model underperforms against shuffled baselines ments may be due to the predictable nature of En-
(Zhang et al., 2018). Why? We propose the fol- glish syntax, which favors right-branching behav-
lowing explanation. ior. Future work could apply similar analysis to
The application of a curriculum is based on the other languages with different grammatical word
often unspoken assumption that the representation orders.
of a complex pattern can be reached more easily
from a simpler pattern. However, we find that 7 Conclusions
effectively representing shorter conduits actually
makes a language model less effective at general- With synthetic experiments, we confirm that the
izing a long-range rule. However, this less gener- longer the span of a rule, the more examples are
alizable representation is still learned faster, which required for an LSTM model to effectively learn
may be why Zhang et al. (2017) found higher per- the rule. We then find that a more predictable con-
formance after one epoch. Our work suggests that duit between rule symbols promotes early learn-
measures of length, including syntactic depth, may ing of the rule, but fails to generalize to new do-
be inappropriate bases for curriculum learning. mains, implying that these memorized patterns
lead to representations that depend heavily on in-
6 Future Work teractions with the conduit instead of learning the
long-distance rule in isolation. We develop a mea-
While we hope to isolate the role of long range de- sure of interdependence to quantify this reliance
pendencies through synthetic data, we must con- on interactions. In natural language experiments,
sider the possibility that the natural predictabil- we find higher interdependence indicates words
ity of language data differs in relevant ways from that are closer in the syntax tree, even stratified
the synthetic data, in which the conduits are pre- by sequential distance and part of speech.
dictable only through pure memorization. Be-

References Elena Lieven and Michael Tomasello. 2008. Children’s
first language acquisition from a usage-based per-
Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Has- spective. Routledge.
san Sajjad, and James Glass. 2017. What do Neural
Machine Translation Models Learn about Morphol- Adam Liška, Germán Kruszewski, and Marco Baroni.
ogy? In Proceedings of the 55th Annual Meeting of 2018. Memorize or generalize? searching for a
the Association for Computational Linguistics (Vol- compositional rnn in a haystack. arXiv preprint
ume 1: Long Papers), pages 861–872, Vancouver, arXiv:1802.06467.
Canada. Association for Computational Linguistics.

Yoshua Bengio, Jrme Louradour, Ronan Collobert, and Nelson F. Liu, Omer Levy, Roy Schwartz, Chenhao
Jason Weston. 2009. Curriculum learning. In Pro- Tan, and Noah A. Smith. 2018. LSTMs Exploit Lin-
ceedings of the 26th annual international conference guistic Attributes of Data. arXiv:1805.11653 [cs].
on machine learning, pages 41–48. ACM. ArXiv: 1805.11653.

Terra Blevins, Omer Levy, and Luke Zettlemoyer. Abhijit Mahalunkar and John Kelleher. 2019. Multi-
2018. Deep RNNs Encode Soft Hierarchical Syn- element long distance dependencies: Using SPk
tax. arXiv:1805.04218 [cs]. ArXiv: 1805.04218. languages to explore the characteristics of long-
distance dependencies. In Proceedings of the Work-
Remi Tachet des Combes, Mohammad Pezeshki, shop on Deep Learning and Formal Languages:
Samira Shabanian, Aaron Courville, and Yoshua Building Bridges, pages 34–43, Florence. Associa-
Bengio. 2018. On the Learning Dynamics of Deep tion for Computational Linguistics.
Neural Networks. arXiv:1809.06848 [cs, stat].
ArXiv: 1809.06848. Stephen Merity, Caiming Xiong, James Bradbury, and
Richard Socher. 2016. Pointer sentinel mixture
John Hewitt and Christopher D Manning. 2019. A models. arXiv preprint arXiv:1609.07843.
Structural Probe for Finding Syntax in Word Rep-
resentations. In NAACL. Ari S. Morcos, Maithra Raghu, and Samy Ben-
gio. 2018. Insights on representational similar-
Dieuwke Hupkes, Verna Dankers, Mathijs Mul, and ity in neural networks with canonical correlation.
Elia Bruni. 2019. The compositionality of neu- arXiv:1806.05759 [cs, stat]. ArXiv: 1806.05759.
ral networks: integrating symbolism and connec-
tionism. arXiv:1908.08351 [cs, stat]. ArXiv: Mathijs Mul and Willem H. Zuidema. 2019. Siamese
1908.08351. recurrent networks learn first-order logic reasoning
and exhibit zero-shot compositional generalization.
Dieuwke Hupkes, Sara Veldhoen, and Willem CoRR, abs/1906.00180.
Zuidema. 2017. Visualisation and ’diagnos-
tic classifiers’ reveal how recurrent and recur- W. James Murdoch, Peter J. Liu, and Bin Yu. 2018. Be-
sive neural networks process hierarchical structure. yond Word Importance: Contextual Decomposition
arXiv:1711.10203 [cs]. ArXiv: 1711.10203. to Extract Interactions from LSTMs. In ICLR.
Jaap Jumelet, Willem Zuidema, and Dieuwke Hupkes.
Shauli Ravfogel, Yoav Goldberg, and Tal Linzen.
2019. Analysing Neural Language Models: Con-
2019a. Studying the Inductive Biases of RNNs
textual Decomposition Reveals Default Reasoning
with Synthetic Variations of Natural Languages.
in Number and Gender Assignment. arXiv preprint
arXiv:1903.06400 [cs]. ArXiv: 1903.06400.
arXiv:1909.08975.

Bhargav Kanuparthi, Devansh Arpit, Giancarlo Kerg, Shauli Ravfogel, Yoav Goldberg, and Tal Linzen.
Nan Rosemary Ke, Ioannis Mitliagkas, and Yoshua 2019b. Studying the inductive biases of rnns with
Bengio. 2018. h-detach: Modifying the LSTM Gra- synthetic variations of natural languages. CoRR,
dient Towards Better Optimization. In ICLR. abs/1903.06400.

Kris Korrel, Dieuwke Hupkes, Verna Dankers, and Elia Naomi Saphra and Adam Lopez. 2019. Understand-
Bruni. 2019. Transcoding compositionally: Us- ing Learning Dynamics Of Language Models with
ing attention to find more generalizable solutions. SVCCA. In NAACL. ArXiv: 1811.00225.
In Proceedings of the 2019 ACL Workshop Black-
boxNLP: Analyzing and Interpreting Neural Net- Andrew M. Saxe, James L. McClelland, and Surya
works for NLP, pages 1–11, Florence, Italy. Asso- Ganguli. 2018. A mathematical theory of se-
ciation for Computational Linguistics. mantic development in deep neural networks.
arXiv:1810.10531 [cs, q-bio, stat]. ArXiv:
Yair Lakretz, German Kruszewski, Theo Desbordes, 1810.10531.
Dieuwke Hupkes, Stanislas Dehaene, and Marco
Baroni. 2019. The emergence of number and syntax Lloyd S Shapley. 1953. A value for n-person games.
units in LSTM language models. arXiv:1903.07435 Contributions to the Theory of Games, 2(28):307–
[cs]. ArXiv: 1903.07435. 317.

Natalia Silveira, Timothy Dozat, Marie-Catherine
  de Marneffe, Samuel Bowman, Miriam Connor,
  John Bauer, and Christopher D. Manning. 2014. A
  gold standard dependency corpus for English. In
  Proceedings of the Ninth International Conference
  on Language Resources and Evaluation (LREC-
  2014).
Natalia Skachkova, Thomas Trost, and Dietrich
  Klakow. 2018. Closing brackets with recurrent neu-
  ral networks. In Proceedings of the 2018 EMNLP
  Workshop BlackboxNLP: Analyzing and Interpret-
  ing Neural Networks for NLP, pages 232–239, Brus-
  sels, Belgium. Association for Computational Lin-
  guistics.

Mirac Suzgun, Yonatan Belinkov, Stuart Shieber, and
  Sebastian Gehrmann. 2019. LSTM networks can
  perform dynamic counting. In Proceedings of
  the Workshop on Deep Learning and Formal Lan-
  guages: Building Bridges, pages 44–54, Florence.
  Association for Computational Linguistics.

Clara Vania and Adam Lopez. 2017. From characters
  to words to in between: Do we capture morphology?
  In ACL.

Elena Voita, Rico Sennrich, and Ivan Titov. 2019.
  The Bottom-up Evolution of Representations in the
  Transformer: A Study with Machine Translation and
  Language Modeling Objectives. arXiv:1909.01380
  [cs]. ArXiv: 1909.01380.
Gail Weiss, Yoav Goldberg, and Eran Yahav. 2018.
  On the Practical Computational Power of Fi-
  nite Precision RNNs for Language Recognition.
  arXiv:1805.04908 [cs, stat]. ArXiv: 1805.04908.
Dakun Zhang, Jungi Kim, Josep Crego, and Jean Senel-
  lart. 2017. Boosting Neural Machine Translation. In
  Proceedings of the Eighth International Joint Con-
  ference on Natural Language Processing (Volume
  2: Short Papers), pages 271–276, Taipei, Taiwan.
  Asian Federation of Natural Language Processing.
Xuan Zhang, Gaurav Kumar, Huda Khayrallah, Ken-
  ton Murray, Jeremy Gwinnup, Marianna J. Mar-
  tindale, Paul McNamee, Kevin Duh, and Marine
  Carpuat. 2018. An Empirical Exploration of Cur-
  riculum Learning for Neural Machine Translation.
  arXiv:1811.00739 [cs]. ArXiv: 1811.00739.

A   The Effect of Rule Frequency and
    Length
We investigate how the frequency of a rule af-
fects the ability of the model to learn the rule. We
vary the number of rule occurrences n and the rule
length k. The results in Figure 10 illustrate how a
longer conduit length requires more examples be-
fore the model can learn the corresponding rule.
We consider the probability assigned to the close
symbol according to the contributions of the open
symbol, excluding interaction from any other to-
ken in the sequence. For contrast, we also show
the extremely low probability assigned to the close
symbol according to the contributions of the con-
duit taken as an entire phrase. In particular, note
the pattern when the rule is extremely rare: The
probability of the close symbol as determined by
the open symbol is low but steady, while the prob-
ability as determined by the conduit declines with
conduit length due to the accumulated low proba-
bilities from each element in the sequence.

B   Smaller conduit gradient, faster rule
    learning
Figure 11 confirms that a predictable conduit is
associated with a smaller error gradient. Because
of the mechanics of backpropagation through time
next described, this setting will teach the α/ω rule
faster.
   Formally in a simple RNN, as the gradient of
the error et at timestep t is backpropagated k
timesteps through the hidden state h:
                         k
             ∂et    ∂et Y  ∂ht−i+1
                  =
            ∂ht−k   ∂ht i=1 ∂ht−i

The backpropagated message is multiplied repeat-
edly by the gradient at each timestep in the con-
duit. If the recurrence derivatives ∂h∂hi+1
                                         i
                                            are large
at some weight, the correspondingly larger back-
propagated gradient ∂h∂et−k
                         t
                             will accelerate descent
at that parameter. In other words, an unpredictable
conduit associated with a high error will dominate
the gradient’s sum over recurrences, delaying the
acquisition of the symbol-matching rule. In the
case of an LSTM, Kanuparthi et al. (2018) ex-
pressed the backpropagated gradient as an iterated
addition of the error from each timestep, leading
to a similar effect.

Figure 10: The predicted probability P (xt = ω), according to the contributions of open symbol xt−k = α and of
the conduit sequence xt−k+1 . . . xt−1 , for various rule occurrence counts n. Shown at 40 epochs.

Figure 11: Average gradient magnitude ∆Et+−k+d ,
varying d up to the length of the conduit. Solid lines
are the unpredictable conduit setting, dashed lines are
the predictable conduit setting.

You can also read