Word Interdependence Exposes How LSTMs Compose Representations
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Word Interdependence Exposes How LSTMs Compose Representations Naomi Saphra Adam Lopez University of Edinburgh University of Edinburgh n.saphra@ed.ac.uk alopez@inf.ed.ac.uk Abstract implying that they exploit long-distance depen- dencies (Liu et al., 2018). Their internal represen- Recent work in NLP shows that LSTM lan- tations seem to be hierarchical in nature (Blevins arXiv:2004.13195v1 [cs.CL] 27 Apr 2020 guage models capture compositional structure et al., 2018; Hupkes et al., 2017). They seemingly in language data. For a closer look at how encode knowledge of part of speech (Belinkov these representations are composed hierarchi- cally, we present a novel measure of inter- et al., 2017), morphological productivity (Vania dependence between word meanings in an and Lopez, 2017), and verb agreement (Lakretz LSTM, based on their interactions at the in- et al., 2019). How does this apparently compo- ternal gates. To explore how compositional sitional behavior arise in learning? representations arise over training, we conduct Concretely, we are interested in an aspect of simple experiments on synthetic data, which compositionality sometimes called localism (Hup- illustrate our measure by showing how high in- kes et al., 2019), in which meanings of long se- terdependence can hurt generalization. These synthetic experiments also illustrate a specific quences are recursively composed from meanings hypothesis about how hierarchical structures of shorter child sequences, without regard for how are discovered over the course of training: that the child meaning is itself constructed—that is, parent constituents rely on effective represen- the computation of the composed meaning relies tations of their children, rather than on learn- only on the local properties of the child meanings. ing long-range relations independently. We By contrast, a global composition would be con- further support this measure with experiments structed from all words. A local composition op- on English language data, where interdepen- eration leads to hierarchical structure (like a clas- dence is higher for more closely syntactically linked word pairs. sic syntax tree), whereas a global operation leads to flat structure. If meaning is composed locally, 1 Introduction then in a sentence like “The chimney sweep has sick lungs”, the unknown composition function Recent work in NLP has seen a flurry of interest f might reflect syntactic structure, computing the in the question: are the representations learned by full meaning as: f (f (The, chimney, sweep), has, neural networks compositional? That is, are repre- f (sick, lungs)). Local composition assumes low sentations of longer phrases built recursively from interdependence between the meanings of “chim- representations of shorter phrases, as they are in ney” and “has”, or indeed between any pair of many linguistic theories? If so, how and when do words not local to the same invocation of f . they learn to do this? To analyze compositionality, we propose a mea- Computationally, sequence models like LSTMs sure of word interdependence that directly mea- scan a sentence from left to right, accumulating sures the composition of meaning in LSTMs meaning into a hidden representation at each time through the interactions between words (Sec- step. Yet we have extensive evidence that fully tion 2). Our method builds on Contextual Decom- trained LSTMs are sensitive to syntactic structure, position (CD; Murdoch et al., 2018), a tool for an- suggesting that they learn something about recur- alyzing the representations produced by LSTMs. sive composition of meaning. For example, they We conduct experiments on a synthetic corpus can recall more history in natural language data (Section 3), which illustrate interdependence and than in similarly Zipfian-distributed n-gram data, find that highly familiar constituents make nearby
vocabulary statistically dependent on them, leav- that: ing them vulnerable to the domain shift. We then h ≈ hβ + hβ̄;βÓβ̄ relate word interdependence in an English lan- This decomposed form is achieved by linearizing guage corpus (Section 4) to syntax, finding that the contribution of the words in focus. This is nec- word pairs with close syntactic links have higher essarily approximate, because the internal gating interdependence than more distantly linked words, mechanisms in an LSTM each employ a nonlin- even stratifying by sequential distance and part ear activation function, either σ or tanh. Murdoch of speech. This pattern offers a potential struc- et al. (2018) use a linearized approximation Lσ tural probe that can be computed directly from an for σ and linearized approximation Ltanh for tanh LSTM without learning additional parameters as such that for arbitrary input N P j=1 yj : required in other methods (Hewitt and Manning, 2019). N X N X σ yj = Lσ (yj ) (1) j=1 j=1 2 Methods These approximations are then used to split We now introduce our interdependence measure, each gate into components contributed by the pre- a natural extension of Contextual Decomposi- vious hidden state ht−1 and by the current input tion (CD; Murdoch et al., 2018), a tool for an- xt , for example the input gate it : alyzing the representations produced by LSTMs. To conform with Murdoch et al. (2018), all exper- it = σ(Wi xt + Vt ht−1 + bi ) iments use a one layer LSTM, with inputs taken ≈ Lσ (Wi xt ) + Lσ (Vt ht−1 ) + Lσ (bi ) from an embedding layer and outputs processed by a softmax layer. This linear form Lσ is achieved by computing 2.1 Contextual Decomposition the Shapley value (Shapley, 1953) of its param- eter, defined as the average difference resulting Let us say that we need to determine when our from excluding the parameter, over all possible language model has learned that “either” implies permutations of the input summants. To apply an appearance of “or” later in the sequence. We Formula 1 to σ(y1 + y2 ) for a linear approxima- consider an example sentence, “Either Socrates is tion of the isolated effect of the summant y1 : mortal or not”. Because many nonlinear functions are applied in the intervening span “Socrates is 1 Lσ (y1 ) = [(σ(y1 )−σ(0))+(σ(y2 +y1 )−σ(y1 ))] mortal”, it is difficult to directly measure the in- 2 fluence of “either” on the later occurrence of “or”. With this function, we can take a hidden To dissect the sequence and understand the impact state from the previous timestep, decomposed as of individual elements in the sequence, we could ht−1 ≈ ht−1 β + ht−1 β̄;βÓβ̄ and add xt to the appro- employ CD. priate component. For example, if xt is in focus, CD is a method of looking at the individual in- we count it in the relevant function inputs when fluences that words and phrases in a sequence have computing the input gate: on the output of a recurrent model. Illustrated in Figure 1, CD decomposes the activation vector it = σ(Wi xt + Vt ht−1 + bi ) produced by an LSTM layer into a sum of relevant ≈ σ(Wi xt + Vt (ht−1 β t−1 + hβ̄;βÓ β̄ ) + bi ) and irrelevant parts. The relevant part is the con- ≈ [Lσ (Wi xt + Vt ht−1 β ) + Lσ (bi )] tribution of the phrase or set of words in focus, i.e., t−1 a set of words whose impact we want to measure. +Lσ (Vt hβ̄;βÓ β̄ ) We denote this contribution as β. The irrelevant = itβ + itβ̄;βÓβ̄ part includes the contribution of all words not in that set (denoted β̄) as well as interactions between the relevant and irrelevant words (denoted βÓβ̄). Because the individual contributions of the For an output hidden state vector h at any partic- items in a sequence interact in nonlinear ways, this ular timestep, CD will decompose it into two vec- decomposition is only an approximation and can- tors: the relevant hβ , and irrelevant hβ̄;βÓβ̄ , such not exactly compute the impact of a specific word
relation; if you “happily eat a slice of cake”, the meaning of “cake” does not depend on “happily”, which modifies “eat” and is far on the syntactic tree from “cake”. We will use the nonlinear inter- actions in contextual decomposition to analyze the interdependence between words alternately con- sidered in focus. Figure 1: CD uses linear approximations of gate op- Generally, CD considers all nonlinear interac- erations to linearize the sequential application of the tions between the relevant and irrelevant sets of LSTM module. words to fall under the irrelevant contribution as βÓβ̄, although other allocations of interactions or words on the label predicted. However, the dy- have been proposed (Jumelet et al., 2019). A fully namics of LSTMs are roughly linear in natural set- flat structure for building meaning could lead to a tings, as found by Morcos et al. (2018), who found contextual representation that breaks into a linear close linear projections between the activations at sum of each word’s meaning, which is the simpli- each timestep in a repeating sequence and the ac- fying assumption at the heart of CD. tivations at the end of the sequence. This obser- Given two interacting sets of words to poten- vation allows CD to linearize hidden states with tially designate as the β in focus, A, B such that low approximation error, but the presence of slight A ∩ B = ∅, we use a measure of interdependence nonlinearity forms the basis for our measure of in- to quantify the degree to which A ∪ B be bro- terdependence later on. ken into their individual meanings. With vA and We can use softmax to convert the relevant log- vB denoting the relevant contributions of A and B its vβ into a probability distribution as P (Y | according to CD, and vA,B as the relevant contri- xβ ) = softmax(vβ ). This allows us to analyze the bution of A ∪ B, we compute the magnitude of effect of input xβ on the representation of a later nonlinear interactions, rescaled to control for the element while controlling for the influence of the magnitude of the representation: rest of the sequence. kvA∪B − (vA + vB )k2 In our analyses, CD yielded an approximation interdependence(A, B) = k(v +vβ̄;βÓβ̄ )−vk kvA∪B k2 error β kvk < 10−5 at the logits. How- ever, this measurement misses another source of This quantity is related to probabilistic indepen- approximation error: the allocation of credit be- dence. We would say that events X and Y are tween β and the interactions βÓβ̄. Changing the independent if their joint probability P (X, Y ) = sequence out of focus β̄ might influence vβ , for ex- P (X)P (Y ). Likewise, the meanings of A and B ample, even though the contribution of the words can be called independent if vA∪B = vA + vB . in focus should be mostly confined to the irrel- evant vector component. This approximation er- 3 Synthetic Experiments ror is crucial because the component attributed to Our first experiments use synthetic data to under- βÓβ̄ is central to our measure of interdependence. stand the role of compositionality in LSTM learn- ing dynamics. These dynamics see long-range 2.2 Interdependence connections discovered after short-range connec- We frame compositionality in terms of whether the tions; in particular document-level content topic meanings of a pair of words or word subsets can information is preserved much later in training be treated independently. For example, a “slice of than local information like part of speech (Saphra cake” can be broken into the individual meanings and Lopez, 2019). There are several explanations of “slice”, “of”, and “cake”, but an idiomatic ex- for this phenomenon. pression such as “piece of cake”, meaning a simple First, long-range connections are less consistent task, cannot be broken into the individual mean- (particularly in a right-branching language like ings of “piece”, “of”, and “cake”. The words in the English). For example, the pattern of a determiner idiom have higher interdependence, or reliance followed by a noun will appear very frequently, as on their interactions to build meaning. Another in “the man”. However, we will less frequently see influence on interdependence should be syntactic long-range connections like the either/or in “Ei-
(a) unfamiliar-conduit training set (b) familiar-conduit training set (c) out-domain test set (d) in-domain test set Figure 2: We have highlighted rule boundaries α and ω in red, and conduit q ∈ Qk in green. ther Socrates is mortal or not”. Rarer patterns are over training, that increasingly long connections learned slowly (Appendix A). are gradually constructed from constituents, can- The following experiments are designed to ex- not be dismissed as purely a data effect. For ex- plore a third possibility: that the training process ample, consider “either/or”. “Either” should de- is inherently compositional. That is, the shorter termine that “or” will later occur, regardless of the sequences must be learned first in order to form phrase that intercedes them. To learn this rule, the basis for longer relations learned around them. a language model must backpropagate informa- The compositional view of training is not a given tion from the occurrence of “or” through the in- and must be verified. In fact, simple rules learned tervening sequence of words, which we will call early on might inhibit the learning of more com- a conduit. Perhaps it encounters a training ex- plex rules through the phenomenon of gradient ample that uses a conduit that is predictable by starvation (Combes et al., 2018), in which more being structured in familiar ways, here italicized: frequent features dominate the gradient directed “Either Socrates is mortal or not”. But what if at rarer features. Shorter familiar patterns could the conduit is unfamiliar and the structure cannot slow down the process for learning longer range be interpreted by the model? For example, if the patterns by degrading the gradient passed through conduit includes unknown tokens: “Either slithy them, or by trapping the model in a local mini- toves gyre or not”. How will the gradient carried mum which makes the long-distance rule harder from “or” to “either” be shaped according to the to reach. However, if the training process builds conduit, and how will the representation of that syntactic patterns hierarchically, it can lead to rep- long-range connection change accordingly? A fa- resentations that are built hierarchically at infer- miliar conduit could be used by a compositional ence time, reflecting linguistic structure. To test training process as a short constituent on which to the idea of a compositional training process, we build longer-range representations, so the meaning use synthetic data that controls for the consistency of “Either” in context will depend on the conduit. and frequency of longer-range relations. Conversely, if training is not biased to be compo- sitional, the connection will be made regardless 3.1 The dataset so the rule will generalize to test data. To inves- tigate whether long-range dependencies are built We may expect representations of long-range con- from short constituents in this way, we train mod- nections to be built in a way that depends strongly els on synthetic data which varies the predictabil- on the subtrees they span, but not all such con- ity of short sequences. nections rely on subtrees. If connections that do not rely on shorter constituents nonetheless are We generate data uniformly at random from a built from these constituents, then the observation vocabulary Σ. We insert n instances of the long-
(a) In-domain conduit test setting (b) Random conduit test setting Figure 3: Mean marginal target probability of the close symbol in a rule. Solid lines are trained in the unfamiliar- conduit set, dashed lines on familiar-conduit. Scale of y-axis is matched among graphs. distance rule αΣk ω, with conduit Σk of length k, open symbol α, and close symbol ω, with α, ω 6∈ Σ. Relating to our running example, α stands for “either” and ω stands for “or”. We use a corpus of 1m tokens with |Σ| = 1k types, which leaves a low probability that any conduit sequence longer than 1 token appears elsewhere by chance. We train with a learning rate set at 1 throughout and gradients clipped at 0.25. We found momen- tum and weight decay to slow rule learning in this setting, so they are not used. Figure 4: Mean target probability of ω using CD with 3.2 The Effect of Conduit Familiarity α in focus, out-domain test set. Solid lines are trained To understand the effect of conduit predictabil- in the unfamiliar-conduit set, dashed lines on familiar- ity on longer-range connections, we modify the conduit. original synthetic data (Figure 2a) so each con- duit appears frequently outside of the α/ω rule (Figure 2b). The conduits are sampled from a lower than with the unfamiliar setting, and is brit- randomly generated vocabulary of 100 phrases of tle to overtraining. length k, so each unique conduit q appears in the If the test set conduits are sampled uniformly training set 10 times in the context αqω. This rep- at random (Figure 2c), Figure 3b shows that the etition is necessary in order to fit 1000 occurrences familiar-conduit training setting never teaches the of the rule in all settings. In the familiar-conduit model to generalize the α/ω rule. For a model setting, we randomly distribute 1000 occurrences trained on the familiar domain, a familiar conduit of each conduit throughout the corpus outside of is required to predict the close symbol. the rule patterns. Therefore each conduit is seen often enough to be memorized (see Appendix B). 3.2.1 Isolating the Effect of the Open-Symbol In the original unfamiliar-conduit setting, q ap- Raw predictions in the out-domain test setting ap- pears only in this context as a conduit, so the con- pear to suggest that the familiar-conduit training duit is not memorized. setting fails to teach the model to associate α and We also use two distinct test sets. Our in- ω. However, the changing domain makes this an domain test set (Figure 2) uses the same set of con- unfair assertion: the poor performance may be at- duits as the train set. In Figure 3a, the model learns tributed to interactions between the open symbol when to predict the close symbol faster if the con- and the conduit. In order to control for the po- duits are familiar (as predicted in Appendix B). tential role that memorization of conduits plays in However, the maximum performance reached is the prediction of the close symbol, we use CD to
Figure 5: The predicted P (xt = ω|xt−k . . . xt−k+i ) Figure 6: Mean interdependence between open symbol according to CD, varying i as the x-axis and with and conduit on the in-domain test set. Solid lines are xt−k = α and k = 8. Solid lines are trained in trained in the unfamiliar-conduit set, dashed lines on the unfamiliar-conduit set, dashed lines on familiar- familiar-conduit. conduit. ROOT DET AMOD NMOD NSUBJ OBJ isolate the contributions of the open symbol in the random conduit test setting. The chimney sweep has sick lungs Figure 4 shows that even in the out-domain test setting, the presence of α predicts ω at the appro- Figure 7: A dependency parsed sentence. priate time step. Furthermore, we confirm that the familiar-conduit training setting enables earlier ac- quences as scaffolding. quisition of this rule. To what, then, can we attribute the failure to 4 English Language Experiments generalize to the random-conduit domain? Fig- ure 5 illustrates how the unfamiliar-conduit model We now apply our measure of interdependence to predicts the close symbol ω with high probability a natural language setting. In natural language, based only on the contributions of the open symbol disentangling the meaning of individual words re- α. Meanwhile, the familiar-conduit model prob- quires contextual information which is hierarchi- ability increases substantially with each symbol cally composed. For example, in the sentence, consumed until the end of the conduit, indicating “The chimney sweep has sick lungs”, “chimney that the model is relying on interactions between sweep” has a clear definition and strong connota- the open symbol and the conduit rather than reg- tions that are less evident in each word individu- istering only the effect of the open symbol. Note ally. However, knowing that “sweep” and “sick” that this effect cannot be because the conduit is co-occur is not sufficient to clarify the meaning more predictive of ω. Because each conduit ap- and connotations of either word or compose a pears frequently outside of the specific context of shared meaning. Does interdependence effectively the rule in the familiar-conduit setting, the conduit express this syntactic link? is less predictive of ω based on distribution alone. These experiments use language models trained These results indicate that predictable patterns on wikitext-2 (Merity et al., 2016), run on the Uni- play a vital role in shaping the representations of versal Dependencies corpus English-EWT (Sil- symbols around them by composing in a way that veira et al., 2014). cannot be easily linearized as a sum of the compo- nent parts. In particular, as seen in Figure 6, the 4.1 Interdependence and Syntax interdependence between open symbol and con- To assess the connection between interdependence duit is substantially higher for the familiar-setting and syntax, we consider the interdependence of model and increases throughout training. Long- word pairs with different syntactic distances. For range connections are not learned independently example, in Figure 7, “chimney” is one edge away from conduit representations, but are built com- from “sweep”, two from “has”, and four from positionally using already-familiar shorter subse- “sick”. In Figure 8, we see that in general, the
learning dynamics. LSTMs have the theoretical capacity to encode a wide range of context-sensitive languages, but in practice their ability to learn such rules from data is limited (Weiss et al., 2018). Empirically, LSTMs encode the most recent noun as the sub- Figure 8: Average interdependence between word pairs ject of a verb by default, but they are still capa- xl , xr at different sequential distances r − l. ble of learning to encode grammatical inflection from the first word in a sequence rather than the most recent (Ravfogel et al., 2019a). Therefore, closer two words occur in sequence, the more while inductive biases inherent to the model play they influence each other, leading to correspond- a critical role in the ability of an LSTM to learn ef- ingly high interdependence. Because proximity is fectively, they are neither necessary nor sufficient a dominant factor in interdependence, we control in determining what the model can learn. Hier- for the sequential distance of words when we in- archical linguistic structure may be learned from vestigate syntactic distance. data alone, or be a natural product of the training The synthetic data experiments show that process, with neither hypothesis a foregone con- phrase frequency and predictability play a critical clusion. We provide a more precise lens on how role in determining interdependence (although raw LSTM training is itself compositional. word frequency shows no clear correlation with in- terdependence in English). We control for these While Saphra and Lopez (2019) illustrate properties through POS tag, as open and closed LSTM learning dynamics expanding from repre- tags vary in their predictability in context; for ex- senting short-range properties to long, Voita et al. ample, determiners (a closed POS class) are al- (2019) presents evidence of transformer models most always soon followed by a noun, but adjec- building long range connections after shorter ones. tives (an open POS class) appear in many con- However, it is not clear whether this composition- structions like “Socrates is mortal” where they are ality is an inherent property of the model or an ef- not. We stratify the data in Figure 9 according fect of hierarchical structure in the data. to whether the POS tags are in closed or open There is a limited literature on compositionality classes, which serves as a proxy for predictabil- as an inductive bias of neural networks. Saxe et al. ity. Irrespective of both sequential distance and (2018) explored how hierarchical ontologies are part of speech, we see broadly decreasing trends learned by following their tree structure in 2-layer in interdependence as the syntactic distance be- feedforward networks. Liu et al. (2018) showed tween words increases, consistent with the pre- that LSTMs take advantage of some inherent trait diction that syntactic proximity drives interdepen- of language. The compositional training we have dence. This pattern is clearer as words become explored may be the mechanism behind this biased further apart in the sequence, implicating non- representational power. syntactic influences such as priming effects that Synthetic data, meanwhile, has formed the ba- are stronger immediately following a word. sis for analyzing the inductive biases of neural net- works their capacity to learn compositional rules. 5 Discussion & Related Work Common synthetic datasets include the Dyck lan- Humans learn by memorizing short rote phrases guages (Suzgun et al., 2019; Skachkova et al., and later mastering the ability to construct deep 2018), SPk (Mahalunkar and Kelleher, 2019), syn- syntactic trees from them (Lieven and Tomasello, thetic variants of natural language (Ravfogel et al., 2008). LSTM models learn by backpropagation 2019b; Liu et al., 2018), and others (Mul and through time, which is unlikely to lead to the same Zuidema, 2019; Liška et al., 2018; Korrel et al., inductive biases, the assumptions that define how 2019). Unlike these works, our synthetic task is the model generalizes from its training data. It not designed primarily to test the biases of the neu- may not be expected for an LSTM to exhibit sim- ral network or to improve its performance in a re- ilarly compositional learning behavior by build- stricted setting, but to investigate the internal be- ing longer constituents out of shorter ones during havior of an LSTM in response to memorization. training, but we present evidence in favor of such Our results may offer insight into selecting
Figure 9: Mean interdependence (y-axis) between word pairs at varying syntactic distances (x-axis), stratified by whether the POS tags are closed or open class (line color) and by sequential distance (plot title). The y-axis ranges differ, but the scale is the same for all plots. Each mean is plotted only if there are at least 100 cases to average. training curricula. For feedforward language cause LSTM models take advantage of linguistic models, a curriculum-based approach has been structure, we cannot be confident that predictable shown to improve performance from small train- natural language exhibits the same cell state dy- ing data (Bengio et al., 2009). Because LSTMs namics that make a memorized uniformly sam- seem to naturally learn short-range dependencies pled conduit promote or inhibit long-range rule before long-range dependencies, it may be tempt- learning. Future work could test these findings ing to enhance this natural tendency with a cur- through carefully selected natural language, rather riculum. However, curricula that move from short than synthetic, data. sequences to long apparently fail to support more Our natural language results could lead to inter- modern recurrent language models. Although dependence as a probe for testing syntax. Similar these data schedules may help the model con- analyses of word interaction and nonlinearity may verge faster and improve performance early on also be used to probe transformer models. (Zhang et al., 2017), after further training the Some effects on our natural language experi- model underperforms against shuffled baselines ments may be due to the predictable nature of En- (Zhang et al., 2018). Why? We propose the fol- glish syntax, which favors right-branching behav- lowing explanation. ior. Future work could apply similar analysis to The application of a curriculum is based on the other languages with different grammatical word often unspoken assumption that the representation orders. of a complex pattern can be reached more easily from a simpler pattern. However, we find that 7 Conclusions effectively representing shorter conduits actually makes a language model less effective at general- With synthetic experiments, we confirm that the izing a long-range rule. However, this less gener- longer the span of a rule, the more examples are alizable representation is still learned faster, which required for an LSTM model to effectively learn may be why Zhang et al. (2017) found higher per- the rule. We then find that a more predictable con- formance after one epoch. Our work suggests that duit between rule symbols promotes early learn- measures of length, including syntactic depth, may ing of the rule, but fails to generalize to new do- be inappropriate bases for curriculum learning. mains, implying that these memorized patterns lead to representations that depend heavily on in- 6 Future Work teractions with the conduit instead of learning the long-distance rule in isolation. We develop a mea- While we hope to isolate the role of long range de- sure of interdependence to quantify this reliance pendencies through synthetic data, we must con- on interactions. In natural language experiments, sider the possibility that the natural predictabil- we find higher interdependence indicates words ity of language data differs in relevant ways from that are closer in the syntax tree, even stratified the synthetic data, in which the conduits are pre- by sequential distance and part of speech. dictable only through pure memorization. Be-
References Elena Lieven and Michael Tomasello. 2008. Children’s first language acquisition from a usage-based per- Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Has- spective. Routledge. san Sajjad, and James Glass. 2017. What do Neural Machine Translation Models Learn about Morphol- Adam Liška, Germán Kruszewski, and Marco Baroni. ogy? In Proceedings of the 55th Annual Meeting of 2018. Memorize or generalize? searching for a the Association for Computational Linguistics (Vol- compositional rnn in a haystack. arXiv preprint ume 1: Long Papers), pages 861–872, Vancouver, arXiv:1802.06467. Canada. Association for Computational Linguistics. Yoshua Bengio, Jrme Louradour, Ronan Collobert, and Nelson F. Liu, Omer Levy, Roy Schwartz, Chenhao Jason Weston. 2009. Curriculum learning. In Pro- Tan, and Noah A. Smith. 2018. LSTMs Exploit Lin- ceedings of the 26th annual international conference guistic Attributes of Data. arXiv:1805.11653 [cs]. on machine learning, pages 41–48. ACM. ArXiv: 1805.11653. Terra Blevins, Omer Levy, and Luke Zettlemoyer. Abhijit Mahalunkar and John Kelleher. 2019. Multi- 2018. Deep RNNs Encode Soft Hierarchical Syn- element long distance dependencies: Using SPk tax. arXiv:1805.04218 [cs]. ArXiv: 1805.04218. languages to explore the characteristics of long- distance dependencies. In Proceedings of the Work- Remi Tachet des Combes, Mohammad Pezeshki, shop on Deep Learning and Formal Languages: Samira Shabanian, Aaron Courville, and Yoshua Building Bridges, pages 34–43, Florence. Associa- Bengio. 2018. On the Learning Dynamics of Deep tion for Computational Linguistics. Neural Networks. arXiv:1809.06848 [cs, stat]. ArXiv: 1809.06848. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture John Hewitt and Christopher D Manning. 2019. A models. arXiv preprint arXiv:1609.07843. Structural Probe for Finding Syntax in Word Rep- resentations. In NAACL. Ari S. Morcos, Maithra Raghu, and Samy Ben- gio. 2018. Insights on representational similar- Dieuwke Hupkes, Verna Dankers, Mathijs Mul, and ity in neural networks with canonical correlation. Elia Bruni. 2019. The compositionality of neu- arXiv:1806.05759 [cs, stat]. ArXiv: 1806.05759. ral networks: integrating symbolism and connec- tionism. arXiv:1908.08351 [cs, stat]. ArXiv: Mathijs Mul and Willem H. Zuidema. 2019. Siamese 1908.08351. recurrent networks learn first-order logic reasoning and exhibit zero-shot compositional generalization. Dieuwke Hupkes, Sara Veldhoen, and Willem CoRR, abs/1906.00180. Zuidema. 2017. Visualisation and ’diagnos- tic classifiers’ reveal how recurrent and recur- W. James Murdoch, Peter J. Liu, and Bin Yu. 2018. Be- sive neural networks process hierarchical structure. yond Word Importance: Contextual Decomposition arXiv:1711.10203 [cs]. ArXiv: 1711.10203. to Extract Interactions from LSTMs. In ICLR. Jaap Jumelet, Willem Zuidema, and Dieuwke Hupkes. Shauli Ravfogel, Yoav Goldberg, and Tal Linzen. 2019. Analysing Neural Language Models: Con- 2019a. Studying the Inductive Biases of RNNs textual Decomposition Reveals Default Reasoning with Synthetic Variations of Natural Languages. in Number and Gender Assignment. arXiv preprint arXiv:1903.06400 [cs]. ArXiv: 1903.06400. arXiv:1909.08975. Bhargav Kanuparthi, Devansh Arpit, Giancarlo Kerg, Shauli Ravfogel, Yoav Goldberg, and Tal Linzen. Nan Rosemary Ke, Ioannis Mitliagkas, and Yoshua 2019b. Studying the inductive biases of rnns with Bengio. 2018. h-detach: Modifying the LSTM Gra- synthetic variations of natural languages. CoRR, dient Towards Better Optimization. In ICLR. abs/1903.06400. Kris Korrel, Dieuwke Hupkes, Verna Dankers, and Elia Naomi Saphra and Adam Lopez. 2019. Understand- Bruni. 2019. Transcoding compositionally: Us- ing Learning Dynamics Of Language Models with ing attention to find more generalizable solutions. SVCCA. In NAACL. ArXiv: 1811.00225. In Proceedings of the 2019 ACL Workshop Black- boxNLP: Analyzing and Interpreting Neural Net- Andrew M. Saxe, James L. McClelland, and Surya works for NLP, pages 1–11, Florence, Italy. Asso- Ganguli. 2018. A mathematical theory of se- ciation for Computational Linguistics. mantic development in deep neural networks. arXiv:1810.10531 [cs, q-bio, stat]. ArXiv: Yair Lakretz, German Kruszewski, Theo Desbordes, 1810.10531. Dieuwke Hupkes, Stanislas Dehaene, and Marco Baroni. 2019. The emergence of number and syntax Lloyd S Shapley. 1953. A value for n-person games. units in LSTM language models. arXiv:1903.07435 Contributions to the Theory of Games, 2(28):307– [cs]. ArXiv: 1903.07435. 317.
Natalia Silveira, Timothy Dozat, Marie-Catherine de Marneffe, Samuel Bowman, Miriam Connor, John Bauer, and Christopher D. Manning. 2014. A gold standard dependency corpus for English. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC- 2014). Natalia Skachkova, Thomas Trost, and Dietrich Klakow. 2018. Closing brackets with recurrent neu- ral networks. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpret- ing Neural Networks for NLP, pages 232–239, Brus- sels, Belgium. Association for Computational Lin- guistics. Mirac Suzgun, Yonatan Belinkov, Stuart Shieber, and Sebastian Gehrmann. 2019. LSTM networks can perform dynamic counting. In Proceedings of the Workshop on Deep Learning and Formal Lan- guages: Building Bridges, pages 44–54, Florence. Association for Computational Linguistics. Clara Vania and Adam Lopez. 2017. From characters to words to in between: Do we capture morphology? In ACL. Elena Voita, Rico Sennrich, and Ivan Titov. 2019. The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives. arXiv:1909.01380 [cs]. ArXiv: 1909.01380. Gail Weiss, Yoav Goldberg, and Eran Yahav. 2018. On the Practical Computational Power of Fi- nite Precision RNNs for Language Recognition. arXiv:1805.04908 [cs, stat]. ArXiv: 1805.04908. Dakun Zhang, Jungi Kim, Josep Crego, and Jean Senel- lart. 2017. Boosting Neural Machine Translation. In Proceedings of the Eighth International Joint Con- ference on Natural Language Processing (Volume 2: Short Papers), pages 271–276, Taipei, Taiwan. Asian Federation of Natural Language Processing. Xuan Zhang, Gaurav Kumar, Huda Khayrallah, Ken- ton Murray, Jeremy Gwinnup, Marianna J. Mar- tindale, Paul McNamee, Kevin Duh, and Marine Carpuat. 2018. An Empirical Exploration of Cur- riculum Learning for Neural Machine Translation. arXiv:1811.00739 [cs]. ArXiv: 1811.00739.
A The Effect of Rule Frequency and Length We investigate how the frequency of a rule af- fects the ability of the model to learn the rule. We vary the number of rule occurrences n and the rule length k. The results in Figure 10 illustrate how a longer conduit length requires more examples be- fore the model can learn the corresponding rule. We consider the probability assigned to the close symbol according to the contributions of the open symbol, excluding interaction from any other to- ken in the sequence. For contrast, we also show the extremely low probability assigned to the close symbol according to the contributions of the con- duit taken as an entire phrase. In particular, note the pattern when the rule is extremely rare: The probability of the close symbol as determined by the open symbol is low but steady, while the prob- ability as determined by the conduit declines with conduit length due to the accumulated low proba- bilities from each element in the sequence. B Smaller conduit gradient, faster rule learning Figure 11 confirms that a predictable conduit is associated with a smaller error gradient. Because of the mechanics of backpropagation through time next described, this setting will teach the α/ω rule faster. Formally in a simple RNN, as the gradient of the error et at timestep t is backpropagated k timesteps through the hidden state h: k ∂et ∂et Y ∂ht−i+1 = ∂ht−k ∂ht i=1 ∂ht−i The backpropagated message is multiplied repeat- edly by the gradient at each timestep in the con- duit. If the recurrence derivatives ∂h∂hi+1 i are large at some weight, the correspondingly larger back- propagated gradient ∂h∂et−k t will accelerate descent at that parameter. In other words, an unpredictable conduit associated with a high error will dominate the gradient’s sum over recurrences, delaying the acquisition of the symbol-matching rule. In the case of an LSTM, Kanuparthi et al. (2018) ex- pressed the backpropagated gradient as an iterated addition of the error from each timestep, leading to a similar effect.
Figure 10: The predicted probability P (xt = ω), according to the contributions of open symbol xt−k = α and of the conduit sequence xt−k+1 . . . xt−1 , for various rule occurrence counts n. Shown at 40 epochs. Figure 11: Average gradient magnitude ∆Et+−k+d , varying d up to the length of the conduit. Solid lines are the unpredictable conduit setting, dashed lines are the predictable conduit setting.
You can also read