The Unstoppable Rise of Computational Linguistics in Deep Learning
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
The Unstoppable Rise of Computational Linguistics in Deep Learning James Henderson Idiap Research Institute, Switzerland james.henderson@idiap.ch Abstract network architectures which bring together contin- uous vector spaces with structured representations In this paper, we trace the history of neural net- in ways which are novel for both machine learning works applied to natural language understand- arXiv:2005.06420v3 [cs.CL] 11 Jun 2020 and computational linguistics. ing tasks, and identify key contributions which the nature of language has made to the devel- Thus, the great progress which we have made opment of neural network architectures. We through the application of neural networks to natu- focus on the importance of variable binding ral language processing should not be viewed as a and its instantiation in attention-based models, conquest, but as a compromise. As well as the un- and argue that Transformer is not a sequence questionable impact of machine learning research model but an induced-structure model. This on NLP, the nature of language has had a profound perspective leads to predictions of the chal- impact on progress in machine learning. In this lenges facing research in deep learning archi- paper we trace this impact, and speculate on future tectures for natural language understanding. progress and its limits. 1 Introduction We start with a sketch of the insights from gram- mar formalisms about the nature of language, with When neural networks first started being applied to their multiple levels, structured representations and natural language in the 1980s and 90s, they repre- rules. The rules were soon learned with statistical sented a radical departure from standard practice methods, followed by the use of neural networks in computational linguistics. Connectionists had to replace symbols with induced vectors, but the vector representations and learning algorithms, and most effective models still kept structured repre- they didn’t see any need for anything else. Every- sentations, such as syntactic trees. More recently, thing was a point in a vector space, and everything attention-based models have replaced hand-coded about the nature of language could be learned from structures with induced structures. The resulting data. On the other hand, most computational lin- models represent language with multiple levels guists had linguistic theories and the poverty-of-the- of structured representations, much as has always stimulus argument. Obviously some things were been done. Given this perspective, we identify re- learned from data, but all the interesting things maining challenges in learning language from data, about the nature of language had to be innate. and its possible limitations. A quarter century later, we can say two things with certainty: they were both wrong. Vector-space 2 Grammar Formalisms versus representations and machine learning algorithms Connectionism are much more powerful than was thought. Much of the linguistic knowledge which computational 2.1 Grammar Formalisms linguists assumed needed to be innate can in fact Our modern understanding of the computational be learned from data. But the unbounded discrete properties of language started with the introduction structured representations they used have not been of grammar formalisms. Context Free Grammars replaced by vector-space representations. Instead, (Chomsky, 1959) illustrated how a formal system the successful uses of neural networks in computa- could model the infinite generative capacity of lan- tional linguistics have replaced specific pieces of guage with a bounded grammar. This formalism computational-linguistic models with new neural soon proved inadequate to account for the diversity Accepted for publication at ACL 2020, in the theme track.
of phenomena in human languages, and a number formalisms capture this unboundedness by allow- of linguistically-motivated grammar formalisms ing an unbounded number of entities in a repre- were proposed (e.g HPSG (Pollard and Sag, 1987), sentation, and thus an unbounded number of rule TAG (Joshi, 1987), CCG (Steedman, 2000)). applications. It is generally accepted that the num- All these grammar formalisms shared certain ber of entities grows linearly with the length of the properties, motivated by the understanding of the sentence (Joshi et al., 1990), so each level can have nature of languages in Linguistics. They all postu- at most a number of entities which is linear in the late representations which decompose an utterances number of entities at the level(s) below. into a set of sub-parts, with labels of the parts and a Computational linguistic grammar formalisms structure of inter-dependence between them. And also typically assume that the properties and rela- they all assume that this decomposition happens tions are discrete, called symbolic representations. at multiple levels of representation. For example These may be atomic categories, as in CFGs, TAGs, that spoken utterances can be decomposed into sen- CCG and dependency grammar, or they may be fea- tences, sentences can be decomposed into words, ture structures, as in HPSG. words can be decomposed into morphemes, and morphemes can be decomposed into phonemes, be- 2.2 Connectionism fore we reach the observable sound signal. In the interests of uniformity, we will refer to the sub- Other researchers who were more interested in the parts in each level of representation as its entities, computational properties of neurological systems their labels as their properties, and their structure of found this reliance on discrete categorical repre- inter-dependence as their relations. The structure sentations untenable. Processing in the brain used of inter-dependence between entities at different real-valued representations distributed across many levels will also be referred to as relations. neurons. Based on successes following the de- In addition to these representations, grammar velopment of multi-layered perceptrons (MLPs) formalisms include specifications of the allowable (Rumelhart et al., 1986b), an approach to mod- structures. These may take the form of hard con- elling cognitive phenomena was developed called straints or soft objectives, or of deterministic rules connectionism. Connectionism uses vector-space or stochastic processes. In all cases, the purpose of representations to reflect the distributed continuous these specifications is to account for the regulari- nature of representations in the brain. Similarly, ties found in natural languages. In the interests of their rules are specified with vectors of continu- uniformity, we will refer to all these different kinds ous parameters. MLPs are so powerful that they of specifications of allowable structures as rules. are arbitrary function approximators (Hornik et al., These rules may apply within or between levels of 1989). And thanks to backpropagation learning representation. (Rumelhart et al., 1986a) in neural network mod- In addition to explicit rules, computational lin- els, such as MLPs and Simple Recurrent Networks guistic formalisms implicitly make claims about (SRNs) (Elman, 1990), these vector-space repre- the regularities found in natural languages through sentations and rules could be learned from data. their expressive power. Certain types of rules sim- The ability to learn powerful vector-space repre- ply cannot be specified, thus claiming that such sentations from data led many connectionist to ar- rules are not necessary to capture the regularities gue that the complex discrete structured representa- found in any natural language. These claims differ tions of computational linguistics were neither nec- across formalisms, but the study of the expressive essary nor desirable (e.g. Smolensky (1988, 1990); power of grammar formalisms have identified cer- Elman (1991); Miikkulainen (1993); Seidenberg tain key principles (Joshi et al., 1990). Firstly, that (2007)). Distributed vector-space representations the set of rules in a given grammar is bounded. were thought to be so powerful that there was no This in turn implies that the set of properties and need for anything else. Learning from data made relations in a given grammar is also bounded. linguistic theories irrelevant. (See also (Collobert But language is unbounded1 in nature, since sen- and Weston, 2008; Collobert et al., 2011; Sutskever tences and texts can be arbitrarily long. Grammar et al., 2014) for more recent incarnations.) 1 The idea that vector-space representations are A set of things (e.g. the sentences of a language) have unbounded size if for any finite size there is always some adequate for natural language and other cognitive element in the set which is larger than that. phenomena was questioned from several directions.
From neuroscience, researchers questioned how a adequate to recover them from the features of the simple vector could encode features of more than entities (Henderson, 1994, 2000). But these argu- one thing at a time. If we see a red square to- ments were largely theoretical, and it was not clear gether with a blue triangle, how do we represent how they could be incorporated in learning-based the difference between that and a red triangle with architectures. a blue square, since the vector elements for red, blue, square and triangle would all be active at the 2.3 Statistical Models same time? This is known as the variable bind- Although researchers in computational linguistics ing problem, so called because variables are used did not want to abandon their representations, they to do this binding in symbolic representations, as did recognise the importance of learning from data. in red(x) ∧ triangle(x) ∧ blue(y) ∧ square(y). The first successes in this direction came from One proposal has been that the precise timing of learning rules with statistical methods, such as neuron activation spikes could be used to encode part-of-speech tagging with hidden Markov mod- variable binding, called Temporal Synchrony Vari- els. For syntactic parsing, the development of the able Binding (von der Malsburg, 1981; Shastri and Penn Treebank led to many statistical models which Ajjanagadde, 1993). Neural spike trains have both learned the rules of grammar (Collins, 1997, 1999; a phase and a period, so the phase could be used Charniak, 1997; Ratnaparkhi, 1999). to encode variable binding while still allowing the These statistical models were very successful period to be used for sequential computation. This at learning from the distributions of linguistic rep- work indicated how entities could be represented resentations which had been annotated in the cor- in a neurally-inspired computational architecture. pus they were trained on. But they still required The adequacy of vector-space representations linguistically-motivated designs to work well. In was also questioned based on the regularities found particular, feature engineering is necessary to make in natural language. In particular, Fodor and sure that these statistical machine-learning method Pylyshyn (1988) argued that connectionist architec- can search a space of rules which is sufficiently tures were not adequate to account for regularities broad to include good models but sufficiently nar- which they characterised as systematicity (see also row to allow learning from limited data. (Smolensky, 1990; Fodor and McLaughlin, 1990)). In essence, systematicity requires that learned rules 3 Inducing Features of Entities generalise in a way that respects structured repre- Early work on neural networks for natural lan- sentations. Here again the issue is representing guage recognised the potential of neural networks multiple entities at the same time, but with the ad- for learning the features as well, replacing feature ditional requirement of representing the structural engineering. But empirically successful neural net- relationships between these entities. Only rules work models for NLP were only achieved with which are parameterised in terms of such represen- approaches where the neural network was used to tations can generalise in a way which accounts for model one component within an otherwise tradi- the generalisations found in language. tional symbolic NLP model. Early work on neural networks for natural lan- The first work to achieve empirical success in guage recognised the significance of variable bind- comparison to non-neural statistical models was ing for solving the issues with systematicity (Hen- work on language modelling. Bengio et al. (2001, derson, 1996, 2000). Henderson (1994, 2000) ar- 2003) used an MLP to estimate the parameters of gued that extending neural networks with temporal an n-gram language model, and showed improve- synchrony variable binding made them powerful ments when interpolated with a statistical n-gram enough to account for the regularities found in lan- language model. A crucial innovation of this model guage. Using time to encode variable bindings was the introduction of word embeddings. The idea means that learning could generalise in a linguis- that the properties of a word could be represented tically appropriate way (Henderson, 1996), since by a vector reflecting the distribution of the word rules (neuronal synapses) learned for one variable in text was introduced earlier in non-neural statisti- (time) would systematically generalise to other vari- cal models (e.g. (Deerwester et al., 1990; Schütze, ables. Although relations were not stored explicitly, 1993; Burgess, 1998; Padó and Lapata, 2007; Erk, it was claimed that for language understanding it is 2010)). This work showed that similarity in the
PTB Constituents ble parses. These models have also been applied model LP LR F1 to syntactic dependency parsing (Titov and Hen- Costa et al. (2001) PoS 57.8 64.9 61.1 Henderson (2003) PoS 83.3 84.3 83.8 derson, 2007b; Yazdani and Henderson, 2015) and Henderson (2003) 88.8 89.5 89.1 joint syntactic-semantic dependency parsing (Hen- Henderson (2004) 89.8 90.4 90.1 derson et al., 2013). Vinyals et al. (2015) seq2seq
In contrast to seq2seq models, there have also structure of the target sentence decoder are both flat been neural network models of parsing which di- sequences, but when each target word is generated, rectly represent linguistic structure, rather than just it computes attention weights over all source words. derivation structure, giving them induced vector These attention weights directly express how target representations which map one-to-one with the en- words are correlated with source words, and in this tities in the linguistic representation. Typically, a sense can be seen as a soft version of the alignment recursive neural network is used to compute em- structure. In traditional statistical machine trans- beddings of syntactic constituents bottom-up. Dyer lation, this alignment structure is determined with et al. (2015) showed improvements by adding these a separate alignment algorithm, and then frozen representations to a model of the derivation struc- while training the model. In contrast, the attention- ture. Socher et al. (2013a) only modelled the lin- based NMT model learns the alignment structure guistic structure, making it difficult to do decoding jointly with learning the encoder and decoder, in- efficiently. But the resulting induced constituent side the deep learning architecture (Bahdanau et al., embeddings have a clear linguistic interpretation, 2015). making it easier to use them within other tasks, This attention-based approach to NMT was also such as sentiment analysis (Socher et al., 2013b). applied to mapping a sentence to its syntactic parse Similarly, models based on Graph Convolutional (Vinyals et al., 2015). The attention function learns Networks have induced embeddings with clear lin- the structure of the relationship between the sen- guistic interpretations within pre-defined model tence and its syntactic derivation sequence, but structures (e.g. (Marcheggiani and Titov, 2017; does not have any representation of the structure Marcheggiani et al., 2018)). of the syntactic derivation itself. Empirical results All these results demonstrate the incredible effec- are much better than their seq2seq model (Vinyals tiveness of inducing vector-space representations et al., 2015), but not as good as models which ex- with neural networks, relieving us from the need to plicitly model both structures (see Table 1). do feature engineering. But neural networks do not relieve us of the need to understand the nature of The change from the sequential LSTM decoders language when designing our models. Instead of of previous NMT models to LSTM decoders with feature engineering, these results show that the best attention seems like a simple addition, but it fun- accuracy is achieved by engineering the inductive damentally changes the kinds of generalisations bias of deep learning models through their model which the model is able to learn. At each step in structure. By designing a hand-coded model struc- decoding, the state of a sequential LSTM model ture which reflects the linguistic structure, locality is a single vector, whereas adding attention means in the model structure can reflect locality in the lin- that the state needs to include the unboundedly guistic structure. The neural network then induces large set of vectors being attended to. This use of features of the entities in this model structure. an unbounded state is more similar to the above models with predefined model structure, where an 4 Inducing Relations between Entities unboundedly large stack is needed to specify the parser state. This change in representation leads to With the introduction of attention-based models, a profound change in the generalisations which can the model structure can now be learned. By choos- be learned. Parameterised rules which are learned ing the nodes to be linguistically-motivated entities, when paying attention to one of these vectors (in learning the model structure in effect learns the sta- the set or in the stack) automatically generalise to tistical inter-dependencies between entities, which the other vectors. In other words, attention-based is what we have been referring to as relations. models have variable binding, which sequential LSTMs do not. Each vector represents the fea- 4.1 Attention-Based Models and Variable tures for one entity, multiple entities can be kept Binding in memory at the same time, and rules generalise The first proposal of an attention-based neural across these entities. In this sense it is wrong to model learned a soft alignment between the tar- refer to attention-based models as sequence mod- get and source words in neural machine translation els; they are in fact induced-structure models. We (NMT) (Bahdanau et al., 2015). The model struc- will expand on this perspective in the rest of this ture of the source sentence encoder and the model section.
4.2 Transformer and Systematicity of tasks. The success of BERT has led to vari- ous analyses of what it has learned, including the The generality of attention as a structure-induction structural relations learned by the attention func- method soon became apparent, culminated in tions. Although there is no exact mapping from the development of the Transformer architecture these structures to the structures posited by linguis- (Vaswani et al., 2017). Transformer has multiple tics, there are clear indications that the attention stacked layers of self-attention (attention to the functions are learning to extract linguistic relations other words in the same sequence), interleaved with (Voita et al., 2019; Tenney et al., 2019; Reif et al., nonlinear functions applied to individual vectors. 2019). Each attention layer has multiple attention heads, allowing each head to learn a different type of re- With variable binding for the properties of enti- lation. A Transformer-encoder has one column of ties and attention functions for relations between stacked vectors for each position in the input se- entities, Transformer can represent the kinds of quence, and the model parameters are shared across structured representations argued for above. With positions. A Transformer-decoder adds attention parameters shared across entities and sensitive to over an encoded text, and predicts words one at a these properties and relations, learned rules are time after encoding the prefix of previously gener- parameterised in terms of these structures. Thus ated words. Transformer is a deep learning architecture with the kind of generalisation ability required to exhibit Although it was developed for encoding and gen- systematicity, as in (Fodor and Pylyshyn, 1988). erating sequences, in Transformer the sequential structure is not hard-coded into the model struc- Interestingly, the relations are not stored explic- ture, unlike previous models of deep learning for itly. Instead they are extracted from pairs of vec- sequences (e.g. LSTMs (Hochreiter and Schmidhu- tors by the attention functions, as with the use of ber, 1997) and CNNs (LeCun and Bengio, 1995)). position embeddings to compute relative position Instead, the sequential structure is input in the form relations. For the model to induce its own structure, of position embeddings. In our formulation, posi- lower levels must learn to embed its relations in tion embeddings are just properties of individual pairs of token embeddings, which higher levels of entities (typically words or subwords). As such, attention then extract. these inputs facilitate learning about absolute posi- That Transformer learns to embed relations in tions. But they are also designed to allow the model pairs of token embeddings is apparent from re- to easily calculate relative position between entities. cent work on dependency parsing (Kondratyuk This allows the model’s attention functions to learn and Straka, 2019; Mohammadshahi and Hender- to discover the relative position structure of the son, 2019, 2020). Earlier models of dependency underlying sequence. In fact, explicitly inputting parsing successfully use BiLSTMs to embed syn- relative position relations as embeddings into the tactic dependencies in pairs of token embeddings attention functions works even better (Shaw et al., (e.g. (Kiperwasser and Goldberg, 2016; Dozat and 2018) (discussed further below). Whether input Manning, 2016)), which are then extracted to pre- as properties or as relations, these inputs are just dict the dependency tree. Mohammadshahi and features, not hard-coded model structure. The at- Henderson (2019, 2020) use their proposed Graph- tention weight functions can then learn to use these to-Graph Transformer to encode dependencies in features to induce their own structure. pairs of token embeddings, for transition-based The appropriateness and generality for natural and graph-based dependency parsing respectively. language of the Transformer architecture became Graph-to-Graph Transformer also inputs previously even more apparent with the development of pre- predicted dependency relations into its attention trained Transformer models like BERT (Devlin functions (like relative position encoding (Shaw et al., 2019). BERT models are large Transformer et al., 2018)). These parsers achieve state of the models trained mostly on a masked language model art accuracies, indicating that Transformer finds it objective, as well as a next-sentence prediction ob- easy to input and predict syntactic dependency rela- jective. After training on a very large amount of un- tions via pairs of token embeddings. Interestingly, labelled text, the resulting pretrained model can be initialising the model with pretrained BERT re- fine tuned for various tasks, with very impressive sults in large improvements, indicating that BERT improvements in accuracy across a wide variety representations also encode syntactically-relevant
relations in pairs of token embeddings. that P (i|x, q) ∝ P (i|x) P (q|xi ) ∝ exp(q ·xi ) (ig- noring factors independent of i) to reinterpret a 4.3 Nonparametric Representations standard attention function. As we have seen, the problem with vector-space Since Transformer has a discrete segmentation of models is not simply about representations, but its representation into positions (which we call enti- about the way learned rules generalise. In work on ties), but no explicit representation of structure, we grammar formalisms, generalisation is analysed by can think of this representation as a bag of vectors looking at the unbounded case, since any bounded (BoV, i.e. a set of instances of vectors). Each layer case can simply be memorised. But the use of has a BoV representation, which is aligned with continuous representations does not fit well with the BoV representation below it. The final output the theory of grammar formalisms, which assumes only becomes a sequence if the downstream task a bounded vocabulary of atomic categories. In- imposes explicit sequential structure on it, which stead we propose an analysis of the generalisation attention alone does not. abilities of Transformer in terms of theory from ma- These bag of vector representations have two chine learning, Bayesian nonparametric learning very interesting properties for natural language. (Jordan, 2010). We argue that the representations First, the number of vectors in the bag can grow of Transformer are the minimal nonparametric ex- arbitrarily large, which captures the unbounded na- tension of a vector space. ture of language. Secondly, the vectors in the bag To connect Transformer to Bayesian probabili- are exchangeable, in the sense of Jordan (2010). ties, we assume that a Transformer representation In other words, renumbering the indices used to can be thought of as the parameters of a probabil- refer to the different vectors will not change the ity distribution. This is natural, since a model’s interpretation of the representation.3 This is be- state represents a belief about the input, and in cause the learned parameters in Transformer are Bayesian approaches beliefs are probability distri- shared across all positions. These two properties butions. From this perspective, computing a rep- are clearly related; exchangeability allows learning resentation is inferring the parameters of a proba- to generalise to unbounded representations, since bility distribution from the observed input. This there is no need to learn about indices which are is analogous to Bayesian learning, where we infer not in the training data. the parameters of a distribution over models from These properties mean that BoV representations observed training data. In this section, we outline are nonparametric representations. In other words, how theory from Bayesian learning helps us under- the specification of a BoV representation cannot stand how the representations of Transformer lead be done just by choosing values for a fixed set of to better generalisation. parameters. The number of parameters you need We do not make any specific assumptions about grows with the size of the bag. This is crucial what probability distributions are specified by a for language because the amount of information Transformer representation, but it is useful to keep conveyed by a text grows with the length of the in mind an example. One possibility is a mixture text, so we need nonparametric representations. model, where each vector specifies the parame- To illustrate the usefulness of this view of BoVs ters of a multi-dimensional distribution, and the as nonparametric representations, we propose to total distribution is the weighted sum across the use methods from Bayesian learning to define a vectors of these distributions. For example, we prior distribution over BoVs where the size of can interpret the vectors x=x1 , . . . , xn in a Trans- the bag is not known. Such a prior would be former’s representation as specifying a belief about needed for learning the number of entities in a the queries q that will be received from a down- Transformer representation, discussed below, using stream attention function, as in: variational Bayesian approaches. For this exam- X ple, we will use the above interpretation of a BoV P (q|x) = P (i|x) P (q|xi ) i x={xi | 1≤i≤k} as Pspecifying a distribution over queries, P (q|x)= i P (i|x)P (q|xi ). A prior dis- X P (i|x) = exp( 21 ||xi ||2 ) / exp( 21 ||xi ||2 ) i tribution over these P (q|x) distributions can be P (q|xi ) = N (q ; µ=xi , σ=1) 3 These indices should not be confused with position em- beddings. In fact, position embeddings are needed precisely With this interpretation of x, we can use the fact because the indices are meaningless to the model.
specified, for example, with a Dirichlet Process, generates a sentence, the number of positions is DP (α, G0 ). The concentration parameter α con- chosen by the model, but it is simply trying to guess trols the generation of a sequence of probabilities the number of positions that would have been given ρ1 , ρ2 , . . ., which correspond to the P (i|x) distri- if this was a training example. These Transformer bution (parameterised by the ||xi ||). The base dis- models never try to induce the number of token tribution G0 controls the generation of the P (q|xi ) embeddings they use in an unsupervised way.4 distributions (parameterised by the xi ). Given that current models hard-code different The use of exchangeability to support generali- token definitions for different tasks (e.g. character sation to unbounded representations implies a third embeddings versus word embeddings versus sen- interesting property, discrete segmentation into en- tence embeddings), it is natural to ask whether a tities. In other words, the information in a BoV specification of the set of entities at a given level is spread across an integer number of vectors. A of representation can be learned. There are models vector cannot be half included in a BoV; it is either which induce the set of entities in an input text, but included or not. In changing from a vector space these are (to the best of our knowledge) not learned to a bag-of-vector space, the only change is this jointly with a downstream deep learning model. discrete segmentation into entities. In particular, Common examples include BPE (Sennrich et al., no discrete representation of structure is added to 2016) and unigram language model (Kudo, 2018), the representation. Thus, the BoV representation which use statistics of character n-grams to decide of Transformer is the minimal nonparametric ex- how to split words into subwords. The resulting tension of a vector space. subwords then become the entities for a deep learn- With this minimal nonparametric extension, ing model, such as Transformer (e.g. BERT), but Transformer is able to explicitly represent enti- they do not explicitly optimise the performance of ties and their properties, and implicitly represent a this downstream model. In a more linguistically- structure of relations between these entities. The informed approach to the same problem, statistical continuing astounding success of Transformer in models have been proposed for morphology induc- natural language understanding tasks suggests that tion (e.g. (Elsner et al., 2013)). Also, Semi-Markov this is an adequate deep learning architecture for CRF models (Sarawagi and Cohen, 2005) can learn the kinds of structured representations needed to segmentations of an input string, which have been account for the nature of language. used in the output layers of neural models (e.g. (Kong et al., 2015)). The success of these models 5 Looking Forward: Inducing Levels in finding useful segmentations of characters into and their Entities subwords suggests that learning the set of entities can be integrated into a deep learning model. But As argued above, the great success of neural net- this task is complicated by the inherently discrete works in NLP has not been because they are radi- nature of the segmentation into entities. It remains cally different from pre-neural computational theo- to find effective neural architectures for learning ries of language, but because they have succeeded the set of entities jointly with the rest of the neu- in replacing hand-coded components of those mod- ral model, and for generalising such methods from els with learned components which are specifically the level of character strings to higher levels of designed to capture the same generalisations. We representation. predict that there is at least one more hand-coded The other remaining hand-coded component of aspect of these models which can be learned from computational linguistic models is levels of repre- data, but question whether they all can be. sentation. Neural network models of language typ- Transformer can learn representations of entities ically only represent a few levels, such as the char- and their relations, but current work (to the best of acter sequence plus the word sequence, the word our knowledge) all assumes that the set of entities is sequence plus the syntax tree, or the word sequence a predefined function of the text. Given a sentence, plus the syntax tree plus the predicate-argument a Transformer does not learn how many vectors it structure (Henderson et al., 2013; Swayamdipta should use to represent it. The number of positions 4 in the input sequence is given, and the number Recent work on inducing sparsity in attention weights (Correia et al., 2019) effectively learns to reduce the number of token embeddings is the same as the number of entities used by individual attention heads, but not by the of input positions. When a Transformer decoder model as a whole.
et al., 2016). And these levels and their entities fundamental ways. Vector space representations are defined before training starts, either in pre- (as in MLPs) are not adequate, nor are vector spaces processing or in annotated data. If we had methods which evolve over time (as in LSTMs). Attention- for inducing the set of entities at a given level (dis- based models are fundamentally different because cussed above), then we could begin to ask whether they use bag-of-vector representations. BoV rep- we can induce the levels themselves. resentations are nonparametric representations, in One common approach to inducing levels of rep- that the number of vectors in the bag can grow ar- resentation in neural models is to deny it is a prob- bitrarily large, and these vectors are exchangeable. lem. Seq2seq and end2end models typically take With BoV representations, attention-based neu- this approach. These models only include represen- ral network models like Transformer can model the tations at a lower level, both for input and output, kinds of unbounded structured representations that and try to achieve equivalent performance to mod- computational linguists have found to be necessary els which postulate some higher level of represen- to capture the generalisations in natural language. tation (e.g. (Collobert and Weston, 2008; Collobert And deep learning allows many aspects of these et al., 2011; Sutskever et al., 2014; Vinyals et al., structured representations to be learned from data. 2015)). The most successful example of this ap- However, successful deep learning architectures proach has been neural machine translation. The for natural language currently still have many hand- ability of neural networks to learn such models is coded aspects. The levels of representation are impressive, but the challenge of general natural hand-coded, based on linguistic theory or available language understanding is much greater than ma- resources. Often deep learning models only address chine translation. Nonetheless, models which do one level at a time, whereas a full model would not explicitly model levels of representation can involve levels ranging from the perceptual input to show that they have learned about different levels logical reasoning. Even within a given level, the implicitly (Peters et al., 2018; Tenney et al., 2019). set of entities is a pre-defined function of the text. We think that it is far more likely that we will be able to design neural architectures which induce This analysis suggests that an important next multiple levels of representation than it is that we step in deep learning architectures for natural lan- can ignore this problem entirely. However, it is guage understanding will be the induction of enti- not at all clear that even this will be possible. Un- ties. It is not clear what advances in deep learning like the components previously learned, no linguis- methods will be necessary to improve over our tic theory postulates different levels of representa- current fixed entity definitions, nor whether the re- tion for different languages. Generally speaking, sulting entities will be any different from the ones there is a consensus that the levels minimally in- postulated by linguistic theory. If we can induce clude phonology, morphology, syntactic structure, the entities at a given level, a more challenging predicate-argument structure, and discourse struc- task will be the induction of the levels themselves. ture. This language-universal nature of levels of The presumably-innate nature of linguistic levels representation suggests that in humans the levels suggests that this might not even be possible. of linguistic representation are innate. This draws But of one thing we can be certain: the immense into question whether levels of representation can success of adapting deep learning architectures to be learned at all. Perhaps they are innate because fit with our computational-linguistic understanding human brains are not able to learn them from data. of the nature of language will doubtless continue, If so, perhaps it is the same for neural networks, with greater insights for both natural language pro- and so attempts to induce levels of representation cessing and machine learning. are doomed to failure. Or perhaps we can find new neural network archi- Acknowledgements tectures which are even more powerful than what is now thought possible. It wouldn’t be the first time! We would like to thank Paola Merlo, Suzanne 6 Conclusions Stevenson, Ivan Titov, members of the Idiap NLU group, and the anonymous reviewers for their com- We conclude that the nature of language has influ- ments and suggestions. enced the design of deep learning architectures in
References neural networks with multitask learning. In Proceed- ings of the Twenty-Fifth International Conference Daniel Andor, Chris Alberti, David Weiss, Aliaksei (ICML 2008), pages 160–167, Helsinki, Finland. Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, and Michael Collins. 2016. Globally normal- Gonçalo M. Correia, Vlad Niculae, and André F. T. ized transition-based neural networks. In Proceed- Martins. 2019. Adaptively sparse transformers. In ings of the 54th Annual Meeting of the Association Proceedings of the 2019 Conference on Empirical for Computational Linguistics (Volume 1: Long Pa- Methods in Natural Language Processing and the pers), pages 2442–2452, Berlin, Germany. Associa- 9th International Joint Conference on Natural Lan- tion for Computational Linguistics. guage Processing (EMNLP-IJCNLP), pages 2174– 2184, Hong Kong, China. Association for Computa- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- tional Linguistics. gio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of Fabrizio Costa, Vincenzo Lombardo, Paolo Frasconi, ICLR. and Giovanni Soda. 2001. Wide coverage incre- mental parsing by learning attachment preferences. Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. pages 297–307. 2001. A neural probabilistic language model. In Advances in Neural Information Processing Systems Scott Deerwester, Susan T. Dumais, George W. Fur- 13, pages 932–938. MIT Press. nas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Jour- Yoshua Bengio, Réjean Ducharme, Pascal Vincent, nal of the American Society for Information Science, and Christian Janvin. 2003. A neural probabilis- 41(6):391–407. tic language model. J. Machine Learning Research, 3:1137–1155. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Curt Burgess. 1998. From simple associations to the deep bidirectional transformers for language under- building blocks of language: Modeling meaning in standing. In Proceedings of the 2019 Conference memory with the HAL model. Behavior Research of the North American Chapter of the Association Methods, Instruments, & Computers, 30(2):188– for Computational Linguistics: Human Language 198. Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Associ- Eugene Charniak. 1997. Statistical parsing with a ation for Computational Linguistics. context-free grammar and word statistics. In Proc. 14th National Conference on Artificial Intelligence, Timothy Dozat and Christopher D. Manning. 2016. Providence, RI. AAAI Press/MIT Press. Deep biaffine attention for neural dependency pars- ing. CoRR, abs/1611.01734. ICLR 2017. Danqi Chen and Christopher Manning. 2014. A fast and accurate dependency parser using neural net- Chris Dyer, Miguel Ballesteros, Wang Ling, Austin works. In Proceedings of the 2014 Conference on Matthews, and Noah A. Smith. 2015. Transition- Empirical Methods in Natural Language Processing based dependency parsing with stack long short- (EMNLP), pages 740–750, Doha, Qatar. Association term memory. In Proceedings of the 53rd Annual for Computational Linguistics. Meeting of the Association for Computational Lin- guistics and the 7th International Joint Conference Noam Chomsky. 1959. On certain formal properties of on Natural Language Processing (Volume 1: Long grammars. Information and Control, 2:137–167. Papers), pages 334–343, Beijing, China. Associa- tion for Computational Linguistics. Michael Collins. 1997. Three generative, lexicalized models for statistical parsing. In Proc. 35th Meeting Jeffrey L. Elman. 1990. Finding structure in time. Cog- of Association for Computational Linguistics and nitive Science, 14(2):179–212. 8th Conf. of European Chapter of Association for Computational Linguistics, pages 16–23, Somerset, Jeffrey L. Elman. 1991. Distributed representations, New Jersey. simple recurrent networks, and grammatical struc- ture. Machine Learning, 7:195–225. Michael Collins. 1999. Head-Driven Statistical Mod- els for Natural Language Parsing. Ph.D. thesis, Uni- Micha Elsner, Sharon Goldwater, Naomi Feldman, and versity of Pennsylvania, Philadelphia, PA. Frank Wood. 2013. A joint learning model of word segmentation, lexical acquisition, and phonetic vari- R. Collobert, J. Weston, L. Bottou, M. Karlen, ability. In Proceedings of the 2013 Conference on K. Kavukcuoglu, and P. Kuksa. 2011. Natural lan- Empirical Methods in Natural Language Processing, guage processing (almost) from scratch. Journal of pages 42–54, Seattle, Washington, USA. Associa- Machine Learning Research, 12:2493–2537. tion for Computational Linguistics. Ronan Collobert and Jason Weston. 2008. A unified Katrin Erk. 2010. What is word meaning, really? (and architecture for natural language processing: deep how can distributional models help us describe it?).
In Proceedings of the 2010 Workshop on GEometri- M.I. Jordan. 2010. Bayesian nonparametric learn- cal Models of Natural Language Semantics, pages ing: Expressive priors for intelligent systems. In 17–26, Uppsala, Sweden. Association for Computa- R. Dechter, H. Geffner, and J. Halpern, editors, tional Linguistics. Heuristics, Probability and Causality: A Tribute to Judea Pearl, chapter 10. College Publications. Jerry A. Fodor and B. McLaughlin. 1990. Connection- ism and the problem of systematicity: Why smolen- Aravind K. Joshi. 1987. An introduction to tree adjoin- sky’s solution doesn’t work. Cognition, 35:183– ing grammars. In Alexis Manaster-Ramer, editor, 204. Mathematics of Language. John Benjamins, Amster- dam. Jerry A. Fodor and Zenon W. Pylyshyn. 1988. Connec- tionism and cognitive architecture: A critical analy- Aravind K. Joshi, K. Vijay-Shanker, and David Weir. sis. Cognition, 28:3–71. 1990. The convergence of mildly context-sensitive grammatical formalisms. In Peter Sells, Stuart James Henderson. 1994. Description Based Parsing Shieber, and Tom Wasow, editors, Foundational Is- in a Connectionist Network. Ph.D. thesis, Univer- sues in Natural Language Processing. MIT Press, sity of Pennsylvania, Philadelphia, PA. Technical Cambridge MA. Forthcoming. Report MS-CIS-94-46. Eliyahu Kiperwasser and Yoav Goldberg. 2016. Sim- James Henderson. 1996. A connectionist architecture ple and accurate dependency parsing using bidirec- with inherent systematicity. In Proceedings of the tional LSTM feature representations. Transactions Eighteenth Conference of the Cognitive Science So- of the Association for Computational Linguistics, ciety, pages 574–579, La Jolla, CA. 4:313–327. James Henderson. 2000. Constituency, context, and Dan Kondratyuk and Milan Straka. 2019. 75 lan- connectionism in syntactic parsing. In Matthew guages, 1 model: Parsing universal dependencies Crocker, Martin Pickering, and Charles Clifton, ed- universally. In Proceedings of the 2019 Confer- itors, Architectures and Mechanisms for Language ence on Empirical Methods in Natural Language Processing, pages 189–209. Cambridge University Processing and the 9th International Joint Confer- Press, Cambridge UK. ence on Natural Language Processing (EMNLP- IJCNLP), pages 2779–2795, Hong Kong, China. As- James Henderson. 2003. Inducing history representa- sociation for Computational Linguistics. tions for broad coverage statistical parsing. In Proc. joint meeting of North American Chapter of the As- Lingpeng Kong, Chris Dyer, and Noah A. Smith. 2015. sociation for Computational Linguistics and the Hu- Segmental recurrent neural networks. man Language Technology Conf., pages 103–110, Edmonton, Canada. Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple sub- James Henderson. 2004. Discriminative training of word candidates. In Proceedings of the 56th Annual a neural network statistical parser. In Proceedings Meeting of the Association for Computational Lin- of the 42nd Meeting of the Association for Compu- guistics (Volume 1: Long Papers), pages 66–75, Mel- tational Linguistics (ACL’04), Main Volume, pages bourne, Australia. Association for Computational 95–102, Barcelona, Spain. Linguistics. James Henderson, Paola Merlo, Ivan Titov, and Yann LeCun and Yoshua Bengio. 1995. Convolutional Gabriele Musillo. 2013. Multilingual joint pars- networks for images, speech, and time-series. In ing of syntactic and semantic dependencies with a Michael A. Arbib, editor, The handbook of brain the- latent variable model. Computational Linguistics, ory and neural networks (Second ed.), page 276278. 39(4):949–998. MIT press. E.K.S. Ho and L.W. Chan. 1999. How to design a Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im- connectionist holistic parser. Neural Computation, proving distributional similarity with lessons learned 11(8):1995–2016. from word embeddings. Transactions of the Associ- ation for Computational Linguistics, 3:211–225. Sepp Hochreiter and Jrgen Schmidhuber. 1997. Long short-term memory. Neural Computation, C. von der Malsburg. 1981. The correlation theory of 9(8):1735–1780. brain function. Technical Report 81-2, Max-Planck- Institute for Biophysical Chemistry, Gottingen. K. Hornik, M. Stinchcombe, and H. White. 1989. Mul- tilayer feedforward networks are universal approxi- Diego Marcheggiani, Joost Bastings, and Ivan Titov. mators. Neural Networks, 2:359–366. 2018. Exploiting semantics in neural machine trans- lation with graph convolutional networks. In Pro- Ajay N. Jain. 1991. PARSEC: A Connectionist ceedings of the 2018 Conference of the North Amer- Learning Architecture for Parsing Spoken Language. ican Chapter of the Association for Computational Ph.D. thesis, Carnegie Mellon University, Pitts- Linguistics: Human Language Technologies, Vol- burgh, PA. ume 2 (Short Papers), pages 486–492, New Orleans,
Louisiana. Association for Computational Linguis- F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Ad- tics. vances in Neural Information Processing Systems 32, pages 8594–8603. Curran Associates, Inc. Diego Marcheggiani and Ivan Titov. 2017. Encoding sentences with graph convolutional networks for se- D. E. Rumelhart, G. E. Hinton, and R. J. Williams. mantic role labeling. In Proceedings of the 2017 1986a. Learning internal representations by error Conference on Empirical Methods in Natural Lan- propagation. In D. E. Rumelhart and J. L. McClel- guage Processing, pages 1506–1515, Copenhagen, land, editors, Parallel Distributed Processing, Vol 1, Denmark. Association for Computational Linguis- pages 318–362. MIT Press, Cambridge, MA. tics. D. E. Rumelhart, J. L. McClelland, and the PDP Re- Risto Miikkulainen. 1993. Subsymbolic Natural Lan- seach group. 1986b. Parallel Distributed Process- guage Processing: An integrated model of scripts, ing: Explorations in the microstructure of cognition, lexicon, and memory. MIT Press, Cambridge, MA. Vol 1. MIT Press, Cambridge, MA. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Sunita Sarawagi and William W Cohen. 2005. Semi- rado, and Jeff Dean. 2013. Distributed representa- markov conditional random fields for information tions of words and phrases and their composition- extraction. In L. K. Saul, Y. Weiss, and L. Bottou, ality. In C.J.C. Burges, L. Bottou, M. Welling, editors, Advances in Neural Information Processing Z. Ghahramani, and K.Q. Weinberger, editors, Ad- Systems 17, pages 1185–1192. MIT Press. vances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc. Hinrich Schütze. 1993. Word space. In Advances in Neural Information Processing Systems 5, pages Alireza Mohammadshahi and James Henderson. 2019. 895–902. Morgan Kaufmann. Graph-to-graph transformer for transition-based de- pendency parsing. Mark S. Seidenberg. 2007. Connectionist models of reading. In Gareth Gaskell, editor, Oxford Hand- Alireza Mohammadshahi and James Henderson. 2020. book of Psycholinguistics, chapter 14, pages 235– Recursive non-autoregressive graph-to-graph trans- 250. Oxford University Press. former for dependency parsing with iterative refine- ment. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words Sebastian Padó and Mirella Lapata. 2007. Dependency- with subword units. In Proceedings of the 54th An- based construction of semantic space models. Com- nual Meeting of the Association for Computational putational Linguistics, 33(2):161–199. Linguistics (Volume 1: Long Papers), pages 1715– Jeffrey Pennington, Richard Socher, and Christopher 1725, Berlin, Germany. Association for Computa- Manning. 2014. Glove: Global vectors for word rep- tional Linguistics. resentation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Process- Lokendra Shastri and Venkat Ajjanagadde. 1993. From ing (EMNLP), pages 1532–1543, Doha, Qatar. Asso- simple associations to systematic reasoning: A con- ciation for Computational Linguistics. nectionist representation of rules, variables, and dy- namic bindings using temporal synchrony. Behav- Matthew Peters, Mark Neumann, Mohit Iyyer, Matt ioral and Brain Sciences, 16:417–451. Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word rep- Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. resentations. In Proceedings of the 2018 Confer- 2018. Self-attention with relative position represen- ence of the North American Chapter of the Associ- tations. In Proceedings of the 2018 Conference of ation for Computational Linguistics: Human Lan- the North American Chapter of the Association for guage Technologies, Volume 1 (Long Papers), pages Computational Linguistics: Human Language Tech- 2227–2237, New Orleans, Louisiana. Association nologies, Volume 2 (Short Papers), pages 464–468, for Computational Linguistics. New Orleans, Louisiana. Association for Computa- tional Linguistics. Carl Pollard and Ivan A. Sag. 1987. Information-Based Syntax and Semantics. Vol 1: Fundamentals. Cen- Paul Smolensky. 1988. On the proper treatment of con- ter for the Study of Language and Information, Stan- nectionism. Behavioral and Brain Sciences, 11:1– ford, CA. 17. Adwait Ratnaparkhi. 1999. Learning to parse natural Paul Smolensky. 1990. Tensor product variable bind- language with maximum entropy models. Machine ing and the representation of symbolic structures in Learning, 34:151–175. connectionist systems. Artificial Intelligence, 46(1- 2):159–216. Emily Reif, Ann Yuan, Martin Wattenberg, Fernanda B Viegas, Andy Coenen, Adam Pearce, and Been Kim. Richard Socher, John Bauer, Christopher D. Manning, 2019. Visualizing and measuring the geometry of and Andrew Y. Ng. 2013a. Parsing with compo- bert. In H. Wallach, H. Larochelle, A. Beygelzimer, sitional vector grammars. In Proceedings of the
51st Annual Meeting of the Association for Compu- H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar- tational Linguistics (Volume 1: Long Papers), pages nett, editors, Advances in Neural Information Pro- 455–465, Sofia, Bulgaria. Association for Computa- cessing Systems 30, pages 5998–6008. Curran Asso- tional Linguistics. ciates, Inc. Richard Socher, Alex Perelygin, Jean Wu, Jason Oriol Vinyals, Ł ukasz Kaiser, Terry Koo, Slav Petrov, Chuang, Christopher D. Manning, Andrew Ng, and Ilya Sutskever, and Geoffrey Hinton. 2015. Gram- Christopher Potts. 2013b. Recursive deep models mar as a foreign language. In C. Cortes, N. D. for semantic compositionality over a sentiment tree- Lawrence, D. D. Lee, M. Sugiyama, and R. Gar- bank. In Proceedings of the 2013 Conference on nett, editors, Advances in Neural Information Pro- Empirical Methods in Natural Language Processing, cessing Systems 28, pages 2773–2781. Curran Asso- pages 1631–1642, Seattle, Washington, USA. Asso- ciates, Inc. ciation for Computational Linguistics. Elena Voita, David Talbot, Fedor Moiseev, Rico Sen- Mark Steedman. 2000. The Syntactic Process. MIT nrich, and Ivan Titov. 2019. Analyzing multi-head Press, Cambridge. self-attention: Specialized heads do the heavy lift- ing, the rest can be pruned. In Proceedings of the Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. 57th Annual Meeting of the Association for Com- Sequence to sequence learning with neural networks. putational Linguistics, pages 5797–5808, Florence, In Z. Ghahramani, M. Welling, C. Cortes, N. D. Italy. Association for Computational Linguistics. Lawrence, and K. Q. Weinberger, editors, Advances Majid Yazdani and James Henderson. 2015. Incre- in Neural Information Processing Systems 27, pages mental recurrent neural network dependency parser 3104–3112. Curran Associates, Inc. with search-based discriminative training. In Pro- ceedings of the Nineteenth Conference on Computa- Swabha Swayamdipta, Miguel Ballesteros, Chris Dyer, tional Natural Language Learning, pages 142–152, and Noah A. Smith. 2016. Greedy, joint syntactic- Beijing, China. Association for Computational Lin- semantic parsing with stack LSTMs. In Proceedings guistics. of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 187–197, Berlin, Germany. Association for Computational Linguis- tics. Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 4593– 4601, Florence, Italy. Association for Computational Linguistics. Ivan Titov and James Henderson. 2007a. A latent vari- able model for generative dependency parsing. In Proceedings of the Tenth International Conference on Parsing Technologies, pages 144–155, Prague, Czech Republic. Association for Computational Lin- guistics. Ivan Titov and James Henderson. 2007b. A latent variable model for generative dependency parsing. In Proceedings of the International Conference on Parsing Technologies (IWPT’07), Prague, Czech Re- public. Association for Computational Linguistics. Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In Proceed- ings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 384–394, Up- psala, Sweden. Association for Computational Lin- guistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio,
You can also read