The Limitations of Limited Context for Constituency Parsing
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
The Limitations of Limited Context for Constituency Parsing Yuchen Li Andrej Risteski Carnegie Mellon University Carnegie Mellon University yuchenl4@andrew.cmu.edu aristesk@andrew.cmu.edu Abstract the neural techniques that have been scaled-up the most and receive widespread usage do not explic- Incorporating syntax into neural approaches in itly try to encode discrete structure that is natural NLP has a multitude of practical and scientific to language, e.g. syntax. The reason for this is benefits. For instance, a language model that is syntax-aware is likely to be able to produce perhaps not surprising: neural models have largely better samples; even a discriminative model achieved substantial improvements in unsupervised like BERT with a syntax module could be used settings, BERT (Devlin et al., 2019) being the de- for core NLP tasks like unsupervised syntactic facto method for unsupervised pre-training in most parsing. Rapid progress in recent years was NLP settings. On the other hand unsupervised syn- arguably spurred on by the empirical success tactic tasks, e.g. unsupervised syntactic parsing, of the Parsing-Reading-Predict architecture of have long been known to be very difficult tasks (Shen et al., 2018a), later simplified by the Order Neuron LSTM of (Shen et al., 2019). (Htut et al., 2018). However, since incorporating Most notably, this is the first time neural ap- syntax has been shown to improve language model- proaches were able to successfully perform un- ing (Kim et al., 2019b) as well as natural language supervised syntactic parsing (evaluated by var- inference (Chen et al., 2017; Pang et al., 2019; He ious metrics like F-1 score). et al., 2020), syntactic parsing remains important However, even heuristic (much less fully math- even in the current era when large pre-trained mod- ematical) understanding of why and when els, like BERT (Devlin et al., 2019), are available. these architectures work is lagging severely Arguably, the breakthrough works in unsuper- behind. In this work, we answer representa- vised constituency parsing in a neural manner were tional questions raised by the architectures in (Shen et al., 2018a, 2019), achieving F1 scores 42.8 (Shen et al., 2018a, 2019), as well as some and 49.4 on the WSJ Penn Treebank dataset (Htut transition-based syntax-aware language mod- els (Dyer et al., 2016): what kind of syntac- et al., 2018; Shen et al., 2019). Both of these ar- tic structure can current neural approaches chitectures, however (especially Shen et al., 2018a) to syntax represent? Concretely, we ground are quite intricate, and it’s difficult to evaluate what this question in the sandbox of probabilistic their representational power is (i.e. what kinds of context-free-grammars (PCFGs), and identify structure can they recover). Moreover, as subse- a key aspect of the representational power quent more thorough evaluations show (Kim et al., of these approaches: the amount and direc- 2019b,a), these methods still have a rather large tionality of context that the predictor has ac- cess to when forced to make parsing deci- performance gap with the oracle binary tree (which sion. We show that with limited context (either is the best binary parse tree according to F1-score) bounded, or unidirectional), there are PCFGs, — raising the question of what is missing in these for which these approaches cannot represent methods. the max-likelihood parse; conversely, if the We theoretically answer both questions raised context is unlimited, they can represent the in the prior paragraph. We quantify the represen- max-likelihood parse of any PCFG. tational power of two major frameworks in neural approaches to syntax: learning a syntactic distance 1 Introduction (Shen et al., 2018a,b, 2019) and learning to parse Neural approaches have been steadily making their through sequential transitions (Dyer et al., 2016; way to NLP in recent years. By and large however, Chelba, 1997). To formalize our results, we con- 2675 Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 2675–2687 August 1–6, 2021. ©2021 Association for Computational Linguistics
sider the well-established sandbox of probabilistic guistic intuitions about difficult syntactic structures context-free grammars (PCFGs). Namely, we ask: to infer: the parse depends on words that come When is a neural model based on a syntactic much later in the sentence. distance or transitions able to represent the max- likelihood parse of a sentence generated from a 2 Overview of Results PCFG? We consider several neural architectures that have We focus on a crucial “hyperparameter” com- shown success in various syntactic tasks, most mon to practical implementations of both families notably unsupervised constituency parsing and of methods that turns out to govern the representa- syntax-aware language modeling. The general tional power: the amount and type of context the framework these architectures fall under is as fol- model is allowed to use when making its predic- lows: to parse a sentence W = w1 w2 ...wn with a tions. Briefly, for every position t in the sentence, trained neural model, the sentence W is input into syntactic distance models learn a distance dt to the the model, which outputs ot at each step t, and fi- previous token — the tree is then inferred from this nally all the outputs {ot }nt=1 are utilized to produce distance; transition-based models iteratively con- the parse. struct the parse tree by deciding, at each position t, Given unbounded time and space resources, by a what operations to perform on a partial parse up to seminal result of Siegelmann and Sontag (1992), an token t. A salient feature of both is the context, that RNN implementation of this framework is Turing is, which tokens is dt a function of (correspond- complete. In practice it is common to restrict the ingly, which tokens can the choice of operations at form of the output ot in some way. In this paper, token t depend on)? we consider the two most common approaches, in We show that when the context is either bounded which ot is a real number representing a syntactic (that is, dt only depends on a bounded window distance (Section 2.1) (Shen et al., 2018a,b, 2019) around the t-th token) or unidirectional (that is, or a sequence of parsing operations (Section 2.2) dt only considers the tokens to the left of the t- (Chelba, 1997; Chelba and Jelinek, 2000; Dyer th token), there are PCFGs for which no distance et al., 2016). We proceed to describe our results for metric (correspondingly, no algorithm to choose each architecture in turn. the sequence of transitions) works. On the other 2.1 Syntactic distance hand, if the context is unbounded in both directions then both methods work: that is, for any parse, we Syntactic distance-based neural parsers train a neu- can design a distance metric (correspondingly, a ral network to learn a distance for each pair of sequence of transitions) that recovers it. adjacent words, depending on the context surround- This is of considerable importance: in practi- ing the pair of words under consideration. The cal implementations the context is either bounded distances are then used to induce a tree structure (e.g. in Shen et al., 2018a, the distance metric is (Shen et al., 2018a,b). parametrized by a convolutional kernel with a con- For a sentence W = w1 w2 ...wn , the syntactic stant width) or unidirectional (e.g. in Shen et al., distance between wt−1 and wt (2 ≤ t ≤ n) is 2019, the distance metric is computed by a LSTM, defined as dt = d(wt−1 , wt | ct ), where ct is the which performs a left-to-right computation). context that dt takes into consideration 1 . We will show that restricting the surrounding context ei- This formally confirms a conjecture of Htut et al. ther in directionality, or in size, results in a poor (2018), who suggested that because these models representational power, while full context confers commit to parsing decision in a left-to-right fash- essentially perfect representational power with re- ion and are trained as a part of a language model, spect to PCFGs. it may be difficult for them to capture sufficiently Concretely, if the context is full, we show: complex syntactic dependencies. Our techniques are fairly generic and seem amenable to analyzing Theorem (Informal, full context). For sentence W other approaches to syntax. Finally, while the ex- generated by any PCFG, if the computation of dt istence of a particular PCFG that is problematic has as context the full sentence and the position for these methods doesn’t necessarily imply that index under consideration, i.e. ct = (W, t) and the difficulties will carry over to real-life data, the 1 Note that this is not a conditional distribution—we use PCFGs that are used in our proofs closely track lin- this notation for convenience. 2676
dt = d(wt−1 , wt | ct ), then dt can induce the maxi- combinatorial structure — e.g. a sequence of tran- mum likelihood parse of W . sitions (Chelba, 1997; Chelba and Jelinek, 2000; On the flipside, if the context is unidirectional Dyer et al., 2016). We adopt a simplification of the (i.e. unbounded left-context from the start of neural parameterization in (Dyer et al., 2016) (see the sentence, and even possibly with a bounded Definition 4.7). look-ahead), the representational power becomes With full context, Dyer et al. (2016) describes severely impoverished: an algorithm to find a sequence of transitions to represent any parse tree, via a “depth-first, left- Theorem (Informal, limitation of left-to-right pars- to-right traversal” of the tree. On the other hand, ing via syntactic distance). There exists a PCFG G without full context, we prove that transition-based such that for any distance measure dt whose com- parsing suffers from the same limitations: putation incorporates only bounded context in at Theorem (Informal, limitation of transition-based least one direction (left or right), e.g. parsing without full context). There exists a PCFG ct = (w0 , w1 , ..., wt+L0 ) G, such that for any learned transition-based dt = d(wt−1 , wt | ct ) parser with bounded context in at least one direc- the probability that dt induces the max likelihood tion (left or right), the probability that it returns parse is arbitrarily low. the max likelihood parse is arbitrarily low. For a formal statement, see Section 7, and in In practice, for computational efficiency, particular Theorem 5. parametrizations of syntactic distances fall into the above assumptions of restricted context (Shen et al., Remark. There is no immediate connection be- 2018a). This puts the ability of these models to tween the syntactic distance-based approaches (in- learn a complex PCFG syntax into considerable cluding ON-LSTM) and the transition-based pars- doubt. For formal definitions, see Section 4.2. For ing framework, so the limitations of transition- formal theorem statements and proofs, see Section based parsing does not directly imply the stated 5. negative results for syntactic distance or ON- Subsequently we consider ON-LSTM, an archi- LSTM, and vice versa. tecture proposed by Shen et al. (2019) improving 2.3 The counterexample family their previous work (Shen et al., 2018a), which Most of our theorems proving limitations on also is based on learning a syntactic distance, but bounded and unidirectional context are based on a in (Shen et al., 2019) the distances are reduced from PCFG family (Definition 2.1) which draws inspi- the values of a carefully structured master forget rations from natural language already suggested in gate (see Section 6). While we show ON-LSTM (Htut et al., 2018): later words in a sentence can can in principle losslessly represent any parse tree force different syntactic structures earlier in the (Theorem 3), calculating the gate values in a left sentence. For example, consider the two sentences: to right fashion (as is done in practice) is subject “I drink coffee with milk.” and “I drink coffee with to the same limitations as the syntactic distance friends.” Their only difference occurs at their very approach: last words, but their parses differ at some earlier Theorem (Informal, limitation of syntactic dis- words in each sentence, too, as shown in Figure 1. tance estimation based on ON-LSTM). There ex- To formalize this intuition, we define the follow- ists a PCFG G for which the probability that the ing PCFG. syntactic distance converted from an ON-LSTM Definition 2.1 (Right-influenced PCFG). Let m ≥ induces the max likelihood parse is arbitrarily low. 2, L0 ≥ 1 be positive integers. The grammar Gm,L0 For a formal statement, see Section 6 and in has starting symbol S, other non-terminals particular Theorem 4. Ak , Bk , Alk , Ark , Bk0 for all k ∈ {1, 2, ..., m}, 2.2 Transition-based parsing and terminals In principle, the output ot at each position t of a left-to-right neural models for syntactic parsing ai for all i ∈ {1, 2, ..., m + 1 + L0 }, need not be restricted to a real-numbered distance or a carefully structured vector. It can also be a cj for all j ∈ {1, 2, ..., m}. 2677
Figure 1: The parse trees of the two sentences: “I drink coffee with milk.” and “I drink coffee with friends.”. Their only difference occurs at their very last words, but their parses differ at some earlier words in each sentence Figure 2: The structure of the parse tree of string lk = a1 a2 ...am+1+L0 ck ∈ L(Gm,L0 ). Note that any lk1 and lk2 are almost the same except for the last token: the prefix a1 a2 ...am+1+L0 is shared among all strings in L(Gm,L0 ). However, their parses differ with respect to where Ak is split. The last token ck is unique to lk and hence determines the correct parse according to Gm,L0 . The rules of the grammar are such that for the parsing algorithms we consider that proceed in a “left-to-right” fashion on lk , be- S → Ak Bk , ∀k ∈ {1, 2, . . . , m} w. prob.1/m fore processing the last token ck , it cannot infer the Ak → Alk Ark w. prob. 1 syntactic structure of a1 a2 ...am+1 any better than ∗ randomly guessing one of the m possibilities. This Alk → a1 a2 ...ak w. prob. 1 ∗ is the main intuition behind Theorems 2 and 5. Ark → ak+1 ak+2 ...am+1 w. prob. 1 ∗ Remark. While our theorems focus on the limita- Bk → Bk0 ck w. prob. 1 tion of “left-to-right” parsing, a symmetric argu- Bk0 →∗ am+2 am+3 ...am+1+L0 w. prob. 1 ment implies the same limitation of “right-to-left” parsing. Thus, our claim is that unidirectional con- text (in either direction) limits the expressive power in which →∗ means that the left expands into the of parsing models. right through a sequence of rules that conform to the requirements of the Chomsky normal form 3 Related Works (CNF, Definition 4.4). Hence the grammar Gm,L0 is in CNF. Neural models for parsing were first successfully The language of this grammar is implemented for supervised settings, e.g. (Vinyals et al., 2015; Dyer et al., 2016; Shen et al., 2018b). L(Gm,L0 )={lk =a1 a2 ...am+1+L0 ck : 1 ≤ k ≤ m}. Unsupervised tasks remained seemingly out of reach, until the proposal of the Parsing-Reading- The parse of an arbitrary lk is shown in Figure 2. Predict Network (PRPN) by Shen et al. (2018a), Each lk corresponds to a unique parse determined whose performance was thoroughly verified by ex- by the choice of k. The structure of this PCFG is tensive experiments in (Htut et al., 2018). The 2678
follow-up paper (Shen et al., 2019) introducing memory (Merrill, 2019; Hewitt et al., 2020). Our the ON-LSTM architecture simplified radically the work contributes to this line of works, but focuses architecture in (Shen et al., 2018a), while still ulti- on the task of syntactic parsing instead. mately attempting to fit a distance metric with the help of carefully designed master forget gates. Sub- 4 Preliminaries sequent work by Kim et al. (2019a) departed from In this section, we define some basic concepts and the usual way neural techniques are integrated in introduce the architectures we will consider. NLP, with great success: they proposed a neural parameterization for the EM algorithm for learning 4.1 Probabilistic context-free grammar a PCFG, but in a manner that leverages semantic First recall several definitions around formal lan- information as well — achieving a large improve- guage, especially probabilistic context free gram- ment on unsupervised parsing tasks.2 mar: In addition to constituency parsing, dependency Definition 4.1 (Probabilistic context-free grammar parsing is another common task for syntactic pars- (PCFG)). Formally, a PCFG (Chomsky, 1956) is a ing, but for our analyses on the ability of various 5-tuple G = (Σ, N, S, R, Π) in which Σ is the set approaches to represent the max-likelihood parse of terminals, N is the set of non-terminals, S ∈ N of sentences generated from PCFGs, we focus on is the start symbol, R is the set of production rules the task of constituency parsing. Moreover, it’s of the form r = (rL → rR ), where rL ∈ N , rR important to note that there is another line of work is of the form B1 B2 ...Bm , m ∈ Z+ , and ∀i ∈ aiming to probe the ability of models trained with- {1, 2, ..., m}, Bi ∈ (Σ ∪ N ). Finally, Π : R 7→ out explicit syntactic consideration (e.g. BERT) to [0, 1] is the rule probability function, in which for nevertheless discover some (rudimentary) syntactic any r = (A → B1 B2 ...Bm ) ∈ R, elements (Bisk and Hockenmaier, 2015; Linzen et al., 2016; Choe and Charniak, 2016; Kuncoro Π(r) is the conditional probability et al., 2018; Williams et al., 2018; Goldberg, 2019; P (rR = B1 B2 ...Bm | rL = A). Htut et al., 2019; Hewitt and Manning, 2019; Reif Definition 4.2 (Parse tree). Let TG denote the set et al., 2019). However, to-date, we haven’t been of parse trees that G can derive. Each t ∈ TG is able to extract parse trees achieving scores that associated with yield(t) ∈ Σ∗ , the sequence of are close to the oracle binarized trees on standard terminals composed of the leaves of t and PT (t) ∈ benchmarks (Kim et al., 2019b,a). [0, 1], the probability of the parse tree, defined by Methodologically, our work is closely related to the product of the probabilities of the rules in the a long line of works aiming to characterize the rep- derivation of t. resentational power of neural models (e.g. RNNs, Definition 4.3 (Language and sentence). The lan- LSTMs) through the lens of formal languages and guage of G is formal models of computation. Some of the works of this flavor are empirical in nature (e.g. LSTMs L(G) = {s ∈ Σ∗ : ∃t ∈ TG , yield(t) = s}. have been shown to possess stronger abilities to Each s ∈ L(G) is called a sentence in L(G), and recognize some context-free language and even is associated with the set of parses TG (s) = {t ∈ some context-sensitive language, compared with TG | yield(t) = s}, the set of max likelihood simple RNNs (Gers and Schmidhuber, 2001; Suz- parses, arg Pmaxt∈TG (s) PT (t), and its probability gun et al., 2019) or GRUs (Weiss et al., 2018; Suz- PS (s) = t∈TG (s) PT (t). gun et al., 2019)); some results are theoretical in nature (e.g. Siegelmann and Sontag (1992)’s proof Definition 4.4 (Chomsky normal form (CNF)). A that with unbounded precision and unbounded time PCFG G = (Σ, N, S, R, Π) is in CNF (Chomsky, complexity, RNNs are Turing-complete; related re- 1959) if we require, in addition to Definition 4.1, sults investigate RNNs with bounded precision and that each rule r ∈ R is in the form A → B1 B2 computation time (Weiss et al., 2018), as well as where B1 , B2 ∈ N \ {S}; A → a where a ∈ Σ, a 6= ; or S → which is only allowed if the 2 By virtue of not relying on bounded or unidirectional empty string ∈ L(G). context, the Compound PCFG (Kim et al., 2019a) eschews the Every PCFG G can be converted into a PCFG G0 techniques in our paper. Specifically, by employing a bidirec- tional LSTM inference network in the process of constructing in CNF such that L(G) = L(G0 ) (Hopcroft et al., a tree given a sentence, the parsing is no longer “left-to-right”. 2006). 2679
4.2 Syntactic distance 4.4 Transition-based parsing The Parsing-Reading-Predict Networks (PRPN) In addition to outputting a single real numbered (Shen et al., 2018a) is one of the leading approaches distance or a vector at each position t, a left-to-right to unsupervised constituency parsing. The parsing model can also parse a sentence by outputting a network (which computes the parse tree, hence the sequence of “transitions” at each position t, an idea only part we focus on in our paper) is a convo- proposed in some traditional parsing approaches lutional network that computes the syntactic dis- (Sagae and Lavie, 2005; Chelba, 1997; Chelba and tances dt = d(wt−1 , wt ) (defined in Section 2.1) Jelinek, 2000), and also some more recent neural based on the past L words. A deterministic greedy parameterization (Dyer et al., 2016). tree induction algorithm is then used to produce We introduce several items of notation: a parse tree as follows. First, we split the sen- tence w1 ...wn into two constituents, w1 ...wt−1 and • zit : the i-th transition performed when reading wt ...wn , where t ∈ argmax{dt }nt=2 and form the in wt , the t-th token of the sentence left and right subtrees of t. We recursively repeat W = w1 w2 ...wn . this procedure for the newly created constituents. An algorithmic form of this procedure is included • Nt : the number of transitions performed be- as Algorithm 1 in Appendix A. tween reading in the token wt and reading in Note that, due to the deterministic nature of the the next token wt+1 . tree-induction process, the ability of PRPN to learn • Zt : the sequence of transitions after reading a PCFG is completely contingent upon learning a in the prefix w1 w2 ...wt of the sentence. good syntactic distance. Zt = {(z1j , z2j , ..., zN j ) | j = 1..t}. 4.3 The ordered neuron architecture j Building upon the idea of representing the syntactic • Z: the parse of the sentence W . Z = Zn . information with a real-valued distance measure at each position, a simple extension is to associate We base our analysis on the approach introduced each position with a learned vector, and then use the in the parsing version of (Dyer et al., 2016), though vector for syntactic parsing. The ordered-neuron that work additionally proposes a generator ver- LSTM (ON-LSTM, Shen et al., 2019) proposes sion. 3 that the nodes that are closer to the root in the Definition 4.6 (Transition-based parser). A parse tree generate a longer span of terminals, and transition-based parser uses a stack (initialized to therefore should be less frequently “forgotten” than empty) and an input buffer (initialized with the nodes that are farther away from the root. The sentence w1 ...wt ). At each position t, based on a difference in the frequency of forgetting is captured context ct , the parser outputs a sequence of parsing by a carefully designed master forget gate vector f˜, transitions {zit }N t t i=1 , where each zi can be one of as shown in Figure 3 (in Appendix B). Formally: the following transitions (Definition 4.7). The Definition 4.5 (Master forget gates, Shen et al., parsing stops when the stack contains one single 2019). Given the input sentence W = w1 w2 ...wn constituent, and the buffer is empty. and a trained ON-LSTM, running the ON-LSTM Definition 4.7 (Parser transitions, Dyer et al., on W gives the master forget gates, which are a 2016). A parsing transition can be one of the fol- sequence of D-dimensional vectors {f˜t }nt=1 , in lowing three types: which at each position t, f˜t = f˜t (w1 , ..., wt ) ∈ • NT(X) pushes a non-terminal X onto the [0, 1]D . Moreover, let f˜t,j represent the j-th dimen- stack. sion of f˜t . The ON-LSTM architectures requires that f˜t,1 = 0, f˜t,D = 1, and • SHIFT: removes the first terminal from the ∀i < j, f˜t,i ≤ f˜t,j . input buffer and pushes onto the stack. When parsing a sentence, the real-valued master 3 Dyer et al. (2016) additionally proposes some generator forget gate vector f˜t at each position t is reduced to transitions. For simplicity, we analyze the simplest form: we a single real number representing the syntactic dis- only allow the model to return one parse, composed of the parser transitions, for a given input sentence. Note that this tance dt at position t (see (1)) (Shen et al., 2018a). simplified variant still confers full representational power in Then, use the syntactic distances to obtain a parse. the “full context” setting (see Section 7). 2680
• REDUCE: pops from the stack until an will describe an assignment of {dt : 2 ≤ t ≤ n} open non-terminal is encountered, then pops such that their order matches the level at which the this non-terminal and assembles everything branches split in T . Specifically, ∀t ∈ [2, n], let at popped to form a new constituent, labels this denote the lowest common ancestor of wt−1 and wt new constituent using this non-terminal, and in T . Let d0t denote the shortest distance between finally pushes this new constituent onto the at and the root of T . Finally, let dt = n − d0t . As a stack. result, {dt : 2 ≤ t ≤ n} induces T . In Appendix Section C, we provide an example Remark. Since any PCFG can be converted to of parsing the sentence “I drink coffee with milk” Chomsky normal form (Hopcroft et al., 2006), The- using the set of transitions given by Definition 4.7. orem 1 implies that given the whole sentence and The different context specifications and the corre- the position index as the context, the syntactic dis- sponding representational powers of the transition- tance has sufficient representational power to cap- based parser are discussed in Section 7. ture any PCFG. It does not state, however, that the whole sentence and the position are the minimal 5 Representational Power of Neural contextual information needed for representabil- Syntactic Distance Methods ity nor does it address training (i.e. optimization) In this section we formalize the results on syntactic issues. distance-based methods. Since the tree induction On the flipside, we show that restricting the con- algorithm always generates a binary tree, we con- text even mildly can considerably decrease the rep- sider only PCFGs in Chomsky normal form (CNF) resentational power. Namely, we show that if con- (Definition 4.4) so that the max likelihood parse of text is bounded even in a single direction (to the a sentence is also a binary tree structure. left or to the right), there are PCFGs on which any To formalize the notion of “representing” a syntactic distance will perform poorly 4 . (Note in PCFG, we introduce the following definition: the implementation (Shen et al., 2018a) the context only considers a bounded window to the left.) Definition 5.1 (Representing PCFG with syntactic distance). Let G be any PCFG in Chomsky Normal Theorem 2 (Limitation of left-to-right parsing via Form. A syntactic distance function d is said to be syntactic distance). Let w0 = hSi be the sentence able to p-represent G if for a set of sentences in start symbol. Let the context L(G) whose total probability is at least p, d can ct = (w0 , w1 , ..., wt+L0 ). correctly induce the tree structure of the max likeli- ∀ > 0, there exists a PCFG G in Chomsky nor- hood parse of these sentences without ambiguity. mal form, such that any syntactic distance measure Remark. Ambiguities could occur when, for ex- dt = d(wt−1 , wt | ct ) cannot -represent G. ample, there exists t such that dt = dt+1 . In this case, the tree induction algorithm would have to Proof. Let m > 1/ be a positive integer. Consider break ties when determining the local structure for the PCFG Gm,L0 in Definition 2.1. wt−1 wt wt+1 . We preclude this possibility in Defi- For any k ∈ [m], consider the string lk ∈ nition 5.1. L(Gm,L0 ). Note that in the parse tree of lk , the In the least restrictive setting, the whole sentence rule S → Ak Bk is applied. Hence, ak and ak+1 W , as well as the position index t can be taken into are the unique pair of adjacent non-terminals in consideration when determining each dt . We prove a1 a2 ...am+1 whose lowest common ancestor is the that under this setting, there is a syntactic distance closest to the root in the parse tree of lk . Then, in measure that can represent any PCFG. order for the syntactic distance metric d to induce the correct parse tree for lk , dk must be the unique Theorem 1 (Full context). Let ct = (W, t). maximum in {dt : 2 ≤ t ≤ m + 1}. For each PCFG G in Chomsky normal form, However, d is restricted to be in the form there exists a syntactic distance measure dt = d(wt−1 , wt | ct ) that can 1-represent G. dt = d(wt−1 , wt | w0 , w1 , ..., wt+L0 ). 4 Proof. For any sentence s = s1 s2 ...sn ∈ L(G), In Theorem 2 we prove the more typical case, i.e. un- bounded left context and bounded right context. The other let T be its max likelihood parse tree. Since G is case, i.e. bounded left context and unbounded right context, in Chomsky normal form, T is a binary tree. We can be proved symmetrically. 2681
Note that ∀1 ≤ k1 < k2 ≤ m, the first m + 1 + L0 Theorem 3 (Lossless representation of a parse tokens of lk1 and lk2 are the same, which implies tree). For any sentence W = w1 w2 ...wn with that the inferred syntactic distances parse tree T in any PCFG in Chomsky normal form, there exists a dimensionality D ∈ Z+ , a sequence {dt : 2 ≤ t ≤ m + 1} of vectors {f˜t }nt=2 in which each f˜t ∈ [0, 1]D , such are the same for lk1 and lk2 at each position t. Thus, that the estimated syntactic distances via (1) induce it is impossible for d to induce the correct parse the structure of T . tree for both lk1 and lk2 . Hence, d is correct on at most one lk ∈ L(Gm,L0 ), which corresponds to Proof. By Theorem 1, there is a syntactic distance probability at most 1/m < . Therefore, d cannot measure {dt }nt=2 that induces the structure of T -represent Gm,L0 . (such that ∀t, dt 6= dt+1 ). For each t = 2..n, set dˆt = k if dt is the k-th Remark. In the counterexample, there are only m smallest entry in {dt }nt=2 , breaking ties arbitrar- possible parse structures for the prefix a1 a2 ...am+1 . ily. Then, each dˆt ∈ [1, n − 1], and {dˆt }nt=2 also Hence, the proved fact that the probability of be- induces the structure of T . ing correct is at most 1/m means that under the Let D = n − 1. For each t = 2..n, let f˜t = restrictions of unbounded look-back and bounded (0, ..., 0, 1, ..., 1) whose lower dˆt dimensions are 0 look-ahead, the distance cannot do better than ran- and higher D − dˆt dimensions are 1. Then, dom guessing for this grammar. D Remark. The above Theorem 2 formalizes the in- dˆft = D − X f˜t,j = D − (D − dˆt ) = dˆt . tuition discussed in (Htut et al., 2018) outlining j=1 an intrinsic limitation of only considering bounded context in one direction. Indeed, for the PCFG con- Therefore, the calculated {dˆft }nt=2 induces the structed in the proof, the failure is a function of the structure of T . context, not because of the fact that we are using a distance-based parser. Although Theorem 3 shows the ability of the Note that as a corollary of the above theorem, if master forget gates to perfectly represent any parse there is no context (ct = null) or the context is tree, a left-to-right parsing can be proved to be both bounded and unidirectional, i.e. unable to return the correct parse with high proba- ct = wt−L wt−L+1 ...wt−1 wt , bility. In the actual implementation in (Shen et al., 2019), the (real-valued) master forget gate vectors then there is a PCFG that cannot be -represented {f˜t }nt=1 are produced by feeding the input sentence by any such d. W = w1 w2 ...wn to a model trained with a lan- guage modeling objective. In other words, f˜t,j is 6 Representational Power of the Ordered calculated as a function of w1 , ..., wt , rather than Neuron Architecture the entire sentence. As such, this left-to-right parser is subject to In this section, we formalize the results character- similar limitations as in Theorem 2: izing the representational power of the ON-LSTM architecture. The master forget gates of the ON- Theorem 4 (Limitation of syntactic distance esti- LSTM, {f˜t }nt=2 in which each f˜t ∈ [0, 1]D , encode mation based on ON-LSTM). For any > 0, there the hierarchical structure of a parse tree, and Shen exists a PCFG G in Chomsky normal form, such et al. (2019) proposes to carry out unsupervised that the syntactic distance measure calculated with constituency parsing via a reduction from the gate (1), dˆft , cannot -represent G. vectors to syntactic distances by setting: D Proof. Since by Definition 4.5, f˜t,j is a function dˆft = D − X f˜t,j for t = 2..n (1) of w1 , ..., wt , the estimated syntactic distance dˆft j=1 is also a function of w1 , ..., wt . By Theorem 2, First we show that the gates in ON-LSTM in even with unbounded look-back context w1 , ..., wt , principle form a lossless representation of any there exists a PCFG for which the probability that parse tree. dˆft induces the correct parse is arbitrarily low. 2682
7 Representational Power of 8 Conclusion Transition-Based Parsing In this section, we analyze a transition-based pars- In this work, we considered the representational ing framework inspired by (Dyer et al., 2016; power of two frameworks for constituency pars- Chelba and Jelinek, 2000; Chelba, 1997). ing prominent in the literature, based on learning Again, we proceed to say first that “full context” a syntactic distance and learning a sequence of it- confers full representational power. Namely, using erative transitions to build the parse tree — in the the terminology of Definition 4.6, we let the context sandbox of PCFGs. In particular, we show that ct at each position t be the whole sentence W and if the context for calculating distance/deciding on the position index t. Note that any parse tree can be transitions is limited at least to one side (which is generated by a sequence of transitions defined in typically the case in practice for existing architec- Definition 4.7. Indeed, Dyer et al. (2016) describes tures), there are PCFGs for which no good distance an algorithm to find such a sequence of transitions metric/sequence of transitions can be chosen to via a “depth-first, left-to-right traversal” of the tree. construct the maximum likelihood parse. Proceeding to limited context, in the setting of typical left-to-right parsing, the context ct consists This limitation was already suspected in (Htut of all current and past tokens {wj }tj=1 and all pre- et al., 2018) as a potential failure mode of leading vious parses {(z1j , ..., zN j )}tj=1 . We’ll again prove neural approaches like (Shen et al., 2018a, 2019) j even stronger negative results, where we allow an and we show formally that this is the case. The optional look-ahead to L0 input tokens to the right. PCFGs with this property track the intuition that bounded context methods will have issues when Theorem 5 (Limitation of transition-based parsing the parse at a certain position depends heavily on without full context). For any > 0, there exists latter parts of the sentence. a PCFG G in Chomsky normal form, such that for any learned transition-based parser (Definition The conclusions thus suggest re-focusing our at- 4.6) based on context tention on methods like (Kim et al., 2019a) which 0 ct = ({wj }t+L j j t have enjoyed greater success on tasks like unsu- j=1 , {(z1 , ..., zNj )}j=1 ), pervised constituency parsing, and do not fall in the sum of the probabilities of the sentences in the paradigm analyzed in our paper. A question of L(G) for which the parser returns the maximum definite further interest is how to augment models likelihood parse is less than . that have been successfully scaled up (e.g. BERT) Proof. Let m > 1/ be a positive integer. Consider in a principled manner with syntactic information, the PCFG Gm,L0 in Definition 2.1. such that they can capture syntactic structure (like Note that ∀k, S → Ak Bk is applied to yield PCFGs). The other question of immediate impor- string lk . Then in the parse tree of lk , ak and tance is to understand the interaction between the ak+1 are the unique pair of adjacent terminals in syntactic and semantic modules in neural architec- a1 a2 ...am+1 whose lowest common ancestor is the tures — information is shared between such mod- closest to the root. Thus, different lk requires a dif- ules in various successful architectures, e.g. (Dyer ferent sequence of transitions within the first m + 1 et al., 2016; Shen et al., 2018a, 2019; Kim et al., input tokens, i.e. {zit }i≥1, 1≤t≤m+1 . 2019a), and the relative pros and cons of doing this For each w ∈ L(Gm,L0 ), before the last to- are not well understood. Finally, our paper purely ken wm+2+L0 is processed, based on the common focuses on representational power, and does not prefix w1 w2 ...wm+1+L0 = a1 a2 ...am+1+L0 , it is consider algorithmic and statistical aspects of train- equally likely that w = lk , ∀k, w. prob. 1/m each. ing. As any model architecture is associated with Moreover, when processing wm+1 , the bounded its distinct optimization and generalization consid- look-ahead window of size L0 does not allow access erations, and natural language data necessitates the to the final input token am+2+L0 = ck . modeling of the interaction between syntax and Thus, ∀1 ≤ k1 < k2 ≤ m, it is impossible semantics, those aspects of considerations are well for the parser to return the correct parse tree for beyond the scope of our analysis in this paper using both lk1 and lk2 without ambiguity. Hence, the the controlled sandbox of PCFGs, and are interest- parse is correct on at most one lk ∈ L(G), which ing directions for future work. corresponds to probability at most 1/m < . 2683
References Yoav Goldberg. 2019. Assessing bert’s syntactic abili- ties. Yonatan Bisk and Julia Hockenmaier. 2015. Probing the linguistic strengths and limitations of unsuper- Qi He, Han Wang, and Yue Zhang. 2020. Enhanc- vised grammar induction. In Proceedings of the ing generalization in natural language inference by 53rd Annual Meeting of the Association for Compu- syntax. In Findings of the Association for Computa- tational Linguistics and the 7th International Joint tional Linguistics: EMNLP 2020, pages 4973–4978, Conference on Natural Language Processing (Vol- Online. Association for Computational Linguistics. ume 1: Long Papers), pages 1395–1404, Beijing, China. Association for Computational Linguistics. John Hewitt, Michael Hahn, Surya Ganguli, Percy Liang, and Christopher D. Manning. 2020. RNNs Ciprian Chelba. 1997. A structured language model. can generate bounded hierarchical languages with In 35th Annual Meeting of the Association for Com- optimal memory. In Proceedings of the 2020 Con- putational Linguistics and 8th Conference of the ference on Empirical Methods in Natural Language European Chapter of the Association for Computa- Processing (EMNLP), pages 1978–2010, Online. As- tional Linguistics, pages 498–500, Madrid, Spain. sociation for Computational Linguistics. Association for Computational Linguistics. John Hewitt and Christopher D. Manning. 2019. A Ciprian Chelba and Frederick Jelinek. 2000. Struc- structural probe for finding syntax in word repre- tured language modeling. Computer Speech & Lan- sentations. In Proceedings of the 2019 Conference guage, 14(4):283 – 332. of the North American Chapter of the Association Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui for Computational Linguistics: Human Language Jiang, and Diana Inkpen. 2017. Enhanced LSTM Technologies, Volume 1 (Long and Short Papers), for natural language inference. In Proceedings of pages 4129–4138, Minneapolis, Minnesota. Associ- the 55th Annual Meeting of the Association for Com- ation for Computational Linguistics. putational Linguistics (Volume 1: Long Papers), John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ull- pages 1657–1668, Vancouver, Canada. Association man. 2006. Introduction to Automata Theory, Lan- for Computational Linguistics. guages, and Computation (3rd Edition). Addison- Do Kook Choe and Eugene Charniak. 2016. Parsing Wesley Longman Publishing Co., Inc., USA. as language modeling. In Proceedings of the 2016 Phu Mon Htut, Kyunghyun Cho, and Samuel Bowman. Conference on Empirical Methods in Natural Lan- 2018. Grammar induction with neural language guage Processing, pages 2331–2336, Austin, Texas. models: An unusual replication. In Proceedings Association for Computational Linguistics. of the 2018 EMNLP Workshop BlackboxNLP: An- N. Chomsky. 1956. Three models for the description of alyzing and Interpreting Neural Networks for NLP, language. IRE Transactions on Information Theory, pages 371–373, Brussels, Belgium. Association for 2(3):113–124. Computational Linguistics. Noam Chomsky. 1959. On certain formal properties Phu Mon Htut, Jason Phang, Shikha Bordia, and of grammars. Information and Control, 2(2):137 – Samuel R. Bowman. 2019. Do attention heads 167. in bert track syntactic dependencies? ArXiv, abs/1911.12246. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Yoon Kim, Chris Dyer, and Alexander Rush. 2019a. deep bidirectional transformers for language under- Compound probabilistic context-free grammars for standing. In Proceedings of the 2019 Conference grammar induction. In Proceedings of the 57th An- of the North American Chapter of the Association nual Meeting of the Association for Computational for Computational Linguistics: Human Language Linguistics, pages 2369–2385, Florence, Italy. Asso- Technologies, Volume 1 (Long and Short Papers), ciation for Computational Linguistics. pages 4171–4186, Minneapolis, Minnesota. Associ- ation for Computational Linguistics. Yoon Kim, Alexander Rush, Lei Yu, Adhiguna Kun- coro, Chris Dyer, and Gábor Melis. 2019b. Unsu- Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, pervised recurrent neural network grammars. In Pro- and Noah A. Smith. 2016. Recurrent neural network ceedings of the 2019 Conference of the North Amer- grammars. In Proceedings of the 2016 Conference ican Chapter of the Association for Computational of the North American Chapter of the Association Linguistics: Human Language Technologies, Vol- for Computational Linguistics: Human Language ume 1 (Long and Short Papers), pages 1105–1117, Technologies, pages 199–209, San Diego, California. Minneapolis, Minnesota. Association for Computa- Association for Computational Linguistics. tional Linguistics. F. Gers and J. Schmidhuber. 2001. Lstm recur- Adhiguna Kuncoro, Chris Dyer, John Hale, Dani Yo- rent networks learn simple context-free and context- gatama, Stephen Clark, and Phil Blunsom. 2018. sensitive languages. IEEE transactions on neural LSTMs can learn syntax-sensitive dependencies networks, 12 6:1333–40. well, but modeling structure makes them better. In 2684
Proceedings of the 56th Annual Meeting of the As- Building Bridges, pages 44–54, Florence. Associa- sociation for Computational Linguistics (Volume 1: tion for Computational Linguistics. Long Papers), pages 1426–1436, Melbourne, Aus- tralia. Association for Computational Linguistics. Oriol Vinyals, Ł ukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. 2015. Gram- Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. mar as a foreign language. In Advances in Neural 2016. Assessing the ability of LSTMs to learn Information Processing Systems, volume 28, pages syntax-sensitive dependencies. Transactions of the 2773–2781. Curran Associates, Inc. Association for Computational Linguistics, 4:521– 535. Gail Weiss, Yoav Goldberg, and Eran Yahav. 2018. On the practical computational power of finite pre- William Merrill. 2019. Sequential neural networks cision RNNs for language recognition. In Proceed- as automata. In Proceedings of the Workshop on ings of the 56th Annual Meeting of the Association Deep Learning and Formal Languages: Building for Computational Linguistics (Volume 2: Short Pa- Bridges, pages 1–13, Florence. Association for Com- pers), pages 740–745, Melbourne, Australia. Asso- putational Linguistics. ciation for Computational Linguistics. Deric Pang, Lucy H. Lin, and Noah A. Smith. 2019. Adina Williams, Andrew Drozdov, and Samuel R. Improving natural language inference with a pre- Bowman. 2018. Do latent tree learning models iden- trained parser. tify meaningful structure in sentences? Transac- tions of the Association for Computational Linguis- Emily Reif, Ann Yuan, Martin Wattenberg, Fernanda B tics, 6:253–267. Viegas, Andy Coenen, Adam Pearce, and Been Kim. 2019. Visualizing and measuring the geometry of bert. In Advances in Neural Information Processing Systems, volume 32, pages 8594–8603. Curran As- sociates, Inc. Kenji Sagae and Alon Lavie. 2005. A classifier-based parser with linear run-time complexity. In Proceed- ings of the Ninth International Workshop on Pars- ing Technology, pages 125–132, Vancouver, British Columbia. Association for Computational Linguis- tics. Yikang Shen, Zhouhan Lin, Chin wei Huang, and Aaron Courville. 2018a. Neural language modeling by jointly learning syntax and lexicon. In Interna- tional Conference on Learning Representations. Yikang Shen, Zhouhan Lin, Athul Paul Jacob, Alessan- dro Sordoni, Aaron Courville, and Yoshua Bengio. 2018b. Straight to the tree: Constituency parsing with neural syntactic distance. In Proceedings of the 56th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 1171–1180, Melbourne, Australia. Association for Computational Linguistics. Yikang Shen, Shawn Tan, Alessandro Sordoni, and Aaron Courville. 2019. Ordered neurons: Integrat- ing tree structures into recurrent neural networks. In International Conference on Learning Representa- tions. Hava T. Siegelmann and Eduardo D. Sontag. 1992. On the computational power of neural nets. In Proceed- ings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92, page 440–449, New York, NY, USA. Association for Computing Machin- ery. Mirac Suzgun, Yonatan Belinkov, Stuart Shieber, and Sebastian Gehrmann. 2019. LSTM networks can perform dynamic counting. In Proceedings of the Workshop on Deep Learning and Formal Languages: 2685
A Tree Induction Algorithm Based on Syntactic Distance The following algorithm is proposed in (Shen et al., 2018a) to create a parse tree based on a given syntactic distance. Algorithm 1: Tree induction based on syntactic distance Data: Sentence W = w1 w2 ...wn , syntactic distances dt = d(wt−1 , wt | ct ), 2 ≤ t ≤ n Result: A parse tree for W Initialize the parse tree with a single node n0 = w1 w2 ...wn ; while ∃ leaf node n = wi wi+1 ...wj where i < j do Find k ∈ arg maxi+1≤k≤j dk ; Create the left child nl and the right child nr of n ; nl ← wi wi+1 ...wk−1 ; nr ← wk wk+1 ...wj ; end return The parse tree rooted at n0 . B ON-LSTM Intuition See Figure 3 below, which is excerpted from (Shen et al., 2019) with minor adaptation to the notation. Figure 3: Relationship between the parse tree, the block view, and the ON-LSTM. Excerpted from (Shen et al., 2019) with minor adaptation to the notation. 2686
C Examples of parsing transitions Table 1 below shows an example of parsing the sentence “I drink coffee with milk” using the set of transitions given by Definition 4.7, which employs the parsing framework of (Dyer et al., 2016). The parse tree of the sentence is given by S NP VP N V NP I drink NP PP N P N coffee with milk Stack Buffer Action I drink coffee with milk NT(S) (S I drink coffee with milk NT(NP) (S | (NP I drink coffee with milk NT(N) (S | (NP | (N I drink coffee with milk SHIFT (S | (NP | (N | I drink coffee with milk REDUCE (S | (NP (N I)) drink coffee with milk NT(VP) (S | (NP (N I)) | (VP drink coffee with milk NT(V) (S | (NP (N I)) | (VP | (V drink coffee with milk SHIFT (S | (NP (N I)) | (VP | (V | drink coffee with milk REDUCE (S | (NP (N I)) | (VP | (V drink) coffee with milk NT(NP) (S | (NP (N I)) | (VP | (V | drink) | (NP coffee with milk NT(NP) (S | (NP (N I)) | (VP | (V drink) | (NP | (NP coffee with milk NT(N) (S | (NP (N I)) | (VP | (V drink) | (NP | (NP | (N coffee with milk SHIFT (S | (NP (N I)) | (VP | (V drink) | (NP | (NP | (N | coffee with milk REDUCE (S | (NP (N I)) | (VP | (V drink) | (NP | (NP (N coffee)) with milk NT(PP) (S | (NP (N I)) | (VP | (V drink) | (NP | (NP (N coffee)) | (PP with milk NT(P) (S | (NP (N I)) | (VP | (V drink) | (NP | (NP (N coffee)) | (PP | (P with milk SHIFT (S | (NP (N I)) | (VP | (V drink) | (NP | (NP (N coffee)) | (PP | (P | with milk REDUCE (S | (NP (N I)) | (VP | (V drink) | (NP | (NP (N coffee)) | (PP | (P with) milk NT(N) (S | (NP (N I)) | (VP | (V drink) | (NP | (NP (N coffee)) | (PP | (P with) | (N milk SHIFT (S | (NP (N I)) | (VP | (V drink) | (NP | (NP (N coffee)) | (PP | (P with) | (N | milk REDUCE (S (NP (N I)) (VP (V drink) (NP (NP (N coffee)) (PP (P with) (N milk))))) Table 1: Transition-based parsing of the sentence “I drink coffee with milk”. 2687
You can also read