The Limitations of Limited Context for Constituency Parsing

Page created by Victor Fitzgerald

Cars & Machinery

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

The Limitations of Limited Context for Constituency Parsing

The Limitations of Limited Context for Constituency Parsing

                     Yuchen Li                                        Andrej Risteski
             Carnegie Mellon University                          Carnegie Mellon University
           yuchenl4@andrew.cmu.edu                             aristesk@andrew.cmu.edu

                     Abstract                               the neural techniques that have been scaled-up the
                                                            most and receive widespread usage do not explic-
    Incorporating syntax into neural approaches in          itly try to encode discrete structure that is natural
    NLP has a multitude of practical and scientific
                                                            to language, e.g. syntax. The reason for this is
    benefits. For instance, a language model that
    is syntax-aware is likely to be able to produce         perhaps not surprising: neural models have largely
    better samples; even a discriminative model             achieved substantial improvements in unsupervised
    like BERT with a syntax module could be used            settings, BERT (Devlin et al., 2019) being the de-
    for core NLP tasks like unsupervised syntactic          facto method for unsupervised pre-training in most
    parsing. Rapid progress in recent years was             NLP settings. On the other hand unsupervised syn-
    arguably spurred on by the empirical success            tactic tasks, e.g. unsupervised syntactic parsing,
    of the Parsing-Reading-Predict architecture of
                                                            have long been known to be very difficult tasks
    (Shen et al., 2018a), later simplified by the
    Order Neuron LSTM of (Shen et al., 2019).
                                                            (Htut et al., 2018). However, since incorporating
    Most notably, this is the first time neural ap-         syntax has been shown to improve language model-
    proaches were able to successfully perform un-          ing (Kim et al., 2019b) as well as natural language
    supervised syntactic parsing (evaluated by var-         inference (Chen et al., 2017; Pang et al., 2019; He
    ious metrics like F-1 score).                           et al., 2020), syntactic parsing remains important
    However, even heuristic (much less fully math-          even in the current era when large pre-trained mod-
    ematical) understanding of why and when                 els, like BERT (Devlin et al., 2019), are available.
    these architectures work is lagging severely               Arguably, the breakthrough works in unsuper-
    behind. In this work, we answer representa-             vised constituency parsing in a neural manner were
    tional questions raised by the architectures in         (Shen et al., 2018a, 2019), achieving F1 scores 42.8
    (Shen et al., 2018a, 2019), as well as some
                                                            and 49.4 on the WSJ Penn Treebank dataset (Htut
    transition-based syntax-aware language mod-
    els (Dyer et al., 2016): what kind of syntac-           et al., 2018; Shen et al., 2019). Both of these ar-
    tic structure can current neural approaches             chitectures, however (especially Shen et al., 2018a)
    to syntax represent? Concretely, we ground              are quite intricate, and it’s difficult to evaluate what
    this question in the sandbox of probabilistic           their representational power is (i.e. what kinds of
    context-free-grammars (PCFGs), and identify             structure can they recover). Moreover, as subse-
    a key aspect of the representational power              quent more thorough evaluations show (Kim et al.,
    of these approaches: the amount and direc-
                                                            2019b,a), these methods still have a rather large
    tionality of context that the predictor has ac-
    cess to when forced to make parsing deci-               performance gap with the oracle binary tree (which
    sion. We show that with limited context (either         is the best binary parse tree according to F1-score)
    bounded, or unidirectional), there are PCFGs,           — raising the question of what is missing in these
    for which these approaches cannot represent             methods.
    the max-likelihood parse; conversely, if the               We theoretically answer both questions raised
    context is unlimited, they can represent the            in the prior paragraph. We quantify the represen-
    max-likelihood parse of any PCFG.
                                                            tational power of two major frameworks in neural
                                                            approaches to syntax: learning a syntactic distance
1   Introduction
                                                            (Shen et al., 2018a,b, 2019) and learning to parse
Neural approaches have been steadily making their           through sequential transitions (Dyer et al., 2016;
way to NLP in recent years. By and large however,           Chelba, 1997). To formalize our results, we con-

                                                       2675
                Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics
             and the 11th International Joint Conference on Natural Language Processing, pages 2675–2687
                           August 1–6, 2021. ©2021 Association for Computational Linguistics

sider the well-established sandbox of probabilistic       guistic intuitions about difficult syntactic structures
context-free grammars (PCFGs). Namely, we ask:            to infer: the parse depends on words that come
   When is a neural model based on a syntactic            much later in the sentence.
distance or transitions able to represent the max-
likelihood parse of a sentence generated from a           2     Overview of Results
PCFG?                                                     We consider several neural architectures that have
   We focus on a crucial “hyperparameter” com-            shown success in various syntactic tasks, most
mon to practical implementations of both families         notably unsupervised constituency parsing and
of methods that turns out to govern the representa-       syntax-aware language modeling. The general
tional power: the amount and type of context the          framework these architectures fall under is as fol-
model is allowed to use when making its predic-           lows: to parse a sentence W = w1 w2 ...wn with a
tions. Briefly, for every position t in the sentence,     trained neural model, the sentence W is input into
syntactic distance models learn a distance dt to the      the model, which outputs ot at each step t, and fi-
previous token — the tree is then inferred from this      nally all the outputs {ot }nt=1 are utilized to produce
distance; transition-based models iteratively con-        the parse.
struct the parse tree by deciding, at each position t,       Given unbounded time and space resources, by a
what operations to perform on a partial parse up to       seminal result of Siegelmann and Sontag (1992), an
token t. A salient feature of both is the context, that   RNN implementation of this framework is Turing
is, which tokens is dt a function of (correspond-         complete. In practice it is common to restrict the
ingly, which tokens can the choice of operations at       form of the output ot in some way. In this paper,
token t depend on)?                                       we consider the two most common approaches, in
   We show that when the context is either bounded        which ot is a real number representing a syntactic
(that is, dt only depends on a bounded window             distance (Section 2.1) (Shen et al., 2018a,b, 2019)
around the t-th token) or unidirectional (that is,        or a sequence of parsing operations (Section 2.2)
dt only considers the tokens to the left of the t-        (Chelba, 1997; Chelba and Jelinek, 2000; Dyer
th token), there are PCFGs for which no distance          et al., 2016). We proceed to describe our results for
metric (correspondingly, no algorithm to choose           each architecture in turn.
the sequence of transitions) works. On the other
                                                          2.1    Syntactic distance
hand, if the context is unbounded in both directions
then both methods work: that is, for any parse, we        Syntactic distance-based neural parsers train a neu-
can design a distance metric (correspondingly, a          ral network to learn a distance for each pair of
sequence of transitions) that recovers it.                adjacent words, depending on the context surround-
   This is of considerable importance: in practi-         ing the pair of words under consideration. The
cal implementations the context is either bounded         distances are then used to induce a tree structure
(e.g. in Shen et al., 2018a, the distance metric is       (Shen et al., 2018a,b).
parametrized by a convolutional kernel with a con-           For a sentence W = w1 w2 ...wn , the syntactic
stant width) or unidirectional (e.g. in Shen et al.,      distance between wt−1 and wt (2 ≤ t ≤ n) is
2019, the distance metric is computed by a LSTM,          defined as dt = d(wt−1 , wt | ct ), where ct is the
which performs a left-to-right computation).              context that dt takes into consideration 1 . We will
                                                          show that restricting the surrounding context ei-
   This formally confirms a conjecture of Htut et al.
                                                          ther in directionality, or in size, results in a poor
(2018), who suggested that because these models
                                                          representational power, while full context confers
commit to parsing decision in a left-to-right fash-
                                                          essentially perfect representational power with re-
ion and are trained as a part of a language model,
                                                          spect to PCFGs.
it may be difficult for them to capture sufficiently
                                                             Concretely, if the context is full, we show:
complex syntactic dependencies. Our techniques
are fairly generic and seem amenable to analyzing         Theorem (Informal, full context). For sentence W
other approaches to syntax. Finally, while the ex-        generated by any PCFG, if the computation of dt
istence of a particular PCFG that is problematic          has as context the full sentence and the position
for these methods doesn’t necessarily imply that          index under consideration, i.e. ct = (W, t) and
the difficulties will carry over to real-life data, the       1
                                                                Note that this is not a conditional distribution—we use
PCFGs that are used in our proofs closely track lin-      this notation for convenience.

                                                     2676

dt = d(wt−1 , wt | ct ), then dt can induce the maxi- combinatorial structure — e.g. a sequence of tran-
mum likelihood parse of W . sitions (Chelba, 1997; Chelba and Jelinek, 2000;
On the flipside, if the context is unidirectional Dyer et al., 2016). We adopt a simplification of the
(i.e. unbounded left-context from the start of neural parameterization in (Dyer et al., 2016) (see
the sentence, and even possibly with a bounded Definition 4.7).
look-ahead), the representational power becomes With full context, Dyer et al. (2016) describes
severely impoverished: an algorithm to find a sequence of transitions to
represent any parse tree, via a “depth-first, left-
Theorem (Informal, limitation of left-to-right pars- to-right traversal” of the tree. On the other hand,
ing via syntactic distance). There exists a PCFG G without full context, we prove that transition-based
such that for any distance measure dt whose com- parsing suffers from the same limitations:
putation incorporates only bounded context in at
Theorem (Informal, limitation of transition-based
least one direction (left or right), e.g.
parsing without full context). There exists a PCFG
ct = (w0 , w1 , ..., wt+L0 )
G, such that for any learned transition-based
dt = d(wt−1 , wt | ct ) parser with bounded context in at least one direc-
the probability that dt induces the max likelihood tion (left or right), the probability that it returns
parse is arbitrarily low. the max likelihood parse is arbitrarily low.
For a formal statement, see Section 7, and in
In practice, for computational efficiency,
particular Theorem 5.
parametrizations of syntactic distances fall into the
above assumptions of restricted context (Shen et al., Remark. There is no immediate connection be-
2018a). This puts the ability of these models to tween the syntactic distance-based approaches (in-
learn a complex PCFG syntax into considerable cluding ON-LSTM) and the transition-based pars-
doubt. For formal definitions, see Section 4.2. For ing framework, so the limitations of transition-
formal theorem statements and proofs, see Section based parsing does not directly imply the stated
5. negative results for syntactic distance or ON-
Subsequently we consider ON-LSTM, an archi- LSTM, and vice versa.
tecture proposed by Shen et al. (2019) improving 2.3 The counterexample family
their previous work (Shen et al., 2018a), which
Most of our theorems proving limitations on
also is based on learning a syntactic distance, but
bounded and unidirectional context are based on a
in (Shen et al., 2019) the distances are reduced from
PCFG family (Definition 2.1) which draws inspi-
the values of a carefully structured master forget
rations from natural language already suggested in
gate (see Section 6). While we show ON-LSTM
(Htut et al., 2018): later words in a sentence can
can in principle losslessly represent any parse tree
force different syntactic structures earlier in the
(Theorem 3), calculating the gate values in a left
sentence. For example, consider the two sentences:
to right fashion (as is done in practice) is subject
“I drink coffee with milk.” and “I drink coffee with
to the same limitations as the syntactic distance
friends.” Their only difference occurs at their very
approach:
last words, but their parses differ at some earlier
Theorem (Informal, limitation of syntactic dis- words in each sentence, too, as shown in Figure 1.
tance estimation based on ON-LSTM). There ex- To formalize this intuition, we define the follow-
ists a PCFG G for which the probability that the ing PCFG.
syntactic distance converted from an ON-LSTM
Definition 2.1 (Right-influenced PCFG). Let m ≥
induces the max likelihood parse is arbitrarily low.
2, L0 ≥ 1 be positive integers. The grammar Gm,L0
For a formal statement, see Section 6 and in has starting symbol S, other non-terminals
particular Theorem 4.
Ak , Bk , Alk , Ark , Bk0 for all k ∈ {1, 2, ..., m},
2.2 Transition-based parsing
and terminals
In principle, the output ot at each position t of
a left-to-right neural models for syntactic parsing ai for all i ∈ {1, 2, ..., m + 1 + L0 },
need not be restricted to a real-numbered distance
or a carefully structured vector. It can also be a cj for all j ∈ {1, 2, ..., m}.

2677

Figure 1: The parse trees of the two sentences: “I drink coffee with milk.” and “I drink coffee with friends.”. Their
only difference occurs at their very last words, but their parses differ at some earlier words in each sentence

Figure 2: The structure of the parse tree of string lk = a1 a2 ...am+1+L0 ck ∈ L(Gm,L0 ). Note that any lk1 and lk2
are almost the same except for the last token: the prefix a1 a2 ...am+1+L0 is shared among all strings in L(Gm,L0 ).
However, their parses differ with respect to where Ak is split. The last token ck is unique to lk and hence determines
the correct parse according to Gm,L0 .

The rules of the grammar are such that for the parsing algorithms we consider
that proceed in a “left-to-right” fashion on lk , be-
S → Ak Bk , ∀k ∈ {1, 2, . . . , m} w. prob.1/m fore processing the last token ck , it cannot infer the
Ak → Alk Ark w. prob. 1 syntactic structure of a1 a2 ...am+1 any better than
∗ randomly guessing one of the m possibilities. This
Alk → a1 a2 ...ak w. prob. 1
∗
is the main intuition behind Theorems 2 and 5.
Ark → ak+1 ak+2 ...am+1 w. prob. 1
∗
Remark. While our theorems focus on the limita-
Bk → Bk0 ck w. prob. 1 tion of “left-to-right” parsing, a symmetric argu-
Bk0 →∗ am+2 am+3 ...am+1+L0 w. prob. 1 ment implies the same limitation of “right-to-left”
parsing. Thus, our claim is that unidirectional con-
text (in either direction) limits the expressive power
in which →∗ means that the left expands into the of parsing models.
right through a sequence of rules that conform
to the requirements of the Chomsky normal form 3 Related Works
(CNF, Definition 4.4). Hence the grammar Gm,L0
is in CNF. Neural models for parsing were first successfully
The language of this grammar is implemented for supervised settings, e.g. (Vinyals
et al., 2015; Dyer et al., 2016; Shen et al., 2018b).
L(Gm,L0 )={lk =a1 a2 ...am+1+L0 ck : 1 ≤ k ≤ m}. Unsupervised tasks remained seemingly out of
reach, until the proposal of the Parsing-Reading-
The parse of an arbitrary lk is shown in Figure 2. Predict Network (PRPN) by Shen et al. (2018a),
Each lk corresponds to a unique parse determined whose performance was thoroughly verified by ex-
by the choice of k. The structure of this PCFG is tensive experiments in (Htut et al., 2018). The

2678

follow-up paper (Shen et al., 2019) introducing                      memory (Merrill, 2019; Hewitt et al., 2020). Our
the ON-LSTM architecture simplified radically the                    work contributes to this line of works, but focuses
architecture in (Shen et al., 2018a), while still ulti-              on the task of syntactic parsing instead.
mately attempting to fit a distance metric with the
help of carefully designed master forget gates. Sub-                 4     Preliminaries
sequent work by Kim et al. (2019a) departed from                     In this section, we define some basic concepts and
the usual way neural techniques are integrated in                    introduce the architectures we will consider.
NLP, with great success: they proposed a neural
parameterization for the EM algorithm for learning                   4.1    Probabilistic context-free grammar
a PCFG, but in a manner that leverages semantic                      First recall several definitions around formal lan-
information as well — achieving a large improve-                     guage, especially probabilistic context free gram-
ment on unsupervised parsing tasks.2                                 mar:
   In addition to constituency parsing, dependency                   Definition 4.1 (Probabilistic context-free grammar
parsing is another common task for syntactic pars-                   (PCFG)). Formally, a PCFG (Chomsky, 1956) is a
ing, but for our analyses on the ability of various                  5-tuple G = (Σ, N, S, R, Π) in which Σ is the set
approaches to represent the max-likelihood parse                     of terminals, N is the set of non-terminals, S ∈ N
of sentences generated from PCFGs, we focus on                       is the start symbol, R is the set of production rules
the task of constituency parsing. Moreover, it’s                     of the form r = (rL → rR ), where rL ∈ N , rR
important to note that there is another line of work                 is of the form B1 B2 ...Bm , m ∈ Z+ , and ∀i ∈
aiming to probe the ability of models trained with-                  {1, 2, ..., m}, Bi ∈ (Σ ∪ N ). Finally, Π : R 7→
out explicit syntactic consideration (e.g. BERT) to                  [0, 1] is the rule probability function, in which for
nevertheless discover some (rudimentary) syntactic                   any         r = (A → B1 B2 ...Bm ) ∈ R,
elements (Bisk and Hockenmaier, 2015; Linzen
et al., 2016; Choe and Charniak, 2016; Kuncoro                       Π(r) is the conditional probability
et al., 2018; Williams et al., 2018; Goldberg, 2019;                          P (rR = B1 B2 ...Bm | rL = A).
Htut et al., 2019; Hewitt and Manning, 2019; Reif                    Definition 4.2 (Parse tree). Let TG denote the set
et al., 2019). However, to-date, we haven’t been                     of parse trees that G can derive. Each t ∈ TG is
able to extract parse trees achieving scores that                    associated with yield(t) ∈ Σ∗ , the sequence of
are close to the oracle binarized trees on standard                  terminals composed of the leaves of t and PT (t) ∈
benchmarks (Kim et al., 2019b,a).                                    [0, 1], the probability of the parse tree, defined by
   Methodologically, our work is closely related to                  the product of the probabilities of the rules in the
a long line of works aiming to characterize the rep-                 derivation of t.
resentational power of neural models (e.g. RNNs,
                                                                     Definition 4.3 (Language and sentence). The lan-
LSTMs) through the lens of formal languages and
                                                                     guage of G is
formal models of computation. Some of the works
of this flavor are empirical in nature (e.g. LSTMs                     L(G) = {s ∈ Σ∗ : ∃t ∈ TG , yield(t) = s}.
have been shown to possess stronger abilities to
                                                                     Each s ∈ L(G) is called a sentence in L(G), and
recognize some context-free language and even
                                                                     is associated with the set of parses TG (s) = {t ∈
some context-sensitive language, compared with
                                                                     TG | yield(t) = s}, the set of max likelihood
simple RNNs (Gers and Schmidhuber, 2001; Suz-
                                                                     parses, arg
                                                                               Pmaxt∈TG (s) PT (t), and its probability
gun et al., 2019) or GRUs (Weiss et al., 2018; Suz-
                                                                     PS (s) = t∈TG (s) PT (t).
gun et al., 2019)); some results are theoretical in
nature (e.g. Siegelmann and Sontag (1992)’s proof                    Definition 4.4 (Chomsky normal form (CNF)). A
that with unbounded precision and unbounded time                     PCFG G = (Σ, N, S, R, Π) is in CNF (Chomsky,
complexity, RNNs are Turing-complete; related re-                    1959) if we require, in addition to Definition 4.1,
sults investigate RNNs with bounded precision and                    that each rule r ∈ R is in the form A → B1 B2
computation time (Weiss et al., 2018), as well as                    where B1 , B2 ∈ N \ {S}; A → a where a ∈
                                                                     Σ, a 6= ; or S →  which is only allowed if the
   2
     By virtue of not relying on bounded or unidirectional           empty string  ∈ L(G).
context, the Compound PCFG (Kim et al., 2019a) eschews the              Every PCFG G can be converted into a PCFG G0
techniques in our paper. Specifically, by employing a bidirec-
tional LSTM inference network in the process of constructing         in CNF such that L(G) = L(G0 ) (Hopcroft et al.,
a tree given a sentence, the parsing is no longer “left-to-right”.   2006).

                                                                2679

4.2 Syntactic distance 4.4 Transition-based parsing
The Parsing-Reading-Predict Networks (PRPN) In addition to outputting a single real numbered
(Shen et al., 2018a) is one of the leading approaches distance or a vector at each position t, a left-to-right
to unsupervised constituency parsing. The parsing model can also parse a sentence by outputting a
network (which computes the parse tree, hence the sequence of “transitions” at each position t, an idea
only part we focus on in our paper) is a convo- proposed in some traditional parsing approaches
lutional network that computes the syntactic dis- (Sagae and Lavie, 2005; Chelba, 1997; Chelba and
tances dt = d(wt−1 , wt ) (defined in Section 2.1) Jelinek, 2000), and also some more recent neural
based on the past L words. A deterministic greedy parameterization (Dyer et al., 2016).
tree induction algorithm is then used to produce We introduce several items of notation:
a parse tree as follows. First, we split the sen-
tence w1 ...wn into two constituents, w1 ...wt−1 and • zit : the i-th transition performed when reading
wt ...wn , where t ∈ argmax{dt }nt=2 and form the in wt , the t-th token of the sentence
left and right subtrees of t. We recursively repeat
W = w1 w2 ...wn .
this procedure for the newly created constituents.
An algorithmic form of this procedure is included • Nt : the number of transitions performed be-
as Algorithm 1 in Appendix A. tween reading in the token wt and reading in
Note that, due to the deterministic nature of the the next token wt+1 .
tree-induction process, the ability of PRPN to learn • Zt : the sequence of transitions after reading
a PCFG is completely contingent upon learning a in the prefix w1 w2 ...wt of the sentence.
good syntactic distance.
Zt = {(z1j , z2j , ..., zN
j
) | j = 1..t}.
4.3 The ordered neuron architecture j

Building upon the idea of representing the syntactic • Z: the parse of the sentence W . Z = Zn .
information with a real-valued distance measure
at each position, a simple extension is to associate We base our analysis on the approach introduced
each position with a learned vector, and then use the in the parsing version of (Dyer et al., 2016), though
vector for syntactic parsing. The ordered-neuron that work additionally proposes a generator ver-
LSTM (ON-LSTM, Shen et al., 2019) proposes sion. 3
that the nodes that are closer to the root in the Definition 4.6 (Transition-based parser). A
parse tree generate a longer span of terminals, and transition-based parser uses a stack (initialized to
therefore should be less frequently “forgotten” than empty) and an input buffer (initialized with the
nodes that are farther away from the root. The sentence w1 ...wt ). At each position t, based on a
difference in the frequency of forgetting is captured context ct , the parser outputs a sequence of parsing
by a carefully designed master forget gate vector f˜, transitions {zit }N t t
i=1 , where each zi can be one of
as shown in Figure 3 (in Appendix B). Formally: the following transitions (Definition 4.7). The
Definition 4.5 (Master forget gates, Shen et al., parsing stops when the stack contains one single
2019). Given the input sentence W = w1 w2 ...wn constituent, and the buffer is empty.
and a trained ON-LSTM, running the ON-LSTM Definition 4.7 (Parser transitions, Dyer et al.,
on W gives the master forget gates, which are a 2016). A parsing transition can be one of the fol-
sequence of D-dimensional vectors {f˜t }nt=1 , in lowing three types:
which at each position t, f˜t = f˜t (w1 , ..., wt ) ∈
• NT(X) pushes a non-terminal X onto the
[0, 1]D . Moreover, let f˜t,j represent the j-th dimen-
stack.
sion of f˜t . The ON-LSTM architectures requires
that f˜t,1 = 0, f˜t,D = 1, and • SHIFT: removes the first terminal from the
∀i < j, f˜t,i ≤ f˜t,j . input buffer and pushes onto the stack.
When parsing a sentence, the real-valued master 3
Dyer et al. (2016) additionally proposes some generator
forget gate vector f˜t at each position t is reduced to transitions. For simplicity, we analyze the simplest form: we
a single real number representing the syntactic dis- only allow the model to return one parse, composed of the
parser transitions, for a given input sentence. Note that this
tance dt at position t (see (1)) (Shen et al., 2018a). simplified variant still confers full representational power in
Then, use the syntactic distances to obtain a parse. the “full context” setting (see Section 7).

2680

• REDUCE: pops from the stack until an will describe an assignment of {dt : 2 ≤ t ≤ n}
open non-terminal is encountered, then pops such that their order matches the level at which the
this non-terminal and assembles everything branches split in T . Specifically, ∀t ∈ [2, n], let at
popped to form a new constituent, labels this denote the lowest common ancestor of wt−1 and wt
new constituent using this non-terminal, and in T . Let d0t denote the shortest distance between
finally pushes this new constituent onto the at and the root of T . Finally, let dt = n − d0t . As a
stack. result, {dt : 2 ≤ t ≤ n} induces T .

In Appendix Section C, we provide an example Remark. Since any PCFG can be converted to
of parsing the sentence “I drink coffee with milk” Chomsky normal form (Hopcroft et al., 2006), The-
using the set of transitions given by Definition 4.7. orem 1 implies that given the whole sentence and
The different context specifications and the corre- the position index as the context, the syntactic dis-
sponding representational powers of the transition- tance has sufficient representational power to cap-
based parser are discussed in Section 7. ture any PCFG. It does not state, however, that the
whole sentence and the position are the minimal
5 Representational Power of Neural contextual information needed for representabil-
Syntactic Distance Methods ity nor does it address training (i.e. optimization)
In this section we formalize the results on syntactic issues.
distance-based methods. Since the tree induction On the flipside, we show that restricting the con-
algorithm always generates a binary tree, we context even mildly can considerably decrease the rep-
sider only PCFGs in Chomsky normal form (CNF) resentational power. Namely, we show that if con-
(Definition 4.4) so that the max likelihood parse of text is bounded even in a single direction (to the
a sentence is also a binary tree structure. left or to the right), there are PCFGs on which any
To formalize the notion of “representing” a syntactic distance will perform poorly 4 . (Note in
PCFG, we introduce the following definition: the implementation (Shen et al., 2018a) the context
only considers a bounded window to the left.)
Definition 5.1 (Representing PCFG with syntactic
distance). Let G be any PCFG in Chomsky Normal Theorem 2 (Limitation of left-to-right parsing via
Form. A syntactic distance function d is said to be syntactic distance). Let w0 = hSi be the sentence
able to p-represent G if for a set of sentences in start symbol. Let the context
L(G) whose total probability is at least p, d can ct = (w0 , w1 , ..., wt+L0 ).
correctly induce the tree structure of the max likeli- ∀ > 0, there exists a PCFG G in Chomsky nor-
hood parse of these sentences without ambiguity. mal form, such that any syntactic distance measure
Remark. Ambiguities could occur when, for ex- dt = d(wt−1 , wt | ct ) cannot -represent G.
ample, there exists t such that dt = dt+1 . In this
case, the tree induction algorithm would have to Proof. Let m > 1/ be a positive integer. Consider
break ties when determining the local structure for the PCFG Gm,L0 in Definition 2.1.
wt−1 wt wt+1 . We preclude this possibility in Defi- For any k ∈ [m], consider the string lk ∈
nition 5.1. L(Gm,L0 ). Note that in the parse tree of lk , the
In the least restrictive setting, the whole sentence rule S → Ak Bk is applied. Hence, ak and ak+1
W , as well as the position index t can be taken into are the unique pair of adjacent non-terminals in
consideration when determining each dt . We prove a1 a2 ...am+1 whose lowest common ancestor is the
that under this setting, there is a syntactic distance closest to the root in the parse tree of lk . Then, in
measure that can represent any PCFG. order for the syntactic distance metric d to induce
the correct parse tree for lk , dk must be the unique
Theorem 1 (Full context). Let ct = (W, t). maximum in {dt : 2 ≤ t ≤ m + 1}.
For each PCFG G in Chomsky normal form, However, d is restricted to be in the form
there exists a syntactic distance measure dt =
d(wt−1 , wt | ct ) that can 1-represent G. dt = d(wt−1 , wt | w0 , w1 , ..., wt+L0 ).
4
Proof. For any sentence s = s1 s2 ...sn ∈ L(G), In Theorem 2 we prove the more typical case, i.e. un-
bounded left context and bounded right context. The other
let T be its max likelihood parse tree. Since G is case, i.e. bounded left context and unbounded right context,
in Chomsky normal form, T is a binary tree. We can be proved symmetrically.

2681

Note that ∀1 ≤ k1 < k2 ≤ m, the first m + 1 + L0           Theorem 3 (Lossless representation of a parse
tokens of lk1 and lk2 are the same, which implies          tree). For any sentence W = w1 w2 ...wn with
that the inferred syntactic distances                      parse tree T in any PCFG in Chomsky normal form,
                                                           there exists a dimensionality D ∈ Z+ , a sequence
               {dt : 2 ≤ t ≤ m + 1}
                                                           of vectors {f˜t }nt=2 in which each f˜t ∈ [0, 1]D , such
are the same for lk1 and lk2 at each position t. Thus,     that the estimated syntactic distances via (1) induce
it is impossible for d to induce the correct parse         the structure of T .
tree for both lk1 and lk2 . Hence, d is correct on
at most one lk ∈ L(Gm,L0 ), which corresponds to           Proof. By Theorem 1, there is a syntactic distance
probability at most 1/m < . Therefore, d cannot           measure {dt }nt=2 that induces the structure of T
-represent Gm,L0 .                                        (such that ∀t, dt 6= dt+1 ).
                                                              For each t = 2..n, set dˆt = k if dt is the k-th
Remark. In the counterexample, there are only m            smallest entry in {dt }nt=2 , breaking ties arbitrar-
possible parse structures for the prefix a1 a2 ...am+1 .   ily. Then, each dˆt ∈ [1, n − 1], and {dˆt }nt=2 also
Hence, the proved fact that the probability of be-         induces the structure of T .
ing correct is at most 1/m means that under the               Let D = n − 1. For each t = 2..n, let f˜t =
restrictions of unbounded look-back and bounded            (0, ..., 0, 1, ..., 1) whose lower dˆt dimensions are 0
look-ahead, the distance cannot do better than ran-        and higher D − dˆt dimensions are 1. Then,
dom guessing for this grammar.
                                                                            D
Remark. The above Theorem 2 formalizes the in-
                                                               dˆft = D −
                                                                            X
                                                                                  f˜t,j = D − (D − dˆt ) = dˆt .
tuition discussed in (Htut et al., 2018) outlining
                                                                            j=1
an intrinsic limitation of only considering bounded
context in one direction. Indeed, for the PCFG con-           Therefore, the calculated {dˆft }nt=2 induces the
structed in the proof, the failure is a function of the    structure of T .
context, not because of the fact that we are using a
distance-based parser.
                                                              Although Theorem 3 shows the ability of the
   Note that as a corollary of the above theorem, if       master forget gates to perfectly represent any parse
there is no context (ct = null) or the context is          tree, a left-to-right parsing can be proved to be
both bounded and unidirectional, i.e.                      unable to return the correct parse with high proba-
           ct = wt−L wt−L+1 ...wt−1 wt ,                   bility. In the actual implementation in (Shen et al.,
                                                           2019), the (real-valued) master forget gate vectors
then there is a PCFG that cannot be -represented          {f˜t }nt=1 are produced by feeding the input sentence
by any such d.                                             W = w1 w2 ...wn to a model trained with a lan-
                                                           guage modeling objective. In other words, f˜t,j is
6   Representational Power of the Ordered                  calculated as a function of w1 , ..., wt , rather than
    Neuron Architecture                                    the entire sentence.
                                                              As such, this left-to-right parser is subject to
In this section, we formalize the results character-
                                                           similar limitations as in Theorem 2:
izing the representational power of the ON-LSTM
architecture. The master forget gates of the ON-           Theorem 4 (Limitation of syntactic distance esti-
LSTM, {f˜t }nt=2 in which each f˜t ∈ [0, 1]D , encode      mation based on ON-LSTM). For any  > 0, there
the hierarchical structure of a parse tree, and Shen       exists a PCFG G in Chomsky normal form, such
et al. (2019) proposes to carry out unsupervised           that the syntactic distance measure calculated with
constituency parsing via a reduction from the gate         (1), dˆft , cannot -represent G.
vectors to syntactic distances by setting:
                       D
                                                           Proof. Since by Definition 4.5, f˜t,j is a function
          dˆft = D −
                       X
                             f˜t,j for t = 2..n     (1)
                                                           of w1 , ..., wt , the estimated syntactic distance dˆft
                       j=1
                                                           is also a function of w1 , ..., wt . By Theorem 2,
  First we show that the gates in ON-LSTM in               even with unbounded look-back context w1 , ..., wt ,
principle form a lossless representation of any            there exists a PCFG for which the probability that
parse tree.                                                dˆft induces the correct parse is arbitrarily low.

                                                      2682

7 Representational Power of 8 Conclusion
Transition-Based Parsing
In this section, we analyze a transition-based pars- In this work, we considered the representational
ing framework inspired by (Dyer et al., 2016; power of two frameworks for constituency pars-
Chelba and Jelinek, 2000; Chelba, 1997). ing prominent in the literature, based on learning
Again, we proceed to say first that “full context” a syntactic distance and learning a sequence of it-
confers full representational power. Namely, using erative transitions to build the parse tree — in the
the terminology of Definition 4.6, we let the context sandbox of PCFGs. In particular, we show that
ct at each position t be the whole sentence W and if the context for calculating distance/deciding on
the position index t. Note that any parse tree can be transitions is limited at least to one side (which is
generated by a sequence of transitions defined in typically the case in practice for existing architec-
Definition 4.7. Indeed, Dyer et al. (2016) describes tures), there are PCFGs for which no good distance
an algorithm to find such a sequence of transitions metric/sequence of transitions can be chosen to
via a “depth-first, left-to-right traversal” of the tree. construct the maximum likelihood parse.
Proceeding to limited context, in the setting of
typical left-to-right parsing, the context ct consists This limitation was already suspected in (Htut
of all current and past tokens {wj }tj=1 and all pre- et al., 2018) as a potential failure mode of leading
vious parses {(z1j , ..., zN
j
)}tj=1 . We’ll again prove neural approaches like (Shen et al., 2018a, 2019)
j
even stronger negative results, where we allow an and we show formally that this is the case. The
optional look-ahead to L0 input tokens to the right. PCFGs with this property track the intuition that
bounded context methods will have issues when
Theorem 5 (Limitation of transition-based parsing the parse at a certain position depends heavily on
without full context). For any > 0, there exists latter parts of the sentence.
a PCFG G in Chomsky normal form, such that
for any learned transition-based parser (Definition The conclusions thus suggest re-focusing our at-
4.6) based on context tention on methods like (Kim et al., 2019a) which
0
ct = ({wj }t+L j j t have enjoyed greater success on tasks like unsu-
j=1 , {(z1 , ..., zNj )}j=1 ),
pervised constituency parsing, and do not fall in
the sum of the probabilities of the sentences in
the paradigm analyzed in our paper. A question of
L(G) for which the parser returns the maximum
definite further interest is how to augment models
likelihood parse is less than .
that have been successfully scaled up (e.g. BERT)
Proof. Let m > 1/ be a positive integer. Consider in a principled manner with syntactic information,
the PCFG Gm,L0 in Definition 2.1. such that they can capture syntactic structure (like
Note that ∀k, S → Ak Bk is applied to yield PCFGs). The other question of immediate impor-
string lk . Then in the parse tree of lk , ak and tance is to understand the interaction between the
ak+1 are the unique pair of adjacent terminals in syntactic and semantic modules in neural architec-
a1 a2 ...am+1 whose lowest common ancestor is the tures — information is shared between such mod-
closest to the root. Thus, different lk requires a dif- ules in various successful architectures, e.g. (Dyer
ferent sequence of transitions within the first m + 1 et al., 2016; Shen et al., 2018a, 2019; Kim et al.,
input tokens, i.e. {zit }i≥1, 1≤t≤m+1 . 2019a), and the relative pros and cons of doing this
For each w ∈ L(Gm,L0 ), before the last to- are not well understood. Finally, our paper purely
ken wm+2+L0 is processed, based on the common focuses on representational power, and does not
prefix w1 w2 ...wm+1+L0 = a1 a2 ...am+1+L0 , it is consider algorithmic and statistical aspects of train-
equally likely that w = lk , ∀k, w. prob. 1/m each. ing. As any model architecture is associated with
Moreover, when processing wm+1 , the bounded its distinct optimization and generalization consid-
look-ahead window of size L0 does not allow access erations, and natural language data necessitates the
to the final input token am+2+L0 = ck . modeling of the interaction between syntax and
Thus, ∀1 ≤ k1 < k2 ≤ m, it is impossible semantics, those aspects of considerations are well
for the parser to return the correct parse tree for beyond the scope of our analysis in this paper using
both lk1 and lk2 without ambiguity. Hence, the the controlled sandbox of PCFGs, and are interest-
parse is correct on at most one lk ∈ L(G), which ing directions for future work.
corresponds to probability at most 1/m < .

2683

References Yoav Goldberg. 2019. Assessing bert’s syntactic abili-
ties.
Yonatan Bisk and Julia Hockenmaier. 2015. Probing
the linguistic strengths and limitations of unsuper- Qi He, Han Wang, and Yue Zhang. 2020. Enhanc-
vised grammar induction. In Proceedings of the ing generalization in natural language inference by
53rd Annual Meeting of the Association for Compu- syntax. In Findings of the Association for Computa-
tational Linguistics and the 7th International Joint tional Linguistics: EMNLP 2020, pages 4973–4978,
Conference on Natural Language Processing (Vol- Online. Association for Computational Linguistics.
ume 1: Long Papers), pages 1395–1404, Beijing,
China. Association for Computational Linguistics. John Hewitt, Michael Hahn, Surya Ganguli, Percy
Liang, and Christopher D. Manning. 2020. RNNs
Ciprian Chelba. 1997. A structured language model. can generate bounded hierarchical languages with
In 35th Annual Meeting of the Association for Com- optimal memory. In Proceedings of the 2020 Con-
putational Linguistics and 8th Conference of the ference on Empirical Methods in Natural Language
European Chapter of the Association for Computa- Processing (EMNLP), pages 1978–2010, Online. As-
tional Linguistics, pages 498–500, Madrid, Spain. sociation for Computational Linguistics.
Association for Computational Linguistics.
John Hewitt and Christopher D. Manning. 2019. A
Ciprian Chelba and Frederick Jelinek. 2000. Struc- structural probe for finding syntax in word repre-
tured language modeling. Computer Speech & Lan- sentations. In Proceedings of the 2019 Conference
guage, 14(4):283 – 332. of the North American Chapter of the Association
Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui for Computational Linguistics: Human Language
Jiang, and Diana Inkpen. 2017. Enhanced LSTM Technologies, Volume 1 (Long and Short Papers),
for natural language inference. In Proceedings of pages 4129–4138, Minneapolis, Minnesota. Associ-
the 55th Annual Meeting of the Association for Com- ation for Computational Linguistics.
putational Linguistics (Volume 1: Long Papers),
John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ull-
pages 1657–1668, Vancouver, Canada. Association
man. 2006. Introduction to Automata Theory, Lan-
for Computational Linguistics.
guages, and Computation (3rd Edition). Addison-
Do Kook Choe and Eugene Charniak. 2016. Parsing Wesley Longman Publishing Co., Inc., USA.
as language modeling. In Proceedings of the 2016
Phu Mon Htut, Kyunghyun Cho, and Samuel Bowman.
Conference on Empirical Methods in Natural Lan-
2018. Grammar induction with neural language
guage Processing, pages 2331–2336, Austin, Texas.
models: An unusual replication. In Proceedings
Association for Computational Linguistics.
of the 2018 EMNLP Workshop BlackboxNLP: An-
N. Chomsky. 1956. Three models for the description of alyzing and Interpreting Neural Networks for NLP,
language. IRE Transactions on Information Theory, pages 371–373, Brussels, Belgium. Association for
2(3):113–124. Computational Linguistics.

Noam Chomsky. 1959. On certain formal properties Phu Mon Htut, Jason Phang, Shikha Bordia, and
of grammars. Information and Control, 2(2):137 – Samuel R. Bowman. 2019. Do attention heads
167. in bert track syntactic dependencies? ArXiv,
abs/1911.12246.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of Yoon Kim, Chris Dyer, and Alexander Rush. 2019a.
deep bidirectional transformers for language under- Compound probabilistic context-free grammars for
standing. In Proceedings of the 2019 Conference grammar induction. In Proceedings of the 57th An-
of the North American Chapter of the Association nual Meeting of the Association for Computational
for Computational Linguistics: Human Language Linguistics, pages 2369–2385, Florence, Italy. Asso-
Technologies, Volume 1 (Long and Short Papers), ciation for Computational Linguistics.
pages 4171–4186, Minneapolis, Minnesota. Associ-
ation for Computational Linguistics. Yoon Kim, Alexander Rush, Lei Yu, Adhiguna Kun-
coro, Chris Dyer, and Gábor Melis. 2019b. Unsu-
Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, pervised recurrent neural network grammars. In Pro-
and Noah A. Smith. 2016. Recurrent neural network ceedings of the 2019 Conference of the North Amer-
grammars. In Proceedings of the 2016 Conference ican Chapter of the Association for Computational
of the North American Chapter of the Association Linguistics: Human Language Technologies, Vol-
for Computational Linguistics: Human Language ume 1 (Long and Short Papers), pages 1105–1117,
Technologies, pages 199–209, San Diego, California. Minneapolis, Minnesota. Association for Computa-
Association for Computational Linguistics. tional Linguistics.

F. Gers and J. Schmidhuber. 2001. Lstm recur- Adhiguna Kuncoro, Chris Dyer, John Hale, Dani Yo-
rent networks learn simple context-free and context- gatama, Stephen Clark, and Phil Blunsom. 2018.
sensitive languages. IEEE transactions on neural LSTMs can learn syntax-sensitive dependencies
networks, 12 6:1333–40. well, but modeling structure makes them better. In

2684

Proceedings of the 56th Annual Meeting of the As- Building Bridges, pages 44–54, Florence. Associa-
sociation for Computational Linguistics (Volume 1: tion for Computational Linguistics.
Long Papers), pages 1426–1436, Melbourne, Aus-
tralia. Association for Computational Linguistics. Oriol Vinyals, Ł ukasz Kaiser, Terry Koo, Slav Petrov,
Ilya Sutskever, and Geoffrey Hinton. 2015. Gram-
Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. mar as a foreign language. In Advances in Neural
2016. Assessing the ability of LSTMs to learn Information Processing Systems, volume 28, pages
syntax-sensitive dependencies. Transactions of the 2773–2781. Curran Associates, Inc.
Association for Computational Linguistics, 4:521–
535. Gail Weiss, Yoav Goldberg, and Eran Yahav. 2018.
On the practical computational power of finite pre-
William Merrill. 2019. Sequential neural networks cision RNNs for language recognition. In Proceed-
as automata. In Proceedings of the Workshop on ings of the 56th Annual Meeting of the Association
Deep Learning and Formal Languages: Building for Computational Linguistics (Volume 2: Short Pa-
Bridges, pages 1–13, Florence. Association for Com- pers), pages 740–745, Melbourne, Australia. Asso-
putational Linguistics. ciation for Computational Linguistics.
Deric Pang, Lucy H. Lin, and Noah A. Smith. 2019. Adina Williams, Andrew Drozdov, and Samuel R.
Improving natural language inference with a pre- Bowman. 2018. Do latent tree learning models iden-
trained parser. tify meaningful structure in sentences? Transac-
tions of the Association for Computational Linguis-
Emily Reif, Ann Yuan, Martin Wattenberg, Fernanda B tics, 6:253–267.
Viegas, Andy Coenen, Adam Pearce, and Been Kim.
2019. Visualizing and measuring the geometry of
bert. In Advances in Neural Information Processing
Systems, volume 32, pages 8594–8603. Curran As-
sociates, Inc.
Kenji Sagae and Alon Lavie. 2005. A classifier-based
parser with linear run-time complexity. In Proceed-
ings of the Ninth International Workshop on Pars-
ing Technology, pages 125–132, Vancouver, British
Columbia. Association for Computational Linguis-
tics.
Yikang Shen, Zhouhan Lin, Chin wei Huang, and
Aaron Courville. 2018a. Neural language modeling
by jointly learning syntax and lexicon. In Interna-
tional Conference on Learning Representations.
Yikang Shen, Zhouhan Lin, Athul Paul Jacob, Alessan-
dro Sordoni, Aaron Courville, and Yoshua Bengio.
2018b. Straight to the tree: Constituency parsing
with neural syntactic distance. In Proceedings of the
56th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), pages
1171–1180, Melbourne, Australia. Association for
Computational Linguistics.
Yikang Shen, Shawn Tan, Alessandro Sordoni, and
Aaron Courville. 2019. Ordered neurons: Integrat-
ing tree structures into recurrent neural networks. In
International Conference on Learning Representa-
tions.
Hava T. Siegelmann and Eduardo D. Sontag. 1992. On
the computational power of neural nets. In Proceed-
ings of the Fifth Annual Workshop on Computational
Learning Theory, COLT ’92, page 440–449, New
York, NY, USA. Association for Computing Machin-
ery.
Mirac Suzgun, Yonatan Belinkov, Stuart Shieber, and
Sebastian Gehrmann. 2019. LSTM networks can
perform dynamic counting. In Proceedings of the
Workshop on Deep Learning and Formal Languages:

2685

A    Tree Induction Algorithm Based on Syntactic Distance
The following algorithm is proposed in (Shen et al., 2018a) to create a parse tree based on a given syntactic
distance.

 Algorithm 1: Tree induction based on syntactic distance
  Data: Sentence W = w1 w2 ...wn , syntactic distances dt = d(wt−1 , wt | ct ), 2 ≤ t ≤ n
  Result: A parse tree for W
  Initialize the parse tree with a single node n0 = w1 w2 ...wn ;
  while ∃ leaf node n = wi wi+1 ...wj where i < j do
       Find k ∈ arg maxi+1≤k≤j dk ;
       Create the left child nl and the right child nr of n ;
       nl ← wi wi+1 ...wk−1 ;
       nr ← wk wk+1 ...wj ;
  end
  return The parse tree rooted at n0 .

B   ON-LSTM Intuition
See Figure 3 below, which is excerpted from (Shen et al., 2019) with minor adaptation to the notation.

Figure 3: Relationship between the parse tree, the block view, and the ON-LSTM. Excerpted from (Shen et al.,
2019) with minor adaptation to the notation.

                                                   2686

C   Examples of parsing transitions
Table 1 below shows an example of parsing the sentence “I drink coffee with milk” using the set of
transitions given by Definition 4.7, which employs the parsing framework of (Dyer et al., 2016). The
parse tree of the sentence is given by
                                             S

                                     NP            VP

                                      N
                                             V              NP
                                      I
                                           drink    NP               PP

                                                     N           P         N

                                                   coffee    with         milk

       Stack                                                         Buffer                     Action
                                                                     I drink coffee with milk   NT(S)
       (S                                                            I drink coffee with milk   NT(NP)
       (S | (NP                                                      I drink coffee with milk   NT(N)
       (S | (NP | (N                                                 I drink coffee with milk   SHIFT
       (S | (NP | (N | I                                             drink coffee with milk     REDUCE
       (S | (NP (N I))                                               drink coffee with milk     NT(VP)
       (S | (NP (N I)) | (VP                                         drink coffee with milk     NT(V)
       (S | (NP (N I)) | (VP | (V                                    drink coffee with milk     SHIFT
       (S | (NP (N I)) | (VP | (V | drink                            coffee with milk           REDUCE
       (S | (NP (N I)) | (VP | (V drink)                             coffee with milk           NT(NP)
       (S | (NP (N I)) | (VP | (V | drink) |
          (NP                                                        coffee with milk           NT(NP)
       (S | (NP (N I)) | (VP | (V drink) |
          (NP | (NP                                                  coffee with milk           NT(N)
       (S | (NP (N I)) | (VP | (V drink) |
          (NP | (NP | (N                                             coffee with milk           SHIFT
       (S | (NP (N I)) | (VP | (V drink) |
          (NP | (NP | (N | coffee                                    with milk                  REDUCE
       (S | (NP (N I)) | (VP | (V drink) |
          (NP | (NP (N coffee))                                      with milk                  NT(PP)
       (S | (NP (N I)) | (VP | (V drink) |
          (NP | (NP (N coffee)) | (PP                                with milk                  NT(P)
       (S | (NP (N I)) | (VP | (V drink) |
          (NP | (NP (N coffee)) | (PP | (P                           with milk                  SHIFT
       (S | (NP (N I)) | (VP | (V drink) |
          (NP | (NP (N coffee)) | (PP | (P | with                    milk                       REDUCE
       (S | (NP (N I)) | (VP | (V drink) |
          (NP | (NP (N coffee)) | (PP | (P with)                     milk                       NT(N)
       (S | (NP (N I)) | (VP | (V drink) |
          (NP | (NP (N coffee)) | (PP | (P with) | (N                milk                       SHIFT
       (S | (NP (N I)) | (VP | (V drink) |
          (NP | (NP (N coffee)) | (PP | (P with) | (N | milk                                    REDUCE
       (S (NP (N I)) (VP (V drink)
          (NP (NP (N coffee)) (PP (P with) (N milk)))))

                Table 1: Transition-based parsing of the sentence “I drink coffee with milk”.

                                                   2687

You can also read