The Limitations of Limited Context for Constituency Parsing

Page created by Victor Fitzgerald
 
CONTINUE READING
The Limitations of Limited Context for Constituency Parsing
The Limitations of Limited Context for Constituency Parsing

                     Yuchen Li                                        Andrej Risteski
             Carnegie Mellon University                          Carnegie Mellon University
           yuchenl4@andrew.cmu.edu                             aristesk@andrew.cmu.edu

                     Abstract                               the neural techniques that have been scaled-up the
                                                            most and receive widespread usage do not explic-
    Incorporating syntax into neural approaches in          itly try to encode discrete structure that is natural
    NLP has a multitude of practical and scientific
                                                            to language, e.g. syntax. The reason for this is
    benefits. For instance, a language model that
    is syntax-aware is likely to be able to produce         perhaps not surprising: neural models have largely
    better samples; even a discriminative model             achieved substantial improvements in unsupervised
    like BERT with a syntax module could be used            settings, BERT (Devlin et al., 2019) being the de-
    for core NLP tasks like unsupervised syntactic          facto method for unsupervised pre-training in most
    parsing. Rapid progress in recent years was             NLP settings. On the other hand unsupervised syn-
    arguably spurred on by the empirical success            tactic tasks, e.g. unsupervised syntactic parsing,
    of the Parsing-Reading-Predict architecture of
                                                            have long been known to be very difficult tasks
    (Shen et al., 2018a), later simplified by the
    Order Neuron LSTM of (Shen et al., 2019).
                                                            (Htut et al., 2018). However, since incorporating
    Most notably, this is the first time neural ap-         syntax has been shown to improve language model-
    proaches were able to successfully perform un-          ing (Kim et al., 2019b) as well as natural language
    supervised syntactic parsing (evaluated by var-         inference (Chen et al., 2017; Pang et al., 2019; He
    ious metrics like F-1 score).                           et al., 2020), syntactic parsing remains important
    However, even heuristic (much less fully math-          even in the current era when large pre-trained mod-
    ematical) understanding of why and when                 els, like BERT (Devlin et al., 2019), are available.
    these architectures work is lagging severely               Arguably, the breakthrough works in unsuper-
    behind. In this work, we answer representa-             vised constituency parsing in a neural manner were
    tional questions raised by the architectures in         (Shen et al., 2018a, 2019), achieving F1 scores 42.8
    (Shen et al., 2018a, 2019), as well as some
                                                            and 49.4 on the WSJ Penn Treebank dataset (Htut
    transition-based syntax-aware language mod-
    els (Dyer et al., 2016): what kind of syntac-           et al., 2018; Shen et al., 2019). Both of these ar-
    tic structure can current neural approaches             chitectures, however (especially Shen et al., 2018a)
    to syntax represent? Concretely, we ground              are quite intricate, and it’s difficult to evaluate what
    this question in the sandbox of probabilistic           their representational power is (i.e. what kinds of
    context-free-grammars (PCFGs), and identify             structure can they recover). Moreover, as subse-
    a key aspect of the representational power              quent more thorough evaluations show (Kim et al.,
    of these approaches: the amount and direc-
                                                            2019b,a), these methods still have a rather large
    tionality of context that the predictor has ac-
    cess to when forced to make parsing deci-               performance gap with the oracle binary tree (which
    sion. We show that with limited context (either         is the best binary parse tree according to F1-score)
    bounded, or unidirectional), there are PCFGs,           — raising the question of what is missing in these
    for which these approaches cannot represent             methods.
    the max-likelihood parse; conversely, if the               We theoretically answer both questions raised
    context is unlimited, they can represent the            in the prior paragraph. We quantify the represen-
    max-likelihood parse of any PCFG.
                                                            tational power of two major frameworks in neural
                                                            approaches to syntax: learning a syntactic distance
1   Introduction
                                                            (Shen et al., 2018a,b, 2019) and learning to parse
Neural approaches have been steadily making their           through sequential transitions (Dyer et al., 2016;
way to NLP in recent years. By and large however,           Chelba, 1997). To formalize our results, we con-

                                                       2675
                Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics
             and the 11th International Joint Conference on Natural Language Processing, pages 2675–2687
                           August 1–6, 2021. ©2021 Association for Computational Linguistics
sider the well-established sandbox of probabilistic       guistic intuitions about difficult syntactic structures
context-free grammars (PCFGs). Namely, we ask:            to infer: the parse depends on words that come
   When is a neural model based on a syntactic            much later in the sentence.
distance or transitions able to represent the max-
likelihood parse of a sentence generated from a           2     Overview of Results
PCFG?                                                     We consider several neural architectures that have
   We focus on a crucial “hyperparameter” com-            shown success in various syntactic tasks, most
mon to practical implementations of both families         notably unsupervised constituency parsing and
of methods that turns out to govern the representa-       syntax-aware language modeling. The general
tional power: the amount and type of context the          framework these architectures fall under is as fol-
model is allowed to use when making its predic-           lows: to parse a sentence W = w1 w2 ...wn with a
tions. Briefly, for every position t in the sentence,     trained neural model, the sentence W is input into
syntactic distance models learn a distance dt to the      the model, which outputs ot at each step t, and fi-
previous token — the tree is then inferred from this      nally all the outputs {ot }nt=1 are utilized to produce
distance; transition-based models iteratively con-        the parse.
struct the parse tree by deciding, at each position t,       Given unbounded time and space resources, by a
what operations to perform on a partial parse up to       seminal result of Siegelmann and Sontag (1992), an
token t. A salient feature of both is the context, that   RNN implementation of this framework is Turing
is, which tokens is dt a function of (correspond-         complete. In practice it is common to restrict the
ingly, which tokens can the choice of operations at       form of the output ot in some way. In this paper,
token t depend on)?                                       we consider the two most common approaches, in
   We show that when the context is either bounded        which ot is a real number representing a syntactic
(that is, dt only depends on a bounded window             distance (Section 2.1) (Shen et al., 2018a,b, 2019)
around the t-th token) or unidirectional (that is,        or a sequence of parsing operations (Section 2.2)
dt only considers the tokens to the left of the t-        (Chelba, 1997; Chelba and Jelinek, 2000; Dyer
th token), there are PCFGs for which no distance          et al., 2016). We proceed to describe our results for
metric (correspondingly, no algorithm to choose           each architecture in turn.
the sequence of transitions) works. On the other
                                                          2.1    Syntactic distance
hand, if the context is unbounded in both directions
then both methods work: that is, for any parse, we        Syntactic distance-based neural parsers train a neu-
can design a distance metric (correspondingly, a          ral network to learn a distance for each pair of
sequence of transitions) that recovers it.                adjacent words, depending on the context surround-
   This is of considerable importance: in practi-         ing the pair of words under consideration. The
cal implementations the context is either bounded         distances are then used to induce a tree structure
(e.g. in Shen et al., 2018a, the distance metric is       (Shen et al., 2018a,b).
parametrized by a convolutional kernel with a con-           For a sentence W = w1 w2 ...wn , the syntactic
stant width) or unidirectional (e.g. in Shen et al.,      distance between wt−1 and wt (2 ≤ t ≤ n) is
2019, the distance metric is computed by a LSTM,          defined as dt = d(wt−1 , wt | ct ), where ct is the
which performs a left-to-right computation).              context that dt takes into consideration 1 . We will
                                                          show that restricting the surrounding context ei-
   This formally confirms a conjecture of Htut et al.
                                                          ther in directionality, or in size, results in a poor
(2018), who suggested that because these models
                                                          representational power, while full context confers
commit to parsing decision in a left-to-right fash-
                                                          essentially perfect representational power with re-
ion and are trained as a part of a language model,
                                                          spect to PCFGs.
it may be difficult for them to capture sufficiently
                                                             Concretely, if the context is full, we show:
complex syntactic dependencies. Our techniques
are fairly generic and seem amenable to analyzing         Theorem (Informal, full context). For sentence W
other approaches to syntax. Finally, while the ex-        generated by any PCFG, if the computation of dt
istence of a particular PCFG that is problematic          has as context the full sentence and the position
for these methods doesn’t necessarily imply that          index under consideration, i.e. ct = (W, t) and
the difficulties will carry over to real-life data, the       1
                                                                Note that this is not a conditional distribution—we use
PCFGs that are used in our proofs closely track lin-      this notation for convenience.

                                                     2676
dt = d(wt−1 , wt | ct ), then dt can induce the maxi-   combinatorial structure — e.g. a sequence of tran-
mum likelihood parse of W .                             sitions (Chelba, 1997; Chelba and Jelinek, 2000;
   On the flipside, if the context is unidirectional    Dyer et al., 2016). We adopt a simplification of the
(i.e. unbounded left-context from the start of          neural parameterization in (Dyer et al., 2016) (see
the sentence, and even possibly with a bounded          Definition 4.7).
look-ahead), the representational power becomes            With full context, Dyer et al. (2016) describes
severely impoverished:                                  an algorithm to find a sequence of transitions to
                                                        represent any parse tree, via a “depth-first, left-
Theorem (Informal, limitation of left-to-right pars-    to-right traversal” of the tree. On the other hand,
ing via syntactic distance). There exists a PCFG G      without full context, we prove that transition-based
such that for any distance measure dt whose com-        parsing suffers from the same limitations:
putation incorporates only bounded context in at
                                                        Theorem (Informal, limitation of transition-based
least one direction (left or right), e.g.
                                                        parsing without full context). There exists a PCFG
             ct = (w0 , w1 , ..., wt+L0 )
                                                        G, such that for any learned transition-based
             dt = d(wt−1 , wt | ct )                    parser with bounded context in at least one direc-
  the probability that dt induces the max likelihood    tion (left or right), the probability that it returns
parse is arbitrarily low.                               the max likelihood parse is arbitrarily low.
                                                           For a formal statement, see Section 7, and in
   In practice, for computational efficiency,
                                                        particular Theorem 5.
parametrizations of syntactic distances fall into the
above assumptions of restricted context (Shen et al.,   Remark. There is no immediate connection be-
2018a). This puts the ability of these models to        tween the syntactic distance-based approaches (in-
learn a complex PCFG syntax into considerable           cluding ON-LSTM) and the transition-based pars-
doubt. For formal definitions, see Section 4.2. For     ing framework, so the limitations of transition-
formal theorem statements and proofs, see Section       based parsing does not directly imply the stated
5.                                                      negative results for syntactic distance or ON-
   Subsequently we consider ON-LSTM, an archi-          LSTM, and vice versa.
tecture proposed by Shen et al. (2019) improving        2.3   The counterexample family
their previous work (Shen et al., 2018a), which
                                                        Most of our theorems proving limitations on
also is based on learning a syntactic distance, but
                                                        bounded and unidirectional context are based on a
in (Shen et al., 2019) the distances are reduced from
                                                        PCFG family (Definition 2.1) which draws inspi-
the values of a carefully structured master forget
                                                        rations from natural language already suggested in
gate (see Section 6). While we show ON-LSTM
                                                        (Htut et al., 2018): later words in a sentence can
can in principle losslessly represent any parse tree
                                                        force different syntactic structures earlier in the
(Theorem 3), calculating the gate values in a left
                                                        sentence. For example, consider the two sentences:
to right fashion (as is done in practice) is subject
                                                        “I drink coffee with milk.” and “I drink coffee with
to the same limitations as the syntactic distance
                                                        friends.” Their only difference occurs at their very
approach:
                                                        last words, but their parses differ at some earlier
Theorem (Informal, limitation of syntactic dis-         words in each sentence, too, as shown in Figure 1.
tance estimation based on ON-LSTM). There ex-              To formalize this intuition, we define the follow-
ists a PCFG G for which the probability that the        ing PCFG.
syntactic distance converted from an ON-LSTM
                                                        Definition 2.1 (Right-influenced PCFG). Let m ≥
induces the max likelihood parse is arbitrarily low.
                                                        2, L0 ≥ 1 be positive integers. The grammar Gm,L0
  For a formal statement, see Section 6 and in          has starting symbol S, other non-terminals
particular Theorem 4.
                                                           Ak , Bk , Alk , Ark , Bk0 for all k ∈ {1, 2, ..., m},
2.2   Transition-based parsing
                                                        and terminals
In principle, the output ot at each position t of
a left-to-right neural models for syntactic parsing             ai for all i ∈ {1, 2, ..., m + 1 + L0 },
need not be restricted to a real-numbered distance
or a carefully structured vector. It can also be a                   cj for all j ∈ {1, 2, ..., m}.

                                                   2677
Figure 1: The parse trees of the two sentences: “I drink coffee with milk.” and “I drink coffee with friends.”. Their
only difference occurs at their very last words, but their parses differ at some earlier words in each sentence

Figure 2: The structure of the parse tree of string lk = a1 a2 ...am+1+L0 ck ∈ L(Gm,L0 ). Note that any lk1 and lk2
are almost the same except for the last token: the prefix a1 a2 ...am+1+L0 is shared among all strings in L(Gm,L0 ).
However, their parses differ with respect to where Ak is split. The last token ck is unique to lk and hence determines
the correct parse according to Gm,L0 .

The rules of the grammar are                                such that for the parsing algorithms we consider
                                                            that proceed in a “left-to-right” fashion on lk , be-
   S → Ak Bk , ∀k ∈ {1, 2, . . . , m} w. prob.1/m           fore processing the last token ck , it cannot infer the
 Ak → Alk Ark         w. prob. 1                            syntactic structure of a1 a2 ...am+1 any better than
         ∗                                                  randomly guessing one of the m possibilities. This
 Alk   → a1 a2 ...ak      w. prob. 1
         ∗
                                                            is the main intuition behind Theorems 2 and 5.
 Ark   → ak+1 ak+2 ...am+1         w. prob. 1
         ∗
                                                            Remark. While our theorems focus on the limita-
 Bk →        Bk0 ck   w. prob. 1                            tion of “left-to-right” parsing, a symmetric argu-
 Bk0 →∗ am+2 am+3 ...am+1+L0           w. prob. 1           ment implies the same limitation of “right-to-left”
                                                            parsing. Thus, our claim is that unidirectional con-
                                                            text (in either direction) limits the expressive power
in which →∗ means that the left expands into the            of parsing models.
right through a sequence of rules that conform
to the requirements of the Chomsky normal form              3    Related Works
(CNF, Definition 4.4). Hence the grammar Gm,L0
is in CNF.                                                  Neural models for parsing were first successfully
   The language of this grammar is                          implemented for supervised settings, e.g. (Vinyals
                                                            et al., 2015; Dyer et al., 2016; Shen et al., 2018b).
L(Gm,L0 )={lk =a1 a2 ...am+1+L0 ck : 1 ≤ k ≤ m}.            Unsupervised tasks remained seemingly out of
                                                            reach, until the proposal of the Parsing-Reading-
  The parse of an arbitrary lk is shown in Figure 2.        Predict Network (PRPN) by Shen et al. (2018a),
Each lk corresponds to a unique parse determined            whose performance was thoroughly verified by ex-
by the choice of k. The structure of this PCFG is           tensive experiments in (Htut et al., 2018). The

                                                       2678
follow-up paper (Shen et al., 2019) introducing                      memory (Merrill, 2019; Hewitt et al., 2020). Our
the ON-LSTM architecture simplified radically the                    work contributes to this line of works, but focuses
architecture in (Shen et al., 2018a), while still ulti-              on the task of syntactic parsing instead.
mately attempting to fit a distance metric with the
help of carefully designed master forget gates. Sub-                 4     Preliminaries
sequent work by Kim et al. (2019a) departed from                     In this section, we define some basic concepts and
the usual way neural techniques are integrated in                    introduce the architectures we will consider.
NLP, with great success: they proposed a neural
parameterization for the EM algorithm for learning                   4.1    Probabilistic context-free grammar
a PCFG, but in a manner that leverages semantic                      First recall several definitions around formal lan-
information as well — achieving a large improve-                     guage, especially probabilistic context free gram-
ment on unsupervised parsing tasks.2                                 mar:
   In addition to constituency parsing, dependency                   Definition 4.1 (Probabilistic context-free grammar
parsing is another common task for syntactic pars-                   (PCFG)). Formally, a PCFG (Chomsky, 1956) is a
ing, but for our analyses on the ability of various                  5-tuple G = (Σ, N, S, R, Π) in which Σ is the set
approaches to represent the max-likelihood parse                     of terminals, N is the set of non-terminals, S ∈ N
of sentences generated from PCFGs, we focus on                       is the start symbol, R is the set of production rules
the task of constituency parsing. Moreover, it’s                     of the form r = (rL → rR ), where rL ∈ N , rR
important to note that there is another line of work                 is of the form B1 B2 ...Bm , m ∈ Z+ , and ∀i ∈
aiming to probe the ability of models trained with-                  {1, 2, ..., m}, Bi ∈ (Σ ∪ N ). Finally, Π : R 7→
out explicit syntactic consideration (e.g. BERT) to                  [0, 1] is the rule probability function, in which for
nevertheless discover some (rudimentary) syntactic                   any         r = (A → B1 B2 ...Bm ) ∈ R,
elements (Bisk and Hockenmaier, 2015; Linzen
et al., 2016; Choe and Charniak, 2016; Kuncoro                       Π(r) is the conditional probability
et al., 2018; Williams et al., 2018; Goldberg, 2019;                          P (rR = B1 B2 ...Bm | rL = A).
Htut et al., 2019; Hewitt and Manning, 2019; Reif                    Definition 4.2 (Parse tree). Let TG denote the set
et al., 2019). However, to-date, we haven’t been                     of parse trees that G can derive. Each t ∈ TG is
able to extract parse trees achieving scores that                    associated with yield(t) ∈ Σ∗ , the sequence of
are close to the oracle binarized trees on standard                  terminals composed of the leaves of t and PT (t) ∈
benchmarks (Kim et al., 2019b,a).                                    [0, 1], the probability of the parse tree, defined by
   Methodologically, our work is closely related to                  the product of the probabilities of the rules in the
a long line of works aiming to characterize the rep-                 derivation of t.
resentational power of neural models (e.g. RNNs,
                                                                     Definition 4.3 (Language and sentence). The lan-
LSTMs) through the lens of formal languages and
                                                                     guage of G is
formal models of computation. Some of the works
of this flavor are empirical in nature (e.g. LSTMs                     L(G) = {s ∈ Σ∗ : ∃t ∈ TG , yield(t) = s}.
have been shown to possess stronger abilities to
                                                                     Each s ∈ L(G) is called a sentence in L(G), and
recognize some context-free language and even
                                                                     is associated with the set of parses TG (s) = {t ∈
some context-sensitive language, compared with
                                                                     TG | yield(t) = s}, the set of max likelihood
simple RNNs (Gers and Schmidhuber, 2001; Suz-
                                                                     parses, arg
                                                                               Pmaxt∈TG (s) PT (t), and its probability
gun et al., 2019) or GRUs (Weiss et al., 2018; Suz-
                                                                     PS (s) = t∈TG (s) PT (t).
gun et al., 2019)); some results are theoretical in
nature (e.g. Siegelmann and Sontag (1992)’s proof                    Definition 4.4 (Chomsky normal form (CNF)). A
that with unbounded precision and unbounded time                     PCFG G = (Σ, N, S, R, Π) is in CNF (Chomsky,
complexity, RNNs are Turing-complete; related re-                    1959) if we require, in addition to Definition 4.1,
sults investigate RNNs with bounded precision and                    that each rule r ∈ R is in the form A → B1 B2
computation time (Weiss et al., 2018), as well as                    where B1 , B2 ∈ N \ {S}; A → a where a ∈
                                                                     Σ, a 6= ; or S →  which is only allowed if the
   2
     By virtue of not relying on bounded or unidirectional           empty string  ∈ L(G).
context, the Compound PCFG (Kim et al., 2019a) eschews the              Every PCFG G can be converted into a PCFG G0
techniques in our paper. Specifically, by employing a bidirec-
tional LSTM inference network in the process of constructing         in CNF such that L(G) = L(G0 ) (Hopcroft et al.,
a tree given a sentence, the parsing is no longer “left-to-right”.   2006).

                                                                2679
4.2   Syntactic distance                                  4.4    Transition-based parsing
The Parsing-Reading-Predict Networks (PRPN)               In addition to outputting a single real numbered
(Shen et al., 2018a) is one of the leading approaches     distance or a vector at each position t, a left-to-right
to unsupervised constituency parsing. The parsing         model can also parse a sentence by outputting a
network (which computes the parse tree, hence the         sequence of “transitions” at each position t, an idea
only part we focus on in our paper) is a convo-           proposed in some traditional parsing approaches
lutional network that computes the syntactic dis-         (Sagae and Lavie, 2005; Chelba, 1997; Chelba and
tances dt = d(wt−1 , wt ) (defined in Section 2.1)        Jelinek, 2000), and also some more recent neural
based on the past L words. A deterministic greedy         parameterization (Dyer et al., 2016).
tree induction algorithm is then used to produce             We introduce several items of notation:
a parse tree as follows. First, we split the sen-
tence w1 ...wn into two constituents, w1 ...wt−1 and         • zit : the i-th transition performed when reading
wt ...wn , where t ∈ argmax{dt }nt=2 and form the              in wt , the t-th token of the sentence
left and right subtrees of t. We recursively repeat
                                                                                 W = w1 w2 ...wn .
this procedure for the newly created constituents.
An algorithmic form of this procedure is included            • Nt : the number of transitions performed be-
as Algorithm 1 in Appendix A.                                  tween reading in the token wt and reading in
   Note that, due to the deterministic nature of the           the next token wt+1 .
tree-induction process, the ability of PRPN to learn         • Zt : the sequence of transitions after reading
a PCFG is completely contingent upon learning a                in the prefix w1 w2 ...wt of the sentence.
good syntactic distance.
                                                                       Zt = {(z1j , z2j , ..., zN
                                                                                                j
                                                                                                    ) | j = 1..t}.
4.3   The ordered neuron architecture                                                             j

Building upon the idea of representing the syntactic         • Z: the parse of the sentence W . Z = Zn .
information with a real-valued distance measure
at each position, a simple extension is to associate         We base our analysis on the approach introduced
each position with a learned vector, and then use the     in the parsing version of (Dyer et al., 2016), though
vector for syntactic parsing. The ordered-neuron          that work additionally proposes a generator ver-
LSTM (ON-LSTM, Shen et al., 2019) proposes                sion. 3
that the nodes that are closer to the root in the         Definition 4.6 (Transition-based parser). A
parse tree generate a longer span of terminals, and       transition-based parser uses a stack (initialized to
therefore should be less frequently “forgotten” than      empty) and an input buffer (initialized with the
nodes that are farther away from the root. The            sentence w1 ...wt ). At each position t, based on a
difference in the frequency of forgetting is captured     context ct , the parser outputs a sequence of parsing
by a carefully designed master forget gate vector f˜,     transitions {zit }N t                t
                                                                            i=1 , where each zi can be one of
as shown in Figure 3 (in Appendix B). Formally:           the following transitions (Definition 4.7). The
Definition 4.5 (Master forget gates, Shen et al.,         parsing stops when the stack contains one single
2019). Given the input sentence W = w1 w2 ...wn           constituent, and the buffer is empty.
and a trained ON-LSTM, running the ON-LSTM                Definition 4.7 (Parser transitions, Dyer et al.,
on W gives the master forget gates, which are a           2016). A parsing transition can be one of the fol-
sequence of D-dimensional vectors {f˜t }nt=1 , in         lowing three types:
which at each position t, f˜t = f˜t (w1 , ..., wt ) ∈
                                                             • NT(X) pushes a non-terminal X onto the
[0, 1]D . Moreover, let f˜t,j represent the j-th dimen-
                                                               stack.
sion of f˜t . The ON-LSTM architectures requires
that f˜t,1 = 0, f˜t,D = 1, and                               • SHIFT: removes the first terminal from the
                 ∀i < j, f˜t,i ≤ f˜t,j .                       input buffer and pushes onto the stack.
   When parsing a sentence, the real-valued master            3
                                                                Dyer et al. (2016) additionally proposes some generator
forget gate vector f˜t at each position t is reduced to   transitions. For simplicity, we analyze the simplest form: we
a single real number representing the syntactic dis-      only allow the model to return one parse, composed of the
                                                          parser transitions, for a given input sentence. Note that this
tance dt at position t (see (1)) (Shen et al., 2018a).    simplified variant still confers full representational power in
Then, use the syntactic distances to obtain a parse.      the “full context” setting (see Section 7).

                                                     2680
• REDUCE: pops from the stack until an                will describe an assignment of {dt : 2 ≤ t ≤ n}
      open non-terminal is encountered, then pops         such that their order matches the level at which the
      this non-terminal and assembles everything          branches split in T . Specifically, ∀t ∈ [2, n], let at
      popped to form a new constituent, labels this       denote the lowest common ancestor of wt−1 and wt
      new constituent using this non-terminal, and        in T . Let d0t denote the shortest distance between
      finally pushes this new constituent onto the        at and the root of T . Finally, let dt = n − d0t . As a
      stack.                                              result, {dt : 2 ≤ t ≤ n} induces T .

   In Appendix Section C, we provide an example           Remark. Since any PCFG can be converted to
of parsing the sentence “I drink coffee with milk”        Chomsky normal form (Hopcroft et al., 2006), The-
using the set of transitions given by Definition 4.7.     orem 1 implies that given the whole sentence and
   The different context specifications and the corre-    the position index as the context, the syntactic dis-
sponding representational powers of the transition-       tance has sufficient representational power to cap-
based parser are discussed in Section 7.                  ture any PCFG. It does not state, however, that the
                                                          whole sentence and the position are the minimal
5    Representational Power of Neural                     contextual information needed for representabil-
     Syntactic Distance Methods                           ity nor does it address training (i.e. optimization)
In this section we formalize the results on syntactic     issues.
distance-based methods. Since the tree induction             On the flipside, we show that restricting the con-
algorithm always generates a binary tree, we con-         text even mildly can considerably decrease the rep-
sider only PCFGs in Chomsky normal form (CNF)             resentational power. Namely, we show that if con-
(Definition 4.4) so that the max likelihood parse of      text is bounded even in a single direction (to the
a sentence is also a binary tree structure.               left or to the right), there are PCFGs on which any
   To formalize the notion of “representing” a            syntactic distance will perform poorly 4 . (Note in
PCFG, we introduce the following definition:              the implementation (Shen et al., 2018a) the context
                                                          only considers a bounded window to the left.)
Definition 5.1 (Representing PCFG with syntactic
distance). Let G be any PCFG in Chomsky Normal            Theorem 2 (Limitation of left-to-right parsing via
Form. A syntactic distance function d is said to be       syntactic distance). Let w0 = hSi be the sentence
able to p-represent G if for a set of sentences in        start symbol. Let the context
L(G) whose total probability is at least p, d can                      ct = (w0 , w1 , ..., wt+L0 ).
correctly induce the tree structure of the max likeli-    ∀ > 0, there exists a PCFG G in Chomsky nor-
hood parse of these sentences without ambiguity.          mal form, such that any syntactic distance measure
Remark. Ambiguities could occur when, for ex-             dt = d(wt−1 , wt | ct ) cannot -represent G.
ample, there exists t such that dt = dt+1 . In this
case, the tree induction algorithm would have to          Proof. Let m > 1/ be a positive integer. Consider
break ties when determining the local structure for       the PCFG Gm,L0 in Definition 2.1.
wt−1 wt wt+1 . We preclude this possibility in Defi-         For any k ∈ [m], consider the string lk ∈
nition 5.1.                                               L(Gm,L0 ). Note that in the parse tree of lk , the
   In the least restrictive setting, the whole sentence   rule S → Ak Bk is applied. Hence, ak and ak+1
W , as well as the position index t can be taken into     are the unique pair of adjacent non-terminals in
consideration when determining each dt . We prove         a1 a2 ...am+1 whose lowest common ancestor is the
that under this setting, there is a syntactic distance    closest to the root in the parse tree of lk . Then, in
measure that can represent any PCFG.                      order for the syntactic distance metric d to induce
                                                          the correct parse tree for lk , dk must be the unique
Theorem 1 (Full context). Let ct = (W, t).                maximum in {dt : 2 ≤ t ≤ m + 1}.
For each PCFG G in Chomsky normal form,                      However, d is restricted to be in the form
there exists a syntactic distance measure dt =
d(wt−1 , wt | ct ) that can 1-represent G.                       dt = d(wt−1 , wt | w0 , w1 , ..., wt+L0 ).
                                                             4
Proof. For any sentence s = s1 s2 ...sn ∈ L(G),                In Theorem 2 we prove the more typical case, i.e. un-
                                                          bounded left context and bounded right context. The other
let T be its max likelihood parse tree. Since G is        case, i.e. bounded left context and unbounded right context,
in Chomsky normal form, T is a binary tree. We            can be proved symmetrically.

                                                     2681
Note that ∀1 ≤ k1 < k2 ≤ m, the first m + 1 + L0           Theorem 3 (Lossless representation of a parse
tokens of lk1 and lk2 are the same, which implies          tree). For any sentence W = w1 w2 ...wn with
that the inferred syntactic distances                      parse tree T in any PCFG in Chomsky normal form,
                                                           there exists a dimensionality D ∈ Z+ , a sequence
               {dt : 2 ≤ t ≤ m + 1}
                                                           of vectors {f˜t }nt=2 in which each f˜t ∈ [0, 1]D , such
are the same for lk1 and lk2 at each position t. Thus,     that the estimated syntactic distances via (1) induce
it is impossible for d to induce the correct parse         the structure of T .
tree for both lk1 and lk2 . Hence, d is correct on
at most one lk ∈ L(Gm,L0 ), which corresponds to           Proof. By Theorem 1, there is a syntactic distance
probability at most 1/m < . Therefore, d cannot           measure {dt }nt=2 that induces the structure of T
-represent Gm,L0 .                                        (such that ∀t, dt 6= dt+1 ).
                                                              For each t = 2..n, set dˆt = k if dt is the k-th
Remark. In the counterexample, there are only m            smallest entry in {dt }nt=2 , breaking ties arbitrar-
possible parse structures for the prefix a1 a2 ...am+1 .   ily. Then, each dˆt ∈ [1, n − 1], and {dˆt }nt=2 also
Hence, the proved fact that the probability of be-         induces the structure of T .
ing correct is at most 1/m means that under the               Let D = n − 1. For each t = 2..n, let f˜t =
restrictions of unbounded look-back and bounded            (0, ..., 0, 1, ..., 1) whose lower dˆt dimensions are 0
look-ahead, the distance cannot do better than ran-        and higher D − dˆt dimensions are 1. Then,
dom guessing for this grammar.
                                                                            D
Remark. The above Theorem 2 formalizes the in-
                                                               dˆft = D −
                                                                            X
                                                                                  f˜t,j = D − (D − dˆt ) = dˆt .
tuition discussed in (Htut et al., 2018) outlining
                                                                            j=1
an intrinsic limitation of only considering bounded
context in one direction. Indeed, for the PCFG con-           Therefore, the calculated {dˆft }nt=2 induces the
structed in the proof, the failure is a function of the    structure of T .
context, not because of the fact that we are using a
distance-based parser.
                                                              Although Theorem 3 shows the ability of the
   Note that as a corollary of the above theorem, if       master forget gates to perfectly represent any parse
there is no context (ct = null) or the context is          tree, a left-to-right parsing can be proved to be
both bounded and unidirectional, i.e.                      unable to return the correct parse with high proba-
           ct = wt−L wt−L+1 ...wt−1 wt ,                   bility. In the actual implementation in (Shen et al.,
                                                           2019), the (real-valued) master forget gate vectors
then there is a PCFG that cannot be -represented          {f˜t }nt=1 are produced by feeding the input sentence
by any such d.                                             W = w1 w2 ...wn to a model trained with a lan-
                                                           guage modeling objective. In other words, f˜t,j is
6   Representational Power of the Ordered                  calculated as a function of w1 , ..., wt , rather than
    Neuron Architecture                                    the entire sentence.
                                                              As such, this left-to-right parser is subject to
In this section, we formalize the results character-
                                                           similar limitations as in Theorem 2:
izing the representational power of the ON-LSTM
architecture. The master forget gates of the ON-           Theorem 4 (Limitation of syntactic distance esti-
LSTM, {f˜t }nt=2 in which each f˜t ∈ [0, 1]D , encode      mation based on ON-LSTM). For any  > 0, there
the hierarchical structure of a parse tree, and Shen       exists a PCFG G in Chomsky normal form, such
et al. (2019) proposes to carry out unsupervised           that the syntactic distance measure calculated with
constituency parsing via a reduction from the gate         (1), dˆft , cannot -represent G.
vectors to syntactic distances by setting:
                       D
                                                           Proof. Since by Definition 4.5, f˜t,j is a function
          dˆft = D −
                       X
                             f˜t,j for t = 2..n     (1)
                                                           of w1 , ..., wt , the estimated syntactic distance dˆft
                       j=1
                                                           is also a function of w1 , ..., wt . By Theorem 2,
  First we show that the gates in ON-LSTM in               even with unbounded look-back context w1 , ..., wt ,
principle form a lossless representation of any            there exists a PCFG for which the probability that
parse tree.                                                dˆft induces the correct parse is arbitrarily low.

                                                      2682
7   Representational Power of                               8   Conclusion
    Transition-Based Parsing
In this section, we analyze a transition-based pars-        In this work, we considered the representational
ing framework inspired by (Dyer et al., 2016;               power of two frameworks for constituency pars-
Chelba and Jelinek, 2000; Chelba, 1997).                    ing prominent in the literature, based on learning
   Again, we proceed to say first that “full context”       a syntactic distance and learning a sequence of it-
confers full representational power. Namely, using          erative transitions to build the parse tree — in the
the terminology of Definition 4.6, we let the context       sandbox of PCFGs. In particular, we show that
ct at each position t be the whole sentence W and           if the context for calculating distance/deciding on
the position index t. Note that any parse tree can be       transitions is limited at least to one side (which is
generated by a sequence of transitions defined in           typically the case in practice for existing architec-
Definition 4.7. Indeed, Dyer et al. (2016) describes        tures), there are PCFGs for which no good distance
an algorithm to find such a sequence of transitions         metric/sequence of transitions can be chosen to
via a “depth-first, left-to-right traversal” of the tree.   construct the maximum likelihood parse.
   Proceeding to limited context, in the setting of
typical left-to-right parsing, the context ct consists         This limitation was already suspected in (Htut
of all current and past tokens {wj }tj=1 and all pre-       et al., 2018) as a potential failure mode of leading
vious parses {(z1j , ..., zN
                           j
                               )}tj=1 . We’ll again prove   neural approaches like (Shen et al., 2018a, 2019)
                             j
even stronger negative results, where we allow an           and we show formally that this is the case. The
optional look-ahead to L0 input tokens to the right.        PCFGs with this property track the intuition that
                                                            bounded context methods will have issues when
Theorem 5 (Limitation of transition-based parsing           the parse at a certain position depends heavily on
without full context). For any  > 0, there exists          latter parts of the sentence.
a PCFG G in Chomsky normal form, such that
for any learned transition-based parser (Definition            The conclusions thus suggest re-focusing our at-
4.6) based on context                                       tention on methods like (Kim et al., 2019a) which
                       0
       ct = ({wj }t+L       j         j    t                have enjoyed greater success on tasks like unsu-
                   j=1 , {(z1 , ..., zNj )}j=1 ),
                                                            pervised constituency parsing, and do not fall in
   the sum of the probabilities of the sentences in
                                                            the paradigm analyzed in our paper. A question of
L(G) for which the parser returns the maximum
                                                            definite further interest is how to augment models
likelihood parse is less than .
                                                            that have been successfully scaled up (e.g. BERT)
Proof. Let m > 1/ be a positive integer. Consider          in a principled manner with syntactic information,
the PCFG Gm,L0 in Definition 2.1.                           such that they can capture syntactic structure (like
   Note that ∀k, S → Ak Bk is applied to yield              PCFGs). The other question of immediate impor-
string lk . Then in the parse tree of lk , ak and           tance is to understand the interaction between the
ak+1 are the unique pair of adjacent terminals in           syntactic and semantic modules in neural architec-
a1 a2 ...am+1 whose lowest common ancestor is the           tures — information is shared between such mod-
closest to the root. Thus, different lk requires a dif-     ules in various successful architectures, e.g. (Dyer
ferent sequence of transitions within the first m + 1       et al., 2016; Shen et al., 2018a, 2019; Kim et al.,
input tokens, i.e. {zit }i≥1, 1≤t≤m+1 .                     2019a), and the relative pros and cons of doing this
   For each w ∈ L(Gm,L0 ), before the last to-              are not well understood. Finally, our paper purely
ken wm+2+L0 is processed, based on the common               focuses on representational power, and does not
prefix w1 w2 ...wm+1+L0 = a1 a2 ...am+1+L0 , it is          consider algorithmic and statistical aspects of train-
equally likely that w = lk , ∀k, w. prob. 1/m each.         ing. As any model architecture is associated with
   Moreover, when processing wm+1 , the bounded             its distinct optimization and generalization consid-
look-ahead window of size L0 does not allow access          erations, and natural language data necessitates the
to the final input token am+2+L0 = ck .                     modeling of the interaction between syntax and
   Thus, ∀1 ≤ k1 < k2 ≤ m, it is impossible                 semantics, those aspects of considerations are well
for the parser to return the correct parse tree for         beyond the scope of our analysis in this paper using
both lk1 and lk2 without ambiguity. Hence, the              the controlled sandbox of PCFGs, and are interest-
parse is correct on at most one lk ∈ L(G), which            ing directions for future work.
corresponds to probability at most 1/m < .

                                                       2683
References                                                Yoav Goldberg. 2019. Assessing bert’s syntactic abili-
                                                            ties.
Yonatan Bisk and Julia Hockenmaier. 2015. Probing
  the linguistic strengths and limitations of unsuper-    Qi He, Han Wang, and Yue Zhang. 2020. Enhanc-
  vised grammar induction. In Proceedings of the            ing generalization in natural language inference by
  53rd Annual Meeting of the Association for Compu-         syntax. In Findings of the Association for Computa-
  tational Linguistics and the 7th International Joint      tional Linguistics: EMNLP 2020, pages 4973–4978,
  Conference on Natural Language Processing (Vol-           Online. Association for Computational Linguistics.
  ume 1: Long Papers), pages 1395–1404, Beijing,
  China. Association for Computational Linguistics.       John Hewitt, Michael Hahn, Surya Ganguli, Percy
                                                            Liang, and Christopher D. Manning. 2020. RNNs
Ciprian Chelba. 1997. A structured language model.          can generate bounded hierarchical languages with
  In 35th Annual Meeting of the Association for Com-        optimal memory. In Proceedings of the 2020 Con-
  putational Linguistics and 8th Conference of the          ference on Empirical Methods in Natural Language
  European Chapter of the Association for Computa-          Processing (EMNLP), pages 1978–2010, Online. As-
  tional Linguistics, pages 498–500, Madrid, Spain.         sociation for Computational Linguistics.
  Association for Computational Linguistics.
                                                          John Hewitt and Christopher D. Manning. 2019. A
Ciprian Chelba and Frederick Jelinek. 2000. Struc-          structural probe for finding syntax in word repre-
  tured language modeling. Computer Speech & Lan-           sentations. In Proceedings of the 2019 Conference
  guage, 14(4):283 – 332.                                   of the North American Chapter of the Association
Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui          for Computational Linguistics: Human Language
  Jiang, and Diana Inkpen. 2017. Enhanced LSTM              Technologies, Volume 1 (Long and Short Papers),
  for natural language inference. In Proceedings of         pages 4129–4138, Minneapolis, Minnesota. Associ-
  the 55th Annual Meeting of the Association for Com-       ation for Computational Linguistics.
  putational Linguistics (Volume 1: Long Papers),
                                                          John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ull-
  pages 1657–1668, Vancouver, Canada. Association
                                                            man. 2006. Introduction to Automata Theory, Lan-
  for Computational Linguistics.
                                                            guages, and Computation (3rd Edition). Addison-
Do Kook Choe and Eugene Charniak. 2016. Parsing             Wesley Longman Publishing Co., Inc., USA.
  as language modeling. In Proceedings of the 2016
                                                          Phu Mon Htut, Kyunghyun Cho, and Samuel Bowman.
  Conference on Empirical Methods in Natural Lan-
                                                            2018. Grammar induction with neural language
  guage Processing, pages 2331–2336, Austin, Texas.
                                                            models: An unusual replication. In Proceedings
  Association for Computational Linguistics.
                                                            of the 2018 EMNLP Workshop BlackboxNLP: An-
N. Chomsky. 1956. Three models for the description of       alyzing and Interpreting Neural Networks for NLP,
   language. IRE Transactions on Information Theory,        pages 371–373, Brussels, Belgium. Association for
   2(3):113–124.                                            Computational Linguistics.

Noam Chomsky. 1959. On certain formal properties          Phu Mon Htut, Jason Phang, Shikha Bordia, and
  of grammars. Information and Control, 2(2):137 –          Samuel R. Bowman. 2019. Do attention heads
  167.                                                      in bert track syntactic dependencies? ArXiv,
                                                            abs/1911.12246.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
   Kristina Toutanova. 2019. BERT: Pre-training of        Yoon Kim, Chris Dyer, and Alexander Rush. 2019a.
   deep bidirectional transformers for language under-      Compound probabilistic context-free grammars for
   standing. In Proceedings of the 2019 Conference          grammar induction. In Proceedings of the 57th An-
   of the North American Chapter of the Association         nual Meeting of the Association for Computational
   for Computational Linguistics: Human Language            Linguistics, pages 2369–2385, Florence, Italy. Asso-
  Technologies, Volume 1 (Long and Short Papers),           ciation for Computational Linguistics.
   pages 4171–4186, Minneapolis, Minnesota. Associ-
   ation for Computational Linguistics.                   Yoon Kim, Alexander Rush, Lei Yu, Adhiguna Kun-
                                                            coro, Chris Dyer, and Gábor Melis. 2019b. Unsu-
Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros,           pervised recurrent neural network grammars. In Pro-
  and Noah A. Smith. 2016. Recurrent neural network         ceedings of the 2019 Conference of the North Amer-
  grammars. In Proceedings of the 2016 Conference           ican Chapter of the Association for Computational
  of the North American Chapter of the Association          Linguistics: Human Language Technologies, Vol-
  for Computational Linguistics: Human Language             ume 1 (Long and Short Papers), pages 1105–1117,
  Technologies, pages 199–209, San Diego, California.       Minneapolis, Minnesota. Association for Computa-
  Association for Computational Linguistics.                tional Linguistics.

F. Gers and J. Schmidhuber. 2001.          Lstm recur-    Adhiguna Kuncoro, Chris Dyer, John Hale, Dani Yo-
   rent networks learn simple context-free and context-     gatama, Stephen Clark, and Phil Blunsom. 2018.
   sensitive languages. IEEE transactions on neural         LSTMs can learn syntax-sensitive dependencies
   networks, 12 6:1333–40.                                  well, but modeling structure makes them better. In

                                                     2684
Proceedings of the 56th Annual Meeting of the As-          Building Bridges, pages 44–54, Florence. Associa-
  sociation for Computational Linguistics (Volume 1:         tion for Computational Linguistics.
  Long Papers), pages 1426–1436, Melbourne, Aus-
  tralia. Association for Computational Linguistics.       Oriol Vinyals, Ł ukasz Kaiser, Terry Koo, Slav Petrov,
                                                             Ilya Sutskever, and Geoffrey Hinton. 2015. Gram-
Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg.              mar as a foreign language. In Advances in Neural
  2016. Assessing the ability of LSTMs to learn              Information Processing Systems, volume 28, pages
  syntax-sensitive dependencies. Transactions of the         2773–2781. Curran Associates, Inc.
  Association for Computational Linguistics, 4:521–
  535.                                                     Gail Weiss, Yoav Goldberg, and Eran Yahav. 2018.
                                                             On the practical computational power of finite pre-
William Merrill. 2019. Sequential neural networks            cision RNNs for language recognition. In Proceed-
  as automata. In Proceedings of the Workshop on             ings of the 56th Annual Meeting of the Association
  Deep Learning and Formal Languages: Building               for Computational Linguistics (Volume 2: Short Pa-
  Bridges, pages 1–13, Florence. Association for Com-        pers), pages 740–745, Melbourne, Australia. Asso-
  putational Linguistics.                                    ciation for Computational Linguistics.
Deric Pang, Lucy H. Lin, and Noah A. Smith. 2019.          Adina Williams, Andrew Drozdov, and Samuel R.
  Improving natural language inference with a pre-           Bowman. 2018. Do latent tree learning models iden-
  trained parser.                                            tify meaningful structure in sentences? Transac-
                                                             tions of the Association for Computational Linguis-
Emily Reif, Ann Yuan, Martin Wattenberg, Fernanda B          tics, 6:253–267.
  Viegas, Andy Coenen, Adam Pearce, and Been Kim.
  2019. Visualizing and measuring the geometry of
  bert. In Advances in Neural Information Processing
  Systems, volume 32, pages 8594–8603. Curran As-
  sociates, Inc.
Kenji Sagae and Alon Lavie. 2005. A classifier-based
  parser with linear run-time complexity. In Proceed-
  ings of the Ninth International Workshop on Pars-
  ing Technology, pages 125–132, Vancouver, British
  Columbia. Association for Computational Linguis-
  tics.
Yikang Shen, Zhouhan Lin, Chin wei Huang, and
  Aaron Courville. 2018a. Neural language modeling
  by jointly learning syntax and lexicon. In Interna-
  tional Conference on Learning Representations.
Yikang Shen, Zhouhan Lin, Athul Paul Jacob, Alessan-
  dro Sordoni, Aaron Courville, and Yoshua Bengio.
  2018b. Straight to the tree: Constituency parsing
  with neural syntactic distance. In Proceedings of the
  56th Annual Meeting of the Association for Compu-
  tational Linguistics (Volume 1: Long Papers), pages
  1171–1180, Melbourne, Australia. Association for
  Computational Linguistics.
Yikang Shen, Shawn Tan, Alessandro Sordoni, and
  Aaron Courville. 2019. Ordered neurons: Integrat-
  ing tree structures into recurrent neural networks. In
  International Conference on Learning Representa-
  tions.
Hava T. Siegelmann and Eduardo D. Sontag. 1992. On
  the computational power of neural nets. In Proceed-
  ings of the Fifth Annual Workshop on Computational
  Learning Theory, COLT ’92, page 440–449, New
  York, NY, USA. Association for Computing Machin-
  ery.
Mirac Suzgun, Yonatan Belinkov, Stuart Shieber, and
  Sebastian Gehrmann. 2019. LSTM networks can
  perform dynamic counting. In Proceedings of the
 Workshop on Deep Learning and Formal Languages:

                                                      2685
A    Tree Induction Algorithm Based on Syntactic Distance
The following algorithm is proposed in (Shen et al., 2018a) to create a parse tree based on a given syntactic
distance.

 Algorithm 1: Tree induction based on syntactic distance
  Data: Sentence W = w1 w2 ...wn , syntactic distances dt = d(wt−1 , wt | ct ), 2 ≤ t ≤ n
  Result: A parse tree for W
  Initialize the parse tree with a single node n0 = w1 w2 ...wn ;
  while ∃ leaf node n = wi wi+1 ...wj where i < j do
       Find k ∈ arg maxi+1≤k≤j dk ;
       Create the left child nl and the right child nr of n ;
       nl ← wi wi+1 ...wk−1 ;
       nr ← wk wk+1 ...wj ;
  end
  return The parse tree rooted at n0 .

B   ON-LSTM Intuition
See Figure 3 below, which is excerpted from (Shen et al., 2019) with minor adaptation to the notation.

Figure 3: Relationship between the parse tree, the block view, and the ON-LSTM. Excerpted from (Shen et al.,
2019) with minor adaptation to the notation.

                                                   2686
C   Examples of parsing transitions
Table 1 below shows an example of parsing the sentence “I drink coffee with milk” using the set of
transitions given by Definition 4.7, which employs the parsing framework of (Dyer et al., 2016). The
parse tree of the sentence is given by
                                             S

                                     NP            VP

                                      N
                                             V              NP
                                      I
                                           drink    NP               PP

                                                     N           P         N

                                                   coffee    with         milk

       Stack                                                         Buffer                     Action
                                                                     I drink coffee with milk   NT(S)
       (S                                                            I drink coffee with milk   NT(NP)
       (S | (NP                                                      I drink coffee with milk   NT(N)
       (S | (NP | (N                                                 I drink coffee with milk   SHIFT
       (S | (NP | (N | I                                             drink coffee with milk     REDUCE
       (S | (NP (N I))                                               drink coffee with milk     NT(VP)
       (S | (NP (N I)) | (VP                                         drink coffee with milk     NT(V)
       (S | (NP (N I)) | (VP | (V                                    drink coffee with milk     SHIFT
       (S | (NP (N I)) | (VP | (V | drink                            coffee with milk           REDUCE
       (S | (NP (N I)) | (VP | (V drink)                             coffee with milk           NT(NP)
       (S | (NP (N I)) | (VP | (V | drink) |
          (NP                                                        coffee with milk           NT(NP)
       (S | (NP (N I)) | (VP | (V drink) |
          (NP | (NP                                                  coffee with milk           NT(N)
       (S | (NP (N I)) | (VP | (V drink) |
          (NP | (NP | (N                                             coffee with milk           SHIFT
       (S | (NP (N I)) | (VP | (V drink) |
          (NP | (NP | (N | coffee                                    with milk                  REDUCE
       (S | (NP (N I)) | (VP | (V drink) |
          (NP | (NP (N coffee))                                      with milk                  NT(PP)
       (S | (NP (N I)) | (VP | (V drink) |
          (NP | (NP (N coffee)) | (PP                                with milk                  NT(P)
       (S | (NP (N I)) | (VP | (V drink) |
          (NP | (NP (N coffee)) | (PP | (P                           with milk                  SHIFT
       (S | (NP (N I)) | (VP | (V drink) |
          (NP | (NP (N coffee)) | (PP | (P | with                    milk                       REDUCE
       (S | (NP (N I)) | (VP | (V drink) |
          (NP | (NP (N coffee)) | (PP | (P with)                     milk                       NT(N)
       (S | (NP (N I)) | (VP | (V drink) |
          (NP | (NP (N coffee)) | (PP | (P with) | (N                milk                       SHIFT
       (S | (NP (N I)) | (VP | (V drink) |
          (NP | (NP (N coffee)) | (PP | (P with) | (N | milk                                    REDUCE
       (S (NP (N I)) (VP (V drink)
          (NP (NP (N coffee)) (PP (P with) (N milk)))))

                Table 1: Transition-based parsing of the sentence “I drink coffee with milk”.

                                                   2687
You can also read