Neural Bi-Lexicalized PCFG Induction - Songlin Yang , Yanpeng Zhao , Kewei Tu

Page created by Chester Fields

Sports

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Neural Bi-Lexicalized PCFG Induction
                                                                 Songlin Yang♣ , Yanpeng Zhao♦ , Kewei Tu♣˚
                                                    ♣
                                                    School of Information Science and Technology, ShanghaiTech University
                                                   Shanghai Engineering Research Center of Intelligent Vision and Imaging
                                         Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences
                                                                  University of Chinese Academy of Sciences
                                                                        ♦
                                                                          ILCC, University of Edinburgh
                                                                {yangsl,tukw}@shanghaitech.edu.cn
                                                                        yannzhao.ed@gmail.com

                                                              Abstract                           information to disambiguate parsing decisions and
                                                                                                 are much more expressive than vanilla PCFGs.
                                             Neural lexicalized PCFGs (L-PCFGs) (Zhu             However, they suffer from representation and in-
arXiv:2105.15021v1 [cs.CL] 31 May 2021

                                             et al., 2020) have been shown effective in          ference complexities. For representation, the ad-
                                             grammar induction. However, to reduce com-
                                                                                                 dition of lexical information greatly increases the
                                             putational complexity, they make a strong in-
                                             dependence assumption on the generation of          number of parameters to be estimated and exac-
                                             the child word and thus bilexical dependen-         erbates the data sparsity problem during learning,
                                             cies are ignored. In this paper, we propose         so the expectation-maximisation (EM) based esti-
                                             an approach to parameterize L-PCFGs with-           mation of L-PCFGs has to rely on sophisticated
                                             out making implausible independence assump-         smoothing techniques and factorizations (Collins,
                                             tions. Our approach directly models bilexi-         2003). As for inference, the CYK algorithm for
                                             cal dependencies and meanwhile reduces both
                                                                                                 L-PCFGs has a Opl5 |G|q complexity, where l is
                                             learning and representation complexities of L-
                                             PCFGs. Experimental results on the English          the sentence length and |G| is the grammar con-
                                             WSJ dataset confirm the effectiveness of our        stant. Although Eisner and Satta (1999) manage to
                                             approach in improving both running speed and        reduce the complexity to Opl4 |G|q, inference with
                                             unsupervised parsing performance.                   L-PCFGs is still relatively slow, making them less
                                                                                                 popular nowadays.
                                         1   Introduction                                           Recently, Zhu et al. (2020) combine the ideas of
                                                                                                 factorizing the binary rule probabilities (Collins,
                                         Probabilistic context-free grammars (PCFGs) has
                                                                                                 2003) and neural parameterization (Kim et al.,
                                         been an important probabilistic approach to syntac-
                                                                                                 2019) and propose neural L-PCFGs (NL-PCFGs),
                                         tic analysis (Lari and Young, 1990; Jelinek et al.,
                                                                                                 achieving good results in both unsupervised depen-
                                         1992). They assign a probability to each of the
                                                                                                 dency and constituency parsing. Neural parame-
                                         parses admitted by CFGs and rank them by the plau-
                                                                                                 terization is the key to success, which facilitates
                                         sibility in such a way that the ambiguity of CFGs
                                                                                                 informed smoothing (Kim et al., 2019), reduces
                                         can be ameliorated. Still, due to the strong indepen-
                                                                                                 the number of learnable parameters for large gram-
                                         dence assumption of CFGs, vanilla PCFGs (Char-
                                                                                                 mars (Chiu and Rush, 2020; Yang et al., 2021)
                                         niak, 1996) are far from adequate for highly am-
                                                                                                 and facilitates advanced gradient-based optimiza-
                                         biguous text.
                                                                                                 tion techniques instead of using the traditional EM
                                            A common premise for tackling the issue is to
                                                                                                 algorithm (Eisner, 2016). However, Zhu et al.
                                         incorporate lexical information and weaken the in-
                                                                                                 (2020) oversimplify the binary rules to decrease
                                         dependence assumption. There have been many ap-
                                                                                                 the complexity of the inside/CYK algorithm in
                                         proaches proposed under the premise (Magerman,
                                                                                                 learning (i.e., estimating the marginal sentence log-
                                         1995; Collins, 1997; Johnson, 1998; Klein and
                                                                                                 likelihood) and inference. Specifically, they make a
                                         Manning, 2003). Among them lexicalized PCFGs
                                                                                                 strong independence assumption on the generation
                                         (L-PCFGs) are a relatively straightforward formal-
                                                                                                 of the child word such that it is only dependent on
                                         ism (Collins, 2003). L-PCFGs extend PCFGs by
                                                                                                 the nonterminal symbol. Bilexical dependencies,
                                         associating a word, i.e., the lexical head, with each
                                                                                                 which have been shown useful in unsupervised de-
                                         grammar symbol. They can thus exploit lexical
                                                                                                 pendency parsing (Han et al., 2017; Yang et al.,
                                             ˚
                                              Corresponding Author                               2020), are thus ignored.

To model bilexical dependencies and meanwhile                 nonterminals:
reduce complexities, we draw inspiration from the
canonical polyadic decomposition (CPD) (Kolda                    S Ñ Arwp s                                   APN
and Bader, 2009) and propose a latent-variable                   Arwp s Ñ Brwp sCrwq s, A P N ; B, C P N Y P
based neural parameterization of L-PCFGs. Co-                    Arwp s Ñ Crwq sBrwp s, A P N ; B, C P N Y P
hen et al. (2013); Yang et al. (2021) have used CPD              T rwp s Ñ wp ,                               T PP
to decrease the complexities of PCFGs, and our
work can be seen as an extension of their work                   where wp , wq P Σ are the headwords of the con-
to L-PCFGs. We further adopt the unfold-refold                   stituents spanned by the associated grammar sym-
transformation technique (Eisner and Blatz, 2007)                bols, and p, q are the word positions in the sen-
to decrease complexities. By using this technique,               tence. We refer to A, a parent nonterminal an-
we show that the time complexity of the inside al-               notated by the headword wp , as head-parent. In
gorithm implemented by Zhu et al. (2020) can be                  binary rules, we refer to a child nonterminal as
improved from cubic to quadratic in the number of                head-child if it inherits the headword of the head-
nonterminals m. The inside algorithm of our pro-                 parent (e.g., Brwp s) and as non-head-child other-
posed method has a linear complexity in m after                  wise (e.g., Crwq s). A head-child appears as either
combining CPD and unfold-refold.                                 the left child or the right child. We denote the
   We evaluate our model on the benchmarking                     head direction by D P tð, ñu, where ð means
Wall Street Journey (WSJ) dataset. Our model                     head-child appears as the left child.
surpasses the strong baseline NL-PCFG (Zhu et al.,               2.2   Grammar induction with lexicalized
2020) by 2.9% mean F1 and 1.3% mean UUAS                               probabilistic CFGs
under CYK decoding. When using the Minimal
Bayes-Risk (MBR) decoding, our model performs                    Lexicalized probabilistic CFGs (L-PCFGs) extend
even better. We provide an efficient implementation              L-CFGs by assigning each production rule r “
of our proposed model at https://github.com/                     A Ñ γ a scalar πr such that it forms a valid cate-
sustcsonglin/TN-PCFG.
                                                                 gorical probability distribution given the left hand
                                                                 side A. Note that preterminal rules always have a
                                                                 probability of 1 because they define a deterministic
2       Background                                               generating process.
                                                                    Grammar induction with L-PCFGs follows the
2.1      Lexicalized CFGs                                        same way of grammar induction with PCFGs. As
We first introduce the formalization of CFGs. A                  with PCFGs, we maximize the log-likelihood of
CFG is defined as a 5-tuple G “ pS, N , P, Σ, Rq                 each observed sentence w “ w1 , . . . , wl :
where S is the start symbol, N is a finite set of non-
                                                                                               ÿ
                                                                           log ppwq “ log            pptq ,        (1)
terminal symbols, P is a finite set of preterminal                                          tPTGL pwq
symbols,1 Σ is a finite set of terminal symbols, and                                 ś
R is a set of rules in the following form:                       where pptq “          rPt πr and TGL pwq consists of
                                                                 all possible lexicalized parse trees of the sentence
                                                                 w under an L-PCFG GL . We can compute the
        SÑA                                     APN
                                                                 marginal ppwq of the sentence by using the in-
        A Ñ BC,          A P N,       B, C P N Y P               side algorithm in polynomial time. The core re-
        T Ñ w,                         T P P, w P Σ              cursion of the inside algorithm is formalized in
                                                                 Equation 3. It recursively computes the probability
N , P and Σ are mutually disjoint. We will use                   sA,p
                                                                  i,j of a head-parent Arwp s spanning the substring
‘nonterminals’ to indicate N Y P when it is clear                wi , . . . , wj´1 (p P ri, j ´ 1s). Term A1 and A2 in
from the context.                                                Equation 3 cover the cases of the head-child as the
   Lexicalized CFGs (L-CFGs) (Collins, 2003) ex-                 left child and the right child respectively.
tend CFGs by associating a word with each of the                 2.3   Challenges of L-PCFG induction
    1
                                                                 The major difference between L-PCFGs from
    An alternative definition of CFGs does not distinguish
nonterminals N (constituent labels) from preterminals P (part-   vanilla PCFGs is that they use word-annotated non-
of-speech tags) and treats both as nonterminals.                 terminals, so the nonterminal number of L-PCFGs

Figure 1: (a) The original parameterization of L-PCFGs. (b) The parameterization of Zhu et al. (2020): Wq
is independent with B, D, A, Wp given C. (c) Our proposed parameterization. We slightly abuse the Bayesian
network notation by grouping variables. In the standard notation, there would be arcs from the parent variables to
each grouped variable as well as arcs between the grouped variables.

is up to |Σ| times the number of nonterminals in Figure 1 (a) and (b). With the factorized binary rule
PCFGs. As the grammar size is largely determined probability in Equation 2, Term A1 in Equation 3
by the number of binary rules and increases ap- can be rewritten as Equation 5. Zhu et al. (2020)
proximately in cubic of the nonterminal number, implement the inside algorithm by caching Term
representing L-PCFGs has a high space complexity C1-1 in Equation 6, resulting in a time complexity
Opm3 |Σ|2 q (m is the nonterminal number). Specif- Opl4 m3 ` l3 mq, which is cubic in m. We note that,
ically, it requires an order-6 probability tensor for we can use unfold-refold to further cache Term C1-
binary rules with each dimension representing A, 2 in Equation 6 and reduce the time complexity
B, C, wp , wq , and head direction D, respectively. of the inside algorithm to Opl4 m2 ` l3 m ` l2 m2 q,
With so many rules, L-PCFGs are very prone to the which is quadratic in m.
data sparsity problem in rule probability estimation. Although the factorization of Equation 2 reduces
Collins (2003) suggests factorizing the binary rule the space and time complexity of the inside algo-
probabilities according to specific independence rithm of L-PCFG, it is based on the independence
assumptions, but his approach still relies on com- assumption that the generation of wq is indepen-
plicated smoothing techniques to be effective. dent of A, B, D and wp given the non-head-child
The addition of lexical heads also scales up the C. This assumption can be violated in many sce-
computational complexity of the inside algorithm narios and hence reduces the expressiveness of the
by a factor Opl2 q and brings it up to Opl5 m3 q. Eis- grammar. For example, suppose C is Noun, then
ner and Satta (1999) point out that, by changing even if we know B is Verb, we still need to know
the order of summations in Term A1 (A2) of Equa- D to determine if wq is an object or a subject of
tion 3, one can cache and reuse Term B1 (B2) in the verb, and then need to know the actual verb wp
Equation 4 and reduce the computational complex- to pick a likely noun as wq .
ity to Opl4 m2 ` l3 m3 q. This is an example applica-
tion of unfold-refold as noted by Eisner and Blatz 3 Factorization with latent variable
(2007). However, the complexity is still cubic in m, Our main goal is to find a parameterization that re-
making it expensive to increase the total number of moves the implausible independence assumptions
nonterminals. of Zhu et al. (2020) while decreases the complexi-
2.4 Neural L-PCFGs ties of the original L-PCFGs.
To reduce the representation complexity, we
Zhu et al. (2020) apply neural parameterization to
draw inspiration from the canonical polyadic de-
tackle the data sparsity issue and to reduce the total
composition (CPD). CPD factorizes an n-th order
learnable parameters of L-PCFGs. Considering
tensor into n two-dimensional matrices. Each ma-
the head-child as the left child (similarly for the
trix consists of two dimensions: one dimension
other case), they further factorize the binary rule
comes from the original n-th order tensor and the
probability as:
other dimension is shared by all the n matrices. The
ppArwp s Ñ Brwp sCrwq sq shared dimension can be marginalized to recover
“ ppB, ð, C|A, wp qppwq |Cq . (2) the original n-th order tensor. From a probabilistic
perspective, the shared dimension can be regarded
Bayesian networks representing the original as a latent-variable. In the spirit of CPD, we intro-
probability and the factorization are illustrated in duce a latent-variable H to decompose the order-6

j´1
          ÿ      j´1
                 ÿ     ÿ                                                           ÿp    k´1
                                                                                          ÿ ÿ B,q C,p
sA,p
 i,j “                       sB,p      C,q
                              i,k ¨ sk,j ¨ ppArwp s Ñ Brwp sCrwq sq `                               si,k ¨ sk,j ¨ ppArwp s Ñ Brwq sCrwp sq                      (3)
         k“p`1   q“k B,C                                                         k“i`1   q“i B,C
         loooooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooooon loooooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooooon
                                        Term A1                                                                          Term A2
          j´1
          ÿ      ÿ          j´1
                            ÿ     ÿ                                                     p
                                                                                        ÿ ÿ                    k´1
                                                                                                               ÿÿ
    “                sB,p
                      i,k            sC,q
                                       k,j ¨ ppArwp s Ñ Brwp sCrwq sq `                               sC,p
                                                                                                       k2 ,j            sB,q
                                                                                                                          i,k ¨ ppArwp s Ñ Brwq sCrwp sq (4)
         k“p`1 B            q“k   C                                               k“i`1           C            q“i B
                                                                                                               looooooooooooooooooooooooomooooooooooooooooooooooooon
                            looooooooooooooooooooooooomooooooooooooooooooooooooon
                                                   Term B1                                                                         Term B2

                                                     j´1
                                                     ÿ       j´1
                                                             ÿ     ÿ
                                   Term A1 “                             sB,p    C,q
                                                                          i,k ¨ sk,j ¨ ppB,   ð, C|A, wp q ¨ ppwq |Cq
                                                                                       looooooooooooooooomooooooooooooooooon                                    (5)
                                                   k“p`1 q“k B,C
                                                                                            factorization of ppArwp sÑBrwp sCrwq sq
                                                     j´1
                                                     ÿ       ÿ           ÿ                              j´1
                                                                                                        ÿ
                                               “                 sB,p
                                                                  i,k        ppB, ð, C|A, wp q            sC,q
                                                                                                            k,j ¨ ppwq |Cq                                      (6)
                                                   k“p`1 B               C                           q“k
                                                                                                     loooooooooomoooooooooon
                                                                                                             Term C1-1
                                                                         looooooooooooooooooooooooomooooooooooooooooooooooooon
                                                                                                  Term C1-2

                                           j´1
                                           ÿ       j´1
                                                   ÿ     ÿ                      ÿ
                          Term A1 “                           sB,p    C,q
                                                               i,k ¨ sk,j ¨         ppH|A, wp qppB|HqppC, ð |Hqppwq |Hq                                         (7)
                                          k“p`1 q“k B,C                         H
                                                                                loooooooooooooooooooooooooooomoooooooooooooooooooooooooooon
                                                                                            factorization of ppArwp sÑBrwp sCrwq sq

                                          ÿ                        j´1
                                                                   ÿ      ÿ             j´1
                                                                                        ÿ ÿ C,q
                                      “       ppH|A, wp q               sB,p
                                                                          i,k ppB|Hq             sk,j ppC ð |Hqppwq |Hq                                         (8)
                                          H                          B
                                                              k“p`1 loooooooomoooooooon q“k C
                                                                                        looooooooooooooooooomooooooooooooooooooon
                                                                                Term D1-1                           Term D1-2

 Table 1: Recursive formulas of the inside algorithm for Eisner and Satta (1999) (Equation 4), Zhu et al. (2020)
 (Equation 5- 6), and our formalism (Equation 7- 8), respectively. sA,p i,j indicates the probability of a head nontermi-
 nal Arwp s spanning the substring wi , . . . , wj´1 , where p is the position of the headword in the sentence.

 probability tensor ppB, C, D, wq |A, wp q. Instead                                    rule probability is factorized as
 of fully decomposing the tensor, we empirically
 find that binding some of the variables leads to bet-                                      ppArwp s Ñ Brwp sCrwq sq “          (10)
                                                                                            ÿ
 ter results. Our best factorization is as follows (also                                       ppH|A, wp qppB|HqppC ð |Hqppwq |Hq ,
 illustrated by a Bayesian network in Figure 1 (c)):                                          H

                                                                                       and

                                                                                            ppArwp s Ñ Brwq sCrwp sq “          (11)
                                                                                            ÿ
   ppB, C, Wq , D|A, Wp q “             (9)                                                    ppH|A, wp qppC|HqppB ñ |Hqppwq |Hq .
   ÿ
                                                                                              H
      ppH|A, Wp qppB|HqppC, D|HqppWq |Hq .
    H                                                                                 We also follow Zhu et al. (2020) and factorize the
                                                                                      start rule as follows.

    According to d-separation (Pearl, 1988), when                                                 ppS Ñ Arwp sq “ ppA|Sqppwp |Aq .                            (12)
 A and wp are given, B, C, wq , and D are interde-
                                                                                       Computational complexity: Considering the
 pendent due to the existence of H. In other words,
                                                                                       head-child as the left child (similarly for the other
 our factorization does not make any independence
                                                                                       case), we apply Equation 10 in Term A1 of Equa-
 assumption beyond the original binary rule. The
                                                                                       tion 3 and obtain Equation 7. Rearranging the
 domain size of H is analogous to the tensor rank
                                                                                       summations in Equation 7 gives Equation 8, where
 in CPD and thus influences the expressiveness of
                                                                                       Term D1-1 and D1-2 can be cached and reused,
 our proposed model.
                                                                                       which also uses the unfold-refold technique. The fi-
   Based on our factorization approach, the binary                                     nal time complexity of the inside computation with

our factorization approach is Opl4 dH ` l2 mdH q         1994). We use the same preprocessing pipeline as
(dH is the domain size of the latent variable H),        in Kim et al. (2019). Specifically, punctuation is
which is linear in m.                                    removed from all data splits and the top 10,000
                                                         frequent words in the training data are used as the
Choices of factorization: If we follow the intu-
                                                         vocabulary. For dependency grammar induction,
ition of CPD, then we shall assume that B, C, D,
                                                         we follow (Zhu et al., 2020) to use the Stanford
and wq are all independent conditioned on H. How-
                                                         typed dependency representation (de Marneffe and
ever, properly relaxing this strong assumption by
                                                         Manning, 2008).
binding some variables could benefit our model.
Though there are many different choices of binding       4.2    Hyperparameters
the variables, some bindings can be easily ruled
                                                         We optimize our model using the Adam optimizer
out. For instance, binding B and C inhibits us
                                                         with β1 “ 0.75, β2 “ 0.999, and learning rate
from caching Term D1-1 and Term D1-2 in Equa-
                                                         0.001. All parameters are initialized with Xavier
tion 7 and thus we cannot implement the inside
                                                         uniform initialization. We set the dimension of all
algorithm efficiently; binding C and wq leads to
                                                         embeddings to 256 and the ratio of the nonterminal
a high computational complexity because we will
                                                         number to the preterminal number to 1:2. Our
have to compute a high-dimensional (m|Σ|) cate-
                                                         best model uses 15 nonterminals, 30 preterminals,
gorical distribution. In Section 6.3, we make an
                                                         and dH “ 300. We use grid search to tune the
ablation study on the impact of different choices of
                                                         nonterminal number (from 5 to 30) and domain
factorizations.
                                                         size dH of the latent H (from 50 to 500).
Neural parameterizations: We follow Kim
et al. (2019) and Zhu et al. (2020) and define the       4.3    Evaluation
following neural parameterization:                       We run each model four times with different ran-
                                                         dom seeds and for ten epochs. We train our models
                      exppuJS f1 pwA qq
      ppA|Sq “ ř                J
                                            ,            on training sentences of length ď 40 with batch
                    A1 PN exppuS f1 pwA1 qq              size 8 and test them on the whole testing set. For
                     exppuJ A f2 pww qq                  each run, we perform early stopping and select the
      ppw|Aq “ ř                J
                                            ,
                    w1 PΣ exppuA f2 pww1 qq
                                                         best model according to the perplexity of the devel-
                         exppuJ                          opment set. We use two different parsing methods:
                               H wB q
      ppB|Hq “ ř                   J w 1q
                                          ,              the variant of CYK algorithm (Eisner and Satta,
                    B 1 PN YP exppuH B                   1999) and Minimum Bayes-Risk (MBR) decoding
                     exppuJ H f2 pww qq                  (Smith and Eisner, 2006). 2 For constituent gram-
      ppw|Hq “ ř               J
                                            ,
                    w1 PΣ exppuH f2 pww1 qq              mar induction, we report the means and standard
                     exppuJ H wCð q
                                                         deviations of sentence-level F1 scores.3 For de-
ppC ð |Hq “ ř                   J
                                         ,               pendency grammar induction, we report unlabeled
                    C 1 PM exppuH wC 1 q
                                                         directed attachment score (UDAS) and unlabeled
                     exppuJ H wCñ q
ppC ñ |Hq “ ř                   J
                                         ,               undirected attachment score (UUAS).
                    C 1 PM exppuH wC 1 q
                      exppuJ H f4 prwA ; ww sqq
                                                         5     Main result
    ppH|A, wq “ ř                J
                                                     ,
                    H 1 PH exppuH 1 f4 prwA ; ww sqq     We present our main results in Table 2. Our model
where H “ tH1 , . . . , HdH u, M “ pN Y Pq ˆ tð          is referred to as Neural Bi-Lexicalized PCFGs
, ñu, u and w are nonterminal embeddings and             (NBL-PCFGs). We mainly compare our approach
word embeddings respectively, and f1 p¨q, f2 p¨q,        against recent PCFG-based models: neural PCFG
f3 p¨q, f4 p¨q are neural networks with residual lay-    (N-PCFG) and compound PCFG (C-PCFG) (Kim
ers (He et al., 2016) (Full parameterization is shown    et al., 2019), tensor decomposition based neural
in Appendix.).                                               2
                                                               In MBR decoding, we use automatic differentiation (Eis-
                                                         ner, 2016; Rush, 2020) to estimate the marginals of spans
4     Experimental setup                                 and arcs, and then use the CYK and Eisner algorithms for
                                                         constituency and dependency parsing, respectively.
4.1    Dataset                                               3
                                                               Following Kim et al. (2019), we remove all trivial spans
                                                         (single-word spans and sentence-level spans). Sentence-level
We conduct experiments on the Wall Street Journal        means that we compute F1 for each sentence and then average
(WSJ) corpus of the Penn Treebank (Marcus et al.,        over all sentences.

PCFG (TN-PCFG) (Yang et al., 2021) and neural WSJ
L-PCFG (NL-PCFG) (Zhu et al., 2020). We report Model
both official result of Zhu et al. (2020) and our F1 UDAS UUAS
reimplementation. Official results
We do not use the compound trick (Kim et al.,
N-PCFG‹ 50.8
2019) in our implementations of lexicalized PCFGs
C-PCFG‹ 55.2
because we empirically find that using it results in
NL-PCFG‹ 55.3 39.7 53.3
unstable training and does not necessarily bring
TN-PCFG: 57.7
performance improvements.
We draw three key observations: (1) Our model Our results
achieves the best F1 and UUAS scores under both NL-PCFG‹ 53.3˘2.1 23.8˘1.1 47.4˘1.0
CYK and MBR decoding. It is also comparable
NL-PCFG: 57.4˘1.4 25.3˘1.3 47.2˘0.7
to the official NL-PCFG in the UDAS score. (2)
NBL-PCFG‹ 58.2˘1.5 37.1˘2.8 54.6˘1.3
When we remove the compound parameterization
NBL-PCFG: 60.4˘1.6 39.1˘2.8 56.1˘1.3
from NL-PCFG, its F1 score drops slightly while its
UDAS and UUAS scores drop dramatically. It im- For reference
plies that compound parameterization is the key to S-DIORA 57.6
achieve excellent dependency grammar induction StructFormer 54.0 46.2 61.6
performance in NL-PCFG. (3) The MBR decoding
outperforms CYK decoding. Oracle Trees 84.3
Regarding UDAS, our model significantly out-
performs NL-PCFGs in UDASs if compound pa- Table 2: Unlabeled sentence-level F1 scores, unlabeled
directed attachment scores and unlabeled undirected
rameterization is not used (37.1 vs. 23.8 with CYK
attachment scores on the WSJ test data. : indicates
decoding), showing that explicitly modeling bilexi- using MBR decoding. ‹ indicates using CYK decod-
cal relationship is helpful in dependency grammar ing. Recall that the official result of Zhu et al. (2020)
induction. However, when compound parameteri- uses compound parameterization while our reimple-
zation is used, the UDAS of NL-PCFGs is greatly mentation removes the compound parameterization. S-
improved, slightly surpassing that of our model. DIORA: Drozdov et al. (2020). StructFormer: Shen
We believe this is because compound parameteriza- et al. (2020).
tion greatly weakens the independence assumption
of NL-PCFGs (i.e., the child word is dependent on ited expressiveness of NBL-PCFGs. When dH is
C only) by leaking bilexical information via the larger than 300, the perplexity becomes plateaued
global sentence embedding. On the other hand, and the F1 score starts to decrease possibly because
NBL-PCFGs are already expressive enough and of overfitting.
thus compound parameterization brings no further
6.2 Influence of nonterminal number
increase of their expressiveness but makes learning
more difficult. Figure 2b illustrates perplexities and F1 scores with
the increase of the nonterminal number and fixed
6 Analysis dH “ 300 (plots of UDAS and UUAS can be found
in Appendix). We observe that increasing the non-
In the following experiments, we report results terminal number has only a minor influence on
using MBR decoding by default. We also use NBL-PCFGs. We speculate that it is because the
dH “ 300 by default unless otherwise specified. number of word-annotated nonterminals (m|Σ|) is
already sufficiently large even if m is small. On
6.1 Influence of the domain size of H the other hand, the nonterminal number has a big
dH (the domain size of H) influences the expres- influence on NL-PCFGs. This is most likely be-
siveness of our model. Figure 2a illustrates per- cause NL-PCFGs make the independence assump-
plexities and F1 scores with the increase of dH and tion that the generation of wq is solely determined
a fixed nonterminal number of 10 (plots of UDAS by the non-head-child C and thus require more
and UUAS can be found in Appendix). We can nonterminals so that C has the capacity of convey-
see that when dH is small, the model has a high ing information from A, B, D and wp . Using more
perplexity and a low F1 score, indicating the lim- nonterminals (ą 30) seems to be helpful for NL-

F1 UDAS UUAS Perplexity We further analyze the quality of our induced
D-C 60.4 39.1 56.1 161.9 trees. Our model prefers to predict left-headed con-
D-alone 57.2 32.8 54.1 164.8 stituents (i.e., constituents headed by the leftmost
D-wq 47.7 45.7 58.6 176.8
D-B 47.8 36.9 54.0 169.6
word). VPs are usually left-headed in English, so
our model has a much higher recall on VPs and cor-
Table 3: Binding the head direction D with different rectly predicts their headwords. SBARs often start
variables. with which and that and PPs often start with prepo-
sitions such as of and for. Our model often relies
PCFGs, but would be computationally too expen- on these words to predict the correct constituents
sive due to the quadratically increased complexity and hence erroneously predicts these words as the
in the number of nonterminals. headwords, which hurts the dependency accuracy.
For NPs, we find our model often makes mistakes
6.3 Influence of different variable bindings in predicting adjective-noun phrases. For example,
Table 3 presents the results of our models with the the correct parse of a rough market is (a (rough
following bindings: market)), but our model predicts ((a rough) market)
instead.
• D-alone: D is generated alone.
7 Discussion on dependency annotation
• D-wq : D is generated with wq . schemes
• D-B: D is generated with head-child B. What should be regarded as the headwords is still
• D-C: D is generated with non-head-child C. debatable in linguistics, especially for those around
function words (Zwicky, 1993). For example, in
Clearly, binding D and C (the default setting phrase the company, some linguists argue that the
for NBL-PCFG) results in the lowest perplexity should be the headword (Abney, 1972). These
and the highest F1 score. Binding D and wq has disagreements are reflected in the dependency an-
a surprisingly good performance in unsupervised notation schemes. Researchers have found that
dependency parsing. different dependency annotation schemes result in
We find that how to bind the head direction has very different evaluation scores of unsupervised de-
a huge impact on the unsupervised parsing perfor- pendency parsing (Noji, 2016; Shen et al., 2020).
mance and we give the following intuition. Usually In our experiments, we use the Stanford Depen-
given a headword and its type, the children gener- dencies annotation scheme in order to compare
ated in each direction would be different. So, D is with NL-PCFGs. Stanford Dependencies prefers
intuitively more related to wq and C than to B. On to select content words as headwords. However,
the other hand, B is dependent more on the head- as we discussed in previous sections, our model
word instead. In Table 3 we can see that (D-B) prefers to select function words (e.g., of, which,
has a lower UDAS score than (D-C) and (D-wq ), for) as headwords for SBARs or PPs.This explains
which is consistent with this intuition. Notably, in why our model can outperform all the baselines on
Zhu et al. (2020), their Factorization III has a signif- constituency parsing but not on dependency pars-
icantly lower UDAS than the default model (35.5 ing (as judged by Stanford Dependencies) at the
vs. 25.9), and the only difference is whether the same time. Table 3 shows that there is a trade-off
generation of C is dependent on the head direction. between the F1 score and UDAS, which suggests
This is also consistent with our intuition. that adapting our model to Stanford Dependencies
would hurt its ability to identify constituents.
6.4 Qualitative analysis
8 Speed comparison
We analyze the parsing performance of different
PCFG extensions by breaking down their recall In practice, the forward and backward pass of the
numbers by constituent labels (see Table 4). NPs inside algorithm consumes the majority of the run-
and VPs cover most of the gold constituents in WSJ ning time in training a N(B)L-PCFG. The existing
test set. TN-PCFGs have the best performance implementation by Zhu et al. (2020)4 does not em-
in predicting NPs and NBL-PCFGs have better ploy efficient parallization and has a cubic time
4
performance in predicting other labels on average. https://github.com/neulab/neural-lpcfg

70                                                             280
                                                                                                    F1                210                                                                   F1 of NL-PCFGs
                 62                                                                                 Perplexity                                                                              F1 of NBL-PCFGs
                                                                                                                      200                              65                                   Perplexity of NL-PCFGs    260
                 60                                                                                                                                                                         Perplexity of NBL-PCFGs
                                                                                                                      190                                                                                             240
                 58                                                                                                                                    60

                                                                                                                            Perplexity

                                                                                                                                                                                                                            Perplexity
F1 (%)

                                                                                                                                              F1 (%)
                                                                                                                                                                                                                      220
                 56                                                                                                   180
                                                                                                                                                       55
                                                                                                                                                                                                                      200
                 54                                                                                                   170
                                                                                                                                                       50                                                             180
                 52                                                                                                   160
                                                                                                                                                                                                                      160
                                                                                                                                                       45
                            100               200               300                      400                 500                                            5     10          15       20           25          30
                                                       size of |H|                                                                                                     The number of nonterminals

                                                                (a)                                                                                                                  (b)

                           Figure 2: The change of F1 scores, perplexities with the change of |H| and nonterminal number.

                           N-PCFG:            C-PCFG:           TN-PCFG:                      NL-PCFG              NBL-PCFG                            terminal number m and a fixed sentence length of
    NP                     72.3%              73.6%             75.4%                         74.0%                66.2%
    VP                     28.1%              45.0%             48.4%                         44.3%                61.1%
                                                                                                                                                       30. The original implementation runs out of 12GB
    PP                     73.0%              71.4%             67.0%                         68.4%                77.7%                               memory when m “ 30. re-impl-2 is faster than
    SBAR                   53.6%              54.8%             50.3%                         49.4%                63.8%
    ADJP                   40.8%              44.3%             53.6%                         55.5%                59.7%                               re-impl-1 when increasing m as it has a better time
    ADVP                   43.8%              61.6%             59.5%                         57.1%                59.1%                               complexity in m (quadratic for re-impl-2, cubic for
    Perplexity             254.3              196.3             207.3                         181.2                161.9                               re-impl-1). Our NBL-PCFGs have a linear com-
                                                                                                                                                       plexity in m, and as we can see in the figure, our
Table 4: Recall on six frequent constituent labels and
perplexities of the WSJ test data. : means that the re-                                                                                                NBL-PCFGs are much faster when m is large.
sults are reported by Yang et al. (2021)
                                                                                                                                                       9    Related Work
complexity in the number of nonterminals. We
provide an efficient reimplementation (we follow                                                                                                       Unsupervised parsing has a long history but has
Zhang et al. (2020) to batchify) of the inside algo-                                                                                                   regained great attention in recent years. In unsu-
rithm based on Equation 6. We refer to an imple-                                                                                                       pervised dependency parsing, most methods are
mentation which caches Term C1-1 as re-impl-1                                                                                                          based on Dependency Model with Valence (DMV)
and refer to an implementation which caches Term                                                                                                       (Klein and Manning, 2004). Neurally parameter-
C1-2 as re-impl-2.                                                                                                                                     ized DMVs have obtained state-of-the-art perfor-
                                                                                                                                                       mance (Jiang et al., 2016; Han et al., 2017, 2019;
            16

            14
                                                    NBL-PCFG
                                                    NL-PCFG                   3.5
                                                                                                                             NBL-PCFG
                                                                                                                             NL-PCFG
                                                                                                                                                       Yang et al., 2020). However, they rely on gold POS
                                                    re-impl-1                                                                re-impl-1
            12

            10
                                                    re-impl-2                 3.0                                            re-impl-2
                                                                                                                                                       tags and sophisticated initializations (e.g. K&M
Time: (s)

                                                                  Time: (s)

                                                                              2.5

             8

             6
                                                                              2.0

                                                                              1.5
                                                                                                                                                       initialization or initialization with the parsing result
             4

             2
                                                                              1.0

                                                                              0.5
                                                                                                                                                       of another unsupervised model). Noji et al. (2016)
             0
                 10   20    30    40     50     60        70                        10   20    30       40       50     60               70
                                                                                                                                                       propose a left-corner parsing-based DMV model to
                            Sentence length                                               The number of nonterminals

                                                                                                                                                       limit the stack depth of center-embedding, which is
                                 (a)                                                                 (b)
                                                                                                                                                       insensitive to initialization but needs gold POS tags.
Figure 3: Total time in performing the inside algorithm                                                                                                He et al. (2018) propose a latent-variable based
and automatic differentiation with different sentence                                                                                                  DMV model, which does not need gold POS tags
lengths and nonterminal numbers.                                                                                                                       but requires good initialization and high-quality
   We measure the time based on a single forward                                                                                                       induced POS tags. See Han et al. (2020) for a
and backward pass of the inside algorithm with                                                                                                         survey of unsupervised dependency parsing. Com-
batch size 1 on a single Titan V GPU. Figure 3a                                                                                                        pared to these methods, our method does not re-
illustrates the time with the increase of the sentence                                                                                                 quire gold/induced POS tags or sophisticated ini-
length and a fixed nonterminal number of 10. The                                                                                                       tializations, though its performance lags behind
original implementation of NL-PCFG by Zhu et al.                                                                                                       some of these previous methods.
(2020) takes much more time when sentences are                                                                                                            Recent unsupervised constituency parsers can
long. For example, when sentence length is 40, it                                                                                                      be roughly categorized into the following groups:
needs 6.80s, while our fast implementation takes                                                                                                       (1) PCFG-based methods. Depth-bounded PCFGs
0.43s and our NBL-PCFG takes only 0.30s. Figure                                                                                                        (Jin et al., 2018a,b) limit the stack depth of center-
3b illustrates the time with the increase of the non-                                                                                                  embedding. Neurally parameterized PCFGs (Jin

et al., 2019; Kim et al., 2019; Zhu et al., 2020; Yang Acknowledgments
et al., 2021) use neural networks to produce gram-
We thank the anonymous reviewers for their con-
mar rule probabilities. (2) Deep Inside-Outside
structive comments. This work was supported by
Recursive Auto-encoder (DIORA) based methods
the National Natural Science Foundation of China
(Drozdov et al., 2019a,b, 2020; Hong et al., 2020;
(61976139).
Sahay et al., 2021). They use neural networks to
mimic the inside-outside algorithm and they are
trained with masked language model objectives. References
(3) Syntactic distance-based methods (Shen et al.,
Steven P. Abney. 1972. The english noun phrase in its
2018, 2019, 2020). They encode hidden syntac- sentential aspect.
tic trees into syntactic distances and inject them
into language models. (4) Probing based methods Steven Cao, Nikita Kitaev, and Dan Klein. 2020. Unsu-
pervised parsing via constituency tests. In Proceed-
(Kim et al., 2020; Li et al., 2020). They extract ings of the 2020 Conference on Empirical Methods
phrase-structure trees based on the attention distri- in Natural Language Processing (EMNLP), pages
butions of large pre-trained language models. In 4798–4808, Online. Association for Computational
addition to these methods, Cao et al. (2020) use Linguistics.
constituency tests and Shi et al. (2021) make use Eugene Charniak. 1996. Tree-bank grammars. Techni-
of naturally-occurring bracketings such as hyper- cal report, USA.
links on webpages to train parsers. Multimodal
Justin Chiu and Alexander Rush. 2020. Scaling hid-
information such as images (Shi et al., 2019; Zhao den Markov language models. In Proceedings of the
and Titov, 2020; Jin and Schuler, 2020) and videos 2020 Conference on Empirical Methods in Natural
(Zhang et al., 2021) have also been exploited for Language Processing (EMNLP), pages 1341–1349,
unsupervised constituency parsing. Online. Association for Computational Linguistics.

We are only aware of a few previous studies in Shay B. Cohen, Giorgio Satta, and Michael Collins.
2013. Approximate PCFG parsing using tensor de-
unsupervised joint dependency and constituency composition. In Proceedings of the 2013 Confer-
parsing. Klein and Manning (2004) propose a joint ence of the North American Chapter of the Associ-
DMV and CCM (Klein and Manning, 2002) model. ation for Computational Linguistics: Human Lan-
Shen et al. (2020) propose a transformer-based guage Technologies, pages 487–496, Atlanta, Geor-
gia. Association for Computational Linguistics.
method, in which they define syntactic distances to
guild attentions of transformers. Zhu et al. (2020) Michael Collins. 1997. Three generative, lexicalised
propose neural L-PCFGs for unsupervised joint models for statistical parsing. In 35th Annual Meet-
ing of the Association for Computational Linguis-
parsing.
tics and 8th Conference of the European Chapter
of the Association for Computational Linguistics,
pages 16–23, Madrid, Spain. Association for Com-
putational Linguistics.
10 Conclusion
Michael Collins. 2003. Head-driven statistical models
for natural language parsing. Computational Lin-
guistics, 29(4):589–637.
We have presented a new formalism of lexicalized
PCFGs. Our formalism relies on the canonical Andrew Drozdov, Subendhu Rongali, Yi-Pei Chen,
polyadic decomposition to factorize the probabil- Tim O’Gorman, Mohit Iyyer, and Andrew McCal-
ity tensor of binary rules. The factorization re- lum. 2020. Unsupervised parsing with S-DIORA:
Single tree encoding for deep inside-outside recur-
duces the space and time complexity of lexicalized sive autoencoders. In Proceedings of the 2020 Con-
PCFGs while keeping the independence assump- ference on Empirical Methods in Natural Language
tions encoded in the original binary rules intact. Processing (EMNLP), pages 4832–4845, Online. As-
We further parameterize our model by using neu- sociation for Computational Linguistics.
ral networks and present an efficient implementa- Andrew Drozdov, Patrick Verga, Mohit Yadav, Mo-
tion of our model. On the English WSJ test data, hit Iyyer, and Andrew McCallum. 2019a. Unsuper-
our model achieves the lowest perplexity, outper- vised latent tree induction with deep inside-outside
recursive auto-encoders. In Proceedings of the 2019
forms all the existing extensions of PCFGs in con- Conference of the North American Chapter of the
stituency grammar induction, and is comparable to Association for Computational Linguistics: Human
strong baselines in dependency grammar induction. Language Technologies, Volume 1 (Long and Short

Papers), pages 1129–1141, Minneapolis, Minnesota. sion and Pattern Recognition, CVPR 2016, Las Ve-
Association for Computational Linguistics. gas, NV, USA, June 27-30, 2016, pages 770–778.
IEEE Computer Society.
Andrew Drozdov, Patrick Verga, Mohit Yadav, Mohit
Iyyer, and Andrew McCallum. 2019b. Unsuper- Ruyue Hong, Jiong Cai, and Kewei Tu. 2020. Deep
vised latent tree induction with deep inside-outside inside-outside recursive autoencoder with all-span
recursive auto-encoders. In Proceedings of the 2019 objective. In Proceedings of the 28th Interna-
Conference of the North American Chapter of the tional Conference on Computational Linguistics,
Association for Computational Linguistics: Human pages 3610–3615, Barcelona, Spain (Online). Inter-
Language Technologies, Volume 1 (Long and Short national Committee on Computational Linguistics.
Papers), pages 1129–1141, Minneapolis, Minnesota.
Association for Computational Linguistics. F. Jelinek, J. D. Lafferty, and R. L. Mercer. 1992. Basic
methods of probabilistic context free grammars. In
Jason Eisner. 2016. Inside-outside and forward- Speech Recognition and Understanding, pages 345–
backward algorithms are just backprop (tutorial pa- 360, Berlin, Heidelberg. Springer Berlin Heidelberg.
per). In Proceedings of the Workshop on Structured
Prediction for NLP, pages 1–17, Austin, TX. Asso- Yong Jiang, Wenjuan Han, and Kewei Tu. 2016. Un-
ciation for Computational Linguistics. supervised neural dependency parsing. In Proceed-
ings of the 2016 Conference on Empirical Methods
Jason Eisner and John Blatz. 2007. Program transfor- in Natural Language Processing, pages 763–771,
mations for optimization of parsing algorithms and Austin, Texas. Association for Computational Lin-
other weighted logic programs. In Proceedings of guistics.
FG 2006: The 11th Conference on Formal Gram-
mar, pages 45–85. CSLI Publications. Lifeng Jin, Finale Doshi-Velez, Timothy Miller,
William Schuler, and Lane Schwartz. 2018a. Depth-
Jason Eisner and Giorgio Satta. 1999. Efficient pars- bounding is effective: Improvements and evaluation
ing for bilexical context-free grammars and head au- of unsupervised PCFG induction. In Proceedings of
tomaton grammars. In Proceedings of the 37th An- the 2018 Conference on Empirical Methods in Nat-
nual Meeting of the Association for Computational ural Language Processing, pages 2721–2731, Brus-
Linguistics, pages 457–464, College Park, Maryland, sels, Belgium. Association for Computational Lin-
USA. Association for Computational Linguistics. guistics.

Wenjuan Han, Yong Jiang, Hwee Tou Ng, and Kewei Lifeng Jin, Finale Doshi-Velez, Timothy Miller,
Tu. 2020. A survey of unsupervised dependency William Schuler, and Lane Schwartz. 2018b. Un-
parsing. In Proceedings of the 28th International supervised grammar induction with depth-bounded
Conference on Computational Linguistics, pages PCFG. Transactions of the Association for Compu-
2522–2533, Barcelona, Spain (Online). Interna- tational Linguistics, 6:211–224.
tional Committee on Computational Linguistics.
Lifeng Jin, Finale Doshi-Velez, Timothy Miller, Lane
Wenjuan Han, Yong Jiang, and Kewei Tu. 2017. De- Schwartz, and William Schuler. 2019. Unsuper-
pendency grammar induction with neural lexicaliza- vised learning of PCFGs with normalizing flow.
tion and big training data. In Proceedings of the In Proceedings of the 57th Annual Meeting of the
2017 Conference on Empirical Methods in Natu- Association for Computational Linguistics, pages
ral Language Processing, pages 1683–1688, Copen- 2442–2452, Florence, Italy. Association for Compu-
hagen, Denmark. Association for Computational tational Linguistics.
Linguistics.
Lifeng Jin and William Schuler. 2020. Grounded
Wenjuan Han, Yong Jiang, and Kewei Tu. 2019. En- PCFG induction with images. In Proceedings of
hancing unsupervised generative dependency parser the 1st Conference of the Asia-Pacific Chapter of the
with contextual information. In Proceedings of the Association for Computational Linguistics and the
57th Annual Meeting of the Association for Com- 10th International Joint Conference on Natural Lan-
putational Linguistics, pages 5315–5325, Florence, guage Processing, pages 396–408, Suzhou, China.
Italy. Association for Computational Linguistics. Association for Computational Linguistics.

Junxian He, Graham Neubig, and Taylor Berg- Mark Johnson. 1998. PCFG models of linguistic
Kirkpatrick. 2018. Unsupervised learning of syn- tree representations. Computational Linguistics,
tactic structure with invertible neural projections. 24(4):613–632.
In Proceedings of the 2018 Conference on Em-
pirical Methods in Natural Language Processing, Taeuk Kim, Jihun Choi, Daniel Edmiston, and Sang-
pages 1292–1302, Brussels, Belgium. Association goo Lee. 2020. Are pre-trained language models
for Computational Linguistics. aware of phrases? simple but strong baselines for
grammar induction. In 8th International Confer-
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian ence on Learning Representations, ICLR 2020, Ad-
Sun. 2016. Deep residual learning for image recog- dis Ababa, Ethiopia, April 26-30, 2020. OpenRe-
nition. In 2016 IEEE Conference on Computer Vi- view.net.

Yoon Kim, Chris Dyer, and Alexander Rush. 2019. Parser Evaluation, pages 1–8, Manchester, UK. Col-
Compound probabilistic context-free grammars for ing 2008 Organizing Committee.
grammar induction. In Proceedings of the 57th An-
nual Meeting of the Association for Computational Hiroshi Noji. 2016. Left-corner methods for syntac-
Linguistics, pages 2369–2385, Florence, Italy. Asso- tic modeling with universal structural constraints.
ciation for Computational Linguistics. CoRR, abs/1608.00293.
Dan Klein and Christopher Manning. 2004. Corpus- Hiroshi Noji, Yusuke Miyao, and Mark Johnson. 2016.
based induction of syntactic structure: Models of de- Using left-corner parsing to encode universal struc-
pendency and constituency. In Proceedings of the tural constraints in grammar induction. In Proceed-
42nd Annual Meeting of the Association for Com- ings of the 2016 Conference on Empirical Meth-
putational Linguistics (ACL-04), pages 478–485, ods in Natural Language Processing, pages 33–43,
Barcelona, Spain. Austin, Texas. Association for Computational Lin-
guistics.
Dan Klein and Christopher D. Manning. 2002. A
generative constituent-context model for improved
Judea Pearl. 1988. Probabilistic Reasoning in Intelli-
grammar induction. In Proceedings of the 40th An-
gent Systems: Networks of Plausible Inference. Mor-
nual Meeting of the Association for Computational
gan Kaufmann Publishers Inc., San Francisco, CA,
Linguistics, pages 128–135, Philadelphia, Pennsyl-
USA.
vania, USA. Association for Computational Linguis-
tics.
Alexander Rush. 2020. Torch-struct: Deep structured
Dan Klein and Christopher D. Manning. 2003. Ac- prediction library. In Proceedings of the 58th An-
curate unlexicalized parsing. In Proceedings of the nual Meeting of the Association for Computational
41st Annual Meeting of the Association for Compu- Linguistics: System Demonstrations, pages 335–
tational Linguistics, pages 423–430, Sapporo, Japan. 342, Online. Association for Computational Linguis-
Association for Computational Linguistics. tics.

Tamara G. Kolda and Brett W. Bader. 2009. Ten- Atul Sahay, Anshul Nasery, Ayush Maheshwari,
sor decompositions and applications. SIAM Review, Ganesh Ramakrishnan, and Rishabh Iyer. 2021.
51(3):455–500. Rule augmented unsupervised constituency parsing.

Karim Lari and Steve J Young. 1990. The estimation Yikang Shen, Zhouhan Lin, Chin-Wei Huang, and
of stochastic context-free grammars using the inside- Aaron C. Courville. 2018. Neural language model-
outside algorithm. Computer speech & language, ing by jointly learning syntax and lexicon. In 6th
4(1):35–56. International Conference on Learning Representa-
tions, ICLR 2018, Vancouver, BC, Canada, April 30
Bowen Li, Taeuk Kim, Reinald Kim Amplayo, and - May 3, 2018, Conference Track Proceedings. Open-
Frank Keller. 2020. Heads-up! unsupervised con- Review.net.
stituency parsing via self-attention heads. In Pro-
ceedings of the 1st Conference of the Asia-Pacific Yikang Shen, Shawn Tan, Seyedarian Hosseini,
Chapter of the Association for Computational Lin- Zhouhan Lin, Alessandro Sordoni, and Aaron C.
guistics and the 10th International Joint Conference Courville. 2019. Ordered memory. CoRR,
on Natural Language Processing, pages 409–424, abs/1910.13466.
Suzhou, China. Association for Computational Lin-
guistics. Yikang Shen, Yi Tay, Che Zheng, Dara Bahri, Donald
David M. Magerman. 1995. Statistical decision-tree Metzler, and Aaron Courville. 2020. Structformer:
models for parsing. In 33rd Annual Meeting of the Joint unsupervised induction of dependency and con-
Association for Computational Linguistics, pages stituency structure from masked language modeling.
276–283, Cambridge, Massachusetts, USA. Associ-
ation for Computational Linguistics. Haoyue Shi, Jiayuan Mao, Kevin Gimpel, and Karen
Livescu. 2019. Visually grounded neural syntax ac-
Mitchell Marcus, Grace Kim, Mary Ann quisition. In Proceedings of the 57th Annual Meet-
Marcinkiewicz, Robert MacIntyre, Ann Bies, ing of the Association for Computational Linguistics,
Mark Ferguson, Karen Katz, and Britta Schasberger. pages 1842–1861, Florence, Italy. Association for
1994. The Penn Treebank: Annotating predicate ar- Computational Linguistics.
gument structure. In Human Language Technology:
Proceedings of a Workshop held at Plainsboro, New Tianze Shi, Ozan İrsoy, Igor Malioutov, and Lillian Lee.
Jersey, March 8-11, 1994. 2021. Learning syntax from naturally-occurring
bracketings. In Proceedings of the 2021 Confer-
Marie-Catherine de Marneffe and Christopher D. Man- ence of the North American Chapter of the Associ-
ning. 2008. The Stanford typed dependencies rep- ation for Computational Linguistics: Human Lan-
resentation. In Coling 2008: Proceedings of the guage Technologies, pages 2941–2949, Online. As-
workshop on Cross-Framework and Cross-Domain sociation for Computational Linguistics.

59
David A. Smith and Jason Eisner. 2006. Minimum risk                      44                                                                         UDAS
                                                                                                                                                    UUAS      58
  annealing for training log-linear models. In Proceed-                  42
  ings of the COLING/ACL 2006 Main Conference                                                                                                                 57
                                                                         40                                                                                   56
  Poster Sessions, pages 787–794, Sydney, Australia.

                                                              UDAS (%)

                                                                                                                                                                   UUAS (%)
  Association for Computational Linguistics.                             38                                                                                   55
                                                                         36                                                                                   54
Songlin Yang, Yong Jiang, Wenjuan Han, and Kewei                         34                                                                                   53
  Tu. 2020. Second-order unsupervised neural depen-                                                                                                           52
                                                                         32
  dency parsing. In Proceedings of the 28th Inter-                                                                                                            51
  national Conference on Computational Linguistics,                      30
  pages 3911–3924, Barcelona, Spain (Online). Inter-                               100               200          300        400                       500
                                                                                                           size of |H|
  national Committee on Computational Linguistics.
Songlin Yang, Yanpeng Zhao, and Kewei Tu. 2021.               Figure 4: Influence of dH on UUAS and UDAS.
  PCFGs can do better: Inducing probabilistic context-
  free grammars with many symbols. In Proceedings                                                                                                  UDAS of NL-PCFGs
                                                                                                                                                   UDAS of NBL-PCFGs
  of the 2021 Conference of the North American Chap-                                                                               58              UUAS of NL-PCFGs
                                                                                                                                                   UUAS of NBL-PCFGs
                                                                         40
  ter of the Association for Computational Linguistics:                                                                            56
  Human Language Technologies, pages 1487–1498,                          35                                                        54
                                                                                                                                   52

                                                              UDAS (%)

                                                                                                                                        UUAS (%)
  Online. Association for Computational Linguistics.                     30                                                        50
                                                                         25                                                        48
Songyang Zhang, Linfeng Song, Lifeng Jin, Kun Xu,
                                                                                                                                   46
  Dong Yu, and Jiebo Luo. 2021. Video-aided unsu-                        20                                                        44
  pervised grammar induction. In Proceedings of the                      15                                                        42
  2021 Conference of the North American Chapter of                             5    10          15         20           25   30
                                                                                         The numnber of nonterminals
  the Association for Computational Linguistics: Hu-
  man Language Technologies, pages 1513–1524, On-         Figure 5: Influence of the number of nonterminals on
  line. Association for Computational Linguistics.        UUAS and UDAS.
Yu Zhang, Houquan Zhou, and Zhenghua Li. 2020.
  Fast and accurate neural CRF constituency parsing.
  In Proceedings of the Twenty-Ninth International                            hi pxq “ gi,1 pgi,2 pWi xqq
  Joint Conference on Artificial Intelligence, IJCAI                     gi,j pyq “ ReLU pVi,j ReLU pUi,j yqq ` y
  2020, pages 4046–4053. ijcai.org.
                                                              f prx, ysq “ h4 pReLUpWrx; ysq ` yq
Yanpeng Zhao and Ivan Titov. 2020.          Visually
  grounded compound PCFGs. In Proceedings of the          B              Influence of the domain size of H and
  2020 Conference on Empirical Methods in Natural
  Language Processing (EMNLP), pages 4369–4379,
                                                                         the number of nonterminals
  Online. Association for Computational Linguistics.      Figure 4 illustrates the change of UUAS and UDAS
Hao Zhu, Yonatan Bisk, and Graham Neubig. 2020.           with the increase of dH . We find similar tendencies
  The return of lexical dependencies: Neural lexical-     compared to the change of F1 scores and perplexi-
  ized PCFGs. Transactions of the Association for
                                                          ties with the increase of dH . dH “ 300 performs
  Computational Linguistics, 8:647–661.
                                                          best. Figure 5 illustrates the change of UUAS and
A. Zwicky. 1993. Heads in grammatical theory: Heads,      UDAS when increasing the number of nontermi-
   bases and functors.
                                                          nals. We can see that NL-PCFGs benefit from using
A    Full Parameterization                                more nonterminals while NBL-PCFGs have a bet-
                                                          ter performance when the number of nonterminals
We give the full parameterizations of the following
                                                          is relatively small.
probability distributions.

                      exppuJS h1 pwA qq
     ppA|Sq “ ř                 J
                                            ,
                    A1 PN exppuS h1 pwA1 qq
                     exppuJ A h2 pww qq
     ppw|Aq “ ř                J
                                            ,
                    w1 PΣ exppuA h2 pww1 qq
                     exppuJ H h3 pww qq
     ppw|Hq “ ř                J
                                            ,
                    w1 PΣ exppuH h3 pww1 qq
                     exppuJ  H f prwA ; ww sqq
 ppH|A, wq “ ř                    J
                                                   ,
                    H 1 PH exppuh1 f prwA ; ww sqq

You can also read