Neural Bi-Lexicalized PCFG Induction - Songlin Yang , Yanpeng Zhao , Kewei Tu
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Neural Bi-Lexicalized PCFG Induction Songlin Yang♣ , Yanpeng Zhao♦ , Kewei Tu♣˚ ♣ School of Information Science and Technology, ShanghaiTech University Shanghai Engineering Research Center of Intelligent Vision and Imaging Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences ♦ ILCC, University of Edinburgh {yangsl,tukw}@shanghaitech.edu.cn yannzhao.ed@gmail.com Abstract information to disambiguate parsing decisions and are much more expressive than vanilla PCFGs. Neural lexicalized PCFGs (L-PCFGs) (Zhu However, they suffer from representation and in- arXiv:2105.15021v1 [cs.CL] 31 May 2021 et al., 2020) have been shown effective in ference complexities. For representation, the ad- grammar induction. However, to reduce com- dition of lexical information greatly increases the putational complexity, they make a strong in- dependence assumption on the generation of number of parameters to be estimated and exac- the child word and thus bilexical dependen- erbates the data sparsity problem during learning, cies are ignored. In this paper, we propose so the expectation-maximisation (EM) based esti- an approach to parameterize L-PCFGs with- mation of L-PCFGs has to rely on sophisticated out making implausible independence assump- smoothing techniques and factorizations (Collins, tions. Our approach directly models bilexi- 2003). As for inference, the CYK algorithm for cal dependencies and meanwhile reduces both L-PCFGs has a Opl5 |G|q complexity, where l is learning and representation complexities of L- PCFGs. Experimental results on the English the sentence length and |G| is the grammar con- WSJ dataset confirm the effectiveness of our stant. Although Eisner and Satta (1999) manage to approach in improving both running speed and reduce the complexity to Opl4 |G|q, inference with unsupervised parsing performance. L-PCFGs is still relatively slow, making them less popular nowadays. 1 Introduction Recently, Zhu et al. (2020) combine the ideas of factorizing the binary rule probabilities (Collins, Probabilistic context-free grammars (PCFGs) has 2003) and neural parameterization (Kim et al., been an important probabilistic approach to syntac- 2019) and propose neural L-PCFGs (NL-PCFGs), tic analysis (Lari and Young, 1990; Jelinek et al., achieving good results in both unsupervised depen- 1992). They assign a probability to each of the dency and constituency parsing. Neural parame- parses admitted by CFGs and rank them by the plau- terization is the key to success, which facilitates sibility in such a way that the ambiguity of CFGs informed smoothing (Kim et al., 2019), reduces can be ameliorated. Still, due to the strong indepen- the number of learnable parameters for large gram- dence assumption of CFGs, vanilla PCFGs (Char- mars (Chiu and Rush, 2020; Yang et al., 2021) niak, 1996) are far from adequate for highly am- and facilitates advanced gradient-based optimiza- biguous text. tion techniques instead of using the traditional EM A common premise for tackling the issue is to algorithm (Eisner, 2016). However, Zhu et al. incorporate lexical information and weaken the in- (2020) oversimplify the binary rules to decrease dependence assumption. There have been many ap- the complexity of the inside/CYK algorithm in proaches proposed under the premise (Magerman, learning (i.e., estimating the marginal sentence log- 1995; Collins, 1997; Johnson, 1998; Klein and likelihood) and inference. Specifically, they make a Manning, 2003). Among them lexicalized PCFGs strong independence assumption on the generation (L-PCFGs) are a relatively straightforward formal- of the child word such that it is only dependent on ism (Collins, 2003). L-PCFGs extend PCFGs by the nonterminal symbol. Bilexical dependencies, associating a word, i.e., the lexical head, with each which have been shown useful in unsupervised de- grammar symbol. They can thus exploit lexical pendency parsing (Han et al., 2017; Yang et al., ˚ Corresponding Author 2020), are thus ignored.
To model bilexical dependencies and meanwhile nonterminals: reduce complexities, we draw inspiration from the canonical polyadic decomposition (CPD) (Kolda S Ñ Arwp s APN and Bader, 2009) and propose a latent-variable Arwp s Ñ Brwp sCrwq s, A P N ; B, C P N Y P based neural parameterization of L-PCFGs. Co- Arwp s Ñ Crwq sBrwp s, A P N ; B, C P N Y P hen et al. (2013); Yang et al. (2021) have used CPD T rwp s Ñ wp , T PP to decrease the complexities of PCFGs, and our work can be seen as an extension of their work where wp , wq P Σ are the headwords of the con- to L-PCFGs. We further adopt the unfold-refold stituents spanned by the associated grammar sym- transformation technique (Eisner and Blatz, 2007) bols, and p, q are the word positions in the sen- to decrease complexities. By using this technique, tence. We refer to A, a parent nonterminal an- we show that the time complexity of the inside al- notated by the headword wp , as head-parent. In gorithm implemented by Zhu et al. (2020) can be binary rules, we refer to a child nonterminal as improved from cubic to quadratic in the number of head-child if it inherits the headword of the head- nonterminals m. The inside algorithm of our pro- parent (e.g., Brwp s) and as non-head-child other- posed method has a linear complexity in m after wise (e.g., Crwq s). A head-child appears as either combining CPD and unfold-refold. the left child or the right child. We denote the We evaluate our model on the benchmarking head direction by D P tð, ñu, where ð means Wall Street Journey (WSJ) dataset. Our model head-child appears as the left child. surpasses the strong baseline NL-PCFG (Zhu et al., 2.2 Grammar induction with lexicalized 2020) by 2.9% mean F1 and 1.3% mean UUAS probabilistic CFGs under CYK decoding. When using the Minimal Bayes-Risk (MBR) decoding, our model performs Lexicalized probabilistic CFGs (L-PCFGs) extend even better. We provide an efficient implementation L-CFGs by assigning each production rule r “ of our proposed model at https://github.com/ A Ñ γ a scalar πr such that it forms a valid cate- sustcsonglin/TN-PCFG. gorical probability distribution given the left hand side A. Note that preterminal rules always have a probability of 1 because they define a deterministic 2 Background generating process. Grammar induction with L-PCFGs follows the 2.1 Lexicalized CFGs same way of grammar induction with PCFGs. As We first introduce the formalization of CFGs. A with PCFGs, we maximize the log-likelihood of CFG is defined as a 5-tuple G “ pS, N , P, Σ, Rq each observed sentence w “ w1 , . . . , wl : where S is the start symbol, N is a finite set of non- ÿ log ppwq “ log pptq , (1) terminal symbols, P is a finite set of preterminal tPTGL pwq symbols,1 Σ is a finite set of terminal symbols, and ś R is a set of rules in the following form: where pptq “ rPt πr and TGL pwq consists of all possible lexicalized parse trees of the sentence w under an L-PCFG GL . We can compute the SÑA APN marginal ppwq of the sentence by using the in- A Ñ BC, A P N, B, C P N Y P side algorithm in polynomial time. The core re- T Ñ w, T P P, w P Σ cursion of the inside algorithm is formalized in Equation 3. It recursively computes the probability N , P and Σ are mutually disjoint. We will use sA,p i,j of a head-parent Arwp s spanning the substring ‘nonterminals’ to indicate N Y P when it is clear wi , . . . , wj´1 (p P ri, j ´ 1s). Term A1 and A2 in from the context. Equation 3 cover the cases of the head-child as the Lexicalized CFGs (L-CFGs) (Collins, 2003) ex- left child and the right child respectively. tend CFGs by associating a word with each of the 2.3 Challenges of L-PCFG induction 1 The major difference between L-PCFGs from An alternative definition of CFGs does not distinguish nonterminals N (constituent labels) from preterminals P (part- vanilla PCFGs is that they use word-annotated non- of-speech tags) and treats both as nonterminals. terminals, so the nonterminal number of L-PCFGs
Figure 1: (a) The original parameterization of L-PCFGs. (b) The parameterization of Zhu et al. (2020): Wq is independent with B, D, A, Wp given C. (c) Our proposed parameterization. We slightly abuse the Bayesian network notation by grouping variables. In the standard notation, there would be arcs from the parent variables to each grouped variable as well as arcs between the grouped variables. is up to |Σ| times the number of nonterminals in Figure 1 (a) and (b). With the factorized binary rule PCFGs. As the grammar size is largely determined probability in Equation 2, Term A1 in Equation 3 by the number of binary rules and increases ap- can be rewritten as Equation 5. Zhu et al. (2020) proximately in cubic of the nonterminal number, implement the inside algorithm by caching Term representing L-PCFGs has a high space complexity C1-1 in Equation 6, resulting in a time complexity Opm3 |Σ|2 q (m is the nonterminal number). Specif- Opl4 m3 ` l3 mq, which is cubic in m. We note that, ically, it requires an order-6 probability tensor for we can use unfold-refold to further cache Term C1- binary rules with each dimension representing A, 2 in Equation 6 and reduce the time complexity B, C, wp , wq , and head direction D, respectively. of the inside algorithm to Opl4 m2 ` l3 m ` l2 m2 q, With so many rules, L-PCFGs are very prone to the which is quadratic in m. data sparsity problem in rule probability estimation. Although the factorization of Equation 2 reduces Collins (2003) suggests factorizing the binary rule the space and time complexity of the inside algo- probabilities according to specific independence rithm of L-PCFG, it is based on the independence assumptions, but his approach still relies on com- assumption that the generation of wq is indepen- plicated smoothing techniques to be effective. dent of A, B, D and wp given the non-head-child The addition of lexical heads also scales up the C. This assumption can be violated in many sce- computational complexity of the inside algorithm narios and hence reduces the expressiveness of the by a factor Opl2 q and brings it up to Opl5 m3 q. Eis- grammar. For example, suppose C is Noun, then ner and Satta (1999) point out that, by changing even if we know B is Verb, we still need to know the order of summations in Term A1 (A2) of Equa- D to determine if wq is an object or a subject of tion 3, one can cache and reuse Term B1 (B2) in the verb, and then need to know the actual verb wp Equation 4 and reduce the computational complex- to pick a likely noun as wq . ity to Opl4 m2 ` l3 m3 q. This is an example applica- tion of unfold-refold as noted by Eisner and Blatz 3 Factorization with latent variable (2007). However, the complexity is still cubic in m, Our main goal is to find a parameterization that re- making it expensive to increase the total number of moves the implausible independence assumptions nonterminals. of Zhu et al. (2020) while decreases the complexi- 2.4 Neural L-PCFGs ties of the original L-PCFGs. To reduce the representation complexity, we Zhu et al. (2020) apply neural parameterization to draw inspiration from the canonical polyadic de- tackle the data sparsity issue and to reduce the total composition (CPD). CPD factorizes an n-th order learnable parameters of L-PCFGs. Considering tensor into n two-dimensional matrices. Each ma- the head-child as the left child (similarly for the trix consists of two dimensions: one dimension other case), they further factorize the binary rule comes from the original n-th order tensor and the probability as: other dimension is shared by all the n matrices. The ppArwp s Ñ Brwp sCrwq sq shared dimension can be marginalized to recover “ ppB, ð, C|A, wp qppwq |Cq . (2) the original n-th order tensor. From a probabilistic perspective, the shared dimension can be regarded Bayesian networks representing the original as a latent-variable. In the spirit of CPD, we intro- probability and the factorization are illustrated in duce a latent-variable H to decompose the order-6
j´1 ÿ j´1 ÿ ÿ ÿp k´1 ÿ ÿ B,q C,p sA,p i,j “ sB,p C,q i,k ¨ sk,j ¨ ppArwp s Ñ Brwp sCrwq sq ` si,k ¨ sk,j ¨ ppArwp s Ñ Brwq sCrwp sq (3) k“p`1 q“k B,C k“i`1 q“i B,C loooooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooooon loooooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooooon Term A1 Term A2 j´1 ÿ ÿ j´1 ÿ ÿ p ÿ ÿ k´1 ÿÿ “ sB,p i,k sC,q k,j ¨ ppArwp s Ñ Brwp sCrwq sq ` sC,p k2 ,j sB,q i,k ¨ ppArwp s Ñ Brwq sCrwp sq (4) k“p`1 B q“k C k“i`1 C q“i B looooooooooooooooooooooooomooooooooooooooooooooooooon looooooooooooooooooooooooomooooooooooooooooooooooooon Term B1 Term B2 j´1 ÿ j´1 ÿ ÿ Term A1 “ sB,p C,q i,k ¨ sk,j ¨ ppB, ð, C|A, wp q ¨ ppwq |Cq looooooooooooooooomooooooooooooooooon (5) k“p`1 q“k B,C factorization of ppArwp sÑBrwp sCrwq sq j´1 ÿ ÿ ÿ j´1 ÿ “ sB,p i,k ppB, ð, C|A, wp q sC,q k,j ¨ ppwq |Cq (6) k“p`1 B C q“k loooooooooomoooooooooon Term C1-1 looooooooooooooooooooooooomooooooooooooooooooooooooon Term C1-2 j´1 ÿ j´1 ÿ ÿ ÿ Term A1 “ sB,p C,q i,k ¨ sk,j ¨ ppH|A, wp qppB|HqppC, ð |Hqppwq |Hq (7) k“p`1 q“k B,C H loooooooooooooooooooooooooooomoooooooooooooooooooooooooooon factorization of ppArwp sÑBrwp sCrwq sq ÿ j´1 ÿ ÿ j´1 ÿ ÿ C,q “ ppH|A, wp q sB,p i,k ppB|Hq sk,j ppC ð |Hqppwq |Hq (8) H B k“p`1 loooooooomoooooooon q“k C looooooooooooooooooomooooooooooooooooooon Term D1-1 Term D1-2 Table 1: Recursive formulas of the inside algorithm for Eisner and Satta (1999) (Equation 4), Zhu et al. (2020) (Equation 5- 6), and our formalism (Equation 7- 8), respectively. sA,p i,j indicates the probability of a head nontermi- nal Arwp s spanning the substring wi , . . . , wj´1 , where p is the position of the headword in the sentence. probability tensor ppB, C, D, wq |A, wp q. Instead rule probability is factorized as of fully decomposing the tensor, we empirically find that binding some of the variables leads to bet- ppArwp s Ñ Brwp sCrwq sq “ (10) ÿ ter results. Our best factorization is as follows (also ppH|A, wp qppB|HqppC ð |Hqppwq |Hq , illustrated by a Bayesian network in Figure 1 (c)): H and ppArwp s Ñ Brwq sCrwp sq “ (11) ÿ ppB, C, Wq , D|A, Wp q “ (9) ppH|A, wp qppC|HqppB ñ |Hqppwq |Hq . ÿ H ppH|A, Wp qppB|HqppC, D|HqppWq |Hq . H We also follow Zhu et al. (2020) and factorize the start rule as follows. According to d-separation (Pearl, 1988), when ppS Ñ Arwp sq “ ppA|Sqppwp |Aq . (12) A and wp are given, B, C, wq , and D are interde- Computational complexity: Considering the pendent due to the existence of H. In other words, head-child as the left child (similarly for the other our factorization does not make any independence case), we apply Equation 10 in Term A1 of Equa- assumption beyond the original binary rule. The tion 3 and obtain Equation 7. Rearranging the domain size of H is analogous to the tensor rank summations in Equation 7 gives Equation 8, where in CPD and thus influences the expressiveness of Term D1-1 and D1-2 can be cached and reused, our proposed model. which also uses the unfold-refold technique. The fi- Based on our factorization approach, the binary nal time complexity of the inside computation with
our factorization approach is Opl4 dH ` l2 mdH q 1994). We use the same preprocessing pipeline as (dH is the domain size of the latent variable H), in Kim et al. (2019). Specifically, punctuation is which is linear in m. removed from all data splits and the top 10,000 frequent words in the training data are used as the Choices of factorization: If we follow the intu- vocabulary. For dependency grammar induction, ition of CPD, then we shall assume that B, C, D, we follow (Zhu et al., 2020) to use the Stanford and wq are all independent conditioned on H. How- typed dependency representation (de Marneffe and ever, properly relaxing this strong assumption by Manning, 2008). binding some variables could benefit our model. Though there are many different choices of binding 4.2 Hyperparameters the variables, some bindings can be easily ruled We optimize our model using the Adam optimizer out. For instance, binding B and C inhibits us with β1 “ 0.75, β2 “ 0.999, and learning rate from caching Term D1-1 and Term D1-2 in Equa- 0.001. All parameters are initialized with Xavier tion 7 and thus we cannot implement the inside uniform initialization. We set the dimension of all algorithm efficiently; binding C and wq leads to embeddings to 256 and the ratio of the nonterminal a high computational complexity because we will number to the preterminal number to 1:2. Our have to compute a high-dimensional (m|Σ|) cate- best model uses 15 nonterminals, 30 preterminals, gorical distribution. In Section 6.3, we make an and dH “ 300. We use grid search to tune the ablation study on the impact of different choices of nonterminal number (from 5 to 30) and domain factorizations. size dH of the latent H (from 50 to 500). Neural parameterizations: We follow Kim et al. (2019) and Zhu et al. (2020) and define the 4.3 Evaluation following neural parameterization: We run each model four times with different ran- dom seeds and for ten epochs. We train our models exppuJS f1 pwA qq ppA|Sq “ ř J , on training sentences of length ď 40 with batch A1 PN exppuS f1 pwA1 qq size 8 and test them on the whole testing set. For exppuJ A f2 pww qq each run, we perform early stopping and select the ppw|Aq “ ř J , w1 PΣ exppuA f2 pww1 qq best model according to the perplexity of the devel- exppuJ opment set. We use two different parsing methods: H wB q ppB|Hq “ ř J w 1q , the variant of CYK algorithm (Eisner and Satta, B 1 PN YP exppuH B 1999) and Minimum Bayes-Risk (MBR) decoding exppuJ H f2 pww qq (Smith and Eisner, 2006). 2 For constituent gram- ppw|Hq “ ř J , w1 PΣ exppuH f2 pww1 qq mar induction, we report the means and standard exppuJ H wCð q deviations of sentence-level F1 scores.3 For de- ppC ð |Hq “ ř J , pendency grammar induction, we report unlabeled C 1 PM exppuH wC 1 q directed attachment score (UDAS) and unlabeled exppuJ H wCñ q ppC ñ |Hq “ ř J , undirected attachment score (UUAS). C 1 PM exppuH wC 1 q exppuJ H f4 prwA ; ww sqq 5 Main result ppH|A, wq “ ř J , H 1 PH exppuH 1 f4 prwA ; ww sqq We present our main results in Table 2. Our model where H “ tH1 , . . . , HdH u, M “ pN Y Pq ˆ tð is referred to as Neural Bi-Lexicalized PCFGs , ñu, u and w are nonterminal embeddings and (NBL-PCFGs). We mainly compare our approach word embeddings respectively, and f1 p¨q, f2 p¨q, against recent PCFG-based models: neural PCFG f3 p¨q, f4 p¨q are neural networks with residual lay- (N-PCFG) and compound PCFG (C-PCFG) (Kim ers (He et al., 2016) (Full parameterization is shown et al., 2019), tensor decomposition based neural in Appendix.). 2 In MBR decoding, we use automatic differentiation (Eis- ner, 2016; Rush, 2020) to estimate the marginals of spans 4 Experimental setup and arcs, and then use the CYK and Eisner algorithms for constituency and dependency parsing, respectively. 4.1 Dataset 3 Following Kim et al. (2019), we remove all trivial spans (single-word spans and sentence-level spans). Sentence-level We conduct experiments on the Wall Street Journal means that we compute F1 for each sentence and then average (WSJ) corpus of the Penn Treebank (Marcus et al., over all sentences.
PCFG (TN-PCFG) (Yang et al., 2021) and neural WSJ L-PCFG (NL-PCFG) (Zhu et al., 2020). We report Model both official result of Zhu et al. (2020) and our F1 UDAS UUAS reimplementation. Official results We do not use the compound trick (Kim et al., N-PCFG‹ 50.8 2019) in our implementations of lexicalized PCFGs C-PCFG‹ 55.2 because we empirically find that using it results in NL-PCFG‹ 55.3 39.7 53.3 unstable training and does not necessarily bring TN-PCFG: 57.7 performance improvements. We draw three key observations: (1) Our model Our results achieves the best F1 and UUAS scores under both NL-PCFG‹ 53.3˘2.1 23.8˘1.1 47.4˘1.0 CYK and MBR decoding. It is also comparable NL-PCFG: 57.4˘1.4 25.3˘1.3 47.2˘0.7 to the official NL-PCFG in the UDAS score. (2) NBL-PCFG‹ 58.2˘1.5 37.1˘2.8 54.6˘1.3 When we remove the compound parameterization NBL-PCFG: 60.4˘1.6 39.1˘2.8 56.1˘1.3 from NL-PCFG, its F1 score drops slightly while its UDAS and UUAS scores drop dramatically. It im- For reference plies that compound parameterization is the key to S-DIORA 57.6 achieve excellent dependency grammar induction StructFormer 54.0 46.2 61.6 performance in NL-PCFG. (3) The MBR decoding outperforms CYK decoding. Oracle Trees 84.3 Regarding UDAS, our model significantly out- performs NL-PCFGs in UDASs if compound pa- Table 2: Unlabeled sentence-level F1 scores, unlabeled directed attachment scores and unlabeled undirected rameterization is not used (37.1 vs. 23.8 with CYK attachment scores on the WSJ test data. : indicates decoding), showing that explicitly modeling bilexi- using MBR decoding. ‹ indicates using CYK decod- cal relationship is helpful in dependency grammar ing. Recall that the official result of Zhu et al. (2020) induction. However, when compound parameteri- uses compound parameterization while our reimple- zation is used, the UDAS of NL-PCFGs is greatly mentation removes the compound parameterization. S- improved, slightly surpassing that of our model. DIORA: Drozdov et al. (2020). StructFormer: Shen We believe this is because compound parameteriza- et al. (2020). tion greatly weakens the independence assumption of NL-PCFGs (i.e., the child word is dependent on ited expressiveness of NBL-PCFGs. When dH is C only) by leaking bilexical information via the larger than 300, the perplexity becomes plateaued global sentence embedding. On the other hand, and the F1 score starts to decrease possibly because NBL-PCFGs are already expressive enough and of overfitting. thus compound parameterization brings no further 6.2 Influence of nonterminal number increase of their expressiveness but makes learning more difficult. Figure 2b illustrates perplexities and F1 scores with the increase of the nonterminal number and fixed 6 Analysis dH “ 300 (plots of UDAS and UUAS can be found in Appendix). We observe that increasing the non- In the following experiments, we report results terminal number has only a minor influence on using MBR decoding by default. We also use NBL-PCFGs. We speculate that it is because the dH “ 300 by default unless otherwise specified. number of word-annotated nonterminals (m|Σ|) is already sufficiently large even if m is small. On 6.1 Influence of the domain size of H the other hand, the nonterminal number has a big dH (the domain size of H) influences the expres- influence on NL-PCFGs. This is most likely be- siveness of our model. Figure 2a illustrates per- cause NL-PCFGs make the independence assump- plexities and F1 scores with the increase of dH and tion that the generation of wq is solely determined a fixed nonterminal number of 10 (plots of UDAS by the non-head-child C and thus require more and UUAS can be found in Appendix). We can nonterminals so that C has the capacity of convey- see that when dH is small, the model has a high ing information from A, B, D and wp . Using more perplexity and a low F1 score, indicating the lim- nonterminals (ą 30) seems to be helpful for NL-
F1 UDAS UUAS Perplexity We further analyze the quality of our induced D-C 60.4 39.1 56.1 161.9 trees. Our model prefers to predict left-headed con- D-alone 57.2 32.8 54.1 164.8 stituents (i.e., constituents headed by the leftmost D-wq 47.7 45.7 58.6 176.8 D-B 47.8 36.9 54.0 169.6 word). VPs are usually left-headed in English, so our model has a much higher recall on VPs and cor- Table 3: Binding the head direction D with different rectly predicts their headwords. SBARs often start variables. with which and that and PPs often start with prepo- sitions such as of and for. Our model often relies PCFGs, but would be computationally too expen- on these words to predict the correct constituents sive due to the quadratically increased complexity and hence erroneously predicts these words as the in the number of nonterminals. headwords, which hurts the dependency accuracy. For NPs, we find our model often makes mistakes 6.3 Influence of different variable bindings in predicting adjective-noun phrases. For example, Table 3 presents the results of our models with the the correct parse of a rough market is (a (rough following bindings: market)), but our model predicts ((a rough) market) instead. • D-alone: D is generated alone. 7 Discussion on dependency annotation • D-wq : D is generated with wq . schemes • D-B: D is generated with head-child B. What should be regarded as the headwords is still • D-C: D is generated with non-head-child C. debatable in linguistics, especially for those around function words (Zwicky, 1993). For example, in Clearly, binding D and C (the default setting phrase the company, some linguists argue that the for NBL-PCFG) results in the lowest perplexity should be the headword (Abney, 1972). These and the highest F1 score. Binding D and wq has disagreements are reflected in the dependency an- a surprisingly good performance in unsupervised notation schemes. Researchers have found that dependency parsing. different dependency annotation schemes result in We find that how to bind the head direction has very different evaluation scores of unsupervised de- a huge impact on the unsupervised parsing perfor- pendency parsing (Noji, 2016; Shen et al., 2020). mance and we give the following intuition. Usually In our experiments, we use the Stanford Depen- given a headword and its type, the children gener- dencies annotation scheme in order to compare ated in each direction would be different. So, D is with NL-PCFGs. Stanford Dependencies prefers intuitively more related to wq and C than to B. On to select content words as headwords. However, the other hand, B is dependent more on the head- as we discussed in previous sections, our model word instead. In Table 3 we can see that (D-B) prefers to select function words (e.g., of, which, has a lower UDAS score than (D-C) and (D-wq ), for) as headwords for SBARs or PPs.This explains which is consistent with this intuition. Notably, in why our model can outperform all the baselines on Zhu et al. (2020), their Factorization III has a signif- constituency parsing but not on dependency pars- icantly lower UDAS than the default model (35.5 ing (as judged by Stanford Dependencies) at the vs. 25.9), and the only difference is whether the same time. Table 3 shows that there is a trade-off generation of C is dependent on the head direction. between the F1 score and UDAS, which suggests This is also consistent with our intuition. that adapting our model to Stanford Dependencies would hurt its ability to identify constituents. 6.4 Qualitative analysis 8 Speed comparison We analyze the parsing performance of different PCFG extensions by breaking down their recall In practice, the forward and backward pass of the numbers by constituent labels (see Table 4). NPs inside algorithm consumes the majority of the run- and VPs cover most of the gold constituents in WSJ ning time in training a N(B)L-PCFG. The existing test set. TN-PCFGs have the best performance implementation by Zhu et al. (2020)4 does not em- in predicting NPs and NBL-PCFGs have better ploy efficient parallization and has a cubic time 4 performance in predicting other labels on average. https://github.com/neulab/neural-lpcfg
70 280 F1 210 F1 of NL-PCFGs 62 Perplexity F1 of NBL-PCFGs 200 65 Perplexity of NL-PCFGs 260 60 Perplexity of NBL-PCFGs 190 240 58 60 Perplexity Perplexity F1 (%) F1 (%) 220 56 180 55 200 54 170 50 180 52 160 160 45 100 200 300 400 500 5 10 15 20 25 30 size of |H| The number of nonterminals (a) (b) Figure 2: The change of F1 scores, perplexities with the change of |H| and nonterminal number. N-PCFG: C-PCFG: TN-PCFG: NL-PCFG NBL-PCFG terminal number m and a fixed sentence length of NP 72.3% 73.6% 75.4% 74.0% 66.2% VP 28.1% 45.0% 48.4% 44.3% 61.1% 30. The original implementation runs out of 12GB PP 73.0% 71.4% 67.0% 68.4% 77.7% memory when m “ 30. re-impl-2 is faster than SBAR 53.6% 54.8% 50.3% 49.4% 63.8% ADJP 40.8% 44.3% 53.6% 55.5% 59.7% re-impl-1 when increasing m as it has a better time ADVP 43.8% 61.6% 59.5% 57.1% 59.1% complexity in m (quadratic for re-impl-2, cubic for Perplexity 254.3 196.3 207.3 181.2 161.9 re-impl-1). Our NBL-PCFGs have a linear com- plexity in m, and as we can see in the figure, our Table 4: Recall on six frequent constituent labels and perplexities of the WSJ test data. : means that the re- NBL-PCFGs are much faster when m is large. sults are reported by Yang et al. (2021) 9 Related Work complexity in the number of nonterminals. We provide an efficient reimplementation (we follow Unsupervised parsing has a long history but has Zhang et al. (2020) to batchify) of the inside algo- regained great attention in recent years. In unsu- rithm based on Equation 6. We refer to an imple- pervised dependency parsing, most methods are mentation which caches Term C1-1 as re-impl-1 based on Dependency Model with Valence (DMV) and refer to an implementation which caches Term (Klein and Manning, 2004). Neurally parameter- C1-2 as re-impl-2. ized DMVs have obtained state-of-the-art perfor- mance (Jiang et al., 2016; Han et al., 2017, 2019; 16 14 NBL-PCFG NL-PCFG 3.5 NBL-PCFG NL-PCFG Yang et al., 2020). However, they rely on gold POS re-impl-1 re-impl-1 12 10 re-impl-2 3.0 re-impl-2 tags and sophisticated initializations (e.g. K&M Time: (s) Time: (s) 2.5 8 6 2.0 1.5 initialization or initialization with the parsing result 4 2 1.0 0.5 of another unsupervised model). Noji et al. (2016) 0 10 20 30 40 50 60 70 10 20 30 40 50 60 70 propose a left-corner parsing-based DMV model to Sentence length The number of nonterminals limit the stack depth of center-embedding, which is (a) (b) insensitive to initialization but needs gold POS tags. Figure 3: Total time in performing the inside algorithm He et al. (2018) propose a latent-variable based and automatic differentiation with different sentence DMV model, which does not need gold POS tags lengths and nonterminal numbers. but requires good initialization and high-quality We measure the time based on a single forward induced POS tags. See Han et al. (2020) for a and backward pass of the inside algorithm with survey of unsupervised dependency parsing. Com- batch size 1 on a single Titan V GPU. Figure 3a pared to these methods, our method does not re- illustrates the time with the increase of the sentence quire gold/induced POS tags or sophisticated ini- length and a fixed nonterminal number of 10. The tializations, though its performance lags behind original implementation of NL-PCFG by Zhu et al. some of these previous methods. (2020) takes much more time when sentences are Recent unsupervised constituency parsers can long. For example, when sentence length is 40, it be roughly categorized into the following groups: needs 6.80s, while our fast implementation takes (1) PCFG-based methods. Depth-bounded PCFGs 0.43s and our NBL-PCFG takes only 0.30s. Figure (Jin et al., 2018a,b) limit the stack depth of center- 3b illustrates the time with the increase of the non- embedding. Neurally parameterized PCFGs (Jin
et al., 2019; Kim et al., 2019; Zhu et al., 2020; Yang Acknowledgments et al., 2021) use neural networks to produce gram- We thank the anonymous reviewers for their con- mar rule probabilities. (2) Deep Inside-Outside structive comments. This work was supported by Recursive Auto-encoder (DIORA) based methods the National Natural Science Foundation of China (Drozdov et al., 2019a,b, 2020; Hong et al., 2020; (61976139). Sahay et al., 2021). They use neural networks to mimic the inside-outside algorithm and they are trained with masked language model objectives. References (3) Syntactic distance-based methods (Shen et al., Steven P. Abney. 1972. The english noun phrase in its 2018, 2019, 2020). They encode hidden syntac- sentential aspect. tic trees into syntactic distances and inject them into language models. (4) Probing based methods Steven Cao, Nikita Kitaev, and Dan Klein. 2020. Unsu- pervised parsing via constituency tests. In Proceed- (Kim et al., 2020; Li et al., 2020). They extract ings of the 2020 Conference on Empirical Methods phrase-structure trees based on the attention distri- in Natural Language Processing (EMNLP), pages butions of large pre-trained language models. In 4798–4808, Online. Association for Computational addition to these methods, Cao et al. (2020) use Linguistics. constituency tests and Shi et al. (2021) make use Eugene Charniak. 1996. Tree-bank grammars. Techni- of naturally-occurring bracketings such as hyper- cal report, USA. links on webpages to train parsers. Multimodal Justin Chiu and Alexander Rush. 2020. Scaling hid- information such as images (Shi et al., 2019; Zhao den Markov language models. In Proceedings of the and Titov, 2020; Jin and Schuler, 2020) and videos 2020 Conference on Empirical Methods in Natural (Zhang et al., 2021) have also been exploited for Language Processing (EMNLP), pages 1341–1349, unsupervised constituency parsing. Online. Association for Computational Linguistics. We are only aware of a few previous studies in Shay B. Cohen, Giorgio Satta, and Michael Collins. 2013. Approximate PCFG parsing using tensor de- unsupervised joint dependency and constituency composition. In Proceedings of the 2013 Confer- parsing. Klein and Manning (2004) propose a joint ence of the North American Chapter of the Associ- DMV and CCM (Klein and Manning, 2002) model. ation for Computational Linguistics: Human Lan- Shen et al. (2020) propose a transformer-based guage Technologies, pages 487–496, Atlanta, Geor- gia. Association for Computational Linguistics. method, in which they define syntactic distances to guild attentions of transformers. Zhu et al. (2020) Michael Collins. 1997. Three generative, lexicalised propose neural L-PCFGs for unsupervised joint models for statistical parsing. In 35th Annual Meet- ing of the Association for Computational Linguis- parsing. tics and 8th Conference of the European Chapter of the Association for Computational Linguistics, pages 16–23, Madrid, Spain. Association for Com- putational Linguistics. 10 Conclusion Michael Collins. 2003. Head-driven statistical models for natural language parsing. Computational Lin- guistics, 29(4):589–637. We have presented a new formalism of lexicalized PCFGs. Our formalism relies on the canonical Andrew Drozdov, Subendhu Rongali, Yi-Pei Chen, polyadic decomposition to factorize the probabil- Tim O’Gorman, Mohit Iyyer, and Andrew McCal- ity tensor of binary rules. The factorization re- lum. 2020. Unsupervised parsing with S-DIORA: Single tree encoding for deep inside-outside recur- duces the space and time complexity of lexicalized sive autoencoders. In Proceedings of the 2020 Con- PCFGs while keeping the independence assump- ference on Empirical Methods in Natural Language tions encoded in the original binary rules intact. Processing (EMNLP), pages 4832–4845, Online. As- We further parameterize our model by using neu- sociation for Computational Linguistics. ral networks and present an efficient implementa- Andrew Drozdov, Patrick Verga, Mohit Yadav, Mo- tion of our model. On the English WSJ test data, hit Iyyer, and Andrew McCallum. 2019a. Unsuper- our model achieves the lowest perplexity, outper- vised latent tree induction with deep inside-outside recursive auto-encoders. In Proceedings of the 2019 forms all the existing extensions of PCFGs in con- Conference of the North American Chapter of the stituency grammar induction, and is comparable to Association for Computational Linguistics: Human strong baselines in dependency grammar induction. Language Technologies, Volume 1 (Long and Short
Papers), pages 1129–1141, Minneapolis, Minnesota. sion and Pattern Recognition, CVPR 2016, Las Ve- Association for Computational Linguistics. gas, NV, USA, June 27-30, 2016, pages 770–778. IEEE Computer Society. Andrew Drozdov, Patrick Verga, Mohit Yadav, Mohit Iyyer, and Andrew McCallum. 2019b. Unsuper- Ruyue Hong, Jiong Cai, and Kewei Tu. 2020. Deep vised latent tree induction with deep inside-outside inside-outside recursive autoencoder with all-span recursive auto-encoders. In Proceedings of the 2019 objective. In Proceedings of the 28th Interna- Conference of the North American Chapter of the tional Conference on Computational Linguistics, Association for Computational Linguistics: Human pages 3610–3615, Barcelona, Spain (Online). Inter- Language Technologies, Volume 1 (Long and Short national Committee on Computational Linguistics. Papers), pages 1129–1141, Minneapolis, Minnesota. Association for Computational Linguistics. F. Jelinek, J. D. Lafferty, and R. L. Mercer. 1992. Basic methods of probabilistic context free grammars. In Jason Eisner. 2016. Inside-outside and forward- Speech Recognition and Understanding, pages 345– backward algorithms are just backprop (tutorial pa- 360, Berlin, Heidelberg. Springer Berlin Heidelberg. per). In Proceedings of the Workshop on Structured Prediction for NLP, pages 1–17, Austin, TX. Asso- Yong Jiang, Wenjuan Han, and Kewei Tu. 2016. Un- ciation for Computational Linguistics. supervised neural dependency parsing. In Proceed- ings of the 2016 Conference on Empirical Methods Jason Eisner and John Blatz. 2007. Program transfor- in Natural Language Processing, pages 763–771, mations for optimization of parsing algorithms and Austin, Texas. Association for Computational Lin- other weighted logic programs. In Proceedings of guistics. FG 2006: The 11th Conference on Formal Gram- mar, pages 45–85. CSLI Publications. Lifeng Jin, Finale Doshi-Velez, Timothy Miller, William Schuler, and Lane Schwartz. 2018a. Depth- Jason Eisner and Giorgio Satta. 1999. Efficient pars- bounding is effective: Improvements and evaluation ing for bilexical context-free grammars and head au- of unsupervised PCFG induction. In Proceedings of tomaton grammars. In Proceedings of the 37th An- the 2018 Conference on Empirical Methods in Nat- nual Meeting of the Association for Computational ural Language Processing, pages 2721–2731, Brus- Linguistics, pages 457–464, College Park, Maryland, sels, Belgium. Association for Computational Lin- USA. Association for Computational Linguistics. guistics. Wenjuan Han, Yong Jiang, Hwee Tou Ng, and Kewei Lifeng Jin, Finale Doshi-Velez, Timothy Miller, Tu. 2020. A survey of unsupervised dependency William Schuler, and Lane Schwartz. 2018b. Un- parsing. In Proceedings of the 28th International supervised grammar induction with depth-bounded Conference on Computational Linguistics, pages PCFG. Transactions of the Association for Compu- 2522–2533, Barcelona, Spain (Online). Interna- tational Linguistics, 6:211–224. tional Committee on Computational Linguistics. Lifeng Jin, Finale Doshi-Velez, Timothy Miller, Lane Wenjuan Han, Yong Jiang, and Kewei Tu. 2017. De- Schwartz, and William Schuler. 2019. Unsuper- pendency grammar induction with neural lexicaliza- vised learning of PCFGs with normalizing flow. tion and big training data. In Proceedings of the In Proceedings of the 57th Annual Meeting of the 2017 Conference on Empirical Methods in Natu- Association for Computational Linguistics, pages ral Language Processing, pages 1683–1688, Copen- 2442–2452, Florence, Italy. Association for Compu- hagen, Denmark. Association for Computational tational Linguistics. Linguistics. Lifeng Jin and William Schuler. 2020. Grounded Wenjuan Han, Yong Jiang, and Kewei Tu. 2019. En- PCFG induction with images. In Proceedings of hancing unsupervised generative dependency parser the 1st Conference of the Asia-Pacific Chapter of the with contextual information. In Proceedings of the Association for Computational Linguistics and the 57th Annual Meeting of the Association for Com- 10th International Joint Conference on Natural Lan- putational Linguistics, pages 5315–5325, Florence, guage Processing, pages 396–408, Suzhou, China. Italy. Association for Computational Linguistics. Association for Computational Linguistics. Junxian He, Graham Neubig, and Taylor Berg- Mark Johnson. 1998. PCFG models of linguistic Kirkpatrick. 2018. Unsupervised learning of syn- tree representations. Computational Linguistics, tactic structure with invertible neural projections. 24(4):613–632. In Proceedings of the 2018 Conference on Em- pirical Methods in Natural Language Processing, Taeuk Kim, Jihun Choi, Daniel Edmiston, and Sang- pages 1292–1302, Brussels, Belgium. Association goo Lee. 2020. Are pre-trained language models for Computational Linguistics. aware of phrases? simple but strong baselines for grammar induction. In 8th International Confer- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian ence on Learning Representations, ICLR 2020, Ad- Sun. 2016. Deep residual learning for image recog- dis Ababa, Ethiopia, April 26-30, 2020. OpenRe- nition. In 2016 IEEE Conference on Computer Vi- view.net.
Yoon Kim, Chris Dyer, and Alexander Rush. 2019. Parser Evaluation, pages 1–8, Manchester, UK. Col- Compound probabilistic context-free grammars for ing 2008 Organizing Committee. grammar induction. In Proceedings of the 57th An- nual Meeting of the Association for Computational Hiroshi Noji. 2016. Left-corner methods for syntac- Linguistics, pages 2369–2385, Florence, Italy. Asso- tic modeling with universal structural constraints. ciation for Computational Linguistics. CoRR, abs/1608.00293. Dan Klein and Christopher Manning. 2004. Corpus- Hiroshi Noji, Yusuke Miyao, and Mark Johnson. 2016. based induction of syntactic structure: Models of de- Using left-corner parsing to encode universal struc- pendency and constituency. In Proceedings of the tural constraints in grammar induction. In Proceed- 42nd Annual Meeting of the Association for Com- ings of the 2016 Conference on Empirical Meth- putational Linguistics (ACL-04), pages 478–485, ods in Natural Language Processing, pages 33–43, Barcelona, Spain. Austin, Texas. Association for Computational Lin- guistics. Dan Klein and Christopher D. Manning. 2002. A generative constituent-context model for improved Judea Pearl. 1988. Probabilistic Reasoning in Intelli- grammar induction. In Proceedings of the 40th An- gent Systems: Networks of Plausible Inference. Mor- nual Meeting of the Association for Computational gan Kaufmann Publishers Inc., San Francisco, CA, Linguistics, pages 128–135, Philadelphia, Pennsyl- USA. vania, USA. Association for Computational Linguis- tics. Alexander Rush. 2020. Torch-struct: Deep structured Dan Klein and Christopher D. Manning. 2003. Ac- prediction library. In Proceedings of the 58th An- curate unlexicalized parsing. In Proceedings of the nual Meeting of the Association for Computational 41st Annual Meeting of the Association for Compu- Linguistics: System Demonstrations, pages 335– tational Linguistics, pages 423–430, Sapporo, Japan. 342, Online. Association for Computational Linguis- Association for Computational Linguistics. tics. Tamara G. Kolda and Brett W. Bader. 2009. Ten- Atul Sahay, Anshul Nasery, Ayush Maheshwari, sor decompositions and applications. SIAM Review, Ganesh Ramakrishnan, and Rishabh Iyer. 2021. 51(3):455–500. Rule augmented unsupervised constituency parsing. Karim Lari and Steve J Young. 1990. The estimation Yikang Shen, Zhouhan Lin, Chin-Wei Huang, and of stochastic context-free grammars using the inside- Aaron C. Courville. 2018. Neural language model- outside algorithm. Computer speech & language, ing by jointly learning syntax and lexicon. In 6th 4(1):35–56. International Conference on Learning Representa- tions, ICLR 2018, Vancouver, BC, Canada, April 30 Bowen Li, Taeuk Kim, Reinald Kim Amplayo, and - May 3, 2018, Conference Track Proceedings. Open- Frank Keller. 2020. Heads-up! unsupervised con- Review.net. stituency parsing via self-attention heads. In Pro- ceedings of the 1st Conference of the Asia-Pacific Yikang Shen, Shawn Tan, Seyedarian Hosseini, Chapter of the Association for Computational Lin- Zhouhan Lin, Alessandro Sordoni, and Aaron C. guistics and the 10th International Joint Conference Courville. 2019. Ordered memory. CoRR, on Natural Language Processing, pages 409–424, abs/1910.13466. Suzhou, China. Association for Computational Lin- guistics. Yikang Shen, Yi Tay, Che Zheng, Dara Bahri, Donald David M. Magerman. 1995. Statistical decision-tree Metzler, and Aaron Courville. 2020. Structformer: models for parsing. In 33rd Annual Meeting of the Joint unsupervised induction of dependency and con- Association for Computational Linguistics, pages stituency structure from masked language modeling. 276–283, Cambridge, Massachusetts, USA. Associ- ation for Computational Linguistics. Haoyue Shi, Jiayuan Mao, Kevin Gimpel, and Karen Livescu. 2019. Visually grounded neural syntax ac- Mitchell Marcus, Grace Kim, Mary Ann quisition. In Proceedings of the 57th Annual Meet- Marcinkiewicz, Robert MacIntyre, Ann Bies, ing of the Association for Computational Linguistics, Mark Ferguson, Karen Katz, and Britta Schasberger. pages 1842–1861, Florence, Italy. Association for 1994. The Penn Treebank: Annotating predicate ar- Computational Linguistics. gument structure. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Tianze Shi, Ozan İrsoy, Igor Malioutov, and Lillian Lee. Jersey, March 8-11, 1994. 2021. Learning syntax from naturally-occurring bracketings. In Proceedings of the 2021 Confer- Marie-Catherine de Marneffe and Christopher D. Man- ence of the North American Chapter of the Associ- ning. 2008. The Stanford typed dependencies rep- ation for Computational Linguistics: Human Lan- resentation. In Coling 2008: Proceedings of the guage Technologies, pages 2941–2949, Online. As- workshop on Cross-Framework and Cross-Domain sociation for Computational Linguistics.
59 David A. Smith and Jason Eisner. 2006. Minimum risk 44 UDAS UUAS 58 annealing for training log-linear models. In Proceed- 42 ings of the COLING/ACL 2006 Main Conference 57 40 56 Poster Sessions, pages 787–794, Sydney, Australia. UDAS (%) UUAS (%) Association for Computational Linguistics. 38 55 36 54 Songlin Yang, Yong Jiang, Wenjuan Han, and Kewei 34 53 Tu. 2020. Second-order unsupervised neural depen- 52 32 dency parsing. In Proceedings of the 28th Inter- 51 national Conference on Computational Linguistics, 30 pages 3911–3924, Barcelona, Spain (Online). Inter- 100 200 300 400 500 size of |H| national Committee on Computational Linguistics. Songlin Yang, Yanpeng Zhao, and Kewei Tu. 2021. Figure 4: Influence of dH on UUAS and UDAS. PCFGs can do better: Inducing probabilistic context- free grammars with many symbols. In Proceedings UDAS of NL-PCFGs UDAS of NBL-PCFGs of the 2021 Conference of the North American Chap- 58 UUAS of NL-PCFGs UUAS of NBL-PCFGs 40 ter of the Association for Computational Linguistics: 56 Human Language Technologies, pages 1487–1498, 35 54 52 UDAS (%) UUAS (%) Online. Association for Computational Linguistics. 30 50 25 48 Songyang Zhang, Linfeng Song, Lifeng Jin, Kun Xu, 46 Dong Yu, and Jiebo Luo. 2021. Video-aided unsu- 20 44 pervised grammar induction. In Proceedings of the 15 42 2021 Conference of the North American Chapter of 5 10 15 20 25 30 The numnber of nonterminals the Association for Computational Linguistics: Hu- man Language Technologies, pages 1513–1524, On- Figure 5: Influence of the number of nonterminals on line. Association for Computational Linguistics. UUAS and UDAS. Yu Zhang, Houquan Zhou, and Zhenghua Li. 2020. Fast and accurate neural CRF constituency parsing. In Proceedings of the Twenty-Ninth International hi pxq “ gi,1 pgi,2 pWi xqq Joint Conference on Artificial Intelligence, IJCAI gi,j pyq “ ReLU pVi,j ReLU pUi,j yqq ` y 2020, pages 4046–4053. ijcai.org. f prx, ysq “ h4 pReLUpWrx; ysq ` yq Yanpeng Zhao and Ivan Titov. 2020. Visually grounded compound PCFGs. In Proceedings of the B Influence of the domain size of H and 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4369–4379, the number of nonterminals Online. Association for Computational Linguistics. Figure 4 illustrates the change of UUAS and UDAS Hao Zhu, Yonatan Bisk, and Graham Neubig. 2020. with the increase of dH . We find similar tendencies The return of lexical dependencies: Neural lexical- compared to the change of F1 scores and perplexi- ized PCFGs. Transactions of the Association for ties with the increase of dH . dH “ 300 performs Computational Linguistics, 8:647–661. best. Figure 5 illustrates the change of UUAS and A. Zwicky. 1993. Heads in grammatical theory: Heads, UDAS when increasing the number of nontermi- bases and functors. nals. We can see that NL-PCFGs benefit from using A Full Parameterization more nonterminals while NBL-PCFGs have a bet- ter performance when the number of nonterminals We give the full parameterizations of the following is relatively small. probability distributions. exppuJS h1 pwA qq ppA|Sq “ ř J , A1 PN exppuS h1 pwA1 qq exppuJ A h2 pww qq ppw|Aq “ ř J , w1 PΣ exppuA h2 pww1 qq exppuJ H h3 pww qq ppw|Hq “ ř J , w1 PΣ exppuH h3 pww1 qq exppuJ H f prwA ; ww sqq ppH|A, wq “ ř J , H 1 PH exppuh1 f prwA ; ww sqq
You can also read