Investigating Non-local Features for Neural Constituency Parsing
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Investigating Non-local Features for Neural Constituency Parsing Leyang Cui♥♠∗, Sen Yang♠∗ , Yue Zhang♠3† ♥ Zhejiang University ♠ School of Engineering, Westlake University 3 Institute of Advanced Technology, Westlake Institute for Advanced Study {cuileyang, zhangyue}@westlake.edu.cn senyang.stu@gmail.com Abstract Thanks to the strong representation power of neural encoders, neural chart-based parsers arXiv:2109.12814v2 [cs.CL] 28 Mar 2022 have achieved highly competitive performance by using local features. Recently, it has been shown that non-local features in CRF struc- tures lead to improvements. In this paper, we investigate injecting non-local features into the training process of a local span-based parser, by predicting constituent n-gram non-local patterns and ensuring consistency between Figure 1: An example of the non-local n-gram pat- non-local patterns and local constituents. Re- tern features: the 3-gram pattern (3, 11, {VBD NP PP}) sults show that our simple method gives bet- is composed of two constituent nodes and one part- ter results than the self-attentive parser on both of-speech node; the 2-gram pattern (7, 11, {NP PP}) is PTB and CTB. Besides, our method achieves composed of two constituent nodes. state-of-the-art BERT-based performance on PTB (95.92 F1) and strong performance on CTB (92.31 F1). Our parser also achieves bet- algorithm (e.g., CKY) to find the highest-scoring ter or competitive performance in multilingual tree. Without explicitly modeling structured depen- and zero-shot cross-domain settings compared dencies between different constituents, the methods with the baseline. give competitive results compared to non-local dis- crete parsers (Stern et al., 2017a; Kitaev and Klein, 1 Introduction 2018). One possible explanation for their strong Constituency parsing is a fundamental task in nat- performance is that the powerful neural encoders ural language processing, which provides useful are capable of capturing implicit output correlation information for downstream tasks such as machine of the tree structure (Stern et al., 2017a; Gaddy translation (Wang et al., 2018), natural language in- et al., 2018; Teng and Zhang, 2018). ference (Chen et al., 2017), text summarization (Xu Recent work has shown that modeling non-local and Durrett, 2019). Over the recent years, with output dependencies can benefit neural structured advance in deep learning and pre-training, neu- prediction tasks, such as NER (Ma and Hovy, ral chart-based constituency parsers (Stern et al., 2016), CCG supertagging (Cui and Zhang, 2019) 2017a; Kitaev and Klein, 2018) have achieved and dependency parsing (Zhang et al., 2020a). highly competitive results on benchmarks like Penn Thus, an interesting research question is whether Treebank (PTB) and Penn Chinese Treebank (CTB) injecting non-local tree structure features is also by solely using local span prediction. beneficial to neural chart-based constituency pars- The above methods take the contextualized rep- ing. To this end, we introduce two auxiliary train- resentation (e.g., BERT) of a text span as input, and ing objectives. The first is Pattern Prediction. As use a local classifier network to calculate the scores shown in Figure 1, we define pattern as the n-gram of the span being a syntactic constituent, together constituents sharing the same parent.1 We ask the with its constituent label. For testing, the output model to predict the pattern based on its span rep- layer uses a non-parametric dynamic programming resentation, which directly injects the non-local ∗ 1 The first two authors contributed equally to this work. Patterns are mainly composed of n-gram constituents but † Corresponding author. also include part-of-speech tags as auxiliary.
constituent tree structure to the encoder. transition-based methods directly model partial tree To allow stronger interaction between non-local structures, their local decision nature may lead patterns and local constituents, we further pro- to error propagation (Goldberg and Nivre, 2013) pose a Consistency loss, which regularizes the co- and worse performance compared with methods occurrence between constituents and patterns by that model long-term dependencies (McDonald and collecting corpus-level statistics. In particular, we Nivre, 2011; Zhang and Nivre, 2012). Similar to count whether the constituents can be a sub-tree of transition-based methods, NFC also directly mod- the pattern based on the training set. For instance, els partial tree structures. The difference is that both NNS and NP are legal to occur as sub-trees of we inject tree structure information using two addi- the 3-gram pattern {VBD NP PP} in Figure 1, while tional loss functions. Thus, our integration of non- S or ADJP cannot be contained within this pattern local constituent features is implicit in the encoder, based on grammar rules. Similarly, for the 2-gram rather than explicit in the decoding process. While pattern {NP PP} highlighted in Figure 1, both IN the relative effectiveness is empirical, it could po- and NP are consistent constituents, but JJ is not. tentially alleviate error propagation. The Consistency loss can be considered as inject- Chart-based methods score each span indepen- ing prior linguistic knowledge to our model, which dently and perform global search over all possible forces the encoder to understand the grammar rules. trees to find the highest-score tree given a sentence. Non-local dependencies among the constituents Durrett and Klein (2015) represented nonlinear fea- that share the same pattern are thus explicitly mod- tures to a traditional CRF parser computed with a eled. We denote our model as Injecting Non-local feed-forward neural network. Stern et al. (2017b) Features for neural Chart-based parsers (NFC). first used LSTM to represent span features. Kitaev We conduct experiments on both PTB and CTB. and Klein (2018) adopted a self-attentive encoder Equipped with BERT, NFC achieves 95.92 F1 on instead of the LSTM encoder to boost parser perfor- PTB test set, which is the best reported perfor- mance. Mrini et al. (2020) proposed label attention mance for BERT-based single-model parsers. For layers to replace self-attention layers. Zhou and Chinese constituency parsing, NFC achieves highly Zhao (2019) integrated constituency and depen- competitive results (92.31 F1) on CTB, outperform- dency structures into head-driven phrase structure ing the baseline self-attentive parser (91.98 F1) and grammar. Tian et al. (2020) used span attention a 0-th order neural CRF parser (92.27 F1) (Zhang to produce span representation to replace the sub- et al., 2020b). To further test the generalization traction of the hidden states at the span boundaries. ability, we annotate a multi-domain test set in En- Despite their success, above work mainly focuses glish, including dialogue, forum, law, literature on how to better encode features over the input sen- and review domains. Experiments demonstrate tence. In contrast, we take the encoder of Kitaev that NFC is robust in zero-shot cross-domain set- and Klein (2018) intact, being the first to explore tings. Finally, NFC also performs competitively new ways to introduce non-local training signal with other languages using the SPMRL 2013/2014 into the local neural chart-based parsers. shared tasks, establishing the best reported results on three rich resource languages. We release our Modeling Label Dependency. There is a line of code and models at https://github.com/ work focusing on modeling non-local output depen- RingoS/nfc-parser. dencies. Zhang and Zhang (2010) used a Bayesian network to encode the label dependency in multi- 2 Related Work label learning. For neural sequence labeling, Zhou and Xu (2015) and Ma and Hovy (2016) built a Constituency Parsing. There are mainly two CRF layer on top of neural encoders to capture lines of approaches for constituency parsing. label transition patterns. Pislar and Rei (2020) in- Transition-based methods process the input words troduced a sentence-level constraint to encourage sequentially and construct the output constituency the model to generate coherent NER predictions. tree incrementally by predicting a series of local Cui and Zhang (2019) investigated label attention transition actions (Zhang and Clark, 2009; Cross network to model the label dependency by produc- and Huang, 2016; Liu and Zhang, 2017). For ing label distribution in sequence labeling tasks. these methods, the sequence of transition actions Gui et al. (2020) proposed a two-stage label de- make traversal over a constituent tree. Although coding framework based on Bayesian network to
-&'() --./ model long-term label dependencies. For syntac- -*+, tic parsing, Zhang et al. (2020b) demonstrated that structured Tree CRF can boost parsing performance over graph-based dependency parser. Our work is constituency score pattern score Consistency matrix %(', )) +̂",# in line with these in the sense that we consider non-local structure information for neural struc- Span representation "",# = !# − !" ture prediction. To our knowledge, we are the first to inject sub-tree structure into neural chart-based encoders for constituency parsing. !! … !" … !# … !$ Sequence Encoder 3 Baseline .! … ." … .# … .$ Our baseline is adopted from the parsing model of Kitaev and Klein (2018) and Kitaev et al. (2019). Figure 2: The three training objectives in NFC. Given a sentence X = {x1 , ..., xn }, its correspond- ing constituency parse tree T is composed by a set of labeled spans Training. The model is trained to satisfy the margin-based constraints |T | T = {(it , jt , ltc )}|t=1 (1) s(T ∗ ) ≥ s(T ) + ∆(T, T ∗ ) (5) where it and jt represent the t-th constituent span’s fencepost positions and ltc represents the where T ∗ denotes the gold parse tree, and ∆ is constituent label. The model assigns a score s(T ) Hamming loss. The hinge loss can be written as to tree T , which can be decomposed as Lcons = max 0, max∗ [s(T ) + ∆(T, T ∗ )] − s(T ∗ ) (6) T 6=T X s(T ) = s(i, j, lc ) (2) During inference time, the most-optimal tree (i,j,l)∈T Following Kitaev et al. (2019), we use BERT T̂ = argmax s(T ) (7) T with a self-attentive encoder as the scoring function s(i, j, ·), and a chart decoder to perform a global- is obtained using a CKY-like algorithm. optimal search over all possible trees to find the highest-scoring tree given the sentence. In particu- 4 Additional Training Objectives lar, given an input sentence X = {x1 , ..., xn }, a list We propose two auxiliary training objectives to of hidden representations Hn1 = {h1 , h2 , . . . , hn } inject non-local features into the encoder, which is produced by the encoder, where hi is a hidden rely only on the annotations in the constituency representation of the input token xi . Following pre- treebank, but not external resources. vious work, the representation of a span (i, j) is constructed by: 4.1 Instance-level Pattern Loss vi,j = hj − hi (3) We define n-gram constituents, which shares the same parent node, as a pattern. We use a triplet Finally, vi,j is fed into an MLP to produce real- (ip , j p , lp ) to denote a pattern span beginning from valued scores s(i, j, ·) for all constituency labels: the ip -th word and ending at j p -th word. lp is the corresponding pattern label. Given a constituency parse tree in Figure 1, (3, 11, {VBD NP PP}) is a s(i, j, ·) = W2c R E LU(W1c vi,j + bc1 ) + bc2 (4) 3-gram pattern. where W1c , W2c , bc1 and bc2 are trainable parame- Similar to Eq 4, an MLP is used for transforming c ters, W2c ∈ R|H|×|L | can be considered as the con- span representations to pattern prediction probabil- stituency label embedding matrix (Cui and Zhang, ities: 2019), where each column in W2c corresponds to p̂i,j = Softmax W2p R E LU(W1p vi,j + bp1 ) + bp2 (8) the embedding of a particular constituent label. |H| represents the hidden dimension and |Lc | is the size where W1p , W2p , bp1 and bp2 are trainable param- p of the constituency label set. eters, W2p ∈ R|H|×|L | can be considered as the
pattern label embedding matrix, where each col- provides information regarding non-local depen- umn in W2p corresponds to the embedding of a dencies among constituents and patterns. particular pattern label. |Lp | represents the size of An intuitive method to predict the consistency the pattern label set. For each instance, the cross- matrix Y is to make use of the constituency label entropy loss between the predicted patterns and the embedding matrix W2p (see Eq 4 for definition), gold patterns are calculated as the pattern label embedding matrix W2c (see Eq 8 for definition) and the span representations V (see n X n X Eq 3 for definition): Lpat = − pi,j log p̂i,j (9) Ŷ = Sigmoid (W2c T U1 V)(VT U2 W2p ) (10) i=1 j=1 We use the span-level cross-entropy loss for pat- where U1 , U2 ∈ R|H|×|H| are trainable parame- terns (Eq 9) instead of the margin loss in Eq 6, ters. because our pattern-prediction objective aims to Intuitively, the left term, W2c T U1 V, integrates augment span representations via greedily classify- the representations of the pattern span and all pos- ing each pattern span, rather than to reconstruct the sible constituent label embeddings. The second constituency parse tree through dynamic program- term, VT U2 W2p , integrates features of the span ming. and all pattern embeddings. Each binary element c p in the resulting Ŷ ∈ R|L |×|L | denotes whether 4.2 Corpus-level Consistency Loss a particular constituent label is consistent with a Constituency scores and pattern probabilities are particular pattern in the given span context. Eq 10 produced based on a shared span representation; can be predicted on the instance-level for ensur- however, the two are subsequently separately pre- ing consistency between patterns and constituent. dicted. Therefore, although the span representa- However, this naive method is difficult for training, tions contain both constituent and pattern infor- and computationally infeasible, because the span 2 mation, the dependencies between constituent and representation matrix V ∈ R|H|×n is composed pattern predictions are not explicitly modeled. Intu- of n2 span representations vi,j ∈ R|H| and the itively, constituents are distributed non-uniformly asymptotic complexity is: in patterns, and such correlation can be obtained in the corpus-level statistic. We propose a consis- O (|Lp | + |Lc |)(|H|2 + n2 |H|) + |Lp ||Lc |n2 tency loss, which explicitly models the non-local (11) dependencies among constituents that belong to the for a single training instance. same pattern. As mentioned in the introduction, we We instead use a corpus-level constraint on the regard all constituent spans within a pattern span non-local dependencies among constituents and as being consistent with the pattern span. Take 2- patterns. In this way, Eq 10 is reduced to be inde- gram patterns for example, which represents two pendent of individual span representations: neighboring subtrees covering a text span. The con- T Ŷ = Sigmoid W2c UW2p (12) stituents that belong to the two subtrees, including the top constituent and internal sub constituents, where U ∈ R|H|×|H| is trainable. are considered as being consistent. We consider This trick decreases the asymptotic complexity only the constituent labels but not their correspond- to O(|Lc ||H|2 + |Lp ||Lc ||H|). The cross-entropy ing span locations for this task. loss between the predicted consistency matrix and This loss can be understood first at the instance gold consistency labels is used to optimize the level. In particular, if a constituent span (it , jt , ltc ) model: is a subtree of a pattern span (it0 , jt0 , ltp0 ), i.e. it >= c p |L | |L | it0 and jt
Data Lang / Domain # Train # Dev # Test Cross-domain. To test the robustness of our PTB English 39,832 1,700 2,416 methods across difference domains, we further an- CTB Chinese 17,544 352 348 notate five test set in dialogue, forum, law, literature SPMRL French 14,759 1,235 2,541 and review domains. For the dialogue domain, we SPMRL German 40,472 5,000 5,000 SPMRL Korean 23,010 2,066 2,287 randomly sample dialogue utterances from Wiz- SPMRL Basque 7,577 948 946 ard of Wikipedia (Dinan et al., 2019), which is a SPMRL Polish 6,578 821 822 chit-chat dialogue benchmark produced by humans. SPMRL Hungarian 8,146 1,051 1,009 For the forum domain, we use users’ communi- MCTB Dialogue - - 1,000 cation records from Reddit, crawled and released MCTB Forum - - 1,000 by Völske et al. (2017). For the law domain, we MCTB Law - - 1,000 sample text from European Court of Human Rights MCTB Literature - - 1,000 Database (Stiansen and Voeten, 2019), which in- MCTB Review - - 1,000 cludes detailing judicial decision patterns. For the Table 1: Dataset statistics. # - number of sentences. literature domain, we download literary fictions from Project Gutenberg2 . For the review domain, we use plain text across a variety of product genres, 4.3 Training released by SNAP Amazon Review Dataset (He Given a constituency tree, we minimize the sum of and McAuley, 2016). After obtaining the plain text, the three objectives to optimize the parser: we ask annotators whose majors are linguistics to annotate constituency parse tree by following the L = Lcons + Lpat + Lreg (14) PTB guideline. We name our dataset as Multi- domain Constituency Treebank (MCTB). More 4.4 Computational Cost details of the dataset are documented in Yang et al. (2022). The number of training parameters increased by p NFC is W1p ∈ R|H|×|H| , W2p ∈ R|H|×|L | , bp1 ∈ Multi-lingual. For the multilingual testing, we p R|H| and bp2 ∈ R|L | in Eq 8 and U ∈ R|H|×|H| select three rich resource language from the in Eq 12. Taking training model on PTB as an SPMRL 2013-2014 shared task (Seddah et al., example, NFC adds less than 0.7M parameters 2013): French, German and Korean, which include to 342M parameters baseline model (Kitaev and at least 10,000 training instances, and three low- Klein, 2018) based on BERT-large-uncased dur- resource language: Hungarian, Basque and Polish. ing training. NFC is identical to our baseline self- attentive parser (Kitaev and Klein, 2018) during 5.2 Setup inference. Our code is based on the open-sourced code of Kitaev and Klein (2018)3 . The training pro- 5 Experiments cess gets terminated if no improvement on de- We empirically compare NFC with the baseline velopment F1 is obtained in the last 60 epochs. parser in different settings, including in-domain, We evaluate the models which have the best F1 cross-domain and multilingual benchmarks. on the development set. For fair comparison, all reported results and baselines are augmented 5.1 Dataset with BERT. We adopt BERT-large-uncased Table 1 shows the detailed statistic of our datasets. for English, BERT-base for Chinese and BERT-multi-lingual-uncased for other In-domain. We conduct experiments on both En- languages. Most of our hyper-parameters are glish and Chinese, using the Penn Treebank (Mar- adopted from Kitaev and Klein (2018) and Fried cus et al., 1993) as our English dataset, with stan- et al. (2019). For scales of the two additional losses, dard splits of section 02-21 for training, section 22 we set the scale of pattern loss to 1.0 and the scale for development and section 23 for testing. For Chi- of consistency loss to 5.0 for all experiments. nese, we split the Penn Chinese Treebank (CTB) To reduce the model size, we filter out those non- 5.1 (Xue et al., 2005), taking articles 001-270 and 2 https://www.gutenberg.org/ 440-1151 as training set, articles 301-325 as devel- 3 Available at https://github.com/nikitakit/ opment set and articles 271-300 as test set. self-attentive-parser.
Model LR LP F1 improvement of 0.20% F1 on PTB (p
In-domain Cross-domain Model PTB Bio Dialogue Forum Law Literature Review Avg Liu and Zhang (2017) 95.65 86.33 85.56 85.42 91.50 84.84 83.53 86.20 Kitaev and Klein (2018) 95.72 86.61 86.30 86.29 92.08 86.10 83.88 86.88 NFC 95.92 86.43 89.85 88.52 95.43 90.75 88.10 89.85 Table 4: Constituency parsing results with BERT (F1 scores) on the cross-domain test set. Rich resource Low Resource Model Avg French German Korean Avg Hungarian Basque Polish Avg Kitaev and Klein (2018) 87.42 90.20 88.80 88.81 94.90 91.63 96.36 94.30 91.55 Nguyen et al. (2020) 86.69 90.28 88.71 88.56 94.24 92.02 96.14 94.13 91.34 Kitaev and Klein (2018) † 87.38 90.25 88.91 88.85 94.56 91.66 96.14 94.12 91.48 NFC 87.51 90.43 89.07 89.00 94.95 91.73 96.33 94.34 91.67 Table 5: Multilingual Experiment results on SPMRL test-sets. † indicates our reproduced baselines. 5.4 Cross-domain Experiments We compare the generalization of our methods with baselines in Table 4. In particular, all the parsers are trained on PTB training and validated on PTB development, and are tested on cross-domain test in the zero-shot setting. As shown in the table, our model achieves 5 best-reported results among 6 cross-domain test sets with an averaged F1 score of 89.85%, outperforming our baseline parser by 2.97% points. This shows that structure informa- tion is useful for improving cross-domain perfor- mance, which is consistent with findings from pre- vious work (Fried et al., 2019). Figure 3: Pearson correlation of n-gram pattern distri- To better understand the benefit of pattern fea- bution between PTB training set and different test set. tures, we calculate Pearson correlation of n-gram pattern distributions between the PTB training set and various test sets in Figure 3. First, we find that Nguyen et al. (2020) do not show obvious advan- the correlation between the PTB training set and tages over Kitaev and Klein (2018). NFC outper- the PTB test set is close to 1.0, which verifies the forms these two methods on three rich resource effectiveness of the corpus-level pattern knowledge languages. For example, NFC achieves 89.07% F1 during inference. Second, the 3-gram pattern corre- on Korean, outperforming Kitaev and Klein (2018) lation of all domains exceeds 0.75, demonstrating by 0.27% F1, suggesting that NFC is generally ef- that n-gram pattern knowledge is robust across do- fective across languages. However, NFC does not mains, which supports the strong performance of give better results compared with Kitaev and Klein NFC in the zero-shot cross-domain setting. Third, (2018) on low-resource languages. One possible pattern correlation decreases significantly as n in- explanation is that it is difficult to obtain prior lin- creases, which suggests that transferable non-local guistic knowledge from corpus-level statistics by information is limited to a certain window size of using a relatively small number of instances. n-gram constituents. 6 Analysis 5.5 Multilingual Experiments 6.1 n-gram Pattern Level Performance We compare NFC with Kitaev and Klein (2018) and Nguyen et al. (2020) on SPMRL. The results Figure 4 shows the pattern-level F1 before and are shown in Table 5. Nguyen et al. (2020) use after introducing the two auxiliary training objec- pointer network to predict a sequence of pointing tives. In particular, we calculate the pattern-level decisions for constituency parsing. As can be seen, F1 by calculating the F1 score for patterns based
Figure 5: F1 scores versus minimum constituent span (a) F1 scores measured by 3-gram pattern. length on PTB test set. Note that constituent spans shorter than 30 accounts for approximately 98.5% of all for the PTB test set. (b) F1 scores measured by 2-gram pattern. Figure 4: Pattern-level F1 on different English datasets. Figure 6: Exact matching (EM) score across different Noted that we train NFC based on 3-gram pattern in domains. EM indicates the percentage of sentences English. There is no direct supervision signal for 2- whose predicted trees are entirely correct. gram pattern. 6.2 F1 against Span Length We compare the performance of the baseline and our method on constituent spans with different word lengths. Figure 5 shows the trends of F1 on the constituency trees predicted by CKY de- scores on the PTB test set as the minimum con- coding. Although our baseline parser with BERT stituent span length increases. Our method shows achieves 95.76% F1 scores on PTB, the pattern- a minor improvement at the beginning, but the gap level F1 is 80.28% measured by 3-gram. When becomes more evident when the minimum span testing on the dialogue domain, the result is re- length increases, demonstrating its advantage in duced to only 57.47% F1, which indicates that capturing more sophisticated constituency label de- even a strong neural encoder still has difficulties pendency. capturing constituent dependency from the input sequence alone. After introducing the pattern and 6.3 Exact Match consistency losses, NFC significantly outperforms Exact match score represents the percentage of sen- the baseline parser measured by 3-gram pattern tences whose predicted trees are entirely the same F1. Though there is no direct supervision signal as the golden trees. Producing exactly matched for 2-gram pattern, NFC also gives better results trees could improve user experiences in practical on pattern F1 of 2-gram, which are subsumed by scenarios and benefit downstream applications on 3-gram patterns. This suggests that NFC can effec- other tasks (Petrov and Klein, 2007; Kummerfeld tively represent sub-tree structures. et al., 2012). We compare exact match scores of
NFC with that of the baseline parser. As shown in senior undergraduate students or graduate students Figure 6, NFC achieves large improvements in ex- in linguistic majors who took this annotation as act match score for all domains. For instance, NFC a part-time job. We manually shuffled the data so gets 33.40% exact match score in the review do- that all batches of to-be-annotated data have similar main, outperforming the baseline by 10.2% points. lengths on average. An annotator could annotate We assume that this results from the fact that NFC around 25 instances per hour. We pay them 50 successfully ensures the output tree structure by CNY an hour. The local minimum salary in the modeling non-local correlation. year 2021 is 22 CNY per hour for part-time jobs. Our annotated data only involves factual infor- 6.4 Model Efficiency mation (i.e., syntactic annotation), but not opinions, As mentioned in Section 4.4, NFC only introduces attitudes or beliefs. Therefore, the annotation job a few training parameters to the baseline model (Ki- does not belong to human subject research; and taev and Klein, 2018). For PTB, NFC takes about IRB approval is not required. 19 hours to train with a single RTX 2080Ti, while the baseline takes about 13 hours. For CTB, the approximate training time is 12 hours for NFC and References 7 hours for the baseline. Our inference time is the Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui same as that of the baseline parser, since no further Jiang, and Diana Inkpen. 2017. Enhanced LSTM computational operations are added to the infer- for natural language inference. In Proceedings of ence phase. Both take around 11 seconds to parse the 55th Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), the PTB section 23 (2416 sentences, an average of pages 1657–1668, Vancouver, Canada. Association 23.5 tokens per sentence). for Computational Linguistics. 7 Conclusion James Cross and Liang Huang. 2016. Span-based con- stituency parsing with a structure-label system and We investigated graph-based constituency parsing provably optimal dynamic oracles. In Proceedings with non-local features – both in the sense that fea- of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1–11, Austin, tures are not restricted to one constituent, and in Texas. Association for Computational Linguistics. the sense that they are not restricted to each train- ing instance. Experimental results verify the effec- Leyang Cui and Yue Zhang. 2019. Hierarchically- tiveness of injecting non-local features to neural refined label attention network for sequence labeling. chart-based constituency parsing. Equipped with In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the pre-trained BERT, our method achieves 95.92% 9th International Joint Conference on Natural Lan- F1 on PTB and 92.31% F1 on CTB. We further guage Processing (EMNLP-IJCNLP), pages 4115– demonstrated that the proposed method gives better 4128, Hong Kong, China. Association for Computa- or competitive results in multilingual and zero-shot tional Linguistics. cross-domain settings. Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2019. Wizard Acknowledgements of wikipedia: Knowledge-powered conversational agents. In International Conference on Learning We appreciate the insightful comments from the Representations. anonymous reviewers. We thank Zhiyang Teng for the insightful discussions. We gratefully acknowl- Greg Durrett and Dan Klein. 2015. Neural CRF pars- ing. In Proceedings of the 53rd Annual Meet- edge funding from the National Natural Science ing of the Association for Computational Linguis- Foundation of China (NSFC No.61976180). tics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Pa- Ethical Considerations pers), pages 302–312, Beijing, China. Association for Computational Linguistics. As mentioned in Section 5.1, we collected the raw data from free and publicly available sources that Daniel Fried, Nikita Kitaev, and Dan Klein. 2019. have no copyright or privacy issues. We recruited Cross-domain generalization of neural constituency parsers. In Proceedings of the 57th Annual Meet- our annotators from the linguistics departments ing of the Association for Computational Linguis- of local universities through public advertisement tics, pages 323–330, Florence, Italy. Association for with a specified pay rate. All of our annotators are Computational Linguistics.
David Gaddy, Mitchell Stern, and Dan Klein. 2018. corpus of English: The Penn Treebank. Computa- What’s going on in neural constituency parsers? an tional Linguistics, 19(2):313–330. analysis. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Ryan McDonald and Joakim Nivre. 2011. Analyzing Computational Linguistics: Human Language Tech- and integrating dependency parsers. Computational nologies, Volume 1 (Long Papers), pages 999–1010, Linguistics, 37(1):197–230. New Orleans, Louisiana. Association for Computa- tional Linguistics. Khalil Mrini, Franck Dernoncourt, Quan Hung Tran, Trung Bui, Walter Chang, and Ndapa Nakashole. Yoav Goldberg and Joakim Nivre. 2013. Training de- 2020. Rethinking self-attention: Towards inter- terministic parsers with non-deterministic oracles. pretability in neural parsing. In Findings of the As- Transactions of the Association for Computational sociation for Computational Linguistics: EMNLP Linguistics, 1:403–414. 2020, pages 731–742, Online. Association for Com- putational Linguistics. Tao Gui, Jiacheng Ye, Qi Zhang, Zhengyan Li, Zichu Fei, Yeyun Gong, and Xuanjing Huang. 2020. Thanh-Tung Nguyen, Xuan-Phi Nguyen, Shafiq Joty, Uncertainty-aware label refinement for sequence la- and Xiaoli Li. 2020. Efficient constituency pars- beling. In Proceedings of the 2020 Conference on ing by pointing. In Proceedings of the 58th Annual Empirical Methods in Natural Language Process- Meeting of the Association for Computational Lin- ing (EMNLP), pages 2316–2326, Online. Associa- guistics, pages 3284–3294, Online. Association for tion for Computational Linguistics. Computational Linguistics. Ruining He and Julian McAuley. 2016. Ups and downs: Slav Petrov and Dan Klein. 2007. Improved inference Modeling the visual evolution of fashion trends with for unlexicalized parsing. In Human Language Tech- one-class collaborative filtering. In Proceedings of nologies 2007: The Conference of the North Amer- the 25th International Conference on World Wide ican Chapter of the Association for Computational Web, WWW ’16, pages 507–517, Republic and Linguistics; Proceedings of the Main Conference, Canton of Geneva, Switzerland. International World pages 404–411, Rochester, New York. Association Wide Web Conferences Steering Committee. for Computational Linguistics. Nikita Kitaev, Steven Cao, and Dan Klein. 2019. Multi- Miruna Pislar and Marek Rei. 2020. Seeing both lingual constituency parsing with self-attention and the forest and the trees: Multi-head attention for pre-training. In Proceedings of the 57th Annual joint classification on different compositional lev- Meeting of the Association for Computational Lin- els. In Proceedings of the 28th International Con- guistics, pages 3499–3505, Florence, Italy. Associa- ference on Computational Linguistics, pages 3761– tion for Computational Linguistics. 3775, Barcelona, Spain (Online). International Com- mittee on Computational Linguistics. Nikita Kitaev and Dan Klein. 2018. Constituency pars- ing with a self-attentive encoder. In Proceedings Djamé Seddah, Reut Tsarfaty, Sandra Kübler, Marie of the 56th Annual Meeting of the Association for Candito, Jinho D. Choi, Richárd Farkas, Jen- Computational Linguistics (Volume 1: Long Papers), nifer Foster, Iakes Goenaga, Koldo Gojenola Gal- pages 2676–2686, Melbourne, Australia. Associa- letebeitia, Yoav Goldberg, Spence Green, Nizar tion for Computational Linguistics. Habash, Marco Kuhlmann, Wolfgang Maier, Joakim Nivre, Adam Przepiórkowski, Ryan Roth, Wolfgang Jonathan K. Kummerfeld, David Hall, James R. Cur- Seeker, Yannick Versley, Veronika Vincze, Marcin ran, and Dan Klein. 2012. Parser showdown at the Woliński, Alina Wróblewska, and Eric Villemonte Wall Street corral: An empirical investigation of er- de la Clergerie. 2013. Overview of the SPMRL ror types in parser output. In Proceedings of the 2013 shared task: A cross-framework evaluation of 2012 Joint Conference on Empirical Methods in Nat- parsing morphologically rich languages. In Proceed- ural Language Processing and Computational Natu- ings of the Fourth Workshop on Statistical Parsing of ral Language Learning, pages 1048–1059, Jeju Is- Morphologically-Rich Languages, pages 146–182, land, Korea. Association for Computational Linguis- Seattle, Washington, USA. Association for Compu- tics. tational Linguistics. Jiangming Liu and Yue Zhang. 2017. In-order Mitchell Stern, Jacob Andreas, and Dan Klein. 2017a. transition-based constituent parsing. Transactions A minimal span-based neural constituency parser. of the Association for Computational Linguistics, In Proceedings of the 55th Annual Meeting of the As- 5:413–424. sociation for Computational Linguistics (Volume 1: Long Papers), pages 818–827, Vancouver, Canada. Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end Association for Computational Linguistics. sequence labeling via bi-directional lstm-cnns-crf. CoRR, abs/1603.01354. Mitchell Stern, Jacob Andreas, and Dan Klein. 2017b. A minimal span-based neural constituency parser. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann In Proceedings of the 55th Annual Meeting of the As- Marcinkiewicz. 1993. Building a large annotated sociation for Computational Linguistics (Volume 1:
Long Papers), pages 818–827, Vancouver, Canada. Yu Zhang, Houquan Zhou, and Zhenghua Li. 2020b. Association for Computational Linguistics. Fast and accurate neural crf constituency parsing. In Proceedings of the Twenty-Ninth International Øyvind Stiansen and Erik Voeten. 2019. ECtHR judg- Joint Conference on Artificial Intelligence, IJCAI- ments. 20, pages 4046–4053. International Joint Confer- ences on Artificial Intelligence Organization. Zhiyang Teng and Yue Zhang. 2018. Two local mod- els for neural constituent parsing. In Proceedings of Yue Zhang and Stephen Clark. 2009. Transition-based the 27th International Conference on Computational parsing of the Chinese treebank using a global dis- Linguistics, pages 119–132, Santa Fe, New Mexico, criminative model. In Proceedings of the 11th USA. Association for Computational Linguistics. International Conference on Parsing Technologies (IWPT’09), pages 162–171, Paris, France. Associa- Yuanhe Tian, Yan Song, Fei Xia, and Tong Zhang. tion for Computational Linguistics. 2020. Improving constituency parsing with span at- Yue Zhang and Joakim Nivre. 2012. Analyzing tention. In Findings of the Association for Computa- the effect of global learning and beam-search on tional Linguistics: EMNLP 2020, pages 1691–1703, transition-based dependency parsing. In Proceed- Online. Association for Computational Linguistics. ings of COLING 2012: Posters, pages 1391–1400, Mumbai, India. The COLING 2012 Organizing Michael Völske, Martin Potthast, Shahbaz Syed, and Committee. Benno Stein. 2017. TL;DR: Mining Reddit to Learn Automatic Summarization. In Workshop on New Jie Zhou and Wei Xu. 2015. End-to-end learning of se- Frontiers in Summarization at EMNLP 2017, pages mantic role labeling using recurrent neural networks. 59–63. Association for Computational Linguistics. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the Xinyi Wang, Hieu Pham, Pengcheng Yin, and Graham 7th International Joint Conference on Natural Lan- Neubig. 2018. A tree-based decoder for neural ma- guage Processing (Volume 1: Long Papers), pages chine translation. In Proceedings of the 2018 Con- 1127–1137, Beijing, China. Association for Compu- ference on Empirical Methods in Natural Language tational Linguistics. Processing, pages 4772–4777, Brussels, Belgium. Association for Computational Linguistics. Junru Zhou and Hai Zhao. 2019. Head-Driven Phrase Structure Grammar parsing on Penn Treebank. In Jiacheng Xu and Greg Durrett. 2019. Neural extractive Proceedings of the 57th Annual Meeting of the text summarization with syntactic compression. In Association for Computational Linguistics, pages Proceedings of the 2019 Conference on Empirical 2396–2408, Florence, Italy. Association for Compu- Methods in Natural Language Processing and the tational Linguistics. 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), pages 3292– 3303, Hong Kong, China. Association for Computa- tional Linguistics. Naiwen Xue, Fei Xia, Fu-dong Chiou, and Marta Palmer. 2005. The penn chinese treebank: Phrase structure annotation of a large corpus. Nat. Lang. Eng., 11(2):207–238. Sen Yang, Leyang Cui, Ruoxi Ning, Di Wu, and Yue Zhang. 2022. Challenges to open-domain con- stituency parsing. In Findings of the Association for Computational Linguistics: ACL 2022. Min-Ling Zhang and Kun Zhang. 2010. Multi-label learning by exploiting label dependency. In Pro- ceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Min- ing, KDD ’10, page 999–1008, New York, NY, USA. Association for Computing Machinery. Yu Zhang, Zhenghua Li, and Min Zhang. 2020a. Effi- cient second-order TreeCRF for neural dependency parsing. In Proceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, pages 3295–3305, Online. Association for Computa- tional Linguistics.
You can also read