StructFormer: Joint Unsupervised Induction of Dependency and Constituency Structure from Masked Language Modeling
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
StructFormer: Joint Unsupervised Induction of Dependency and Constituency Structure from Masked Language Modeling Yikang Shen∗ Yi Tay Che Zheng Mila/Université de Montréal Google Research Google Research Dara Bahri Donald Metzler Aaron Courville Google Research Google Research Mila/Université de Montréal Abstract Wang et al., 2019; Kim et al., 2019b,a). However, most of these works focus on inducing either con- There are two major classes of natural lan- stituency or dependency structures alone. guage grammars — the dependency grammar In this paper, we make two important techni- that models one-to-one correspondences be- tween words and the constituency grammar cal contributions. First, we introduce a new neu- that models the assembly of one or several ral model, StructFormer, that is able to simultane- corresponded words. While previous unsuper- ously induce both dependency structure and con- vised parsing methods mostly focus on only in- stituency structure. Specifically, our approach aims ducing one class of grammars, we introduce a to unify latent structure induction of different types novel model, StructFormer, that can simulta- of grammar within the same framework. Second, neously induce dependency and constituency StructFormer is able to induce dependency struc- structure. To achieve this, we propose a new tures from raw data in an end-to-end unsupervised parsing framework that can jointly generate a constituency tree and dependency graph. Then fashion. Most existing approaches induce depen- we integrate the induced dependency relations dency structures from other syntactic information into the transformer, in a differentiable man- like gold POS tags (Klein and Manning, 2004; Co- ner, through a novel dependency-constrained hen and Smith, 2009; Jiang et al., 2016). Previous self-attention mechanism. Experimental re- works, having trained from words alone, often re- sults show that our model can achieve strong quires additional information, like pre-trained word results on unsupervised constituency parsing, clustering (Spitkovsky et al., 2011), pre-trained unsupervised dependency parsing, and masked language modeling at the same time. word embedding (He et al., 2018), acoustic cues (Pate and Goldwater, 2013), or annotated data from 1 Introduction related languages (Cohen et al., 2011). We introduce a new inductive bias that enables Human languages have a rich latent structure. This the Transformer models to induce a directed depen- structure is multifaceted, with the two major classes dency graph in a fully unsupervised manner. To of grammar being dependency and constituency avoid the necessity of using grammar labels during structures. There has been an exciting breath of training, we use a distance-based parsing mecha- recent work targeted at learning this structure in a nism. The parsing mechanism predicts a sequence data-driven unsupervised fashion (Klein and Man- of Syntactic Distances T (Shen et al., 2018b) and a ning, 2002; Klein, 2005; Le and Zuidema, 2015; sequence of Syntactic Heights ∆ (Luo et al., 2019) Shen et al., 2018c; Kim et al., 2019a). The core to represent dependency graphs and constituency principle behind recent methods that induce struc- trees at the same time. Examples of ∆ and T are ture from data is simple - provide an inductive illustrated in Figure 1a. Based on the syntactic bias that is conducive for structure to emerge as distances (T) and syntactic heights (∆), we pro- a byproduct of some self-supervised training, e.g., vide a new dependency-constrained self-attention language modeling. To this end, a wide range of layer to replace the multi-head self-attention layer models have been proposed that are able to success- in standard transformer model. More specifically, fully learn grammar structures (Shen et al., 2018a,c; the new attention head can only attend its parent (to ∗ Corresponding author: yikang.shn@gmail.ca. avoid confusion with self-attention head, we use Work done while interning at Google Reseach. “parent” to denote “head” in dependency graph) or 7196 Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 7196–7209 August 1–6, 2021. ©2021 Association for Computational Linguistics
(b) Two types of dependency relations. The parent distribution allows each token to attend on its parent. The dependent distribu- tion allows each token to attend on its dependents. For example (a) An example of Syntactic Distances T (grey bars) and the parent of cats is like. Cats and I are dependents of Syntactic Heights ∆ (white bars). In this example, like like Each attention head will receive a different weighted sum is the parent (head) of constituent (like cats) and of these relations. (I like cats). Figure 1: An example of our parsing mechanism and dependency-constrained self-attention mechanism. The parsing network first predicts the syntactic distance T and syntactic height ∆ to represent the latent structure of the input sentence I like cats. Then the parent and dependent relations are computed in a differentiable manner from T and ∆. its dependents in the predicted dependency struc- based on the distributed representation of POS tags. ture, through a weighted sum of relations shown However, most existing unsupervised dependency in Figure 1b. In this way, we replace the complete parsing algorithms require the gold POS tags to graph in the standard transformer model with a ge provided as inputs. These gold POS tags are la- differentiable directed dependency graph. During beled by humans and can be potentially difficult (or the process of training on a downstream task (e.g. prohibitively expensive) to obtain for large corpora. masked language model), the model will gradu- Spitkovsky et al. (2011) proposed to overcome this ally converge to a reasonable dependency graph problem with unsupervised word clustering that via gradient descent. can dynamically assign tags to each word consid- Incorporating the new parsing mechanism, the ering its context. He et al. (2018) overcame the dependency-constrained self-attention, and the problem by combining DMV model with invertible Transformer architecture, we introduce a new neural network to jointly model discrete syntactic model named StructFormer. The proposed model structure and continuous word representations. can perform unsupervised dependency and con- Unsupervised constituency parsing has recently stituency parsing at the same time, and can leverage received more attention. PRPN (Shen et al., 2018a) the parsing results to achieve strong performance and ON-LSTM (Shen et al., 2018c) induce tree on masked language model tasks. structure by introducing an inductive bias to re- current neural networks. PRPN proposes a pars- 2 Related Work ing network to compute the syntactic distance of Previous works on unsupervised dependency pars- all word pairs, while a reading network uses the ing are primarily based on the dependency model syntactic structure to attend to relevant memories. with valence (DMV) (Klein and Manning, 2004) ON-LSTM allows hidden neurons to learn long- and its extension (Daumé III, 2009; Gillenwater term or short-term information by a novel gating et al., 2010). To effectively learn the DMV model mechanism and activation function. In URNNG for better parsing accuracy, a variety of inductive bi- (Kim et al., 2019b), amortized variational infer- ases and handcrafted features, such as correlations ence was applied between a recurrent neural net- between parameters of grammar rules involving work grammar (RNNG) (Dyer et al., 2016) decoder different part-of-speech (POS) tags, have been pro- and a tree structure inference network, which en- posed to incorporate prior information into learning. courages the decoder to generate reasonable tree The most recent progress is the neural DMV model structures. DIORA (Drozdov et al., 2019) proposed (Jiang et al., 2016), which uses a neural network using inside-outside dynamic programming to com- model to predict the grammar rule probabilities pose latent representations from all possible binary 7197
trees. The representations of inside and outside In other words, each syntactic distance di is as- passes from the same sentences are optimized to sociated with a split point (i, i + 1) and specify the be close to each other. The compound PCFG (Kim relative order in which the sentence will be split et al., 2019a) achieves grammar induction by max- into smaller components. Thus, any sequence of imizing the marginal likelihood of the sentences n − 1 real values can unambiguously map to an which are generated by a probabilistic context-free unlabeled binary constituency tree with n leaves grammar (PCFG). Tree Transformer (Wang et al., through the Algorithm 1 (Shen et al., 2018b). As 2019) adds extra locality constraints to the Trans- Shen et al. (2018c,a); Wang et al. (2019) pointed former encoder’s self-attention to encourage the out, the syntactic distance reflects the information attention heads to follow a tree structure such that communication between constituents. More con- each token can only attend on nearby neighbors in cretely, a large syntactic distance τi represents that lower layers and gradually extend the attention field short-term or local information should not be com- to further tokens when climbing to higher layers. municated between (x≤i ) and (x>i ). While cooper- Neural L-PCFG (Zhu et al., 2020) demonstrated ating with appropriate neural network architectures, that PCFG can benefit from modeling lexical de- we can leverage this feature to build unsupervised pendencies. Similar to StructFormer, the Neural dependency parsing models. L-PCFG induces both constituents and dependen- cies within a single model. Algorithm 1 Distance to binary constituency Though large scale pre-trained models have tree dominated most natural language processing tasks, 1: function C ONSTITUENT(w, d) 2: if d = [] then some recent work indicates that neural network 3: T ⇐ Leaf(w) models can see accuracy gains by leveraging syn- 4: else tactic information rather than ignoring it (Marcheg- 5: i ⇐ arg maxi (d) 6: childl ⇐ Constituent(w≤i , di , d>i ) Strubell et al. (2018) introduces syntactically- 8: T ⇐ Node(childl , childr ) informed self-attention that force one attention 9: return T head to attend on the syntactic governor of the input token. Omote et al. (2019) and Deguchi Algorithm 2 Converting binary constituency et al. (2019) argue that dependency-informed self- tree to dependency graph attention can improve Transformer’s performance 1: function D EPENDENT(T, ∆) 2: if T = w then on machine translation. Kuncoro et al. (2020) 3: D ⇐ [], parent ⇐ w shows that syntactic biases help large scale pre- 4: else trained models, like BERT, to achieve better lan- 5: childl , childr ⇐ T 6: Dl , parentl ⇐ Dependent(childl , ∆) guage understanding. 7: Dr , parentr ⇐ Dependent(childr , ∆) 8: D ⇐ Union(Dl , Dr ) 3 Syntactic Distance and Height 9: if ∆(parentl ) > ∆(parentr ) then 10: D.add(parentl ← parentr ) 11: parent ⇐ parentl In this section, we first reintroduce the concepts 12: else of syntactic distance and height, then discuss their 13: D.add(parentr ← parentl ) 14: parent ⇐ parentr relations in the context of StructFormer. 15: return D, parent 3.1 Syntactic Distance 3.2 Syntactic Height Syntactic distance is proposed in Shen et al. Syntactic height is proposed in Luo et al. (2019), (2018b) to quantify the process of splitting sen- where it is used to capture the distance to the root tences into smaller constituents. node in a dependency graph. A word with high Definition 3.1. Let T be a constituency tree for syntactic height means it is close to the root node. sentence (w1 , ..., wn ). The height of the lowest In this paper, to match the definition of syntactic common ancestor for consecutive words xi and distance, we redefine syntactic height as: xi+1 is τ̃i . Syntactic distances T = (τ1 , ..., τn−1 ) Definition 3.2. Let D be a dependency graph for are defined as a sequence of n − 1 real scalars that sentence (w1 , ..., wn ). The height of a token wi share the same rank as (τ̃1 , ..., τ̃n−1 ). in D is δ̃i . The syntactic heights of D can be any 7198
sequence of n real scalars ∆ = (δ1 , ..., δn ) that share the same rank as (δ̃1 , ..., δ̃n ). Although the syntactic height is defined based on the dependency structure, we cannot rebuild the original dependency structure by syntactic heights (a) Model Architecture (b) Parsing Network alone, since there is no information about whether a token should be attached to the left side or the right Figure 3: The Architecture of StructFormer. The parser side. However, given an unlabelled constituent tree, takes shared word embeddings as input, outputs syntac- tic distances T, syntactic heights ∆, and dependency we can convert it into a dependency graph with distributions between tokens. The transformer layers the help of syntactic distance. The converting pro- take word embeddings and dependency distributions cess is similar to the standard process of converting as input, output contextualized embeddings for input constituency treebank to dependency treebank (Gel- words. bukh et al., 2005). Instead of using the constituent labels and POS tags to identify the parent of each constituent, we simply assign the token with the xj with τj > δi can attend across the split point largest syntactic height as the parent of each con- (i, i + 1). Thus tokens with small syntactic height stituent. The conversion algorithm is described in are limited to attend to nearby tokens. Figure 2 Algorithm 2. In Appendix A.1, we also propose a provides an example of T, ∆ and respective depen- joint algorithm, that takes T and ∆ as inputs and dency graph D. jointly outputs a constituency tree and dependency However, if the left and right boundary syntac- graph. tic distance of a constituent [l, r] are too large, all words in [l, r] will be forced to only attend to other words in [l, r]. Their contextual embedding will not be able to encode the full context. To avoid this phenomena, we propose calibrating T according to ∆ in Appendix A.2 4 StructFormer In this section, we present the StructFormer model. Figure 3a shows the architecture of StructFormer, which includes a parser network and a Transformer module. The parser network predicts T and ∆, then passes them to a set of differentiable func- tions to generate dependency distributions. The Figure 2: An example of T, ∆ and respective depen- Transformer module takes these distributions and dency graph D. Solid lines represent dependency rela- the sentence as input to computes a contextual em- tions between tokens. StructFormer only allow tokens with dependency relation to attend on each other. bedding for each position. The StructFormer can be trained in an end-to-end fashion on a Masked Language Model task. In this setting, the gradient 3.3 The relation between Syntactic Distance back propagates through the relation distributions and Height into the parser. As discussed previously, the syntactic distance con- 4.1 Parsing Network trols information communication between the two As shown in Figure 3b, the parsing network takes sides of the split point. The syntactic height quanti- word embeddings as input and feeds them into sev- fies the centrality of each token in the dependency eral convolution layers: graph. A token with large syntactic height tends to have more long-term dependency relations to sl,i = tanh (Conv (sl−1,i−W , ..., sl−1,i+W )) (1) connect different parts of the sentence together. In StructFormer, we quantify the syntactic distance where sl,i is the output of l-th layer at i-th position, and height on the same scale. Given a split point s0,i is the input embedding of token wi , and 2W +1 (i, i + 1) and it’s syntactic distance δi , only tokens is the convolution kernel size. 7199
Given the output of the convolution stack sN,i , term. Thus the probability that the l-th (l < i) we parameterize the syntactic distance T as: token is inside C(xi ) is equal to the probability that δ̃i is larger then the maximum distance τ between τ tanh Wτ sN,i W 1 2 s , l and i: N,i+1 τi = (2) 1≤i≤n−1 p(l ∈ C(xi )) = p(δ̃i > max(τi−1 , ..., τl )) (7) ∞, i = 0 or i = n = σ((δi − max(τl , ..., τi−1 ))/µ) where τi is the contextualized distance for the i- th split point between token wi and wi+1 . The Then we can compute the probability distribution syntactic height ∆ is parameterized in a similar for l: way: p(l|i) = p(l ∈ C(xi )) − p(l − 1 ∈ C(xi )) δi = W1δ tanh W2δ sN,i + bδ2 + bδ1 (3) = σ((δi − max(τl , ..., τi−1 ))/µ) − σ((δi − max(τl−1 , ..., τi−1 ))/µ) (8) 4.2 Estimate the Dependency Distribution Given T and ∆, we now explain how to estimate Similarly, we can compute the probability distribu- the probability p(xj |xi ) such that the j-th token is tion for r: the parent of the i-th token. The first step is iden- p(r|i) = σ((δi − max(τi , ..., τr−1 ))/µ) − tifying the smallest legal constituent C(xi ), that contains xi and xi is not C(xi )’s parent. The sec- σ((δi − max(τi , ..., τr ))/µ) (9) ond step is identifying the parent of the constituent xj = Pr(C(xi )). Given the discussion in section The probability distribution for [l, r] = C(xi ) can 3.2, the parent of C(xi ) must be the parent of xi . be computed as: Thus, the two-stages of identifying the parent of xi p(l|i)p(r|i), l ≤ i ≤ r can be formulated as: pC ([l, r]|i) = (10) 0, otherwise D(xi ) = Pr(C(xi )) (4) The second step is to identify the parent of [l, r]. In StructFormer, C(xi ) is represented as con- For any constituent [l, r], we choose the j = stituent [l, r], where l is the starting index (l ≤ i) of argmaxk∈[l,r] (δk ) as the parent of [l, r]. In the C(xi ) and r is the ending index (r ≥ i) of C(xi ). previous example, given constituent [4, 8], the In a dependency graph, xi is only connected to maximum syntactic height is δ6 = 4.5, thus its parent and dependents. This means that xi does Pr([4, 8]) = x6 . We use softmax function to pa- not have direct connection to the outside of C(xi ). rameterize the probability pPr (j|[l, r]): In other words, C(xi ) = [l, r] is the smallest con- ( P exp(hj /µ2 ) , l≤t≤r stituent that satisfies: pPr (j|[l, r]) = l≤k≤r exp(hk /µ2 ) (11) 0, otherwise δi < τl−1 , δi < τr (5) Given probability p(j|[l, r]) and p([l, r]|i), we can where τl−1 is the first τ δ4 and pD (j|i) = 0, i=j τ8 = ∞ > δ4 , thus C(x4 ) = [4, 8]. To make (12) this process differentiable, we define τk as a real value and δi as a probability distribution p(δ̃i ). For 4.3 Dependency-Constrained Multi-head the simplicity and efficiency of computation, we Self-Attention directly parameterize the cumulative distribution The multi-head self-attention in the transformer function p(δ̃i > τk ) with sigmoid function: can be seen as a information propagation mecha- p(δ̃i > τk ) = σ((δi − τk )/µ1 ) (6) nism on the complete graph G = (X, E), where the set of vertices X contains all n tokens in the where σ is the sigmoid function, δi is the mean of sentence, and the set of edges E contains all possi- distribution p(δ̃i ) and µ1 is a learnable temperature ble word pairs (xi , xj ). StructFormer replace the 7200
complete graph G with a soft dependency graph 5 Experiments D = (X, A), where A is the matrix of n × n prob- abilities. Aij = pD (j|i) is the probability of the We evaluate the proposed model on three tasks: j-th token depending on the i-th token. The reason Masked Language Modeling, Unsupervised Con- that we called it a directed edge is that each specific stituency Parsing and Unsupervised Dependency head is only allow to propagate information either Parsing. from parent to dependent or from from dependent Our implementation of StructFormer is close to to parent. To do so, structformer associate each the original Transformer encoder (Vaswani et al., attention head with a probability distribution over 2017). Except that we put the layer normalization parent or dependent relation. in front of each layer, similar to the T5 model (Raf- fel et al., 2019). We found that this modification exp(wparent ) allows the model to converges faster. For all exper- pparent = (13) iments, we set the number of layers L = 8, the em- exp(wparent ) + exp(wdep ) bedding size and hidden size to be dmodel = 512, exp(wdep ) pdep = (14) the number of self-attention heads h = 8, the feed- exp(wparent ) + exp(wdep ) forward size df f = 2048, dropout rate as 0.1, and where wparent and wdep are learnable parameters the number of convolution layers in the parsing that associated with each attention head, pparent is network as Lp = 3. the probability that this head will propagate infor- 5.1 Masked Language Model mation from parent to dependent, vice versa. The model will learn to assign this association from the Masked Language Modeling (MLM) has been downstream task via gradient descent. Then we widely used as a pretraining object for larger-scale can compute the probability that information can pretraining models. In BERT (Devlin et al., 2018) be propagated from node j to node i via this head: and RoBERTa (Liu et al., 2019), authors found that MLM perplexities on held-out evaluation set pi,j = pparent pD (j|i) + pdep pD (i|j) (15) have a positive correlation with the end-task per- formance. We trained and evaluated our model However, Htut et al. (2019) pointed out that differ- on 2 different datasets: the Penn TreeBank (PTB) ent heads tend to associate with different type of and BLLIP. In our MLM experiments, each token universal dependency relations (including nsubj, has an independent chance to be replaced by a obj, advmod, etc), but there is no generalist head mask token , except that we never replace can that work with all different relations. To ac- < unk > token. The training and evaluation ob- commodate this observation, we compute a indi- ject for Masked Language Model is to predict the vidual probability for each head and pair of tokens replaced tokens. The performance of MLM is eval- (xi , xj ): uated by measuring perplexity on masked words. PTB is a standard dataset for language model- QK T ing (Mikolov et al., 2012) and unsupervised con- qi,j = sigmoid √ (16) stituency parsing (Shen et al., 2018c; Kim et al., dk 2019a). Following the setting proposed in Shen where Q and K are query and key matrix in a et al. (2018c), we use Mikolov et al. (2012)’s standard transformer model and dk is the dimen- prepossessing process, which removes all punc- sion of attention head. The equation is inspired tuations, and replaces low frequency tokens with by the scaled dot-product attention in transformer. . The preprocessing results in a vocabu- We replace the original softmax function with a lary size of 10001 (including , and sigmoid function, so qi,j became an independent ). For PTB, we use a 30% mask rate. probability that indicates whether xi should attend BLLIP is a large Penn Treebank-style parsed on xj through the current attention head. In the corpus of approximately 24 million sentences. We end, we propose to replace transformer’s scaled dot- train and evaluate StructFormer on three splits of product attention with our dependency-constrained BLLIP: BLLIP-XS (40k sentences, 1M tokens), self-attention: BLLIP-SM (200K sentences, 5M tokens), and BLLIP-MD (600K sentences, 14M tokens). They Attention(Qi , Kj , Vj , D) = pi,j qi,j Vj (17) are obtained by randomly sampling sections from 7201
BLLIP BLLIP BLLIP PRPN ON C-PCFG Tree-T Ours Model PTB -XS -SM -MD SBAR 50.0% 52.5% 56.1% 36.4% 48.7% Transformer 64.05 93.90 19.92 14.31 NP 59.2% 64.5% 74.7% 67.6% 72.1% StructFormer 60.94 57.28 18.70 13.70 VP 46.7% 41.0% 41.7% 38.5% 43.0% PP 57.2% 54.4% 68.8% 52.3% 74.1% Table 1: Masked Language Model perplexities on dif- ADJP 44.3% 38.1% 40.4% 24.7% 51.9% ADVP 32.8% 31.6% 52.5% 55.1% 69.5% ferent datasets. Table 3: Fraction of ground truth constituents that were BLLIP 1987-89 Corpus Release 1. All models are predicted as a constituent by the models broken down tested on a shared held-out test set (20k sentences, by label (i.e. label recall) 500k tokens). Following the settings provided in (Hu et al., 2020), we use subword-level vocabulary extracted from the GPT-2 pre-trained model rather the C-PCFG (Kim et al., 2019a) achieve a stronger than the BLLIP training corpora. For BLLIP, we parsing performance with its strong linguistic con- use a 15% mask rate. straints (e.g. a finite set of production rules), Struct- The masked language model results are shown in Former may have a border domain of application. Table 1. StructFormer consistently outperforms our For example, it can replace the standard trans- Transformer baseline. This result aligns with previ- former encoder in most of the popular large-scale ous observations that linguistically informed self- pre-trained language models (e.g. BERT and Re- attention can help Transformers achieve stronger BERTa) and transformer based machine translation performance. We also observe that StructFormer models. Different from the transformer-based Tree- converges much faster than the standard Trans- T (Wang et al., 2019), we did not directly use con- former model. stituents to restrict the self-attention receptive field. But StructFormer achieves a stronger constituency 5.2 Unsupervised Constituency Parsing parsing performance. This result may suggest that The unsupervised constituency parsing task com- dependency relations are more suitable for gram- pares the latent tree structure induced by the model mar induction in transformer-based models. Table with those annotated by human experts. We use 3 shows that our model achieves strong accuracy the Algorithm 1 to predict the constituency trees while predicting Noun Phrase (NP), Preposition from T predicted by StructFormer. Following the Phrase (PP), Adjective Phrase (ADJP), and Adverb experiment settings proposed in Shen et al. (2018c), Phrase (ADVP). we take the model trained on PTB dataset and eval- 5.3 Unsupervised Dependency Parsing uate it on WSJ test set. The WSJ test set is section 23 of WSJ corpus, it contains 2416 human expert The unsupervised dependency parsing evaluation labeled sentences. Punctuation is ignored during compares the induced dependency relations with the evaluation. those in the reference dependency graph. The most common metric is the Unlabeled Attachment Score Methods UF1 (UAS), which measures the percentage that a token RANDOM 21.6 is correctly attached to its parent in the reference LBRANCH 9.0 tree. Another widely used metric for unsupervised RBRANCH 39.8 dependency parsing is Undirected Unlabeled At- PRPN (Shen et al., 2018a) 37.4 (0.3) tachment Score (UUAS) measures the percentage ON-LSTM (Shen et al., 2018c) 47.7 (1.5) Tree-T (Wang et al., 2019) 49.5 that the reference undirected and unlabeled connec- URNNG (Kim et al., 2019b) 52.4 tions are recovered by the induced tree. Similar C-PCFG (Kim et al., 2019a) 55.2 to the unsupervised constituency parsing, we take Neural L-PCFGs (Zhu et al., 2020) 55.31 StructFormer 54.0 (0.3) the model trained on PTB dataset and evaluate it on WSJ test set (section 23). For the WSJ test set, Table 2: Unsupervised constituency parsing tesults. * reference dependency graphs are converted from results are from Kim et al. (2020). UF1 stands for Un- its human-annotated constituency trees. However, labeled F1. there are two different sets of rules for the conver- sion: the Stanford dependencies and the CoNLL de- Table 2 shows that our model achieves strong re- pendencies. While Stanford dependencies are used sults on unsupervised constituency parsing. While as reference dependencies in previous unsupervised 7202
Relations MLM Constituency Stanford Conll PPL UF1 UAS UUAS UAS UUAS parent+dep 60.9 (1.0) 54.0 (0.3) 46.2 (0.4) 61.6 (0.4) 36.2 (0.1) 56.3 (0.2) parent 63.0 (1.2) 40.2 (3.5) 32.4 (5.6) 49.1 (5.7) 30.0 (3.7) 50.0 (5.3) dep 63.2 (0.6) 51.8 (2.4) 15.2 (18.2) 41.6 (16.8) 20.2 (12.2) 44.7 (13.9) Table 4: The performance of StructFormer with different combinations of attention masks. UAS stands for Unla- beled Attachment Score. UUAS stands for Undirected Unlabeled Attachment Score. Methods UAS the strong performances on both dependency pars- w/o gold POS tags ing and masked language modeling, we believe DMV (Klein and Manning, 2004) 35.8 that the dependency graph schema could be a vi- E-DMV (Headden III et al., 2009) 38.2 UR-A E-DMV (Tu and Honavar, 2012) 46.1 able substitute for the complete graph schema used CS* (Spitkovsky et al., 2013) 64.4* in the standard transformer. Appendix A.4 provides Neural E-DMV (Jiang et al., 2016) 42.7 examples of parent distribution. Gaussian DMV (He et al., 2018) 43.1 (1.2) INP (He et al., 2018) 47.9 (1.2) Since our model uses a mixture of the relation Neural L-PCFGs (Zhu et al., 2020) 40.5 (2.9) StructFormer 46.2 (0.4) probability distribution for each self-attention head, we also studied how different combinations of re- w/ gold POS tags (for reference only) DMV (Klein and Manning, 2004) 39.7 lations affect the performance of our model. Table UR-A E-DMV (Tu and Honavar, 2012) 57.0 6 shows that the model can achieve the best per- MaxEnc (Le and Zuidema, 2015) 65.8 Neural E-DMV (Jiang et al., 2016) 57.6 formance while using both parent and dependent CRFAE (Cai et al., 2017) 55.7 relations. The model suffers more on dependency L-NDMV† (Han et al., 2017) 63.2 parsing if the parent relation is removed. And if the dependent relationship is removed, the model will Table 5: Dependency Parsing Results on WSJ testset. suffer more on the constituency parsing. Appendix Starred entries (*) benefit from additional punctuation- A.3 shows the weight for parent and dependent based constraints. Daggered entries († ) benefit from larger additional training data. Baseline results are relations learnt from MLM tasks. It’s interesting from He et al. (2018). to observe that Structformer tends to focus on the parent relations in the first layer, and start to use both relations from the second layer. parsing papers, we noticed that our model some- times output dependency structures that are closer to the CoNLL dependencies. Therefore, we report 6 Conclusion UAS and UUAS for both Stanford and CoNLL dependencies. Following the setting of previous papers (Jiang et al., 2016), we ignored the punctua- In this paper, we introduce a novel dependency and tion during evaluation. To obtain the dependency constituency joint parsing framework. Based on relation from our model, we compute the argmax the framework, we propose StructFormer, a new for dependency distribution: unsupervised parsing algorithm that does unsuper- vised dependency and constituency parsing at the k = argmaxj6=i pD (j|i) (18) same time. We also introduced a novel dependency- constrained self-attention mechanism that allows and assign the k-th token as the parent of i-th token. each attention head to focus on a specific mixture Table 5 shows that our model achieves competi- of dependency relations. This brings Transformers tive dependency parsing performance while com- closer to modeling a directed dependency graph. paring to other models that do not require gold The experiments show promising results that Struct- POS tags. While most of the baseline models still Former can induce meaningful dependency and rely on some kind of latent POS tags or pre-trained constituency structures and achieve better perfor- word embeddings, StructFormer can be seen as an mance on masked language model tasks. This re- easy-to-use alternative that works in an end-to-end search provides a new path to build more linguistic fashion. Table 6 shows that our model recovers bias into a pre-trained language model. 61.6% of undirected dependency relations. Given 7203
References Junxian He, Graham Neubig, and Taylor Berg- Kirkpatrick. 2018. Unsupervised learning of syntac- Jiong Cai, Yong Jiang, and Kewei Tu. 2017. Crf tic structure with invertible neural projections. In autoencoder for unsupervised dependency parsing. Proceedings of the 2018 Conference on Empirical arXiv preprint arXiv:1708.01018. Methods in Natural Language Processing, pages 1292–1302. Shay B Cohen, Dipanjan Das, and Noah A Smith. 2011. Unsupervised structure prediction with non-parallel William P Headden III, Mark Johnson, and David Mc- multilingual guidance. In Proceedings of the 2011 Closky. 2009. Improving unsupervised dependency Conference on Empirical Methods in Natural Lan- parsing with richer contexts and smoothing. In Pro- guage Processing, pages 50–61. ceedings of human language technologies: the 2009 Shay B Cohen and Noah A Smith. 2009. Shared logis- annual conference of the North American chapter of tic normal distributions for soft parameter tying in the association for computational linguistics, pages unsupervised grammar induction. In Proceedings of 101–109. Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the As- Phu Mon Htut, Jason Phang, Shikha Bordia, and sociation for Computational Linguistics, pages 74– Samuel R Bowman. 2019. Do attention heads in 82. bert track syntactic dependencies? arXiv preprint arXiv:1911.12246. Hal Daumé III. 2009. Unsupervised search-based structured prediction. In Proceedings of the 26th Jennifer Hu, Jon Gauthier, Peng Qian, Ethan Wilcox, Annual International Conference on Machine Learn- and Roger P Levy. 2020. A systematic assessment ing, pages 209–216. of syntactic generalization in neural language mod- els. arXiv preprint arXiv:2005.03692. Hiroyuki Deguchi, Akihiro Tamura, and Takashi Ni- nomiya. 2019. Dependency-based self-attention for Yong Jiang, Wenjuan Han, Kewei Tu, et al. 2016. Un- transformer nmt. In Proceedings of the Interna- supervised neural dependency parsing. Association tional Conference on Recent Advances in Natural for Computational Linguistics (ACL). Language Processing (RANLP 2019), pages 239– 246. Taeuk Kim, Jihun Choi, Daniel Edmiston, and Sang- goo Lee. 2020. Are pre-trained language mod- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and els aware of phrases? simple but strong base- Kristina Toutanova. 2018. Bert: Pre-training of deep lines for grammar induction. arXiv preprint bidirectional transformers for language understand- arXiv:2002.00737. ing. arXiv preprint arXiv:1810.04805. Yoon Kim, Chris Dyer, and Alexander M Rush. 2019a. Andrew Drozdov, Patrick Verga, Mohit Yadav, Mohit Compound probabilistic context-free grammars for Iyyer, and Andrew McCallum. 2019. Unsupervised grammar induction. In Proceedings of the 57th An- latent tree induction with deep inside-outside recur- nual Meeting of the Association for Computational sive auto-encoders. In Proceedings of the 2019 Con- Linguistics, pages 2369–2385. ference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Lan- Yoon Kim, Alexander M Rush, Lei Yu, Adhiguna Kun- guage Technologies, Volume 1 (Long and Short Pa- coro, Chris Dyer, and Gábor Melis. 2019b. Unsuper- pers), pages 1129–1141. vised recurrent neural network grammars. In Pro- ceedings of the 2019 Conference of the North Amer- Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, ican Chapter of the Association for Computational and Noah A Smith. 2016. Recurrent neural network Linguistics: Human Language Technologies, Vol- grammars. In Proceedings of NAACL-HLT, pages ume 1 (Long and Short Papers), pages 1105–1117. 199–209. Alexander Gelbukh, Sulema Torres, and Hiram Calvo. Dan Klein. 2005. The unsupervised learning of natural 2005. Transforming a constituency treebank into a language structure. Stanford University Stanford. dependency treebank. Procesamiento del lenguaje natural, (35):145–152. Dan Klein and Christopher D Manning. 2002. A gener- ative constituent-context model for improved gram- Jennifer Gillenwater, Kuzman Ganchev, João Graça, mar induction. In Proceedings of the 40th Annual Fernando Pereira, and Ben Taskar. 2010. Sparsity Meeting of the Association for Computational Lin- in dependency grammar induction. ACL 2010, page guistics, pages 128–135. 194. Dan Klein and Christopher D Manning. 2004. Corpus- Wenjuan Han, Yong Jiang, and Kewei Tu. 2017. based induction of syntactic structure: Models of de- Dependency grammar induction with neural lexi- pendency and constituency. In Proceedings of the calization and big training data. arXiv preprint 42nd annual meeting of the association for computa- arXiv:1708.00801. tional linguistics (ACL-04), pages 478–485. 7204
Adhiguna Kuncoro, Lingpeng Kong, Daniel Fried, International Conference on Learning Representa- Dani Yogatama, Laura Rimell, Chris Dyer, and Phil tions. Blunsom. 2020. Syntactic structure distillation pre- training for bidirectional encoders. arXiv preprint Valentin I Spitkovsky, Hiyan Alshawi, Angel Chang, arXiv:2005.13482. and Dan Jurafsky. 2011. Unsupervised dependency parsing without gold part-of-speech tags. In Pro- Phong Le and Willem Zuidema. 2015. Unsupervised ceedings of the 2011 Conference on Empirical Meth- dependency parsing: Let’s use supervised parsers. ods in Natural Language Processing, pages 1281– arXiv preprint arXiv:1504.04666. 1290. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Valentin I Spitkovsky, Daniel Jurafsky, and Hiyan Al- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, shawi. 2013. Breaking out of local optima with Luke Zettlemoyer, and Veselin Stoyanov. 2019. count transforms and model recombination: A study Roberta: A robustly optimized bert pretraining ap- in grammar induction. proach. arXiv preprint arXiv:1907.11692. Emma Strubell, Patrick Verga, Daniel Andor, Hongyin Luo, Lan Jiang, Yonatan Belinkov, and James David Weiss, and Andrew McCallum. 2018. Glass. 2019. Improving neural language models by Linguistically-informed self-attention for semantic segmenting, attending, and predicting the future. In role labeling. arXiv preprint arXiv:1804.08199. Proceedings of the 57th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 1483– Kewei Tu and Vasant Honavar. 2012. Unambiguity reg- 1493. ularization for unsupervised learning of probabilis- tic grammars. In Proceedings of the 2012 Joint Diego Marcheggiani and Ivan Titov. 2017. En- Conference on Empirical Methods in Natural Lan- coding sentences with graph convolutional net- guage Processing and Computational Natural Lan- works for semantic role labeling. arXiv preprint guage Learning, pages 1324–1334. arXiv:1703.04826. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Tomáš Mikolov et al. 2012. Statistical language mod- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz els based on neural networks. Presentation at Kaiser, and Illia Polosukhin. 2017. Attention is all Google, Mountain View, 2nd April, 80:26. you need. In Advances in neural information pro- cessing systems, pages 5998–6008. Yutaro Omote, Akihiro Tamura, and Takashi Ninomiya. 2019. Dependency-based relative positional encod- Yaushian Wang, Hung-Yi Lee, and Yun-Nung Chen. ing for transformer nmt. In Proceedings of the In- 2019. Tree transformer: Integrating tree structures ternational Conference on Recent Advances in Natu- into self-attention. In Proceedings of the 2019 Con- ral Language Processing (RANLP 2019), pages 854– ference on Empirical Methods in Natural Language 861. Processing and the 9th International Joint Confer- ence on Natural Language Processing (EMNLP- John K Pate and Sharon Goldwater. 2013. Unsuper- IJCNLP), pages 1060–1070. vised dependency parsing with acoustic cues. Trans- actions of the Association for Computational Lin- Hao Zhu, Yonatan Bisk, and Graham Neubig. 2020. guistics, 1:63–74. The return of lexical dependencies: Neural lexical- ized pcfgs. Transactions of the Association for Com- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine putational Linguistics, 8:647–661. Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text trans- former. arXiv preprint arXiv:1910.10683. Yikang Shen, Zhouhan Lin, Chin-wei Huang, and Aaron Courville. 2018a. Neural language modeling by jointly learning syntax and lexicon. In Interna- tional Conference on Learning Representations. Yikang Shen, Zhouhan Lin, Athul Paul Jacob, Alessan- dro Sordoni, Aaron Courville, and Yoshua Bengio. 2018b. Straight to the tree: Constituency parsing with neural syntactic distance. In Proceedings of the 56th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 1171–1180. Yikang Shen, Shawn Tan, Alessandro Sordoni, and Aaron Courville. 2018c. Ordered neurons: Integrat- ing tree structures into recurrent neural networks. In 7205
A Appendix height, we use a ReLU activation function to keep only the positive values: A.1 Joint Dependency and Constituency Parsing [l,r] = ReLU min τl−1 − δ[l,r] , τr − δ[l,r] (20) Algorithm 3 The joint dependency and con- To make sure all constituent are not isolated and stituency parsing algorithm. Inputs are a sequence maintain the rank of T, we subtract all T by the of words w, syntactic distances d, syntactic heights maximum of : h. Outputs are a binary constituency tree T, a de- δˆi = δi − max pendency graph D that is represented as a set of [l,r] (21) {[l,r]}/[1,n] dependency relations, the parent of dependency graph D, and the syntactic height of parent. 1: function B UILD T REE(w, d, h) 2: if d = [] and w = [w] and h = [h] then 3: T ⇐ Leaf(w), D ⇐ [], parent ⇐ w, height ⇐h 4: else 5: i ⇐ arg max(d) 6: Tl , Dl , parentl , heightl ⇐ BuildTree(di , w>i , h>i ) 8: T ⇐ Node(childl ⇐ Tl , childr ⇐ Tr ) 9: D ⇐ Union(Dl , Dr ) 10: if heightl > heightr then 11: D.add(parentl ← parentr ) 12: parent ⇐ parentl , height ⇐ heightl 13: else 14: D.add(parentr ← parentl ) 15: parent ⇐ parentr , height ⇐ heightr 16: return T, D, parent, height A.2 Calibrating the Syntactic Distance and Height In Section 3.3, we explained the relation between ∆ and T, that if δi < τj , the i-th word won’t be able to attend beyond the j-th split point. How- ever, in a specific case, the constraint will isolate a constituent [l, r] from the rest of the sentence. If τl−1 and τr are larger then all height δl,...,r in the constituent, then all words in [l, r] won’t be able to attend on the outside of the constituent. This phenomenon will prevent their output contex- tual embedding from encoding the full context. To avoid this phenomenon, we propose to calibrate the syntactic distance T according to the syntactic height ∆. First, we compute the maximum syntac- tic height for each constituent: δ[l,r] = max (δl , ..., δr ) , l
A.3 Dependency Relation Weights for Self-attention Heads (a) Dependency relation weights learnt on PTB (b) Dependency relation weights learnt on BLLIP-SM Figure 4: Dependency relation weights learnt on different datasets. Row i constains relation weights for all at- tention heads in the i-th transformer layer. p represents the parent relation. d represents the dependent relation. We observe a clearer preference for each attention head in the model trained on BLLIP-SM. This probably due to BLLIP-SM has signficantly more training data. It’s also interesting to notice that the first layer tend to focus on parent relations. 7207
A.4 Dependency Distribution Examples (a) (b) Figure 5: Dependency distribution examples from WSJ test set. Each row is the parent distribution for the respec- tive word. The sum of each distribution may not equal to 1. Against our intuition, the distribution is not very sharp. This is partially due to the ambiguous nature of the dependency graph. As we previously discussed, at least two styles of dependency rules (Conll and Stanford) exist. And without extra constraint or supervision, the model seems trying to model both of them at the same time. One interesting future work will be finding an inductive bias that can encourage the model to converge to a specific style of dependency graph. 7208
A.5 The Performance of StructFormer with different mask rates Mask rate MLM Constituency Stanford Conll PPL UF1 UAS UUAS UAS UUAS 0.1 45.3 (1.2) 51.45 (2.7) 31.4 (11.9) 51.2 (8.1) 32.3 (5.2) 52.4 (4.5) 0.2 50.4 (1.3) 54.0 (0.6) 37.4 (12.6) 55.6 (8.8) 33.0 (5.7) 53.5 (4.7) 0.3 60.9 (1.0) 54.0 (0.3) 46.2 (0.4) 61.6 (0.4) 36.2 (0.1) 56.3 (0.2) 0.4 76.9 (1.2) 53.5 (1.5) 34.0 (10.3) 52.0 (7.4) 29.5 (5.4) 50.6 (4.1) 0.5 100.3 (1.4) 53.2 (0.9) 36.3 (9.8) 53.6 (6.8) 30.6 (4.2) 51.3 (3.2) Table 6: The performance of StructFormer on PTB dataset with different mask rates. Dependency parsing is especially affected by the masks. Mask rate 0.3 provides the best and the most stable performance. 7209
You can also read