Attention Can Reflect Syntactic Structure
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Attention Can Reflect Syntactic Structure (If You Let It) ∗ Vinit Ravishankar† ∗ Artur Kulmizev‡ Mostafa Abdou§ Anders Søgaard§ Joakim Nivre‡ † Language Technology Group, Department of Informatics, University of Oslo ‡ Department of Linguistics and Philology, Uppsala University § Department of Computer Science, University of Copenhagen † vinitr@ifi.uio.no ‡ {artur.kulmizev,joakim.nivre}@lingfil.uu.se § {abdou,soegaard}@di.ku.dk Abstract 1997) as NLP’s feature-encoder-of-choice, largely due to its compatibility with parallelized training Since the popularization of the Transformer regimes and ability to handle long-distance depen- arXiv:2101.10927v1 [cs.CL] 26 Jan 2021 as a general-purpose feature encoder for NLP, many studies have attempted to decode lin- dencies. guistic structure from its novel multi-head at- Certainly, the nature of attention as a distribu- tention mechanism. However, much of such tion over tokens lends itself to a straightforward work focused almost exclusively on English — interpretation of a model’s inner workings. Bah- a language with rigid word order and a lack danau et al. (2014) illustrate this nicely in the con- of inflectional morphology. In this study, we text of seq2seq machine translation, showing present decoding experiments for multilingual that the attention learned by their models reflects BERT across 18 languages in order to test the generalizability of the claim that dependency expected cross-lingual idiosyncrasies between En- syntax is reflected in attention patterns. We glish and French, e.g., concerning word order. With show that full trees can be decoded above base- self-attentive Transformers, interpretation becomes line accuracy from single attention heads, and slightly more difficult, as attention is distributed that individual relations are often tracked by across words within the input itself. This is fur- the same heads across languages. Furthermore, ther compounded by the use of multiple layers and in an attempt to address recent debates about heads, each combination of which yields its own the status of attention as an explanatory mecha- nism, we experiment with fine-tuning mBERT alignment, representing a different (possibly re- on a supervised parsing objective while freez- dundant) view of the data. Given the similarity ing different series of parameters. Interest- of such attention matrices to the score matrices ingly, in steering the objective to learn explicit employed in arc-factored dependency parsing (Mc- linguistic structure, we find much of the same Donald et al., 2005a,b), a salient question concern- structure represented in the resulting attention ing interpretability becomes: Can we expect some patterns, with interesting differences with re- combination of these parameters to capture linguis- spect to which parameters are frozen. tic structure in the form of a dependency tree, espe- 1 Introduction cially if the model performs well on NLP tasks? If not, can we relax the expectation and examine the In recent years, the attention mechanism proposed extent to which subcomponents of the linguistic by Bahdanau et al. (2014) has become an indis- structure, such as subject-verb relations, are repre- pensable component of many NLP systems. Its sented? This prospect was first posed by Raganato widespread adoption was, in part, heralded by et al. (2018) for MT encoders, and later explored the introduction of the Transformer architecture by Clark et al. (2019) for BERT. Ultimately, the (Vaswani et al., 2017a), which constrains a soft consensus of these and other studies (Voita et al., alignment to be learned across discrete states in 2019; Htut et al., 2019; Limisiewicz et al., 2020) the input (self-attention), rather than across input was that, while there appears to exist no “general- and output (e.g., Xu et al., 2015; Rocktäschel et al., ist” head responsible for extracting full dependency 2015). The Transformer has, by now, supplanted structures, standalone heads often specialize in cap- the popular LSTM (Hochreiter and Schmidhuber, turing individual grammatical relations. ∗ Equal contribution. Order was decided by a coin toss. Unfortunately, most of such studies focused their
experiments entirely on English, which is typologi- key (K), and value (V ) vectors via three separate cally favored to succeed in such scenarios due to its linear transformations and passed to an attention rigid word order and lack of inflectional morphol- mechanism. A single attention head computes ogy. It remains to be seen whether the attention scaled dot-product attention between K and Q, out- patterns of such models can capture structural fea- putting a weighted sum of V : QK > tures across typologically diverse languages, or if Attention(Q, K, V ) = softmax √ V the reported experiments on English are a misrep- dk resentation of local positional heuristics as such. (1) Furthermore, though previous work has investi- For multihead attention (MHA), the same process gated how attention patterns might change after is repeated for k heads, allowing the model to fine-tuning on different tasks (Htut et al., 2019), jointly attend to information from different repre- a recent debate about attention as an explanatory sentation subspaces at different positions (Vaswani mechanism (Jain and Wallace, 2019; Wiegreffe and et al., 2017b). Ultimately, the output of all heads is Pinter, 2019) has cast the entire enterprise in doubt. concatenated and passed through a linear projection Indeed, it remains to be seen whether fine-tuning W O: on an explicit structured prediction task, e.g. de- Hi = Attention QWiQ , KWiK , V WiV (2) pendency parsing, can force attention to represent the structure being learned, or if the patterns ob- served in pretrained models are not altered in any MHA(Q, K, V ) = concat(H1 , H2 , ..., Hk )W O meaningful way. (3) To address these issues, we investigate the Every layer also consists of a feed-forward network prospect of extracting linguistic structure from (FFN), consisting of two Dense layers with ReLU the attention weights of multilingual Transformer- activation functions. For each layer, therefore, the based language models. In light of the surveyed output of MHA is passed through a LayerNorm literature, our research questions are as follows: with residual connections, passed through FFN, 1. Can we decode dependency trees for some and then through another LayerNorm with residual languages better than others? connections. 2. Do the same layer–head combinations track Searching for structure Often, the line of in- the same relations across languages? quiry regarding interpretability in NLP has been 3. How do attention patterns change after fine- concerned with extracting and analyzing linguistic tuning with explicit syntactic annotation? information from neural network models of lan- 4. Which components of the model are involved guage (Belinkov and Glass, 2019). Recently, such in these changes? investigations have targeted Transformer models In answering these questions, we believe we can (Hewitt and Manning, 2019; Rosa and Mareček, shed further light on the (cross-)linguistic proper- 2019; Tenney et al., 2019), at least in part because ties of Transformer-based language models, as well the self-attention mechanism employed by these as address the question of attention patterns being models offers a possible window into their inner a reliable representation of linguistic structure. workings. With large-scale machine translation and language models being openly distributed for ex- 2 Attention as Structure perimentation, several researchers have wondered Transformers The focus of the present study if self-attention is capable of representing syntactic is mBERT, a multilingual variant of the exceed- structure, despite not being trained with any overt ingly popular language model (Devlin et al., 2019). parsing objective. BERT is built upon the Transformer architecture In pursuit of this question, Raganato et al. (2018) (Vaswani et al., 2017b), which is a self-attention- applied a maximum-spanning-tree algorithm over based encoder-decoder model (though only the en- the attention weights of several trained MT models, coder is relevant to our purposes). A Transformer comparing them with gold trees from Universal De- takes a sequence of vectors x = [x1 , x2 , ...xn ] as pendencies (Nivre et al., 2016, 2020). They found input and applies a positional encoding to them, that, while the accuracy was not comparable to that in order to retain the order of words in a sentence. of a supervised parser, it was nonetheless higher These inputs are then transformed into query (Q), than several strong baselines, implying that some
structure was consistently represented. Clark et al. standard attention mechanisms. On a theoretical (2019) corroborated the same findings for BERT level, they argue that, although attention weights do when decoding full trees, but observed that indi- not give an exclusive “faithful” explanation, they vidual dependency relations were often tracked by do provide a meaningful plausible explanation. specialized heads and were decodable with much This discussion is relevant to our study because higher accuracy than some fixed-offset baselines. it remains unclear whether or not attending to syn- Concurrently, Voita et al. (2019) made a similar tactic structure serves, in practice, as plausible ex- observation about heads specializing in specific planation for model behavior, or whether or not it is dependency relations, proposing a coarse taxon- even capable of serving as such. Indeed, the studies omy of head attention functions: positional, where of Raganato et al. (2018) and Clark et al. (2019) heads attend to adjacent tokens; syntactic, where relate a convincing but incomplete picture — tree heads attend to specific syntactic relations; and rare decoding accuracy just marginally exceeds base- words, where heads point to the least frequent to- lines and various relations tend to be tracked across kens in the sentence. Htut et al. (2019) followed varying heads and layers. Thus, our fine-tuning ex- Raganato et al. (2018) in decoding dependency periments (detailed in the following section) serve trees from BERT-based models, finding that fine- to enable an “easy” setting wherein we explicitly tuning on two classification tasks did not produce inform our models of the same structure that we are syntactically plausible attention patterns. Lastly, trying to extract. We posit that, if, after fine-tuning, Limisiewicz et al. (2020) modified UD annotation syntactic structures were still not decodable from to better represent attention patterns and introduced the attention weights, one could safely conclude a supervised head-ensembling method for consoli- that these structures are being stored via a non- dating shared syntactic information across heads. transparent mechanism that may not even involve attention weights. Such an insight would allow us Does attention have explanatory value? Though to conclude that attention weights cannot provide many studies have yielded insight about how atten- even a plausible explanation for models relying on tion behaves in a variety of models, the question of syntax. whether it can be seen as a “faithful” explanation of model predictions has been subject to much re- 3 Experimental Design cent debate. For example, Jain and Wallace (2019) present compelling arguments that attention does To examine the extent to which we can decode de- not offer a faithful explanation of predictions. Pri- pendency trees from attention patterns, we run a marily, they demonstrate that there is little correla- tree decoding algorithm over mBERT’s attention tion between standard feature importance measures heads — before and after fine-tuning via a parsing and attention weights. Furthermore, they contend objective. We surmise that doing so will enable us that there exist counterfactual attention distribu- to determine if attention can be interpreted as a re- tions, which are substantially different from learned liable mechanism for capturing linguistic structure. attention weights but that do not alter a model’s pre- 3.1 Model dictions. Using a similar methodology, Serrano and Smith (2019) corroborate that attention does not We employ mBERT1 in our experiments, which provide an adequate account of an input compo- has been shown to perform well across a variety nent’s importance. of NLP tasks (Hu et al., 2020; Kondratyuk and In response to these findings, Wiegreffe and Pin- Straka, 2019a) and capture aspects of syntactic ter (2019) question the assumptions underlying structure cross-lingually (Pires et al., 2019; Chi such claims. Attention, they argue, is not a prim- et al., 2020). mBERT features 12 layers with 768 itive, i.e., it cannot be detached from the rest of a hidden units and 12 attention heads, with a joint model’s components as is done in the experiments WordPiece sub-word vocabulary across languages. of Jain and Wallace (2019). They propose a set The model was trained on the concatenation of of four analyses to test whether a given model’s WikiDumps for the top 104 languages with the attention mechanism can provide meaningful ex- largest Wikipedias,where principled sampling was planation and demonstrate that the alternative at- employed to enforce a balance between high- and tention distributions found via adversarial training 1 https://github.com/google-research/ methods do, in fact, perform poorly compared to bert
low-resource languages. the exact tree structure we aim to decode. To this end, we employ the graph-based decoding algo- 3.2 Decoding Algorithm rithm of the biaffine parser introduced by Dozat For decoding dependency trees, we follow Ra- and Manning (2016). We replace the standard ganato et al. (2018) in applying the Chu-Liu- BiLSTM encoder for this parser with the entire Edmonds maximum spanning tree algorithm (Chu, mBERT network, which we fine-tune with the pars- 1965) to every layer/head combination available ing loss. The full parser decoder consists of four in mBERT (12 × 12 = 144 in total). In order for dense layers, two for head/child representations for the matrices to correspond to gold treebank tok- dependency arcs (dim. 500) and two for head/child enization, we remove the cells corresponding to representations for dependency labels (dim. 100). the BERT delimiter tokens ([CLS] and [SEP]). These are transformed into the label space via a In addition to this, we sum the columns and av- bilinear transform. erage the rows corresponding to the constituent After training the parser, we can decode the fine- subwords of gold tokens, respectively (Clark et al., tuned mBERT parameters in the same fashion as 2019). Lastly, since attention patterns across heads described in Section 3.2. We surmise that, if atten- may differ in whether they represent heads attend- tion heads are capable of tracking hierarchical rela- ing to their dependents or vice versa, we take our tions between words in any capacity, it is precisely input to be the element-wise product of a given in this setting that this ability would be attested. In attention matrix and its transpose (A ◦ A> ). We addition to this, we are interested in what individual liken this to the joint probability of a head attend- components of the mBERT network are capable of ing to its dependent and a dependent attending to steering attention patterns towards syntactic struc- its head, similarly to Limisiewicz et al. (2020). Per ture. We believe that addressing this question will this point, we also follow Htut et al. (2019) in eval- help us not only in interpreting decisions made by uating the decoded trees via Undirected Unlabeled BERT-based neural parsers, but also in aiding us de- Attachment Score (UUAS) — the percentage of veloping syntax-aware models in general (Strubell undirected edges recovered correctly. Since we dis- et al., 2018; Swayamdipta et al., 2018). As such — count directionality, this is effectively a less strict beyond fine-tuning all parameters of the mBERT measure than UAS, but one that has a long tradition network (our basic setting) — we perform a se- in unsupervised dependency parsing since Klein ries of ablation experiments wherein we update and Manning (2004). only one set of parameters per training cycle, e.g. the Query weights WiQ , and leave everything else 3.3 Data frozen. This gives us a set of 6 models, which are For our data, we employ the Parallel Universal De- described below. For each model, all non-BERT pendencies (PUD) treebanks, as collected in UD parser components are always left unfrozen. v2.4 (Nivre et al., 2019). PUD was first released as part of the CONLL 2017 shared task (Zeman • K EY: only the K components of the trans- et al., 2018), containing 1000 parallel sentences, former are unfrozen; these are the represen- which were (professionally) translated from En- tations of tokens that are paying attention to glish, German, French, Italian, and Spanish to 14 other tokens. other languages. The sentences are taken from two • Q UERY: only the Q components are unfrozen; domains, news and wikipedia, the latter implying these, conversely, are the representations of some overlap with mBERT’s training data (though tokens being paid attention to. we did not investigate this). We include all PUD • KQ: both keys and queries are unfrozen. treebanks except Thai.2 • VALUE: semantic value vectors per token (V ) are unfrozen; they are composed after being 3.4 Fine-Tuning Details weighted with attention scores obtained from In addition to exploring pretrained mBERT’s atten- the K/Q matrices. tion weights, we are also interested in how attention • D ENSE: the dense feed-forward networks in might be guided by a training objective that learns the attention mechanism; all three per layer 2 are unfrozen. Thai is the only treebank that does not have a non-PUD treebank available in UD, which we need for our fine-tuning • N ONE: The basic setting with nothing frozen; experiments. all parameters are updated with the parsing
AR CS DE EN ES FI FR HI ID IT JA KO PL PT RU SV TR ZH BASELINE 50 40 36 36 40 42 40 46 47 40 43 55 45 41 42 39 52 41 53 53 49 47 50 48 41 48 50 41 45 64 52 50 51 51 55 42 P RE 7-6 10-8 10-8 10-8 9-5 10-8 2-3 2-3 9-5 6-4 2-3 9-2 10-8 9-5 10-8 10-8 3-8 2-3 76 78 76 71 77 66 45 72 75 58 42 64 75 76 75 74 55 38 N ONE 11-10 11-10 11-10 10-11 10-11 10-11 11-10 11-10 11-10 11-10 11-10 11-10 11-10 11-10 10-8 10-8 3-8 2-3 62 64 58 53 59 56 41 54 59 47 44 62 64 58 61 59 55 41 K EY 10-8 10-8 11-12 10-8 11-12 10-8 7-12 10-8 10-8 9-2 2-3 10-8 10-8 11-12 10-8 12-10 3-12 2-3 69 74 70 66 73 63 42 62 67 54 45 65 72 70 70 68 56 42 Q UERY 11-4 10-8 11-4 11-4 11-4 10-8 11-4 11-4 11-4 11-4 2-3 10-8 11-4 11-4 10-8 11-4 10-8 2-3 71 76 70 65 74 62 43 64 69 55 44 64 73 73 69 69 55 41 KQ 11-4 11-4 11-4 11-4 11-4 11-4 10-11 11-4 11-4 11-4 2-3 11-4 11-4 11-4 11-4 11-4 11-4 2-3 75 72 72 64 76 59 45 63 73 55 45 66 73 74 69 65 57 42 VALUE 12-5 12-5 12-5 12-5 12-5 12-5 12-5 12-5 12-5 12-5 2-3 10-8 12-5 12-5 12-5 12-5 12-5 3-8 68 71 65 60 67 61 42 65 66 49 44 64 70 64 67 64 55 40 D ENSE 11-10 11-10 11-10 10-8 12-10 11-10 10-8 11-10 11-10 9-5 3-12 11-10 11-10 12-5 11-10 11-10 11-10 3-12 Table 1: Adjacent-branching baseline and maximum UUAS decoding accuracy per PUD treebank, expressed as best score and best layer/head combination for UUAS decoding. P RE refers to basic mBERT model before fine- tuning, while all cells below correspond different fine-tuned models described in Section 3.4. Best score indicated in bold. loss. We fine-tune each of these models on a concaten- tation of all PUD treebanks for 20 epochs, which effectively makes our model multilingual. We do so in order to 1) control for domain and annotation confounds, since all PUD sentences are parallel and are natively annotated (unlike converted UD treebanks, for instance); 2) increase the number of training samples for fine-tuning, as each PUD tree- bank features only 1000 sentences; and 3) induce a better parser through multilinguality, as in Kon- dratyuk and Straka (2019b). Furthermore, in or- der to gauge the overall performance of our parser across all ablated settings, we evaluate on the test set of the largest non-PUD treebank available for each language, since PUD only features test par- titions. When training, we employ a combined dense/sparse Adam optimiser, at a learning rate of 3 ∗ 10−5 . We rescale gradients to have a maximum norm of 5. 4 Decoding mBERT Attention The second row of Table 1 (P RE) depicts the UUAS after running our decoding algorithm over mBERT attention matrices, per language. We see a famil- iar pattern to that in Clark et al. (2019) among Figure 1: UUAS of MST decoding per layer and head, others — namely that attention patterns extracted across languages. Heads (y-axis) are sorted by accu- directly from mBERT appear to be incapable of racy for easier visualization. decoding dependency trees beyond a threshold of 50–60% UUAS accuracy. However, we also note that, in all languages, the attention-decoding algo- linear structures are captured with regularity. In- rithm outperforms a BASELINE (row 1) that draws deed, head 8 in layer 10 appears to be particularly an (undirected) edge between any two adjacent strong in this regard, returning the highest UUAS words in linear order, which implies that some non- for 7 languages. Interestingly, the accuracy patterns
Figure 2: Left: UUAS per relation across languages (best layer/head combination indicated in cell). Right: Best UUAS as a function of best positional baseline (derived from the treebank), selected relations. across layers depicted in Figure 1 tend to follow guages (Korean Hangul and Japanese Hiragana and an identical trend for all languages, with nearly all Katakana included), which rely on whitespace and heads in layer 7 returning high within-language WordPiece (Wu et al., 2016). It is thus possible that accuracies. the attention distributions for these two languages It appears that attention for some languages (Ara- (at least where CJK characters are relevant) are de- bic, Czech, Korean, Turkish) is comparatively eas- voted to composing words, rather than structural ier to decode than others (French, Italian, Japanese, relations, which will distort the attention matrices Chinese). A possible explanation for this result is that we compute to correspond with gold tokeniza- that dependency relations between content words, tion (e.g. by maxing rows and averaging columns). which are favored by the UD annotation, are more likely to be adjacent in the morphologically rich Relation analysis We can disambiguate what languages of the first group (without intervening sort of structures are captured with regularity by function words). This assumption seems to be cor- looking at the UUAS returned per dependency rela- roborated by the high baseline scores for Arabic, tion. Figure 2 (left) shows that adjectival modifiers Korean and Turkish (but not Czech). Conversely, (amod, mean UUAS = 85 ±12) and determiners the low baselines scores and the likewise low de- (det, 88 ± 6) are among the easiest relations to coding accuracies for the latter four languages are decode across languages. Indeed, words that are difficult to characterize. Indeed, we could not iden- connected by these relations are often adjacent to tify what factors — typological, annotation, tok- each other and may be simple to decode if a head is enization or otherwise — would set French and Ital- primarily concerned with tracking linear order. To ian apart from the remaining languages in terms of verify the extent to which this might be happening, score. However, we hypothesize that the tokeniza- we plot the aforementioned decoding accuracy as tion and our treatment of subword tokens plays a a function of select relations’ positional baselines part in attempting to decode attention from Chinese in Figure 2 (right). The positional baselines, in this and Japanese representations. Per the mBERT doc- case, are calculated by picking the most frequent umentation,3 Chinese and Japanese Kanji character offset at which a dependent occurs with respect to spans within the CJK Unicode range are character- its head, e.g., −1 for det in English, meaning one tokenized. This lies in contrast with all other lan- position to the left of the head. Interestingly, while 3 https://github.com/google-research/ we observe significant variation across the posi- bert/blob/master/multilingual.md tional baselines for amod and det, the decoding
accuracy remains quite high. terns of the fine-tuned model typically yields better In slight contrast to this, the core subject UUAS than the pretrained model, particularly in (nsubj, 58 ± 16 SD) and object (obj, 64 ± 13) the highest layers. Indeed, the lowest layer at which relations prove to be more difficult to decode. Un- fine-tuning appears to improve decoding is layer like the aforementioned relations, nsubj and obj 7. This implies that, regardless of which compo- are much more sensitive to the word order proper- nent remains frozen, the parameters facing any sort ties of the language at hand. For example, while of significant and positive update tend to be those a language like English, with Subject-Verb-Object appearing towards the higher-end of the network, (SVO) order, might have the subject frequently closer to the output. appear to the left of the verb, an SOV language For the frozen components, the best improve- like Hindi might have it several positions further ments in UUAS are seen at the final layer in VALUE, away, with an object and its potential modifiers which is also the only model that shows consistent intervening. Indeed, the best positional baseline improvement, as well as the highest average im- for English nsubj is 39 UUAS, while it is only provement in mean scores4 for the last few layers. 10 for Hindi. Despite this variation, the relation Perhaps most interestingly, the mean UUAS (Fig- seems to be tracked with some regularity by the ure 3 (bottom)) for our “attentive” components – same head (layer 3, head 9), returning 60 UUAS keys, queries, and their combination – does not for English and 52 for Hindi. The same can largely appear to have improved by much after fine-tuning. be said for obj, where the positional baselines re- In contrast, the maximum does show considerable turn 51 ± 18. In this latter case, however, the heads improvement; this seems to imply that although tend to be much differently distributed across lan- all components appear to be more or less equally guages. Finally, he results for the obj relation capable of learning decodable heads, the attentive provides some support for our earlier explanation components, when fine-tuned, appear to sharpen concerning morphologically rich languages, as Ara- fewer heads. bic, Czech, Korean and Turkish all have among the Note that the only difference between keys and highest accuracies (as well as positional baselines). queries in an attention mechanism is that keys are transposed to index attention from/to appropriately. 5 Fine-Tuning Experiments Surprisingly, K EY and Q UERY appear to act some- what differently, with Q UERY being almost uni- Next, we investigate the effect fine-tuning has on formly better than K EY with the best heads, whilst UUAS decoding. Row 3 in Table 1 (N ONE) indi- K EY is slightly better with averages, implying dis- cates that fine-tuning does result in large improve- tinctions in how both store information. Further- ments to UUAS decoding across most languages, more, allowing both keys and queries seems to often by margins as high as ∼ 30%. This shows result in an interesting contradiction – the ultimate that with an explicit parsing objective, attention layer, which has reasonable maximums and aver- heads are capable of serving as explanatory mecha- ages for both K EY and Q UERY, now seems to show nisms for syntax; syntactic structure can be made a UUAS drop almost uniformly. This is also true to be transparently stored in the heads, in a man- for the completely unfrozen encoder. ner that does not require additional probe fitting or parameterized transformation to extract. Supervised Parsing In addition to decoding Given that we do manage to decode reasonable trees from attention matrices, we also measure su- syntactic trees, we can then refine our question pervised UAS/LAS on a held-out test set.5 Based — what components are capable of learning these on Figure 4, it is apparent that all settings result trees? One obvious candidate is the key/query in generally the same UAS. This is somewhat ex- component pair, given that attention weights are a pected; Lauscher et al. (2020) see better results on scaled softmax of a composition of the two. Figure parsing with the entire encoder frozen, implying 3 (top) shows the difference between pretrained that the task is easy enough for a biaffine parser UUAS and fine-tuned UUAS per layer, across mod- 4 els and languages. Interestingly, the best parsing The inner average is over all heads; the outer is over all accuracies do not appear to vary much depending languages. 5 Note that the test set in our scenario is from the actual, on what component is frozen. We do see a clear non-parallel language treebank; as such, we left Korean out of trend, however, in that decoding the attention pat- this comparison due to annotation differences.
Figure 3: (Top) best scores across all heads, per language; (bottom) mean scores across all heads, per language. The languages (hidden from the X-axis for brevity) are, in order, ar, cs, de, en, es, fi, fr, hi, id, it, ja, ko, pl, pt, ru, sv, tr, zh be better than N ONE on average, implying that non-dense components might actually be hurting labelling capacity. In brief, consolidating the two sets of results above, we can draw three interesting conclusions about the components: 1. Value vectors are best aligned with syntactic dependencies; this is reflected both in the best head at the upper layers, and the average score across all heads. 2. Dense layers appear to have moderate infor- mative capacity, but appear to have the best learning capacity for the task of arc labelling. 3. Perhaps most surprisingly, Key and Query vectors do not appear to make any outstanding contributions, save for sharpening a smaller subset of heads. Figure 4: Mean UAS and LAS when evaluating differ- Our last result is especially surprising for UUAS de- ent models on language-specific treebanks (Korean ex- cluded due to annotation differences). M BERT refers coding. Keys and queries, fundamentally, combine to models where the entire mBERT network is frozen to form the attention weight matrix, which is pre- as input to the parser. cisely what we use to decode trees. One would ex- pect that allowing these components to learn from labelled syntax would result in the best improve- to learn, given frozen mBERT representations.6 ments to decoding, but all three have surprisingly The LAS distinction is, however, rather interesting: negligible mean improvements. This indicates that there is a marked difference between how impor- we need to further improve our understanding of tant the dense layers are, as opposed to the atten- how attentive structure and weighting really works. tive components. This is likely not reflected in our Cross-linguistic observations We notice no UUAS probe as, strictly speaking, labelling arcs clear cross-linguistic trends here across different is not equivalent to searching for structure in sen- component sets; however, certain languages do tences, but more akin to classifying pre-identified stand out as being particularly hard to decode from structures. We also note that D ENSE appears to the fine-tuned parser. These include Japanese, Ko- 6 Due to training on concatenated PUD sets, however, our rean, Chinese, French and Turkish. For the first results are not directly comparable/ three, we hypothesise that tokenization clashes with
mBERT’s internal representations may play a role. wards resembling the dependency structure. In con- Indeed, as we hypothesized in Section 3.2, it could trast, it was the Value (V ) parameters, which are be the case that the composition of CJK charac- used for computing a weighted sum over the KQ- ters into gold tokens for Chinese and Japanese may produced attention, that yielded the most faithful degrade the representations (and their correspond- representations of the linguistic structure via atten- ing attention) therein. Furthermore, for Japanese tion. and Korean specifically, it has been observed that Though prior work (Kovaleva et al., 2019; Zhao tokenization strategies employed by different tree- and Bethard, 2020) seems to indicate that there is banks could drastically influence the conclusions a lack of a substantial change in attention patterns one may draw about their inherent hierarchical after fine-tuning on syntax- and semantics-oriented structure (Kulmizev et al., 2020). Turkish and classification tasks, the opposite effect has been French are admittedly more difficult to diagnose. observed with fine-tuning on negation scope reso- Note, however, that we fine-tuned our model on a lution, where a more explanatory attention mech- concatenation of all PUD treebanks. As such, any anism can be induced (Htut et al., 2019). Our re- deviation from PUD’s annotation norms is there- sults are similar to the latter, and we demonstrate fore likely to be heavily penalised, by virtue of that given explicit syntactic annotation, attention signal from other languages drowning out these weights do end up storing more transparently de- differences. codable structure. It is, however, still unclear which sets of transformer parameters are best suited for 6 Conclusion learning this information and storing it in the form of attention. In this study, we revisited the prospect of decoding dependency trees from the self-attention patterns of Acknowledgements Transformer-based language models. We elected to extend our experiments to 18 languages in or- Our experiments were run on resources provided der to gain better insight about how tree decoding by UNINETT Sigma2 - the National Infrastructure accuracy might be affected in the face of (mod- for High Performance Computing and Data Stor- est) typological diversity. Surprisingly, across all age in Norway, under the NeIC-NLPL umbrella. languages, we were able to decode dependency Mostafa and Anders were funded by a Google Fo- trees from attention patterns more accurately than cused Research Award. We would like to thank an adjacent-linking baseline, implying that some Daniel Dakota and Ali Basirat for some fruitful structure was indeed being tracked by the mech- discussions and the anonymous reviewers for their anism. In looking at specific relation types, we excellent feedback. corroborated previous studies in showing that par- ticular layer-head combinations tracked the same relation with regularity across languages, despite References typological differences concerning word order, etc. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- In investigating the extent to which attention can gio. 2014. Neural machine translation by jointly be guided to properly capture structural relations learning to align and translate. arXiv preprint arXiv:1409.0473. between input words, we fine-tuned mBERT as in- put to a dependency parser. This, we found, yielded Yonatan Belinkov and James Glass. 2019. Analysis large improvements over the pretrained attention methods in neural language processing: A survey. patterns in terms of decoding accuracy, demonstrat- Transactions of the Association for Computational ing that the attention mechanism was learning to Linguistics, 7:49–72. represent the structural objective of the parser. In Ethan A. Chi, John Hewitt, and Christopher D. Man- addition to fine-tuning the entire mBERT network, ning. 2020. Finding universal grammatical rela- we conducted a series of experiments, wherein we tions in multilingual BERT. In Proceedings of the updated only select components of model and left 58th Annual Meeting of the Association for Compu- the remainder frozen. Most surprisingly, we ob- tational Linguistics, pages 5564–5577, Online. As- sociation for Computational Linguistics. served that the Transformer parameters designed for composing the attention matrix, K and Q, were Yoeng-Jin Chu. 1965. On the shortest arborescence of only modestly capable of guiding the attention to- a directed graph. Scientia Sinica, 14:1396–1400.
Kevin Clark, Urvashi Khandelwal, Omer Levy, and Dan Kondratyuk and Milan Straka. 2019b. 75 lan- Christopher D. Manning. 2019. What does BERT guages, 1 model: Parsing Universal Dependencies look at? an analysis of BERT’s attention. In Pro- universally. In Proceedings of the 2019 Confer- ceedings of the 2019 ACL Workshop BlackboxNLP: ence on Empirical Methods in Natural Language Analyzing and Interpreting Neural Networks for Processing and the 9th International Joint Confer- NLP, pages 276–286, Florence, Italy. Association ence on Natural Language Processing (EMNLP- for Computational Linguistics. IJCNLP), pages 2779–2795, Hong Kong, China. As- sociation for Computational Linguistics. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of Olga Kovaleva, Alexey Romanov, Anna Rogers, and deep bidirectional transformers for language under- Anna Rumshisky. 2019. Revealing the Dark Secrets standing. In Proceedings of the 2019 Conference of of BERT. In Proceedings of the 2019 Conference on the North American Chapter of the Association for Empirical Methods in Natural Language Processing Computational Linguistics: Human Language Tech- and the 9th International Joint Conference on Natu- nologies, Volume 1 (Long and Short Papers), pages ral Language Processing (EMNLP-IJCNLP), pages 4171–4186. 4364–4373, Hong Kong, China. Association for Computational Linguistics. Timothy Dozat and Christopher D Manning. 2016. Deep biaffine attention for neural dependency pars- Artur Kulmizev, Vinit Ravishankar, Mostafa Abdou, ing. arXiv preprint arXiv:1611.01734. and Joakim Nivre. 2020. Do neural language mod- els show preferences for syntactic formalisms? In John Hewitt and Christopher D Manning. 2019. A Proceedings of the 58th Annual Meeting of the Asso- structural probe for finding syntax in word represen- ciation for Computational Linguistics, pages 4077– tations. In Proceedings of the 2019 Conference of 4091, Online. Association for Computational Lin- the North American Chapter of the Association for guistics. Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages Anne Lauscher, Vinit Ravishankar, Ivan Vuli, and 4129–4138. Goran Glava. 2020. From Zero to Hero: On the Lim- itations of Zero-Shot Cross-Lingual Transfer with Sepp Hochreiter and Jürgen Schmidhuber. 1997. Multilingual Transformers. arXiv:2005.00633 [cs]. Long short-term memory. Neural computation, ArXiv: 2005.00633. 9(8):1735–1780. Tomasz Limisiewicz, Rudolf Rosa, and David Mareček. 2020. Universal dependencies according to bert: Phu Mon Htut, Jason Phang, Shikha Bordia, and both more specific and more general. arXiv preprint Samuel R Bowman. 2019. Do attention heads in arXiv:2004.14620. bert track syntactic dependencies? arXiv preprint arXiv:1911.12246. Ryan McDonald, Koby Crammer, and Fernando Pereira. 2005a. Online large-margin training of de- Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham pendency parsers. In Proceedings of the 43rd An- Neubig, Orhan Firat, and Melvin Johnson. 2020. nual Meeting of the Association for Computational XTREME: A Massively Multilingual Multi-task Linguistics (ACL), pages 91–98. Benchmark for Evaluating Cross-lingual Generaliza- tion. arXiv:2003.11080 [cs]. ArXiv: 2003.11080. Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajič. 2005b. Non-projective dependency pars- Sarthak Jain and Byron C Wallace. 2019. Attention is ing using spanning tree algorithms. In Proceed- not explanation. In Proceedings of the 2019 Con- ings of the Human Language Technology Confer- ference of the North American Chapter of the Asso- ence and the Conference on Empirical Methods in ciation for Computational Linguistics: Human Lan- Natural Language Processing (HLT/EMNLP), pages guage Technologies, Volume 1 (Long and Short Pa- 523–530. pers), pages 3543–3556. Joakim Nivre, Mitchell Abrams, Željko Agić, Lars Dan Klein and Christopher D. Manning. 2004. Corpus- Ahrenberg, Gabrielė Aleksandravičiūtė, Lene An- based induction of syntactic structure: Models of de- tonsen, Katya Aplonova, Maria Jesus Aranz- pendency and constituency. In Proceedings of the abe, Gashaw Arutie, Masayuki Asahara, Luma 42nd Annual Meeting of the Association for Compu- Ateyah, Mohammed Attia, Aitziber Atutxa, Lies- tational Linguistics (ACL), pages 479–486. beth Augustinus, Elena Badmaeva, Miguel Balles- teros, Esha Banerjee, Sebastian Bank, Verginica Dan Kondratyuk and Milan Straka. 2019a. 75 lan- Barbu Mititelu, Victoria Basmov, John Bauer, San- guages, 1 model: Parsing universal dependencies dra Bellato, Kepa Bengoetxea, Yevgeni Berzak, universally. In Proceedings of the 2019 Confer- Irshad Ahmad Bhat, Riyaz Ahmad Bhat, Erica ence on Empirical Methods in Natural Language Biagetti, Eckhard Bick, Agnė Bielinskienė, Ro- Processing and the 9th International Joint Confer- gier Blokland, Victoria Bobicev, Loı̈c Boizou, ence on Natural Language Processing (EMNLP- Emanuel Borges Völker, Carl Börstell, Cristina IJCNLP), pages 2779–2795. Bosco, Gosse Bouma, Sam Bowman, Adriane
Boyd, Kristina Brokaitė, Aljoscha Burchardt, Marie Barbara Plank, Thierry Poibeau, Martin Popel, Candito, Bernard Caron, Gauthier Caron, Gülşen Lauma Pretkalniņa, Sophie Prévost, Prokopis Proko- Cebiroğlu Eryiğit, Flavio Massimiliano Cecchini, pidis, Adam Przepiórkowski, Tiina Puolakainen, Giuseppe G. A. Celano, Slavomı́r Čéplö, Savas Sampo Pyysalo, Andriela Rääbis, Alexandre Rade- Cetin, Fabricio Chalub, Jinho Choi, Yongseok Cho, maker, Loganathan Ramasamy, Taraka Rama, Car- Jayeol Chun, Silvie Cinková, Aurélie Collomb, los Ramisch, Vinit Ravishankar, Livy Real, Siva Çağrı Çöltekin, Miriam Connor, Marine Courtin, Reddy, Georg Rehm, Michael Rießler, Erika Elizabeth Davidson, Marie-Catherine de Marneffe, Rimkutė, Larissa Rinaldi, Laura Rituma, Luisa Valeria de Paiva, Arantza Diaz de Ilarraza, Carly Rocha, Mykhailo Romanenko, Rudolf Rosa, Da- Dickerson, Bamba Dione, Peter Dirix, Kaja Do- vide Rovati, Valentin Roca, Olga Rudina, Jack brovoljc, Timothy Dozat, Kira Droganova, Puneet Rueter, Shoval Sadde, Benoı̂t Sagot, Shadi Saleh, Dwivedi, Hanne Eckhoff, Marhaba Eli, Ali Elkahky, Alessio Salomoni, Tanja Samardžić, Stephanie Sam- Binyam Ephrem, Tomaž Erjavec, Aline Etienne, son, Manuela Sanguinetti, Dage Särg, Baiba Saulı̄te, Richárd Farkas, Hector Fernandez Alcalde, Jennifer Yanin Sawanakunanon, Nathan Schneider, Sebastian Foster, Cláudia Freitas, Kazunori Fujita, Katarı́na Schuster, Djamé Seddah, Wolfgang Seeker, Mojgan Gajdošová, Daniel Galbraith, Marcos Garcia, Moa Seraji, Mo Shen, Atsuko Shimada, Hiroyuki Shirasu, Gärdenfors, Sebastian Garza, Kim Gerdes, Filip Muh Shohibussirri, Dmitry Sichinava, Natalia Sil- Ginter, Iakes Goenaga, Koldo Gojenola, Memduh veira, Maria Simi, Radu Simionescu, Katalin Simkó, Gökırmak, Yoav Goldberg, Xavier Gómez Guino- Mária Šimková, Kiril Simov, Aaron Smith, Isabela vart, Berta González Saavedra, Matias Grioni, Nor- Soares-Bastos, Carolyn Spadine, Antonio Stella, munds Grūzı̄tis, Bruno Guillaume, Céline Guillot- Milan Straka, Jana Strnadová, Alane Suhr, Umut Barbance, Nizar Habash, Jan Hajič, Jan Hajič jr., Sulubacak, Shingo Suzuki, Zsolt Szántó, Dima Linh Hà Mỹ, Na-Rae Han, Kim Harris, Dag Taji, Yuta Takahashi, Fabio Tamburini, Takaaki Haug, Johannes Heinecke, Felix Hennig, Barbora Tanaka, Isabelle Tellier, Guillaume Thomas, Li- Hladká, Jaroslava Hlaváčová, Florinel Hociung, Pet- isi Torga, Trond Trosterud, Anna Trukhina, Reut ter Hohle, Jena Hwang, Takumi Ikeda, Radu Ion, Tsarfaty, Francis Tyers, Sumire Uematsu, Zdeňka Elena Irimia, O.lájı́dé Ishola, Tomáš Jelı́nek, An- Urešová, Larraitz Uria, Hans Uszkoreit, Sowmya ders Johannsen, Fredrik Jørgensen, Hüner Kaşıkara, Vajjala, Daniel van Niekerk, Gertjan van No- Andre Kaasen, Sylvain Kahane, Hiroshi Kanayama, ord, Viktor Varga, Eric Villemonte de la Clerg- Jenna Kanerva, Boris Katz, Tolga Kayadelen, Jes- erie, Veronika Vincze, Lars Wallin, Abigail Walsh, sica Kenney, Václava Kettnerová, Jesse Kirchner, Jing Xian Wang, Jonathan North Washington, Arne Köhn, Kamil Kopacewicz, Natalia Kotsyba, Maximilan Wendt, Seyi Williams, Mats Wirén, Jolanta Kovalevskaitė, Simon Krek, Sookyoung Christian Wittern, Tsegay Woldemariam, Tak-sum Kwak, Veronika Laippala, Lorenzo Lambertino, Lu- Wong, Alina Wróblewska, Mary Yako, Naoki Ya- cia Lam, Tatiana Lando, Septina Dian Larasati, mazaki, Chunxiao Yan, Koichi Yasuoka, Marat M. Alexei Lavrentiev, John Lee, Phng Lê H`ông, Yavrumyan, Zhuoran Yu, Zdeněk Žabokrtský, Amir Alessandro Lenci, Saran Lertpradit, Herman Le- Zeldes, Daniel Zeman, Manying Zhang, and ung, Cheuk Ying Li, Josie Li, Keying Li, Kyung- Hanzhi Zhu. 2019. Universal dependencies 2.4. Tae Lim, Yuan Li, Nikola Ljubešić, Olga Logi- LINDAT/CLARIAH-CZ digital library at the Insti- nova, Olga Lyashevskaya, Teresa Lynn, Vivien tute of Formal and Applied Linguistics (ÚFAL), Fac- Macketanz, Aibek Makazhanov, Michael Mandl, ulty of Mathematics and Physics, Charles Univer- Christopher Manning, Ruli Manurung, Cătălina sity. Mărănduc, David Mareček, Katrin Marheinecke, Héctor Martı́nez Alonso, André Martins, Jan Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin- Mašek, Yuji Matsumoto, Ryan McDonald, Sarah ter, Yoav Goldberg, Jan Hajič, Christopher D. Man- McGuinness, Gustavo Mendonça, Niko Miekka, ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Margarita Misirpashayeva, Anna Missilä, Cătălin Natalia Silveira, Reut Tsarfaty, and Dan Zeman. Mititelu, Yusuke Miyao, Simonetta Montemagni, 2016. Universal Dependencies v1: A multilingual Amir More, Laura Moreno Romero, Keiko Sophie treebank collection. In Proceedings of the 10th In- Mori, Tomohiko Morioka, Shinsuke Mori, Shigeki ternational Conference on Language Resources and Moro, Bjartur Mortensen, Bohdan Moskalevskyi, Evaluation (LREC). Kadri Muischnek, Yugo Murawaki, Kaili Müürisep, Pinkey Nainwani, Juan Ignacio Navarro Horñiacek, Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin- Anna Nedoluzhko, Gunta Nešpore-Bērzkalne, Lng ter, Jan Hajič, Christopher D. Manning, Sampo Nguy˜ên Thi., Huy`ên Nguy˜ên Thi. Minh, Yoshi- Pyysalo, Sebastian Schuster, Francis Tyers, and Dan hiro Nikaido, Vitaly Nikolaev, Rattima Nitisaroj, Zeman. 2020. Universal Dependencies v2: An ev- Hanna Nurmi, Stina Ojala, Adédayo. Olúòkun, Mai ergrowing multilingual treebank collection. In Pro- Omura, Petya Osenova, Robert Östling, Lilja Øvre- ceedings of the 12th International Conference on lid, Niko Partanen, Elena Pascual, Marco Passarotti, Language Resources and Evaluation (LREC). Agnieszka Patejuk, Guilherme Paulino-Passos, An- gelika Peljak-Łapińska, Siyao Peng, Cenel-Augusto Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. Perez, Guy Perrier, Daria Petrova, Slav Petrov, How multilingual is multilingual bert? In Proceed- Jussi Piitulainen, Tommi A Pirinen, Emily Pitler, ings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996–5001.
Alessandro Raganato, Jörg Tiedemann, et al. 2018. An Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V analysis of encoder representations in transformer- Le, Mohammad Norouzi, Wolfgang Macherey, based machine translation. In Proceedings of the Maxim Krikun, Yuan Cao, Qin Gao, Klaus 2018 EMNLP Workshop BlackboxNLP: Analyzing Macherey, et al. 2016. Google’s neural machine and Interpreting Neural Networks for NLP. The As- translation system: Bridging the gap between hu- sociation for Computational Linguistics. man and machine translation. arXiv preprint arXiv:1609.08144. Tim Rocktäschel, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiskỳ, and Phil Blunsom. 2015. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Reasoning about entailment with neural attention. Aaron Courville, Ruslan Salakhudinov, Rich Zemel, arXiv preprint arXiv:1509.06664. and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual atten- Rudolf Rosa and David Mareček. 2019. Inducing syn- tion. In International conference on machine learn- tactic trees from bert representations. arXiv preprint ing, pages 2048–2057. arXiv:1906.11511. Daniel Zeman, Jan Hajic, Martin Popel, Martin Pot- thast, Milan Straka, Filip Ginter, Joakim Nivre, and Sofia Serrano and Noah A Smith. 2019. Is attention Slav Petrov. 2018. Conll 2018 shared task: Mul- interpretable? arXiv preprint arXiv:1906.03731. tilingual parsing from raw text to universal depen- dencies. In Proceedings of the CoNLL 2018 Shared Emma Strubell, Patrick Verga, Daniel Andor, Task: Multilingual parsing from raw text to univer- David Weiss, and Andrew McCallum. 2018. sal dependencies, pages 1–21. Linguistically-informed self-attention for semantic role labeling. In Proceedings of the 2018 Confer- Yiyun Zhao and Steven Bethard. 2020. How does ence on Empirical Methods in Natural Language BERT’s attention change when you fine-tune? An Processing, pages 5027–5038, Brussels, Belgium. analysis methodology and a case study in negation Association for Computational Linguistics. scope. page 19. Swabha Swayamdipta, Sam Thomson, Kenton Lee, Luke Zettlemoyer, Chris Dyer, and Noah A. Smith. 2018. Syntactic scaffolds for semantic structures. In Proceedings of the 2018 Conference on Em- pirical Methods in Natural Language Processing, pages 3772–3782, Brussels, Belgium. Association for Computational Linguistics. Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. Bert rediscovers the classical nlp pipeline. arXiv preprint arXiv:1905.05950. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017a. Attention is all you need. In Advances in neural information pro- cessing systems, pages 5998–6008. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017b. Attention Is All You Need. arXiv:1706.03762 [cs]. ArXiv: 1706.03762. Elena Voita, David Talbot, Fedor Moiseev, Rico Sen- nrich, and Ivan Titov. 2019. Analyzing multi- head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint arXiv:1905.09418. Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not explanation. In Proceedings of the 2019 Con- ference on Empirical Methods in Natural Language Processing and the 9th International Joint Confer- ence on Natural Language Processing (EMNLP- IJCNLP), pages 11–20.
Figure 5: Positional scores across relations for all languages.
Figure 6: Decoding UUAS as a function of best positional baselines. Figure 7: Parsing scores across components and languages.
You can also read