Attention Can Reflect Syntactic Structure

Page created by Tiffany Graves

Health & Fitness

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Attention Can Reflect Syntactic Structure

Attention Can Reflect Syntactic Structure
                                                                                   (If You Let It)
                                                                  ∗
                                                                Vinit Ravishankar† ∗ Artur Kulmizev‡ Mostafa Abdou§
                                                                          Anders Søgaard§       Joakim Nivre‡
                                                      †
                                                        Language Technology Group, Department of Informatics, University of Oslo
                                                              ‡
                                                                Department of Linguistics and Philology, Uppsala University
                                                            §
                                                              Department of Computer Science, University of Copenhagen
                                                                               †
                                                                                 vinitr@ifi.uio.no
                                                            ‡
                                                              {artur.kulmizev,joakim.nivre}@lingfil.uu.se
                                                                         §
                                                                           {abdou,soegaard}@di.ku.dk

                                                                   Abstract                              1997) as NLP’s feature-encoder-of-choice, largely
                                                                                                         due to its compatibility with parallelized training
                                             Since the popularization of the Transformer
                                                                                                         regimes and ability to handle long-distance depen-
arXiv:2101.10927v1 [cs.CL] 26 Jan 2021

                                             as a general-purpose feature encoder for NLP,
                                             many studies have attempted to decode lin-
                                                                                                         dencies.
                                             guistic structure from its novel multi-head at-                Certainly, the nature of attention as a distribu-
                                             tention mechanism. However, much of such                    tion over tokens lends itself to a straightforward
                                             work focused almost exclusively on English —                interpretation of a model’s inner workings. Bah-
                                             a language with rigid word order and a lack                 danau et al. (2014) illustrate this nicely in the con-
                                             of inflectional morphology. In this study, we               text of seq2seq machine translation, showing
                                             present decoding experiments for multilingual
                                                                                                         that the attention learned by their models reflects
                                             BERT across 18 languages in order to test the
                                             generalizability of the claim that dependency               expected cross-lingual idiosyncrasies between En-
                                             syntax is reflected in attention patterns. We               glish and French, e.g., concerning word order. With
                                             show that full trees can be decoded above base-             self-attentive Transformers, interpretation becomes
                                             line accuracy from single attention heads, and              slightly more difficult, as attention is distributed
                                             that individual relations are often tracked by              across words within the input itself. This is fur-
                                             the same heads across languages. Furthermore,               ther compounded by the use of multiple layers and
                                             in an attempt to address recent debates about
                                                                                                         heads, each combination of which yields its own
                                             the status of attention as an explanatory mecha-
                                             nism, we experiment with fine-tuning mBERT                  alignment, representing a different (possibly re-
                                             on a supervised parsing objective while freez-              dundant) view of the data. Given the similarity
                                             ing different series of parameters. Interest-               of such attention matrices to the score matrices
                                             ingly, in steering the objective to learn explicit          employed in arc-factored dependency parsing (Mc-
                                             linguistic structure, we find much of the same              Donald et al., 2005a,b), a salient question concern-
                                             structure represented in the resulting attention            ing interpretability becomes: Can we expect some
                                             patterns, with interesting differences with re-             combination of these parameters to capture linguis-
                                             spect to which parameters are frozen.
                                                                                                         tic structure in the form of a dependency tree, espe-
                                         1   Introduction                                                cially if the model performs well on NLP tasks? If
                                                                                                         not, can we relax the expectation and examine the
                                         In recent years, the attention mechanism proposed               extent to which subcomponents of the linguistic
                                         by Bahdanau et al. (2014) has become an indis-                  structure, such as subject-verb relations, are repre-
                                         pensable component of many NLP systems. Its                     sented? This prospect was first posed by Raganato
                                         widespread adoption was, in part, heralded by                   et al. (2018) for MT encoders, and later explored
                                         the introduction of the Transformer architecture                by Clark et al. (2019) for BERT. Ultimately, the
                                         (Vaswani et al., 2017a), which constrains a soft                consensus of these and other studies (Voita et al.,
                                         alignment to be learned across discrete states in               2019; Htut et al., 2019; Limisiewicz et al., 2020)
                                         the input (self-attention), rather than across input            was that, while there appears to exist no “general-
                                         and output (e.g., Xu et al., 2015; Rocktäschel et al.,         ist” head responsible for extracting full dependency
                                         2015). The Transformer has, by now, supplanted                  structures, standalone heads often specialize in cap-
                                         the popular LSTM (Hochreiter and Schmidhuber,                   turing individual grammatical relations.
                                             ∗
                                                 Equal contribution. Order was decided by a coin toss.      Unfortunately, most of such studies focused their

experiments entirely on English, which is typologi-      key (K), and value (V ) vectors via three separate
cally favored to succeed in such scenarios due to its    linear transformations and passed to an attention
rigid word order and lack of inflectional morphol-       mechanism. A single attention head computes
ogy. It remains to be seen whether the attention         scaled dot-product attention between K and Q, out-
patterns of such models can capture structural fea-      putting a weighted sum of V :
                                                                                                 QK >
                                                                                                       
tures across typologically diverse languages, or if
                                                            Attention(Q, K, V ) = softmax √               V
the reported experiments on English are a misrep-                                                   dk
resentation of local positional heuristics as such.                                                        (1)
Furthermore, though previous work has investi-           For multihead attention (MHA), the same process
gated how attention patterns might change after          is repeated for k heads, allowing the model to
fine-tuning on different tasks (Htut et al., 2019),      jointly attend to information from different repre-
a recent debate about attention as an explanatory        sentation subspaces at different positions (Vaswani
mechanism (Jain and Wallace, 2019; Wiegreffe and         et al., 2017b). Ultimately, the output of all heads is
Pinter, 2019) has cast the entire enterprise in doubt.   concatenated and passed through a linear projection
Indeed, it remains to be seen whether fine-tuning        W O:                                          
on an explicit structured prediction task, e.g. de-         Hi = Attention QWiQ , KWiK , V WiV             (2)
pendency parsing, can force attention to represent
the structure being learned, or if the patterns ob-
served in pretrained models are not altered in any        MHA(Q, K, V ) = concat(H1 , H2 , ..., Hk )W O
meaningful way.                                                                                           (3)
   To address these issues, we investigate the           Every layer also consists of a feed-forward network
prospect of extracting linguistic structure from         (FFN), consisting of two Dense layers with ReLU
the attention weights of multilingual Transformer-       activation functions. For each layer, therefore, the
based language models. In light of the surveyed          output of MHA is passed through a LayerNorm
literature, our research questions are as follows:       with residual connections, passed through FFN,
    1. Can we decode dependency trees for some           and then through another LayerNorm with residual
       languages better than others?                     connections.
    2. Do the same layer–head combinations track
                                                         Searching for structure Often, the line of in-
       the same relations across languages?
                                                         quiry regarding interpretability in NLP has been
    3. How do attention patterns change after fine-
                                                         concerned with extracting and analyzing linguistic
       tuning with explicit syntactic annotation?
                                                         information from neural network models of lan-
    4. Which components of the model are involved
                                                         guage (Belinkov and Glass, 2019). Recently, such
       in these changes?
                                                         investigations have targeted Transformer models
In answering these questions, we believe we can          (Hewitt and Manning, 2019; Rosa and Mareček,
shed further light on the (cross-)linguistic proper-     2019; Tenney et al., 2019), at least in part because
ties of Transformer-based language models, as well       the self-attention mechanism employed by these
as address the question of attention patterns being      models offers a possible window into their inner
a reliable representation of linguistic structure.       workings. With large-scale machine translation and
                                                         language models being openly distributed for ex-
2     Attention as Structure
                                                         perimentation, several researchers have wondered
Transformers The focus of the present study              if self-attention is capable of representing syntactic
is mBERT, a multilingual variant of the exceed-          structure, despite not being trained with any overt
ingly popular language model (Devlin et al., 2019).      parsing objective.
BERT is built upon the Transformer architecture             In pursuit of this question, Raganato et al. (2018)
(Vaswani et al., 2017b), which is a self-attention-      applied a maximum-spanning-tree algorithm over
based encoder-decoder model (though only the en-         the attention weights of several trained MT models,
coder is relevant to our purposes). A Transformer        comparing them with gold trees from Universal De-
takes a sequence of vectors x = [x1 , x2 , ...xn ] as    pendencies (Nivre et al., 2016, 2020). They found
input and applies a positional encoding to them,         that, while the accuracy was not comparable to that
in order to retain the order of words in a sentence.     of a supervised parser, it was nonetheless higher
These inputs are then transformed into query (Q),        than several strong baselines, implying that some

structure was consistently represented. Clark et al. standard attention mechanisms. On a theoretical
(2019) corroborated the same findings for BERT level, they argue that, although attention weights do
when decoding full trees, but observed that indi- not give an exclusive “faithful” explanation, they
vidual dependency relations were often tracked by do provide a meaningful plausible explanation.
specialized heads and were decodable with much This discussion is relevant to our study because
higher accuracy than some fixed-offset baselines. it remains unclear whether or not attending to syn-
Concurrently, Voita et al. (2019) made a similar tactic structure serves, in practice, as plausible ex-
observation about heads specializing in specific planation for model behavior, or whether or not it is
dependency relations, proposing a coarse taxon- even capable of serving as such. Indeed, the studies
omy of head attention functions: positional, where of Raganato et al. (2018) and Clark et al. (2019)
heads attend to adjacent tokens; syntactic, where relate a convincing but incomplete picture — tree
heads attend to specific syntactic relations; and rare decoding accuracy just marginally exceeds base-
words, where heads point to the least frequent to- lines and various relations tend to be tracked across
kens in the sentence. Htut et al. (2019) followed varying heads and layers. Thus, our fine-tuning ex-
Raganato et al. (2018) in decoding dependency periments (detailed in the following section) serve
trees from BERT-based models, finding that fine- to enable an “easy” setting wherein we explicitly
tuning on two classification tasks did not produce inform our models of the same structure that we are
syntactically plausible attention patterns. Lastly, trying to extract. We posit that, if, after fine-tuning,
Limisiewicz et al. (2020) modified UD annotation syntactic structures were still not decodable from
to better represent attention patterns and introduced the attention weights, one could safely conclude
a supervised head-ensembling method for consoli- that these structures are being stored via a non-
dating shared syntactic information across heads. transparent mechanism that may not even involve
attention weights. Such an insight would allow us
Does attention have explanatory value? Though to conclude that attention weights cannot provide
many studies have yielded insight about how atten- even a plausible explanation for models relying on
tion behaves in a variety of models, the question of syntax.
whether it can be seen as a “faithful” explanation
of model predictions has been subject to much re- 3 Experimental Design
cent debate. For example, Jain and Wallace (2019)
present compelling arguments that attention does To examine the extent to which we can decode de-
not offer a faithful explanation of predictions. Pri- pendency trees from attention patterns, we run a
marily, they demonstrate that there is little correla- tree decoding algorithm over mBERT’s attention
tion between standard feature importance measures heads — before and after fine-tuning via a parsing
and attention weights. Furthermore, they contend objective. We surmise that doing so will enable us
that there exist counterfactual attention distribu- to determine if attention can be interpreted as a re-
tions, which are substantially different from learned liable mechanism for capturing linguistic structure.
attention weights but that do not alter a model’s pre-
3.1 Model
dictions. Using a similar methodology, Serrano and
Smith (2019) corroborate that attention does not We employ mBERT1 in our experiments, which
provide an adequate account of an input compo- has been shown to perform well across a variety
nent’s importance. of NLP tasks (Hu et al., 2020; Kondratyuk and
In response to these findings, Wiegreffe and Pin- Straka, 2019a) and capture aspects of syntactic
ter (2019) question the assumptions underlying structure cross-lingually (Pires et al., 2019; Chi
such claims. Attention, they argue, is not a prim- et al., 2020). mBERT features 12 layers with 768
itive, i.e., it cannot be detached from the rest of a hidden units and 12 attention heads, with a joint
model’s components as is done in the experiments WordPiece sub-word vocabulary across languages.
of Jain and Wallace (2019). They propose a set The model was trained on the concatenation of
of four analyses to test whether a given model’s WikiDumps for the top 104 languages with the
attention mechanism can provide meaningful ex- largest Wikipedias,where principled sampling was
planation and demonstrate that the alternative at- employed to enforce a balance between high- and
tention distributions found via adversarial training 1
https://github.com/google-research/
methods do, in fact, perform poorly compared to bert

low-resource languages. the exact tree structure we aim to decode. To this
end, we employ the graph-based decoding algo-
3.2 Decoding Algorithm rithm of the biaffine parser introduced by Dozat
For decoding dependency trees, we follow Ra- and Manning (2016). We replace the standard
ganato et al. (2018) in applying the Chu-Liu- BiLSTM encoder for this parser with the entire
Edmonds maximum spanning tree algorithm (Chu, mBERT network, which we fine-tune with the pars-
1965) to every layer/head combination available ing loss. The full parser decoder consists of four
in mBERT (12 × 12 = 144 in total). In order for dense layers, two for head/child representations for
the matrices to correspond to gold treebank tok- dependency arcs (dim. 500) and two for head/child
enization, we remove the cells corresponding to representations for dependency labels (dim. 100).
the BERT delimiter tokens ([CLS] and [SEP]). These are transformed into the label space via a
In addition to this, we sum the columns and av- bilinear transform.
erage the rows corresponding to the constituent After training the parser, we can decode the fine-
subwords of gold tokens, respectively (Clark et al., tuned mBERT parameters in the same fashion as
2019). Lastly, since attention patterns across heads described in Section 3.2. We surmise that, if atten-
may differ in whether they represent heads attend- tion heads are capable of tracking hierarchical rela-
ing to their dependents or vice versa, we take our tions between words in any capacity, it is precisely
input to be the element-wise product of a given in this setting that this ability would be attested. In
attention matrix and its transpose (A ◦ A> ). We addition to this, we are interested in what individual
liken this to the joint probability of a head attend- components of the mBERT network are capable of
ing to its dependent and a dependent attending to steering attention patterns towards syntactic struc-
its head, similarly to Limisiewicz et al. (2020). Per ture. We believe that addressing this question will
this point, we also follow Htut et al. (2019) in eval- help us not only in interpreting decisions made by
uating the decoded trees via Undirected Unlabeled BERT-based neural parsers, but also in aiding us de-
Attachment Score (UUAS) — the percentage of veloping syntax-aware models in general (Strubell
undirected edges recovered correctly. Since we dis- et al., 2018; Swayamdipta et al., 2018). As such —
count directionality, this is effectively a less strict beyond fine-tuning all parameters of the mBERT
measure than UAS, but one that has a long tradition network (our basic setting) — we perform a se-
in unsupervised dependency parsing since Klein ries of ablation experiments wherein we update
and Manning (2004). only one set of parameters per training cycle, e.g.
the Query weights WiQ , and leave everything else
3.3 Data
frozen. This gives us a set of 6 models, which are
For our data, we employ the Parallel Universal De- described below. For each model, all non-BERT
pendencies (PUD) treebanks, as collected in UD parser components are always left unfrozen.
v2.4 (Nivre et al., 2019). PUD was first released
as part of the CONLL 2017 shared task (Zeman • K EY: only the K components of the trans-
et al., 2018), containing 1000 parallel sentences, former are unfrozen; these are the represen-
which were (professionally) translated from En- tations of tokens that are paying attention to
glish, German, French, Italian, and Spanish to 14 other tokens.
other languages. The sentences are taken from two • Q UERY: only the Q components are unfrozen;
domains, news and wikipedia, the latter implying these, conversely, are the representations of
some overlap with mBERT’s training data (though tokens being paid attention to.
we did not investigate this). We include all PUD • KQ: both keys and queries are unfrozen.
treebanks except Thai.2 • VALUE: semantic value vectors per token (V )
are unfrozen; they are composed after being
3.4 Fine-Tuning Details weighted with attention scores obtained from
In addition to exploring pretrained mBERT’s atten- the K/Q matrices.
tion weights, we are also interested in how attention • D ENSE: the dense feed-forward networks in
might be guided by a training objective that learns the attention mechanism; all three per layer
2
are unfrozen.
Thai is the only treebank that does not have a non-PUD
treebank available in UD, which we need for our fine-tuning • N ONE: The basic setting with nothing frozen;
experiments. all parameters are updated with the parsing

AR      CS      DE      EN      ES      FI      FR      HI      ID      IT      JA      KO      PL      PT      RU      SV      TR     ZH

 BASELINE        50      40      36      36      40      42      40      46      47      40      43      55      45      41      42      39      52     41
                 53      53      49      47      50      48      41      48      50      41      45      64      52      50      51      51      55     42
 P RE
                 7-6    10-8    10-8    10-8     9-5    10-8     2-3     2-3     9-5     6-4     2-3     9-2    10-8     9-5    10-8    10-8     3-8    2-3
              76         78      76      71      77      66      45      72      75      58      42      64      75      76      75      74      55     38
 N ONE
             11-10      11-10   11-10   10-11   10-11   10-11   11-10   11-10   11-10   11-10   11-10   11-10   11-10   11-10   10-8    10-8     3-8    2-3
                  62     64      58      53      59      56      41      54      59      47      44      62      64      58      61      59      55     41
 K EY
                 10-8   10-8    11-12   10-8    11-12   10-8    7-12    10-8    10-8     9-2     2-3    10-8    10-8    11-12   10-8    12-10   3-12    2-3
                  69     74      70      66      73      63      42      62      67      54      45      65      72      70      70      68      56     42
 Q UERY
                 11-4   10-8    11-4    11-4    11-4    10-8    11-4    11-4    11-4    11-4     2-3    10-8    11-4    11-4    10-8    11-4    10-8    2-3
                  71     76      70      65      74      62      43      64      69      55      44      64      73      73      69      69      55     41
 KQ
                 11-4   11-4    11-4    11-4    11-4    11-4    10-11   11-4    11-4    11-4     2-3    11-4    11-4    11-4    11-4    11-4    11-4    2-3
                  75     72      72      64      76      59      45      63      73      55      45      66      73      74      69      65      57     42
 VALUE
                 12-5   12-5    12-5    12-5    12-5    12-5    12-5    12-5    12-5    12-5     2-3    10-8    12-5    12-5    12-5    12-5    12-5    3-8
              68         71      65      60      67      61      42      65      66      49      44      64      70      64      67      64      55      40
 D ENSE
             11-10      11-10   11-10   10-8    12-10   11-10   10-8    11-10   11-10    9-5    3-12    11-10   11-10   12-5    11-10   11-10   11-10   3-12

Table 1: Adjacent-branching baseline and maximum UUAS decoding accuracy per PUD treebank, expressed as
best score and best layer/head combination for UUAS decoding. P RE refers to basic mBERT model before fine-
tuning, while all cells below correspond different fine-tuned models described in Section 3.4. Best score indicated
in bold.

         loss.
We fine-tune each of these models on a concaten-
tation of all PUD treebanks for 20 epochs, which
effectively makes our model multilingual. We do
so in order to 1) control for domain and annotation
confounds, since all PUD sentences are parallel
and are natively annotated (unlike converted UD
treebanks, for instance); 2) increase the number of
training samples for fine-tuning, as each PUD tree-
bank features only 1000 sentences; and 3) induce
a better parser through multilinguality, as in Kon-
dratyuk and Straka (2019b). Furthermore, in or-
der to gauge the overall performance of our parser
across all ablated settings, we evaluate on the test
set of the largest non-PUD treebank available for
each language, since PUD only features test par-
titions. When training, we employ a combined
dense/sparse Adam optimiser, at a learning rate of
3 ∗ 10−5 . We rescale gradients to have a maximum
norm of 5.

4       Decoding mBERT Attention

The second row of Table 1 (P RE) depicts the UUAS
after running our decoding algorithm over mBERT
attention matrices, per language. We see a famil-
iar pattern to that in Clark et al. (2019) among
                                                                                  Figure 1: UUAS of MST decoding per layer and head,
others — namely that attention patterns extracted                                 across languages. Heads (y-axis) are sorted by accu-
directly from mBERT appear to be incapable of                                     racy for easier visualization.
decoding dependency trees beyond a threshold of
50–60% UUAS accuracy. However, we also note
that, in all languages, the attention-decoding algo-                              linear structures are captured with regularity. In-
rithm outperforms a BASELINE (row 1) that draws                                   deed, head 8 in layer 10 appears to be particularly
an (undirected) edge between any two adjacent                                     strong in this regard, returning the highest UUAS
words in linear order, which implies that some non-                               for 7 languages. Interestingly, the accuracy patterns

Figure 2: Left: UUAS per relation across languages (best layer/head combination indicated in cell). Right: Best
UUAS as a function of best positional baseline (derived from the treebank), selected relations.

across layers depicted in Figure 1 tend to follow guages (Korean Hangul and Japanese Hiragana and
an identical trend for all languages, with nearly all Katakana included), which rely on whitespace and
heads in layer 7 returning high within-language WordPiece (Wu et al., 2016). It is thus possible that
accuracies. the attention distributions for these two languages
It appears that attention for some languages (Ara- (at least where CJK characters are relevant) are de-
bic, Czech, Korean, Turkish) is comparatively eas- voted to composing words, rather than structural
ier to decode than others (French, Italian, Japanese, relations, which will distort the attention matrices
Chinese). A possible explanation for this result is that we compute to correspond with gold tokeniza-
that dependency relations between content words, tion (e.g. by maxing rows and averaging columns).
which are favored by the UD annotation, are more
likely to be adjacent in the morphologically rich Relation analysis We can disambiguate what
languages of the first group (without intervening sort of structures are captured with regularity by
function words). This assumption seems to be cor- looking at the UUAS returned per dependency rela-
roborated by the high baseline scores for Arabic, tion. Figure 2 (left) shows that adjectival modifiers
Korean and Turkish (but not Czech). Conversely, (amod, mean UUAS = 85 ±12) and determiners
the low baselines scores and the likewise low de- (det, 88 ± 6) are among the easiest relations to
coding accuracies for the latter four languages are decode across languages. Indeed, words that are
difficult to characterize. Indeed, we could not iden- connected by these relations are often adjacent to
tify what factors — typological, annotation, tok- each other and may be simple to decode if a head is
enization or otherwise — would set French and Ital- primarily concerned with tracking linear order. To
ian apart from the remaining languages in terms of verify the extent to which this might be happening,
score. However, we hypothesize that the tokeniza- we plot the aforementioned decoding accuracy as
tion and our treatment of subword tokens plays a a function of select relations’ positional baselines
part in attempting to decode attention from Chinese in Figure 2 (right). The positional baselines, in this
and Japanese representations. Per the mBERT doc- case, are calculated by picking the most frequent
umentation,3 Chinese and Japanese Kanji character offset at which a dependent occurs with respect to
spans within the CJK Unicode range are character- its head, e.g., −1 for det in English, meaning one
tokenized. This lies in contrast with all other lan- position to the left of the head. Interestingly, while
3
https://github.com/google-research/ we observe significant variation across the posi-
bert/blob/master/multilingual.md tional baselines for amod and det, the decoding

accuracy remains quite high. terns of the fine-tuned model typically yields better
In slight contrast to this, the core subject UUAS than the pretrained model, particularly in
(nsubj, 58 ± 16 SD) and object (obj, 64 ± 13) the highest layers. Indeed, the lowest layer at which
relations prove to be more difficult to decode. Un- fine-tuning appears to improve decoding is layer
like the aforementioned relations, nsubj and obj 7. This implies that, regardless of which compo-
are much more sensitive to the word order proper- nent remains frozen, the parameters facing any sort
ties of the language at hand. For example, while of significant and positive update tend to be those
a language like English, with Subject-Verb-Object appearing towards the higher-end of the network,
(SVO) order, might have the subject frequently closer to the output.
appear to the left of the verb, an SOV language For the frozen components, the best improve-
like Hindi might have it several positions further ments in UUAS are seen at the final layer in VALUE,
away, with an object and its potential modifiers which is also the only model that shows consistent
intervening. Indeed, the best positional baseline improvement, as well as the highest average im-
for English nsubj is 39 UUAS, while it is only provement in mean scores4 for the last few layers.
10 for Hindi. Despite this variation, the relation Perhaps most interestingly, the mean UUAS (Fig-
seems to be tracked with some regularity by the ure 3 (bottom)) for our “attentive” components –
same head (layer 3, head 9), returning 60 UUAS keys, queries, and their combination – does not
for English and 52 for Hindi. The same can largely appear to have improved by much after fine-tuning.
be said for obj, where the positional baselines re- In contrast, the maximum does show considerable
turn 51 ± 18. In this latter case, however, the heads improvement; this seems to imply that although
tend to be much differently distributed across lan- all components appear to be more or less equally
guages. Finally, he results for the obj relation capable of learning decodable heads, the attentive
provides some support for our earlier explanation components, when fine-tuned, appear to sharpen
concerning morphologically rich languages, as Ara- fewer heads.
bic, Czech, Korean and Turkish all have among the Note that the only difference between keys and
highest accuracies (as well as positional baselines). queries in an attention mechanism is that keys are
transposed to index attention from/to appropriately.
5 Fine-Tuning Experiments Surprisingly, K EY and Q UERY appear to act some-
what differently, with Q UERY being almost uni-
Next, we investigate the effect fine-tuning has on formly better than K EY with the best heads, whilst
UUAS decoding. Row 3 in Table 1 (N ONE) indi- K EY is slightly better with averages, implying dis-
cates that fine-tuning does result in large improve- tinctions in how both store information. Further-
ments to UUAS decoding across most languages, more, allowing both keys and queries seems to
often by margins as high as ∼ 30%. This shows result in an interesting contradiction – the ultimate
that with an explicit parsing objective, attention layer, which has reasonable maximums and aver-
heads are capable of serving as explanatory mecha- ages for both K EY and Q UERY, now seems to show
nisms for syntax; syntactic structure can be made a UUAS drop almost uniformly. This is also true
to be transparently stored in the heads, in a man- for the completely unfrozen encoder.
ner that does not require additional probe fitting or
parameterized transformation to extract. Supervised Parsing In addition to decoding
Given that we do manage to decode reasonable trees from attention matrices, we also measure su-
syntactic trees, we can then refine our question pervised UAS/LAS on a held-out test set.5 Based
— what components are capable of learning these on Figure 4, it is apparent that all settings result
trees? One obvious candidate is the key/query in generally the same UAS. This is somewhat ex-
component pair, given that attention weights are a pected; Lauscher et al. (2020) see better results on
scaled softmax of a composition of the two. Figure parsing with the entire encoder frozen, implying
3 (top) shows the difference between pretrained that the task is easy enough for a biaffine parser
UUAS and fine-tuned UUAS per layer, across mod-
4
els and languages. Interestingly, the best parsing The inner average is over all heads; the outer is over all
accuracies do not appear to vary much depending languages.
5
Note that the test set in our scenario is from the actual,
on what component is frozen. We do see a clear non-parallel language treebank; as such, we left Korean out of
trend, however, in that decoding the attention pat- this comparison due to annotation differences.

Figure 3: (Top) best scores across all heads, per language; (bottom) mean scores across all heads, per language.
The languages (hidden from the X-axis for brevity) are, in order, ar, cs, de, en, es, fi, fr, hi, id, it, ja, ko, pl, pt, ru,
sv, tr, zh

                                                                be better than N ONE on average, implying that
                                                                non-dense components might actually be hurting
                                                                labelling capacity.
                                                                   In brief, consolidating the two sets of results
                                                                above, we can draw three interesting conclusions
                                                                about the components:
                                                                   1. Value vectors are best aligned with syntactic
                                                                      dependencies; this is reflected both in the best
                                                                      head at the upper layers, and the average score
                                                                      across all heads.
                                                                   2. Dense layers appear to have moderate infor-
                                                                      mative capacity, but appear to have the best
                                                                      learning capacity for the task of arc labelling.
                                                                   3. Perhaps most surprisingly, Key and Query
                                                                      vectors do not appear to make any outstanding
                                                                      contributions, save for sharpening a smaller
                                                                      subset of heads.
Figure 4: Mean UAS and LAS when evaluating differ-              Our last result is especially surprising for UUAS de-
ent models on language-specific treebanks (Korean ex-
cluded due to annotation differences). M BERT refers
                                                                coding. Keys and queries, fundamentally, combine
to models where the entire mBERT network is frozen              to form the attention weight matrix, which is pre-
as input to the parser.                                         cisely what we use to decode trees. One would ex-
                                                                pect that allowing these components to learn from
                                                                labelled syntax would result in the best improve-
to learn, given frozen mBERT representations.6                  ments to decoding, but all three have surprisingly
The LAS distinction is, however, rather interesting:            negligible mean improvements. This indicates that
there is a marked difference between how impor-                 we need to further improve our understanding of
tant the dense layers are, as opposed to the atten-             how attentive structure and weighting really works.
tive components. This is likely not reflected in our
                                                                Cross-linguistic observations We notice no
UUAS probe as, strictly speaking, labelling arcs
                                                                clear cross-linguistic trends here across different
is not equivalent to searching for structure in sen-
                                                                component sets; however, certain languages do
tences, but more akin to classifying pre-identified
                                                                stand out as being particularly hard to decode from
structures. We also note that D ENSE appears to
                                                                the fine-tuned parser. These include Japanese, Ko-
   6
     Due to training on concatenated PUD sets, however, our     rean, Chinese, French and Turkish. For the first
results are not directly comparable/                            three, we hypothesise that tokenization clashes with

mBERT’s internal representations may play a role. wards resembling the dependency structure. In con-
Indeed, as we hypothesized in Section 3.2, it could trast, it was the Value (V ) parameters, which are
be the case that the composition of CJK charac- used for computing a weighted sum over the KQ-
ters into gold tokens for Chinese and Japanese may produced attention, that yielded the most faithful
degrade the representations (and their correspond- representations of the linguistic structure via atten-
ing attention) therein. Furthermore, for Japanese tion.
and Korean specifically, it has been observed that Though prior work (Kovaleva et al., 2019; Zhao
tokenization strategies employed by different tree- and Bethard, 2020) seems to indicate that there is
banks could drastically influence the conclusions a lack of a substantial change in attention patterns
one may draw about their inherent hierarchical after fine-tuning on syntax- and semantics-oriented
structure (Kulmizev et al., 2020). Turkish and classification tasks, the opposite effect has been
French are admittedly more difficult to diagnose. observed with fine-tuning on negation scope reso-
Note, however, that we fine-tuned our model on a lution, where a more explanatory attention mech-
concatenation of all PUD treebanks. As such, any anism can be induced (Htut et al., 2019). Our re-
deviation from PUD’s annotation norms is there- sults are similar to the latter, and we demonstrate
fore likely to be heavily penalised, by virtue of that given explicit syntactic annotation, attention
signal from other languages drowning out these weights do end up storing more transparently de-
differences. codable structure. It is, however, still unclear which
sets of transformer parameters are best suited for
6 Conclusion learning this information and storing it in the form
of attention.
In this study, we revisited the prospect of decoding
dependency trees from the self-attention patterns of Acknowledgements
Transformer-based language models. We elected
to extend our experiments to 18 languages in or- Our experiments were run on resources provided
der to gain better insight about how tree decoding by UNINETT Sigma2 - the National Infrastructure
accuracy might be affected in the face of (mod- for High Performance Computing and Data Stor-
est) typological diversity. Surprisingly, across all age in Norway, under the NeIC-NLPL umbrella.
languages, we were able to decode dependency Mostafa and Anders were funded by a Google Fo-
trees from attention patterns more accurately than cused Research Award. We would like to thank
an adjacent-linking baseline, implying that some Daniel Dakota and Ali Basirat for some fruitful
structure was indeed being tracked by the mech- discussions and the anonymous reviewers for their
anism. In looking at specific relation types, we excellent feedback.
corroborated previous studies in showing that par-
ticular layer-head combinations tracked the same
relation with regularity across languages, despite References
typological differences concerning word order, etc. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
In investigating the extent to which attention can gio. 2014. Neural machine translation by jointly
be guided to properly capture structural relations learning to align and translate. arXiv preprint
arXiv:1409.0473.
between input words, we fine-tuned mBERT as in-
put to a dependency parser. This, we found, yielded Yonatan Belinkov and James Glass. 2019. Analysis
large improvements over the pretrained attention methods in neural language processing: A survey.
patterns in terms of decoding accuracy, demonstrat- Transactions of the Association for Computational
ing that the attention mechanism was learning to Linguistics, 7:49–72.
represent the structural objective of the parser. In
Ethan A. Chi, John Hewitt, and Christopher D. Man-
addition to fine-tuning the entire mBERT network, ning. 2020. Finding universal grammatical rela-
we conducted a series of experiments, wherein we tions in multilingual BERT. In Proceedings of the
updated only select components of model and left 58th Annual Meeting of the Association for Compu-
the remainder frozen. Most surprisingly, we ob- tational Linguistics, pages 5564–5577, Online. As-
sociation for Computational Linguistics.
served that the Transformer parameters designed
for composing the attention matrix, K and Q, were Yoeng-Jin Chu. 1965. On the shortest arborescence of
only modestly capable of guiding the attention to- a directed graph. Scientia Sinica, 14:1396–1400.

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Dan Kondratyuk and Milan Straka. 2019b. 75 lan-
Christopher D. Manning. 2019. What does BERT guages, 1 model: Parsing Universal Dependencies
look at? an analysis of BERT’s attention. In Pro- universally. In Proceedings of the 2019 Confer-
ceedings of the 2019 ACL Workshop BlackboxNLP: ence on Empirical Methods in Natural Language
Analyzing and Interpreting Neural Networks for Processing and the 9th International Joint Confer-
NLP, pages 276–286, Florence, Italy. Association ence on Natural Language Processing (EMNLP-
for Computational Linguistics. IJCNLP), pages 2779–2795, Hong Kong, China. As-
sociation for Computational Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. Bert: Pre-training of Olga Kovaleva, Alexey Romanov, Anna Rogers, and
deep bidirectional transformers for language under- Anna Rumshisky. 2019. Revealing the Dark Secrets
standing. In Proceedings of the 2019 Conference of of BERT. In Proceedings of the 2019 Conference on
the North American Chapter of the Association for Empirical Methods in Natural Language Processing
Computational Linguistics: Human Language Tech- and the 9th International Joint Conference on Natu-
nologies, Volume 1 (Long and Short Papers), pages ral Language Processing (EMNLP-IJCNLP), pages
4171–4186. 4364–4373, Hong Kong, China. Association for
Computational Linguistics.
Timothy Dozat and Christopher D Manning. 2016.
Deep biaffine attention for neural dependency pars- Artur Kulmizev, Vinit Ravishankar, Mostafa Abdou,
ing. arXiv preprint arXiv:1611.01734. and Joakim Nivre. 2020. Do neural language mod-
els show preferences for syntactic formalisms? In
John Hewitt and Christopher D Manning. 2019. A Proceedings of the 58th Annual Meeting of the Asso-
structural probe for finding syntax in word represen- ciation for Computational Linguistics, pages 4077–
tations. In Proceedings of the 2019 Conference of 4091, Online. Association for Computational Lin-
the North American Chapter of the Association for guistics.
Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), pages Anne Lauscher, Vinit Ravishankar, Ivan Vuli, and
4129–4138. Goran Glava. 2020. From Zero to Hero: On the Lim-
itations of Zero-Shot Cross-Lingual Transfer with
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Multilingual Transformers. arXiv:2005.00633 [cs].
Long short-term memory. Neural computation, ArXiv: 2005.00633.
9(8):1735–1780. Tomasz Limisiewicz, Rudolf Rosa, and David Mareček.
2020. Universal dependencies according to bert:
Phu Mon Htut, Jason Phang, Shikha Bordia, and both more specific and more general. arXiv preprint
Samuel R Bowman. 2019. Do attention heads in arXiv:2004.14620.
bert track syntactic dependencies? arXiv preprint
arXiv:1911.12246. Ryan McDonald, Koby Crammer, and Fernando
Pereira. 2005a. Online large-margin training of de-
Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham pendency parsers. In Proceedings of the 43rd An-
Neubig, Orhan Firat, and Melvin Johnson. 2020. nual Meeting of the Association for Computational
XTREME: A Massively Multilingual Multi-task Linguistics (ACL), pages 91–98.
Benchmark for Evaluating Cross-lingual Generaliza-
tion. arXiv:2003.11080 [cs]. ArXiv: 2003.11080. Ryan McDonald, Fernando Pereira, Kiril Ribarov, and
Jan Hajič. 2005b. Non-projective dependency pars-
Sarthak Jain and Byron C Wallace. 2019. Attention is ing using spanning tree algorithms. In Proceed-
not explanation. In Proceedings of the 2019 Con- ings of the Human Language Technology Confer-
ference of the North American Chapter of the Asso- ence and the Conference on Empirical Methods in
ciation for Computational Linguistics: Human Lan- Natural Language Processing (HLT/EMNLP), pages
guage Technologies, Volume 1 (Long and Short Pa- 523–530.
pers), pages 3543–3556.
Joakim Nivre, Mitchell Abrams, Željko Agić, Lars
Dan Klein and Christopher D. Manning. 2004. Corpus- Ahrenberg, Gabrielė Aleksandravičiūtė, Lene An-
based induction of syntactic structure: Models of de- tonsen, Katya Aplonova, Maria Jesus Aranz-
pendency and constituency. In Proceedings of the abe, Gashaw Arutie, Masayuki Asahara, Luma
42nd Annual Meeting of the Association for Compu- Ateyah, Mohammed Attia, Aitziber Atutxa, Lies-
tational Linguistics (ACL), pages 479–486. beth Augustinus, Elena Badmaeva, Miguel Balles-
teros, Esha Banerjee, Sebastian Bank, Verginica
Dan Kondratyuk and Milan Straka. 2019a. 75 lan- Barbu Mititelu, Victoria Basmov, John Bauer, San-
guages, 1 model: Parsing universal dependencies dra Bellato, Kepa Bengoetxea, Yevgeni Berzak,
universally. In Proceedings of the 2019 Confer- Irshad Ahmad Bhat, Riyaz Ahmad Bhat, Erica
ence on Empirical Methods in Natural Language Biagetti, Eckhard Bick, Agnė Bielinskienė, Ro-
Processing and the 9th International Joint Confer- gier Blokland, Victoria Bobicev, Loı̈c Boizou,
ence on Natural Language Processing (EMNLP- Emanuel Borges Völker, Carl Börstell, Cristina
IJCNLP), pages 2779–2795. Bosco, Gosse Bouma, Sam Bowman, Adriane

Boyd, Kristina Brokaitė, Aljoscha Burchardt, Marie         Barbara Plank, Thierry Poibeau, Martin Popel,
Candito, Bernard Caron, Gauthier Caron, Gülşen            Lauma Pretkalniņa, Sophie Prévost, Prokopis Proko-
Cebiroğlu Eryiğit, Flavio Massimiliano Cecchini,          pidis, Adam Przepiórkowski, Tiina Puolakainen,
Giuseppe G. A. Celano, Slavomı́r Čéplö, Savas            Sampo Pyysalo, Andriela Rääbis, Alexandre Rade-
Cetin, Fabricio Chalub, Jinho Choi, Yongseok Cho,           maker, Loganathan Ramasamy, Taraka Rama, Car-
Jayeol Chun, Silvie Cinková, Aurélie Collomb,             los Ramisch, Vinit Ravishankar, Livy Real, Siva
Çağrı Çöltekin, Miriam Connor, Marine Courtin,          Reddy, Georg Rehm, Michael Rießler, Erika
Elizabeth Davidson, Marie-Catherine de Marneffe,            Rimkutė, Larissa Rinaldi, Laura Rituma, Luisa
Valeria de Paiva, Arantza Diaz de Ilarraza, Carly           Rocha, Mykhailo Romanenko, Rudolf Rosa, Da-
Dickerson, Bamba Dione, Peter Dirix, Kaja Do-               vide Rovati, Valentin Roca, Olga Rudina, Jack
brovoljc, Timothy Dozat, Kira Droganova, Puneet             Rueter, Shoval Sadde, Benoı̂t Sagot, Shadi Saleh,
Dwivedi, Hanne Eckhoff, Marhaba Eli, Ali Elkahky,           Alessio Salomoni, Tanja Samardžić, Stephanie Sam-
Binyam Ephrem, Tomaž Erjavec, Aline Etienne,               son, Manuela Sanguinetti, Dage Särg, Baiba Saulı̄te,
Richárd Farkas, Hector Fernandez Alcalde, Jennifer         Yanin Sawanakunanon, Nathan Schneider, Sebastian
Foster, Cláudia Freitas, Kazunori Fujita, Katarı́na        Schuster, Djamé Seddah, Wolfgang Seeker, Mojgan
Gajdošová, Daniel Galbraith, Marcos Garcia, Moa           Seraji, Mo Shen, Atsuko Shimada, Hiroyuki Shirasu,
Gärdenfors, Sebastian Garza, Kim Gerdes, Filip             Muh Shohibussirri, Dmitry Sichinava, Natalia Sil-
Ginter, Iakes Goenaga, Koldo Gojenola, Memduh               veira, Maria Simi, Radu Simionescu, Katalin Simkó,
Gökırmak, Yoav Goldberg, Xavier Gómez Guino-              Mária Šimková, Kiril Simov, Aaron Smith, Isabela
vart, Berta González Saavedra, Matias Grioni, Nor-         Soares-Bastos, Carolyn Spadine, Antonio Stella,
munds Grūzı̄tis, Bruno Guillaume, Céline Guillot-         Milan Straka, Jana Strnadová, Alane Suhr, Umut
Barbance, Nizar Habash, Jan Hajič, Jan Hajič jr.,         Sulubacak, Shingo Suzuki, Zsolt Szántó, Dima
Linh Hà Mỹ, Na-Rae Han, Kim Harris, Dag                   Taji, Yuta Takahashi, Fabio Tamburini, Takaaki
Haug, Johannes Heinecke, Felix Hennig, Barbora              Tanaka, Isabelle Tellier, Guillaume Thomas, Li-
Hladká, Jaroslava Hlaváčová, Florinel Hociung, Pet-     isi Torga, Trond Trosterud, Anna Trukhina, Reut
ter Hohle, Jena Hwang, Takumi Ikeda, Radu Ion,              Tsarfaty, Francis Tyers, Sumire Uematsu, Zdeňka
Elena Irimia, O.lájı́dé Ishola, Tomáš Jelı́nek, An-     Urešová, Larraitz Uria, Hans Uszkoreit, Sowmya
ders Johannsen, Fredrik Jørgensen, Hüner Kaşıkara,        Vajjala, Daniel van Niekerk, Gertjan van No-
Andre Kaasen, Sylvain Kahane, Hiroshi Kanayama,             ord, Viktor Varga, Eric Villemonte de la Clerg-
Jenna Kanerva, Boris Katz, Tolga Kayadelen, Jes-            erie, Veronika Vincze, Lars Wallin, Abigail Walsh,
sica Kenney, Václava Kettnerová, Jesse Kirchner,          Jing Xian Wang, Jonathan North Washington,
Arne Köhn, Kamil Kopacewicz, Natalia Kotsyba,              Maximilan Wendt, Seyi Williams, Mats Wirén,
Jolanta Kovalevskaitė, Simon Krek, Sookyoung               Christian Wittern, Tsegay Woldemariam, Tak-sum
Kwak, Veronika Laippala, Lorenzo Lambertino, Lu-            Wong, Alina Wróblewska, Mary Yako, Naoki Ya-
cia Lam, Tatiana Lando, Septina Dian Larasati,              mazaki, Chunxiao Yan, Koichi Yasuoka, Marat M.
Alexei Lavrentiev, John Lee, Phng Lê H`ông,               Yavrumyan, Zhuoran Yu, Zdeněk Žabokrtský, Amir
Alessandro Lenci, Saran Lertpradit, Herman Le-              Zeldes, Daniel Zeman, Manying Zhang, and
ung, Cheuk Ying Li, Josie Li, Keying Li, Kyung-             Hanzhi Zhu. 2019. Universal dependencies 2.4.
Tae Lim, Yuan Li, Nikola Ljubešić, Olga Logi-             LINDAT/CLARIAH-CZ digital library at the Insti-
nova, Olga Lyashevskaya, Teresa Lynn, Vivien                tute of Formal and Applied Linguistics (ÚFAL), Fac-
Macketanz, Aibek Makazhanov, Michael Mandl,                 ulty of Mathematics and Physics, Charles Univer-
Christopher Manning, Ruli Manurung, Cătălina              sity.
Mărănduc, David Mareček, Katrin Marheinecke,
Héctor Martı́nez Alonso, André Martins, Jan             Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-
Mašek, Yuji Matsumoto, Ryan McDonald, Sarah                ter, Yoav Goldberg, Jan Hajič, Christopher D. Man-
McGuinness, Gustavo Mendonça, Niko Miekka,                 ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo,
Margarita Misirpashayeva, Anna Missilä, Cătălin          Natalia Silveira, Reut Tsarfaty, and Dan Zeman.
Mititelu, Yusuke Miyao, Simonetta Montemagni,               2016. Universal Dependencies v1: A multilingual
Amir More, Laura Moreno Romero, Keiko Sophie                treebank collection. In Proceedings of the 10th In-
Mori, Tomohiko Morioka, Shinsuke Mori, Shigeki              ternational Conference on Language Resources and
Moro, Bjartur Mortensen, Bohdan Moskalevskyi,               Evaluation (LREC).
Kadri Muischnek, Yugo Murawaki, Kaili Müürisep,
Pinkey Nainwani, Juan Ignacio Navarro Horñiacek,         Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-
Anna Nedoluzhko, Gunta Nešpore-Bērzkalne, Lng             ter, Jan Hajič, Christopher D. Manning, Sampo
Nguy˜ên Thi., Huy`ên Nguy˜ên Thi. Minh, Yoshi-           Pyysalo, Sebastian Schuster, Francis Tyers, and Dan
hiro Nikaido, Vitaly Nikolaev, Rattima Nitisaroj,           Zeman. 2020. Universal Dependencies v2: An ev-
Hanna Nurmi, Stina Ojala, Adédayo. Olúòkun, Mai          ergrowing multilingual treebank collection. In Pro-
Omura, Petya Osenova, Robert Östling, Lilja Øvre-          ceedings of the 12th International Conference on
lid, Niko Partanen, Elena Pascual, Marco Passarotti,        Language Resources and Evaluation (LREC).
Agnieszka Patejuk, Guilherme Paulino-Passos, An-
gelika Peljak-Łapińska, Siyao Peng, Cenel-Augusto        Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.
Perez, Guy Perrier, Daria Petrova, Slav Petrov,             How multilingual is multilingual bert? In Proceed-
Jussi Piitulainen, Tommi A Pirinen, Emily Pitler,           ings of the 57th Annual Meeting of the Association
                                                            for Computational Linguistics, pages 4996–5001.

Alessandro Raganato, Jörg Tiedemann, et al. 2018. An Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
analysis of encoder representations in transformer- Le, Mohammad Norouzi, Wolfgang Macherey,
based machine translation. In Proceedings of the Maxim Krikun, Yuan Cao, Qin Gao, Klaus
2018 EMNLP Workshop BlackboxNLP: Analyzing Macherey, et al. 2016. Google’s neural machine
and Interpreting Neural Networks for NLP. The As- translation system: Bridging the gap between hu-
sociation for Computational Linguistics. man and machine translation. arXiv preprint
arXiv:1609.08144.
Tim Rocktäschel, Edward Grefenstette, Karl Moritz
Hermann, Tomáš Kočiskỳ, and Phil Blunsom. 2015. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,
Reasoning about entailment with neural attention. Aaron Courville, Ruslan Salakhudinov, Rich Zemel,
arXiv preprint arXiv:1509.06664. and Yoshua Bengio. 2015. Show, attend and tell:
Neural image caption generation with visual atten-
Rudolf Rosa and David Mareček. 2019. Inducing syn- tion. In International conference on machine learn-
tactic trees from bert representations. arXiv preprint ing, pages 2048–2057.
arXiv:1906.11511.
Daniel Zeman, Jan Hajic, Martin Popel, Martin Pot-
thast, Milan Straka, Filip Ginter, Joakim Nivre, and
Sofia Serrano and Noah A Smith. 2019. Is attention Slav Petrov. 2018. Conll 2018 shared task: Mul-
interpretable? arXiv preprint arXiv:1906.03731. tilingual parsing from raw text to universal depen-
dencies. In Proceedings of the CoNLL 2018 Shared
Emma Strubell, Patrick Verga, Daniel Andor, Task: Multilingual parsing from raw text to univer-
David Weiss, and Andrew McCallum. 2018. sal dependencies, pages 1–21.
Linguistically-informed self-attention for semantic
role labeling. In Proceedings of the 2018 Confer- Yiyun Zhao and Steven Bethard. 2020. How does
ence on Empirical Methods in Natural Language BERT’s attention change when you fine-tune? An
Processing, pages 5027–5038, Brussels, Belgium. analysis methodology and a case study in negation
Association for Computational Linguistics. scope. page 19.

Swabha Swayamdipta, Sam Thomson, Kenton Lee,
Luke Zettlemoyer, Chris Dyer, and Noah A. Smith.
2018. Syntactic scaffolds for semantic structures.
In Proceedings of the 2018 Conference on Em-
pirical Methods in Natural Language Processing,
pages 3772–3782, Brussels, Belgium. Association
for Computational Linguistics.

Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.
Bert rediscovers the classical nlp pipeline. arXiv
preprint arXiv:1905.05950.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017a. Attention is all
you need. In Advances in neural information pro-
cessing systems, pages 5998–6008.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017b. Attention Is
All You Need. arXiv:1706.03762 [cs]. ArXiv:
1706.03762.

Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-
nrich, and Ivan Titov. 2019. Analyzing multi-
head self-attention: Specialized heads do the heavy
lifting, the rest can be pruned. arXiv preprint
arXiv:1905.09418.

Sarah Wiegreffe and Yuval Pinter. 2019. Attention is
not not explanation. In Proceedings of the 2019 Con-
ference on Empirical Methods in Natural Language
Processing and the 9th International Joint Confer-
ence on Natural Language Processing (EMNLP-
IJCNLP), pages 11–20.

Figure 5: Positional scores across relations for all languages.

Figure 6: Decoding UUAS as a function of best positional baselines.

    Figure 7: Parsing scores across components and languages.

You can also read

Detecting Propaganda Techniques in Memes

Curriculum Summary Grade 6 2020 2021 - The Skokie School - Winnetka Public ...

Michael Rosen: A Poet Study through Michael Rosen's Big Book of Bad Things illustrated by Joe Berger (Puffin)

SOUND PREDICTS MEANING: CROSS-MODAL ASSOCIATIONS BETWEEN FORMANT FREQUENCY AND EMOTIONAL TONE IN STANZAS - MPG.PURE

"It's up to all of us": Social Identity in the Language of Public Warnings - Peter Lang Publishing

Queen Ethelburga's Collegiate Key Stage Four Curriculum Options 2020-2021

GCSE / IGCSE Courses Guide 2019-2021 - Bedford School

SummEval: Re-evaluating Summarization Evaluation - arXiv.org

MSSA - English Language Arts/Literacy Grade 7 . Practice Test - Cognia

2021Grade 10 - St Anne's Diocesan College

Paths of Transition in Education - (Re-)qualification of (Recently) Immigrated and Refugee Teachers in Europe and Beyond - Uni Köln

PRACTICAL GUIDE FOR EXCHANGE STUDENTS - GETTING STARTED - Uni Mainz

Ancient Classics Student Handbook 2021-22 - Acadamh ...

Conceptual Metaphors of Time in the Sonnets of Shakespeare: A Cognitive Linguistic Approach

Dresden International School - Guide to the Middle Years Programme Grades 6-10

A CONSTRUCTIONAL PERSPECTIVE ON THE RISE OF METATEXTUAL DISCOURSE MARKERS - Abralin

Effective teaching and support of students from low socioeconomic status backgrounds

GCSE Options 2021 A guide for Year 9 pupils and their parents on GCSE subject choices - Alderley Edge School for Girls

NINETEEN EIGHTY-FOUR AS A CRITIQUE OF BRITISH COLONIALISM - BY : NICLAS OLSSON - DIVA PORTAL

Postgraduate Study 2019 - International students - Times Higher ...

Behaviorism versus Intercultural Education in the Novel Purple Hibiscus

Communication - Thriving in Challenging Times: members share - New Zealand ...

INTERTEXTUALITY IN THE SOAP OPERA EGOLI: CULTURE AND CONSUMPTION

UNCORRECTED PROOF - MPG.PURE

2 Regular Expressions, Text - Normalization, Edit Distance - Stanford University

THE INTERNATIONAL BACCALAUREATE AT SALEM - Schule Schloss Salem

Viewbook Kenjgewin Teg - Educational Institute

IB MYP-DP EDITION February-2021 - The Gaudium

INDONESIA IN DEPTH: 2019 INDUSTRY OUTLOOK - HFW

Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection - ACL Anthology

One Size DOES NOT Fit All: Black British-Born Mental Health Nurses and Factors Influencing their 'National' Health Service Career Progression ...

Vote For Prashant Tamang: Representations of an Indian Idol in the Nepali print media and the retreat of multiculturalism

CamemBERT: a Tasty French Language Model

A Review of Tungsten Resources and Potential Extraction from Mine Waste

Linguistic Diversity as a Challenge for Street-Level Bureaucrats in a Monolingually-Oriented Organisation

2010 IALEI Country Report Multicultural Education in South Africa

The Light Verb Jungle: Still Hacking Away

Visual Reconstruction of Ancient Coins Using Cycle-Consistent Generative Adversarial Networks - MDPI

2020 Perinatal Mental Health Toolkit for Pediatric Primary Care: Overview and Primer - DC ...

Logic programming applications: What are the abstractions and implementations?

Examining family language policy through realist social theory

"It sounds like elves talking" - Polish migrants in Aberystwyth (Wales) and their impressions of the Welsh language

Role of peptidyl-prolyl cis-trans isomerases in infectious diseases and host-pathogen interactions

Effects of tidal influence on the structure and function of prokaryotic communities in the sediments of a pristine Brazilian mangrove

Applied Machine Learning for Games: A Graduate School Course - arXiv.org

Smooth as Glass and Hard as Stone? On the Conceptual Structure of the Aesthetics of Materials

Y9 Options Booklet - Unsworth Academy

Electron microscopy of high pressure frozen samples: bridging the gap between cellular ultrastructure and atomic resolution

Learning to run a Power Network Challenge: a Retrospective Analysis

Castle or House of Cards? - Strengthening the structural safety chain