Attention Can Reflect Syntactic Structure
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Attention Can Reflect Syntactic Structure
(If You Let It)
∗
Vinit Ravishankar† ∗ Artur Kulmizev‡ Mostafa Abdou§
Anders Søgaard§ Joakim Nivre‡
†
Language Technology Group, Department of Informatics, University of Oslo
‡
Department of Linguistics and Philology, Uppsala University
§
Department of Computer Science, University of Copenhagen
†
vinitr@ifi.uio.no
‡
{artur.kulmizev,joakim.nivre}@lingfil.uu.se
§
{abdou,soegaard}@di.ku.dk
Abstract 1997) as NLP’s feature-encoder-of-choice, largely
due to its compatibility with parallelized training
Since the popularization of the Transformer
regimes and ability to handle long-distance depen-
arXiv:2101.10927v1 [cs.CL] 26 Jan 2021
as a general-purpose feature encoder for NLP,
many studies have attempted to decode lin-
dencies.
guistic structure from its novel multi-head at- Certainly, the nature of attention as a distribu-
tention mechanism. However, much of such tion over tokens lends itself to a straightforward
work focused almost exclusively on English — interpretation of a model’s inner workings. Bah-
a language with rigid word order and a lack danau et al. (2014) illustrate this nicely in the con-
of inflectional morphology. In this study, we text of seq2seq machine translation, showing
present decoding experiments for multilingual
that the attention learned by their models reflects
BERT across 18 languages in order to test the
generalizability of the claim that dependency expected cross-lingual idiosyncrasies between En-
syntax is reflected in attention patterns. We glish and French, e.g., concerning word order. With
show that full trees can be decoded above base- self-attentive Transformers, interpretation becomes
line accuracy from single attention heads, and slightly more difficult, as attention is distributed
that individual relations are often tracked by across words within the input itself. This is fur-
the same heads across languages. Furthermore, ther compounded by the use of multiple layers and
in an attempt to address recent debates about
heads, each combination of which yields its own
the status of attention as an explanatory mecha-
nism, we experiment with fine-tuning mBERT alignment, representing a different (possibly re-
on a supervised parsing objective while freez- dundant) view of the data. Given the similarity
ing different series of parameters. Interest- of such attention matrices to the score matrices
ingly, in steering the objective to learn explicit employed in arc-factored dependency parsing (Mc-
linguistic structure, we find much of the same Donald et al., 2005a,b), a salient question concern-
structure represented in the resulting attention ing interpretability becomes: Can we expect some
patterns, with interesting differences with re- combination of these parameters to capture linguis-
spect to which parameters are frozen.
tic structure in the form of a dependency tree, espe-
1 Introduction cially if the model performs well on NLP tasks? If
not, can we relax the expectation and examine the
In recent years, the attention mechanism proposed extent to which subcomponents of the linguistic
by Bahdanau et al. (2014) has become an indis- structure, such as subject-verb relations, are repre-
pensable component of many NLP systems. Its sented? This prospect was first posed by Raganato
widespread adoption was, in part, heralded by et al. (2018) for MT encoders, and later explored
the introduction of the Transformer architecture by Clark et al. (2019) for BERT. Ultimately, the
(Vaswani et al., 2017a), which constrains a soft consensus of these and other studies (Voita et al.,
alignment to be learned across discrete states in 2019; Htut et al., 2019; Limisiewicz et al., 2020)
the input (self-attention), rather than across input was that, while there appears to exist no “general-
and output (e.g., Xu et al., 2015; Rocktäschel et al., ist” head responsible for extracting full dependency
2015). The Transformer has, by now, supplanted structures, standalone heads often specialize in cap-
the popular LSTM (Hochreiter and Schmidhuber, turing individual grammatical relations.
∗
Equal contribution. Order was decided by a coin toss. Unfortunately, most of such studies focused theirexperiments entirely on English, which is typologi- key (K), and value (V ) vectors via three separate
cally favored to succeed in such scenarios due to its linear transformations and passed to an attention
rigid word order and lack of inflectional morphol- mechanism. A single attention head computes
ogy. It remains to be seen whether the attention scaled dot-product attention between K and Q, out-
patterns of such models can capture structural fea- putting a weighted sum of V :
QK >
tures across typologically diverse languages, or if
Attention(Q, K, V ) = softmax √ V
the reported experiments on English are a misrep- dk
resentation of local positional heuristics as such. (1)
Furthermore, though previous work has investi- For multihead attention (MHA), the same process
gated how attention patterns might change after is repeated for k heads, allowing the model to
fine-tuning on different tasks (Htut et al., 2019), jointly attend to information from different repre-
a recent debate about attention as an explanatory sentation subspaces at different positions (Vaswani
mechanism (Jain and Wallace, 2019; Wiegreffe and et al., 2017b). Ultimately, the output of all heads is
Pinter, 2019) has cast the entire enterprise in doubt. concatenated and passed through a linear projection
Indeed, it remains to be seen whether fine-tuning W O:
on an explicit structured prediction task, e.g. de- Hi = Attention QWiQ , KWiK , V WiV (2)
pendency parsing, can force attention to represent
the structure being learned, or if the patterns ob-
served in pretrained models are not altered in any MHA(Q, K, V ) = concat(H1 , H2 , ..., Hk )W O
meaningful way. (3)
To address these issues, we investigate the Every layer also consists of a feed-forward network
prospect of extracting linguistic structure from (FFN), consisting of two Dense layers with ReLU
the attention weights of multilingual Transformer- activation functions. For each layer, therefore, the
based language models. In light of the surveyed output of MHA is passed through a LayerNorm
literature, our research questions are as follows: with residual connections, passed through FFN,
1. Can we decode dependency trees for some and then through another LayerNorm with residual
languages better than others? connections.
2. Do the same layer–head combinations track
Searching for structure Often, the line of in-
the same relations across languages?
quiry regarding interpretability in NLP has been
3. How do attention patterns change after fine-
concerned with extracting and analyzing linguistic
tuning with explicit syntactic annotation?
information from neural network models of lan-
4. Which components of the model are involved
guage (Belinkov and Glass, 2019). Recently, such
in these changes?
investigations have targeted Transformer models
In answering these questions, we believe we can (Hewitt and Manning, 2019; Rosa and Mareček,
shed further light on the (cross-)linguistic proper- 2019; Tenney et al., 2019), at least in part because
ties of Transformer-based language models, as well the self-attention mechanism employed by these
as address the question of attention patterns being models offers a possible window into their inner
a reliable representation of linguistic structure. workings. With large-scale machine translation and
language models being openly distributed for ex-
2 Attention as Structure
perimentation, several researchers have wondered
Transformers The focus of the present study if self-attention is capable of representing syntactic
is mBERT, a multilingual variant of the exceed- structure, despite not being trained with any overt
ingly popular language model (Devlin et al., 2019). parsing objective.
BERT is built upon the Transformer architecture In pursuit of this question, Raganato et al. (2018)
(Vaswani et al., 2017b), which is a self-attention- applied a maximum-spanning-tree algorithm over
based encoder-decoder model (though only the en- the attention weights of several trained MT models,
coder is relevant to our purposes). A Transformer comparing them with gold trees from Universal De-
takes a sequence of vectors x = [x1 , x2 , ...xn ] as pendencies (Nivre et al., 2016, 2020). They found
input and applies a positional encoding to them, that, while the accuracy was not comparable to that
in order to retain the order of words in a sentence. of a supervised parser, it was nonetheless higher
These inputs are then transformed into query (Q), than several strong baselines, implying that somestructure was consistently represented. Clark et al. standard attention mechanisms. On a theoretical
(2019) corroborated the same findings for BERT level, they argue that, although attention weights do
when decoding full trees, but observed that indi- not give an exclusive “faithful” explanation, they
vidual dependency relations were often tracked by do provide a meaningful plausible explanation.
specialized heads and were decodable with much This discussion is relevant to our study because
higher accuracy than some fixed-offset baselines. it remains unclear whether or not attending to syn-
Concurrently, Voita et al. (2019) made a similar tactic structure serves, in practice, as plausible ex-
observation about heads specializing in specific planation for model behavior, or whether or not it is
dependency relations, proposing a coarse taxon- even capable of serving as such. Indeed, the studies
omy of head attention functions: positional, where of Raganato et al. (2018) and Clark et al. (2019)
heads attend to adjacent tokens; syntactic, where relate a convincing but incomplete picture — tree
heads attend to specific syntactic relations; and rare decoding accuracy just marginally exceeds base-
words, where heads point to the least frequent to- lines and various relations tend to be tracked across
kens in the sentence. Htut et al. (2019) followed varying heads and layers. Thus, our fine-tuning ex-
Raganato et al. (2018) in decoding dependency periments (detailed in the following section) serve
trees from BERT-based models, finding that fine- to enable an “easy” setting wherein we explicitly
tuning on two classification tasks did not produce inform our models of the same structure that we are
syntactically plausible attention patterns. Lastly, trying to extract. We posit that, if, after fine-tuning,
Limisiewicz et al. (2020) modified UD annotation syntactic structures were still not decodable from
to better represent attention patterns and introduced the attention weights, one could safely conclude
a supervised head-ensembling method for consoli- that these structures are being stored via a non-
dating shared syntactic information across heads. transparent mechanism that may not even involve
attention weights. Such an insight would allow us
Does attention have explanatory value? Though to conclude that attention weights cannot provide
many studies have yielded insight about how atten- even a plausible explanation for models relying on
tion behaves in a variety of models, the question of syntax.
whether it can be seen as a “faithful” explanation
of model predictions has been subject to much re- 3 Experimental Design
cent debate. For example, Jain and Wallace (2019)
present compelling arguments that attention does To examine the extent to which we can decode de-
not offer a faithful explanation of predictions. Pri- pendency trees from attention patterns, we run a
marily, they demonstrate that there is little correla- tree decoding algorithm over mBERT’s attention
tion between standard feature importance measures heads — before and after fine-tuning via a parsing
and attention weights. Furthermore, they contend objective. We surmise that doing so will enable us
that there exist counterfactual attention distribu- to determine if attention can be interpreted as a re-
tions, which are substantially different from learned liable mechanism for capturing linguistic structure.
attention weights but that do not alter a model’s pre-
3.1 Model
dictions. Using a similar methodology, Serrano and
Smith (2019) corroborate that attention does not We employ mBERT1 in our experiments, which
provide an adequate account of an input compo- has been shown to perform well across a variety
nent’s importance. of NLP tasks (Hu et al., 2020; Kondratyuk and
In response to these findings, Wiegreffe and Pin- Straka, 2019a) and capture aspects of syntactic
ter (2019) question the assumptions underlying structure cross-lingually (Pires et al., 2019; Chi
such claims. Attention, they argue, is not a prim- et al., 2020). mBERT features 12 layers with 768
itive, i.e., it cannot be detached from the rest of a hidden units and 12 attention heads, with a joint
model’s components as is done in the experiments WordPiece sub-word vocabulary across languages.
of Jain and Wallace (2019). They propose a set The model was trained on the concatenation of
of four analyses to test whether a given model’s WikiDumps for the top 104 languages with the
attention mechanism can provide meaningful ex- largest Wikipedias,where principled sampling was
planation and demonstrate that the alternative at- employed to enforce a balance between high- and
tention distributions found via adversarial training 1
https://github.com/google-research/
methods do, in fact, perform poorly compared to bertlow-resource languages. the exact tree structure we aim to decode. To this
end, we employ the graph-based decoding algo-
3.2 Decoding Algorithm rithm of the biaffine parser introduced by Dozat
For decoding dependency trees, we follow Ra- and Manning (2016). We replace the standard
ganato et al. (2018) in applying the Chu-Liu- BiLSTM encoder for this parser with the entire
Edmonds maximum spanning tree algorithm (Chu, mBERT network, which we fine-tune with the pars-
1965) to every layer/head combination available ing loss. The full parser decoder consists of four
in mBERT (12 × 12 = 144 in total). In order for dense layers, two for head/child representations for
the matrices to correspond to gold treebank tok- dependency arcs (dim. 500) and two for head/child
enization, we remove the cells corresponding to representations for dependency labels (dim. 100).
the BERT delimiter tokens ([CLS] and [SEP]). These are transformed into the label space via a
In addition to this, we sum the columns and av- bilinear transform.
erage the rows corresponding to the constituent After training the parser, we can decode the fine-
subwords of gold tokens, respectively (Clark et al., tuned mBERT parameters in the same fashion as
2019). Lastly, since attention patterns across heads described in Section 3.2. We surmise that, if atten-
may differ in whether they represent heads attend- tion heads are capable of tracking hierarchical rela-
ing to their dependents or vice versa, we take our tions between words in any capacity, it is precisely
input to be the element-wise product of a given in this setting that this ability would be attested. In
attention matrix and its transpose (A ◦ A> ). We addition to this, we are interested in what individual
liken this to the joint probability of a head attend- components of the mBERT network are capable of
ing to its dependent and a dependent attending to steering attention patterns towards syntactic struc-
its head, similarly to Limisiewicz et al. (2020). Per ture. We believe that addressing this question will
this point, we also follow Htut et al. (2019) in eval- help us not only in interpreting decisions made by
uating the decoded trees via Undirected Unlabeled BERT-based neural parsers, but also in aiding us de-
Attachment Score (UUAS) — the percentage of veloping syntax-aware models in general (Strubell
undirected edges recovered correctly. Since we dis- et al., 2018; Swayamdipta et al., 2018). As such —
count directionality, this is effectively a less strict beyond fine-tuning all parameters of the mBERT
measure than UAS, but one that has a long tradition network (our basic setting) — we perform a se-
in unsupervised dependency parsing since Klein ries of ablation experiments wherein we update
and Manning (2004). only one set of parameters per training cycle, e.g.
the Query weights WiQ , and leave everything else
3.3 Data
frozen. This gives us a set of 6 models, which are
For our data, we employ the Parallel Universal De- described below. For each model, all non-BERT
pendencies (PUD) treebanks, as collected in UD parser components are always left unfrozen.
v2.4 (Nivre et al., 2019). PUD was first released
as part of the CONLL 2017 shared task (Zeman • K EY: only the K components of the trans-
et al., 2018), containing 1000 parallel sentences, former are unfrozen; these are the represen-
which were (professionally) translated from En- tations of tokens that are paying attention to
glish, German, French, Italian, and Spanish to 14 other tokens.
other languages. The sentences are taken from two • Q UERY: only the Q components are unfrozen;
domains, news and wikipedia, the latter implying these, conversely, are the representations of
some overlap with mBERT’s training data (though tokens being paid attention to.
we did not investigate this). We include all PUD • KQ: both keys and queries are unfrozen.
treebanks except Thai.2 • VALUE: semantic value vectors per token (V )
are unfrozen; they are composed after being
3.4 Fine-Tuning Details weighted with attention scores obtained from
In addition to exploring pretrained mBERT’s atten- the K/Q matrices.
tion weights, we are also interested in how attention • D ENSE: the dense feed-forward networks in
might be guided by a training objective that learns the attention mechanism; all three per layer
2
are unfrozen.
Thai is the only treebank that does not have a non-PUD
treebank available in UD, which we need for our fine-tuning • N ONE: The basic setting with nothing frozen;
experiments. all parameters are updated with the parsingAR CS DE EN ES FI FR HI ID IT JA KO PL PT RU SV TR ZH
BASELINE 50 40 36 36 40 42 40 46 47 40 43 55 45 41 42 39 52 41
53 53 49 47 50 48 41 48 50 41 45 64 52 50 51 51 55 42
P RE
7-6 10-8 10-8 10-8 9-5 10-8 2-3 2-3 9-5 6-4 2-3 9-2 10-8 9-5 10-8 10-8 3-8 2-3
76 78 76 71 77 66 45 72 75 58 42 64 75 76 75 74 55 38
N ONE
11-10 11-10 11-10 10-11 10-11 10-11 11-10 11-10 11-10 11-10 11-10 11-10 11-10 11-10 10-8 10-8 3-8 2-3
62 64 58 53 59 56 41 54 59 47 44 62 64 58 61 59 55 41
K EY
10-8 10-8 11-12 10-8 11-12 10-8 7-12 10-8 10-8 9-2 2-3 10-8 10-8 11-12 10-8 12-10 3-12 2-3
69 74 70 66 73 63 42 62 67 54 45 65 72 70 70 68 56 42
Q UERY
11-4 10-8 11-4 11-4 11-4 10-8 11-4 11-4 11-4 11-4 2-3 10-8 11-4 11-4 10-8 11-4 10-8 2-3
71 76 70 65 74 62 43 64 69 55 44 64 73 73 69 69 55 41
KQ
11-4 11-4 11-4 11-4 11-4 11-4 10-11 11-4 11-4 11-4 2-3 11-4 11-4 11-4 11-4 11-4 11-4 2-3
75 72 72 64 76 59 45 63 73 55 45 66 73 74 69 65 57 42
VALUE
12-5 12-5 12-5 12-5 12-5 12-5 12-5 12-5 12-5 12-5 2-3 10-8 12-5 12-5 12-5 12-5 12-5 3-8
68 71 65 60 67 61 42 65 66 49 44 64 70 64 67 64 55 40
D ENSE
11-10 11-10 11-10 10-8 12-10 11-10 10-8 11-10 11-10 9-5 3-12 11-10 11-10 12-5 11-10 11-10 11-10 3-12
Table 1: Adjacent-branching baseline and maximum UUAS decoding accuracy per PUD treebank, expressed as
best score and best layer/head combination for UUAS decoding. P RE refers to basic mBERT model before fine-
tuning, while all cells below correspond different fine-tuned models described in Section 3.4. Best score indicated
in bold.
loss.
We fine-tune each of these models on a concaten-
tation of all PUD treebanks for 20 epochs, which
effectively makes our model multilingual. We do
so in order to 1) control for domain and annotation
confounds, since all PUD sentences are parallel
and are natively annotated (unlike converted UD
treebanks, for instance); 2) increase the number of
training samples for fine-tuning, as each PUD tree-
bank features only 1000 sentences; and 3) induce
a better parser through multilinguality, as in Kon-
dratyuk and Straka (2019b). Furthermore, in or-
der to gauge the overall performance of our parser
across all ablated settings, we evaluate on the test
set of the largest non-PUD treebank available for
each language, since PUD only features test par-
titions. When training, we employ a combined
dense/sparse Adam optimiser, at a learning rate of
3 ∗ 10−5 . We rescale gradients to have a maximum
norm of 5.
4 Decoding mBERT Attention
The second row of Table 1 (P RE) depicts the UUAS
after running our decoding algorithm over mBERT
attention matrices, per language. We see a famil-
iar pattern to that in Clark et al. (2019) among
Figure 1: UUAS of MST decoding per layer and head,
others — namely that attention patterns extracted across languages. Heads (y-axis) are sorted by accu-
directly from mBERT appear to be incapable of racy for easier visualization.
decoding dependency trees beyond a threshold of
50–60% UUAS accuracy. However, we also note
that, in all languages, the attention-decoding algo- linear structures are captured with regularity. In-
rithm outperforms a BASELINE (row 1) that draws deed, head 8 in layer 10 appears to be particularly
an (undirected) edge between any two adjacent strong in this regard, returning the highest UUAS
words in linear order, which implies that some non- for 7 languages. Interestingly, the accuracy patternsFigure 2: Left: UUAS per relation across languages (best layer/head combination indicated in cell). Right: Best
UUAS as a function of best positional baseline (derived from the treebank), selected relations.
across layers depicted in Figure 1 tend to follow guages (Korean Hangul and Japanese Hiragana and
an identical trend for all languages, with nearly all Katakana included), which rely on whitespace and
heads in layer 7 returning high within-language WordPiece (Wu et al., 2016). It is thus possible that
accuracies. the attention distributions for these two languages
It appears that attention for some languages (Ara- (at least where CJK characters are relevant) are de-
bic, Czech, Korean, Turkish) is comparatively eas- voted to composing words, rather than structural
ier to decode than others (French, Italian, Japanese, relations, which will distort the attention matrices
Chinese). A possible explanation for this result is that we compute to correspond with gold tokeniza-
that dependency relations between content words, tion (e.g. by maxing rows and averaging columns).
which are favored by the UD annotation, are more
likely to be adjacent in the morphologically rich Relation analysis We can disambiguate what
languages of the first group (without intervening sort of structures are captured with regularity by
function words). This assumption seems to be cor- looking at the UUAS returned per dependency rela-
roborated by the high baseline scores for Arabic, tion. Figure 2 (left) shows that adjectival modifiers
Korean and Turkish (but not Czech). Conversely, (amod, mean UUAS = 85 ±12) and determiners
the low baselines scores and the likewise low de- (det, 88 ± 6) are among the easiest relations to
coding accuracies for the latter four languages are decode across languages. Indeed, words that are
difficult to characterize. Indeed, we could not iden- connected by these relations are often adjacent to
tify what factors — typological, annotation, tok- each other and may be simple to decode if a head is
enization or otherwise — would set French and Ital- primarily concerned with tracking linear order. To
ian apart from the remaining languages in terms of verify the extent to which this might be happening,
score. However, we hypothesize that the tokeniza- we plot the aforementioned decoding accuracy as
tion and our treatment of subword tokens plays a a function of select relations’ positional baselines
part in attempting to decode attention from Chinese in Figure 2 (right). The positional baselines, in this
and Japanese representations. Per the mBERT doc- case, are calculated by picking the most frequent
umentation,3 Chinese and Japanese Kanji character offset at which a dependent occurs with respect to
spans within the CJK Unicode range are character- its head, e.g., −1 for det in English, meaning one
tokenized. This lies in contrast with all other lan- position to the left of the head. Interestingly, while
3
https://github.com/google-research/ we observe significant variation across the posi-
bert/blob/master/multilingual.md tional baselines for amod and det, the decodingaccuracy remains quite high. terns of the fine-tuned model typically yields better
In slight contrast to this, the core subject UUAS than the pretrained model, particularly in
(nsubj, 58 ± 16 SD) and object (obj, 64 ± 13) the highest layers. Indeed, the lowest layer at which
relations prove to be more difficult to decode. Un- fine-tuning appears to improve decoding is layer
like the aforementioned relations, nsubj and obj 7. This implies that, regardless of which compo-
are much more sensitive to the word order proper- nent remains frozen, the parameters facing any sort
ties of the language at hand. For example, while of significant and positive update tend to be those
a language like English, with Subject-Verb-Object appearing towards the higher-end of the network,
(SVO) order, might have the subject frequently closer to the output.
appear to the left of the verb, an SOV language For the frozen components, the best improve-
like Hindi might have it several positions further ments in UUAS are seen at the final layer in VALUE,
away, with an object and its potential modifiers which is also the only model that shows consistent
intervening. Indeed, the best positional baseline improvement, as well as the highest average im-
for English nsubj is 39 UUAS, while it is only provement in mean scores4 for the last few layers.
10 for Hindi. Despite this variation, the relation Perhaps most interestingly, the mean UUAS (Fig-
seems to be tracked with some regularity by the ure 3 (bottom)) for our “attentive” components –
same head (layer 3, head 9), returning 60 UUAS keys, queries, and their combination – does not
for English and 52 for Hindi. The same can largely appear to have improved by much after fine-tuning.
be said for obj, where the positional baselines re- In contrast, the maximum does show considerable
turn 51 ± 18. In this latter case, however, the heads improvement; this seems to imply that although
tend to be much differently distributed across lan- all components appear to be more or less equally
guages. Finally, he results for the obj relation capable of learning decodable heads, the attentive
provides some support for our earlier explanation components, when fine-tuned, appear to sharpen
concerning morphologically rich languages, as Ara- fewer heads.
bic, Czech, Korean and Turkish all have among the Note that the only difference between keys and
highest accuracies (as well as positional baselines). queries in an attention mechanism is that keys are
transposed to index attention from/to appropriately.
5 Fine-Tuning Experiments Surprisingly, K EY and Q UERY appear to act some-
what differently, with Q UERY being almost uni-
Next, we investigate the effect fine-tuning has on formly better than K EY with the best heads, whilst
UUAS decoding. Row 3 in Table 1 (N ONE) indi- K EY is slightly better with averages, implying dis-
cates that fine-tuning does result in large improve- tinctions in how both store information. Further-
ments to UUAS decoding across most languages, more, allowing both keys and queries seems to
often by margins as high as ∼ 30%. This shows result in an interesting contradiction – the ultimate
that with an explicit parsing objective, attention layer, which has reasonable maximums and aver-
heads are capable of serving as explanatory mecha- ages for both K EY and Q UERY, now seems to show
nisms for syntax; syntactic structure can be made a UUAS drop almost uniformly. This is also true
to be transparently stored in the heads, in a man- for the completely unfrozen encoder.
ner that does not require additional probe fitting or
parameterized transformation to extract. Supervised Parsing In addition to decoding
Given that we do manage to decode reasonable trees from attention matrices, we also measure su-
syntactic trees, we can then refine our question pervised UAS/LAS on a held-out test set.5 Based
— what components are capable of learning these on Figure 4, it is apparent that all settings result
trees? One obvious candidate is the key/query in generally the same UAS. This is somewhat ex-
component pair, given that attention weights are a pected; Lauscher et al. (2020) see better results on
scaled softmax of a composition of the two. Figure parsing with the entire encoder frozen, implying
3 (top) shows the difference between pretrained that the task is easy enough for a biaffine parser
UUAS and fine-tuned UUAS per layer, across mod-
4
els and languages. Interestingly, the best parsing The inner average is over all heads; the outer is over all
accuracies do not appear to vary much depending languages.
5
Note that the test set in our scenario is from the actual,
on what component is frozen. We do see a clear non-parallel language treebank; as such, we left Korean out of
trend, however, in that decoding the attention pat- this comparison due to annotation differences.Figure 3: (Top) best scores across all heads, per language; (bottom) mean scores across all heads, per language.
The languages (hidden from the X-axis for brevity) are, in order, ar, cs, de, en, es, fi, fr, hi, id, it, ja, ko, pl, pt, ru,
sv, tr, zh
be better than N ONE on average, implying that
non-dense components might actually be hurting
labelling capacity.
In brief, consolidating the two sets of results
above, we can draw three interesting conclusions
about the components:
1. Value vectors are best aligned with syntactic
dependencies; this is reflected both in the best
head at the upper layers, and the average score
across all heads.
2. Dense layers appear to have moderate infor-
mative capacity, but appear to have the best
learning capacity for the task of arc labelling.
3. Perhaps most surprisingly, Key and Query
vectors do not appear to make any outstanding
contributions, save for sharpening a smaller
subset of heads.
Figure 4: Mean UAS and LAS when evaluating differ- Our last result is especially surprising for UUAS de-
ent models on language-specific treebanks (Korean ex-
cluded due to annotation differences). M BERT refers
coding. Keys and queries, fundamentally, combine
to models where the entire mBERT network is frozen to form the attention weight matrix, which is pre-
as input to the parser. cisely what we use to decode trees. One would ex-
pect that allowing these components to learn from
labelled syntax would result in the best improve-
to learn, given frozen mBERT representations.6 ments to decoding, but all three have surprisingly
The LAS distinction is, however, rather interesting: negligible mean improvements. This indicates that
there is a marked difference between how impor- we need to further improve our understanding of
tant the dense layers are, as opposed to the atten- how attentive structure and weighting really works.
tive components. This is likely not reflected in our
Cross-linguistic observations We notice no
UUAS probe as, strictly speaking, labelling arcs
clear cross-linguistic trends here across different
is not equivalent to searching for structure in sen-
component sets; however, certain languages do
tences, but more akin to classifying pre-identified
stand out as being particularly hard to decode from
structures. We also note that D ENSE appears to
the fine-tuned parser. These include Japanese, Ko-
6
Due to training on concatenated PUD sets, however, our rean, Chinese, French and Turkish. For the first
results are not directly comparable/ three, we hypothesise that tokenization clashes withmBERT’s internal representations may play a role. wards resembling the dependency structure. In con-
Indeed, as we hypothesized in Section 3.2, it could trast, it was the Value (V ) parameters, which are
be the case that the composition of CJK charac- used for computing a weighted sum over the KQ-
ters into gold tokens for Chinese and Japanese may produced attention, that yielded the most faithful
degrade the representations (and their correspond- representations of the linguistic structure via atten-
ing attention) therein. Furthermore, for Japanese tion.
and Korean specifically, it has been observed that Though prior work (Kovaleva et al., 2019; Zhao
tokenization strategies employed by different tree- and Bethard, 2020) seems to indicate that there is
banks could drastically influence the conclusions a lack of a substantial change in attention patterns
one may draw about their inherent hierarchical after fine-tuning on syntax- and semantics-oriented
structure (Kulmizev et al., 2020). Turkish and classification tasks, the opposite effect has been
French are admittedly more difficult to diagnose. observed with fine-tuning on negation scope reso-
Note, however, that we fine-tuned our model on a lution, where a more explanatory attention mech-
concatenation of all PUD treebanks. As such, any anism can be induced (Htut et al., 2019). Our re-
deviation from PUD’s annotation norms is there- sults are similar to the latter, and we demonstrate
fore likely to be heavily penalised, by virtue of that given explicit syntactic annotation, attention
signal from other languages drowning out these weights do end up storing more transparently de-
differences. codable structure. It is, however, still unclear which
sets of transformer parameters are best suited for
6 Conclusion learning this information and storing it in the form
of attention.
In this study, we revisited the prospect of decoding
dependency trees from the self-attention patterns of Acknowledgements
Transformer-based language models. We elected
to extend our experiments to 18 languages in or- Our experiments were run on resources provided
der to gain better insight about how tree decoding by UNINETT Sigma2 - the National Infrastructure
accuracy might be affected in the face of (mod- for High Performance Computing and Data Stor-
est) typological diversity. Surprisingly, across all age in Norway, under the NeIC-NLPL umbrella.
languages, we were able to decode dependency Mostafa and Anders were funded by a Google Fo-
trees from attention patterns more accurately than cused Research Award. We would like to thank
an adjacent-linking baseline, implying that some Daniel Dakota and Ali Basirat for some fruitful
structure was indeed being tracked by the mech- discussions and the anonymous reviewers for their
anism. In looking at specific relation types, we excellent feedback.
corroborated previous studies in showing that par-
ticular layer-head combinations tracked the same
relation with regularity across languages, despite References
typological differences concerning word order, etc. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
In investigating the extent to which attention can gio. 2014. Neural machine translation by jointly
be guided to properly capture structural relations learning to align and translate. arXiv preprint
arXiv:1409.0473.
between input words, we fine-tuned mBERT as in-
put to a dependency parser. This, we found, yielded Yonatan Belinkov and James Glass. 2019. Analysis
large improvements over the pretrained attention methods in neural language processing: A survey.
patterns in terms of decoding accuracy, demonstrat- Transactions of the Association for Computational
ing that the attention mechanism was learning to Linguistics, 7:49–72.
represent the structural objective of the parser. In
Ethan A. Chi, John Hewitt, and Christopher D. Man-
addition to fine-tuning the entire mBERT network, ning. 2020. Finding universal grammatical rela-
we conducted a series of experiments, wherein we tions in multilingual BERT. In Proceedings of the
updated only select components of model and left 58th Annual Meeting of the Association for Compu-
the remainder frozen. Most surprisingly, we ob- tational Linguistics, pages 5564–5577, Online. As-
sociation for Computational Linguistics.
served that the Transformer parameters designed
for composing the attention matrix, K and Q, were Yoeng-Jin Chu. 1965. On the shortest arborescence of
only modestly capable of guiding the attention to- a directed graph. Scientia Sinica, 14:1396–1400.Kevin Clark, Urvashi Khandelwal, Omer Levy, and Dan Kondratyuk and Milan Straka. 2019b. 75 lan-
Christopher D. Manning. 2019. What does BERT guages, 1 model: Parsing Universal Dependencies
look at? an analysis of BERT’s attention. In Pro- universally. In Proceedings of the 2019 Confer-
ceedings of the 2019 ACL Workshop BlackboxNLP: ence on Empirical Methods in Natural Language
Analyzing and Interpreting Neural Networks for Processing and the 9th International Joint Confer-
NLP, pages 276–286, Florence, Italy. Association ence on Natural Language Processing (EMNLP-
for Computational Linguistics. IJCNLP), pages 2779–2795, Hong Kong, China. As-
sociation for Computational Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. Bert: Pre-training of Olga Kovaleva, Alexey Romanov, Anna Rogers, and
deep bidirectional transformers for language under- Anna Rumshisky. 2019. Revealing the Dark Secrets
standing. In Proceedings of the 2019 Conference of of BERT. In Proceedings of the 2019 Conference on
the North American Chapter of the Association for Empirical Methods in Natural Language Processing
Computational Linguistics: Human Language Tech- and the 9th International Joint Conference on Natu-
nologies, Volume 1 (Long and Short Papers), pages ral Language Processing (EMNLP-IJCNLP), pages
4171–4186. 4364–4373, Hong Kong, China. Association for
Computational Linguistics.
Timothy Dozat and Christopher D Manning. 2016.
Deep biaffine attention for neural dependency pars- Artur Kulmizev, Vinit Ravishankar, Mostafa Abdou,
ing. arXiv preprint arXiv:1611.01734. and Joakim Nivre. 2020. Do neural language mod-
els show preferences for syntactic formalisms? In
John Hewitt and Christopher D Manning. 2019. A Proceedings of the 58th Annual Meeting of the Asso-
structural probe for finding syntax in word represen- ciation for Computational Linguistics, pages 4077–
tations. In Proceedings of the 2019 Conference of 4091, Online. Association for Computational Lin-
the North American Chapter of the Association for guistics.
Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), pages Anne Lauscher, Vinit Ravishankar, Ivan Vuli, and
4129–4138. Goran Glava. 2020. From Zero to Hero: On the Lim-
itations of Zero-Shot Cross-Lingual Transfer with
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Multilingual Transformers. arXiv:2005.00633 [cs].
Long short-term memory. Neural computation, ArXiv: 2005.00633.
9(8):1735–1780. Tomasz Limisiewicz, Rudolf Rosa, and David Mareček.
2020. Universal dependencies according to bert:
Phu Mon Htut, Jason Phang, Shikha Bordia, and both more specific and more general. arXiv preprint
Samuel R Bowman. 2019. Do attention heads in arXiv:2004.14620.
bert track syntactic dependencies? arXiv preprint
arXiv:1911.12246. Ryan McDonald, Koby Crammer, and Fernando
Pereira. 2005a. Online large-margin training of de-
Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham pendency parsers. In Proceedings of the 43rd An-
Neubig, Orhan Firat, and Melvin Johnson. 2020. nual Meeting of the Association for Computational
XTREME: A Massively Multilingual Multi-task Linguistics (ACL), pages 91–98.
Benchmark for Evaluating Cross-lingual Generaliza-
tion. arXiv:2003.11080 [cs]. ArXiv: 2003.11080. Ryan McDonald, Fernando Pereira, Kiril Ribarov, and
Jan Hajič. 2005b. Non-projective dependency pars-
Sarthak Jain and Byron C Wallace. 2019. Attention is ing using spanning tree algorithms. In Proceed-
not explanation. In Proceedings of the 2019 Con- ings of the Human Language Technology Confer-
ference of the North American Chapter of the Asso- ence and the Conference on Empirical Methods in
ciation for Computational Linguistics: Human Lan- Natural Language Processing (HLT/EMNLP), pages
guage Technologies, Volume 1 (Long and Short Pa- 523–530.
pers), pages 3543–3556.
Joakim Nivre, Mitchell Abrams, Željko Agić, Lars
Dan Klein and Christopher D. Manning. 2004. Corpus- Ahrenberg, Gabrielė Aleksandravičiūtė, Lene An-
based induction of syntactic structure: Models of de- tonsen, Katya Aplonova, Maria Jesus Aranz-
pendency and constituency. In Proceedings of the abe, Gashaw Arutie, Masayuki Asahara, Luma
42nd Annual Meeting of the Association for Compu- Ateyah, Mohammed Attia, Aitziber Atutxa, Lies-
tational Linguistics (ACL), pages 479–486. beth Augustinus, Elena Badmaeva, Miguel Balles-
teros, Esha Banerjee, Sebastian Bank, Verginica
Dan Kondratyuk and Milan Straka. 2019a. 75 lan- Barbu Mititelu, Victoria Basmov, John Bauer, San-
guages, 1 model: Parsing universal dependencies dra Bellato, Kepa Bengoetxea, Yevgeni Berzak,
universally. In Proceedings of the 2019 Confer- Irshad Ahmad Bhat, Riyaz Ahmad Bhat, Erica
ence on Empirical Methods in Natural Language Biagetti, Eckhard Bick, Agnė Bielinskienė, Ro-
Processing and the 9th International Joint Confer- gier Blokland, Victoria Bobicev, Loı̈c Boizou,
ence on Natural Language Processing (EMNLP- Emanuel Borges Völker, Carl Börstell, Cristina
IJCNLP), pages 2779–2795. Bosco, Gosse Bouma, Sam Bowman, AdrianeBoyd, Kristina Brokaitė, Aljoscha Burchardt, Marie Barbara Plank, Thierry Poibeau, Martin Popel,
Candito, Bernard Caron, Gauthier Caron, Gülşen Lauma Pretkalniņa, Sophie Prévost, Prokopis Proko-
Cebiroğlu Eryiğit, Flavio Massimiliano Cecchini, pidis, Adam Przepiórkowski, Tiina Puolakainen,
Giuseppe G. A. Celano, Slavomı́r Čéplö, Savas Sampo Pyysalo, Andriela Rääbis, Alexandre Rade-
Cetin, Fabricio Chalub, Jinho Choi, Yongseok Cho, maker, Loganathan Ramasamy, Taraka Rama, Car-
Jayeol Chun, Silvie Cinková, Aurélie Collomb, los Ramisch, Vinit Ravishankar, Livy Real, Siva
Çağrı Çöltekin, Miriam Connor, Marine Courtin, Reddy, Georg Rehm, Michael Rießler, Erika
Elizabeth Davidson, Marie-Catherine de Marneffe, Rimkutė, Larissa Rinaldi, Laura Rituma, Luisa
Valeria de Paiva, Arantza Diaz de Ilarraza, Carly Rocha, Mykhailo Romanenko, Rudolf Rosa, Da-
Dickerson, Bamba Dione, Peter Dirix, Kaja Do- vide Rovati, Valentin Roca, Olga Rudina, Jack
brovoljc, Timothy Dozat, Kira Droganova, Puneet Rueter, Shoval Sadde, Benoı̂t Sagot, Shadi Saleh,
Dwivedi, Hanne Eckhoff, Marhaba Eli, Ali Elkahky, Alessio Salomoni, Tanja Samardžić, Stephanie Sam-
Binyam Ephrem, Tomaž Erjavec, Aline Etienne, son, Manuela Sanguinetti, Dage Särg, Baiba Saulı̄te,
Richárd Farkas, Hector Fernandez Alcalde, Jennifer Yanin Sawanakunanon, Nathan Schneider, Sebastian
Foster, Cláudia Freitas, Kazunori Fujita, Katarı́na Schuster, Djamé Seddah, Wolfgang Seeker, Mojgan
Gajdošová, Daniel Galbraith, Marcos Garcia, Moa Seraji, Mo Shen, Atsuko Shimada, Hiroyuki Shirasu,
Gärdenfors, Sebastian Garza, Kim Gerdes, Filip Muh Shohibussirri, Dmitry Sichinava, Natalia Sil-
Ginter, Iakes Goenaga, Koldo Gojenola, Memduh veira, Maria Simi, Radu Simionescu, Katalin Simkó,
Gökırmak, Yoav Goldberg, Xavier Gómez Guino- Mária Šimková, Kiril Simov, Aaron Smith, Isabela
vart, Berta González Saavedra, Matias Grioni, Nor- Soares-Bastos, Carolyn Spadine, Antonio Stella,
munds Grūzı̄tis, Bruno Guillaume, Céline Guillot- Milan Straka, Jana Strnadová, Alane Suhr, Umut
Barbance, Nizar Habash, Jan Hajič, Jan Hajič jr., Sulubacak, Shingo Suzuki, Zsolt Szántó, Dima
Linh Hà Mỹ, Na-Rae Han, Kim Harris, Dag Taji, Yuta Takahashi, Fabio Tamburini, Takaaki
Haug, Johannes Heinecke, Felix Hennig, Barbora Tanaka, Isabelle Tellier, Guillaume Thomas, Li-
Hladká, Jaroslava Hlaváčová, Florinel Hociung, Pet- isi Torga, Trond Trosterud, Anna Trukhina, Reut
ter Hohle, Jena Hwang, Takumi Ikeda, Radu Ion, Tsarfaty, Francis Tyers, Sumire Uematsu, Zdeňka
Elena Irimia, O.lájı́dé Ishola, Tomáš Jelı́nek, An- Urešová, Larraitz Uria, Hans Uszkoreit, Sowmya
ders Johannsen, Fredrik Jørgensen, Hüner Kaşıkara, Vajjala, Daniel van Niekerk, Gertjan van No-
Andre Kaasen, Sylvain Kahane, Hiroshi Kanayama, ord, Viktor Varga, Eric Villemonte de la Clerg-
Jenna Kanerva, Boris Katz, Tolga Kayadelen, Jes- erie, Veronika Vincze, Lars Wallin, Abigail Walsh,
sica Kenney, Václava Kettnerová, Jesse Kirchner, Jing Xian Wang, Jonathan North Washington,
Arne Köhn, Kamil Kopacewicz, Natalia Kotsyba, Maximilan Wendt, Seyi Williams, Mats Wirén,
Jolanta Kovalevskaitė, Simon Krek, Sookyoung Christian Wittern, Tsegay Woldemariam, Tak-sum
Kwak, Veronika Laippala, Lorenzo Lambertino, Lu- Wong, Alina Wróblewska, Mary Yako, Naoki Ya-
cia Lam, Tatiana Lando, Septina Dian Larasati, mazaki, Chunxiao Yan, Koichi Yasuoka, Marat M.
Alexei Lavrentiev, John Lee, Phng Lê H`ông, Yavrumyan, Zhuoran Yu, Zdeněk Žabokrtský, Amir
Alessandro Lenci, Saran Lertpradit, Herman Le- Zeldes, Daniel Zeman, Manying Zhang, and
ung, Cheuk Ying Li, Josie Li, Keying Li, Kyung- Hanzhi Zhu. 2019. Universal dependencies 2.4.
Tae Lim, Yuan Li, Nikola Ljubešić, Olga Logi- LINDAT/CLARIAH-CZ digital library at the Insti-
nova, Olga Lyashevskaya, Teresa Lynn, Vivien tute of Formal and Applied Linguistics (ÚFAL), Fac-
Macketanz, Aibek Makazhanov, Michael Mandl, ulty of Mathematics and Physics, Charles Univer-
Christopher Manning, Ruli Manurung, Cătălina sity.
Mărănduc, David Mareček, Katrin Marheinecke,
Héctor Martı́nez Alonso, André Martins, Jan Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-
Mašek, Yuji Matsumoto, Ryan McDonald, Sarah ter, Yoav Goldberg, Jan Hajič, Christopher D. Man-
McGuinness, Gustavo Mendonça, Niko Miekka, ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo,
Margarita Misirpashayeva, Anna Missilä, Cătălin Natalia Silveira, Reut Tsarfaty, and Dan Zeman.
Mititelu, Yusuke Miyao, Simonetta Montemagni, 2016. Universal Dependencies v1: A multilingual
Amir More, Laura Moreno Romero, Keiko Sophie treebank collection. In Proceedings of the 10th In-
Mori, Tomohiko Morioka, Shinsuke Mori, Shigeki ternational Conference on Language Resources and
Moro, Bjartur Mortensen, Bohdan Moskalevskyi, Evaluation (LREC).
Kadri Muischnek, Yugo Murawaki, Kaili Müürisep,
Pinkey Nainwani, Juan Ignacio Navarro Horñiacek, Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-
Anna Nedoluzhko, Gunta Nešpore-Bērzkalne, Lng ter, Jan Hajič, Christopher D. Manning, Sampo
Nguy˜ên Thi., Huy`ên Nguy˜ên Thi. Minh, Yoshi- Pyysalo, Sebastian Schuster, Francis Tyers, and Dan
hiro Nikaido, Vitaly Nikolaev, Rattima Nitisaroj, Zeman. 2020. Universal Dependencies v2: An ev-
Hanna Nurmi, Stina Ojala, Adédayo. Olúòkun, Mai ergrowing multilingual treebank collection. In Pro-
Omura, Petya Osenova, Robert Östling, Lilja Øvre- ceedings of the 12th International Conference on
lid, Niko Partanen, Elena Pascual, Marco Passarotti, Language Resources and Evaluation (LREC).
Agnieszka Patejuk, Guilherme Paulino-Passos, An-
gelika Peljak-Łapińska, Siyao Peng, Cenel-Augusto Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.
Perez, Guy Perrier, Daria Petrova, Slav Petrov, How multilingual is multilingual bert? In Proceed-
Jussi Piitulainen, Tommi A Pirinen, Emily Pitler, ings of the 57th Annual Meeting of the Association
for Computational Linguistics, pages 4996–5001.Alessandro Raganato, Jörg Tiedemann, et al. 2018. An Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
analysis of encoder representations in transformer- Le, Mohammad Norouzi, Wolfgang Macherey,
based machine translation. In Proceedings of the Maxim Krikun, Yuan Cao, Qin Gao, Klaus
2018 EMNLP Workshop BlackboxNLP: Analyzing Macherey, et al. 2016. Google’s neural machine
and Interpreting Neural Networks for NLP. The As- translation system: Bridging the gap between hu-
sociation for Computational Linguistics. man and machine translation. arXiv preprint
arXiv:1609.08144.
Tim Rocktäschel, Edward Grefenstette, Karl Moritz
Hermann, Tomáš Kočiskỳ, and Phil Blunsom. 2015. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,
Reasoning about entailment with neural attention. Aaron Courville, Ruslan Salakhudinov, Rich Zemel,
arXiv preprint arXiv:1509.06664. and Yoshua Bengio. 2015. Show, attend and tell:
Neural image caption generation with visual atten-
Rudolf Rosa and David Mareček. 2019. Inducing syn- tion. In International conference on machine learn-
tactic trees from bert representations. arXiv preprint ing, pages 2048–2057.
arXiv:1906.11511.
Daniel Zeman, Jan Hajic, Martin Popel, Martin Pot-
thast, Milan Straka, Filip Ginter, Joakim Nivre, and
Sofia Serrano and Noah A Smith. 2019. Is attention Slav Petrov. 2018. Conll 2018 shared task: Mul-
interpretable? arXiv preprint arXiv:1906.03731. tilingual parsing from raw text to universal depen-
dencies. In Proceedings of the CoNLL 2018 Shared
Emma Strubell, Patrick Verga, Daniel Andor, Task: Multilingual parsing from raw text to univer-
David Weiss, and Andrew McCallum. 2018. sal dependencies, pages 1–21.
Linguistically-informed self-attention for semantic
role labeling. In Proceedings of the 2018 Confer- Yiyun Zhao and Steven Bethard. 2020. How does
ence on Empirical Methods in Natural Language BERT’s attention change when you fine-tune? An
Processing, pages 5027–5038, Brussels, Belgium. analysis methodology and a case study in negation
Association for Computational Linguistics. scope. page 19.
Swabha Swayamdipta, Sam Thomson, Kenton Lee,
Luke Zettlemoyer, Chris Dyer, and Noah A. Smith.
2018. Syntactic scaffolds for semantic structures.
In Proceedings of the 2018 Conference on Em-
pirical Methods in Natural Language Processing,
pages 3772–3782, Brussels, Belgium. Association
for Computational Linguistics.
Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.
Bert rediscovers the classical nlp pipeline. arXiv
preprint arXiv:1905.05950.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017a. Attention is all
you need. In Advances in neural information pro-
cessing systems, pages 5998–6008.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017b. Attention Is
All You Need. arXiv:1706.03762 [cs]. ArXiv:
1706.03762.
Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-
nrich, and Ivan Titov. 2019. Analyzing multi-
head self-attention: Specialized heads do the heavy
lifting, the rest can be pruned. arXiv preprint
arXiv:1905.09418.
Sarah Wiegreffe and Yuval Pinter. 2019. Attention is
not not explanation. In Proceedings of the 2019 Con-
ference on Empirical Methods in Natural Language
Processing and the 9th International Joint Confer-
ence on Natural Language Processing (EMNLP-
IJCNLP), pages 11–20.Figure 5: Positional scores across relations for all languages.
Figure 6: Decoding UUAS as a function of best positional baselines.
Figure 7: Parsing scores across components and languages.You can also read