What do you mean, BERT? Assessing BERT as a Distributional Semantics Model
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
What do you mean, BERT? Assessing BERT as a Distributional Semantics Model Timothee Mickus Denis Paperno Mathieu Constant Kees van Deemter Université de Lorraine Utrecht University Université de Lorraine Utrecht University CNRS, ATILF d.paperno@uu.nl CNRS, ATILF c.j.vandeemter@uu.nl tmickus@atilf.fr mconstant@atilf.fr Abstract based on BERT. Furthermore, BERT serves both as a strong baseline and as a basis for a fine- Contextualized word embeddings, i.e. vector tuned state-of-the-art word sense disambiguation representations for words in context, are nat- arXiv:1911.05758v2 [cs.CL] 8 May 2020 urally seen as an extension of previous non- pipeline (Wang et al., 2019a). contextual distributional semantic models. In Analyses aiming to understand the mechanical this work, we focus on BERT, a deep neural behavior of Transformers in general, and BERT in network that produces contextualized embed- particular, have suggested that they compute word dings and has set the state-of-the-art in several representations through implicitly learned syntac- semantic tasks, and study the semantic coher- tic operations (Raganato and Tiedemann, 2018; ence of its embedding space. While showing Clark et al., 2019; Coenen et al., 2019; Jawa- a tendency towards coherence, BERT does not fully live up to the natural expectations for a har et al., 2019, a.o.): representations computed semantic vector space. In particular, we find through the ‘attention’ mechanisms of Transform- that the position of the sentence in which a ers can arguably be seen as weighted sums of word occurs, while having no meaning corre- intermediary representations from the previous lates, leaves a noticeable trace on the word em- layer, with many attention heads assigning higher beddings and disturbs similarity relationships. weights to syntactically related tokens (however, contrast with Brunner et al., 2019; Serrano and 1 Introduction Smith, 2019). A recent success story of NLP, BERT (Devlin et al., Complementing these previous studies, in this 2018) stands at the crossroad of two key innova- paper we adopt a more theory-driven lexical se- tions that have brought about significant improve- mantic perspective. While a clear parallel was es- ments over previous state-of-the-art results. On tablished between ‘traditional’ noncontextual em- the one hand, BERT models are an instance of con- beddings and the theory of distributional seman- textual embeddings (McCann et al., 2017; Peters tics (a.o. Lenci, 2018; Boleda, 2019), this link is et al., 2018), which have been shown to be subtle not automatically extended to contextual embed- and accurate representations of words within sen- dings: some authors (Westera and Boleda, 2019) tences. On the other hand, BERT is a variant of the even explicitly consider only “context-invariant” Transformer architecture (Vaswani et al., 2017) representations as distributional semantics. Hence which has set a new state-of-the-art on a wide we study to what extent BERT, as a contextual em- variety of tasks ranging from machine translation bedding architecture, satisfies the properties ex- (Ott et al., 2018) to language modeling (Dai et al., pected from a natural contextualized extension of 2019). BERT-based models have significantly in- distributional semantics models (DSMs). creased state-of-the-art over the GLUE benchmark DSM s assume that meaning is derived from use for natural language understanding (Wang et al., in context. DSMs are nowadays systematically 2019b) and most of the best scoring models for represented using vector spaces (Lenci, 2018). this benchmark include or elaborate on BERT. Us- They generally map each word in the domain of ing BERT representations has become in many the model to a numeric vector on the basis of cases a new standard approach: for instance, all distributional criteria; vector components are in- submissions at the recent shared task on gendered ferred from text data. DSMs have also been com- pronoun resolution (Webster et al., 2019) were puted for linguistic items other than words, e.g.,
word senses—based both on meaning inventories sentence coherence. In section 4, we test whether (Rothe and Schütze, 2015) and word sense induc- we can observe cross-sentence coherence for sin- tion techniques (Bartunov et al., 2015)—or mean- gle tokens, whereas in section 5 we study to what ing exemplars (Reisinger and Mooney, 2010; Erk extent incoherence of representations across sen- and Padó, 2010; Reddy et al., 2011). The default tences affects the similarity structure of the seman- approach has however been to produce represen- tic space. We summarize our findings in section 6. tations for word types. Word properties encoded by DSMs vary from morphological information 2 Theoretical background & connections (Marelli and Baroni, 2015; Bonami and Paperno, 2018) to geographic information (Louwerse and Word embeddings have been said to be ‘all- Zwaan, 2009), to social stereotypes (Bolukbasi purpose’ representations, capable of unifying the et al., 2016) and to referential properties (Herbe- otherwise heterogeneous domain that is NLP (Tur- lot and Vecchi, 2015). ney and Pantel, 2010). To some extent this claim spearheaded the evolution of NLP: focus recently A reason why contextualized embeddings have shifted from task-specific architectures with lim- not been equated to distributional semantics may ited applicability to universal architectures requir- lie in that they are “functions of the entire input ing little to no adaptation (Radford, 2018; Devlin sentence” (Peters et al., 2018). Whereas tradi- et al., 2018; Radford et al., 2019; Yang et al., 2019; tional DSMs match word types with numeric vec- Liu et al., 2019, a.o.). tors, contextualized embeddings produce distinct Word embeddings are linked to the distribu- vectors per token. Ideally, the contextualized na- tional hypothesis, according to which “you shall ture of these embeddings should reflect the seman- know a word from the company it keeps” (Firth, tic nuances that context induces in the meaning of 1957). Accordingly, the meaning of a word can be a word—with varying degrees of subtlety, rang- inferred from the effects it has on its context (Har- ing from broad word-sense disambiguation (e.g. ris, 1954); as this framework equates the meaning ‘bank’ as a river embankment or as a financial of a word to the set of its possible usage contexts, institution) to narrower subtypes of word usage it corresponds more to holistic theories of meaning (‘bank’ as a corporation or as a physical building) (Quine, 1960, a.o.) than to truth-value accounts and to more context-specific nuances. (Frege, 1892, a.o.). In early works on word em- Regardless of how apt contextual embeddings beddings (Bengio et al., 2003, e.g.), a straightfor- such as BERT are at capturing increasingly finer ward parallel between word embeddings and dis- semantic distinctions, we expect the contextual tributional semantics could be made: the former variation to preserve the basic DSM properties. are distributed representations of word meaning, Namely, we expect that the space structure en- the latter a theory stating that a word’s meaning is codes meaning similarity and that variation within drawn from its distribution. In short, word embed- the embedding space is semantic in nature. Simi- dings could be understood as a vector-based im- lar words should be represented with similar vec- plementation of the distributional hypothesis. This tors, and only semantically pertinent distinctions parallel is much less obvious for contextual em- should affect these representations. We connect beddings: are constantly changing representations our study with previous work in section 2 be- truly an apt description of the meaning of a word? fore detailing the two approaches we followed. More precisely, the literature on distributional First, we verify in section 3 that BERT embeddings semantics has put forth and discussed many math- form natural clusters when grouped by word types, ematical properties of embeddings: embeddings which on any account should be groups of similar are equivalent to count-based matrices (Levy and words and thus be assigned similar vectors. Sec- Goldberg, 2014b), expected to be linearly depen- ond, we test in sections 4 and 5 that contextualized dant (Arora et al., 2016), expressible as a unitary word vectors do not encode semantically irrelevant matrix (Smith et al., 2017) or as a perturbation features: in particular, leveraging some knowledge of an identity matrix (Yin and Shen, 2018). All from the architectural design of BERT, we address these properties have however been formalized for whether there is no systematic difference between non-contextual embeddings: they were formulated BERT representations in odd and even sentences using the tools of matrix algebra, under the as- of running text—a property we refer to as cross- sumption that embedding matrix rows correspond
to word types. Hence they cannot be applied as Liang (2019) for a discussion. We refrain from such to contextual embeddings. This disconnect using learned probes in favor of a more direct as- in the literature leaves unanswered the question of sessment of the coherence of the semantic space. what consequences there are to framing contextu- alized embeddings as DSMs. 3 Experiment 1: Word Type Cohesion The analyses that contextual embeddings have The trait of distributional spaces that we focus been subjected to differ from most analyses of dis- on in this study is that similar words should lie tributional semantics models. Peters et al. (2018) in similar regions of the semantic space. This analyzed through an extensive ablation study of should hold all the more so for identical words, ELM o what information is captured by each layer which ought to be be maximally similar. By de- of their architecture. Devlin et al. (2018) dis- sign, contextualized embeddings like BERT exhibit cussed what part of their architecture is criti- variation within vectors corresponding to identical cal to the performances of BERT, comparing pre- word types. Thus, if BERT is a DSM, we expect that training objectives, number of layers and train- word types form natural, distinctive clusters in the ing duration. Other works (Raganato and Tiede- embedding space. Here, we assess the coherence mann, 2018; Hewitt and Manning, 2019; Clark of word type clusters by means of their silhouette et al., 2019; Voita et al., 2019; Michel et al., 2019) scores (Rousseeuw, 1987). have introduced specific procedures to understand how attention-based architectures function on a 3.1 Data & Experimental setup mechanical level. Recent research has however Throughout our experiments, we used the Guten- questioned the pertinence of these attention-based berg corpus as provided by the NLTK platform, analyses (Serrano and Smith, 2019; Brunner et al., out of which we removed older texts (King John’s 2019); moreover these works have focused more Bible and Shakespeare). Sentences are enumer- on the inner workings of the networks than on their ated two by two; each pair of sentences is then adequacy with theories of meaning. used as a distinct input source for BERT. As we One trait of DSMs that is very often encoun- treat the BERT algorithm as a black box, we re- tered, discussed and exploited in the literature is trieve only the embeddings from the last layer, the fact that the relative positions of embeddings discarding all intermediary representations and at- are not random. Early vector space models, by de- tention weights. We used the bert-large- sign, required that word with similar meanings lie uncased model in all experiments2 ; therefore near one another (Salton et al., 1975); as a conse- most of our experiments are done on word-pieces. quence, regions of the vectors space describe co- To study the basic coherence of BERT’s seman- herent semantic fields.1 Despite the importance tic space, we can consider types as clusters of of this characteristic, the question whether BERT tokens—i.e. specific instances of contextualized contextual embeddings depict a coherent semantic embeddings—and thus leverage the tools of clus- space on their own has been left mostly untouched ter analysis. In particular, silhouette score is gen- by papers focusing on analyzing BERT or Trans- erally used to assess whether a specific observa- formers (with some exceptions, e.g. Coenen et al., tion ~v is well assigned to a given cluster Ci drawn 2019). Moreover, many analyses of how mean- from a set of possible clusters C. The silhouette ing is represented in attention-based networks or score is defined in eq. 1: contextual embeddings include “probes” (learned sep(~v , Ci ) = min{mean d(~v , v~0 )∀ Cj ∈ C − {Ci }} models such as classifiers) as part of their evalu- v~0 ∈Cj ation setup to ‘extract’ information from the em- coh(~v , Ci ) = mean d(~v , v~0 ) beddings (Peters et al., 2018; Tang et al., 2018; v~0 ∈Ci −{~v } Coenen et al., 2019; Chang and Chen, 2019, e.g.). sep(~v , Ci ) − coh(~v , Ci ) Yet this methodology has been criticized as po- silh(~v , Ci ) = (1) max{sep(~v , Ci ), coh(~v , Ci )} tentially conflicting with the intended purpose of studying the representations themselves (Wieting We used Euclidean distance for d. In our case, and Kiela, 2019; Cover, 1965); cf. also Hewitt and observations ~v therefore correspond to tokens (that 1 is, word-piece tokens), and clusters Ci to types. Vectors encoding contrasts between words are also ex- 2 pected to be coherent (Mikolov et al., 2013b), although this Measurements were conducted before the release of the assumption has been subjected to criticism (Linzen, 2016). bert-large-uncased-whole-words model.
Silhouette scores consist in computing for each vector observation ~v a cohesion score (viz. the av- erage distance to other observations in the cluster Ci ) and a separation score (viz. the minimal av- erage distance to other observations, i.e. the min- imal ‘cost’ of assigning ~v to any other cluster than Ci ). Optimally, cohesion is to be minimized and separation is to be maximized, and this is re- flected in the silhouette score itself: scores are de- fined between -1 and 1; -1 denotes that the ob- servation ~v should be assigned to another cluster than Ci , whereas 1 denotes that the observation ~v is entirely consistent with the cluster Ci . Keep- Figure 1: Distribution of token silhouette scores ing track of silhouette scores for a large number of vectors quickly becomes intractable, hence we use a slightly modified version of the above def- we found that 10% of types contained only tokens inition, and compute separation and cohesion us- with negative silhouette score. ing the distance to the average vector for a clus- The standards we expect of DSMs are not al- ter rather than the average distance to other vec- ways upheld strictly; the median and mean score tors in a cluster, as suggested by Vendramin et al. are respectively at 0.08 and 0.06, indicating a gen- (2013). Though results are not entirely equivalent eral trend of low scores, even when they are posi- as they ignore the inner structure of clusters, they tive. We previously noted that both the use of sub- still present a gross view of the consistency of the word representations in BERT as well as polysemy vector space under study. and homonymy might impact these results. The We do note two caveats with our proposed amount of meaning variation induced by polysemy methodology. Firstly, BERT uses subword repre- and homonymy can be estimated by using a dictio- sentations, and thus BERT tokens do not necessar- nary as a sense inventory. The number of distinct ily correspond to words. However we may conjec- entries for a type serves as a proxy measure of how ture that some subwords exhibit coherent mean- much its meaning varies in use. We thus used a ings, based on whether they tightly correspond linear model to predict silhouette scores with log- to morphemes—e.g. ‘##s’, ‘##ing’ or ‘##ness’. scaled frequency and log-scaled definition counts, Secondly, we group word types based on char- as listed in the Wiktionary, as predictors. We se- acter strings; yet only monosemous words should lected tokens for which we found at least one en- describe perfectly coherent clusters—whereas we try in the Wiktionary, out of which we then ran- expect some degree of variation for polysemous domly sampled 10000 observations. Both defini- words and homonyms according to how widely tion counts and frequency were found to be signif- their meanings may vary. icant predictors, leading the silhouette score to de- crease. This suggests that polysemy degrades the cohesion score of the type cluster, which is com- 3.2 Results & Discussion patible with what one would expect from a DSM. We compared cohesion to separation scores using We moreover observed that monosemous words a paired Student’s t-test, and found a significant yielded higher silhouette scores than polysemous effect (p-value < 2 · 2−16 ). This highlights that words (p < 2 · 2−16 , Cohen’s d = 0.236), though cohesion scores are lower than separation scores. they still include a substantial number of tokens The effect size as measured by Cohen’s d (Cohen’s with negative silhouette scores. d = −0.121) is however rather small, suggesting Similarity also includes related words, and not that cohesion scores are only 12% lower than sep- only tokens of the same type. Other studies (Vial aration scores. More problematically, we can see et al., 2019; Coenen et al., 2019, e.g.) already in figure 1 that 25.9% of the tokens have a nega- stressed that BERT embeddings perform well on tive silhouette score: one out of four tokens would word-level semantic tasks. To directly assess be better assigned to some other type than the one whether BERT captures this broader notion of sim- they belong to. When aggregating scores by types, ilarity, we used the MEN word similarity dataset
(Bruni et al., 2014), which lists pairs of English ment, or randomly selected sentences. A special words with human annotated similarity ratings. token [SEP] is used to indicate sentence bound- We removed pairs containing words for which we aries, and the full sentence is prepended with a had no representation, leaving us with 2290 pairs. second special token [CLS] used to perform the We then computed the Spearman correlation be- actual prediction for NSP. Each token is trans- tween similarity ratings and the cosine of the av- formed into an input vector using an input em- erage BERT embeddings of the two paired word bedding matrix. To distinguish between tokens types, and found a correlation of 0.705, showing from the first and the second sentence, the model that cosine similarity of average BERT embeddings adds a learned feature vector seg ~ A to all tokens encodes semantic similarity. For comparison, a from first sentences, and a distinct learned feature word2vec DSM (Mikolov et al., 2013a, henceforth vector seg~ B to all tokens from second sentences; W 2 V ) trained on BooksCorpus (Zhu et al., 2015) these feature vectors are called ‘segment encod- using the same tokenization as BERT achieved a ings’. Lastly, as Transformer models do not have correlation of 0.669. an implicit representation of word-order, informa- tion regarding the index i of the token in the sen- 4 Experiment 2: Cross-Sentence tence is added using a positional encoding p(i). Coherence Therefore, if the initial training example was “My dog barks. It is a pooch.”, the actual input would As observed in the previous section, overall the correspond to the following sequence of vectors: word type coherence in BERT tends to match our basic expectations. In this section, we do further ~ [CLS] ~ + seg + p(0) ~ + seg ~ A , M~ y + p(1) ~ A, tests, leveraging our knowledge of the design of dog ~ + seg ~ + p(2) ~ + p(3) ~ A , barks ~ + seg~ A, BERT . We detail the effects of jointly using seg- ment encodings to distinguish between paired in- ~ + seg ~. + p(4) ~ , [SEP]~ ~ + seg + p(5) ~ , A A put sentences and residual connections. ~ + seg ~ + seg ~ + p(6) It ~ + p(7) ~ B , is ~ B, 4.1 Formal approach ~ + seg ~a + p(8) ~ + seg ~ + p(9) ~ B , pooch ~ B, We begin by examining the architectural design ~ + seg ~. + p(10) ~ , [SEP] ~ ~ + seg + p(11) ~ B B of BERT. We give some elements relevant to our study here and refer the reader to the orig- Due to the general use of residual connections, inal papers by Vaswani et al. (2017) and Devlin marking the sentences using the segment encod- et al. (2018), introducing Transformers and BERT, ings seg ~ A and seg~ B can introduce a systematic for a more complete description. On a formal offset within sentences. Consider that the first level, BERT is a deep neural network composed layer uses as input vectors corresponding to word, of superposed layers of computations. Each layer position, and sentence information: w ~ + ~i + p(i) is composed of two “sub-layers”: the first per- ~ i ; for simplicity, let i~i = w seg ~ ~i + p(i); we also forming “multi-head attention”, the second be- ignore the rest of the input as it does not impact ing a simple feed-forward network. Throughout this reformulation. The output from the first sub- all layers, after each sub-layer, residual connec- layer o~1i can be written: tions and layer normalization are applied, thus the intermediary output o~L after sub-layer L can be o~1i = LayerNorm(Sub1 (i~i + seg ~ i ) + i~i + seg ~ i) written as a function of the input x~L , as o~L = LayerNorm(SubL (x~L ) + x~L ). 1 1~ = ~bl + ~g 1 Sub1 (i~i + seg ~ i ) + ~g 1 ii BERT is optimized on two training objectives. σi1 σi1 The first, called masked language model, is a vari- 1 − ~g 1 µ(Sub1 (i~i + seg ~ i ) + i~i + seg ~ i) ation on the Cloze test for reading proficiency σi1 (Taylor, 1953). The second, called next sen- 1 + ~g 1 seg~ i tence prediction (NSP), corresponds to predicting σi1 whether two sentences are found one next to the 1 other in the original corpus or not. Each example = ~õ1i + ~g 1 ~ i seg (2) σi1 passed as input to BERT is comprised of two sen- tences, either contiguous sentences from a docu- This equation is obtain by simply injecting the
~ barks ~ It’s 4.2 Data & Experimental setup ~ If BERT properly describes a semantic vector pooch ~ dog space, we should, on average, observe no signifi- ~ My cant difference in token encoding imputable to the segment the token belongs to. For a given word type w, we may constitute two groups: wsegA , the ~a set of tokens for this type w belonging to first sen- tences in the inputs, and wsegB , the set of tokens (a) without segment encoding bias of w belonging to second sentences. If BERT coun- terbalances the segment encodings, random differ- ~ barks ences should cancel out, and therefore the mean of ~ dog all tokens wsegA should be equivalent to the mean of all tokens wsegB . ~ pooch ~ My We used the same dataset as in section 3. This setting (where all paired input sentences are drawn ~ from running text) allows us to focus on the effects It’s of the segment encodings. We retrieved the output embeddings of the last BERT layer and grouped ~a them per word type. To assess the consistency of (b) with segment encoding bias a group of embeddings with respect to a purported reference, we used a mean of squared error (MSE): Figure 2: Segment encoding bias given a group of embeddings E and a reference vector ~r, we computed how much each vector in E strayed from the reference ~r. It is formally de- definition for layer-normalization.3 Therefore, by fined as: recurrence, the final output o~L i for a given token 1 XX w ~ + seg ~i + p(i) ~ i can be written as: MSE(E, ~r) = (~vd − ~rd )2 (4) #E ~v ∈E d L L ! ! 1 This MSE can also be understood as the average o~L K Y ~L ~g l × seg ~ i (3) i = õi + squared distance to the reference ~r. When ~r = E, l=1 l=1 i σl i.e. ~r is set to be the average vector in E, the This rewriting trick shows that segment encod- MSE measures variance of E via Euclidean dis- ings are partially preserved in the output. All em- tance. We then used the MSE function to construct beddings within a sentence contain a shift in a pairs of observations: for each word type w, and specific direction, determined only by the initial for each segment encoding segi , we computed segment encoding and the learned gain parameters two scores: MSE(wsegi , wsegi )—which gives us for layer normalization. In figure 2, we illustrate an assessment of how coherent the set of embed- what this systematic shift might entail. Prior to the dings wsegi is with respect to the mean vector application of the segment encoding bias, the se- in that set—and MSE(wsegi , wsegj )—which as- mantic space is structured by similarity (‘pooch’ is sesses how coherent the same group of embed- near ‘dog’); with the bias, we find a different set dings is with respect to the mean vector for the of characteristics: in our toy example, tokens are embeddings of the same type, but from the other linearly separable by sentences. segment segj . If no significant contrast between 3 Layer normalization after sub-layer l is defined as: these two scores can be observed, then BERT coun- terbalances the segment encodings and is coherent ~gl x − µ(~ (~ x)) across sentences. x) = ~bl + LayerNorml (~ σ 1 1 4.3 Results & Discussion = ~bl + ~gl x − ~gl ~ µ(~ x) σ σ We compared results using a paired Student’s t- where b~l is a bias, denotes element-wise multiplication, g~l test, which highlighted a significant difference is a “gain” parameter, σ is the standard deviation of compo- nents of ~x and µ(~x) = hx̄, . . . , x̄i is a vector with all compo- based on which segment types belonged to (p- nents defined as the mean of components of ~ x. value < 2 · 2−16 ); the effect size (Cohen’s d =
occurrences in even and odd sentences. However, comparing tokens of the same type in consecutive sentences is not necessarily the main application of BERT and related models. Does the segment- based representational variance affect the structure of the semantic space, instantiated in similarities between tokens of different types? Here we inves- tigate how segment encodings impact the relation between any two tokens in a given sentence. 5.1 Data & Experimental setup Consistent with previous experiments, we used the same dataset (cf. section 3); in this experi- ment also mitigating the impact of the NSP objec- Figure 3: Log-scaled MSE per reference tive was crucial. Sentences were thus passed two by two as input to the BERT model. As cosine has been traditionally used to quantify semantic −0.527) was found to be stronger than what we similarity between words (Mikolov et al., 2013b; computed when assessing whether tokens cluster Levy and Goldberg, 2014a, e.g.), we then com- according to their types (cf. section 3). A vi- puted pairwise cosine of the tokens in each sen- sual representation of these results, log-scaled, is tence. This allows us to reframe our assessment of shown in figure 3. For all sets wsegi , the average whether lexical contrasts are coherent across sen- embedding from the set itself was systematically tences as a comparison of semantic dissimilarity a better fit than the average embedding from the across sentences. More formally, we compute the paired set wsegj . We also noted that a small num- following set of cosine scores CS for each sen- ber of items yielded a disproportionate difference tence S: in MSE scores and that frequent word types had CS = {cos(~v , ~u) | ~v 6= ~u ∧ ~v , ~u ∈ ES } (5) smaller differences in MSE scores: roughly speak- ing, very frequent items—punctuation signs, stop- with ES the set of embeddings for the sentence S. words, frequent word suffixes—received embed- In this analysis, we compare the union of all sets of dings that are almost coherent across sentences. cosine scores for first sentences against the union Although the observed positional effect of em- of all sets of cosine scores for second sentences. beddings’ inconsistency might be entirely due to To avoid asymmetry, we remove the [CLS] token segment encodings, additional factors might be (only present in first sentences), and as with pre- at play. In particular, BERT uses absolute posi- vious experiments we neutralize the effects of the tional encoding vectors to order words within a NSP objective by using only consecutive sentences sequence: the first word w1 is marked with the po- as input. sitional encoding p(1), the second word w2 with 5.2 Results & Discussion p(1), and so on until the last word, wn , marked with p(n). As these positional encodings are We compared cosine scores for first and second added to the word embeddings, the same remark sentences using a Wilcoxon rank sum test. We made earlier on the impact of residual connections observed a significant effect, however small (Co- may apply to these positional encodings as well. hen’s d = 0.011). This may perhaps be due to data Lastly, we also note that many downstream appli- idiosyncrasies, and indeed when comparing with a cations use a single segment encoding per input, W 2 V (Mikolov et al., 2013a) trained on BooksCor- and thus sidestep the caveat stressed here. pus (Zhu et al., 2015) using the same tokeniza- tion as BERT, we do observe a significant effect 5 Experiment 3: Sentence-level structure (p < 0.05). However the effect size is six times smaller (d = 0.002) than what we found for BERT We have seen that BERT embeddings do not fully representations; moreover, when varying the sam- respect cross-sentence coherence; the same type ple size (cf. figure 4), p-values for BERT represen- receives somewhat different representations for tations drop much faster to statistical significance.
Model STScor. SICK - Rcor. 0.9 Skip-Thought 0.255 60 0.487 62 0.8 USE 0.666 86 0.689 97 0.7 InferSent 0.676 46 0.709 03 0.6 p-value 0.5 BERT , 2 sent. ipt. 0.359 13 0.369 92 0.4 BERT , 1 sent. ipt. 0.482 41 0.586 95 0.3 W2V 0.370 17 0.533 56 0.2 0.1 Table 1: Correlation (Spearman ρ) of cosine simi- 0 larity and relatedness ratings on the STS and SICK - R benchmarks 1K 10K 100K 1M Full Sample size of sentences with a human-annotated relatedness BERT W2V p = 0.05 score. The original presentation of BERT (Devlin Figure 4: Wilcoxon tests, 1st vs. 2nd sentences et al., 2018) did include a downstream application to these datasets, but employed a learned classi- fier, which obfuscates results (Wieting and Kiela, A possible reason for the larger discrepancy ob- 2019; Cover, 1965; Hewitt and Liang, 2019). served in BERT representations might be that BERT Hence we simply reduce the sequence of tokens uses absolute positional encodings, i.e. the kth within each sentence into a single vector by sum- word of the input is encoded with p(k). There- ming them, a simplistic yet robust semantic com- fore, although all first sentences of a given length l position method. We then compute the Spearman will be indexed with the same set of positional en- correlation between the cosines of the two sum codings {p(1), . . . , p(l)}, only second sentences vectors and the sentence pair’s relatedness score. of a given length l preceded by first sentences of We compare two setups: a “two sentences input” a given length j share the exact same set of posi- scheme (or 2 sent. ipt. for short)—where we use tional encodings {p(j + 1), . . . , p(j + l)}. As the sequences of vectors obtained by passing the highlighted previously, the residual connections two sentences as a single input—and a “one sen- ensure that the segment encodings were partially tence input” scheme (1 sent. ipt.)—using two dis- preserved in the output embedding: the same argu- tinct inputs of a single sentence each. ment can be made for positional encodings. In any Results are reported in table 1; we also pro- event, the fact is that we do observe on BERT rep- vide comparisons with three different sentence en- resentations an effect of segment on sentence-level coders and the aforementioned W 2 V model. As structure. This effect is greater than one can blame we had suspected, using sum vectors drawn from on data idiosyncrasies, as verified by the compari- a two sentence input scheme single degrades per- son with a traditional DSM such as W 2 V. If we are formances below the W 2 V baseline. On the other to consider BERT as a DSM, we must do so at the hand, a one sentence input scheme seems to pro- cost of cross-sentence coherence. duce coherent sentence representations: in that The analysis above suggests that embeddings scenario, BERT performs better than W 2 V and for tokens drawn from first sentences live in a the older sentence encoder Skip-Thought (Kiros different semantic space than tokens drawn from et al., 2015), but worse than the modern USE second sentences, i.e. that BERT contains two (Cer et al., 2018) and Infersent (Conneau et al., DSM s rather than one. If so, the comparison be- 2017). The comparison with W 2 V also shows tween two sentence-representations from a single that BERT representations over a coherent input input would be meaningless, or at least less co- are more likely to include some form of composi- herent than the comparison of two sentence rep- tional knowledge than traditional DSMs; however resentations drawn from the same sentence posi- it is difficult to decide whether some true form of tion. To test this conjecture, we use two compo- compositionality is achieved by BERT or whether sitional semantics benchmarks: STS (Cer et al., these performances are entirely a by-product of 2017) and SICK - R (Marelli et al., 2014). These the positional encodings. In favor of the former, datasets are structured as triplets, grouping a pair other research has suggested that Transformer-
based architectures perform syntactic operations system, absolute encodings are not: other works (Raganato and Tiedemann, 2018; Hewitt and Man- have proposed alternative relative position encod- ning, 2019; Clark et al., 2019; Jawahar et al., 2019; ings (Shaw et al., 2018; Dai et al., 2019, e.g.); re- Voita et al., 2019; Michel et al., 2019). In all, these placing the former with the latter may alleviate the results suggest that the semantic space of token gap in lexical contrasts. Other related questions representations from second sentences differ from that we must leave to future works encompass that of embeddings from first sentences. testing on other BERT models such as the whole- words model, or that of Liu et al. (2019) which 6 Conclusions differs only by its training objectives, as well as other contextual embeddings architectures. Our experiments have focused on testing to what Our findings suggest that the formulation of the extent similar words lie in similar regions of NSP objective of BERT obfuscates its relation to BERT ’s latent semantic space. Although we saw distributional semantics, by introducing a system- that type-level semantics seem to match our gen- atic distinction between first and second sentences eral expectations about DSMs, focusing on details which impacts the output embeddings. Similarly, leaves us with a much foggier picture. other works (Lample and Conneau, 2019; Yang The main issue stems from BERT’s “next sen- et al., 2019; Joshi et al., 2019; Liu et al., 2019) tence prediction objective”, which requires tokens stress that the usefulness and pertinence of the NSP to be marked according to which sentence they be- task were not obvious. These studies favored an long. This introduces a distinction between first empirical point of view; here, we have shown what and second sentence of the input that runs con- sorts of caveats came along with such artificial dis- trary to our expectations in terms of cross-sentence tinctions from the perspective of a theory of lexi- coherence. The validity of such a distinction for cal semantics. We hope that future research will lexical semantics may be questioned, yet its ef- extend and refine these findings, and further our fects can be measured. The primary assessment understanding of the peculiarities of neural archi- conducted in section 3 shows that token repre- tectures as models of linguistic structure. sentations did tend to cluster naturally according to their types, yet a finer study detailed in sec- Acknowledgments tion 4 highlights that tokens from distinct sen- tence positions (even vs. odd) tend to have dif- We thank Quentin Gliosca whose remarks have ferent representations. This can seen as a direct been extremely helpful to this work. We also consequence of BERT’s architecture: residual con- thank Olivier Bonami as well as three anony- nections, along with the use of specific vectors to mous reviewers for their thoughtful criticism. The encode sentence position, entail that tokens for a work was supported by a public grant overseen by given sentence position are ‘shifted’ with respect the French National Research Agency (ANR) as to tokens for the other position. Encodings have a part of the “Investissements d’Avenir” program: substantial effect on the structure of the semantic Idex Lorraine Université d’Excellence (reference: subspaces of the two sentences in BERT input. Our ANR-15-IDEX-0004). experiments demonstrate that assuming sameness of the semantic space across the two input sen- tences can lead to a significant performance drop References in semantic textual similarity. Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, One way to overcome this violation of cross- and Andrej Risteski. 2016. Linear algebraic struc- sentence coherence would be to consider first and ture of word senses, with applications to polysemy. CoRR, abs/1601.03764. second sentences representations as belonging to distinct distributional semantic spaces. The fact Sergey Bartunov, Dmitry Kondrashkin, Anton Os- that first sentences were shown to have on aver- okin, and Dmitry P. Vetrov. 2015. Breaking sticks age higher pairwise cosines than second sentences and ambiguities with adaptive skip-gram. CoRR, can be partially explained by the use of absolute abs/1502.07257. positional encodings in BERT representations. Al- Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and though positional encodings are required so that Christian Janvin. 2003. A neural probabilistic lan- the model does not devolve into a bag-of-word guage model. J. Mach. Learn. Res., 3:1137–1155.
Gemma Boleda. 2019. Distributional semantics and Thomas M. Cover. 1965. Geometrical and statistical linguistic theory. CoRR, abs/1905.01896. properties of systems of linear inequalities with ap- plications in pattern recognition. IEEE Trans. Elec- Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, tronic Computers, 14(3):326–334. Venkatesh Saligrama, and Adam T Kalai. 2016. Man is to computer programmer as woman is to Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. homemaker? debiasing word embeddings. In D. D. Carbonell, Quoc V. Le, and Ruslan Salakhutdi- Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and nov. 2019. Transformer-xl: Attentive language R. Garnett, editors, Advances in Neural Information models beyond a fixed-length context. CoRR, Processing Systems 29, pages 4349–4357. Curran abs/1901.02860. Associates, Inc. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Olivier Bonami and Denis Paperno. 2018. A charac- Kristina Toutanova. 2018. BERT: pre-training of terisation of the inflection-derivation opposition in a deep bidirectional transformers for language under- distributional vector space. Lingua e Langaggio. standing. CoRR, abs/1810.04805. Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. Katrin Erk and Sebastian Padó. 2010. Exemplar-based Multimodal distributional semantics. J. Artif. Intell. models for word meaning in context. In ACL. Res., 49:1–47. J. R. Firth. 1957. A synopsis of linguistic theory 1930- Gino Brunner, Yang Liu, Damián Pascual, Oliver 55. Studies in Linguistic Analysis (special volume of Richter, and Roger Wattenhofer. 2019. On the Philological Society), 1952-59:1–32. the Validity of Self-Attention as Explanation in Transformer Models. arXiv e-prints, page Gottlob Frege. 1892. Über Sinn und Bedeutung. arXiv:1908.04211. Zeitschrift fr Philosophie und philosophische Kritik, 100:25–50. Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Zellig Harris. 1954. Distributional structure. Word, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, 10(23):146–162. Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. Aurélie Herbelot and Eva Maria Vecchi. 2015. Build- 2018. Universal sentence encoder. CoRR, ing a shared world: mapping distributional to model- abs/1803.11175. theoretic semantic spaces. In Proceedings of the Daniel M. Cer, Mona T. Diab, Eneko Agirre, Iñigo 2015 Conference on Empirical Methods in Natural Lopez-Gazpio, and Lucia Specia. 2017. Semeval- Language Processing, pages 22–32. Association for 2017 task 1: Semantic textual similarity multilin- Computational Linguistics. gual and crosslingual focused evaluation. In Pro- John Hewitt and Percy Liang. 2019. Designing and ceedings of the 11th International Workshop on Se- Interpreting Probes with Control Tasks. arXiv e- mantic Evaluation, SemEval@ACL 2017, Vancou- prints, page arXiv:1909.03368. ver, Canada, August 3-4, 2017, pages 1–14. John Hewitt and Christopher D. Manning. 2019. A Ting-Yun Chang and Yun-Nung Chen. 2019. What structural probe for finding syntax in word represen- does this word mean? explaining contextualized em- tations. In NAACL-HLT. beddings with natural language definition. Ganesh Jawahar, Benoı̂t Sagot, and Djamé Seddah. Kevin Clark, Urvashi Khandelwal, Omer Levy, and 2019. What does BERT learn about the structure Christopher D. Manning. 2019. What does BERT of language? In ACL 2019 - 57th Annual Meeting of look at? an analysis of BERT’s attention. In Pro- the Association for Computational Linguistics, Flo- ceedings of the 2019 ACL Workshop BlackboxNLP: rence, Italy. Analyzing and Interpreting Neural Networks for NLP, pages 276–286, Florence, Italy. Association Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. for Computational Linguistics. Weld, Luke Zettlemoyer, and Omer Levy. 2019. Spanbert: Improving pre-training by representing Andy Coenen, Emily Reif, Ann Yuan, Been Kim, and predicting spans. CoRR, abs/1907.10529. Adam Pearce, Fernanda Viégas, and Martin Watten- berg. 2019. Visualizing and measuring the geometry Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, of bert. arXiv e-prints, page arXiv:1906.02715. Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Alexis Conneau, Douwe Kiela, Holger Schwenk, Loı̈c C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, Barrault, and Antoine Bordes. 2017. Supervised and R. Garnett, editors, Advances in Neural Infor- learning of universal sentence representations from mation Processing Systems 28, pages 3294–3302. natural language inference data. In Proceedings of Curran Associates, Inc. the 2017 Conference on Empirical Methods in Nat- ural Language Processing, pages 670–680, Copen- Guillaume Lample and Alexis Conneau. 2019. Cross- hagen, Denmark. Association for Computational lingual language model pretraining. CoRR, Linguistics. abs/1901.07291.
Alessandro Lenci. 2018. Distributional models of word Myle Ott, Sergey Edunov, David Grangier, and meaning. Annual review of Linguistics, 4:151–171. Michael Auli. 2018. Scaling neural machine trans- lation. In Proceedings of the Third Conference on Omer Levy and Yoav Goldberg. 2014a. Linguistic Machine Translation: Research Papers, pages 1–9, regularities in sparse and explicit word representa- Belgium, Brussels. Association for Computational tions. In Proceedings of the Eighteenth Confer- Linguistics. ence on Computational Natural Language Learning, pages 171–180. Association for Computational Lin- Matthew Peters, Mark Neumann, Mohit Iyyer, Matt guistics. Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word rep- Omer Levy and Yoav Goldberg. 2014b. Neural resentations. In Proceedings of the 2018 Confer- word embedding as implicit matrix factorization. ence of the North American Chapter of the Associ- In Z. Ghahramani, M. Welling, C. Cortes, N. D. ation for Computational Linguistics: Human Lan- Lawrence, and K. Q. Weinberger, editors, Advances guage Technologies, Volume 1 (Long Papers), pages in Neural Information Processing Systems 27, pages 2227–2237, New Orleans, Louisiana. Association 2177–2185. Curran Associates, Inc. for Computational Linguistics. Tal Linzen. 2016. Issues in evaluating semantic spaces William Van Ormann Quine. 1960. Word And Object. using word analogies. In Proceedings of the 1st MIT Press. Workshop on Evaluating Vector-Space Representa- Alec Radford. 2018. Improving language understand- tions for NLP, pages 13–18, Berlin, Germany. Asso- ing by generative pre-training. ciation for Computational Linguistics. Alec Radford, Jeff Wu, Rewon Child, David Luan, Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Dario Amodei, and Ilya Sutskever. 2019. Language dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, models are unsupervised multitask learners. Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining Alessandro Raganato and Jörg Tiedemann. 2018. An approach. CoRR, abs/1907.11692. analysis of encoder representations in transformer- based machine translation. In Proceedings of the Max M. Louwerse and Rolf A. Zwaan. 2009. Lan- 2018 EMNLP Workshop BlackboxNLP: Analyzing guage encodes geographical information. Cognitive and Interpreting Neural Networks for NLP, pages Science 33 (2009) 5173. 287–297, Brussels, Belgium. Association for Com- putational Linguistics. Marco Marelli and Marco Baroni. 2015. Affixation in semantic space: Modeling morpheme meanings Siva Reddy, Ioannis P. Klapaftis, Diana McCarthy, and with compositional distributional semantics. Psy- Suresh Manandhar. 2011. Dynamic and static proto- chological review, 122 3:485–515. type vectors for semantic composition. In IJCNLP. Marco Marelli, Stefano Menini, Marco Baroni, Luisa Joseph Reisinger and Raymond Mooney. 2010. Multi- Bentivogli, Raffaella Bernardi, and Roberto Zam- prototype vector-space models of word meaning. parelli. 2014. A sick cure for the evaluation of pages 109–117. compositional distributional semantic models. In Proceedings of the Ninth International Conference Sascha Rothe and Hinrich Schütze. 2015. Autoex- on Language Resources and Evaluation (LREC’14), tend: Extending word embeddings to embeddings pages 216–223, Reykjavik, Iceland. European Lan- for synsets and lexemes. CoRR, abs/1507.01127. guage Resources Association (ELRA). Peter Rousseeuw. 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Bryan McCann, James Bradbury, Caiming Xiong, and J. Comput. Appl. Math., 20(1):53–65. Richard Socher. 2017. Learned in translation: Con- textualized word vectors. NIPS. G. Salton, A. Wong, and C. S. Yang. 1975. A vec- tor space model for automatic indexing. Commun. Paul Michel, Omer Levy, and Graham Neubig. 2019. ACM, 18(11):613–620. Are Sixteen Heads Really Better than One? arXiv e-prints, page arXiv:1905.10650. Sofia Serrano and Noah A. Smith. 2019. Is attention interpretable? In Proceedings of the 57th Annual Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Meeting of the Association for Computational Lin- Dean. 2013a. Efficient estimation of word represen- guistics, pages 2931–2951, Florence, Italy. Associa- tations in vector space. CoRR, abs/1301.3781. tion for Computational Linguistics. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2013b. Linguistic regularities in continuous space 2018. Self-attention with relative position represen- word representations. In HLT-NAACL, pages 746– tations. In Proceedings of the 2018 Conference of 751. the North American Chapter of the Association for
Computational Linguistics: Human Language Tech- language understanding systems. arXiv preprint nologies, Volume 2 (Short Papers), pages 464–468, arXiv:1905.00537. New Orleans, Louisiana. Association for Computa- tional Linguistics. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019b. Samuel L. Smith, David H. P. Turban, Steven Hamblin, GLUE: A multi-task benchmark and analysis plat- and Nils Y. Hammerla. 2017. Offline bilingual word form for natural language understanding. In the Pro- vectors, orthogonal transformations and the inverted ceedings of ICLR. softmax. In 5th International Conference on Learn- ing Representations, ICLR 2017, Toulon, France, Kellie Webster, Marta R. Costa-jussà, Christian Hard- April 24-26, 2017, Conference Track Proceedings. meier, and Will Radford. 2019. Gendered ambigu- OpenReview.net. ous pronoun (GAP) shared task at the gender bias in NLP workshop 2019. In Proceedings of the First Gongbo Tang, Rico Sennrich, and Joakim Nivre. 2018. Workshop on Gender Bias in Natural Language Pro- An analysis of attention mechanisms: The case of cessing, pages 1–7, Florence, Italy. Association for word sense disambiguation in neural machine trans- Computational Linguistics. lation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 26– Matthijs Westera and Gemma Boleda. 2019. Don’t 35, Belgium, Brussels. Association for Computa- blame distributional semantics if it can’t do entail- tional Linguistics. ment. In Proceedings of the 13th International Con- ference on Computational Semantics - Long Papers, Wilson Taylor. 1953. Cloze procedure: A new tool pages 120–133, Gothenburg, Sweden. Association for measuring readability. Journalism Quarterly, for Computational Linguistics. 30:415–433. John Wieting and Douwe Kiela. 2019. No train- Peter D. Turney and Patrick Pantel. 2010. From fre- ing required: Exploring random encoders for sen- quency and to meaning and vector space and models tence classification. In 7th International Conference of semantics. In Journal of Artificial Intelligence on Learning Representations, ICLR 2019, New Or- Research 37 (2010) 141-188 Submitted 10/09; pub- leans, LA, USA, May 6-9, 2019. OpenReview.net. lished. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Carbonell, Ruslan Salakhutdinov, and Quoc V. Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Le. 2019. Xlnet: Generalized autoregressive Kaiser, and Illia Polosukhin. 2017. Attention is all pretraining for language understanding. CoRR, you need. In I. Guyon, U. V. Luxburg, S. Bengio, abs/1906.08237. H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar- Zi Yin and Yuanyuan Shen. 2018. On the dimension- nett, editors, Advances in Neural Information Pro- ality of word embedding. In S. Bengio, H. Wallach, cessing Systems 30, pages 5998–6008. Curran As- H. Larochelle, K. Grauman, N. Cesa-Bianchi, and sociates, Inc. R. Garnett, editors, Advances in Neural Information Lucas Vendramin, Pablo A. Jaskowiak, and Ricardo J. Processing Systems 31, pages 887–898. Curran As- G. B. Campello. 2013. On the combination of rel- sociates, Inc. ative clustering validity criteria. In Proceedings of Yukun Zhu, Jamie Ryan Kiros, Richard S. Zemel, Rus- the 25th International Conference on Scientific and lan Salakhutdinov, Raquel Urtasun, Antonio Tor- Statistical Database Management, SSDBM, pages ralba, and Sanja Fidler. 2015. Aligning books and 4:1–4:12, New York, NY, USA. ACM. movies: Towards story-like visual explanations by Loı̈c Vial, Benjamin Lecouteux, and Didier Schwab. watching movies and reading books. 2015 IEEE In- 2019. Sense Vocabulary Compression through the ternational Conference on Computer Vision (ICCV), Semantic Knowledge of WordNet for Neural Word pages 19–27. Sense Disambiguation. In Global Wordnet Confer- ence, Wroclaw, Poland. Elena Voita, David Talbot, Fedor Moiseev, Rico Sen- nrich, and Ivan Titov. 2019. Analyzing multi-head self-attention: Specialized heads do the heavy lift- ing, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Com- putational Linguistics, pages 5797–5808, Florence, Italy. Association for Computational Linguistics. Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2019a. Super- glue: A stickier benchmark for general-purpose
You can also read