SimCSE: Simple Contrastive Learning of Sentence Embeddings
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
SimCSE: Simple Contrastive Learning of Sentence Embeddings Tianyu Gao†∗ Xingcheng Yao‡∗ Danqi Chen† † Department of Computer Science, Princeton University ‡ Institute for Interdisciplinary Information Sciences, Tsinghua University {tianyug,danqic}@cs.princeton.edu yxc18@mails.tsinghua.edu.cn Abstract embedding methods and demonstrate that a con- trastive objective can be extremely effective when This paper presents SimCSE, a simple con- coupled with pre-trained language models such as trastive learning framework that greatly ad- BERT (Devlin et al., 2019) or RoBERTa (Liu et al., vances the state-of-the-art sentence embed- arXiv:2104.08821v3 [cs.CL] 9 Sep 2021 2019). We present SimCSE, a simple contrastive dings. We first describe an unsupervised ap- proach, which takes an input sentence and sentence embedding framework, which can pro- predicts itself in a contrastive objective, with duce superior sentence embeddings, from either only standard dropout used as noise. This unlabeled or labeled data. simple method works surprisingly well, per- Our unsupervised SimCSE simply predicts the forming on par with previous supervised coun- input sentence itself with only dropout (Srivastava terparts. We find that dropout acts as mini- mal data augmentation and removing it leads et al., 2014) used as noise (Figure 1(a)). In other to a representation collapse. Then, we pro- words, we pass the same sentence to the pre-trained pose a supervised approach, which incorpo- encoder twice: by applying the standard dropout rates annotated pairs from natural language twice, we can obtain two different embeddings as inference datasets into our contrastive learn- “positive pairs”. Then we take other sentences in the ing framework, by using “entailment” pairs same mini-batch as “negatives”, and the model pre- as positives and “contradiction” pairs as hard dicts the positive one among negatives. Although it negatives. We evaluate SimCSE on standard may appear strikingly simple, this approach outper- semantic textual similarity (STS) tasks, and our unsupervised and supervised models using forms training objectives such as predicting next BERTbase achieve an average of 76.3% and sentences (Logeswaran and Lee, 2018) and discrete 81.6% Spearman’s correlation respectively, a data augmentation (e.g., word deletion and replace- 4.2% and 2.2% improvement compared to ment) by a large margin, and even matches previous previous best results. We also show—both supervised methods. Through careful analysis, we theoretically and empirically—that contrastive find that dropout acts as minimal “data augmenta- learning objective regularizes pre-trained em- tion” of hidden representations, while removing it beddings’ anisotropic space to be more uni- form, and it better aligns positive pairs when leads to a representation collapse. supervised signals are available.1 Our supervised SimCSE builds upon the recent success of using natural language inference (NLI) 1 Introduction datasets for sentence embeddings (Conneau et al., 2017; Reimers and Gurevych, 2019) and incorpo- Learning universal sentence embeddings is a fun- rates annotated sentence pairs in contrastive learn- damental problem in natural language process- ing (Figure 1(b)). Unlike previous work that casts ing and has been studied extensively in the litera- it as a 3-way classification task (entailment, neu- ture (Kiros et al., 2015; Hill et al., 2016; Conneau tral and contradiction), we leverage the fact that et al., 2017; Logeswaran and Lee, 2018; Cer et al., entailment pairs can be naturally used as positive 2018; Reimers and Gurevych, 2019, inter alia). instances. We also find that adding correspond- In this work, we advance state-of-the-art sentence ing contradiction pairs as hard negatives further * The first two authors contributed equally (listed in alpha- improves performance. This simple use of NLI betical order). This work was done when Xingcheng visited datasets achieves a substantial improvement com- the Princeton NLP group remotely. 1 Our code and pre-trained models are publicly available at pared to prior methods using the same datasets. https://github.com/princeton-nlp/SimCSE. We also compare to other labeled sentence-pair
(a) Unsupervised SimCSE (b) Supervised SimCSE Different hidden dropout masks in two forward passes Two dogs are running. Two dogs There are animals outdoors. label=entailment are running. A man surfing on the sea. E The pets are sitting on a couch. label=contradiction A kid is on a skateboard. A man surfing There is a man. label=entailment on the sea. E E The man wears a business suit. E Encoder label=contradiction Positive instance A kid is on a A kid is skateboarding. label=entailment Negative instance skateboard. A kit is inside the house. label=contradiction Figure 1: (a) Unsupervised SimCSE predicts the input sentence itself from in-batch negatives, with different hidden dropout masks applied. (b) Supervised SimCSE leverages the NLI datasets and takes the entailment (premise- hypothesis) pairs as positives, and contradiction pairs as well as other in-batch instances as negatives. datasets and find that NLI datasets are especially 2 Background: Contrastive Learning effective for learning sentence embeddings. Contrastive learning aims to learn effective repre- To better understand the strong performance of sentation by pulling semantically close neighbors SimCSE, we borrow the analysis tool from Wang together and pushing apart non-neighbors (Hadsell and Isola (2020), which takes alignment between et al., 2006). It assumes a set of paired examples semantically-related positive pairs and uniformity D = {(xi , x+ m + i )}i=1 , where xi and xi are semanti- of the whole representation space to measure the cally related. We follow the contrastive framework quality of learned embeddings. Through empiri- in Chen et al. (2020) and take a cross-entropy ob- cal analysis, we find that our unsupervised Sim- jective with in-batch negatives (Chen et al., 2017; CSE essentially improves uniformity while avoid- Henderson et al., 2017): let hi and h+ i denote the ing degenerated alignment via dropout noise, thus representations of xi and x+ i , the training objective improving the expressiveness of the representa- for (xi , x+ i ) with a mini-batch of N pairs is: tions. The same analysis shows that the NLI train- ing signal can further improve alignment between esim(hi ,hi + )/τ positive pairs and produce better sentence embed- `i = − log P , (1) N sim(hi ,h+ j )/τ dings. We also draw a connection to the recent find- j=1 e ings that pre-trained word embeddings suffer from where τ is a temperature hyperparameter and anisotropy (Ethayarajh, 2019; Li et al., 2020) and h>1 h2 prove that—through a spectrum perspective—the sim(h1 , h2 ) is the cosine similarity kh1 k·kh 2k . In contrastive learning objective “flattens” the singu- this work, we encode input sentences using a lar value distribution of the sentence embedding pre-trained language model such as BERT (De- space, hence improving uniformity. vlin et al., 2019) or RoBERTa (Liu et al., 2019): h = fθ (x), and then fine-tune all the parameters We conduct a comprehensive evaluation of Sim- using the contrastive learning objective (Eq. 1). CSE on seven standard semantic textual similarity (STS) tasks (Agirre et al., 2012, 2013, 2014, 2015, Positive instances. One critical question in con- 2016; Cer et al., 2017; Marelli et al., 2014) and trastive learning is how to construct (xi , x+ i ) pairs. seven transfer tasks (Conneau and Kiela, 2018). In visual representations, an effective solution is to On the STS tasks, our unsupervised and supervised take two random transformations of the same image models achieve a 76.3% and 81.6% averaged Spear- (e.g., cropping, flipping, distortion and rotation) as man’s correlation respectively using BERTbase , a xi and x+ i (Dosovitskiy et al., 2014). A similar 4.2% and 2.2% improvement compared to previous approach has been recently adopted in language best results. We also achieve competitive perfor- representations (Wu et al., 2020; Meng et al., 2021) mance on the transfer tasks. Finally, we identify by applying augmentation techniques such as word an incoherent evaluation issue in the literature and deletion, reordering, and substitution. However, consolidate results of different settings for future data augmentation in NLP is inherently difficult work in evaluation of sentence embeddings. because of its discrete nature. As we will see in §3,
simply using standard dropout on intermediate rep- Data augmentation STS-B resentations outperforms these discrete operators. None (unsup. SimCSE) 82.5 In NLP, a similar contrastive learning objective Crop 10% 20% 30% has been explored in different contexts (Henderson 77.8 71.4 63.6 et al., 2017; Gillick et al., 2019; Karpukhin et al., Word deletion 10% 20% 30% 2020). In these cases, (xi , x+i ) are collected from 75.9 72.2 68.2 supervised datasets such as question-passage pairs. Delete one word 75.9 Because of the distinct nature of xi and x+ i , these w/o dropout 74.2 approaches always use a dual-encoder framework, Synonym replacement 77.4 MLM 15% 62.2 i.e., using two independent encoders fθ1 and fθ2 for xi and x+ i . For sentence embeddings, Logeswaran and Lee (2018) also use contrastive learning with Table 1: Comparison of data augmentations on STS-B development set (Spearman’s correlation). Crop k%: a dual-encoder approach, by forming current sen- keep 100-k% of the length; word deletion k%: delete tence and next sentence as (xi , x+ i ). k% words; Synonym replacement: use nlpaug (Ma, Alignment and uniformity. Recently, Wang and 2019) to randomly replace one word with its synonym; MLM k%: use BERTbase to replace k% of words. Isola (2020) identify two key properties related to contrastive learning—alignment and uniformity— Training objective fθ (fθ1 , fθ2 ) and propose to use them to measure the quality of representations. Given a distribution of positive Next sentence 67.1 68.9 Next 3 sentences 67.4 68.8 pairs ppos , alignment calculates expected distance Delete one word 75.9 73.1 between embeddings of the paired instances (as- Unsupervised SimCSE 82.5 80.7 suming representations are already normalized): Table 2: Comparison of different unsupervised objec- `align , E kf (x) − f (x+ )k2 . (2) tives (STS-B development set, Spearman’s correlation). (x,x+ )∼ppos The two columns denote whether we use one encoder On the other hand, uniformity measures how well or two independent encoders. Next 3 sentences: ran- domly sample one from the next 3 sentences. Delete the embeddings are uniformly distributed: one word: delete one word randomly (see Table 1). 2 `uniform , log E e−2kf (x)−f (y)k , (3) i.i.d. x,y ∼ pdata two embeddings with different dropout masks z, z 0 , and the training objective of SimCSE becomes: where pdata denotes the data distribution. These two metrics are well aligned with the objective zi zi 0 of contrastive learning: positive instances should esim(hi ,hi )/τ `i = − log z0 , (4) z stay close and embeddings for random instances PN sim(hi i ,hj j )/τ j=1 e should scatter on the hypersphere. In the following sections, we will also use the two metrics to justify for a mini-batch of N sentences. Note that z is just the inner workings of our approaches. the standard dropout mask in Transformers and we do not add any additional dropout. 3 Unsupervised SimCSE Dropout noise as data augmentation. We view The idea of unsupervised SimCSE is extremely it as a minimal form of data augmentation: the simple: we take a collection of sentences {xi }m i=1 positive pair takes exactly the same sentence, and and use x+ i = xi . The key ingredient to get this to their embeddings only differ in dropout masks. work with identical positive pairs is through the use We compare this approach to other training ob- of independently sampled dropout masks for xi and jectives on the STS-B development set (Cer et al., x+i . In standard training of Transformers (Vaswani 2017)2 . Table 1 compares our approach to common et al., 2017), there are dropout masks placed on data augmentation techniques such as crop, word fully-connected layers as well as attention probabil- deletion and replacement, which can be viewed as ities (default p = 0.1). We denote hzi = fθ (xi , z) 2 We randomly sample 106 sentences from English where z is a random mask for dropout. We simply Wikipedia and fine-tune BERTbase with learning rate = 3e-5, feed the same input to the encoder twice and get N = 64. In all our experiments, no STS training sets are used.
0.400 p 0.0 0.01 0.05 0.1 Fixed Fixed0.1 0.1 STS-B 71.1 72.6 81.1 82.5 0.375 No1odropout dUoSout Delete Deleteone oneword woUd 0.350 p 0.15 0.2 0.5 Fixed 0.1 Unsup. 8nsuS.SimCSE 6imC6E STS-B 81.4 80.5 71.0 43.6 0.325 !Alignment align 0.300 Training direction 0.275 Table 3: Effects of different dropout probabilities p on the STS-B development set (Spearman’s correlation, 0.250 BERTbase ). Fixed 0.1: default 0.1 dropout rate but ap- 0.225 ply the same dropout mask on both xi and x+ i . 0.200 −2.6 −2.4 −2.2 −2.0 −1.8 −1.6 8nifoUmity !uniform h = fθ (g(x), z) and g is a (random) discrete op- Figure 2: `align -`uniform plot for unsupervised SimCSE, erator on x. We note that even deleting one word “no dropout”, “fixed 0.1”, and “delete one word”. We would hurt performance and none of the discrete visualize checkpoints every 10 training steps and the augmentations outperforms dropout noise. arrows indicate the training direction. For both `align We also compare this self-prediction training and `uniform , lower numbers are better. objective to the next-sentence objective used in Lo- geswaran and Lee (2018), taking either one encoder 4 Supervised SimCSE or two independent encoders. As shown in Table 2, we find that SimCSE performs much better than We have demonstrated that adding dropout noise the next-sentence objectives (82.5 vs 67.4 on STS- is able to keep a good alignment for positive pairs B) and using one encoder instead of two makes a (x, x+ ) ∼ ppos . In this section, we study whether significant difference in our approach. we can leverage supervised datasets to provide better training signals for improving alignment of Why does it work? To further understand the our approach. Prior work (Conneau et al., 2017; role of dropout noise in unsupervised SimCSE, we Reimers and Gurevych, 2019) has demonstrated try out different dropout rates in Table 3 and ob- that supervised natural language inference (NLI) serve that all the variants underperform the default datasets (Bowman et al., 2015; Williams et al., dropout probability p = 0.1 from Transformers. 2018) are effective for learning sentence embed- We find two extreme cases particularly interesting: dings, by predicting whether the relationship be- “no dropout” (p = 0) and “fixed 0.1” (using default tween two sentences is entailment, neutral or con- dropout p = 0.1 but the same dropout masks for tradiction. In our contrastive learning framework, the pair). In both cases, the resulting embeddings we instead directly take (xi , x+ i ) pairs from super- for the pair are exactly the same, and it leads to vised datasets and use them to optimize Eq. 1. a dramatic performance degradation. We take the checkpoints of these models every 10 steps during Choices of labeled data. We first explore which training and visualize the alignment and uniformity supervised datasets are especially suitable for con- metrics3 in Figure 2, along with a simple data aug- structing positive pairs (xi , x+ i ). We experiment mentation model “delete one word”. As clearly with a number of datasets with sentence-pair ex- shown, starting from pre-trained checkpoints, all amples, including 1) QQP4 : Quora question pairs; models greatly improve uniformity. However, the 2) Flickr30k (Young et al., 2014): each image is alignment of the two special variants also degrades annotated with 5 human-written captions and we drastically, while our unsupervised SimCSE keeps consider any two captions of the same image as a a steady alignment, thanks to the use of dropout positive pair; 3) ParaNMT (Wieting and Gimpel, noise. It also demonstrates that starting from a pre- 2018): a large-scale back-translation paraphrase trained checkpoint is crucial, for it provides good dataset5 ; and finally 4) NLI datasets: SNLI (Bow- initial alignment. At last, “delete one word” im- man et al., 2015) and MNLI (Williams et al., 2018). proves the alignment yet achieves a smaller gain We train the contrastive learning model (Eq. 1) on the uniformity metric, and eventually underper- with different datasets and compare the results in forms unsupervised SimCSE. 4 https://www.quora.com/q/quoradata/ 5 ParaNMT is automatically constructed by machine trans- 3 We take STS-B pairs with a score higher than 4 as ppos lation systems. Strictly speaking, we should not call it “super- and all STS-B sentences as pdata . vised”. It underperforms our unsupervised SimCSE though.
Table 4. For a fair comparison, we also run exper- Dataset sample full iments with the same # of training pairs. Among Unsup. SimCSE (1m) - 82.5 all the options, using entailment pairs from the QQP (134k) 81.8 81.8 NLI (SNLI + MNLI) datasets performs the best. Flickr30k (318k) 81.5 81.4 We think this is reasonable, as the NLI datasets ParaNMT (5m) 79.7 78.7 consist of high-quality and crowd-sourced pairs. SNLI+MNLI entailment (314k) 84.1 84.9 Also, human annotators are expected to write the neutral (314k)8 82.6 82.9 hypotheses manually based on the premises and contradiction (314k) 77.5 77.6 two sentences tend to have less lexical overlap. all (942k) 81.7 81.9 For instance, we find that the lexical overlap (F1 SNLI+MNLI measured between two bags of words) for the en- entailment + hard neg. - 86.2 tailment pairs (SNLI + MNLI) is 39%, while they + ANLI (52k) - 85.0 are 60% and 55% for QQP and ParaNMT. Table 4: Comparisons of different supervised datasets Contradiction as hard negatives. Finally, we fur- as positive pairs. Results are Spearman’s correlations ther take the advantage of the NLI datasets by us- on the STS-B development set using BERTbase (we ing its contradiction pairs as hard negatives6 . In use the same hyperparameters as the final SimCSE NLI datasets, given one premise, annotators are re- model). Numbers in brackets denote the # of pairs. quired to manually write one sentence that is abso- Sample: subsampling 134k positive pairs for a fair com- lutely true (entailment), one that might be true (neu- parison among datasets; full: using the full dataset. In tral), and one that is definitely false (contradiction). the last block, we use entailment pairs as positives and Therefore, for each premise and its entailment hy- contradiction pairs as hard negatives (our final model). pothesis, there is an accompanying contradiction hypothesis7 (see Figure 1 for an example). demonstrate that language models trained with tied + − Formally, we extend (xi , x+ i ) to (xi , xi , xi ), input/output embeddings lead to anisotropic word − where xi is the premise, x+ i and xi are entailment embeddings, and this is further observed by Etha- and contradiction hypotheses. The training objec- yarajh (2019) in pre-trained contextual representa- tive `i is then defined by (N is mini-batch size): tions. Wang et al. (2020) show that singular values + of the word embedding matrix in a language model esim(hi ,hi )/τ decay drastically: except for a few dominating sin- − log P − . sim(hi ,h+ N e j )/τ + esim(hi ,hj )/τ gular values, all others are close to zero. j=1 (5) A simple way to alleviate the problem is post- As shown in Table 4, adding hard negatives can processing, either to eliminate the dominant prin- further improve performance (84.9 → 86.2) and cipal components (Arora et al., 2017; Mu and this is our final supervised SimCSE. We also tried Viswanath, 2018), or to map embeddings to an to add the ANLI dataset (Nie et al., 2020) or com- isotropic distribution (Li et al., 2020; Su et al., bine it with our unsupervised SimCSE approach, 2021). Another common solution is to add reg- but didn’t find a meaningful improvement. We also ularization during training (Gao et al., 2019; Wang considered a dual encoder framework in supervised et al., 2020). In this work, we show that—both SimCSE and it hurt performance (86.2 → 84.2). theoretically and empirically—the contrastive ob- jective can also alleviate the anisotropy problem. 5 Connection to Anisotropy The anisotropy problem is naturally connected to uniformity (Wang and Isola, 2020), both highlight- Recent work identifies an anisotropy problem in ing that embeddings should be evenly distributed language representations (Ethayarajh, 2019; Li in the space. Intuitively, optimizing the contrastive et al., 2020), i.e., the learned embeddings occupy a learning objective can improve uniformity (or ease narrow cone in the vector space, which severely the anisotropy problem), as the objective pushes limits their expressiveness. Gao et al. (2019) negative instances apart. Here, we take a singular 6 We also experimented with adding neutral hypotheses as spectrum perspective—which is a common practice hard negatives. See Section 6.3 for more discussion. 7 8 In fact, one premise can have multiple contradiction hy- Though our final model only takes entailment pairs as potheses. In our implementation, we only sample one as the positive instances, here we also try taking neutral and contra- hard negative and we did not find a difference by using more. diction pairs from the NLI datasets as positive pairs.
in analyzing word embeddings (Mu and Viswanath, also optimizes for aligning positive pairs by the 2018; Gao et al., 2019; Wang et al., 2020), and first term in Eq. 6, which is the key to the success show that the contrastive objective can “flatten” the of SimCSE. A quantitative analysis is given in §7. singular value distribution of sentence embeddings and make the representations more isotropic. 6 Experiment Following Wang and Isola (2020), the asymp- totics of the contrastive learning objective (Eq. 1) 6.1 Evaluation Setup can be expressed by the following equation when We conduct our experiments on 7 semantic textual the number of negative instances approaches infin- similarity (STS) tasks. Note that all our STS exper- ity (assuming f (x) is normalized): iments are fully unsupervised and no STS training 1 h i sets are used. Even for supervised SimCSE, we − E f (x)> f (x+ ) simply mean that we take extra labeled datasets τ (x,x+ )∼ppos h i (6) for training, following previous work (Conneau f (x)> f (x− )/τ et al., 2017). We also evaluate 7 transfer learning + E log E e , x∼pdata x− ∼pdata tasks and provide detailed results in Appendix E. where the first term keeps positive instances similar We share a similar sentiment with Reimers and and the second pushes negative pairs apart. When Gurevych (2019) that the main goal of sentence pdata is uniform over finite samples {xi }m embeddings is to cluster semantically similar sen- i=1 , with hi = f (xi ), we can derive the following formula tences and hence take STS as the main result. from the second term with Jensen’s inequality: Semantic textual similarity tasks. We evalu- h i ate on 7 STS tasks: STS 2012–2016 (Agirre f (x)> f (x− )/τ E log E e et al., 2012, 2013, 2014, 2015, 2016), STS x∼pdata x− ∼pdata Benchmark (Cer et al., 2017) and SICK- m m 1 X 1 X > Relatedness (Marelli et al., 2014). When compar- = log ehi hj /τ (7) ing to previous work, we identify invalid compari- m m i=1 j=1 son patterns in published papers in the evaluation m X m 1 X settings, including (a) whether to use an additional ≥ h> i hj . τ m2 regressor, (b) Spearman’s vs Pearson’s correlation, i=1 j=1 and (c) how the results are aggregated (Table B.1). Let W be the sentence embedding matrix corre- We discuss the detailed differences in Appendix B sponding to {xi }m i=1 , i.e., the i-th row of W is and choose to follow the setting of Reimers and hi . Optimizing the second term in Eq. 6 essen- Gurevych (2019) in our evaluation (no additional tially minimizes an upper bound of the summation regressor, Spearman’s correlation, and “all” aggre- of all elements in WW> , i.e., Sum(WW> ) = gation). We also report our replicated study of P m Pm > i=1 j=1 hi hj . previous work as well as our results evaluated in Since we normalize hi , all elements on the di- a different setting in Table B.2 and Table B.3. We agonal of WW> are 1 and then tr(WW> ) (the call for unifying the setting in evaluating sentence sum of all eigenvalues) is a constant. According embeddings for future research. to Merikoski (1984), if all elements in WW> are Training details. We start from pre-trained check- positive, which is the case in most times accord- points of BERT (Devlin et al., 2019) (uncased) ing to Figure G.1, then Sum(WW> ) is an upper or RoBERTa (Liu et al., 2019) (cased) and take bound for the largest eigenvalue of WW> . When the [CLS] representation as the sentence embed- minimizing the second term in Eq. 6, we reduce ding9 (see §6.3 for comparison between different the top eigenvalue of WW> and inherently “flat- pooling methods). We train unsupervised SimCSE ten” the singular spectrum of the embedding space. on 106 randomly sampled sentences from English Therefore, contrastive learning is expected to alle- Wikipedia, and train supervised SimCSE on the viate the representation degeneration problem and combination of MNLI and SNLI datasets (314k). improve uniformity of sentence embeddings. More training details can be found in Appendix A. Compared to post-processing methods in Li et al. (2020); Su et al. (2021), which only aim to encour- 9 There is an MLP layer over [CLS] in BERT’s original age isotropic representations, contrastive learning implementation and we keep it with random initialization.
Model STS12 STS13 STS14 STS15 STS16 STS-B SICK-R Avg. Unsupervised models ♣ GloVe embeddings (avg.) 55.14 70.66 59.73 68.25 63.66 58.02 53.76 61.32 BERTbase (first-last avg.) 39.70 59.38 49.67 66.03 66.19 53.87 62.06 56.70 BERTbase -flow 58.40 67.10 60.85 75.16 71.22 68.66 64.47 66.55 BERTbase -whitening 57.83 66.90 60.90 75.08 71.31 68.24 63.73 66.28 IS-BERTbase ♥ 56.77 69.24 61.21 75.23 70.16 69.21 64.25 66.58 CT-BERTbase 61.63 76.80 68.47 77.50 76.48 74.31 69.19 72.05 ∗ SimCSE-BERTbase 68.40 82.41 74.38 80.91 78.56 76.85 72.23 76.25 RoBERTabase (first-last avg.) 40.88 58.74 49.07 65.63 61.48 58.55 61.63 56.57 RoBERTabase -whitening 46.99 63.24 57.23 71.36 68.99 61.36 62.91 61.73 DeCLUTR-RoBERTabase 52.41 75.19 65.52 77.12 78.63 72.41 68.62 69.99 ∗ SimCSE-RoBERTabase 70.16 81.77 73.24 81.36 80.65 80.22 68.56 76.57 ∗ SimCSE-RoBERTalarge 72.86 83.99 75.62 84.77 81.80 81.98 71.26 78.90 Supervised models InferSent-GloVe♣ 52.86 66.75 62.15 72.77 66.87 68.03 65.65 65.01 Universal Sentence Encoder♣ 64.49 67.80 64.61 76.83 73.18 74.92 76.69 71.22 SBERTbase ♣ 70.97 76.53 73.19 79.09 74.30 77.03 72.91 74.89 SBERTbase -flow 69.78 77.27 74.35 82.01 77.46 79.12 76.21 76.60 SBERTbase -whitening 69.65 77.57 74.66 82.27 78.39 79.52 76.91 77.00 CT-SBERTbase 74.84 83.20 78.07 83.84 77.93 81.46 76.42 79.39 ∗ SimCSE-BERTbase 75.30 84.67 80.19 85.40 80.82 84.25 80.39 81.57 SRoBERTabase ♣ 71.54 72.49 70.80 78.74 73.69 77.77 74.46 74.21 SRoBERTabase -whitening 70.46 77.07 74.46 81.64 76.43 79.49 76.65 76.60 ∗ SimCSE-RoBERTabase 76.53 85.21 80.95 86.03 82.57 85.83 80.50 82.52 ∗ SimCSE-RoBERTalarge 77.46 87.27 82.36 86.66 83.93 86.70 81.95 83.76 Table 5: Sentence embedding performance on STS tasks (Spearman’s correlation, “all” setting). We highlight the highest numbers among models with the same pre-trained encoder. ♣: results from Reimers and Gurevych (2019); ♥: results from Zhang et al. (2020); all other results are reproduced or reevaluated by ourselves. For BERT-flow (Li et al., 2020) and whitening (Su et al., 2021), we only report the “NLI” setting (see Table C.1). 6.2 Main Results methods include InferSent (Conneau et al., 2017), Universal Sentence Encoder (Cer et al., 2018), and We compare unsupervised and supervised Sim- SBERT/SRoBERTa (Reimers and Gurevych, 2019) CSE to previous state-of-the-art sentence embed- with post-processing methods (BERT-flow, whiten- ding methods on STS tasks. Unsupervised base- ing, and CT). We provide more details of these lines include average GloVe embeddings (Pen- baselines in Appendix C. nington et al., 2014), average BERT or RoBERTa Table 5 shows the evaluation results on 7 STS embeddings10 , and post-processing methods such tasks. SimCSE can substantially improve results as BERT-flow (Li et al., 2020) and BERT- on all the datasets with or without extra NLI su- whitening (Su et al., 2021). We also compare to sev- pervision, greatly outperforming the previous state- eral recent methods using a contrastive objective, of-the-art models. Specifically, our unsupervised including 1) IS-BERT (Zhang et al., 2020), which SimCSE-BERTbase improves the previous best maximizes the agreement between global and lo- averaged Spearman’s correlation from 72.05% to cal features; 2) DeCLUTR (Giorgi et al., 2021), 76.25%, even comparable to supervised baselines. which takes different spans from the same docu- When using NLI datasets, SimCSE-BERTbase fur- ment as positive pairs; 3) CT (Carlsson et al., 2021), ther pushes the state-of-the-art results to 81.57%. which aligns embeddings of the same sentence The gains are more pronounced on RoBERTa from two different encoders.11 Other supervised encoders, and our supervised SimCSE achieves 10 Following Su et al. (2021), we take the average of the first 83.76% with RoBERTalarge . and the last layers, which is better than only taking the last. In Appendix E, we show that SimCSE also 11 We do not compare to CLEAR (Wu et al., 2020), because achieves on par or better transfer task performance they use their own version of pre-trained models, and the numbers appear to be much lower. Also note that CT is a compared to existing work, and an auxiliary MLM concurrent work to ours. objective can further boost performance.
Pooler Unsup. Sup. incorporate weighting of different negatives: [CLS] esim(hi ,hi + )/τ w/ MLP 81.7 86.2 − log P , (8) + α1i esim(hi ,hj )/τ − sim(hi ,h+ j N j )/τ w/ MLP (train) 82.5 85.8 j=1 e w/o MLP 80.9 86.2 First-last avg. 81.2 86.1 where 1ji ∈ {0, 1} is an indicator that equals 1 if and only if i = j. We train SimCSE with different Table 6: Ablation studies of different pooling methods values of α and evaluate the trained models on in unsupervised and supervised SimCSE. [CLS] w/ the development set of STS-B. We also consider MLP (train): using MLP on [CLS] during training but taking neutral hypotheses as hard negatives. As removing it during testing. The results are based on the shown in Table 7, α = 1 performs the best, and development set of STS-B using BERTbase . neutral hypotheses do not bring further gains. Contra.+ 7 Analysis Hard neg N/A Contradiction Neutral In this section, we conduct further analyses to un- α - 0.5 1.0 2.0 1.0 derstand the inner workings of SimCSE. STS-B 84.9 86.1 86.2 86.2 85.3 Uniformity and alignment. Figure 3 shows uni- formity and alignment of different sentence embed- Table 7: STS-B development results with different hard ding models along with their averaged STS results. negative policies. “N/A”: no hard negative. In general, models which have both better align- ment and uniformity achieve better performance, 6.3 Ablation Studies confirming the findings in Wang and Isola (2020). We also observe that (1) though pre-trained em- We investigate the impact of different pooling meth- beddings have good alignment, their uniformity is ods and hard negatives. All reported results in this poor (i.e., the embeddings are highly anisotropic); section are based on the STS-B development set. (2) post-processing methods like BERT-flow and We provide more ablation studies (normalization, BERT-whitening greatly improve uniformity but temperature, and MLM objectives) in Appendix D. also suffer a degeneration in alignment; (3) unsu- Pooling methods. Reimers and Gurevych (2019); pervised SimCSE effectively improves uniformity Li et al. (2020) show that taking the average em- of pre-trained embeddings whereas keeping a good beddings of pre-trained models (especially from alignment; (4) incorporating supervised data in both the first and last layers) leads to better perfor- SimCSE further amends alignment. In Appendix F, mance than [CLS]. Table 6 shows the comparison we further show that SimCSE can effectively flat- between different pooling methods in both unsuper- ten singular value distribution of pre-trained em- vised and supervised SimCSE. For [CLS] repre- beddings. In Appendix G, we demonstrate that sentation, the original BERT implementation takes SimCSE provides more distinguishable cosine sim- an extra MLP layer on top of it. Here, we consider ilarities between different sentence pairs. three different settings for [CLS]: 1) keeping the Qualitative comparison. We conduct a small- MLP layer; 2) no MLP layer; 3) keeping MLP dur- scale retrieval experiment using SBERTbase and ing training but removing it at testing time. We find SimCSE-BERTbase . We use 150k captions from that for unsupervised SimCSE, taking [CLS] rep- Flickr30k dataset and take any random sentence as resentation with MLP only during training works query to retrieve similar sentences (based on cosine the best; for supervised SimCSE, different pooling similarity). As several examples shown in Table 8, methods do not matter much. By default, we take the retrieved sentences by SimCSE have a higher [CLS]with MLP (train) for unsupervised SimCSE quality compared to those retrieved by SBERT. and [CLS]with MLP for supervised SimCSE. 8 Related Work Hard negatives. Intuitively, it may be beneficial to differentiate hard negatives (contradiction exam- Early work in sentence embeddings builds upon the ples) from other in-batch negatives. Therefore, we distributional hypothesis by predicting surrounding extend our training objective defined in Eq. 5 to sentences of a given one (Kiros et al., 2015; Hill
SBERTbase Supervised SimCSE-BERTbase Query: A man riding a small boat in a harbor. #1 A group of men traveling over the ocean in a small boat. A man on a moored blue and white boat. #2 Two men sit on the bow of a colorful boat. A man is riding in a boat on the water. #3 A man wearing a life jacket is in a small boat on a lake. A man in a blue boat on the water. Query: A dog runs on the green grass near a wooden fence. #1 A dog runs on the green grass near a grove of trees. The dog by the fence is running on the grass. #2 A brown and white dog runs through the green grass. Dog running through grass in fenced area. #3 The dogs run in the green field. A dog runs on the green grass near a grove of trees. Table 8: Retrieved top-3 examples by SBERT and supervised SimCSE from Flickr30k (150k sentences). 0.7 100 bilingual and back-translation corpora provide use- BERT-whitening (66.3) 0.6 BERT-flow (66.6) 90 ful supervision for learning semantic similarity. An- SBERT-whitening (77.0) other line of work focuses on regularizing embed- 0.5 SBERT-flow (76.6) 80 dings (Li et al., 2020; Su et al., 2021; Huang et al., Alignment 0.4 2021) to alleviate the representation degeneration align 70 0.3 problem (as discussed in §5), and yields substantial ! Unsup. SimCSE (76.3) Avg. BERT (56.7) 60 improvement over pre-trained language models. 0.2 SimCSE (81.6) SBERT (74.9) 0.1 Next3Sent (63.1) 50 9 Conclusion 0.0 40 In this work, we propose SimCSE, a simple con- −4.0 −3.5 −3.0 −2.5 −2.0 −1.5 −1.0 !uniform 8nifoUmity trastive learning framework, which greatly im- proves state-of-the-art sentence embeddings on se- Figure 3: `align -`uniform plot of models based on mantic textual similarity tasks. We present an un- BERTbase . Color of points and numbers in brackets represent average STS performance (Spearman’s corre- supervised approach which predicts input sentence lation). Next3Sent: “next 3 sentences” from Table 2. itself with dropout noise and a supervised approach utilizing NLI datasets. We further justify the inner workings of our approach by analyzing alignment et al., 2016; Logeswaran and Lee, 2018). Pagliar- and uniformity of SimCSE along with other base- dini et al. (2018) show that simply augmenting line models. We believe that our contrastive objec- the idea of word2vec (Mikolov et al., 2013) with tive, especially the unsupervised one, may have a n-gram embeddings leads to strong results. Sev- broader application in NLP. It provides a new per- eral recent (and concurrent) approaches adopt con- spective on data augmentation with text input, and trastive objectives (Zhang et al., 2020; Giorgi et al., can be extended to other continuous representations 2021; Wu et al., 2020; Meng et al., 2021; Carlsson and integrated in language model pre-training. et al., 2021; Kim et al., 2021; Yan et al., 2021) by taking different views—from data augmentation or Acknowledgements different copies of models—of the same sentence We thank Tao Lei, Jason Lee, Zhengyan Zhang, or document. Compared to these work, SimCSE Jinhyuk Lee, Alexander Wettig, Zexuan Zhong, uses the simplest idea by taking different outputs and the members of the Princeton NLP group for of the same sentence from standard dropout, and helpful discussion and valuable feedback. This performs the best on STS tasks. research is supported by a Graduate Fellowship at Supervised sentence embeddings are promised Princeton University and a gift award from Apple. to have stronger performance compared to unsu- pervised counterparts. Conneau et al. (2017) pro- pose to fine-tune a Siamese model on NLI datasets, which is further extended to other encoders or pre-trained models (Cer et al., 2018; Reimers and Gurevych, 2019). Furthermore, Wieting and Gim- pel (2018); Wieting et al. (2020) demonstrate that
References task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of Eneko Agirre, Carmen Banea, Claire Cardie, Daniel the 11th International Workshop on Semantic Evalu- Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei ation (SemEval-2017), pages 1–14. Guo, Iñigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, German Rigau, Larraitz Uria, and Janyce Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Wiebe. 2015. SemEval-2015 task 2: Semantic tex- Nicole Limtiaco, Rhomni St. John, Noah Constant, tual similarity, English, Spanish and pilot on inter- Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, pretability. In Proceedings of the 9th International Brian Strope, and Ray Kurzweil. 2018. Universal Workshop on Semantic Evaluation (SemEval 2015), sentence encoder for English. In Empirical Methods pages 252–263. in Natural Language Processing (EMNLP): System Demonstrations, pages 169–174. Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Ting Chen, Simon Kornblith, Mohammad Norouzi, Guo, Rada Mihalcea, German Rigau, and Janyce and Geoffrey Hinton. 2020. A simple framework Wiebe. 2014. SemEval-2014 task 10: Multilingual for contrastive learning of visual representations. semantic textual similarity. In Proceedings of the In International Conference on Machine Learning 8th International Workshop on Semantic Evaluation (ICML), pages 1597–1607. (SemEval 2014), pages 81–91. Ting Chen, Yizhou Sun, Yue Shi, and Liangjie Hong. Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, 2017. On sampling strategies for neural network- Aitor Gonzalez-Agirre, Rada Mihalcea, German based collaborative filtering. In ACM SIGKDD Rigau, and Janyce Wiebe. 2016. SemEval-2016 International Conference on Knowledge Discovery task 1: Semantic textual similarity, monolingual and Data Mining, pages 767–776. and cross-lingual evaluation. In Proceedings of the 10th International Workshop on Semantic Evalua- Alexis Conneau and Douwe Kiela. 2018. SentEval: An tion (SemEval-2016), pages 497–511. Association evaluation toolkit for universal sentence representa- for Computational Linguistics. tions. In International Conference on Language Re- sources and Evaluation (LREC). Eneko Agirre, Daniel Cer, Mona Diab, and Aitor Gonzalez-Agirre. 2012. SemEval-2012 task 6: A Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc pilot on semantic textual similarity. In *SEM 2012: Barrault, and Antoine Bordes. 2017. Supervised The First Joint Conference on Lexical and Compu- learning of universal sentence representations from tational Semantics – Volume 1: Proceedings of the natural language inference data. In Empirical main conference and the shared task, and Volume Methods in Natural Language Processing (EMNLP), 2: Proceedings of the Sixth International Workshop pages 670–680. on Semantic Evaluation (SemEval 2012), pages 385– 393. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez- deep bidirectional transformers for language under- Agirre, and Weiwei Guo. 2013. *SEM 2013 shared standing. In North American Chapter of the As- task: Semantic textual similarity. In Second Joint sociation for Computational Linguistics: Human Conference on Lexical and Computational Seman- Language Technologies (NAACL-HLT), pages 4171– tics (*SEM), Volume 1: Proceedings of the Main 4186. Conference and the Shared Task: Semantic Textual Similarity, pages 32–43. William B. Dolan and Chris Brockett. 2005. Automati- cally constructing a corpus of sentential paraphrases. Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. In Proceedings of the Third International Workshop A simple but tough-to-beat baseline for sentence em- on Paraphrasing (IWP2005). beddings. In International Conference on Learning Representations (ICLR). Alexey Dosovitskiy, Jost Tobias Springenberg, Mar- tin Riedmiller, and Thomas Brox. 2014. Discrim- Samuel R. Bowman, Gabor Angeli, Christopher Potts, inative unsupervised feature learning with convolu- and Christopher D. Manning. 2015. A large anno- tional neural networks. In Advances in Neural Infor- tated corpus for learning natural language inference. mation Processing Systems (NIPS), volume 27. In Empirical Methods in Natural Language Process- ing (EMNLP), pages 632–642. Kawin Ethayarajh. 2019. How contextual are contex- tualized word representations? comparing the geom- Fredrik Carlsson, Amaru Cuba Gyllensten, Evan- etry of BERT, ELMo, and GPT-2 embeddings. In gelia Gogoulou, Erik Ylipää Hellqvist, and Magnus Empirical Methods in Natural Language Processing Sahlgren. 2021. Semantic re-tuning with contrastive and International Joint Conference on Natural Lan- tension. In International Conference on Learning guage Processing (EMNLP-IJCNLP), pages 55–65. Representations (ICLR). Jun Gao, Di He, Xu Tan, Tao Qin, Liwei Wang, Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez- and Tieyan Liu. 2019. Representation degenera- Gazpio, and Lucia Specia. 2017. SemEval-2017 tion problem in training natural language generation
models. In International Conference on Learning Bohan Li, Hao Zhou, Junxian He, Mingxuan Wang, Representations (ICLR). Yiming Yang, and Lei Li. 2020. On the sentence embeddings from pre-trained language models. In Dan Gillick, Sayali Kulkarni, Larry Lansing, Alessan- Empirical Methods in Natural Language Processing dro Presta, Jason Baldridge, Eugene Ie, and Diego (EMNLP), pages 9119–9130. Garcia-Olano. 2019. Learning dense representa- tions for entity retrieval. In Computational Natural Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Language Learning (CoNLL), pages 528–537. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. John Giorgi, Osvald Nitski, Bo Wang, and Gary Bader. Roberta: A robustly optimized bert pretraining ap- 2021. DeCLUTR: Deep contrastive learning for proach. arXiv preprint arXiv:1907.11692. unsupervised textual representations. In Associ- ation for Computational Linguistics and Interna- Lajanugen Logeswaran and Honglak Lee. 2018. An ef- tional Joint Conference on Natural Language Pro- ficient framework for learning sentence representa- cessing (ACL-IJCNLP), pages 879–895. tions. In International Conference on Learning Rep- resentations (ICLR). Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant Edward Ma. 2019. Nlp augmentation. mapping. In IEEE/CVF Conference on Computer https://github.com/makcedward/nlpaug. Vision and Pattern Recognition (CVPR), volume 2, pages 1735–1742. IEEE. Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, and Roberto Zampar- Matthew Henderson, Rami Al-Rfou, Brian Strope, Yun- elli. 2014. A SICK cure for the evaluation of compo- Hsuan Sung, László Lukács, Ruiqi Guo, Sanjiv Ku- sitional distributional semantic models. In Interna- mar, Balint Miklos, and Ray Kurzweil. 2017. Effi- tional Conference on Language Resources and Eval- cient natural language response suggestion for smart uation (LREC), pages 216–223. reply. arXiv preprint arXiv:1705.00652. Yu Meng, Chenyan Xiong, Payal Bajaj, Saurabh Ti- Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016. wary, Paul Bennett, Jiawei Han, and Xia Song. Learning distributed representations of sentences 2021. COCO-LM: Correcting and contrasting text from unlabelled data. In North American Chapter of sequences for language model pretraining. arXiv the Association for Computational Linguistics: Hu- preprint arXiv:2102.08473. man Language Technologies (NAACL-HLT), pages 1367–1377. Jorma Kaarlo Merikoski. 1984. On the trace and the sum of elements of a matrix. Linear Algebra and its Minqing Hu and Bing Liu. 2004. Mining and summa- Applications, 60:177–185. rizing customer reviews. In ACM SIGKDD interna- tional conference on Knowledge discovery and data Tomas Mikolov, Ilya Sutskever, Kai Chen, G. Corrado, mining. and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. In Junjie Huang, Duyu Tang, Wanjun Zhong, Shuai Lu, Advances in Neural Information Processing Systems Linjun Shou, Ming Gong, Daxin Jiang, and Nan (NIPS). Duan. 2021. Whiteningbert: An easy unsuper- vised sentence embedding approach. arXiv preprint Jiaqi Mu and Pramod Viswanath. 2018. All-but-the- arXiv:2104.01767. top: Simple and effective postprocessing for word representations. In International Conference on Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Learning Representations (ICLR). Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval Yixin Nie, Adina Williams, Emily Dinan, Mohit for open-domain question answering. In Empirical Bansal, Jason Weston, and Douwe Kiela. 2020. Ad- Methods in Natural Language Processing (EMNLP), versarial NLI: A new benchmark for natural lan- pages 6769–6781. guage understanding. In Association for Computa- tional Linguistics (ACL), pages 4885–4901. Taeuk Kim, Kang Min Yoo, and Sang-goo Lee. 2021. Self-guided contrastive learning for BERT sentence Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. representations. In Association for Computational 2018. Unsupervised learning of sentence embed- Linguistics and International Joint Conference on dings using compositional n-gram features. In North Natural Language Processing (ACL-IJCNLP), pages American Chapter of the Association for Computa- 2528–2540. tional Linguistics: Human Language Technologies (NAACL-HLT), pages 528–540. Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S Zemel, Antonio Torralba, Raquel Urtasun, Bo Pang and Lillian Lee. 2004. A sentimental educa- and Sanja Fidler. 2015. Skip-thought vectors. In tion: Sentiment analysis using subjectivity summa- Advances in Neural Information Processing Systems rization based on minimum cuts. In Association for (NIPS), pages 3294–3302. Computational Linguistics (ACL), pages 271–278.
Bo Pang and Lillian Lee. 2005. Seeing stars: Exploit- Janyce Wiebe, Theresa Wilson, and Claire Cardie. ing class relationships for sentiment categorization 2005. Annotating expressions of opinions and emo- with respect to rating scales. In Association for Com- tions in language. Language resources and evalua- putational Linguistics (ACL), pages 115–124. tion, 39(2-3):165–210. Jeffrey Pennington, Richard Socher, and Christopher John Wieting and Kevin Gimpel. 2018. ParaNMT- Manning. 2014. GloVe: Global vectors for word 50M: Pushing the limits of paraphrastic sentence representation. In Proceedings of the 2014 Confer- embeddings with millions of machine translations. ence on Empirical Methods in Natural Language In Association for Computational Linguistics (ACL), Processing (EMNLP), pages 1532–1543. pages 451–462. Nils Reimers, Philip Beyer, and Iryna Gurevych. 2016. John Wieting, Graham Neubig, and Taylor Berg- Task-oriented intrinsic evaluation of semantic tex- Kirkpatrick. 2020. A bilingual generative trans- tual similarity. In International Conference on Com- former for semantic sentence embedding. In Em- putational Linguistics (COLING), pages 87–96. pirical Methods in Natural Language Processing (EMNLP), pages 1581–1594. Nils Reimers and Iryna Gurevych. 2019. Sentence- Adina Williams, Nikita Nangia, and Samuel Bowman. BERT: Sentence embeddings using Siamese BERT- 2018. A broad-coverage challenge corpus for sen- networks. In Empirical Methods in Natural Lan- tence understanding through inference. In North guage Processing and International Joint Confer- American Chapter of the Association for Computa- ence on Natural Language Processing (EMNLP- tional Linguistics: Human Language Technologies IJCNLP), pages 3982–3992. (NAACL-HLT), pages 1112–1122. Richard Socher, Alex Perelygin, Jean Wu, Jason Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chuang, Christopher D. Manning, Andrew Ng, and Chaumond, Clement Delangue, Anthony Moi, Pier- Christopher Potts. 2013. Recursive deep models ric Cistac, Tim Rault, Remi Louf, Morgan Funtow- for semantic compositionality over a sentiment tree- icz, Joe Davison, Sam Shleifer, Patrick von Platen, bank. In Empirical Methods in Natural Language Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Processing (EMNLP), pages 1631–1642. Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Trans- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, formers: State-of-the-art natural language process- Ilya Sutskever, and Ruslan Salakhutdinov. 2014. ing. In Empirical Methods in Natural Language Pro- Dropout: a simple way to prevent neural networks cessing (EMNLP): System Demonstrations, pages from overfitting. The Journal of Machine Learning 38–45. Research (JMLR), 15(1):1929–1958. Zhuofeng Wu, Sinong Wang, Jiatao Gu, Madian Jianlin Su, Jiarun Cao, Weijie Liu, and Yangyiwen Ou. Khabsa, Fei Sun, and Hao Ma. 2020. Clear: Con- 2021. Whitening sentence representations for bet- trastive learning for sentence representation. arXiv ter semantics and faster retrieval. arXiv preprint preprint arXiv:2012.15466. arXiv:2103.15316. Yuanmeng Yan, Rumei Li, Sirui Wang, Fuzheng Zhang, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Wei Wu, and Weiran Xu. 2021. ConSERT: A Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz contrastive framework for self-supervised sentence Kaiser, and Illia Polosukhin. 2017. Attention is all representation transfer. In Association for Com- you need. In Advances in Neural Information Pro- putational Linguistics and International Joint Con- cessing Systems (NIPS), pages 6000–6010. ference on Natural Language Processing (ACL- IJCNLP), pages 5065–5075. Ellen M Voorhees and Dawn M Tice. 2000. Building a question answering test collection. In the 23rd Peter Young, Alice Lai, Micah Hodosh, and Julia Hock- annual international ACM SIGIR conference on Re- enmaier. 2014. From image descriptions to visual search and development in information retrieval, denotations: New similarity metrics for semantic in- pages 200–207. ference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78. Lingxiao Wang, Jing Huang, Kevin Huang, Ziniu Hu, Guangtao Wang, and Quanquan Gu. 2020. Improv- Yan Zhang, Ruidan He, Zuozhu Liu, Kwan Hui Lim, ing neural language generation with spectrum con- and Lidong Bing. 2020. An unsupervised sentence trol. In International Conference on Learning Rep- embedding method by mutual information maxi- resentations (ICLR). mization. In Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 1601–1610. Tongzhou Wang and Phillip Isola. 2020. Understand- ing contrastive representation learning through align- ment and uniformity on the hypersphere. In Inter- national Conference on Machine Learning (ICML), pages 9929–9939.
A Training Details Paper Reg. Metric Aggr. Hill et al. (2016) Both all We implement SimCSE with transformers Conneau et al. (2017) X Pearson mean package (Wolf et al., 2020). For supervised Sim- Conneau and Kiela (2018) X Pearson mean CSE, we train our models for 3 epochs, evaluate the Reimers and Gurevych (2019) Spearman all Zhang et al. (2020) Spearman all model every 250 training steps on the development Li et al. (2020) Spearman wmean set of STS-B and keep the best checkpoint for the Su et al. (2021) Spearman wmean final evaluation on test sets. We do the same for Wieting et al. (2020) Pearson mean the unsupervised SimCSE, except that we train the Giorgi et al. (2021) Spearman mean Ours Spearman all model for one epoch. We carry out grid-search of batch size ∈ {64, 128, 256, 512} and learning rate Table B.1: STS evaluation protocols used in different ∈ {1e-5, 3e-5, 5e-5} on STS-B development set papers. “Reg.”: whether an additional regressor is used; and adopt the hyperparameter settings in Table A.1. “aggr.”: methods to aggregate different subset results. We find that SimCSE is not sensitive to batch sizes as long as tuning the learning rates accordingly, which contradicts the finding that contrastive learn- frozen sentence embeddings for STS-B and SICK- ing requires large batch sizes (Chen et al., 2020). R, and train the regressor on the training sets of It is probably due to that all SimCSE models start the two tasks, while most sentence representation from pre-trained checkpoints, which already pro- papers take the raw embeddings and evaluate in an vide us a good set of initial parameters. unsupervised way. In our experiments, we do not apply any additional regressors and directly take Unsupervised Supervised cosine similarities for all STS tasks. BERT RoBERTa base large Metrics. Both Pearson’s and Spearman’s cor- base large base large Batch size 64 64 512 512 512 512 relation coefficients are used in the literature. Learning rate 3e-5 1e-5 1e-5 3e-5 5e-5 1e-5 Reimers et al. (2016) argue that Spearman corre- lation, which measures the rankings instead of the Table A.1: Batch sizes and learning rates for SimCSE. actual scores, better suits the need of evaluating sentence embeddings. For all of our experiments, For both unsupervised and supervised SimCSE, we report Spearman’s rank correlation. we take the [CLS] representation with an MLP Aggregation methods. Given that each year’s layer on top of it as the sentence representation. STS challenge contains several subsets, there are Specially, for unsupervised SimCSE, we discard different choices to gather results from them: one the MLP layer and only use the [CLS] output way is to concatenate all the topics and report the during test, since we find that it leads to better overall Spearman’s correlation (denoted as “all”), performance (ablation study in §6.3). and the other is to calculate results for differ- Finally, we introduce one more optional variant ent subsets separately and average them (denoted which adds a masked language modeling (MLM) as “mean” if it is simple average or “wmean” if objective (Devlin et al., 2019) as an auxiliary loss weighted by the subset sizes). However, most pa- to Eq. 1: ` + λ · `mlm (λ is a hyperparameter). pers do not claim the method they take, making it This helps SimCSE avoid catastrophic forgetting challenging for a fair comparison. We take some of token-level knowledge. As we will show in Ta- of the most recent work: SBERT (Reimers and ble D.2, we find that adding this term can help Gurevych, 2019), BERT-flow (Li et al., 2020) and improve performance on transfer tasks (not on BERT-whitening (Su et al., 2021)12 as an example: sentence-level STS tasks). In Table B.2, we compare our reproduced results to reported results of SBERT and BERT-whitening, B Different Settings for STS Evaluation and find that Reimers and Gurevych (2019) take the We elaborate the differences in STS evaluation set- “all” setting but Li et al. (2020); Su et al. (2021) take tings in previous work in terms of (a) whether to the “wmean” setting, even though Li et al. (2020) use additional regressors; (b) reported metrics; (c) claim that they take the same setting as Reimers different ways to aggregate results. 12 Li et al. (2020) and Su et al. (2021) have consistent results, Additional regressors. The default SentEval so we assume that they take the same evaluation and just take implementation applies a linear regressor on top of BERT-whitening in experiments here.
You can also read