Meta-Learning with Variational Semantic Memory for Word Sense Disambiguation
Meta-Learning with Variational Semantic Memory for Word Sense Disambiguation Yingjun Du Nithin Holla Xiantong Zhen University of Amsterdam Amberscript University of Amsterdam Cees G.M. Snoek Ekaterina Shutova University of Amsterdam University of Amsterdam Abstract sense frequencies, which further increase the need for annotation to capture a diversity of senses and A critical challenge faced by supervised word sense disambiguation (WSD) is the lack of to obtain sufficient training data for rare senses. This motivated recent research on few-shot arXiv:2106.02960v1 [cs.CL] 5 Jun 2021 large annotated datasets with sufficient cover- age of words in their diversity of senses. This WSD, where the objective of the model is to learn inspired recent research on few-shot WSD us- new, previously unseen word senses from only a ing meta-learning. While such work has suc- small number of examples. Holla et al. (2020a) pre- cessfully applied meta-learning to learn new sented a meta-learning approach to few-shot WSD, word senses from very few examples, its per- as well as a benchmark for this task. Meta-learning formance still lags behind its fully-supervised counterpart. Aiming to further close this gap, makes use of an episodic training regime, where a we propose a model of semantic memory for model is trained on a collection of diverse few-shot WSD in a meta-learning setting. Semantic tasks and is explicitly optimized to perform well memory encapsulates prior experiences seen when learning from a small number of examples throughout the lifetime of the model, which per task (Snell et al., 2017; Finn et al., 2017; Tri- aids better generalization in limited data set- antafillou et al., 2020). Holla et al. (2020a) have tings. Our model is based on hierarchical vari- shown that meta-learning can be successfully ap- ational inference and incorporates an adaptive plied to learn new word senses from as little as one memory update rule via a hypernetwork. We show our model advances the state of the art example per sense. Yet, the overall model perfor- in few-shot WSD, supports effective learning mance in settings where data is highly limited (e.g. in extremely data scarce (e.g. one-shot) sce- one- or two-shot learning) still lags behind that of narios and produces meaning prototypes that fully supervised models. capture similar senses of distinct words. In the meantime, machine learning research demonstrated the advantages of a memory com- 1 Introduction ponent for meta-learning in limited data settings Disambiguating word meaning in context is at (Santoro et al., 2016a; Munkhdalai and Yu, 2017a; the heart of any natural language understanding Munkhdalai et al., 2018; Zhen et al., 2020). The task or application, whether it is performed ex- memory stores general knowledge acquired in plicitly or implicitly. Traditionally, word sense learning related tasks, which facilitates the acquisi- disambiguation (WSD) has been defined as the tion of new concepts and recognition of previously task of explicitly labeling word usages in context unseen classes with limited labeled data (Zhen with sense labels from a pre-defined sense inven- et al., 2020). Inspired by these advances, we intro- tory. The majority of approaches to WSD rely duce the first model of semantic memory for WSD on (semi-)supervised learning (Yuan et al., 2016; in a meta-learning setting. In meta-learning, pro- Raganato et al., 2017a,b; Hadiwinoto et al., 2019; totypes are embeddings around which other data Huang et al., 2019; Scarlini et al., 2020; Bevilac- points of the same class are clustered (Snell et al., qua and Navigli, 2020) and make use of training 2017). Our semantic memory stores prototypical corpora manually annotated for word senses. Typi- representations of word senses seen during train- cally, these methods require a fairly large number ing, generalizing over the contexts in which they of annotated training examples per word. This prob- are used. This rich contextual information aids in lem is exacerbated by the dramatic imbalances in learning new senses of previously unseen words
that appear in similar contexts, from very few ex- 2014; Moro et al., 2014) rely on lexical resources amples. such as WordNet (Miller et al., 1990) and do not The design of our prototypical representation of require a corpus manually annotated with word word sense takes inspiration from prototype theory senses. Alternatively, supervised learning meth- (Rosch, 1975), an established account of category ods treat WSD as a word-level classification task representation in psychology. It stipulates that se- for ambiguous words and rely on sense-annotated mantic categories are formed around prototypical corpora for training. Early supervised learning ap- members, new members are added based on resem- proaches trained classifiers with hand-crafted fea- blance to the prototypes and category membership tures (Navigli, 2009; Zhong and Ng, 2010) and is a matter of degree. In line with this account, word embeddings (Rothe and Schütze, 2015; Ia- our models learn prototypical representations of cobacci et al., 2016) as input. Raganato et al. word senses from their linguistic context. To do (2017a) proposed a benchmark for WSD based on this, we employ a neural architecture for learning the SemCor corpus (Miller et al., 1994) and found probabilistic class prototypes: variational prototype that supervised methods outperform the knowledge- networks, augmented with a variational semantic based ones. memory (VSM) component (Zhen et al., 2020). Neural models for supervised WSD include Unlike deterministic prototypes in prototypical LSTM-based (Hochreiter and Schmidhuber, 1997) networks (Snell et al., 2017), we model class proto- classifiers (Kågebäck and Salomonsson, 2016; types as distributions and perform variational infer- Melamud et al., 2016; Raganato et al., 2017b), near- ence of these prototypes in a hierarchical Bayesian est neighbour classifier with ELMo embeddings framework. Unlike deterministic memory access (Peters et al., 2018), as well as a classifier based in memory-based meta-learning (Santoro et al., on pretrained BERT representations (Hadiwinoto 2016b; Munkhdalai and Yu, 2017a), we access et al., 2019). Recently, hybrid approaches incorpo- memory by Monte Carlo sampling from a varia- rating information from lexical resources into neu- tional distribution. Specifically, we first perform ral architectures have gained traction. GlossBERT variational inference to obtain a latent memory (Huang et al., 2019) fine-tunes BERT with Word- variable and then perform another step of varia- Net sense definitions as additional input. EWISE tional inference to obtain the prototype distribu- (Kumar et al., 2019) learns continuous sense em- tion. Furthermore, we enhance the memory update beddings as targets, aided by dictionary definitions of vanilla VSM with a novel adaptive update rule and lexical knowledge bases. Scarlini et al. (2020) involving a hypernetwork (Ha et al., 2016) that present a semi-supervised approach for obtaining controls the weight of the updates. We call our sense embeddings with the aid of a lexical knowl- approach β-VSM to denote the adaptive weight β edge base, enabling WSD with a nearest neighbor for memory updates. algorithm. By further exploiting the graph structure We experimentally demonstrate the effectiveness of WordNet and integrating it with BERT, EWISER of this approach for few-shot WSD, advancing the (Bevilacqua and Navigli, 2020) achieves the cur- state of the art in this task. Furthermore, we ob- rent state-of-the-art performance on the benchmark serve the highest performance gains on word senses by Raganato et al. (2017a) – an F1 score of 80.1%. with the least training examples, emphasizing the Unlike few-shot WSD, these works do not fine- benefits of semantic memory for truly few-shot tune the models on new words during testing. In- learning scenarios. Our analysis of the meaning stead, they train on a training set and evaluate on prototypes acquired in the memory suggests that a test set where words and senses might have been they are able to capture related senses of distinct seen during training. words, demonstrating the generalization capabili- ties of our memory component. We make our code Meta-learning Meta-learning, or learning to publicly available to facilitate further research.1 learn (Schmidhuber, 1987; Bengio et al., 1991; Thrun and Pratt, 1998), is a learning paradigm 2 Related work where a model is trained on a distribution of tasks so as to enable rapid learning on new tasks. By Word sense disambiguation Knowledge-based solving a large number of different tasks, it aims approaches to WSD (Lesk, 1986; Agirre et al., to leverage the acquired knowledge to learn new, 1 unseen tasks. The training set, referred to as the
meta-training set, consists of episodes, each cor- testing. Meta-training consists of episodes formed responding to a distinct task. Every episode is from multiple words whereas meta-testing has one further divided into a support set containing just episode corresponding to each of the test words. a handful of examples for learning the task, and They show that prototype-based methods – proto- a query set containing examples for task evalua- typical networks (Snell et al., 2017) and first-order tion. In the meta-training phase, for each episode, ProtoMAML (Triantafillou et al., 2020) – obtain the model adapts to the task using the support set, promising results, in contrast with model-agnostic and its performance on the task is evaluated on meta-learning (MAML) (Finn et al., 2017). the corresponding query set. The initial parame- Memory-based models Memory mechanisms ters of the model are then adjusted based on the (Weston et al., 2014; Graves et al., 2014; Krotov loss on the query set. By repeating the process on and Hopfield, 2016) have recently drawn increas- several episodes/tasks, the model produces repre- ing attention. In memory-augmented neural net- sentations that enable rapid adaptation to a new work (Santoro et al., 2016b), given an input, the task. The test set, referred to as the meta-test set, memory read and write operations are performed also consists of episodes with a support and query by a controller, using soft attention for reads and set. The meta-test set corresponds to new tasks that least recently used access module for writes. Meta were not seen during meta-training. During meta- Network (Munkhdalai and Yu, 2017b) uses two testing, the meta-trained model is first fine-tuned memory modules: a key-value memory in com- on a small number of examples in the support set bination with slow and fast weights for one-shot of each meta-test episode and then evaluated on learning. An external memory was introduced to the accompanying query set. The average perfor- enhance recurrent neural network in Munkhdalai mance on all such query sets measures the few-shot et al. (2019), in which memory is conceptualized as learning ability of the model. an adaptable function and implemented as a deep Metric-based meta-learning methods (Koch neural network. Semantic memory has recently et al., 2015; Vinyals et al., 2016; Sung et al., 2018; been introduced by Zhen et al. (2020) for few-shot Snell et al., 2017) learn a kernel function and make learning to enhance prototypical representations of predictions on the query set based on the similarity objects, where memory recall is cast as a variational with the support set examples. Model-based meth- inference problem. ods (Santoro et al., 2016b; Munkhdalai and Yu, In NLP, Tang et al. (2016) use content and 2017a) employ external memory and make predic- location-based neural attention over external mem- tions based on examples retrieved from the memory. ory for aspect-level sentiment classification. Das Optimization-based methods (Ravi and Larochelle, et al. (2017) use key-value memory for question an- 2017; Finn et al., 2017; Nichol et al., 2018; Anto- swering on knowledge bases. Mem2Seq (Madotto niou et al., 2019) directly optimize for generaliz- et al., 2018) is an architecture for task-oriented di- ability over tasks in their training objective. alog that combines attention-based memory with Meta-learning has been applied to a range of pointer networks (Vinyals et al., 2015). Geng et al. tasks in NLP, including machine translation (Gu (2020) propose Dynamic Memory Induction Net- et al., 2018), relation classification (Obamuyide works for few-shot text classification, which uti- and Vlachos, 2019), text classification (Yu et al., lizes dynamic routing (Sabour et al., 2017) over 2018; Geng et al., 2019), hypernymy detection (Yu a static memory module. Episodic memory has et al., 2020), and dialog generation (Qian and Yu, been used in lifelong learning on language tasks, as 2019). It has also been used to learn across distinct a means to perform experience replay (d’Autume NLP tasks (Dou et al., 2019; Bansal et al., 2019) as et al., 2019; Han et al., 2020; Holla et al., 2020b). well as across different languages (Nooralahzadeh et al., 2020; Li et al., 2020). Bansal et al. (2020) 3 Task and dataset show that meta-learning during self-supervised pre- We treat WSD as a word-level classification prob- training of language models leads to improved few- lem where ambiguous words are to be classified shot generalization on downstream tasks. into their senses given the context. In traditional Holla et al. (2020a) propose a framework for WSD, the goal is to generalize to new contexts of few-shot word sense disambiguation, where the word-sense pairs. Specifically, the test set consists goal is to disambiguate new words during meta- of word-sense pairs that were seen during train-
ing. On the other hand, in few-shot WSD, the the meta-training episodes, the meta-test episodes goal is to generalize to new words and senses al- reflect a natural distribution of senses in the cor- together. The meta-testing phase involves further pus, including class imbalance, providing a realistic adapting the models (on the small support set) to evaluation setting. new words that were not seen during training and evaluates them on new contexts (using the query 4 Methods set). It deviates from the standard N -way, K-shot 4.1 Model architectures classification setting in few-shot learning since the words may have a different number of senses and We experiment with the same model architectures each sense may have different number of examples as Holla et al. (2020a). The model fθ , with param- (Holla et al., 2020a), making it a more realistic eters θ, takes words xi as input and produces a per- few-shot learning setup (Triantafillou et al., 2020). word representation vector fθ (xi ) for i = 1, ..., L where L is the length of the sentence. Sense pre- Dataset We use the few-shot WSD benchmark dictions are only made for ambiguous words using provided by Holla et al. (2020a). It is based on the corresponding word representation. the SemCor corpus (Miller et al., 1994), annotated GloVe+GRU Single-layer bi-directional GRU with senses from the New Oxford American Dic- (Cho et al., 2014) network followed by a single tionary by Yuan et al. (2016). The dataset con- linear layer, that takes GloVe embeddings (Pen- sists of words grouped into meta-training, meta- nington et al., 2014) as input. GloVe embed- validation and meta-test sets. The meta-test set dings capture all senses of a word. We thus evalu- consists of new words that were not part of meta- ate a model’s ability to disambiguate from sense- training and meta-validation sets. There are four agnostic input. setups varying in the number of sentences in the support set |S| = 4, 8, 16, 32. |S| = 4 corre- ELMo+MLP A multi-layer perception (MLP) sponds to an extreme few-shot learning scenario network that receives contextualized ELMo embed- for most words, whereas |S| = 32 comes closer dings (Peters et al., 2018) as input. Their contex- to the number of sentences per word encountered tualised nature makes ELMo embeddings better in standard WSD setups. For |S| = 4, 8, 16, 32, suited to capture meaning variation than the static the number of unique words in the meta-training ones. Since ELMo is not fine-tuned, this model has / meta-validation / meta-test sets is 985/166/270, the lowest number of learnable parameters. 985/163/259, 799/146/197 and 580/85/129 respec- tively. We use the publicly available standard BERT Pretrained BERTBASE (Devlin et al., dataset splits.2 2019) model followed by a linear layer, fully fine- tuned on the task. BERT underlies state-of-the-art Episodes The meta-training episodes were cre- approaches to WSD. ated by first sampling a set of words and a fixed number of senses per word, followed by sampling 4.2 Prototypical Network example sentences for these word-sense pairs. This Our few-shot learning approach builds upon pro- strategy allows for a combinatorially large number totypical networks (Snell et al., 2017), which is of episodes. Every meta-training episode has |S| widely used for few-shot image classification and sentences in both the support and query sets, and has been shown to be successful in WSD (Holla corresponds to the distinct task of disambiguating et Pal., 2020a). It computes a prototype zk = 1 between the sampled word-sense pairs. The total K k fθ (xk ) of each word sense (where K is the number of meta-training episodes is 10, 000. In the number of examples for each word sense) through meta-validation and meta-test sets, each episode an embedding function fθ , which is realized as the corresponds to the task of disambiguating a single, aforementioned architectures. It computes a dis- previously unseen word between all its senses. For tribution over classes for a query sample x given every meta-test episode, the model is fine-tuned on a distance function d(·, ·) as the softmax over its a few examples in the support set and its generaliz- distances to the prototypes in the embedding space: ability is evaluated on the query set. In contrast to 2 exp(−d(fθ (x), zk )) MetaWSD p(yi = k|x) = P (1) k0 exp(−d(fθ (x), zk0 ))
However, the resulting prototypes may not be sufficiently representative of word senses as seman- tic categories when using a single deterministic vector, computed as the average of only a few ex- amples. Such representations lack expressiveness and may not encompass sufficient intra-class vari- ance, that is needed to distinguish between different fine-grained word senses. Moreover, large uncer- tainty arises in the single prototype due to the small Figure 1: Computational graph of variational semantic number of samples. memory for few-shot WSD. M is the semantic memory module, S the support set, x and y are the query sample 4.3 Variational Prototype Network and label, and z is the word sense prototype. Variational prototype network (Zhen et al., 2020) (VPN) is a powerful model for learning latent rep- resentations from small amounts of data, where the memory update, which effectively collects new in- prototype z of each class is treated as a distribution. formation from the task and gradually consolidates Given a task with a support set S and query set Q, the semantic knowledge in the memory. We adopt the objective of VPN takes the following form: a similar memory mechanism and introduce an im- proved update rule for memory consolidation. |Q| Lz 1 Xh 1 X LVPN = − log p(yi |xi , z(lz ) ) Memory recall The memory recall of VSM aims |Q| Lz to choose the related content from the memory, and i=1 lz =1 is accomplished by variational inference. It intro- i + λDKL [q(z|S)||p(z|xi )] duces latent memory m as an intermediate stochas- (2) tic variable, and infers m from the addressed mem- where q(z|S) is the variational posterior over z, ory M . The approximate variational posterior p(z|xi ) is the prior, and Lz is the number of Monte q(m|M, S) over the latent memory m is obtained Carlo samples for z. The prior and posterior are empirically by assumed to be Gaussian. The re-parameterization trick (Kingma and Welling, 2013) is adopted to |M | enable back-propagation with gradient descent, i.e., X q(m|M, S) = γa p(m|Ma ), (3) z(lz ) = f (S, (lz ) ), (lz ) ∼ N (0, I), f (·, ·) = a=1 (lz ) ∗ µz + σz , where the mean µz and diagonal covariance σz are generated from the posterior in- where ference network with S as input. The amortization exp g(Ma , S) technique is employed for the implementation of γa = P (4) i exp g(Mi , S) VPN. The posterior network takes the mean word representations in the support set S as input and g(·) is the dot product, |M | is the number of mem- returns the parameters of q(z|S). Similarly, the ory slots, Ma is the memory content at slot a and prior network produces the parameters of p(z|xi ) stores the prototype of samples in each class, and by taking the query word representation xi ∈ Q as we take the mean representation of samples in S. input. The conditional predictive log-likelihood is The variational posterior over the prototype then implemented as a cross-entropy loss. becomes: 4.4 β-Variational Semantic Memory Lm 1 X In order to leverage the shared common knowledge q̃(z|M, S) ≈ q(z|m(lm ) , S), (5) Lm between different tasks to improve disambiguation lm =1 in future tasks, we incorporate variational semantic memory (VSM) as in Zhen et al. (2020). It consists where m(lm ) is a Monte Carlo sample drawn from of two main processes: memory recall, which re- the distribution q(m|M, S), and lm is the number trieves relevant information that fits with specific of samples. By incorporating the latent memory tasks based on the support set of the current task; m from Eq. (3), we achieve the objective for varia-
tional semantic memory as follows: achieved by scaling as follows: |Q| h Mc X Mc = (9) LVSM = − Eq(z|S,m) log p(yi |xi , z) max(1, kMc k2 ) i=1 When we update memory, we feed the new ob- + λz DKL q(z|S, m)||p(z|xi ) tained memory M̄c into the hypernetwork fβ (·) |M | X i and output adaptive β for the update. We provide + λm DKL γi p(m|Mi )||p(m|S) a more detailed implementation of β-VSM in Ap- i (6) pendix A.1. where p(m|S) is the introduced prior over m, λz 5 Experiments and results and λm are the hyperparameters. The overall com- putational graph of VSM is shown in Figure 1. Experimental setup The size of the shared lin- Similarly, the posterior and prior over m are also ear layer and memory content of each word sense assumed to be Gaussian and obtained by using is 64, 256, and 192 for GloVe+GRU, ELMo+MLP amortized inference networks; more details are pro- and BERT respectively. The activation function vided in Appendix A.1. of the shared linear layer is tanh for GloVe+GRU and ReLU for the rest. The inference networks Memory update The memory update is to be gφ (·) for calculating the prototype distribution and able to effectively absorb new useful information to gψ (·) for calculating the memory distribution are enrich memory content. VSM employs an update all three-layer MLPs, with the size of each hid- rule as follows: den layer being 64, 256, and 192 for GloVe+GRU, ELMo+MLP and BERT. The activation function Mc ← βMc + (1 − β)M̄c , (7) of their hidden layers is ELU (Clevert et al., 2016), and the output layer does not use any activation where Mc is the memory content correspond- function. Each batch during meta-training includes ing to class c, M̄c is obtained using graph atten- 16 tasks. The hypernetwork fβ (·) is also a three- tion (Veličković et al., 2017), and β ∈ (0, 1) is a layer MLP, with the size of hidden state consis- hyperparameter. tent with that of the memory contents. The linear Adaptive memory update Although VSM was layer activation function is ReLU for the hypernet- shown to be promising for few-shot image classi- work. For BERT and |S| = {4, 8}, λz = 0.001, fication, it can be seen from the experiments by λm = 0.0001 and learning rate is 5e−6; |S| = 16, Zhen et al. (2020) that different values of β have λz = 0.0001, λm = 0.0001 and learning rate is considerable influence on the performance. β de- 1e−6; |S| = 32, λz = 0.001, λm = 0.0001 and termines the extent to which memory is updated at learning rate is 1e−5. Hyperparameters for other each iteration. In the original VSM, β is treated models are reported in Appendix A.2. All the hy- as a hyperparameter obtained by cross-validation, perparameters are chosen using the meta-validation which is time-consuming and inflexible in dealing set. The number of slots in memory is consistent with different datasets. To address this problem, with the number of senses in the meta-training set we propose an adaptive memory update rule by – 2915 for |S| = 4 and 8; 2452 for |S| = 16; 1937 learning β from data using a lightweight hypernet- for |S| = 32. The evaluation metric is the word- work (Ha et al., 2016). To be more specific, we level macro F1 score, averaged over all episodes obtain β by a function fβ (·) implemented as an in the meta-test set. The parameters are optimized MLP with a sigmoid activation function in the out- using Adam (Kingma and Ba, 2014). put layer. The hypernetwork takes M̄c as input and We compare our methods against several base- returns the value of β: lines and state-of-the-art approaches. The near- est neighbor classifier baseline (NearestNeighbor) β = fβ (M̄c ) (8) predicts a query example’s sense as the sense of the support example closest in the word embed- Moreover, to prevent the possibility of endless ding space (ELMo and BERT) in terms of co- growth of memory value, we propose to scale down sine distance. The episodic fine-tuning baseline the memory value whenever kMc k2 > 1. This is (EF-ProtoNet) is one where only meta-testing is
Embedding/ Average macro F1 score Method Encoder |S| = 4 |S| = 8 |S| = 16 |S| = 32 - MajoritySenseBaseline 0.247 0.259 0.264 0.261 NearestNeighbor – – – – EF-ProtoNet 0.522 ± 0.008 0.539 ± 0.009 0.538 ± 0.003 0.562 ± 0.005 GloVe+GRU ProtoNet 0.579 ± 0.004 0.601 ± 0.003 0.633 ± 0.008 0.654 ± 0.004 ProtoFOMAML 0.577 ± 0.011 0.616 ± 0.005 0.626 ± 0.005 0.631 ± 0.008 β-VSM (Ours) 0.597 ± 0.005 0.631 ± 0.004 0.652 ± 0.006 0.678 ± 0.007 NearestNeighbor 0.624 0.641 0.645 0.654 EF-ProtoNet 0.609 ± 0.008 0.635 ± 0.004 0.661 ± 0.004 0.683 ± 0.003 ELMo+MLP ProtoNet 0.656 ± 0.006 0.688 ± 0.004 0.709 ± 0.006 0.731 ± 0.006 ProtoFOMAML 0.670 ± 0.005 0.700 ± 0.004 0.724 ± 0.003 0.737 ± 0.007 β-VSM (Ours) 0.679 ± 0.006 0.709 ± 0.005 0.735 ± 0.004 0.758 ± 0.005 NearestNeighbor 0.681 0.704 0.716 0.741 EF-ProtoNet 0.594 ± 0.008 0.655 ± 0.004 0.682 ± 0.005 0.721 ± 0.009 BERT ProtoNet 0.696 ± 0.011 0.750 ± 0.008 0.755 ± 0.002 0.766 ± 0.003 ProtoFOMAML 0.719 ± 0.005 0.756 ± 0.007 0.744 ± 0.007 0.761 ± 0.005 β-VSM (Ours) 0.728 ± 0.012 0.773 ± 0.005 0.776 ± 0.003 0.788 ± 0.003 Table 1: Model performance comparison on the meta-test words using different embedding functions. performed, starting from a randomly initialized Role of variational prototypes VPN consis- model. Prototypical network (ProtoNet) and Proto- tently outperforms ProtoNet with all embedding FOMAML achieve the highest few-shot WSD per- functions (by around 1% F1 score on average). The formance to date on the benchmark of Holla et al. results indicate that the probabilistic prototypes (2020a). provide more informative representations of word senses compared to deterministic vectors. The high- Results In Table 1, we show the average macro est gains were obtained in case of GloVe+GRU F1 scores of the models, with their mean and stan- (1.7% F1 score with |S| = 8), suggesting that dard deviation obtained over five independent runs. probabilistic prototypes are particularly useful for Our proposed β-VSM achieves the new state-of- models that rely on static word embeddings, as they the-art performance on few-shot WSD with all the capture uncertainty in contextual interpretation. embedding functions, across all the setups with Role of variational semantic memory We show varying |S|. For GloVe+GRU, where the input is the benefit of VSM by comparing it with VPN. sense-agnostic embeddings, our model improves VSM consistently surpasses VPN with all three disambiguation compared to ProtoNet by 1.8% for embedding functions. According to our analysis, |S| = 4 and by 2.4% for |S| = 32. With contextual VSM makes the prototypes of different word senses embeddings as input, β-VSM with ELMo+MLP more distinctive and distant from each other. The also leads to improvements compared to the pre- senses in memory provide more context informa- vious best ProtoFOMAML for all |S|. Holla et al. tion, enabling larger intra-class variations to be cap- (2020a) obtained state-of-the-art performance with tured, and thus lead to improvements upon VPN. BERT, and β-VSM further advances this, resulting in a gain of 0.9 – 2.2%. The consistent improve- Role of adaptive β To demonstrate the effec- ments with different embedding functions and sup- tiveness of the hypernetwork for adaptive β, we port set sizes suggest that our β-VSM is effective compare β-VSM with VSM where β is tuned by for few-shot WSD for varying number of shots and cross-validation. It can be seen from Table 2 that senses as well as across model architectures. there is consistent improvement over VSM. Thus, the learned adaptive β acquires the ability to deter- 6 Analysis and discussion mine how much of the contents of memory needs to be updated based on the current new memory. β- To analyze the contributions of different compo- VSM enables the memory content of different word nents in our method, we perform an ablation study senses to be more representative by better absorb- by comparing ProtoNet, VPN, VSM and β-VSM ing information from data with adaptive update, and present the macro F1 scores in Table 2. resulting in improved performance.
Embedding/ Average macro F1 score Method Encoder |S| = 4 |S| = 8 |S| = 16 |S| = 32 ProtoNet 0.579 ± 0.004 0.601 ± 0.003 0.633 ± 0.008 0.654 ± 0.004 VPN 0.583 ± 0.005 0.618 ± 0.005 0.641 ± 0.007 0.668 ± 0.005 GloVe+GRU VSM 0.587 ± 0.004 0.625 ± 0.004 0.645 ± 0.006 0.670 ± 0.005 β-VSM 0.597 ± 0.005 0.631 ± 0.004 0.652 ± 0.006 0.678 ± 0.007 ProtoNet 0.656 ± 0.006 0.688 ± 0.004 0.709 ± 0.006 0.731 ± 0.006 VPN 0.661 ± 0.005 0.694 ± 0.006 0.718 ± 0.004 0.741 ± 0.004 ELMo+MLP VSM 0.670 ± 0.006 0.707 ± 0.006 0.726 ± 0.005 0.750 ± 0.004 β-VSM 0.679 ± 0.006 0.709 ± 0.005 0.735 ± 0.004 0.758 ± 0.005 ProtoNet 0.696 ± 0.011 0.750 ± 0.008 0.755 ± 0.002 0.766 ± 0.003 VPN 0.703 ± 0.011 0.761 ± 0.007 0.762 ± 0.004 0.779 ± 0.002 BERT VSM 0.717 ± 0.013 0.769 ± 0.006 0.770 ± 0.005 0.784 ± 0.002 β-VSM 0.728 ± 0.012 0.773 ± 0.005 0.776 ± 0.003 0.788 ± 0.003 Table 2: Ablation study comparing the meta-test performance of the different variants of prototypical networks. (a) ProtoNet (b) VPN (c) VSM (d) β-VSM Figure 2: Distribution of average macro F1 scores over number of senses for BERT-based models with |S| = 16. Variation of performance with the number of tions based on BERT for the word draw. Different senses In order to further probe into the strengths colored ellipses indicate the distribution of its dif- of β-VSM, we analyze the macro F1 scores of the ferent senses obtained from the support set. Differ- different models averaged over all the words in the ent colored points indicate the representations of meta-test set with a particular number of senses. the query examples. β-VSM makes the prototypes In Figure 2, we show a bar plot of the scores ob- of different word senses of the same word more tained from BERT for |S| = 16. For words with distinctive and distant from each other, with less a low number of senses, the task corresponds to overlap, compared to the other models. Notably, a higher number of effective shots and vice versa. the representations of query examples are closer It can be seen that the different models perform to their corresponding prototype distribution for β- roughly the same for words with fewer senses, i.e., VSM, thereby resulting in improved performance. 2 – 4. VPN is comparable to ProtoNet in its distri- We also visualize the prototype distributions of bution of scores. But with semantic memory, VSM similar vs. dissimilar senses of multiple words in improves the performance on words with a higher Figure 4 (see Appendix A.4 for example sentences). number of senses. β-VSM further boosts the scores The blue ellipse corresponds to the ‘set up’ sense for such words on average. The same trend is ob- of launch from the meta-test samples. Green and served for |S| = 8 (see Appendix A.3). Therefore, gray ellipses correspond to a similar sense of the the improvements of β-VSM over ProtoNet come words start and establish from the memory. We from tasks with fewer shots, indicating that VSM is can see that they are close to each other. Orange particularly effective at disambiguation in low-shot and purple ellipses correspond to other senses of scenarios. the words start and establish from the memory, and they are well separated. For a given query word, Visualization of prototypes To study the distinc- our model is thus able to retrieve related senses tion between the prototype distributions of word from the memory and exploit them to make its senses obtained by β-VSM, VSM and VPN, we word sense distribution more representative and visualize them using t-SNE (Van der Maaten and distinctive. Hinton, 2008). Figure 3 shows prototype distribu-
Figure 3: Prototype distributions of distinct senses of draw with different models.

Figure 4: Prototype distributions of similar sense of launch (blue), start (green) and establish (grey). Distinct senses: start (orange) and establish (purple).

7 Conclusion

In this paper, we presented a model of variational semantic memory for few-shot WSD. We use a variational prototype network to model the prototype of each word sense as a distribution. To leverage the shared common knowledge between tasks, we incorporate semantic memory into the probabilistic model of prototypes in a hierarchical Bayesian framework. VSM is able to acquire long-term, general knowledge that enables learning new senses from very few examples. Furthermore, we propose adaptive β-VSM which learns an adaptive memory update rule from data using a lightweight hypernetwork. The consistent new state-of-the-art performance with three different embedding functions shows the benefit of our model in boosting few-shot WSD.

Since meaning disambiguation is central to many natural language understanding tasks, models based on semantic memory are a promising direction in NLP, more generally. Future work might investigate the role of memory in modeling meaning variation across domains and languages, as well as in tasks that integrate knowledge at different levels of linguistic hierarchy.
International Conference on Learning Representa- Alex Graves, Greg Wayne, and Ivo Danihelka. tions, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2014. Neural turing machines. arXiv preprint 2016, Conference Track Proceedings. arXiv:1410.5401. Rajarshi Das, Manzil Zaheer, Siva Reddy, and Andrew Jiatao Gu, Yong Wang, Yun Chen, Victor O. K. Li, McCallum. 2017. Question answering on knowl- and Kyunghyun Cho. 2018. Meta-learning for low- edge bases and text using universal schema and resource neural machine translation. In Proceed- memory networks. In Proceedings of the 55th An- ings of the 2018 Conference on Empirical Methods nual Meeting of the Association for Computational in Natural Language Processing, pages 3622–3631, Linguistics (Volume 2: Short Papers), pages 358– Brussels, Belgium. Association for Computational 365, Vancouver, Canada. Association for Computa- Linguistics. tional Linguistics. David Ha, Andrew Dai, and Quoc V Le. 2016. 