Conditional Generators of Words Definitions
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Conditional Generators of Words Definitions Artyom Gadetsky Ilya Yakubovskiy Dmitry Vetrov National Research University Joom National Research University Higher School of Economics yakubovskiy@joom.com Higher School of Economics artygadetsky@yandex.ru Samsung-HSE Laboratory vetrovd@yandex.ru Abstract modeling as the evaluation task. In definition modeling vector representations of words are used We explore recently introduced defini- for conditional generation of corresponding word tion modeling technique that provided the arXiv:1806.10090v1 [cs.CL] 26 Jun 2018 definitions. The primary motivation is that high- tool for evaluation of different distributed quality word embedding should contain all useful vector representations of words through information to reconstruct the definition. The im- modeling dictionary definitions of words. portant drawback of Noraset et al. (2017) defini- In this work, we study the problem of tion models is that they cannot take into account word ambiguities in definition modeling words with several different meanings. These and propose a possible solution by em- problems are related to word disambiguation task, ploying latent variable modeling and soft which is a common problem in natural language attention mechanisms. Our quantitative processing. Such examples of polysemantic words and qualitative evaluation and analysis of as “bank“ or “spring“ whose meanings can only the model shows that taking into account be disambiguated using their contexts. In such words ambiguity and polysemy leads to cases, proposed models tend to generate defini- performance improvement. tions based on the most frequent meaning of the corresponding word. Therefore, building models 1 Introduction that incorporate word sense disambiguation is an Continuous representations of words are used in important research direction in natural language many natural language processing (NLP) applica- processing. tions. Using pre-trained high-quality word em- In this work, we study the problem of word beddings are most effective if not millions of ambiguity in definition modeling task. We pro- training examples are available, which is true for pose several models which can be possible so- most tasks in NLP (Kumar et al., 2016; Karpa- lutions to it. One of them is based on recently thy and Fei-Fei, 2015). Recently, several unsu- proposed Adaptive Skip Gram model (Bartunov pervised methods were introduced to learn word et al., 2016), the generalized version of the orig- vectors from large corpora of texts (Mikolov et al., inal SkipGram Word2Vec, which can differ word 2013; Pennington et al., 2014; Joulin et al., 2016). meanings using word context. The second one Learned vector representations have been shown is the attention-based model that uses the context to have useful and interesting properties. For ex- of a word being defined to determine components ample, Mikolov et al. (2013) showed that vec- of embedding referring to relevant word meaning. tor operations such as subtraction or addition re- Our contributions are as follows: (1) we intro- flect semantic relations between words. Despite duce two models based on recurrent neural net- all these properties it is hard to precisely evalu- work (RNN) language models, (2) we collect new ate embeddings because analogy relation or word dataset of definitions, which is larger in number similarity tasks measure learned information indi- of unique words than proposed in Noraset et al. rectly. (2017) and also supplement it with examples of the Quite recently Noraset et al. (2017) introduced word usage (3) finally, in the experiment section a more direct way for word embeddings evalu- we show that our models outperform previously ation. Authors suggested considering definition proposed models and have the ability to generate
definitions depending on the meaning of words. learning models. Therefore, learning high-quality vector representations is the important task. 2 Related Work 3.1 Skip-gram 2.1 Constructing Embeddings Using Dictionary Definitions One of the most popular and frequently used vec- tor representations is Skip-gram model. The orig- Several works utilize word definitions to learn em- inal Skip-gram model consists of grouped word beddings. For example, Hill et al. (2016) use defi- prediction tasks. Each task is formulated as a pre- nitions to construct sentence embeddings. Authors diction of the word v given word w using their in- propose to train recurrent neural network produc- put and output representations: ing an embedding of the dictionary definition that is close to an embedding of the corresponding exp(inTw outv ) word. The model is evaluated with the reverse p(v|w, θ) = PV (2) T dictionary task. Bahdanau et al. (2017) suggest v 0 =1 exp(inw outv 0 ) using definitions to compute embeddings for out- where θ and V stand for the set of input and out- of-vocabulary words. In comparison to Hill et al. put word representations, and dictionary size re- (2016) work, dictionary reader network is trained spectively. These individual prediction tasks are end-to-end for a specific task. grouped in a way to independently predict all ad- 2.2 Definition Modeling jacent (with some sliding window) words y = {y1 , . . . yC } given the central word x: Definition modeling was introduced in Noraset et al. (2017) work. The goal of the definition p(y|x, θ) = Y p(yj |x, θ) (3) model p(D|w∗ ) is to predict the probability of j words in the definition D = {w1 , . . . , wT } given the word being defined w∗ . The joint probability The joint probability of the model is written as fol- is decomposed into separate conditional probabil- lows: ities, each of which is modeled using the recurrent N neural network with soft-max activation, applied Y p(Y |X, θ) = p(yi |xi , θ) (4) to its logits. i=1 T where (X, Y ) = {xi , yi }N i=1 are training pairs of Y ∗ p(D|w ) = p(wt |wi
with Skip-gram AdaGram assumes several mean- One important property of the model is an abil- ings for each word and therefore keeps several ity to disambiguate words using context. More vectors representations for each word. They in- formally, after training on data D = {xi , yi }N i=1 troduce latent variable z that encodes the index of we may compute the posterior probability of word meaning and extend (2) to p(v|z, w, θ). They use meaning given context and take the word vector hierarchical soft-max approach rather than nega- with the highest probability.: tive sampling to overcome computing denomina- tor. p(z = k|x, y, θ) ∝ (8) Z Y p(v|z = k, w, θ) = σ(ch(n)inTwk outn ) ∝ p(y|x, z = k, θ) p(z = k|β, x)q(β)dβ n∈path(v) (5) This knowledge about word meaning will be Here inwk stands for input representation of word further utilized in one of our models as w with meaning index k and output representa- disambiguation(x|y). tions are associated with nodes in a binary tree, where leaves are all possible words in model vo- 4 Models cabulary with unique paths from the root to the In this section, we describe our extension to orig- corresponding leaf. ch(n) is a function which re- inal definition model. The goal of the extended turns 1 or -1 to each node in the path(·) depending definition model is to predict the probability of a on whether n is a left or a right child of the previ- definition D = {w1 , . . . , wT } given a word being ous node in the path. Huffman tree is often used defined w∗ and its context C = {c1 , . . . , cm } (e.g. for computational efficiency. example of use of this word). As it was motivated To automatically determine the number of earlier, the context will provide proper information meanings for each word authors use the con- about word meaning. The joint probability is also structive definition of Dirichlet process via stick- decomposed in the conditional probabilities, each breaking representation (p(z = k|w, β)), which is of which is provided with the information about commonly used prior distribution on discrete la- context: tent variables when the number of possible values is unknown (e.g. infinite mixtures). T Y ∗ p(D|w , C) = p(wt |wi
similar vectors representations with smoothed Split train val test meanings due to theoretical guarantees on a num- #Words 33,128 8,867 8,850 ber of learned components. To overcome this #Entries 97,855 12,232 12,232 problem and to get rid of careful tuning of this #Tokens 1,078,828 134,486 133,987 hyper-parameter we introduce following model: Avg length 11.03 10.99 10.95 ht = g([a∗ ; vt ], ht−1 ) Table 1: Statistics of new dataset ∗ ∗ a =v mask (11) Pm AN N (ci ) mask = σ(W i=1 + b) m where is an element-wise product, σ is a logistic sigmoid function and AN N is attention neural network, which is a feed-forward neural network. We motivate these updates by the fact, that after learning Skip-gram model on a large cor- pus, vector representation for each word will ab- sorb information about every meaning of the word. Using soft binary mask dependent on word context we extract components of word embedding rele- vant to corresponding meaning. We refer to this Figure 1: Perplexities of S+I Attention model model as Input Attention (I-Attention). for the case of pre-training (solid lines) and for the case when the model is trained from scratch 4.3 Attention SkipGram (dashed lines). For attention-based model, we use different em- beddings for context words. Because of that, we pre-train attention block containing embeddings, 5.2 Pre-training attention neural network and linear layer weights It is well-known that good language model can of- by optimizing a negative sampling loss function in ten improve metrics such as BLEU for a particu- the same manner as the original Skip-gram model: lar NLP task (Jozefowicz et al., 2016). According to this, we decided to pre-train our models. For 0T log σ(vw v ) this purpose, WikiText-103 dataset (Merity et al., O wI 2016) was chosen. During pre-training we set v ∗ k X 0T (12) (eq. 10) to zero vector to make our models purely + Ewi ∼Pn (w) [log σ(−vw v )] i wI unconditional. Embeddings for these language i=1 models were initialized by Google Word2Vec vec- where vw0 , v 0 O wI and vwi are vector representa- tors1 and were fine-tuned. Figure 1 shows that tion of ”positive” example, anchor word and nega- this procedure helps to decrease perplexity and tive example respectively. Vector vwI is computed prevents over-fitting. Attention Skip-gram vectors using embedding of wI and attention mechanism were also trained on the WikiText-103. proposed in previous section. 5.3 Results 5 Experiments Both our models are LSTM networks (Hochre- 5.1 Data iter and Schmidhuber, 1997) with an embedding layer. The attention-based model has own em- We collected new dataset of definitions using Ox- bedding layer, mapping context words to vector fordDictionaries.com (2018) API. Each entry is a representations. Firstly, we pre-train our mod- triplet, containing the word, its definition and ex- els using the procedure, described above. Then, ample of the use of this word in the given meaning. we train them on the collected dataset maximiz- It is important to note that in our data set words can ing log-likelihood objective using Adam (Kingma have one or more meanings, depending on the cor- and Ba, 2014). Also, we anneal learning rate by responding entry in the Oxford Dictionary. Table 1 1 shows basic statistics of the new dataset. https://code.google.com/archive/p/word2vec/
Word Context Definition star she got star treatment a person who is very important a small circle of a celestial object star bright star in the sky or planet that is seen in a circle sentence sentence in prison an act of restraining someone or something sentence write up the sentence a piece of text written to be printed head the head of a man the upper part of a human body head he will be the head of the office the chief part of an organization, institution, etc they never reprinted the a written or printed version of reprint famous treatise a book or other publication the woman was raped on rape the act of killing her way home at night he pushed the string through invisible not able to be seen an inconspicuous hole shake my faith has been shaken cause to be unable to think clearly the nickname for the u.s. nickname a name for a person or thing that is not genuine constitution is ‘old ironsides ’ Table 2: Examples of definitions generated by S + I-Attention model for the words and contexts from the test set. Model PPL BLEU algorithm (τ = 0.1). Table 2 shows the examples. S+G+CH+HE (1) 45.62 11.62 ± 0.05 The source code and dataset will be freely avail- S+G+CH+HE (2) 46.12 - able 3 . S+G+CH+HE (3) 46.80 - S + I-Adaptive (2) 46.08 11.53 ± 0.03 6 Conclusion S + I-Adaptive (3) 46.93 - S + I-Attention (2) 43.54 12.08 ± 0.02 S + I-Attention (3) 44.9 - In the paper, we proposed two definition models which can work with polysemantic words. We Table 3: Performance comparison between best evaluate them using perplexity and measure the model proposed by Noraset et al. (2017) and our definition generation accuracy with BLEU score. models on the test set. Number in brackets means Obtained results show that incorporating informa- number of LSTM layers. BLEU is averaged across tion about word senses leads to improved met- 3 trials. rics. Moreover, generated definitions show that even implicit word context can help to differ word meanings. In future work, we plan to explore in- a factor of 10 if validation loss doesn’t decrease dividual components of word embedding and the per epochs. We use original Adaptive Skip-gram mask produced by our attention-based model to vectors as inputs to S+I-Adaptive, which were ob- get a deeper understanding of vectors representa- tained from the official repository2 . We compare tions of words. different models using perplexity and BLEU score on the test set. BLEU score is computed only for models with the lowest perplexity and only on the Acknowledgments test words that have multiple meanings. The re- sults are presented in Table 3. We see that both This work was partly supported by Samsung Re- models that utilize knowledge about meaning of search, Samsung Electronics, Sberbank AI Lab the word have better performance than the com- and the Russian Science Foundation grant 17-71- peting one. We generated definitions using S + I- 20072. Attention model with simple temperature sampling 2 3 https://github.com/sbos/AdaGram.jl https://github.com/agadetsky/pytorch-definitions
References Andriy Mnih and Geoffrey E Hinton. 2009. A scal- able hierarchical distributed language model. In Ad- Dzmitry Bahdanau, Tom Bosc, Stanislaw Jastrzebski, vances in Neural Information Processing Systems Edward Grefenstette, Pascal Vincent, and Yoshua 21, pages 1081–1088. Curran Associates, Inc. Bengio. 2017. Learning to compute word embed- dings on the fly. arXiv preprint arXiv:1706.00286. Thanapon Noraset, Chen Liang, Larry Birnbaum, and Doug Downey. 2017. Definition modeling: Learn- Sergey Bartunov, Dmitry Kondrashkin, Anton Osokin, ing to define word embeddings in natural language. and Dmitry Vetrov. 2016. Breaking sticks and ambi- In 31st AAAI Conference on Artificial Intelligence, guities with adaptive skip-gram. In Artificial Intelli- AAAI 2017, pages 3259–3266. AAAI press. gence and Statistics, pages 130–138. OxfordDictionaries.com. 2018. Oxford University Felix Hill, KyungHyun Cho, Anna Korhonen, and Press. Yoshua Bengio. 2016. Learning to understand phrases by embedding the dictionary. Transactions Jeffrey Pennington, Richard Socher, and Christopher of the Association for Computational Linguistics, Manning. 2014. Glove: Global vectors for word 4:17–30. representation. In Proceedings of the 2014 confer- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long ence on empirical methods in natural language pro- short-term memory. Neural Comput., 9(8):1735– cessing (EMNLP), pages 1532–1543. 1780. Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. 2013. Stochastic variational in- ference. Journal of Machine Learning Research, 14:1303–1347. Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016. Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651. Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410. Andrej Karpathy and Li Fei-Fei. 2015. Deep visual- semantic alignments for generating image descrip- tions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980. Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. 2016. Ask me anything: Dynamic memory networks for natural language processing. In Proceedings of The 33rd International Conference on Machine Learn- ing, volume 48 of Proceedings of Machine Learning Research, pages 1378–1387. PMLR. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013. Distributed representa- tions of words and phrases and their composition- ality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Ad- vances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc.
You can also read