Multi-Prototype Vector-Space Models of Word Meaning
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Multi-Prototype Vector-Space Models of Word Meaning Joseph Reisinger Raymond J. Mooney Department of Computer Science Department of Computer Science The University of Texas at Austin The University of Texas at Austin 1 University Station C0500 1 University Station C0500 Austin, TX 78712-0233 Austin, TX 78712-0233 joeraii@cs.utexas.edu mooney@cs.utexas.edu Abstract to both bat and association, which are not at all simi- lar to each other. Word meaning violates the triangle Current vector-space models of lexical seman- inequality when viewed at the level of word types, tics create a single “prototype” vector to rep- posing a problem for vector-space models (Tver- resent the meaning of a word. However, due sky and Gati, 1982). A single “prototype” vector to lexical ambiguity, encoding word mean- ing with a single vector is problematic. This is simply incapable of capturing phenomena such as paper presents a method that uses cluster- homonymy and polysemy. Also, most vector-space ing to produce multiple “sense-specific” vec- models are context independent, while the meaning tors for each word. This approach provides of a word clearly depends on context. The word club a context-dependent vector representation of in “The caveman picked up the club” is similar to bat word meaning that naturally accommodates in “John hit the robber with a bat,” but not in “The homonymy and polysemy. Experimental com- bat flew out of the cave.” parisons to human judgements of semantic similarity for both isolated words as well as We present a new resource-lean vector-space words in sentential contexts demonstrate the model that represents a word’s meaning by a set of superiority of this approach over both proto- distinct “sense specific” vectors. The similarity of type and exemplar based vector-space models. two isolated words A and B is defined as the mini- mum distance between one of A’s vectors and one of B’s vectors. In addition, a context-dependent mean- 1 Introduction ing for a word is determined by choosing one of the Automatically judging the degree of semantic sim- vectors in its set based on minimizing the distance ilarity between words is an important task useful to the vector representing the current context. Con- in text classification (Baker and McCallum, 1998), sequently, the model supports judging the similarity information retrieval (Sanderson, 1994), textual en- of both words in isolation and words in context. tailment, and other language processing tasks. The The set of vectors for a word is determined by un- standard empirical approach to this task exploits the supervised word sense discovery (WSD) (Schütze, distributional hypothesis, i.e. that similar words ap- 1998), which clusters the contexts in which a word pear in similar contexts (Curran and Moens, 2002; appears. In previous work, vector-space lexical sim- Lin and Pantel, 2002; Pereira et al., 1993). Tra- ilarity and word sense discovery have been treated ditionally, word types are represented by a sin- as two separate tasks. This paper shows how they gle vector of contextual features derived from co- can be combined to create an improved vector-space occurrence information, and semantic similarity is model of lexical semantics. First, a word’s contexts computed using some measure of vector distance are clustered to produce groups of similar context (Lee, 1999; Lowe, 2001). vectors. An average “prototype” vector is then com- However, due to homonymy and polysemy, cap- puted separately for each cluster, producing a set of turing the semantics of a word with a single vector is vectors for each word. Finally, as described above, problematic. For example, the word club is similar these cluster vectors can be used to determine the se-
mantic similarity of both isolated words and words (cluster#1) location in context. The approach is completely modular, and importance ... chose Zbigniew Brzezinski bombing can integrate any clustering method with any tradi- for the position of ... ... thus the symbol s position (cluster#2) tional vector-space model. on his clothing was ... post We present experimental comparisons to human ... writes call options against appointme the stock position ... nt, role, job judgements of semantic similarity for both isolated ... offered a position with ... ... a position he would hold words and words in sentential context. The results single (cluster#3) until his retirement in ... ... endanger their position as prototype intensity, demonstrate the superiority of a clustered approach a cultural group... winds, hour, gust over both traditional prototype and exemplar-based ... on the chart of the vessel s current position ... (cluster#4) vector-space models. For example, given the iso- ... not in a position to help... lineman, lated target word singer our method produces the tackle, role, scorer most similar word vocalist, while using a single pro- (collect contexts) (cluster) (similarity) totype gives musician. Given the word cell in the context: “The book was published while Piasecki Figure 1: Overview of the multi-prototype approach was still in prison, and a copy was delivered to his to near-synonym discovery for a single target word cell.” the standard approach produces protein while independent of context. Occurrences are clustered our method yields incarcerated. and cluster centroids are used as prototype vectors. The remainder of the paper is organized as fol- Note the “hurricane” sense of position (cluster 3) is lows: Section 2 gives relevant background on pro- not typically considered appropriate in WSD. totype and exemplar methods for lexical semantics, Section 3 presents our multi-prototype method, Sec- tion 4 presents our experimental evaluations, Section approach is to compute a single prototype vector for 5 discusses future work, and Section 6 concludes. each word from its occurrences. This paper presents a multi-prototype vector space 2 Background model for lexical semantics with a single parame- Psychological concept models can be roughly di- ter K (the number of clusters) that generalizes both vided into two classes: prototype (K = 1) and exemplar (K = N , the total number of instances) methods. Such models have 1. Prototype models represented concepts by an been widely studied in the Psychology literature abstract prototypical instance, similar to a clus- (Griffiths et al., 2007; Love et al., 2004; Rosseel, ter centroid in parametric density estimation. 2002). By employing multiple prototypes per word, vector space models can account for homonymy, 2. Exemplar models represent concepts by a con- polysemy and thematic variation in word usage. crete set of observed instances, similar to non- Furthermore, such approaches require only O(K 2 ) parametric approaches to density estimation in comparisons for computing similarity, yielding po- statistics (Ashby and Alfonso-Reese, 1995). tential computational savings over the exemplar ap- Tversky and Gati (1982) famously showed that con- proach when K N , while reaping many of the ceptual similarity violates the triangle inequality, same benefits. lending evidence for exemplar-based models in psy- Previous work on lexical semantic relatedness has chology. Exemplar models have been previously focused on two approaches: (1) mining monolin- used for lexical semantics problems such as selec- gual or bilingual dictionaries or other pre-existing tional preference (Erk, 2007) and thematic fit (Van- resources to construct networks of related words dekerckhove et al., 2009). Individual exemplars can (Agirre and Edmond, 2006; Ramage et al., 2009), be quite noisy and the model can incur high com- and (2) using the distributional hypothesis to au- putational overhead at prediction time since naively tomatically infer a vector-space prototype of word computing the similarity between two words using meaning from large corpora (Agirre et al., 2009; each occurrence in a textual corpus as an exemplar Curran, 2004; Harris, 1954). The former approach requires O(n2 ) comparisons. Instead, the standard tends to have greater precision, but depends on hand-
crafted dictionaries and cannot, in general, model cosine similarity, a standard measure of textual sim- sense frequency (Budanitsky and Hirst, 2006). The ilarity. However, movMF introduces an additional latter approach is fundamentally more scalable as it per-cluster concentration parameter controlling its does not rely on specific resources and can model semantic breadth, allowing it to more accurately corpus-specific sense distributions. However, the model non-uniformities in the distribution of cluster distributional approach can suffer from poor preci- sizes. Based on preliminary experiments comparing sion, as thematically similar words (e.g., singer and various clustering methods, we found movMF gave actor) and antonyms often occur in similar contexts the best results. (Lin et al., 2003). 3.2 Measuring Semantic Similarity Unsupervised word-sense discovery has been studied by number of researchers (Agirre and Ed- The similarity between two words in a multi- mond, 2006; Schütze, 1998). Most work has also prototype model can be computed straightforwardly, requiring only simple modifications to standard dis- focused on corpus-based distributional approaches, tributional similarity methods such as those pre- varying the vector-space representation, e.g. by in- sented by Curran (2004). Given words w and w0 , we corporating syntactic and co-occurrence information define two noncontextual clustered similarity met- from the words surrounding the target term (Pereira rics to measure similarity of isolated words: et al., 1993; Pantel and Lin, 2002). K K def 1 XX AvgSim(w, w0 ) = d(πk (w), πj (w0 )) 3 Multi-Prototype Vector-Space Models K 2 j=1 k=1 def Our approach is similar to standard vector-space MaxSim(w, w0 ) = max d(πk (w), πj (w0 )) 1≤j≤K,1≤k≤K models of word meaning, with the addition of a per- word-type clustering step: Occurrences for a spe- where d(·, ·) is a standard distributional similarity cific word type are collected from the corpus and measure. In AvgSim, word similarity is computed clustered using any appropriate method (§3.1). Sim- as the average similarity of all pairs of prototype ilarity between two word types is then computed as vectors; In MaxSim the similarity is the maximum a function of their cluster centroids (§3.2), instead of over all pairwise prototype similarities. All results the centroid of all the word’s occurrences. Figure 1 reported in this paper use cosine similarity, 1 gives an overview of this process. P 0 0 f ∈F I(w, f ) · I(w , f ) Cos(w, w ) = P 3.1 Clustering Occurrences q q 2 0 2 P f ∈F I(w, f ) f ∈F I(w , f ) Multiple prototypes for each word w are generated by clustering feature vectors v(c) derived from each We compare across two different feature functions occurrence c ∈ C(w) in a large textual corpus and tf-idf weighting and χ2 weighting, chosen due to collecting the resulting cluster centroids πk (w), k ∈ their ubiquity in the literature (Agirre et al., 2009; [1, K]. This approach is commonly employed in un- Curran, 2004). supervised word sense discovery; however, we do In AvgSim, all prototype pairs contribute equally not assume that clusters correspond to traditional to the similarity computation, thus two words are word senses. Rather, we only rely on clusters to cap- judged as similar if many of their senses are simi- ture meaningful variation in word usage. lar. MaxSim, on the other hand, only requires a sin- Our experiments employ a mixture of von Mises- gle pair of prototypes to be close for the words to be Fisher distributions (movMF) clustering method judged similar. Thus, MaxSim models the similarity with first-order unigram contexts (Banerjee et al., of words that share only a single sense (e.g. bat and 2005). Feature vectors v(c) are composed of indi- club) at the cost of lower robustness to noisy clusters vidual features I(c, f ), taken as all unigrams occur- that might be introduced when K is large. ring f ∈ F in a 10-word window around w. When contextual information is available, AvgSim and MaxSim can be modified to produce Like spherical k-means (Dhillon and Modha, 1 2001), movMF models semantic relatedness using The main results also hold for weighted Jaccard similarity.
more precise similarity computations: corpus contains multiple human judgements on 353 def word pairs, covering both monosemous and poly- AvgSimC(w, w0 ) = semous words, each rated on a 1–10 integer scale. K K 1 XX Spearman’s rank correlation (ρ) with average human dc,w,k dc0 ,w0 ,j d(πk (w), πj (w0 )) K 2 j=1 judgements (Agirre et al., 2009) was used to mea- k=1 def sure the quality of various models. MaxSimC(w, w0 ) = d(π̂(w), π̂(w0 )) Figure 2 plots Spearman’s ρ on WordSim-353 def against the number of clusters (K) for Wikipedia where dc,w,k = d(v(c), πk (w)) is the likelihood of def and Gigaword corpora, using pruned tf-idf and χ2 context c belonging to cluster πk (w), and π̂(w) = features.2 In general pruned tf-idf features yield πarg max1≤k≤K dc,w,k (w), the maximum likelihood higher correlation than χ2 features. Using AvgSim, cluster for w in context c. Thus, AvgSimC corre- the multi-prototype approach (K > 1) yields higher sponds to soft cluster assignment, weighting each correlation than the single-prototype approach (K = similarity term in AvgSim by the likelihood of the 1) across all corpora and feature types, achieving word contexts appearing in their respective clus- state-of-the-art results with pruned tf-idf features. ters. MaxSimC corresponds to hard assignment, This result is statistically significant in all cases for using only the most probable cluster assignment. tf-idf and for K ∈ [2, 10] on Wikipedia and K > 4 Note that AvgSim and MaxSim can be thought of as on Gigaword for χ2 features.3 MaxSim yields simi- special cases of AvgSimC and MaxSimC with uni- lar performance when K < 10 but performance de- form weight to each cluster; hence AvgSimC and grades as K increases. MaxSimC can be used to compare words in context It is possible to circumvent the model-selection to isolated words as well. problem (choosing the best value of K) by simply combining the prototypes from clusterings of dif- 4 Experimental Evaluation ferent sizes. This approach represents words using 4.1 Corpora both semantically broad and semantically tight pro- We employed two corpora to train our models: totypes, similar to hierarchical clustering. Table 1 and Figure 2 (squares) show the result of such a com- 1. A snapshot of English Wikipedia taken on Sept. bined approach, where the prototypes for clusterings 29th, 2009. Wikitext markup is removed, as of size 2-5, 10, 20, 50, and 100 are unioned to form a are articles with fewer than 100 words, leaving single large prototype set. In general, this approach 2.8M articles with a total of 2.05B words. works about as well as picking the optimal value of K, even outperforming the single best cluster size 2. The third edition English Gigaword corpus, for Wikipedia. with articles containing fewer than 100 words Finally, we also compared our method to a pure removed, leaving 6.6M articles and 3.9B words exemplar approach, averaging similarity across all (Graff, 2003). occurrence pairs.4 Table 1 summarizes the results. Wikipedia covers a wider range of sense distribu- The exemplar approach yields significantly higher tions, whereas Gigaword contains only newswire correlation than the single prototype approach in all text and tends to employ fewer senses of most am- cases except Gigaword with tf-idf features (p < biguous words. Our method outperforms baseline 0.05). Furthermore, it performs significantly worse methods even on Gigaword, indicating its advan- 2 (Feature pruning) We find that results using tf-idf features tages even when the corpus covers few senses. are extremely sensitive to feature pruning while χ2 features are more robust. In all experiments we prune tf-idf features by their 4.2 Judging Semantic Similarity overall weight, taking the top 5000. This setting was found to To evaluate the quality of various models, we first optimize the performance of the single-prototype approach. 3 Significance is calculated using the large-sample approxi- compared their lexical similarity measurements to mation of the Spearman rank test; (p < 0.05). human similarity judgements from the WordSim- 4 Averaging across all pairs was found to yield higher corre- 353 data set (Finkelstein et al., 2001). This test lation than averaging over the most similar pairs.
Spearman’s ρ prototype exemplar multi-prototype (AvgSim) combined K=5 K = 20 K = 50 Wikipedia tf-idf 0.53±0.02 0.60±0.06 0.69±0.02 0.76±0.01 0.76±0.01 0.77±0.01 Wikipedia χ2 0.54±0.03 0.65±0.07 0.58±0.02 0.56±0.02 0.52±0.03 0.59±0.04 Gigaword tf-idf 0.49±0.02 0.48±0.10 0.64±0.02 0.61±0.02 0.61±0.02 0.62±0.02 Gigaword χ2 0.25±0.03 0.41±0.14 0.32±0.03 0.35±0.03 0.33±0.03 0.34±0.03 Table 1: Spearman correlation on the WordSim-353 dataset broken down by corpus and feature type. homonymous carrier, crane, cell, company, issue, interest, match, media, nature, party, practice, plant, racket, recess, reservation, rock, space, value polysemous cause, chance, journal, market, network, policy, power, production, series, trading, train Table 2: Words used in predicting near synonyms. gle prototype. Participants on Amazon’s Mechani- cal Turk5 (Snow et al., 2008) were asked to choose between two possible alternatives (one from a proto- type model and one from a multi-prototype model) as being most similar to a given target word. The target words were presented either in isolation or in Figure 2: WordSim-353 rank correlation vs. num- a sentential context randomly selected from the cor- ber of clusters (log scale) for both the Wikipedia pus. Table 2 lists the ambiguous words used for this (left) and Gigaword (right) corpora. Horizontal bars task. They are grouped into homonyms (words with show the performance of single-prototype. Squares very distinct senses) and polysemes (words with re- indicate performance when combining across clus- lated senses). All words were chosen such that their terings. Error bars depict 95% confidence intervals usages occur within the same part of speech. using the Spearman test. Squares indicate perfor- In the non-contextual task, 79 unique raters com- mance when combining across clusterings. pleted 7,620 comparisons of which 72 were dis- carded due to poor performance on a known test set.6 than combined multi-prototype for tf-idf features, In the contextual task, 127 raters completed 9,930 and does not differ significantly for χ2 features. comparisons of which 87 were discarded. Overall this result indicates that multi-prototype per- For the non-contextual case, Figure 3 left plots forms at least as well as exemplar in the worst case, the fraction of raters preferring the multi-prototype and significantly outperforms when using the best prediction (using AvgSim) over that of a single pro- feature representation / corpus pair. totype as the number of clusters is varied. When 4.3 Predicting Near-Synonyms asked to choose between the single best word for 5 We next evaluated the multi-prototype approach on http://mturk.com 6 (Rater reliability) The reliability of Mechanical Turk its ability to determine the most closely related raters is quite variable, so we computed an accuracy score for words for a given target word (using the Wikipedia each rater by including a control question with a known cor- corpus with tf-idf features). The top k most simi- rect answer in each HIT. Control questions were generated by lar words were computed for each prototype of each selecting a random word from WordNet 3.0 and including as possible choices a word in the same synset (correct answer) and target word. Using a forced-choice setup, human a word in a synset with a high path distance (incorrect answer). subjects were asked to evaluate the quality of these Raters who got less than 50% of these control questions correct, near synonyms relative to those produced by a sin- or spent too little time on the HIT were discarded.
Non-contextual Near-Synonym Prediction Contextual Near-Synonym Prediction Figure 3: (left) Near-synonym evaluation for isolated words showing fraction of raters preferring multi- prototype results vs. number of clusters. Colored squares indicate performance when combining across clusterings. 95% confidence intervals computed using the Wald test. (right) Near-synonym evaluation for words in a sentential context chosen either from the minority sense or the majority sense. each method (top word), the multi-prototype pre- is somewhat robust to this phenomenon, but syn- diction is chosen significantly more frequently (i.e. onym prediction is more affected since only the top the result is above 0.5) when the number of clus- predicted choice is used. When raters are forced ters is small, but the two methods perform sim- to chose between the top three predictions for each ilarly for larger numbers of clusters (Wald test, method (presented as top set in Figure 3 left), the ef- α = 0.05.) Clustering more accurately identi- fect of this noise is reduced and the multi-prototype fies homonyms’ clearly distinct senses and produces approach remains dominant even for a large num- prototypes that better capture the different uses of ber of clusters. This indicates that although more these words. As a result, compared to using a sin- clusters can capture finer-grained sense distinctions, gle prototype, our approach produces better near- they also can introduce noise. synonyms for homonyms compared to polysemes. When presented with words in context (Figure However, given the right number of clusters, it also 3 right),7 raters found no significant difference in produces better results for polysemous words. the two methods for words used in their majority The near-synonym prediction task highlights one sense.8 However, when a minority sense is pre- of the weaknesses of the multi-prototype approach: 7 as the number of clusters increases, the number of Results for the multi-prototype method are generated using AvgSimC (soft assignment) as this was found to significantly occurrences assigned to each cluster decreases, in- outperform MaxSimC. creasing noise and resulting in some poor prototypes 8 Sense frequency determined using Google; senses labeled that mainly cover outliers. The word similarity task manually by trained human evaluators.
sented (e.g. the “prison” sense of cell), raters pre- fer the choice predicted by the multi-prototype ap- proach. This result is to be expected since the sin- gle prototype mainly reflects the majority sense, pre- venting it from predicting appropriate synonyms for a minority sense. Also, once again, the perfor- mance of the multi-prototype approach is better for homonyms than polysemes. 4.4 Predicting Variation in Human Ratings Variance in pairwise prototype distances can help explain the variance in human similarity judgements for a given word pair. We evaluate this hypothe- sis empirically on WordSim-353 by computing the Figure 4: Plots of variance correlation; lower num- Spearman correlation between the variance of the bers indicate higher negative correlation, i.e. that def prototype entropy predicts rater disagreement. per-cluster similarity computations, V[D], D = {d(πk (w), πj (w0 )) : 1 ≤ k, j ≤ K}, and the vari- ance of the human annotations for that pair. Cor- ing the contextual multi-prototype method and hu- relations for each dataset are shown in Figure 4 left. man similarity judgements for different usages of In general, we find a statistically significant negative the same word. The Usage Similarity (USim) data correlation between these values using χ2 features, set collected in Erk et al. (2009) provides such simi- indicating that as the entropy of the pairwise cluster larity scores from human raters. However, we find similarities increases (i.e., prototypes become more no evidence for correlation between USim scores similar, and similarities become uniform), rater dis- and their corresponding prototype similarity scores agreement increases. This result is intuitive: if the (ρ = 0.04), indicating that prototype vectors may occurrences of a particular word cannot be easily not correspond well to human senses. separated into coherent clusters (perhaps indicating 5 Discussion and Future Work high polysemy instead of homonymy), then human judgement will be naturally more difficult. Table 3 compares the inferred synonyms for several Rater variance depends more directly on the ac- target words, generally demonstrating the ability of tual word similarity: word pairs at the extreme the multi-prototype model to improve the precision ranges of similarity have significantly lower vari- of inferred near-synonyms (e.g. in the case of singer ance as raters are more certain. By removing word or need) as well as its ability to include synonyms pairs with similarity judgements in the middle two from less frequent senses (e.g., the experiment sense quartile ranges (4.4 to 7.5) we find significantly of research or the verify sense of prove). However, higher variance correlation (Figure 4 right). This there are a number of ways it could be improved: result indicates that multi-prototype similarity vari- Feature representations: Multiple prototypes im- ance accounts for a secondary effect separate from prove Spearman correlation on WordSim-353 com- the primary effect that variance is naturally lower for pared to previous methods using the same under- ratings in extreme ranges. lying representation (Agirre et al., 2009). How- Although the entropy of the prototypes correlates ever we have not yet evaluated its performance with the variance of the human ratings, we find that when using more powerful feature representations the individual senses captured by each prototype do such those based on Latent or Explicit Semantic not correspond to human intuition for a given word, Analysis (Deerwester et al., 1990; Gabrilovich and e.g. the “hurricane” sense of position in Figure 1. Markovitch, 2007). Due to its modularity, the multi- This notion is evaluated empirically by computing prototype approach can easily incorporate such ad- the correlation between the predicted similarity us- vances in order to further improve its effectiveness.
Inferred Thesaurus would be good to compare prototypes learned from bass supervised sense inventories to prototypes produced single guitar, drums, rhythm, piano, acoustic multi basses, contrabass, rhythm, guitar, drums by automatic clustering. claim Joint model: The current method independently single argue, say, believe, assert, contend multi assert, contend, allege, argue, insist clusters the contexts of each word, so the senses dis- hold covered for w cannot influence the senses discovered single carry, take, receive, reach, maintain for w0 6= w. Sharing statistical strength across simi- multi carry, maintain, receive, accept, reach lar words could yield better results for rarer words. maintain single ensure, establish, achieve, improve, promote 6 Conclusions multi preserve, ensure, establish, retain, restore prove We presented a resource-light model for vector- single demonstrate, reveal, ensure, confirm, say multi demonstrate, verify, confirm, reveal, admit space word meaning that represents words as col- research lections of prototype vectors, naturally accounting single studies, work, study, training, development for lexical ambiguity. The multi-prototype approach multi studies, experiments, study, investigations, uses word sense discovery to partition a word’s con- training singer texts and construct “sense specific” prototypes for single musician, actress, actor, guitarist, composer each cluster. Doing so significantly increases the ac- multi vocalist, guitarist, musician, singer- curacy of lexical-similarity computation as demon- songwriter, singers strated by improved correlation with human similar- ity judgements and generation of better near syn- Table 3: Examples of the top 5 inferred near- onyms according to human evaluators. Further- synonyms using the single- and multi-prototype ap- more, we show that, although performance is sen- proaches (with results merged). In general such sitive to the number of prototypes, combining pro- clustering improves the precision and coverage of totypes across a large range of clusterings performs the inferred near-synonyms. nearly as well as the ex-post best clustering. Finally, variance in the prototype similarities is found to cor- Nonparametric clustering: The success of the relate with inter-annotator disagreement, suggesting combined approach indicates that the optimal num- psychological plausibility. ber of clusters may vary per word. A more prin- cipled approach to selecting the number of proto- Acknowledgements types per word is to employ a clustering model with infinite capacity, e.g. the Dirichlet Process Mixture We would like to thank Katrin Erk for helpful dis- Model (Rasmussen, 2000). Such a model would al- cussions and making the USim data set available. low naturally more polysemous words to adopt more This work was supported by an NSF Graduate Re- flexible representations. search Fellowship and a Google Research Award. Experiments were run on the Mastodon Cluster, pro- Cluster similarity metrics: Besides AvgSim and vided by NSF Grant EIA-0303609. MaxSim, there are many similarity metrics over mixture models, e.g. KL-divergence, which may correlate better with human similarity judgements. References Comparing to traditional senses: Compared to Eneko Agirre and Phillip Edmond. 2006. Word Sense WordNet, our best-performing clusterings are sig- Disambiguation: Algorithms and Applications (Text, nificantly more fine-grained. Furthermore, they of- Speech and Language Technology). Springer-Verlag New York, Inc., Secaucus, NJ, USA. ten do not correspond to agreed upon semantic dis- Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana tinctions (e.g., the “hurricane” sense of position in Kravalova, Marius Paşca, and Aitor Soroa. 2009. A Fig. 1). We posit that the finer-grained senses actu- study on similarity and relatedness using distributional ally capture useful aspects of word meaning, leading and WordNet-based approaches. In Proc. of NAACL- to better correlation with WordSim-353. However, it HLT-09, pages 19–27.
F. Gregory Ashby and Leola A. Alfonso-Reese. 1995. Zellig Harris. 1954. Distributional structure. Word, Categorization as probability density estimation. J. 10(23):146–162. Math. Psychol., 39(2):216–233. Lillian Lee. 1999. Measures of distributional similarity. L. Douglas Baker and Andrew K. McCallum. 1998. Dis- In 37th Annual Meeting of the Association for Compu- tributional clustering of words for text classification. tational Linguistics, pages 25–32. In Proceedings of 21st International ACM SIGIR Con- Dekang Lin and Patrick Pantel. 2002. Concept discovery ference on Research and Development in Information from text. In Proc. of COLING-02, pages 1–7. Retrieval, pages 96–103. Dekang Lin, Shaojun Zhao, Lijuan Qin, and Ming Zhou. Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, and 2003. Identifying synonyms among distributionally Suvrit Sra. 2005. Clustering on the unit hypersphere similar words. In Proceedings of the Interational Joint using von Mises-Fisher distributions. Journal of Ma- Conference on Artificial Intelligence, pages 1492– chine Learning Research, 6:1345–1382. 1493. Morgan Kaufmann. Alexander Budanitsky and Graeme Hirst. 2006. Evalu- Bradley C. Love, Douglas L. Medin, and Todd M. ating wordnet-based measures of lexical semantic re- Gureckis. 2004. SUSTAIN: A network model of cat- latedness. Computational Linguistics, 32(1):13–47. egory learning. Psych. Review, 111(2):309–332. James R. Curran and Marc Moens. 2002. Improvements Will Lowe. 2001. Towards a theory of semantic space. in automatic thesaurus extraction. In Proceedings of In Proceedings of the 23rd Annual Meeting of the Cog- the ACL-02 workshop on Unsupervised lexical acqui- nitive Science Society, pages 576–581. sition, pages 59–66. Patrick Pantel and Dekang Lin. 2002. Discovering word James R. Curran. 2004. From Distributional to Seman- senses from text. In Proc. of SIGKDD-02, pages 613– tic Similarity. Ph.D. thesis, University of Edinburgh. 619, New York, NY, USA. ACM. College of Science. Fernando C. N. Pereira, Naftali Tishby, and Lillian Lee. Scott C. Deerwester, Susan T. Dumais, George W. Fur- 1993. Distributional clustering of English words. In nas, Thomas K. Landauer, and Richard A. Harshman. Proceedings of the 31st Annual Meeting of the Associ- 1990. Indexing by latent semantic analysis. Jour- ation for Computational Linguistics (ACL-93), pages nal of the American Society for Information Science, 183–190, Columbus, Ohio. 41:391–407. Daniel Ramage, Anna N. Rafferty, and Christopher D. Inderjit S. Dhillon and Dharmendra S. Modha. 2001. Manning. 2009. Random walks for text seman- Concept decompositions for large sparse text data us- tic similarity. In Proc. of the 2009 Workshop on ing clustering. Machine Learning, 42:143–175. Graph-based Methods for Natural Language Process- ing (TextGraphs-4), pages 23–31. Katrin Erk, Diana McCarthy, Nicholas Gaylord Investi- Carl E. Rasmussen. 2000. The infinite Gaussian mixture gations on Word Senses, and Word Usages. 2009. In- model. In Advances in Neural Information Processing vestigations on word senses and word usages. In Proc. Systems, pages 554–560. MIT Press. of ACL-09. Yves Rosseel. 2002. Mixture models of categorization. Katrin Erk. 2007. A simple, similarity-based model for J. Math. Psychol., 46(2):178–210. selectional preferences. In Proceedings of the 45th Mark Sanderson. 1994. Word sense disambiguation and Annual Meeting of the Association for Computational information retrieval. In Proc. of SIGIR-94, pages Linguistics. Association for Computer Linguistics. 142–151. Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Hinrich Schütze. 1998. Automatic word sense discrimi- Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan nation. Computational Linguistics, 24(1):97–123. Ruppin. 2001. Placing search in context: the concept Rion Snow, Brendan O’Connor, Daniel Jurafsky, and An- revisited. In Proc. of WWW-01, pages 406–414, New drew Ng. 2008. Cheap and fast—but is it good? Eval- York, NY, USA. ACM. uating non-expert annotations for natural language Evgeniy Gabrilovich and Shaul Markovitch. 2007. Com- tasks. In Proc. of EMNLP-08. puting semantic relatedness using Wikipedia-based ex- Amos Tversky and Itamar Gati. 1982. Similarity, sepa- plicit semantic analysis. In Proc. of IJCAI-07, pages rability, and the triangle inequality. Psychological Re- 1606–1611. view, 89(2):123–154. David Graff. 2003. English Gigaword. Linguistic Data Bram Vandekerckhove, Dominiek Sandra, and Walter Consortium, Philadephia. Daelemans. 2009. A robust and extensible exemplar- Tom L. Griffiths, Kevin. R. Canini, Adam N. Sanborn, based model of thematic fit. In Proc. of EACL 2009, and Daniel. J. Navarro. 2007. Unifying rational mod- pages 826–834. Association for Computational Lin- els of categorization via the hierarchical Dirichlet pro- guistics. cess. In Proc. of CogSci-07.
You can also read