Brighter than Gold: Figurative Language in User Generated Comparisons - Vlad Niculae
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Brighter than Gold: Figurative Language in User Generated Comparisons Vlad Niculae and Cristian Danescu-Niculescu-Mizil MPI-SWS Cornell University vniculae@mpi-sws.org, cristian@mpi-sws.org Abstract following two examples paraphrased from Ama- zon product reviews: Comparisons are common linguistic de- vices used to indicate the likeness of two (1) Sterling is much cheaper than gold. things. Often, this likeness is not meant (2) Her voice makes this song shine brighter than gold. in the literal sense—for example, “I slept In (1) the comparison draws on the relation be- like a log” does not imply that logs ac- tween the price property shared by the two metals, tually sleep. In this paper we propose a sterling and gold. While (2) also draws on a com- computational study of figurative compar- mon property (brightness), the polysemantic use isons, or similes. Our starting point is a (vocal timbre vs. light reflection) makes the com- new large dataset of comparisons extracted parison figurative. from product reviews and annotated for Importantly, there is no general rule separating figurativeness. We use this dataset to char- literal from figurative comparisons. More gen- acterize figurative language in naturally erally, the distinction between figurative and lit- occurring comparisons and reveal linguis- eral language is blurred and subjective (Hanks, tic patterns indicative of this phenomenon. 2006). Multiple criteria for delimiting the two We operationalize these insights and ap- have been proposed in the linguistic and philo- ply them to a new task with high relevance sophical literature—for a comprehensive review, to text understanding: distinguishing be- see Shutova (2010)—but they are not without ex- tween figurative and literal comparisons. ceptions, and are often hard to operationalize in a Finally, we apply this framework to ex- computational framework. When considering the plore the social context in which figurative specific case of comparisons, such criteria cannot language is produced, showing that simi- be directly applied. les are more likely to accompany opinions showing extreme sentiment, and that they Recently, the simile has received increasing at- are uncommon in reviews deemed helpful. tention from linguists and lexicographers (Moon, 2008; Moon, 2011; Hanks, 2013) as it became 1 Introduction clearer that similes need to be treated separately from metaphors since they operate on funda- In argument similes are like songs in love; they describe much, but prove nothing. mentally different principles (Bethlehem, 1996). — Franz Kafka Metaphors are linguistically simple structures hid- ing a complex mapping between two domains, Comparisons are fundamental linguistic devices through which many properties are transferred. that express the likeness of two things—be it en- For example the conceptual metaphor of life as tities, concepts or ideas. Given that their work- a journey can be instantiated in many particular ing principle is to emphasize the relation between ways: being at a fork in the road, reaching the end the shared properties of two arguments (Bredin, of the line (Lakoff and Johnson, 1980). In contrast, 1998), comparisons can synthesize important se- the semantic context of similes tends to be very mantic knowledge. shallow, transferring a single property (Hanks, Often, comparisons are not meant to be under- 2013). Their more explicit syntactic structure al- stood literally. Figurative comparisons are an im- lows, in exchange, for more lexical creativity. As portant figure of speech called simile. Consider the Hanks (2013) puts it, similes “tend to license all 2008 Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2008–2018, October 25-29, 2014, Doha, Qatar. c 2014 Association for Computational Linguistics
sorts of logical mayhem.” Moreover, the over- that figurative comparisons are more likely to ac- lap between the expressive range of similes and company reviews showing extreme sentiment, and metaphors is now known to be only partial: there that they are uncommon in opinions deemed as be- are similes that cannot be rephrased as metaphors, ing helpful. To the best of our knowledge, this is and the other way around (Israel et al., 2004). This the first time figurative language is tied to the so- suggests that figurativeness in similes should be cial context in which it appears. modeled differently than in metaphors. To further To summarize, the main contributions of this underline the necessity of a computational model work are as follows: for similes, we give the first estimate of their fre- quency in the wild: over 30% of comparisons are • it introduces the first large dataset of compar- figurative.1 We also confirm that a state of the art isons with figurativeness annotations (Sec- metaphor detection system performs poorly when tion 3); applied directly to the task of detecting similes. In this work we propose a computational study • it unveils new linguistic patterns characteriz- of figurative language in comparisons. To this end, ing figurative comparisons (Section 4); we build the first large collection of naturally oc- • it introduces the task of distinguishing figura- curring comparisons with figurativeness annota- tive from literal comparisons (Section 5); tion, which we make publicly available. Using this resource we explore the linguistic patterns that • it establishes the relation between figurative characterize similes, and group them in two con- language and the social context in which it ceptually distinctive classes. The first class con- appears (Section 6). tains cues that are agnostic of the context in which the comparison appears (domain-agnostic cues). 2 Further Related Work For example, we find that the higher the seman- tic similarity between the two arguments, the less Corpus studies on figurative language in compar- likely it is for the comparison to be figurative—in isons are scarce, and none directly address the the examples above, sterling is semantically very distinction between figurative and literal compar- similar to gold, both being metals, but song and isons. Roncero et al. (2006) observed, by search- gold are semantically dissimilar. The second type ing the web for several stereotypical comparisons of cues are domain-specific, drawing on the in- (e.g., education is like a stairway), that similes tuition that the domain in which a comparison is are more likely to be accompanied by explana- used is a factor in determining its figurativeness. tions than equivalent metaphors (e.g., education We find, for instance, that the less specific a com- is a stairway). Related to figurativeness is irony, parison is to the domain in which it appears, the which Veale (2012a) finds to often be lexically more likely it is to be used in a figurative sense marked. By using a similar insight to filter out (e.g., in example (2), gold is very unexpected in ironic comparisons, and by assuming that the rest the musical domain). are literal, Veale and Hao (2008) learn stereotyp- We successfully exploit these insights in a new ical knowledge about the world from frequently prediction task relevant to text understanding: dis- compared terms. A similar process has been ap- criminating figurative comparisons from literal plied to both English and Chinese by Li et al. ones. Encouraged by the high accuracy of our (2012), thereby encouraging the idea that the trope system—which is within 10% of that obtained by behaves similarly in different languages. A related human annotators—we automatically extend the system is the Jigsaw Bard (Veale and Hao, 2011), figurativeness labels to 80,000 comparisons occur- a thesaurus driven by figurative conventional sim- ring in product reviews. This enables us to conduct iles extracted from the Google N-grams. This sys- a fine-grained analysis of how comparison usage tem aims to build and generate canned expressions interacts with their social context, opening up a by using items frequently associated with the sim- research direction with applications in sentiment ile pattern above. An extension of the principles of analysis and opinion mining. In particular we find the Jigsaw Bard is found in Thesaurus Rex (Veale 1 and Li, 2013), a data-driven partition of words into This estimate is based on the set of noun-noun compar- isons with non-identical arguments collected for this study ad-hoc categories. Thesaurus Rex is constructed from Amazon.com product reviews. using simple comparison and hypernym patterns 2009
and is able to provide weighted lists of categories • the V EHICLE: it is the object of the compari- for given words. son and is also usually a noun phrase; In text understanding systems, literal compar- isons are used to detect analogies between related • the shared P ROPERTY or ground: it expresses geographical places (Lofi et al., 2014). Tandon et what the two entities have in common—it can al. (2014) use relative comparative patterns (e.g., be explicit but is often implicit, left for the X is heavier than Y) to enrich a common-sense reader to infer; knowledge base. Jindal and Liu (2006) extract • the E VENT (eventuality or state): usually a graded comparisons from various sources, with verb, it sets the frame for the observation of the objective of mining consumer opinion about the common property; products. They note that identifying objective vs. subjective comparisons—related to literality—is • the C OMPARATOR: commonly a preposition an important future direction. Given that many (like) or part of an adjectival phrase (better comparisons are figurative, a system that discrim- than), it is the trigger word or phrase that inates literal from figurative comparisons is essen- marks the presence of a comparison. tial for such text understanding and information retrieval systems. The literal example (1) would be segmented as: The vast majority of previous work on figu- [Sterling /T OPIC] [is /E VENT] much [cheaper rative language focused on metaphor detection. /P ROPERTY] [than /C OMPARATOR] [gold /V E - Tsvetkov et al. (2014a) propose a cross-lingual HICLE ] system based on word-level conceptual features and they evaluate it on Subject-Verb-Object triples 3.2 Annotation and Adjective-Noun pairs. Their features include People resort to comparisons often when mak- and extend the idea of abstractness used by Turney ing descriptions, as they are a powerful way of et al. (2011) for Adjective-Noun metaphors. Hovy expressing properties by example. For this rea- et al. (2013) contribute an unrestricted metaphor son we collect a dataset of user-generated compar- corpus and propose a method based on tree ker- isons in Amazon product reviews (McAuley and nels. Bridging the gap between metaphor identifi- Leskovec, 2013), where users have to be descrip- cation and interpretation, Shutova and Sun (2013) tive and precise, but also to express personal opin- proposed an unsupervised system to learn source- ion. We supplement the data with a smaller set of target domain mappings. The system fits concep- comparisons from WaCky and WaCkypedia (Ba- tual metaphor theory (Lakoff and Johnson, 1980) roni et al., 2009) to cover more genres. In pre- well, at the cost of not being able to tackle figu- liminary work, we experimented with dependency rative language in general, and similes in particu- parse tree patterns for extracting comparisons and lar, as similes do not map entire domains to one labeling their parts (Niculae, 2013). We use the another. Since similes operate on fundamentally same approach, but with an improved set of pat- different principles than metaphors, our work pro- terns, to extract comparisons with the C OMPARA - poses a computational approach tailored specifi- TORS like, as and than.2 We keep only the matches cally for comparisons. where the T OPIC and the V EHICLE are nouns, and the P ROPERTY, if present, is an adjective, which 3 Background and Data is the typical case. Also, the head words of the 3.1 Structure of a comparison constituents are constrained to occur in the distri- butional resources used (Baroni and Lenci, 2010; Unlike metaphors, which are generally unre- Faruqui and Dyer, 2014).3 stricted, comparisons are more structured but also more lexically and semantically varied. This en- 2 We process the review corpus with part-of-speech tag- ables a more structured computational representa- ging using the IRC model for TweetNLP (Owoputi et al., 2013; Forsyth and Martell, 2007) and dependency parsing tion of which we take advantage. The constituents using the TurboParser standard model (Martins et al., 2010). of a comparison according to Hanks (2012) are: 3 Due to the strong tendency of comparisons with the same T OPIC and V EHICLE to be trivially literal in the WaCky examples, we filtered out such examples from the Amazon • the T OPIC, sometimes called tenor: it is usu- product reviews. We also filtered proper nouns using a capi- ally a noun phrase and acts as logical subject; talization heuristic. 2010
We proceed to validate and annotate for figu- Domain fig. lit. % fig. rativeness a random sample of the comparisons Books 177 313 36% extracted using the automated process described Music 45 68 40% above. The annotation is performed using crowd- Electronics 23 105 18% sourcing on the Amazon Mechanical Turk plat- Jewelery 9 126 7% form, in two steps. First, the annotators are asked to determine whether a displayed sentence is in- WaCky 19 79 19% deed a comparison between the highlighted words (T OPIC and V EHICLE). Sentences qualified by Total 273 609 31% two out of three annotators as comparisons are used in the second round, where the task is to Table 1: Figurativeness annotation results. Only rate how metaphorical a comparison is. We use comparisons where all three annotators agree are a scale of 1 to 4 following Turney et al. (2011), considered. and then binarize to consider scores of 1–2 as lit- eral and 3–4 as figurative. Finally, in this work we them in our experiments. It is worth noting that only consider comparisons where all three annota- the existing corpora annotated for metaphor can- tors agree on this binary notion of figurativeness. not be directly used to study comparisons. For ex- For both tasks, we provide guidelines mostly in ample, in TroFi (Birke and Sarkar, 2006), a cor- the form of examples and intuition, motivated on pus of 6436 sentences annotated for figurative- one hand by the annotators not having specialized ness, we only find 42 noun-noun comparisons with knowledge, and on the other hand by the observa- sentence-level (thus noisy) figurativeness labels. tion that the literal-figurative distinction is subjec- tive. All annotators have the master worker qual- 4 Linguistic Insights ification, reside in the U.S. and completed a lin- We now proceed to exploring the linguistic pat- guistic background questionnaire that verifies their terns that discriminate figurative from literal com- experience with English. In both tasks, control parisons. We consider two broad classes of cues, sentences with confidently known labels are used which we discuss next. to filter low quality answers; in addition, we test annotators with a simple paraphrasing task shown 4.1 Domain-specific cues to be effective for eliciting and verifying linguis- Figurative language is often used for striking ef- tic attention (Munro et al., 2010). Both tasks fects, and comparisons are used to describe new seem relatively difficult for humans, with inter- things in terms of something given (Hanks, 2013). annotator agreement given by Fleiss’ k of 0.48 Since the norms that define what is surprising and for the comparison identification task and of 0.54 what is well-known vary across domains, we ex- for the figurativeness annotation after binarization. pect that such contextual information should play This is comparable to 0.57 reported by Hovy et al. an important role in figurative language detection. (2013) for general metaphor labeling. We show This is a previously unexplored dimension of figu- some statistics about the collected data in Table 1. rative language, and Amazon product reviews of- Overall, this is a costly process: out of 2400 auto- fer a convenient testbed for this intuition since cat- matically extracted comparison candidates, about egory information is provided. 60% were deemed by the annotators to be actual comparisons and only 12% end up being selected Specificity To estimate whether a compari- confidently enough as figurative comparisons. son can be considered striking in a particular Our dataset of human-filtered comparisons, domain—whether it references images or ideas with the scores given by the three annotators, that are unexpected in its context—we employ a is made publicly available to encourage further simple measure of word specificity with respect to work.4 This also includes about 400 comparisons a domain: the ratio of the word frequency within where the annotators do not agree perfectly on bi- the domain and the word frequency in all domains nary figurativeness. Such cases can be interest- being considered.5 It should be noted that speci- ing to other analyses, even if we don’t consider ficity is not purely a function of the word, but 5 We measure specificity for the V EHICLE, P ROPERTY 4 http://vene.ro/figurative-comparisons/ and E VENT. 2011
1.4 1.2 1.4 1.2 1.0 1.2 1.0 1.0 0.8 0.8 imageability specificity 0.8 similarity 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.2 0.0 0.2 0.4 0.2 0.4 figurative literal figurative literal figurative literal (a) V EHICLE specificity. (b) T OPIC-V EHICLE similarity. (c) Imageability of the P ROPERTY. Figure 1: Distribution of some of the features we use, across literal and figurative comparisons in the test set. The profile of the plot is a kernel density estimation of the distribution, and the markers indicate the median and the first and third quartiles. of the word and the context in which it appears. a Z-test comparing all Amazon categories. With A comparison in the music domain that involves the exception of books and music reviews, that melodies is not surprising: have similar ratios, all other pairs of categories show significantly different proportions of figura- But the title song really feels like a pretty bland vocal melody [...] tive comparisons (p < 0.01). But the same word can play a very different role 4.2 Domain-agnostic cues in another context, for example, book reviews: Linguistic studies of figurative language suggest Her books are like sweet melodies that flow that there is a fundamental generic notion of fig- through your head. urativeness. We attempt to capture this notion in the context of comparisons using syntactic and se- Indeed, the word melody has a specificity of 96% mantic information. in the music domain and only of 3% in the books domain. Topic-Vehicle similarity The default role of lit- An analysis on the labeled data confirms that eral comparisons is to assert similarity of things. literal comparisons do indeed tend to have more Therefore, we expect that a high semantic simi- domain-specific V EHICLES (Mann-Whitney U larity between the T OPIC and the V EHICLE of a test, p < 0.01) than figurative ones. Further- comparison is a sign of literal usage, as we pre- more, the distribution of specificity across both viously hypothesized in preliminary work (Nicu- types of comparisons, as shown in Figure 1a, has lae, 2013). To test this hypothesis, we compute the appearance of a mixture model of general and T OPIC-V EHICLE similarity using Distributional specific words. Figurative comparison V EHICLES Memory (Baroni and Lenci, 2010), a freely avail- largely exhibit only the general component of the able distributional semantics resource that cap- mixture.6 tures word relationships through grammatical role co-occurrence. Domain label An analysis of the annotation re- By applying this measure to our data, we find sults reveals that the percentage of comparisons that there is indeed an important difference be- that are figurative differs widely across domains, tween the distributions of T OPIC-V EHICLE simi- as indicated in the last column in Table 1. This larity in figurative and literal comparisons (shown suggests that simply knowing the domain of a in Figure 1b); the means of the two distribu- text can serve to adjust some prior expectation tions are significantly different (Mann-Whitney about figurative language presence and therefore p < 0.01). improve detection. We test this hypothesis using 6 Metaphor-inspired features We also seek to The mass around 0.25 in Figure 1a is largely explained by generic words such as thing, others, nothing, average and understand to what extent insights provided by barely specific words like veil, reputation, dream, garbage. computational work on metaphor detection can be 2012
more concrete less concrete are more imageable (Mann-Whitney p < 0.01), as more imageable cinnamon, kiss devil, happiness illustrated by Figure 1c. This is in agreement with less imageable casque, pugilist aspect, however Hanks (2005), who observed that similes are char- acterized by their appeal to sensory imagination. Table 2: Examples of words with high and low concreteness and imageability scores from the Definiteness We introduce another simple but MRC Psycholinguistic Database. effective syntactic cue that relates to concreteness: the presence of a definite article versus an indefi- nite one (or none at all). We search for the indefi- applied in the context of comparisons. To that end nite articles a and an and the definite article the in we consider features shown to provide state of the each component of a comparison. art performance in the task of metaphor detection We find that similes tend to have indefinite arti- (Tsvetkov et al., 2014a): abstractness, imageabil- cles in the V EHICLE more often and definite arti- ity and supersenses. cles less often (Mann-Whitney p < 0.01). In par- Abstractness and imageability features are de- ticular, 59% of comparisons where the V EHICLE rived from the MRC Psycholinguistic Database has a indefinite article are figurative, as opposed (Coltheart, 1981), a dictionary based on manually to 13% of the comparisons where V EHICLE has a annotated datasets of psycholinguistic norms. Im- definite article. ageability is the property of a word to arouse a mental image, be it in the form of a mental pic- 5 Prediction Task ture, sound or any other sense. Concreteness is defined as “any word that refers to objects, materi- We now turn to the task of predicting whether a als or persons,” while abstractness, at the other end comparison is figurative or literal. Not only does of the spectrum, is represented by words that can- this task allow us to assess and compare the effi- not be usually experienced by the senses (Paivio ciency of the linguistic cues we discussed, but it is et al., 1968). Table 2 shows a few examples of also highly relevant in the context of natural lan- words with high and low concreteness and image- guage understanding systems. ability scores. Supersenses are a very coarse form We conduct a logistic regression analysis, and of meaning representation. Tsvetkov et al. (2014a) compare the efficiency of the features derived used WordNet (Miller, 1995) semantic classes from our analysis to a bag of words baseline. for nouns and verbs, for example noun.body, In addition to the features inspired by the pre- noun.animal, verb.consumption, or verb.motion. viously described linguistic insights, we also try For adjectives, Tsvetkov et al. (2014b) developed to computationally capture the lexical usage pat- and made available a novel classification in the terns of comparisons using a version of bag of same spirit.7 We compute abstractness, image- words adapted to the comparison structure. In this ability and supersenses for the T OPIC, V EHICLE, slotted bag of words system, features correspond E VENT, and P ROPERTY.8 We concatenate these to occurrence of words within constituents (e.g., features with the raw vector representations of the bright ∈ P ROPERTY). constituents, following Tsvetkov et al. (2014a). We perform a stratified split of our compari- We find that such features relate to figurative son dataset into equal train and test sets (each set comparisons in a meaningful way. For example, containing 408 comparisons, out of which 134 are out of all comparisons with explicit properties, fig- figurative),9 and use a 5-fold stratified cross vali- urative comparisons tend to have properties that dation over the training set to choose the optimal 7 value for the logistic regression regularization pa- Following Tsvetkov et al. (2014a) we train a classifier to predict these features from a vector space representation of a rameter and the type of regularization (ℓ1 or ℓ2 ) for word. We use the same cross-lingually optimized represen- each feature set.10 tation from Faruqui and Dyer (2014) and a simpler classifier, 9 a logistic regression, which we find to perform as well as the The entire analysis described in Section 4 is only con- random forests used in Tsvetkov et al. (2014a). We treat su- ducted on the training set. Also, in order to ensure that we are persense prediction as a multi-label problem and apply a one- assessing the performance of the classifier on unseen com- versus-all transformation, effectively learning a linear classi- parisons, we discard from our dataset all those with the same fier for each supersense. T OPIC and V EHICLE pair. 8 10 If the P ROPERTY is implicit, all corresponding features We use the logistic regression implementation are set to zero. An extra binary feature indicates whether the of liblinear (Fan et al., 2008) wrapped by the P ROPERTY is explicit or implicit. scikit-learn library (Pedregosa et al., 2011). 2013
Model # features Acc. P R F1 AUC Bag of words 1970 0.79 0.63 0.84 0.72 0.87 Slotted bag of words 1840 0.80 0.64 0.90 0.75 0.89 Domain-agnostic cues 357 0.81 0.70 0.74 0.72 0.90 only metaphor inspired 345 0.75 0.60 0.72 0.65 0.84 Domain-specific cues 8 0.69 0.51 0.81 0.63 0.76 All linguistic insight cues 365 0.86 0.76 0.83 0.79 0.92 Full 2202 0.88 0.80 0.84 0.82 0.94 Human - 0.96 0.92 0.96 0.94 - Table 3: Classification performance on the test set for the different sets of features we considered; human performance is shown for reference. Classifier performance The performance on the classification task is summarized in Table 3. 0.85 all linguistic insights We note that the bag of words baseline is remark- metaphor inspired ably strong, because of common idiomatic simi- 0.80 Accuracy les that can be captured through keywords. Our 0.75 full system (which relies on our linguistically in- spired cues discussed in Section 4 in addition to 0.70 slotted bag of words) significantly outperforms the bag of words baseline and the slotted bag of words 0.65 20 40 60 80 100 system in terms of accuracy, F1 score and AUC Percentage of training data used (p < 0.05),11 suggesting that linguistic insights complement idiomatic simile matching. Impor- tantly, a system using only our linguistic insight Figure 2: Learning curves. Each point is obtained cues also significantly improves over the baseline by fitting a model on 10 random subsets of the in terms of accuracy and AUC and it is not signif- training set. Error bars show 95% confidence in- icantly different from the full system in terms of tervals. performance, in spite of having about an order of magnitude fewer features. It is also worth noting Comparison to human performance To gauge that the domain-specific cues play an important how well humans would perform at the classifica- role in bringing the performance to this level by tion task on the actual test data, we perform an- capturing a different aspect of what it means for a other Amazon Mechanical Turk evaluation on 140 comparison to be figurative. examples from the test set. For the evaluation, The features used by the state of the art we use majority voting between the three anno- metaphor detection system of Tsvetkov et al. tators,12 and compare to the agreed labels in the (2014a), adapted to the comparison structure, per- dataset. Estimated human accuracy is 96%, plac- form poorly by themselves and do not improve ing our full system within 10% of human accuracy. significantly over the baseline. This is consis- tent with the theoretical motivation that figura- Feature analysis The predictive analysis we tiveness in comparisons requires special compu- perform allows us to investigate to what extent the tational treatment, as discussed in Section 1. Fur- features inspired by our linguistic insights have thermore, the linguistic insight features not only discriminative power, and whether they actually significantly outperform the metaphor inspired cover different aspects of figurativeness. features (p < 0.05), but are also better at exploit- 12 Majority voting helps account for the noise inherent to ing larger amounts of data, as shown in Figure 2. crowdsourced annotation, which is less accurate than profes- sional annotation. Taking the less optimistic invididual turker 11 All statistical significance results in this paragraph are answers, human performance is on the same level as our full obtained from 5000 bootstrap samples. system. 2014
Feature Coef. Example where the feature is positively activated T OPIC-V EHICLE similarity −11.3 the older man was wiser and stronger than the boy V EHICLE specificity −5.8 the cord is more durable than the adapter [Electronics] V EHICLE imageability 4.9 the explanations are as clear as mud V EHICLE communication supersense −4.6 the book reads like six short articles V EHICLE indefiniteness 4.0 his fame drew foreigners to him like a magnet life ∈ V EHICLE 7.1 the hero is truly larger than life: godlike, yet flawed picture ∈ V EHICLE −6.0 the necklace looks just like the picture other ∈ V EHICLE −5.9 this one is just as nice as the other others ∈ V EHICLE −5.5 some songs are more memorable than others crap ∈ V EHICLE 4.7 the headphones sounded like crap Table 4: Top 5 linguistic insight features (top) and slotted bag of words features (bottom) in the full model and their logistic regression coefficients. A positive coefficient means the feature indicates figurativeness. Table 4 shows the best linguistic insight and ture is specific to the review setting, as products slotted bag of words features selected by the full are accompanied by photos, and for certain kinds model. The strongest feature by far is the seman- of products, the resemblance of the product with tic similarity between the T OPIC and the V EHI - the image is an important factor for potential buy- CLE . By itself, this feature gets 70% accuracy and ers.13 The bag of words systems are furthermore 61% F1 score. able to learn idiomatic comparisons by identify- The rest of the top features involve mostly the ing common figurative V EHICLES such as life and V EHICLE. This suggests that the V EHICLE is the crap, corresponding to fixed expressions such as most informative element of a comparison when it larger than life. comes to figurativeness. Features involving other constituents also get selected, but with slightly Error analysis Many of the errors made by our lower weights, not making it to the top. full system involve indirect semantic mechanisms V EHICLE specificity is one of the strongest fea- such as metonymy. For example, the false pos- tures, with positive values indicating literal com- itive the typeface was larger than most books parisons. This confirms our intuition that domain really means larger than the typefaces found in information is important to discriminate figurative most books, but without the implicit expansion the from literal language. meaning can appear figurative. A similar kind of ellipsis makes the example a lot [of songs] are Of the adapted metaphor features, the noun even better than sugar be wrongly classified as communication supersense and the imageability literal. Another source of error is polysemy. Ex- of the V EHICLE make it to the top. Nouns with amples like the rejuvelac formula is about 10 times low communication rating occurring in the train- better than yogurt are misclassified because of the ing set include puddles, arrangements, carbohy- multiple meanings of the word formula, one being drates while nouns with high communication rat- closely related to yogurt and food, but the more ing include languages and subjects. common ones being general and abstract, suggest- Presence of an indefinite article in the V EHICLE ing figurativeness. is a strong indicator of figurativeness. By them- selves, the definiteness and indefiniteness features 6 Social Correlates perform quite well, attaining 78% accuracy and 67% F1 score. The advantage of studying comparisons situated The salient bag of words features correspond to in a social context is that we can understand how specific types of comparisons. The words other their usage interacts with internal and external hu- and others in the V EHICLE indicate comparisons man factors. An internal factor is the sentiment of between the same kind of arguments, for exam- 13 This feature is highly correlated with the domain: it ap- ple some songs are more memorable than others, pears 25 times in the training set, 24 of which in the jewelery and these are likely to be literal. The word pic- domain and once in book reviews. 2015
figurative comparison ratio figurative comparison ratio 0.38 0.50 0.36 0.45 0.34 0.40 0.32 0.35 0.30 0.30 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 stars helpfulness ratio (a) Figurative comparisons are more likely to be found in re- (b) Helpful reviews are less likely to contain figurative compar- views with strongly polarized sentiment. isons. Figure 3: Interaction between figurative language and social context aspects. Error bars show 95% confidence intervals. The dashed horizontal line marks the average proportion of figurative comparisons. In Figure 3b the average proportion is different because we only consider reviews rated by at least 10 readers. the user towards the reviewed product, indicated pears. We find that comparisons in helpful re- by the star rating of the review. An external factor views14 are less likely to be figurative. Figure 3b present in the data is how helpful the review is per- shows a near-constant high ratio of figurative com- ceived by other users. In this section we analyze parisons among unhelpful and average reviews; as how these factors interact with figurative language helpfulness increases, figurative comparisons be- in comparisons. come less frequent. We further validate that this To gain insight about fine grained interactions effect is not a confound of the distribution of help- with human factors at larger scale, we use our clas- fulness ratings across reviews of different polarity sifier to find over 80,000 figurative and literal com- by controlling for the star rating: given a fixed star parisons from the same four categories. The trends rating, the proportion of figurative comparisons is we reveal also hold significantly on the manually still lower in helpful (helpfulness over 50%) than annotated data. in unhelpful (helpfulness under 50%) reviews; this difference is significant (Mann-Whitney p < 0.01) Sentiment While it was previously noted that for all classes of ratings except one-star. The similes often transmit strong affect (Hanks, 2005; size of the manually annotated data does not al- Veale, 2012a; Veale, 2012b), the connection be- low for star rating stratification, but the overall dif- tween figurativeness and sentiment was never em- ference is statistically significant (Mann-Whitney pirically validated. The setting of product reviews p < 0.01). This result encourages further exper- is convenient for investigating this issue, since imentation to determine whether there is a causal the star ratings associated with the reviews can link between the use of figurative language in user be used as sentiment labels. We find that com- generated content and its external perception. parisons are indeed significantly more likely to be figurative when the users express strong opin- 7 Conclusions and Future Work ions, i.e., in one-star or five-star reviews (Mann- Whitney p < 0.02 on the manually annotated This work proposes a computational study of fig- data). Figure 3a shows how the proportion of fig- urative language in comparisons. Starting from urative comparisons varies with the polarity of the a new dataset of naturally occurring comparisons review. with figurativeness annotation (which we make publicly available) we explore linguistic patterns Helpfulness It is also interesting to understand that are indicative of similes. We show that these to what extent figurative language relates to the 14 In order to have reliable helpfulness scores, we only con- external perception of the content in which it ap- sider reviews that have been rated by at least by ten readers. 2016
insights can be successfully operationalized in a dataset. We are thankful to the anonymous review- new prediction task: distinguishing literal from ers, whose comments were like a breath of fresh figurative comparisons. Our system reaches ac- air. We acknowledge the help of the Amazon Me- curacy that is within 10% of human performance, chanical Turk annotators and of the MPI-SWS stu- and is outperforming a state of the art metaphor dents involved in pilot experiments. detection system, thus confirming the need for Vlad Niculae was supported in part by the a computational approach tailored specifically to European Commission, Education & Training, comparisons. While we take a data-driven ap- Erasmus Mundus: EMMC 2008-0083, Erasmus proach, our annotated dataset can be useful for Mundus Masters in NLP & HLT. more theoretical studies of the kinds of compar- isons and similes people use. We discover that domain knowledge is an im- References portant factor in identifying similes. This suggests Marco Baroni and Alessandro Lenci. 2010. Dis- that future work on automatic detection of figura- tributional memory: A general framework for corpus-based semantics. Computational Linguis- tive language should consider contextual parame- tics, 36(4):673–721. ters such as the topic and community where the content appears. Marco Baroni, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta. 2009. The WaCky wide web: Furthermore, we are the first to tie figurative A collection of very large linguistically processed language to the social context in which it is pro- web-crawled corpora. Language Resources and duced and show its relation to internal and exter- Evaluation, 43(3):209–226. nal human factors such as opinion sentiment and Louise Shabat Bethlehem. 1996. Simile and figurative helpfulness. Future investigation into the causal language. Poetics Today, 17(2):203–240. effects of these interactions could lead to a better understanding of the role of figurative language in Julia Birke and Anoop Sarkar. 2006. A clustering ap- proach for nearly unsupervised recognition of non- persuasion and rhetorics. literal language. In Proceedings of EACL. In our work, we consider common noun T OP - ICS and V EHICLES and adjectival P ROPERTIES . Hugh Bredin. 1998. Comparisons and similes. Lin- gua, 105(1):67–78. This is the most typical case, but supporting other parts of speech—such as proper nouns, pronouns, Max Coltheart. 1981. The MRC psycholinguistic and adverbs—can make a difference in many ap- database. The Quarterly Journal of Experimental plications. Capturing compositional interaction Psychology, 33(4):497–505. between the parts of the comparison could lead to Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang- more flexible models that give less weight to the Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: V EHICLE. A library for large linear classification. Journal of Machine Learning Research, 9:1871–1874. This study is also the first to estimate how prevalent similes are in the wild, and reports that Manaal Faruqui and Chris Dyer. 2014. Improving about one third of the comparisons we consider are vector space word representations using multilingual figurative. This is suggestive of the need to build correlation. In Proceedings of EACL. systems that can properly process figurative com- Eric N Forsyth and Craig H Martell. 2007. Lexical and parisons in order to correctly harness the semantic discourse analysis of online chat dialog. In Proceed- information encapsulated in comparisons. ings of ICSC. Patrick Hanks. 2005. Similes and sets: The English Acknowledgements preposition ‘like’. In R. Blatná and V. Petkevic, ed- itors, Languages and Linguistics: Festschrift for Fr. We would like to thank Yulia Tsvetkov for con- Cermak. Charles University, Prague. structive discussion about figurative language and Patrick Hanks. 2006. Metaphoricity is grad- about her and her co-authors’ work. We are grate- able. Trends in Linguistic Studies and Monographs, ful for the suggestions of Patrick Hanks, Con- 171:17. stantin Orăsan, Sylviane Cardey, Izabella Thomas, Patrick Hanks. 2012. The roles and structure of com- Ekaterina Shutova, Tony Veale, and Niket Tan- parisons, similes, and metaphors in natural language don. We extend our gratitude to Julian McAuley (an analogical system). Presented at the Stockholm for preparing and sharing the Amazon review Metaphor Festival. 2017
Patrick Hanks. 2013. Lexical Analysis: Norms and Allan Paivio, John C Yuille, and Stephen A Madigan. Exploitations. MIT Press. 1968. Concreteness, imagery, and meaningfulness values for 925 nouns. Journal of Experimental Psy- Dirk Hovy, Shashank Srivastava, Sujay Kumar Jauhar, chology, 76(1p2):1. Mrinmaya Sachan, Kartik Goyal, Huiying Li, Whit- ney Sanders, and Eduard Hovy. 2013. Identifying Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram- metaphorical word use with tree kernels. In Pro- fort, Vincent Michel, Bertrand Thirion, Olivier ceedings of the NAACL Workshop on Metaphors for Grisel, Mathieu Blondel, Peter Prettenhofer, Ron NLP. Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. Journal of Machine M. Israel, J.R. Harding, and V. Tobin. 2004. On simile. Learning Research, 12:2825–2830. Language, Culture, and Mind. CSLI Publications. Carlos Roncero, John M Kennedy, and Ron Smyth. Nitin Jindal and Bing Liu. 2006. Identifying compara- 2006. Similes on the internet have explanations. tive sentences in text documents. In Proceedings of Psychonomic Bulletin & Review, 13(1):74–77. SIGIR. Ekaterina Shutova and Lin Sun. 2013. Unsupervised George Lakoff and Mark Johnson. 1980. Metaphors metaphor identification using hierarchical graph fac- we live by. University of Chicago Press. torization clustering. In Proceedings of NAACL- Bin Li, Jiajun Chen, and Yingjie Zhang. 2012. Web HLT. based collection and comparison of cognitive prop- Ekaterina Shutova. 2010. Models of metaphor in NLP. erties in English and Chinese. In Proceedings of the In Proceedings of ACL. Joint Workshop on Automatic Knowledge Base Con- struction and Web-scale Knowledge Extraction. Niket Tandon, Gerard de Melo, and Gerhard Weikum. 2014. Smarter than you think: Acquiring compara- Christoph Lofi, Christian Nieke, and Nigel Collier. tive commonsense knowledge from the web. In Pro- 2014. Discriminating rhetorical analogies in social ceedings of AAAI. media. In Proceedings of EACL. Yulia Tsvetkov, Leonid Boytsov, Anatole Gershman, André FT Martins, Noah A Smith, Eric P Xing, Pe- Eric Nyberg, and Chris Dyer. 2014a. Metaphor de- dro MQ Aguiar, and Mário AT Figueiredo. 2010. tection with cross-lingual model transfer. In Pro- Turbo parsers: Dependency parsing by approximate ceedings of ACL. variational inference. In Proceedings of EMNLP. Yulia Tsvetkov, Nathan Schneider, Dirk Hovy, Archna Julian McAuley and Jure Leskovec. 2013. Hidden fac- Bhatia, Manaal Faruqui, and Chris Dyer. 2014b. tors and hidden topics: Understanding rating dimen- Augmenting English adjective senses with super- sions with review text. In Proceedings of RecSys. senses. In Proceedings of LREC. George A Miller. 1995. WordNet: a lexical Peter D Turney, Yair Neuman, Dan Assaf, and Yohai database for English. Communications of the ACM, Cohen. 2011. Literal and metaphorical sense iden- 38(11):39–41. tification through concrete and abstract context. In Rosamund Moon. 2008. Conventionalized as-similes Proceedings of EMNLP. in English: A problem case. International Journal of Corpus Linguistics, 13(1):3–37. Tony Veale and Yanfen Hao. 2008. A context-sensitive framework for lexical ontologies. Knowledge Engi- Rosamund Moon. 2011. Simile and dissimilarity. neering Review, 23(1):101–115. Journal of Literary Semantics, 40(2):133–157. Tony Veale and Yanfen Hao. 2011. Exploiting ready- Robert Munro, Steven Bethard, Victor Kuperman, mades in linguistic creativity: A system demonstra- Vicky Tzuyin Lai, Robin Melnick, Christopher tion of the Jigsaw Bard. In Proceedings of ACL (Sys- Potts, Tyler Schnoebelen, and Harry Tily. 2010. tem Demonstrations). Crowdsourcing and language studies: The new gen- eration of linguistic data. In Proceedings of the Tony Veale and Guofu Li. 2013. Creating similarity: NAACL Workshop on Creating Speech and Lan- Lateral thinking for vertical similarity judgments. In guage Data with Amazon’s Mechanical Turk. Proceedings of ACL. Vlad Niculae. 2013. Comparison pattern matching and Tony Veale. 2012a. A computational exploration of creative simile recognition. In Proceedings of the creative similes. Metaphor in Use: Context, Cul- Joint Symposium on Semantic Processing. ture, and Communication, 38:329. Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Tony Veale. 2012b. A context-sensitive, multi-faceted Kevin Gimpel, Nathan Schneider, and Noah A model of lexico-conceptual affect. In Proceedings Smith. 2013. Improved part-of-speech tagging for of ACL. online conversational text with word clusters. In Proceedings of NAACL-HLT. 2018
You can also read