Generating Metaphoric Paraphrases - Kevin Stowe, Leonardo Ribeiro, Iryna Gurevych
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Generating Metaphoric Paraphrases Kevin Stowe, Leonardo Ribeiro, Iryna Gurevych UKP Lab, Technical University of Darmstadt www.ukp.tu-darmstadt.de Abstract rely on lexical diversity and conceptual knowl- edge. The bulk of work in metaphor has gone This work describes the task of metaphoric to identifying metaphor expressions or generat- paraphrase generation, in which we are given a literal sentence and are charged with ing interpretations for them (Shutova, 2015; Veale et al., 2016). Whereas previous approaches focus arXiv:2002.12854v1 [cs.CL] 28 Feb 2020 generating a metaphoric paraphrase. We propose two different models for this task: a on classification, we instead focus on generation: lexical replacement baseline and a novel se- how can we create novel, interesting, and valid quence to sequence model, ’metaphor mask- metaphoric expressions? ing’, that generates free metaphoric para- This task has many possible applications, in- phrases. We use crowdsourcing to evalu- ate our results, as well as developing an cluding creative writing assistance, where users automatic metric for evaluating metaphoric can employ metaphor generation to develop more paraphrases. We show that while the lex- interesting, persuasive writing. Lakoff and John- ical replacement baseline is capable of pro- son (1980) suggest that not only can metaphors ducing accurate paraphrases, they often lack capture similarity between domains, they ac- metaphoricity, while our metaphor masking tually can generate the similarity, allowing us model excels in generating metaphoric sen- to view concepts in new ways; optimistically, tences while performing nearly as well with metaphor generation may allow us to discover regard to fluency and paraphrase quality. new metaphoric ideas to foster understanding and growth in scientific areas. This is particularly true 1 Introduction in the domain of education, where new metaphors Metaphors have long posed significant problems can be instructive both for teachers and students to researchers across a wide variety of fields. (Marshall, 1990). Metaphors are also critical for While humans seem capable of easily under- proper interaction between humans and computa- standing even complex metaphors, it remains tional agents: humans produce metaphors easily, difficult to devise a formal analysis that cap- and to have natural communication with compu- tures the depth and breadth of meanings pro- tational models will require them to be able to do duced by novel metaphors. We typically think of the same (Zhang, 2008; Wallington et al., 2011). metaphors within the Conceptual Metaphor frame- In contrast to previous work generating novel work (Lakoff and Johnson, 1980; Lakoff, 1993), metaphors (§2.2), we are the first to tackle in which metaphors are based in conceptual map- metaphor paraphrase generation, and we hope our pings between different domains: we have cog- work can function as a jumping-off point for this nitive concepts that can be used to represent and challenging and interesting task. This task is a understand other concepts, and these mappings particularly difficult task for a variety of reasons. can be expressed linguistically to form concrete First, metaphors have the potential to be enor- metaphoric expressions. mously creative, deviating greatly from "standard" While there are many different computational language, which means normal language models approaches to metaphoric language, the field re- may have difficulty in producing good metaphors. mains challenging and, in some areas, relatively Traditional paraphrasing systems attempt to keep unexplored. The variety of meanings captured the sentences relatively similar, while in fact we by creative metaphors pose numerous problems to need sentences that vary substantially, in order to natural language processing researchers, as they enforce metaphor production.
This leads also to significant problems: there with metaphoric counterparts. This yields coher- are countless possible metaphor paraphrases for ent utterances, but limits the flexibility of the out- any given utterance and there are numerous pos- put. Second, we will consider this a sequence to sible metaphoric mappings that can be evoked, sequence (seq2seq) problem, and employ a novel yielding slightly different semantic connotations. generation technique dubbed "metaphor masking" Consider the following example: to hide important words in the input during train- ing and evaluation, forcing the seq2seq model to 1. The company was losing money rapidly. learn the appropriate contexts for metaphoric and literal words. This also requires knowledge of the This sentence has numerable possible key words before paraphrasing, but allows for sub- metaphoric paraphrases, evoking many different stantially more flexibility in generation. metaphors: Our contribution is thus threefold: 2. The company was hemorrhaging money. • We formalize the task of metaphor genera- 3. The company’s finances were circling the tion, elucidating the datasets and experimen- drain. tal setup necessary. 4. The business fell off of a cliff. • We implement a lexical replacement-based 5. Profits collapsed. baseline, as well as a novel seq2seq architec- ture based on "metaphor masking". In 2 and 3, "money" is conceptualized as • We perform analysis of generated metaphors, blood and water respectively, and from concep- identifying strengths and weaknesses for tual metaphor theory we see that this evokes the each method. MONEY IS A LIQUID mapping. In 4, "finances" is conceptualized as a physical entity, and fur- 2 Related Work ther, one that can experience harm, perhaps evok- ing the ECONOMIC HARM IS PHYSICAL INJURY While our task is new, it bears similarity to a mapping. In 5, the company’s profits are con- variety of better known NLP benchmarks. In ceptualized as a building, evoking the frequent the metaphor community, most of the efforts are metaphor of social and economic constructs being focused on identification and interpretation of conceptualized as physical constructions, in this metaphors. We will instead focus on our two key case specifically FINANCES ARE BUILDINGS. components, paraphrasing and generation, as they Note that there is a seemingly endless variety of relate to metaphors. metaphoric expressions that can fairly consistently capture the same general meaning, with a wide va- 2.1 Literal Paraphrasing riety of lexical variation. This makes metaphoric Previous work investigates paraphrasing from paraphrases extremely difficult to evaluate auto- metaphoric utterances to literal ones with the goal matically: traditional metrics for generation (such of providing interpretations (Mao et al., 2018; as BLEU (Banerjee and Lavie, 2005) and ROUGE Shutova, 2010). Shutova et al. (2010) treats (Papineni et al., 2002)) rely heavily on word identification and interpretation jointly, and gener- overlap, which is actually counterproductive for ates literal paraphrases for metaphoric adjective- metaphoric paraphrasing: we would like our gen- noun phrases. Vector space models have also been erated phrases to have less word overlap, as inter- employed successfully for generating literal para- esting metaphors are likely to share little lexical phrases. Shutova et al. (2012) identify a set of overlap with the original inputs. For this reason candidate paraphrases based on context and word we rely on crowdsourcing, evaluating metaphoric- vectors, and then use a model of selectional pref- ity, fluency, and paraphrase quality. erences to pick the most literal paraphrase. They We approach the problem of metaphoric para- require no training data, and achieve promising re- phrase generation from a variety of backgrounds, sults for unsupervised literal paraphrasing. each with their own positives and negatives. First, Similarly, Mao et al. (2018) build a metaphor we will consider the problem one of lexical re- identification system using word vectors, and also placement, in which we identify the important use it to generate paraphrases for metaphoric sen- words in the literal utterance and replace them tences. This is done by replacing the verbs that are
identified as metaphoric with the most likely lit- phrasing work, Gagliano et al. (2016) build off of eral candidates. They use Word2Vec embeddings Word2Vec, using the generated vectors to identify (Mikolov et al., 2013) combined with WordNet to poetic relationships between words, developing a identify relations between literal and metaphoric vector-based interpretation of conceptual blends lexemes. This allows for replacement of rarer, (Fauconnier and Turner, 1996). They identify more metaphoric senses to concrete literal ones, "connector words" between concepts, allowing for but doesn’t provide a solution for transitioning the creation of linguistic metaphors that accurately from a literal sense to an appropriate metaphoric capture these conceptual metaphoric mappings. one. Thus their work is effective at metaphoric to More recently there have been efforts using literal paraphrasing, but functions only in this di- deep learning methods to generate metaphoric ex- rection; we will restructure their algorithm for the pressions more freely, using sequence-to-sequence metaphoric direction as a lexical baseline in §4.1. models. Most notable is Yu et al. (2019), who use neural models to generate metaphoric expres- 2.2 Metaphor Generation sions in an unsupervised manner. They identify source and target verbs automatically from cor- With regard to metaphor generation, most efforts pora, and use these to train a neural language have been to generate metaphors at the lexical or model. Our work is similar: they encode both lit- phrase level, using template- and heuristic-based eral and metaphoric pairs and produce metaphoric methods. Early work in computational metaphor outputs based on verbs, but their generation task is generation involves generating simple "A is like free. We are instead working on the more con- B" expressions, based on probabilistic relation- strained task of generating specific paraphrases ships between words (Abe et al., 2006; Terai and from literal utterances. Nakagawa, 2010). These methods are effective to This is the experimental paradigm we will be a degree, but lack the flexibility necessary to in- following: given a literal phrase, we generate stantiate natural language metaphors. a metaphoric paraphrase that should capture the Other early approaches to metaphor genera- same meaning. Unlike previous work, our meth- tion are rooted in knowledge bases. Hervas et ods are broadly applicable to free text: we are al. (2007) build a metaphor generation system not limited to paraphrasing individual words or by identifying metaphoric domains, building map- phrases, but rather use deep learning models for pings between the source and target, and replacing full natural language generation, which can then appropriate references with the built metaphors. freely create literal paraphrases. To our knowl- They show the difficulty of determining appropri- edge, our work is the first to attempt to explicitly ate target domains for metaphors in context. Oth- generate metaphoric paraphrases. ers use WordNet, building knowledge representa- tions through semantic information from defini- 3 Data tions (Veale and Hao, 2008). Other works seek to generate conceptual Our goal is to generate metaphoric paraphrases metaphors, rather than open linguistic expres- for given literal phrases. Data for this task is ex- sions. These approaches, designed to generate tremely sparse: there aren’t any large scale paral- conceptual metaphor mappings such as MONEY lel corpora containing literal and metaphoric para- IS A LIQUID , vary from WordNet- and selectional phrases. Most useful is that of the Mohammad et preference-based (Mason, 2004), clustering over al. (2016). Their dataset includes multiple parts; WordNet senses (Gandy et al., 2013), and us- importantly, it contains 171 metaphoric sentences ing proposition databases built from syntactic re- extracted from WordNet, with manually generated lations (Ovchinnikova et al., 2014). While this literal paraphrases. These are high quality annota- task is interesting and useful, particularly for do- tions, and we will use this dataset for evaluation. ing proper reasoning from metaphoric mappings, While originally built from the side of generating our goal is instead to generate natural linguistic literal paraphrases for metaphoric utterances, it is metaphors, rather than metaphoric mappings. easy enough to reverse the direction, using their Word embedding approaches have been popu- literal paraphrases as input and attempting to gen- lar and effective for lexical metaphor tasks. In erate metaphoric outputs. addition to Mao et al. and Shutova et al.’s para- Note that there are some discrepancies between
the original usage and our intended paraphrase usage. Notably, the dataset was originally built around verbs: the authors replaced the key verbs in each metaphoric sentence to yield a more literal output. This ignores cases where the metaphoric meaning of the sentence is captured by compo- nents other than the verb: 1. The painting seems to capture the essence of Spring. 2. These events could fracture the balance of power. 3. The new moon reflected back at itself from the lake’s surface. In these examples, the verb that was replaced Figure 1: Lexical Replacement Baseline to make a paraphrase is in bold, while the italic phrases could also be construed as metaphoric. In 4.1 Lexical Replacement Baseline particular, 3 is likely to be considered metaphoric regardless of the bolded verb, due to the poetic re- Metaphors often hinge on verbs. This intu- flexive construction "back at itself". This means ition has fueled many identification and interpre- that the resulting "literal" paraphrases contain lit- tation projects, including the inclusion of the verb- eral verbs, but the sentences themselves may still specific identification track of the metaphor detec- contain metaphors. This isn’t prevalent in the data tion shared task (Leong et al., 2018). We imple- and doesn’t impact the experiments, as we are only ment a lexical replacement baseline that takes the trying to generate more metaphoric output sen- literal verb and replaces it with a more metaphoric tences from more literal inputs, but it is impor- counterpart. This is based on the work of Mao et tant to be aware that our paraphrasing task differs al. (2018), who employ this strategy in the other somewhat from the design of the original dataset. direction: they take metaphoric sentences and re- The size of this dataset is small: 171 instances place the metaphoric verbs with literal ones. is not enough to train viable deep learning models, We implement this algorithm for metaphor gen- and large scale parallel corpora for this task don’t eration by reversing their candidate selection. For exist. For this reason, we will use methods that an overview of the process, see Figure 1. We be- are either unsupervised, or don’t rely on parallel gin with a literal sentence with a marked verb (a). data, and can be developed using non-parallel cor- (b) We use the WordNet sense hierarchy to find re- pora. The lexical replacement model is the former, lated words to the input word which will then be requiring no training data. The metaphor mask- "candidates" to replace it, but rather than search- ing seq2seq model uses external training data, but ing "up" the hierarchy for hypernyms, we search does not require the data to be parallel. We use a "down" the hierarchy for troponyms: more spe- masking procedure to generate artificial sentence cific verbs (in bold). We believe that in the lexical pairs for seq2seq training, allowing the model to replacement task, replacement with more specific be function using non-parallel datasets. verbs is likely to yield more metaphoric expres- sions, as these specific verbs require specific con- 4 Methods texts to be understood literally. When placed in an unfamiliar context, they adopt metaphoric mean- We propose two different models for metaphoric ings via a coercion-like process (Steedman and paraphrase generation. First, we implement a lex- Moens, 1988). (c) We follow their algorithm for ical replacement baseline, based on that of Mao et picking the best candidate: we take the mean out- al. (2018). Second, we develop a novel seq2seq put embedding of the context (based on the Google framework that masks metaphoric words to better News Word2Vec vectors (Mikolov et al., 2013)), learn how to generate metaphoric outputs. and select the candidate word that best matches
counters the metaphor mask token. At test time, we provide the model with the literal input, mask the verb, and the model produces an output con- ditioned on the metaphor masking training. An overview of the process is shown in Figure 2. This procedure requires additional annotated data to generate the parallel inputs for train- ing. For this, we employ a number of available metaphor corpora: the VUAMC dataset (Steen et al., 2010), another partition of the Mohammad et al. dataset that contains individual sentences la- belled as literal or metaphoric (Mohammad et al., Figure 2: Metaphor masking for the seq2seq 2016)1 , the Trofi dataset (Birke and Sarkar, 2006), model. and the additional data collected by Stowe et al. (2018). Each of these datasets contains annota- tions of metaphoric verbs, although the annotation that mean by way of cosine similarity. (d) This schema differ, so we expect some variety and noise yields the (more specific) word that best fits the in the model. Combining these datasets yields context, generating a more metaphoric expression. 35,415 verbs, of which 11,593 are metaphoric. This method, then, takes as input a sentence Our final goal is to generate short metaphoric with a known literal verb, generates possible utterances based on the Mohammad et al. (2016) metaphoric candidates to replace that verb, and dataset. In order to match this, we trim our train- chooses the best fitting option. It requires no ex- ing data around the verbs: each verb is treated as ternal training data, but relies on WordNet, and is a separate training instance, along with 7 words restricted to only generating metaphoric verbs. of context on each side. We use all 35,415 sen- 4.2 Metaphor masking model tences as input to the model: non-metaphoric sen- tences are left as-is, with the input mirroring the Sequence to sequence (seq2seq) learning output. Metaphoric data is masked during training, paradigms are vital for a variety of NLP ap- replacing the input verb with a metaphor mask- plications: machine translation, style transfer, ing and using the original as output. This yields natural language generation, and more (Chen 35,415 pairs for training, 11,593 of which con- et al., 2018; Mueller et al., 2017; Dušek et al., tain metaphoric masks. We hypothesize that us- 2020). These methods rely on encoding input ing both literal and metaphoric datasets will allow sentences into vectors, and then applying decoders the model to better distinguish between sentences to generate some output from that input vector. with a metaphor mask and those without, generat- They are often trained on parallel corpora (as in ing stronger metaphoric outputs. We use a trans- the case of machine translation), with the model former architecture (Vaswani et al., 2017) with 6 learning to output some text based on the vector layers in the encoder and decoder. The model uses encoded from the input. 8 heads to learn different attention distributions. Seq2seq models have been used to generate In the end they are concatenated. The hidden size metaphoric text (Yu and Wan, 2019), but here we for encoder and decoder is 512. We use normaliza- are focused on paraphrase generation. In order tion per tokens, with a vocabulary size of 30K. The to apply seq2seq models to this task, we develop model was trained using ADAM optimiser, with a new framework dubbed "metaphor masking". an initial learning rate of 0.5. In this framework, we replace metaphoric words in the input texts with metaphor masks (unique 5 Crowdsourced Evaluation "metaphor" tokens), hiding the lexical item. This creates artificial parallel training data: the input The approaches to evaluating metaphoric and lit- is the masked text, with the hidden metaphorical eral sentences using crowdsourcing include eval- word, and the output is the original text. Through 1 Note that our test data also comes from this source: we this learning paradigm, the model learns that it have removed all text examples from this dataset for all train- needs to generate metaphoric words when it en- ing.
uating hand-generated sentences for metaphoric- ity (Mohammad et al., 2016; Bizzoni and Lap- pin, 2018), evaluation of the output of automatic metaphor generation systems (Yu and Wan, 2019; Veale, 2016), and evaluation of novelty in verbal metaphors (Do Dinh et al., 2018). Uniquely focus- ing on metaphor evaluation, Miyazawa and Miyao (2017) highlight the importance of effective eval- uation. They use four key metrics: metaphoricity, novelty, comprehensibility, and overall evaluation, to measure the success of metaphor generation in Japanese. We will rely on two components that are typ- Figure 3: Evaluation of each model via crowd- ical of metaphor generation. First, we evaluate sourcing. metaphoricity, with the goal of producing coher- ent and interesting metaphors, rather than con- ventional, common language. Second, we eval- gold input (either literal for x, y 0 or metaphoric for uate fluency, attempting to capture the syntac- y, y 0 ) and the generated output, and asked them tic viability of the generated output. Addition- how good of a paraphrase the output was, from ally, as we are attempting to generate paraphrases, "completely unrelated" to "strong paraphrase". we also include crowdsourced evaluation of para- Metaphor evaluation is more difficult, and we phrase quality. attempt to follow previous crowdsourcing ap- Annotators were thus asked to rate sentences proaches for metaphor rating. Based on the with regard to three different factors: metaphoric- schema from Do Dinh et al. (2018) and Yu ity, fluency, and paraphrase quality. Each sentence et al. (2019), we provided basic definitions of was rated by five separate workers on a Likert metaphoricity for crowdworkers, allowing them scale from 1 to 4.2 We filtered out results of users to use their intuitions about what to consider who failed test sentences and those who only com- metaphoric. We found in a pilot study that pleted 1 task, aiming to keep results from consis- providing longer, more complex descriptions of tent and knowledgeable workers. metaphoricity increased the difficulty of the task, Fluency judgments were relatively simple. For so we chose to keep the definition simple.3 this, we asked annotators to rate the sentences Our crowdsourcing setup was repeated for three based on how fluent (from incomprehensible to outputs. The gold metaphors of Mohammad et fluent English) a sentence is. al. (2016), which also contain hand-crafted literal For paraphrase judgments, we used with two paraphrases, the lexical replacement baseline, and different setups. We have access to three com- the output of our experimental system: sentences ponents: the original literal input x; y, the origi- generated via seq2seq with metaphor masking. nal metaphoric paraphrase of x; and y 0 , the gen- 6 Analysis erated metaphoric paraphrase of x. We first eval- uate y, y 0 paraphrasing, comparing the generated The mean scores for the crowdsourced evalua- metaphoric outputs with the gold metaphors from tions for each system are shown in Figure 3.4 the test data, allowing us to compare the system The sentences generated by lexical replacement output to the gold data. We also experimented with bear closer resemblance to the literal inputs: they comparing generated paraphrases to the literal in- have lower metaphoricity scores and higher x, y 0 puts, as these should also be valid paraphrases. paraphrase rankings. This is expected, as the This represents our x, y 0 evaluation, comparing the change to the input is only a single word. The resulting paraphrases with the original literal in- 3 To facilitate future work, full descriptions of the task, puts. For each, we presented the worker with a parameters, payments, and guidelines, along with the crowd- sourced results and codebase, will be released upon publica- 2 We chose this scale rather than the 1 to 5 used by Yu et tion. al. (2019) to encourage workers to avoid "neutral" responses 4 Note that y, y 0 comparison is unavailable for the gold of ’3’. data: there is no y 0 to compare to
Source Text Met Flu PP Input He was lavished with praise Gold He was showered with praise 3 4 4 1 LexRep He was lavished with praise 2.6 3.8 3.8 MM He was pleading with impishly 2.8 2.2 1 Input The moon reflected back at itself from the lake’s surface Gold The moon glared back at itself from the lake’s surface 3.3 4 3 2 LexRep The moon sparkled back at itself from the lake’s surface 3.2 4 3.6 MM The DMZ falls back at itself from the glittering surface 3.25 2.6 1.4 Input She appears among royalty Gold She circulates among royalty 2.4 4 3.6 3 LexRep She manifested among royalty 2.75 3.8 3.4 MM She clawed among doorknob 2.8 2.6 1.6 Input Life in the camp weakened him Gold Life in the camp drained him 2.75 3.75 3.6 4 LexRep Life in the camp emasculated him 3.3 3.8 2.8 MM Life in the camp miss him 1.8 2.2 3 Table 1: Samples with the highest scores for LexRep metaphor masking model shows more similarity tence, but understandably lack metaphoricity (1). to the metaphoric outputs: they have low para- However, in many cases the model often makes phrase similarity in the x, y 0 settings, but better novel and metaphoric word choices, indicating the y, y 0 paraphrase scores, and high metaphoricity. validity of this approach for metaphor generation The fact that the metaphor masking model pro- (2-4). duces metaphoricity scores on par with the gold This baseline has numerous theoretical advan- standard metaphors, and is consistent with the lex- tages and disadvantages. It yields output sen- ical model in terms of fluency and y, y 0 paraphrase tences that are very similar to the inputs, as we quality, shows that this method is very effective at are only replacing a single word. This can be ben- generating metaphoric sentences. The quality of eficial, as the outputs will be necessarily syntacti- the paraphrases is still relatively low, averaged 1.3 cally and semantically coherent except for the re- points below the gold x, y paraphrases, but this is placed word, but also severely restricts the creativ- an important first step in for this task. ity and novelty of the output. In order to understand how each of these mod- A downside is that this method requires knowl- els performed, we do qualitative analysis over the edge of the target verb. Our data has the target results. We examined the results of each model: verb in the literal and metaphoric paraphrases an- what does the lexical replacement baseline do notated, but sometimes these verbs contain par- well, and what benefits can we gain from employ- ticles (such as "start on" and "use up"), which ing our metaphor masking model? make lexical replacement difficult. Just replac- ing the verb and maintaining the particle some- 6.1 Lexical Replacement times yields good results ("I [started] on the prob- Table 1 shows the best lexical replacement lem" → "I [fell] on the problem"), while replacing outputs, based on their improvement over the both verb and particle can also be correct ("they metaphor masking model. The replacement model [used up] their food" → "they [demolished] their performs well in fluency and paraphrase quality, food"). Second, if we apply this method to unseen particularly because it copies most of the input, data, we will first need to identify the target verbs, only replacing a single word. In some cases, making it more reliant on external knowledge and the "best fit" candidate is the original input word. prone to error. Finally, it is dependent on Word- These perform exceedingly well in fluency and Net, which restricts the power and flexibility with paraphrase quality, as they match the input sen- regard to creativity.
Source Text Met Flu PP Input She was saddened by his refusal of her invitation Gold She was crushed by his refusal of her invitation 2.4 4 4 5 LexRep She was saddens by his refusal of her invitation 1.25 2.8 3.25 MM She besieged by his refusal of her invitation 3.2 3.8 3.6 Input The critics overpraised this broadway production Gold The critics puffed up this broadway production 3.2 3.75 3.75 6 LexRep The critics this broadway production 2.2 2.25 2.2 MM The critics hailed this airborne production 3.2 3.75 3 Input The company dismissed him after many years of service Gold The company dumped him after many years of service 2.2 4 3.75 7 LexRep The company scoff him after many years of service 1.5 3 2.4 MM The company downsized him after many years of service 2.6 4 3.5 Input This story will intrigue you Gold This story will grab you 3.2 3.75 4 8 LexRep This story will schemed you 2.8 2.33 1.8 MM This story will help you 3.6 4 2.4 Table 2: Samples with the highest scores for Metaphor Masking While the above examples highlight the strength yield better metaphors. As the model isn’t con- of the lexical replacement baseline, they also strained to a particular resource, it has more power show the weaknesses of the metaphor masking ap- with regard to lexical choice. Example 6 shows proach. Due to the free nature of generation, we another benefit with regard to metaphoricity: the often see words in the generated output that bear model can generate multiple words not present in little to relation to the original input ("impishly" in the input ("hailed", "airborne"), yielding more cre- (1) and "DMZ" in (2)). These kinds of errors eluci- ative utterances, although these are often worse date how the more constrained lexical replacement paraphrases.5 . model tends to yield better paraphrases. While the model often generates strong metaphors, there are also cases where the model 6.2 Metaphor Masking predicts a word for the masked metaphoric word The metaphor masking model tend to generate that is extremely literal (7 and 8), which yields more metaphoric sentences with similar fluency, sentences that are fluent and good paraphrases although they often are not valid paraphrases of but lacking in metaphoricity. This deficit is due the original input. Table 2 shows examples for to the lack of information for the model about which the metaphor masking model performs best the metaphoric class. As our dataset is limited, in comparison with the lexical replacement model. the model doesn’t have enough signal to fully Metaphor masking tends to produce fairly con- distinguish what goes into a metaphoric gap. sistent outputs, which are syntactically regular. More data (both metaphoric and overtly literal) Hiding the metaphoric word causes the model to should help the model generate more surprising make a prediction, yielding varied outputs, and and metaphoric outputs. these are more metaphoric than their inputs. We can also see from Table 2 the weaknesses This model is complementary to the lexical re- of the lexical replacement baseline. As the can- placement model: as it is based on a sequence-to- didates are generated with diverse syntactic end- sequence transformer model, it is relatively free in ings, they often exhibit disagreement with their its generation. It frequently generates words not arguments (5, 7, 8). Additionally, it doesn’t al- in the original input which leads to more creative, ways make metaphoric predictions: in 5, the out- metaphoric outputs. Examples like 5 and 6 show put matches the input verb, yielding an extremely the power of the metaphor masking model: it is ca- 5 Note that the lexical replacement model fails for this sen- pable of generating a wider variety of words that tence, as no candidate word was found for "puffed up".
Source Text Met Flu PP Input I can’t cope with it anymore Gold I can’t hack it anymore 1.5 3 3.75 9 LexRep I can’t improvising with it anymore 1.8 2.3 2 MM I chemotherapy stuck with it gathers 2.25 2 1.5 Input Actions communicate louder than words Gold Actions talk louder than words 3.2 3.5 3.75 10 LexRep Actions conveys louder than words 1.75 3.4 3 MM Seurat culminated pointillism than words 2.5 2 1.2 Input Which horse are you betting on Gold Which horse are you backing 2 4 4 11 LexRep Which horse are you bet on 1.4 2.6 3.8 MM Gillette horse are you going on 3 1.4 1.6 Input A weather vane tops the building Gold A weather vane crowns the building 1.75 3.8 3.8 12 LexRep A weather vane clears the building 2.25 4 2.4 MM A weather verbs raided the building 3.4 1.8 1.2 Input It occurred to him that she had betrayed him Gold It dawned on him that she had betrayed him 2.2 4 3.7 13 LexRep It intervened to him that she had betrayed him 2 3 2.8 MM It comes to him that she had greeted him 2.2 2.8 1.5 Table 3: Sentences for which both LexRep and MM models performed poorly. literal paraphrase. might still prefer the most literal option. A possible solution left for future work is to se- 6.3 Consistent Errors lect the "worst fit" from the candidates: the word The sentences that confound both of our models who’s vector is least likely to match the context. tend to be idiomatic (Table 3). These are cases This would ensure contrast between domains, but where the "metaphoric" meaning of the sentence in preliminary studies lead to the model invariably isn’t captured explicitly by the verb, but rather picking syntactically incomprehensible or seman- spans the entire utterance. For example, in 10, the tically incoherent choices. For future work, we communication metaphors is present, regardless believe better limitations on the candidate selec- of the verb used: the literal verb "communicate" tion, enforcing syntactic constraints while allow- may be less metaphoric as a verb than the gold ing a wider variety of domains, will allow us to im- "talk", but the metaphor of the sentence persists. plement the "worst fit" approach more effectively This causes difficulties for our systems which re- with the potential to generate much more interest- quire metaphoricity to be focused on the verb. ing metaphoric replacements. The lexical replacement model often makes lex- The metaphoric masking model struggles with ical choices that either don’t match the original short sentences: it often generates words that don’t meaning (9), or don’t maintain any metaphoricity fit the context, yielding unparseable expressions (13). As WordNet is a finite resource, the number (see 9-12). The relatively idiomatic nature of these of candidate replacement verbs is often small, and expressions also hinders the model’s performance: this restricts the system from finding truly novel as the metaphoricity isn’t focused singularly on metaphoric expressions. It also may be the case the verb, the model is unable to make accurate pre- that finding the "best fit" word from output vectors dictions about the masked token. is actually counterproductive: Mao et al. (2018) A possible solution here is to expand the mask- use this procedure for finding the best literal para- ing to other parts of speech, or even to phrases. phrases, and although we alter their approach to This would allow the model to generate over more identify more metaphoric candidates, the model complex metaphoric expressions. Additionally,
if our seq2seq model can accurately pick up on that masks metaphoric verbs to encourage genera- masked metaphor tasks, this gives us both flexi- tion of metaphoric outputs. Crowdsourced eval- bility and control over metaphor generation: we uations show that both models are successful at will be able to choose which parts of utterances different aspects of the task: the lexical replace- we would like to metaphoric, allowing for much ment baseline yields consistent paraphrases that more powerful generation systems. lack metaphoricity, while the metaphor masking One consistent problem in this process is the model yields extremely metaphor outputs that of- difficulty of keeping annotation categories inde- ten don’t accurately paraphrase the input. pendent. We find that generated sentences that are Future work in this area is hindered by the lack incoherent syntactically also tend to be considered of available data. In order to improve these meth- bad paraphrases (Spearman correlation of .559, ods, we need better datasets. This couples with the p < .01). It is likely because if a sentence is dif- problem of evaluation: standard evaluation met- ficult to syntactically parse, it is more difficult to rics for language generation are often misleading assess its meaning, making judgments of seman- with regard to metaphors. Better datasets would tic similarity difficult. Additionally, metaphoric- allow for the development of better metrics for ity ratings correlate negatively to a lesser de- evaluation, and in turn better evaluation metrics gree with paraphrase quality (-.112, p ≈ .03). may allow us to build better systems for automat- Strong metaphoric paraphrases likely add addi- ically identifying metaphoric paraphrases, allow- tional meaning or de-emphasize some of the origi- ing us to build better corpora. nal literal meaning, making their paraphrase qual- Another possible direction to explore is the in- ity lower. Interestingly, fluency and metaphor corporation of knowledge representations. Our ratings did not significantly correlate, indicating lexical replacement method relies heavily on that disfluent sentences were neither more or less WordNet, and can make local changes based on metaphoric than their fluent counterparts. a small number of candidate verbs. Our metaphor It is important to note the variety of pos- masking model is relatively free, but neither con- sible generated expressions that are considered tain any knowledge of the metaphors in use. good. Different generated metaphors can even To truly be able to generate metaphors based maintain some of the original literal meaning, on actual metaphoric mappings, we need to in- while highlighting different aspects, as good novel corporate some knowledge of the source and tar- metaphors are known for. Consider the generated get domains involved. This could involve lever- example "This idea harmonizes up with the other aging FrameNet (Baker et al., 1998) or MetaNet one", intended to paraphrase "This idea matches (Dodge et al., 2015), developing a novel metaphor up with the other one". This captures in many knowledge base, or learning domain knowledge senses the original input of "matches up", but also in an unsupervised fashion. Developing metaphor provides something more: not only do the ideas knowledge bases that capture relations between go together, but perhaps they also improve upon domains in a usable way will not only allow for one another. Because of the variety of accept- better metaphor generation, but also better reason- able outputs, automatic generation of metaphoric ing and understanding of texts that make use of paraphrases is exceedingly difficult. For this rea- more complicated metaphoric expressions. How- son, we present an automatic metric for evaluating ever the ordeal is undertaken, generation of coher- metaphoric paraphrases. ent metaphors will inevitably require better repre- sentation of the interaction between the domains 7 Conclusions and Future Work evoked. We’ve established a new task for natural lan- guage generation: the creation of metaphoric para- References phrases for literal sentences. We explore two possible models for accomplishing this task: an Keiga Abe, Sakamoto Kayo, and Masanori Nak- adapted lexical replacement baseline model that agawa. 2006. A computational model of the relies on WordNet to find candidate verbs and the metaphor generation process. In Proceedings output vectors of word embeddings to match their of the 28th Annual Meeting of the Cognitive Sci- contexts, and a seq2seq transformer-based model ence Society, pages 937–942.
C. F. Baker, C.J. Fillmore, and J.B. Lowe. 1998. Denver, Colorado. Association for Computa- The Berkeley FrameNet project. In Proceed- tional Linguistics. ings of the 36th Annual Meeting of the As- sociation for Computational Linguistics and Ondřej Dušek, Jekaterina Novikova, and Verena the 17th International Conference on Compu- Rieser. 2020. Evaluating the state-of-the-art tational Linguistics, pages 86–90, Montreal, of end-to-end natural language generation: The Canada. Association for Computational Lin- E2E NLG challenge. Computer Speech & Lan- guistics. guage, 59:123–156. Satanjeev Banerjee and Alon Lavie. 2005. ME- Gilles Fauconnier and Mark Turner. 1996. Blend- TEOR: An automatic metric for MT evaluation ing as a central process of grammar. In Adele with improved correlation with human judg- Goldberg, editor, Conceptual Structure, Dis- ments. In Proceedings of the ACL Workshop course, and Language. Cambridge University on Intrinsic and Extrinsic Evaluation Measures Press. for Machine Translation and/or Summarization, Andrea Gagliano, Emily Paul, Kyle Booten, and pages 65–72, Ann Arbor, Michigan. Associa- Marti A. Hearst. 2016. Intersecting word vec- tion for Computational Linguistics. tors to take figurative language to new heights. Julia Birke and Anoop Sarkar. 2006. A clustering In Proceedings of the Fifth Workshop on Com- approach for nearly unsupervised recognition of putational Linguistics for Literature, pages 20– nonliteral language. In Proceedings of the 11th 31, San Diego, California, USA. Association Conference of the European Chapter of the As- for Computational Linguistics. sociation for Computational Linguistics, pages 329–336, Trento, Italy. Association for Compu- Lisa Gandy, Nadji Allan, Mark Atallah, Ophir tational Linguistics. Frieder, Newton Howard, Sergey Kanareykin, Moshe Koppel, Mark Last, Yair Neuman, and Yuri Bizzoni and Shalom Lappin. 2018. Predict- Shlomo Argamon. 2013. Automatic identi- ing human metaphor paraphrase judgments with fication of conceptual metaphors with limited deep neural networks. In Proceedings of the knowledge. In Proceedings of the 27th AAAI Workshop on Figurative Language Processing, Conference on Artificial Intelligence, pages pages 45–55, New Orleans, Louisiana. Associ- 328–334, Bellevue, Washington. AAAI Press. ation for Computational Linguistics. Raquel Hervás, Rui P. Costa, Hugo Costa, Pablo Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Gervás, and Francisco C. Pereira. 2007. En- Johnson, Wolfgang Macherey, George Fos- richment of automatically generated texts using ter, Llion Jones, Niki Parmar, Mike Schus- metaphor. In Proceedings of the Sixth Mexi- ter, Zhifeng Chen, Yonghui Wu, and Macduff can International Conference on Artificial Intel- Hughes. 2018. The best of both worlds: Com- ligence, pages 944–954, Aguascalientes, Mex- bining recent advances in neural machine trans- ico. Springer. lation. cs.CL/1804.09849v2. George Lakoff. 1993. The contemporary theory of Erik-Lân Do Dinh, Hannah Wieland, and Iryna metaphor. In Andrew Ortony, editor, Metaphor Gurevych. 2018. Weeding out conventional- and Thought, pages 202–251. University Press ized metaphors: A corpus of novel metaphor Cambridge. annotations. In Proceedings of the 2018 Con- ference on Empirical Methods in Natural Lan- George Lakoff and Mark Johnson. 1980. guage Processing, pages 1412–1424, Brussels, Metaphors We Live By. University of Chicago Belgium. Association for Computational Lin- Press, Chicago. guistics. Chee Wee (Ben) Leong, Beata Beigman Kle- Ellen Dodge, Jisup Hong, and Elise Stickles. banov, and Ekaterina Shutova. 2018. A report 2015. Metanet: Deep semantic automatic on the 2018 VUA metaphor detection shared metaphor analysis. In Proceedings of the Third task. In Proceedings of the Workshop on Figu- Workshop on Metaphor in NLP, pages 40–49, rative Language Processing, pages 56–66, New
Orleans, Louisiana. Association for Computa- ceedings of the 40th Annual Meeting of the As- tional Linguistics. sociation for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania. Associa- Rui Mao, Chenghua Lin, and Frank Guerin. 2018. tion for Computational Linguistics. Word embedding and WordNet based metaphor identification and interpretation. In Proceed- Ekaterina Shutova. 2010. Automatic Metaphor ings of the 56th Annual Meeting of the As- Interpretation as a Paraphrasing Task. In The sociation for Computational Linguistics, pages 2010 Annual Conference of the North Ameri- 1222–1231, Melbourne, Australia. Association can Chapter of the Association for Computa- for Computational Linguistics. tional Linguistics, pages 1029–1037, Los Ange- les, California. Association for Computational Hermine H. Marshall. 1990. This issue: Linguistics. Metaphors we learn by. Theory Into Practice, 29(2):70–70. Ekaterina Shutova. 2015. Design and Evaluation of Metaphor Processing Systems. Computa- Zachary J. Mason. 2004. CorMet: A compu- tional Linguistics, 41:579–623. tational, corpus-based conventional metaphor extraction system. Computational Linguistics, Ekaterina Shutova, Tim Van de Cruys, and Anna 30(1):23–44. Korhonen. 2012. Unsupervised metaphor para- phrasing using a vector space model. In Pro- Tomas Mikolov, Kai Chen, Greg Corrado, and ceedings of the 24th International Conference Jeffrey Dean. 2013. Efficient estimation of on Computational Linguistics, pages 1121– word representations in vector space. CoRR, 1130, Mumbai, India. COLING 2012 Organiz- abs/1301.3781. ing Committee. Akira Miyazawa and Yusuke Miyao. 2017. Eval- uation metrics for automatically generated Mark Steedman and Marc Moens. 1988. Tempo- metaphorical expressions. In The 12th Interna- ral ontology and temporal reference. Computa- tional Conference on Computational Semantics, tional Linguistics, 2(14):15–28. Montpellier, France. Association for Computa- G.J. Steen, A.G. Dorst, J.B. Herrmann, A.A. tional Linguistics. Kaal, T. Krennmayr, and T. Pasma. 2010. A Saif Mohammad, Ekaterina Shutova, and Peter method for linguistic metaphor identification. Turney. 2016. Metaphor as a medium for emo- From MIP to MIPVU. Converging Evidence in tion: An empirical study. In Proceedings of the Language and Communication Research. John Fifth Joint Conference on Lexical and Compu- Benjamins. tational Semantics, pages 23–33, Berlin, Ger- Kevin Stowe and Martha Palmer. 2018. Lever- many. Association for Computational Linguis- aging syntactic constructions for metaphor pro- tics. cessing. In Workshop on Figurative Lan- Jonas Mueller, David Gifford, and Tommi guage Processing, pages 17–26, New Orleans, Jaakkola. 2017. Sequence to better sequence: Louisiana. Association for Computational Lin- Continuous revision of combinatorial struc- guistics. tures. In Proceedings of the 34th International Asuka Terai and Masanori Nakagawa. 2010. Conference on Machine Learning, pages 2536– A computational system of metaphor gener- 2544, Sydney, Australia. PMLR. ation with evaluation mechanism. In Inter- Ekatarina Ovchinnikova, Vladimir Zaytsev, national Conference on Artificial Neural Net- Suzanne Wertheim, and Ross Israel. 2014. works, pages 142–147, Thessaloniki, Greece. Generating conceptual metaphors from propo- Springer. sition stores. cs.CL/1409.7619. Ashish Vaswani, Noam Shazeer, Niki Parmar, Kishore Papineni, Salim Roukos, Todd Ward, and Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Wei-Jing Zhu. 2002. Bleu: a method for auto- Łukasz Kaiser, and Illia Polosukhin. 2017. At- matic evaluation of machine translation. In Pro- tention is all you need. In 31st Conference on
Neural Information Processing Systems, pages 5998–6008, Long Beach, California. Curran Associates, Inc. Tony Veale. 2016. Round up the usual suspects: Knowledge-based metaphor generation. In Pro- ceedings of the Fourth Workshop on Metaphor in NLP, pages 34–41, San Diego, California. Association for Computational Linguistics. Tony Veale and Yanfen Hao. 2008. A fluid knowl- edge representation for understanding and gen- erating creative metaphors. In Proceedings of the 22nd International Conference on Compu- tational Linguistics, pages 945–952, Manch- ester, UK. COLING 2008 Organizing Commit- tee. Tony Veale, Ekaterina Shutova, and Beata Beigman Klebanov. 2016. Metaphor: A computational perspective. Synthesis Lectures on Human Language Technologies, 9(1):1–160. Alan Wallington, Rodrigo Agerri, John Barnden, Mark Lee, and Tim Rumbell. 2011. Affect transfer by metaphor for an intelligent conver- sational agent. In Affective computing and sen- timent analysis. Emotion, metaphor and termi- nology, volume 45, pages 53–66. Zhiwei Yu and Xiaojun Wan. 2019. How to avoid sentences spelling boring? Towards a neural approach to unsupervised metaphor gen- eration. In Proceedings of the 2019 Conference of the North American Chapter of the Associ- ation for Computational Linguistics: Human Language Technologies, pages 861–871, Min- neapolis, Minnesota. Association for Computa- tional Linguistics. Li Zhang. 2008. Metaphorical affect sensing in an intelligent conversational agent. In Proceed- ings of the Fifth International Conference on Advances in Computer Entertainment Technol- ogy, pages 100–106, Yokohama, Japan.
You can also read