Fixed That for You: Generating Contrastive Claims with Semantic Edits - Association for ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Fixed That for You: Generating Contrastive Claims with Semantic Edits Christopher Hidey Kathleen McKeown Department of Computer Science Department of Computer Science Columbia University Columbia University New York, NY 10027 New York, NY 10027 chidey@cs.columbia.edu kathy@cs.columbia.edu Abstract This is an example of a policy claim - a view on what should be done (Schiappa and Nordin, 2013). Understanding contrastive opinions is a key component of argument generation. Central While negation of this claim is a plausible re- to an argument is the claim, a statement that sponse (e.g. asserting there should be no change is in dispute. Generating a counter-argument by stating Do not get employers out of the busi- then requires generating a response in contrast ness, do not pass universal healthcare), negation to the main claim of the original argument. To limits the diversity of responses that can lead to a generate contrastive claims, we create a cor- productive dialogue. Instead, consider a response pus of Reddit comment pairs self-labeled by that provides an alternative suggestion: posters using the acronym FTFY (fixed that for you). We then train neural models on Get employers out of the business, these pairs to edit the original claim and pro- deregulate and allow cross-state com- (2) duce a new claim with a different view. We demonstrate significant improvement over a petition. sequence-to-sequence baseline in BLEU score In Example 1, the speaker believes in an in- and a human evaluation for fluency, coherence, creased role for government while in Example 2, and contrast. the speaker believes in a decreased one. As these 1 Introduction views are on different sides of the political spec- trum, it is unlikely that a single speaker would In the Toulmin model (1958), often used in com- utter both claims. In related work, de Marneffe putational argumentation research, the center of et al. (2008) define two sentences as contradictory the argument is the claim, a statement that is in when they are extremely unlikely to be true simul- dispute (Govier, 2010). In recent years, there taneously. We thus define a contrastive claim as has been increased interest in argument genera- one that is likely to be contradictory if made by tion (Bilu and Slonim, 2016; Hua and Wang, 2018; the speaker of the original claim. Our goal, then, Le et al., 2018). Given an argument, a system is to develop a method for generating contrastive that generates counter-arguments would need to claims when explicit negation is not the best op- 1) identify the claims to refute, 2) generate a new tion. Generating claims in this way also has the claim with a different view, and 3) find supporting benefit of providing new content that can be used evidence for the new claim. We focus on this sec- for retrieving or generating supporting evidence. ond task, which requires an understanding of con- In order to make progress towards generating trast. A system that can generate claims with dif- contrastive responses, we need large, high-quality ferent views is a step closer to understanding and datasets that illustrate this phenomenon. We con- generating arguments (Apothloz et al., 1993). We struct a dataset of 1,083,520 contrastive comment build on previous work in automated claim genera- pairs drawn from Reddit using a predictive model tion (Bilu et al., 2015) which examined generating to filter out non-contrastive claims. Each pair con- opposing claims via explicit negation. However, tains very similar, partially aligned text but the researchers also noted that not every claim has an responder has significantly modified the original exact opposite. Consider a claim from Reddit: post. We use this dataset to model differences in Get employers out of the business, pass views and generate a new claim given an origi- (1) nal comment. The similarity within these pairs universal single-payer healthcare. 1756 Proceedings of NAACL-HLT 2019, pages 1756–1767 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational Linguistics
allows us to use them as distantly labeled con- between pairs of contrastive claims except for the trastive word alignments. The word alignments contrastive word or phrase. Much of the previous provide semantic information about which words work in contrast and contradiction has examined and phrases can be substituted in context in a co- the relationship between words or sentences. In herent, meaningful way. order to understand when words and phrases are Our contributions1 are as follows: contrastive in argumentation, we need to exam- ine them in context. For example, consider the 1. Methods and data for contrastive claim iden- claim Hillary Clinton should be president. A rea- tification to mine comment pairs from Red- sonable contrastive claim might be Bernie Sanders dit, resulting in a large, continuously grow- should be president. (rather than the explicit nega- ing dataset of 1,083,520 distant-labeled ex- tion Hillary Clinton should not be president.) In amples. this context, Hillary Clinton and Bernie Sanders are contrastive entities as they were both running 2. A crowd-labeled set of 2,625 comments each for president. However, for the claim Hillary Clin- paired with 5 new contrastive responses gen- ton was the most accomplished Secretary of State erated by additional annotators. in recent memory. they would be unrelated. Con- 3. Models for generating contrastive claims us- sider also that we could generate the claim Hillary ing neural sequence models and constrained Clinton should be senator. This contrastive claim decoding. is not coherent given the context. Generating a contrastive claim then requires 1) identifying the In the following sections, we describe the task correct substitution span and 2) generating a re- methodology and data collection and process- sponse with semantically relevant replacements. ing. Next, we present neural models for con- While some contrastive claims are not coherent, trastive claim generation and evaluate our work there are often multiple plausible responses, sim- and present an error analysis. We then discuss ilar to tasks such as dialogue generation. For ex- related work in contrast/contradiction, argumenta- ample, Donald Trump should be president is just tion, and generation before concluding. as appropriate as Bernie Sanders should be pres- ident. We thus treat this as a dialogue generation 2 Task Definition and Motivation task where the goal is to generate a plausible re- Previous work in claim generation (Bilu et al., sponse given an input context. 2015) focused on explicit negation to provide op- posing claims. While negation plays an important 3 Data role in argumentation (Apothloz et al., 1993), re- In order to model contrastive claims, we need searchers found that explicit negation may result datasets that reflect this phenomenon. in incoherent responses (Bilu et al., 2015). Fur- thermore, recent empirical studies have shown that 3.1 Collection arguments that provide new content (Wachsmuth We obtain training data by scraping the social et al., 2018) tend to be more effective. While new media site Reddit for comments containing the concepts can be introduced in other ways by find- acronym FTFY.2 FTFY is a common acronym ing semantically relevant content, we may find it meaning “fixed that for you.”3 FTFY responses desirable to explicitly model contrast in order to (hereafter FTFY) are used to respond to another control the output of the model as part of a rhetor- comment by editing part of the “parent comment” ical strategy, e.g. concessions (Musi, 2018). We (hereafter parent). Most commonly, FTFY is used thus develop a model that generates a contrastive for three categories of responses: 1) expressing claim given an input claim. a contrastive claim (e.g. the parent is Bernie Contrastive claims may differ in more than just Sanders for president and the FTFY is Hillary viewpoint; they may also contain stylistic differ- Clinton should be president) which may be sarcas- ences and paraphrases, among other aspects. We tic (e.g. Ted Cruz for president becomes Zodiac thus propose to model contrastive claims by con- 2 trolling for context and maintaining the same text https://en.wiktionary.org/wiki/FTFY 3 https://reddit.zendesk.com/hc/en-us/articles/205173295- 1 Data and code available at github.com/chridey/fixedthat What-do-all-these-acronyms-mean 1757
killer for president) 2) making a joke (e.g. This exact sentence alignments in 93. Python library really piques my interest vs. This Given these pairs of comments, we derive lin- really *py*ques my interest), and 3) correcting a guistic and structural features for training a binary typo (e.g. This peaks my interest vs. piques). In classifier. For each pair of comments, we compute Section 3.2, we describe how we identify category features for the words in the entire comment span 1 (contrastive claims) for modeling. and features from the aligned phrases span only To obtain historical Reddit data, we mined (as identified by edit distance). From the aligned comments from the site pushshift.io for Decem- phrases we compute the character edit distance ber 2008 through October 2017. This results in and character Jaccard similarity (both normalized 2,200,258 pairs from Reddit, where a pair consists by the number of characters) to attempt to cap- of a parent and an FTFY. We find that many of the ture jokes and typos (the similarity should be high top occurring subreddits are ones where we would if the FTFY is inventing an oronym or correcting expect strong opinions (/r/politics, /r/worldnews, a spelling error). From the entire comment, we and /r/gaming). use the percentage of characters copied as a low percentage may indicate a poor alignment and the 3.2 Classification percentage of non-ASCII characters as many of the jokes use emojis or upside-down text. In ad- To filter the data to only the type of response that dition, we use features from GloVe (Pennington we are interested in, we annotated comment pairs et al., 2014) word embeddings4 for both the entire for contrastive claims and other types. We use comment and aligned phrases. We include the per- our definition of contrastive claims based on con- centage of words in the embedding vocabulary for tradiction, where both the parent and FTFY are both spans for both the parent and FTFY. The rea- a claim and they are unlikely to be beliefs held son for this feature is to identify infrequent words by the same speaker. A joke is a response that which may be typos or jokes. We compute the does not meaningfully contrast with the parent cosine similarity of the average word embeddings and commonly takes the form of a pun, rhyme, between the parent and FTFY for both spans. Fi- or oronym. A correction is a response to a typo, nally, we use average word embeddings for both which may be a spelling or grammatical error. spans for both parent and FTFY. Any other pair is labeled as “other,” including As we want to model the generation of new pairs where the parent is not a claim. content, not explicit negation, we removed any In order to identify contrastive claims, we se- pairs where the difference was only “stop words.” lected a random subset of the Reddit data from The set of stop words includes all the default stop prior to September 2017 and annotated 1993 com- words in Spacy5 combined with expletives and ments. Annotators were native speakers of English special tokens (we replaced all URLs and user- and the Inter-Annotator Agreement using Kripen- names). We trained a logistic regression classifier dorff’s alpha was 0.72. Contrast occurs in slightly and evaluated using 4-fold cross-validation. We more than half of the sampled cases (51.4%), with compare to a character overlap baseline where any jokes (23.0%) and corrections (21.2%) compris- examples with Jaccard similarity > 0.9 and edit ing about one quarter each. We then train a bi- distance < 0.15 were classified as non-contrastive. nary classifier to predict contrastive claims, thus The goal of this baseline is to illustrate how much enabling better quality data for the generation task. of the non-contrastive data involves simple or non- To identify the sentence in the parent that the existent substitutions. Results are shown in Table FTFY responds to and derive features for classi- 1. Our model obtains an F-score of 80.25 for an 8 fication, we use an edit distance metric to obtain point absolute improvement over the baseline. sentence and word alignments between the parent comment and response. As the words in the par- 3.3 Selection ent and response are mostly in the same order and After using the trained model to classify the re- most FTFYs contain significant overlap with the maining data, we have 1,083,797 Reddit pairs. We parent response, it is possible to find alignments by set aside 10,307 pairs from October 1-20, 2017 for moving a sliding window over the parent. A sam- 4 We found the 50-dimensional Wikipedia+Gigaword em- ple of 100 comments verifies that this approach beddings to be sufficient 5 yields exact word alignments in 75 comments and spacy.io 1758
Model Precision Recall F-score 4 Methods Majority 51.4 100 67.5 Baseline 67.75 77.19 72.16 Our goal of generating contrastive claims can be LR 74.22 87.60 80.25 broken down into two primary tasks: 1) identify- ing the words in the original comment that should Table 1: Results of Identifying Contastive Claims be removed or replaced and 2) generating the ap- propriate substitutions and any necessary context. Initially, we thus experimented with a modular ap- development and October 21-30 for test (6,773), proach by tagging each word in the parent and with the remainder used for training. As we are then using the model predictions to determine if primarily working with sentences, the mean par- we should copy, delete, or replace a segment with ent length was 16.3 and FTFY length was 14.3. a new word or phrase. We tried the bi-directional The resulting test FTFYs are naturally occur- LSTM-CNN-CRF model of Ma and Hovy (2016) ring and so do not suffer from annotation arti- and used our edit distance word alignments to facts. At the same time, they are noisy and may obtain labels for copying, deleting, or replacing. not reflect the desired phenomenon. Thus, we However, we found this model performed slightly also conducted an experiment on Amazon Me- above random predictions, and with error propa- chanical Turk6 (AMT) to obtain additional gold gation, the model is unlikely to produce fluent and references, which are further required by metrics accurate output. Instead, we use an end-to-end ap- such as BLEU (Papineni et al., 2002). We se- proach using techniques from machine translation. lected 2,625 pairs from the 10 most frequent cat- egories7 (see Table 2). These categories form a 4.1 Model three-level hierarchy for each subreddit and we We use neural sequence-to-sequence encoder- use the second-level, e.g. for /r/pokemongo the decoder models (Sutskever et al., 2014) with at- categories are “Pokémon”, “Video Games”, and tention for our experiments. The tokens from the “Gaming” so we use “Video Games.” Before par- parent are passed as input to a bi-directional GRU ticipating, each annotator was required to pass a (Cho et al., 2014) to obtain a sequence of encoder qualification test - five questions to gauge their hidden states hi . Our decoder is also a GRU, knowledge of that topic. For the movies category, which at time t generates a hidden state st from the one question we asked was whether for the sen- previous hidden state st−1 along with the input. tence Steven Spielberg is the greatest director of When training, the input xt is computed from the all time, we could instead use Stanley Kubrick or previous word in the gold training data if we are Paul McCartney. If they passed this test, the an- in “teacher forcing” mode (Williams and Zipser, notators were then given the parent comment and 1989) and otherwise is the prediction made by the “keywords” (the subreddit and three category lev- model at the previous time step. When testing, we els) to provide additional context. We obtained also use the model predictions. The input word five new FTFYs for each parent and validated them wt may be augmented by additional features, as manually to remove obvious spam or trivial nega- discussed in Section 4.2. In the baseline scenario tion (e.g. “not” or “can’t”). xt = e(wt ) where e is an embedding. The hidden state st is then combined with a context vector h∗t , Category Count Category Count which is a weighted combination of the encoder Video Games 1062 Basketball 116 hidden states using an attention mechanism: Politics 529 Soccer 99 X Football 304 Movies 88 h∗t = αti hi Television 194 Hockey 60 i World News 130 Baseball 55 To calculate αit , we use the attention of Luong et Table 2: Comments for Mechanical Turk al. (2015) as this encourages the model to select features in the encoder hidden state which corre- 6 late with the decoder hidden state, which we want We paid annotators the U.S. federal minimum wage and the study was approved by an IRB. because our input and output are similar. Our ex- 7 Obtained from the snoopsnoo.com API periments on the development data verified this, 1759
as Bahdanau attention (Bahdanau et al., 2015) per- count is computed by: formed worse. Attention is then calculated as: c0 = |O \ (S ∪ I)| or desired count exp(hTi st ) ( αti = ct − 1, wt ∈ / S ∪ I and ct > 0 P T ct+1 = s0 exp(hs0 st ) ct , otherwise Finally, we make a prediction of a vocabulary where O is the set of gold output words in training. word w by using features from the context and de- For the parent comment Hillary Clinton for coder hidden state with a projection matrix W and president 2020 and FTFY Bernie Sanders for pres- output vocabulary matrix V : ident, the decoder input is presented, with the time t in the first row of Table 3 and the inputs wt and P (w) = softmax(V tanh(W [st ; h∗t ] + bw ) + bv ) ct in the second and third rows, respectively. At the start of decoding, the model expects to gener- We explored using a copy mechanism (See et al., ate two new content words, which in this exam- 2017) for word prediction but found it difficult to ple it generates immediately and decrements the prevent the model from copying the entire input. counter. When the counter reaches 0, it only gen- erates stop or input words. 4.2 Decoder Representation t 0 1 2 3 4 Decoder Input: We evaluate two representations wt - Bernie Sanders for president of the target input: as a sequence of words and ct 2 1 0 0 0 as a sequence of edits. The sequence of words approach is the standard encoder-decoder setup. Table 3: Example of Counter For the example parent Hillary Clinton for presi- dent 2020 and FTFY Bernie Sanders for president Unlike the controlled-length scenario, at test we would use the FTFY without modification. time we do not know the number of new content Schmaltz et al. (2017) found success modeling er- words to generate. However, the count for most ror correction using sequence-to-sequence models FTFYs is between 1 and 5, inclusive, so we can ex- by representing the target input as a sequence of haustively search this range during decoding. We edits. We apply a similar approach to our prob- experimented with predicting the count but found lem, generating a target sequence by following the it to be inaccurate so we leave this for future work. best path in the matrix created by the edit dis- Subreddit Information: As the model often tance algorithm. The new target sequence is the needs to disambiguate polysemous words, addi- original parent interleaved with “DELETE-N to- tional context can be useful. Consider the par- kens” that specify how many previous words to ent comment this is a strange bug. In a program- delete, followed by the newly generated content. ming subreddit, a sarcastic FTFY might be this is For the same example, Hillary Clinton for pres- a strange feature. However, in a Pokémon subred- ident 2020, the modified target sequence would dit, an FTFY might be this is a strange dinosaur be Hillary Clinton DELETE-2 Bernie Sanders for in an argument over whether Armaldo is a bug or president 2020 DELETE-1. a dinosaur. We thus include additional features to Counter: Kikuchi et al. (2016) found that be passed to the encoder at each time step, in the by using an embedding for a length variable they form of an embedding g for each the three cat- were able to control output length via a learned egory levels obtained in Section 3.3. These em- mechanism. In our work, we compute a counter beddings are concatenated to the input word wt at variable which is initially set to the number of new each timestep, i.e. xt = e(wt , gt1 , gt2 , gt3 ). content words the model should generate. During decoding, the counter is decremented if a word 4.3 Objective Function is generated that is not in the source input (I) or We use a negative log Plikelihood ∗objective func- in the set of stop words (S) defined in Section tion LN LL = − log t∈1:T P (wt ), where wt∗ is 3.2. The model uses an embedding e(ct ) for each the gold token at time t, normalized by each batch. count, which is parameterized by a count embed- We also include an additional loss term that uses ding matrix. The input to the decoder state in this the encoder hidden states to make a binary pre- scenario is xt = e(wt , ct ). At each time step, the diction over the input for whether a token will be 1760
copied or inserted/deleted. For the example from 5 Results Section 4.2, the target for Hillary Clinton for pres- ident 2020 would be 0 0 1 1 0. This encourages For training, development, and testing we use the the model to select features that indicate whether data described in Section 3.3. The test reference the encoder input will be copied to the output. We data consists of the Reddit FTFYs and the FTFYs use a 2-layer multi-layer perceptron and a binary generated from AMT. We evaluate our models us- cross-entropy loss LBCE . The joint loss is then: ing automated metrics and human judgments. Automated metrics should reflect our joint goals of 1) copying necessary context and 2) making ap- L = LN LL + λLBCE propriate substitutions. To address point 1, we use BLEU-4 as a measure of similarity between the 4.4 Decoding gold FTFY and the model output. As the FTFY may contain significant overlap with the parent, We use beam search for generation, as this method BLEU indicates how well the model copies the has proven effective for many neural language appropriate context. As BLEU reflects mostly generation tasks. For the settings of the model that span selection rather than the insertion of new con- require a counter, we expand the beam by count m tent, we need alternative metrics to address point so that for a beam size k we calculate k ∗ m states. 2. However, addressing point 2 is more difficult Filtering: We optionally include a constrained due to the variety of possible substitutions, in- decoding mode where we filter the output based cluding named entities. For example, if the par- on the counter; when ct > 1 the end-of-sentence ent comment is jaguars for the win! and the gold (EOS) score is set to −∞ and when ct = 0 the FTFY is chiefs for the win! but the model pro- score of any word w ∈ V \ (S ∪ I) is set to −∞. duces cowboys for the win! (or any of 29 other The counter ct is decremented at every time step as NFL teams), most metrics would judge this re- in Section 4.2. In other words, when the counter sponse incorrectly even though it would be an ac- is zero, we only allow the model to copy or gener- ceptable response. Thus we present results using ate stop words. When the counter is positive, we both automated metrics and human evaluation. As prevent the model from ending the sentence before an approximation to address point 2, we attempt it generates new content words and decrements the to measure when the model is making changes counter. The constrained decoding is possible with rather than just copying the input. To this end, we any combination of settings, with or without the present two additional metrics - novelty, a mea- counter embedding. sure of whether novel content (non-stop word) to- kens are generated relative to the parent comment, and partial match, a measure of whether the novel 4.5 Hyper-parameters and Optimization tokens in the gold FTFY match any of the novel We used Pytorch (Paszke et al., 2017) for all ex- tokens in the generated FTFY. To provide a refer- periments. We used 300-dimensional vectors for ence point, we find that the partial match between the word embedding and GRU layers. The count two different gold FTFYs (Reddit and AMT) was embedding dimension was set to 5 with m = 5 11.4% and BLEU was 47.28, which shows the dif- and k = 10 for decoding. The category embed- ficulty of automatic evaluation. The scores are ding dimensions were set to 5, 10, and 25 for each lower than expected because the Reddit FTFYs are of the non-subreddit categories. We also set λ = 1 noisy due to the process in Section 3.2. This also for multi-task learning. We used the Adam op- justifies obtaining the AMT FTFYs. timizer (Kingma and Ba, 2015) with settings of Results are presented in Table 4. The baseline β1 = 0.9, β2 = 0.999, = 10−8 and a learning is a sequence-to-sequence model with attention. rate of 10−3 decaying by γ = 0.1 every epoch. For other components, the counter embedding is We used dropout (Srivastava et al., 2014) on the referred to as “COUNT,” the category/subreddit embeddings with a probability of 0.2 and teacher embeddings as “SUB,” the sequence of edits as forcing with 0.5. We used a batch size of 100 with “EDIT,” and the multi-task copy loss as “COPY.” 10 epochs, selecting the best model on the devel- The models in the top half of the table use con- opment set based on perplexity. We set the mini- strained decoding and those in the bottom half mum frequency of a word in the vocabulary to 4. are unconstrained, to show the learning capabil- 1761
Reddit AMT Model Novelty BLEU-4 % Match BLEU-4 % Match Constrained Baseline 79.88 18.81 4.67 40.14 10.06 COUNT 89.69 22.61 4.72 47.55 12.55 COUNT + SUB + COPY 90.45 23.13 4.83 50.05 14.92 EDIT 64.64 16.12 3.37 35.48 7.33 EDIT + COUNT + SUB + COPY 82.96 19.37 4.23 42.69 11.62 Unconstrained Baseline 3.34 7.31 0.73 25.83 0.68 COUNT 16.19 8.51 1.95 27.68 2.36 COUNT + SUB + COPY 16.26 9.62 1.93 31.23 3.81 EDIT 7.97 35.41 1.57 74.24 1.56 EDIT + COUNT + SUB + COPY 39.99 32.59 3.25 67.56 6.25 Table 4: Automatic Evaluation ities of the models. For each model we com- contradicts the original comment. We specified pute statistical significance with bootstrap resam- that if the response is different but does not pro- pling (Koehn, 2004) for the constrained or uncon- vide a contrasting view it should receive a low rat- strained baseline as appropriate and we find the ing. Previous work (Bilu et al., 2015) used flu- COUNT and EDIT models to be significantly bet- ency, clarity/usability (which we combine into co- ter for constrained and unconstrained decoding, herence), and opposition (where we use contrast). respectively (p < 0.005). Under constrained decoding, we see that the Model Fluency Coherence Contrast “COUNT + SUB + COPY” model performs the Gold 4.34 4.26 3.01 best in all metrics, although most of the perfor- Baseline 3.49 3.19 1.94 mance can be attributed to the count embedding. Constrained 3.46 3.32 2.53 When we allow the model to determine its own Best 3.52 3.46 2.87 output, we find that “EDIT + COUNT” performs the best. In particular, this model does well at un- Table 5: Human Evaluation derstanding which part of the context to select, and even does better than other unconstrained models We used a Likert scale where 5 is strongly agree at selecting appropriate substitutions. However, and 1 is strongly disagree. We used the same data when we combine this model with constrained de- and qualification test from Section 3.3 for each coding, the improvement is smaller than for the category and used three annotators per example. other settings. We suspect that because the EDIT We asked annotators to judge 4 different pairs: 3 model often needs to generate a DELETE-N token model outputs and the gold Reddit8 FTFYs for before a new response, these longer-term depen- comparison. We include the baseline, the base- dencies are hard to capture with constrained de- line with constrained decoding, and the best con- coding but easier if included in training. strained model (“COUNT + SUB + COPY”) ac- We also conducted a human evaluation of the cording to BLEU and partial match. We verified model output on the same subset of 2,625 exam- that the annotators understood how to rate con- ples described in Section 3.3. We performed an trast by examining the distribution of responses: additional experiment on AMT where we asked the annotators selected option 3 (neither) 15% of annotators to rate responses on fluency, coherence, the time and preferred to select either extreme, 5 and contrast. Fluency is a measure of the qual- (21%) or 1 (27%). Results are presented in Table ity of the grammar and syntax and the likelihood 5, showing a clear preference for the best model. that a native English speaker would utter that state- Note the degradation in fluency for the constrained ment. Coherence is a measure of whether the baseline, as the model is prevented from generat- response makes sense, is semantically meaning- ing the EOS token and may repeat tokens up to the ful, and would be usable as a response to a claim. 8 We did not evaluate the AMT FTFYs as these were gen- Contrast is a measure of how much the response erated by the same pool of annotators. 1762
Parent: ah yes the wonders of the free market clue what i ’m talking about . We also found ex- Model: ah yes the wonders of government in- amples where the model chose poorly due to un- tervention filtered jokes or errors in the training data (12 in Parent: i know that this is an unofficial mod total). In 15 cases, due to the constrained decod- , but xp is the best os for this machine ing the model repeated a word until the maximum Model: linux is the best os for this machine length or appended an incoherent phrase. For the Parent: that ’s why it ’s important to get all most common error, the model made a substitu- your propaganda from infowars and brietbart tion that was not contrasting as in Table 6 (19 examples). Finally, we found 38 of the samples Model: propaganda from fox news outlets were valid responses, but did not match the gold, Table 6: Model Output indicating the difficulty of automatic evaluation. For example, in response to the claim Nintendo is the only company that puts customers over profits, maximum length. the model replaces Nintendo with Rockstar (both 6 Qualitative Analysis video game companies) while the gold FTFYs had other video game companies. We provide three examples of the model output in Table 6 with the first and third from the News and 7 Related Work Politics category, demonstrating how the model handles different types of input. In the first ex- Understanding contrast and contradiction is key ample, the contrast is between allowing markets to argumentation as it requires an understand- to regulate themselves versus an increased role of ing of differing points-of-view. Recent work ex- government. In the second example, the contra- amined the negation of claims via explicit nega- diction is due to the choice of operating system. tion (Bilu et al., 2015). Other work investigated In the third (invalid) example, the model responds the detection of different points-of-view in opin- to a sarcastic claim with another right-wing news ionated text (Al Khatib et al., 2012; Paul et al., organization; this response is not a contradiction 2010). Wachsmuth et al. (2017; 2018) retrieved since it is plausible the original speaker would also arguments for and against a particular stance us- utter this statement. ing online debate forums. In non-argumentative text, researchers predicted contradictions for types 6.1 Error Analysis such as negation, antonyms, phrasal, or structural We conduct an error analysis by selecting 100 re- (de Marneffe et al., 2008) or those that can be sponses where the model did not partially match expressed with functional relations (Ritter et al., any of the 6 gold responses and we found 6 main 2008). Other researchers have incorporated en- types of errors. One error is the model identify- tailment models (Kloetzer et al., 2013) or crowd- ing an incorrect substitution span while the hu- sourcing methods (Takabatake et al., 2015). Con- man responses all selected a different span to re- tradiction has also become a part of the natural lan- place. We noticed that this occurred 5 times and guage inference (NLI) paradigm, with datasets la- may require world knowledge to understand which beling contradiction, entailment, or neutral (Bow- tokens to select. For example, in response to the man et al., 2015a; Williams et al., 2018). The claim Hillary Clinton could have been president if increase in resources with contrast and contradic- not for robots, the model generates Donald Trump tion has resulted in new representations with con- in place of Hillary Clinton, whereas the gold re- trastive meaning (Chen et al., 2015; Nguyen et al., sponses generate humans / votes / Trump’s tweets 2016; Vulić, 2018; Conneau et al., 2017). Most in place of robots. Another type of error is when of this work has focused on identifying contrast or the responses are not coherent with the parent and contradiction while we aim to generate contrast. the language model instead determines the token Furthermore, while contradiction and contrast are selection based on the most recent context (11 present in these corpora, we obtain distant-labeled cases). For example, given the claim bb-8 gets a alignments for contrast at the word and phrase girlfriend and poe still does n’t have a girlf :’) the level. Our dataset also includes contrastive con- Reddit FTFY has boyf instead of girlf whereas the cepts and entities while other corpora primarily model generates ... and poe still does n’t have a contain antonyms and explicit negation. 1763
Contrast also appears in the study of stance, generating contrastive claims and obtained signif- where the opinion towards a target may vary. The icant improvement in automated metrics and hu- SemEval 2016 Stance Detection for Twitter task man evaluations for Reddit and AMT test data. (Mohammad et al., 2016) involved predicting if a Our goal is to incorporate this model into an ar- tweet favors a target entity. The Interpretable Se- gumentative dialogue system. In addition to gen- mantic Similarity task (Agirre et al., 2016) called erating claims with a contrasting view, we can to identify semantic relation types (including op- also retrieve supporting evidence for the newly- position) between headlines or captions. Target- generated claims. Additionally, we plan to exper- specific stance prediction in debates is addressed iment with using our model to improve claim de- by Hasan and Ng (2014) and Walker et al. (2012). tection (Daxenberger et al., 2017) and stance pre- Fact checking can be viewed as stance toward an diction (Bar-Haim et al., 2017). Our model could event, resulting in research on social media (Lend- be used to generate artificial data to enhance clas- vai and Reichel, 2016; Mihaylova et al., 2018), sification performance on these tasks. politician statements (Vlachos and Riedel, 2014), To improve our model, we plan to experiment news articles (Pomerleau and Rao, 2017), and with retrieval-based approaches to handle low- Wikipedia (Thorne et al., 2018). frequency terms and named entities, as sequence- In computational argumentation mining, iden- to-sequence models are likely to have trouble in tifying claims and other argumentative compo- this environment. One possibility is to incorpo- nents is a well-studied task (Stab and Gurevych, rate external knowledge with entity linking over 2014). Daxenberger et al. (2017) and Schulz et al. Wikipedia articles to find semantically-relevant (2018) developed approaches to detect claims substitutions. across across diverse claim detection datasets. Re- Another way to improve the model is by intro- cently, a shared task was developed for argument ducing controllable generation. One aspect of con- reasoning comprehension (Habernal et al., 2018). trollability is intention; our model produces con- The best system (Choi and Lee, 2018) used models trastive claims without understanding the view of pre-trained on NLI data (Bowman et al., 2015b), the original claim. Category embeddings partially which contains contradictions. While this work address this issue (some labels are “Liberal” or is concerned with identification of argumentative “Conservative”), but labels are not available for components, we propose to generate new claims. all views. Going forward, we hope to classify the In the field of argument generation, Wang and viewpoint of the original claim and then generate a Ling (2016) train neural abstractive summarizers claim with a desired orientation. Furthermore, we for opinions and arguments. Additional work in- hope to improve on the generation task by identi- volved generating opinions given a product rating fying the types of claims we encounter. For exam- (Wang and Zhang, 2017). Bilu and Slonim (2016) ple, we may want to change the target of the claims combine topics and predicates via a template- in some claims but in others change the polarity. based classifier. This work involves the genera- We also plan to improve the dataset by improv- tion of claims but in relation to a topic. Other ing our models for contrastive pair prediction to researchers generated political counter-arguments reduce noise. Finally, we hope that this dataset supported by external evidence (Hua and Wang, proves useful for related tasks such as textual 2018) and generating argumentative dialogue by entailment (providing examples of contradiction) maximizing mutual information (Le et al., 2018). and argument comprehension (providing counter- This research considers end-to-end argument gen- examples of arguments) or even unrelated tasks eration, which may not be coherent, whereas we like humor or error correction. focus specifically on contrastive claims. Acknowledgments 8 Conclusion We would like to thank the AMT annotators for We presented a new source of over 1 million con- their work and the anonymous reviewers for their trastive claim pairs that can be mined from social valuable feedback. The authors also thank Alyssa media sites such as Reddit. We provided an anal- Hwang for providing annotations and Smaranda ysis and models to filter noisy training data from Muresan, Chris Kedzie, Jessica Ouyang, and Els- 49% down to 25%. We created neural models for beth Turcan for their helpful comments. 1764
References In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the Eneko Agirre, Aitor Gonzalez-Agirre, Inigo Lopez- 7th International Joint Conference on Natural Lan- Gazpio, Montse Maritxalar, German Rigau, and Lar- guage Processing (Volume 1: Long Papers), pages raitz Uria. 2016. Semeval-2016 task 2: Interpretable 106–115. Association for Computational Linguis- semantic textual similarity. In Proceedings of the tics. 10th International Workshop on Semantic Evalua- tion (SemEval-2016), pages 512–524. Association Kyunghyun Cho, Bart van Merrienboer, Caglar Gul- for Computational Linguistics. cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Khalid Al Khatib, Hinrich Schütze, and Cathleen Kant- phrase representations using rnn encoder–decoder ner. 2012. Automatic detection of point of view dif- for statistical machine translation. In Proceedings of ferences in wikipedia. In Proceedings of COLING the 2014 Conference on Empirical Methods in Nat- 2012, pages 33–50. The COLING 2012 Organizing ural Language Processing (EMNLP), pages 1724– Committee. 1734, Doha, Qatar. Association for Computational Linguistics. Denis Apothloz, Pierre-Yves Brandt, and Gustavo Quiroz. 1993. The function of negation in argumen- HongSeok Choi and Hyunju Lee. 2018. Gist at tation. Journal of Pragmatics, 19:23–38. semeval-2018 task 12: A network transferring infer- ence knowledge to argument reasoning comprehen- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- sion task. In Proceedings of The 12th International gio. 2015. Neural machine translation by jointly Workshop on Semantic Evaluation, pages 773–777. learning to align and translate. In 3rd International Association for Computational Linguistics. Conference on Learning Representations. Alexis Conneau, Douwe Kiela, Holger Schwenk, Loı̈c Roy Bar-Haim, Indrajit Bhattacharya, Francesco Din- Barrault, and Antoine Bordes. 2017. Supervised uzzo, Amrita Saha, and Noam Slonim. 2017. Stance learning of universal sentence representations from classification of context-dependent claims. In Pro- natural language inference data. In Proceedings of ceedings of the 15th Conference of the European the 2017 Conference on Empirical Methods in Nat- Chapter of the Association for Computational Lin- ural Language Processing, pages 670–680. Associ- guistics: Volume 1, Long Papers, pages 251–261, ation for Computational Linguistics. Valencia, Spain. Association for Computational Lin- guistics. Johannes Daxenberger, Steffen Eger, Ivan Habernal, Christian Stab, and Iryna Gurevych. 2017. What is Yonatan Bilu, Daniel Hershcovich, and Noam Slonim. the essence of a claim? cross-domain claim iden- 2015. Automatic claim negation: Why, how and tification. In Proceedings of the 2017 Conference when. In Proceedings of the 2nd Workshop on Ar- on Empirical Methods in Natural Language Pro- gumentation Mining, pages 84–93. Association for cessing, pages 2055–2066. Association for Compu- Computational Linguistics. tational Linguistics. Yonatan Bilu and Noam Slonim. 2016. Claim synthe- Trudy Govier. 2010. A Practical Study of Argument. sis via predicate recycling. In Proceedings of the Cengage Learning, Wadsworth. 54th Annual Meeting of the Association for Compu- tational Linguistics (Volume 2: Short Papers), pages Ivan Habernal, Henning Wachsmuth, Iryna Gurevych, 525–530, Berlin, Germany. Association for Compu- and Benno Stein. 2018. The argument reasoning tational Linguistics. comprehension task: Identification and reconstruc- tion of implicit warrants. In Proceedings of the Samuel R. Bowman, Gabor Angeli, Christopher Potts, 2018 Conference of the North American Chapter of and Christopher D. Manning. 2015a. A large an- the Association for Computational Linguistics: Hu- notated corpus for learning natural language infer- man Language Technologies, Volume 1 (Long Pa- ence. In Proceedings of the 2015 Conference on pers), pages 1930–1940. Association for Computa- Empirical Methods in Natural Language Process- tional Linguistics. ing, pages 632–642. Association for Computational Linguistics. Kazi Saidul Hasan and Vincent Ng. 2014. Why are you taking this stance? identifying and classifying Samuel R. Bowman, Gabor Angeli, Christopher Potts, reasons in ideological debates. In Proceedings of and Christopher D. Manning. 2015b. A large an- the 2014 Conference on Empirical Methods in Natu- notated corpus for learning natural language infer- ral Language Processing (EMNLP), pages 751–762. ence. In Proceedings of the 2015 Conference on Association for Computational Linguistics. Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguis- Xinyu Hua and Lu Wang. 2018. Neural argument tics. generation augmented with externally retrieved evi- dence. In Proceedings of the 56th Annual Meeting of Zhigang Chen, Wei Lin, Qian Chen, Xiaoping Chen, the Association for Computational Linguistics (Vol- Si Wei, Hui Jiang, and Xiaodan Zhu. 2015. Re- ume 1: Long Papers), pages 1–10, Melbourne, Aus- visiting word embedding for contrasting meaning. tralia. Association for Computational Linguistics. 1765
Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hi- Saif Mohammad, Svetlana Kiritchenko, Parinaz Sob- roya Takamura, and Manabu Okumura. 2016. Con- hani, Xiaodan Zhu, and Colin Cherry. 2016. trolling output length in neural encoder-decoders. Semeval-2016 task 6: Detecting stance in tweets. In In Proceedings of the 2016 Conference on Empiri- Proceedings of the 10th International Workshop on cal Methods in Natural Language Processing, pages Semantic Evaluation (SemEval-2016), pages 31–41. 1328–1338, Austin, Texas. Association for Compu- Association for Computational Linguistics. tational Linguistics. Elena Musi. 2018. How did you change my view? Diederik Kingma and Jimmy Ba. 2015. Adam: A a corpus-based study of concessions argumentative method for stochastic optimization. In 3rd Interna- role. Discourse Studies, 20(2):270–288. tional Conference on Learning Representations. Kim Anh Nguyen, Sabine Schulte im Walde, and Julien Kloetzer, Stijn De Saeger, Kentaro Torisawa, Ngoc Thang Vu. 2016. Integrating distributional Chikara Hashimoto, Jong-Hoon Oh, Motoki Sano, lexical contrast into word embeddings for antonym- and Kiyonori Ohtake. 2013. Two-stage method for synonym distinction. In Proceedings of the 54th An- large-scale acquisition of contradiction pattern pairs nual Meeting of the Association for Computational using entailment. In Proceedings of the 2013 Con- Linguistics (Volume 2: Short Papers), pages 454– ference on Empirical Methods in Natural Language 459. Association for Computational Linguistics. Processing, pages 693–703. Association for Com- Kishore Papineni, Salim Roukos, Todd Ward, and Wei- putational Linguistics. Jing Zhu. 2002. Bleu: a method for automatic eval- uation of machine translation. In Proceedings of Philipp Koehn. 2004. Statistical significance tests for 40th Annual Meeting of the Association for Com- machine translation evaluation. In Proceedings of putational Linguistics, pages 311–318, Philadelphia, EMNLP 2004, pages 388–395, Barcelona, Spain. Pennsylvania, USA. Association for Computational Association for Computational Linguistics. Linguistics. Dieu-Thu Le, Cam Tu Nguyen, and Kim Anh Nguyen. Adam Paszke, Sam Gross, Soumith Chintala, Gre- 2018. Dave the debater: a retrieval-based and gen- gory Chanan, Edward Yang, Zachary DeVito, Zem- erative argumentative dialogue agent. In Proceed- ing Lin, Alban Desmaison, Luca Antiga, and Adam ings of the 5th Workshop on Argument Mining, pages Lerer. 2017. Automatic differentiation in pytorch. 121–130. Association for Computational Linguis- In NIPS-W. tics. Michael Paul, ChengXiang Zhai, and Roxana Girju. Piroska Lendvai and Uwe Reichel. 2016. Contradic- 2010. Summarizing contrastive viewpoints in opin- tion detection for rumorous claims. In Proceedings ionated text. In Proceedings of the 2010 Confer- of the Workshop on Extra-Propositional Aspects of ence on Empirical Methods in Natural Language Meaning in Computational Linguistics (ExProM), Processing, pages 66–76. Association for Compu- pages 31–40. The COLING 2016 Organizing Com- tational Linguistics. mittee. Jeffrey Pennington, Richard Socher, and Christopher Thang Luong, Hieu Pham, and Christopher D. Man- Manning. 2014. Glove: Global vectors for word ning. 2015. Effective approaches to attention-based representation. In Proceedings of the 2014 Con- neural machine translation. In Proceedings of the ference on Empirical Methods in Natural Language 2015 Conference on Empirical Methods in Natu- Processing (EMNLP), pages 1532–1543, Doha, ral Language Processing, pages 1412–1421, Lis- Qatar. Association for Computational Linguistics. bon, Portugal. Association for Computational Lin- guistics. Dean Pomerleau and Delip Rao. 2017. Fake news chal- lenge. Xuezhe Ma and Eduard Hovy. 2016. End-to-end se- Alan Ritter, Stephen Soderland, Doug Downey, and quence labeling via bi-directional lstm-cnns-crf. In Oren Etzioni. 2008. It’s a contradiction – no, it’s Proceedings of the 54th Annual Meeting of the As- not: A case study using functional relations. In Pro- sociation for Computational Linguistics (Volume 1: ceedings of the 2008 Conference on Empirical Meth- Long Papers), pages 1064–1074, Berlin, Germany. ods in Natural Language Processing, pages 11–20. Association for Computational Linguistics. Association for Computational Linguistics. Marie-Catherine de Marneffe, Anna N. Rafferty, and E. Schiappa and J.P. Nordin. 2013. Argumentation: Christopher D. Manning. 2008. Finding contradic- Keeping Faith with Reason. Pearson Education. tions in text. In Proceedings of ACL-08: HLT, pages 1039–1047, Columbus, Ohio. Association for Com- Allen Schmaltz, Yoon Kim, Alexander Rush, and Stu- putational Linguistics. art Shieber. 2017. Adapting sequence models for sentence correction. In Proceedings of the 2017 Tsvetomila Mihaylova, Preslav Nakov, Lluı́s Màrquez, Conference on Empirical Methods in Natural Lan- Alberto Barrón-Cedeño, Mitra Mohtarami, Georgi guage Processing, pages 2807–2813, Copenhagen, Karadzhov, and James R. Glass. 2018. Fact check- Denmark. Association for Computational Linguis- ing in community forums. CoRR, abs/1803.03178. tics. 1766
Claudia Schulz, Steffen Eger, Johannes Daxenberger, Proceedings of The Third Workshop on Representa- Tobias Kahse, and Iryna Gurevych. 2018. Multi- tion Learning for NLP, pages 137–143. Association task learning for argumentation mining in low- for Computational Linguistics. resource settings. CoRR, abs/1804.04083. Henning Wachsmuth, Martin Potthast, Khalid Abigail See, Peter J. Liu, and Christopher D. Manning. Al Khatib, Yamen Ajjour, Jana Puschmann, Jiani 2017. Get to the point: Summarization with pointer- Qu, Jonas Dorsch, Viorel Morari, Janek Bevendorff, generator networks. In Proceedings of the 55th An- and Benno Stein. 2017. Building an argument nual Meeting of the Association for Computational search engine for the web. In Proceedings of Linguistics (Volume 1: Long Papers), pages 1073– the 4th Workshop on Argument Mining, pages 1083, Vancouver, Canada. Association for Compu- 49–59, Copenhagen, Denmark. Association for tational Linguistics. Computational Linguistics. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Henning Wachsmuth, Shahbaz Syed, and Benno Stein. Ilya Sutskever, and Ruslan Salakhutdinov. 2014. 2018. Retrieval of the best counterargument with- Dropout: A simple way to prevent neural networks out prior topic knowledge. In Proceedings of the from overfitting. J. Mach. Learn. Res., 15(1):1929– 56th Annual Meeting of the Association for Compu- 1958. tational Linguistics (Volume 1: Long Papers), pages 241–251. Association for Computational Linguis- Christian Stab and Iryna Gurevych. 2014. Identify- tics. ing argumentative discourse structures in persuasive Marilyn Walker, Pranav Anand, Rob Abbott, and Ricky essays. In Proceedings of the 2014 Conference on Grant. 2012. Stance classification using dialogic Empirical Methods in Natural Language Processing properties of persuasion. In Proceedings of the 2012 (EMNLP), pages 46–56, Doha, Qatar. Association Conference of the North American Chapter of the for Computational Linguistics. Association for Computational Linguistics: Human Language Technologies, pages 592–596. Associa- Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. tion for Computational Linguistics. Sequence to sequence learning with neural net- works. In Z. Ghahramani, M. Welling, C. Cortes, Lu Wang and Wang Ling. 2016. Neural network-based N. D. Lawrence, and K. Q. Weinberger, editors, Ad- abstract generation for opinions and arguments. In vances in Neural Information Processing Systems Proceedings of the 2016 Conference of the North 27, pages 3104–3112. Curran Associates, Inc. American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Yu Takabatake, Hajime Morita, Daisuke Kawahara, pages 47–57, San Diego, California. Association for Sadao Kurohashi, Ryuichiro Higashinaka, and Computational Linguistics. Yoshihiro Matsuo. 2015. Classification and ac- quisition of contradictory event pairs using crowd- Zhongqing Wang and Yue Zhang. 2017. Opinion rec- sourcing. In Proceedings of the The 3rd Work- ommendation using a neural model. In Proceed- shop on EVENTS: Definition, Detection, Corefer- ings of the 2017 Conference on Empirical Methods ence, and Representation, pages 99–107. Associa- in Natural Language Processing, pages 1627–1638. tion for Computational Linguistics. Association for Computational Linguistics. James Thorne, Andreas Vlachos, Christos Adina Williams, Nikita Nangia, and Samuel Bowman. Christodoulopoulos, and Arpit Mittal. 2018. 2018. A broad-coverage challenge corpus for sen- Fever: a large-scale dataset for fact extraction tence understanding through inference. In Proceed- and verification. In Proceedings of the 2018 ings of the 2018 Conference of the North American Conference of the North American Chapter of the Chapter of the Association for Computational Lin- Association for Computational Linguistics: Human guistics: Human Language Technologies, Volume 1 Language Technologies, Volume 1 (Long Papers), (Long Papers), pages 1112–1122. Association for pages 809–819. Association for Computational Computational Linguistics. Linguistics. Ronald J. Williams and David Zipser. 1989. A learn- Stephen Toulmin. 1958. The Uses of Argument. Cam- ing algorithm for continually running fully recurrent bridge At the University Press. neural networks. Neural Comput., 1(2):270–280. Andreas Vlachos and Sebastian Riedel. 2014. Fact checking: Task definition and dataset construction. In Proceedings of the ACL 2014 Workshop on Lan- guage Technologies and Computational Social Sci- ence, pages 18–22. Association for Computational Linguistics. Ivan Vulić. 2018. Injecting lexical contrast into word vectors by guiding vector space specialisation. In 1767
You can also read