Fixed That for You: Generating Contrastive Claims with Semantic Edits - Association for ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Fixed That for You: Generating Contrastive Claims with Semantic Edits
Christopher Hidey Kathleen McKeown
Department of Computer Science Department of Computer Science
Columbia University Columbia University
New York, NY 10027 New York, NY 10027
chidey@cs.columbia.edu kathy@cs.columbia.edu
Abstract This is an example of a policy claim - a view on
what should be done (Schiappa and Nordin, 2013).
Understanding contrastive opinions is a key
component of argument generation. Central While negation of this claim is a plausible re-
to an argument is the claim, a statement that sponse (e.g. asserting there should be no change
is in dispute. Generating a counter-argument by stating Do not get employers out of the busi-
then requires generating a response in contrast ness, do not pass universal healthcare), negation
to the main claim of the original argument. To limits the diversity of responses that can lead to a
generate contrastive claims, we create a cor- productive dialogue. Instead, consider a response
pus of Reddit comment pairs self-labeled by that provides an alternative suggestion:
posters using the acronym FTFY (fixed that
for you). We then train neural models on Get employers out of the business,
these pairs to edit the original claim and pro-
deregulate and allow cross-state com- (2)
duce a new claim with a different view. We
demonstrate significant improvement over a petition.
sequence-to-sequence baseline in BLEU score In Example 1, the speaker believes in an in-
and a human evaluation for fluency, coherence, creased role for government while in Example 2,
and contrast. the speaker believes in a decreased one. As these
1 Introduction views are on different sides of the political spec-
trum, it is unlikely that a single speaker would
In the Toulmin model (1958), often used in com- utter both claims. In related work, de Marneffe
putational argumentation research, the center of et al. (2008) define two sentences as contradictory
the argument is the claim, a statement that is in when they are extremely unlikely to be true simul-
dispute (Govier, 2010). In recent years, there taneously. We thus define a contrastive claim as
has been increased interest in argument genera- one that is likely to be contradictory if made by
tion (Bilu and Slonim, 2016; Hua and Wang, 2018; the speaker of the original claim. Our goal, then,
Le et al., 2018). Given an argument, a system is to develop a method for generating contrastive
that generates counter-arguments would need to claims when explicit negation is not the best op-
1) identify the claims to refute, 2) generate a new tion. Generating claims in this way also has the
claim with a different view, and 3) find supporting benefit of providing new content that can be used
evidence for the new claim. We focus on this sec- for retrieving or generating supporting evidence.
ond task, which requires an understanding of con- In order to make progress towards generating
trast. A system that can generate claims with dif- contrastive responses, we need large, high-quality
ferent views is a step closer to understanding and datasets that illustrate this phenomenon. We con-
generating arguments (Apothloz et al., 1993). We struct a dataset of 1,083,520 contrastive comment
build on previous work in automated claim genera- pairs drawn from Reddit using a predictive model
tion (Bilu et al., 2015) which examined generating to filter out non-contrastive claims. Each pair con-
opposing claims via explicit negation. However, tains very similar, partially aligned text but the
researchers also noted that not every claim has an responder has significantly modified the original
exact opposite. Consider a claim from Reddit: post. We use this dataset to model differences in
Get employers out of the business, pass views and generate a new claim given an origi-
(1) nal comment. The similarity within these pairs
universal single-payer healthcare.
1756
Proceedings of NAACL-HLT 2019, pages 1756–1767
Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational Linguisticsallows us to use them as distantly labeled con- between pairs of contrastive claims except for the
trastive word alignments. The word alignments contrastive word or phrase. Much of the previous
provide semantic information about which words work in contrast and contradiction has examined
and phrases can be substituted in context in a co- the relationship between words or sentences. In
herent, meaningful way. order to understand when words and phrases are
Our contributions1 are as follows: contrastive in argumentation, we need to exam-
ine them in context. For example, consider the
1. Methods and data for contrastive claim iden- claim Hillary Clinton should be president. A rea-
tification to mine comment pairs from Red- sonable contrastive claim might be Bernie Sanders
dit, resulting in a large, continuously grow- should be president. (rather than the explicit nega-
ing dataset of 1,083,520 distant-labeled ex- tion Hillary Clinton should not be president.) In
amples. this context, Hillary Clinton and Bernie Sanders
are contrastive entities as they were both running
2. A crowd-labeled set of 2,625 comments each
for president. However, for the claim Hillary Clin-
paired with 5 new contrastive responses gen-
ton was the most accomplished Secretary of State
erated by additional annotators.
in recent memory. they would be unrelated. Con-
3. Models for generating contrastive claims us- sider also that we could generate the claim Hillary
ing neural sequence models and constrained Clinton should be senator. This contrastive claim
decoding. is not coherent given the context. Generating a
contrastive claim then requires 1) identifying the
In the following sections, we describe the task correct substitution span and 2) generating a re-
methodology and data collection and process- sponse with semantically relevant replacements.
ing. Next, we present neural models for con- While some contrastive claims are not coherent,
trastive claim generation and evaluate our work there are often multiple plausible responses, sim-
and present an error analysis. We then discuss ilar to tasks such as dialogue generation. For ex-
related work in contrast/contradiction, argumenta- ample, Donald Trump should be president is just
tion, and generation before concluding. as appropriate as Bernie Sanders should be pres-
ident. We thus treat this as a dialogue generation
2 Task Definition and Motivation task where the goal is to generate a plausible re-
Previous work in claim generation (Bilu et al., sponse given an input context.
2015) focused on explicit negation to provide op-
posing claims. While negation plays an important
3 Data
role in argumentation (Apothloz et al., 1993), re- In order to model contrastive claims, we need
searchers found that explicit negation may result datasets that reflect this phenomenon.
in incoherent responses (Bilu et al., 2015). Fur-
thermore, recent empirical studies have shown that 3.1 Collection
arguments that provide new content (Wachsmuth We obtain training data by scraping the social
et al., 2018) tend to be more effective. While new media site Reddit for comments containing the
concepts can be introduced in other ways by find- acronym FTFY.2 FTFY is a common acronym
ing semantically relevant content, we may find it meaning “fixed that for you.”3 FTFY responses
desirable to explicitly model contrast in order to (hereafter FTFY) are used to respond to another
control the output of the model as part of a rhetor- comment by editing part of the “parent comment”
ical strategy, e.g. concessions (Musi, 2018). We (hereafter parent). Most commonly, FTFY is used
thus develop a model that generates a contrastive for three categories of responses: 1) expressing
claim given an input claim. a contrastive claim (e.g. the parent is Bernie
Contrastive claims may differ in more than just Sanders for president and the FTFY is Hillary
viewpoint; they may also contain stylistic differ- Clinton should be president) which may be sarcas-
ences and paraphrases, among other aspects. We tic (e.g. Ted Cruz for president becomes Zodiac
thus propose to model contrastive claims by con-
2
trolling for context and maintaining the same text https://en.wiktionary.org/wiki/FTFY
3
https://reddit.zendesk.com/hc/en-us/articles/205173295-
1
Data and code available at github.com/chridey/fixedthat What-do-all-these-acronyms-mean
1757killer for president) 2) making a joke (e.g. This exact sentence alignments in 93.
Python library really piques my interest vs. This Given these pairs of comments, we derive lin-
really *py*ques my interest), and 3) correcting a guistic and structural features for training a binary
typo (e.g. This peaks my interest vs. piques). In classifier. For each pair of comments, we compute
Section 3.2, we describe how we identify category features for the words in the entire comment span
1 (contrastive claims) for modeling. and features from the aligned phrases span only
To obtain historical Reddit data, we mined (as identified by edit distance). From the aligned
comments from the site pushshift.io for Decem- phrases we compute the character edit distance
ber 2008 through October 2017. This results in and character Jaccard similarity (both normalized
2,200,258 pairs from Reddit, where a pair consists by the number of characters) to attempt to cap-
of a parent and an FTFY. We find that many of the ture jokes and typos (the similarity should be high
top occurring subreddits are ones where we would if the FTFY is inventing an oronym or correcting
expect strong opinions (/r/politics, /r/worldnews, a spelling error). From the entire comment, we
and /r/gaming). use the percentage of characters copied as a low
percentage may indicate a poor alignment and the
3.2 Classification percentage of non-ASCII characters as many of
the jokes use emojis or upside-down text. In ad-
To filter the data to only the type of response that dition, we use features from GloVe (Pennington
we are interested in, we annotated comment pairs et al., 2014) word embeddings4 for both the entire
for contrastive claims and other types. We use comment and aligned phrases. We include the per-
our definition of contrastive claims based on con- centage of words in the embedding vocabulary for
tradiction, where both the parent and FTFY are both spans for both the parent and FTFY. The rea-
a claim and they are unlikely to be beliefs held son for this feature is to identify infrequent words
by the same speaker. A joke is a response that which may be typos or jokes. We compute the
does not meaningfully contrast with the parent cosine similarity of the average word embeddings
and commonly takes the form of a pun, rhyme, between the parent and FTFY for both spans. Fi-
or oronym. A correction is a response to a typo, nally, we use average word embeddings for both
which may be a spelling or grammatical error. spans for both parent and FTFY.
Any other pair is labeled as “other,” including As we want to model the generation of new
pairs where the parent is not a claim. content, not explicit negation, we removed any
In order to identify contrastive claims, we se- pairs where the difference was only “stop words.”
lected a random subset of the Reddit data from The set of stop words includes all the default stop
prior to September 2017 and annotated 1993 com- words in Spacy5 combined with expletives and
ments. Annotators were native speakers of English special tokens (we replaced all URLs and user-
and the Inter-Annotator Agreement using Kripen- names). We trained a logistic regression classifier
dorff’s alpha was 0.72. Contrast occurs in slightly and evaluated using 4-fold cross-validation. We
more than half of the sampled cases (51.4%), with compare to a character overlap baseline where any
jokes (23.0%) and corrections (21.2%) compris- examples with Jaccard similarity > 0.9 and edit
ing about one quarter each. We then train a bi- distance < 0.15 were classified as non-contrastive.
nary classifier to predict contrastive claims, thus The goal of this baseline is to illustrate how much
enabling better quality data for the generation task. of the non-contrastive data involves simple or non-
To identify the sentence in the parent that the existent substitutions. Results are shown in Table
FTFY responds to and derive features for classi- 1. Our model obtains an F-score of 80.25 for an 8
fication, we use an edit distance metric to obtain point absolute improvement over the baseline.
sentence and word alignments between the parent
comment and response. As the words in the par- 3.3 Selection
ent and response are mostly in the same order and After using the trained model to classify the re-
most FTFYs contain significant overlap with the maining data, we have 1,083,797 Reddit pairs. We
parent response, it is possible to find alignments by set aside 10,307 pairs from October 1-20, 2017 for
moving a sliding window over the parent. A sam- 4
We found the 50-dimensional Wikipedia+Gigaword em-
ple of 100 comments verifies that this approach beddings to be sufficient
5
yields exact word alignments in 75 comments and spacy.io
1758Model Precision Recall F-score 4 Methods
Majority 51.4 100 67.5
Baseline 67.75 77.19 72.16 Our goal of generating contrastive claims can be
LR 74.22 87.60 80.25 broken down into two primary tasks: 1) identify-
ing the words in the original comment that should
Table 1: Results of Identifying Contastive Claims be removed or replaced and 2) generating the ap-
propriate substitutions and any necessary context.
Initially, we thus experimented with a modular ap-
development and October 21-30 for test (6,773), proach by tagging each word in the parent and
with the remainder used for training. As we are then using the model predictions to determine if
primarily working with sentences, the mean par- we should copy, delete, or replace a segment with
ent length was 16.3 and FTFY length was 14.3. a new word or phrase. We tried the bi-directional
The resulting test FTFYs are naturally occur- LSTM-CNN-CRF model of Ma and Hovy (2016)
ring and so do not suffer from annotation arti- and used our edit distance word alignments to
facts. At the same time, they are noisy and may obtain labels for copying, deleting, or replacing.
not reflect the desired phenomenon. Thus, we However, we found this model performed slightly
also conducted an experiment on Amazon Me- above random predictions, and with error propa-
chanical Turk6 (AMT) to obtain additional gold gation, the model is unlikely to produce fluent and
references, which are further required by metrics accurate output. Instead, we use an end-to-end ap-
such as BLEU (Papineni et al., 2002). We se- proach using techniques from machine translation.
lected 2,625 pairs from the 10 most frequent cat-
egories7 (see Table 2). These categories form a 4.1 Model
three-level hierarchy for each subreddit and we We use neural sequence-to-sequence encoder-
use the second-level, e.g. for /r/pokemongo the decoder models (Sutskever et al., 2014) with at-
categories are “Pokémon”, “Video Games”, and tention for our experiments. The tokens from the
“Gaming” so we use “Video Games.” Before par- parent are passed as input to a bi-directional GRU
ticipating, each annotator was required to pass a (Cho et al., 2014) to obtain a sequence of encoder
qualification test - five questions to gauge their hidden states hi . Our decoder is also a GRU,
knowledge of that topic. For the movies category, which at time t generates a hidden state st from the
one question we asked was whether for the sen- previous hidden state st−1 along with the input.
tence Steven Spielberg is the greatest director of When training, the input xt is computed from the
all time, we could instead use Stanley Kubrick or previous word in the gold training data if we are
Paul McCartney. If they passed this test, the an- in “teacher forcing” mode (Williams and Zipser,
notators were then given the parent comment and 1989) and otherwise is the prediction made by the
“keywords” (the subreddit and three category lev- model at the previous time step. When testing, we
els) to provide additional context. We obtained also use the model predictions. The input word
five new FTFYs for each parent and validated them wt may be augmented by additional features, as
manually to remove obvious spam or trivial nega- discussed in Section 4.2. In the baseline scenario
tion (e.g. “not” or “can’t”). xt = e(wt ) where e is an embedding. The hidden
state st is then combined with a context vector h∗t ,
Category Count Category Count which is a weighted combination of the encoder
Video Games 1062 Basketball 116 hidden states using an attention mechanism:
Politics 529 Soccer 99 X
Football 304 Movies 88 h∗t = αti hi
Television 194 Hockey 60 i
World News 130 Baseball 55
To calculate αit , we use the attention of Luong et
Table 2: Comments for Mechanical Turk al. (2015) as this encourages the model to select
features in the encoder hidden state which corre-
6 late with the decoder hidden state, which we want
We paid annotators the U.S. federal minimum wage and
the study was approved by an IRB. because our input and output are similar. Our ex-
7
Obtained from the snoopsnoo.com API periments on the development data verified this,
1759as Bahdanau attention (Bahdanau et al., 2015) per- count is computed by:
formed worse. Attention is then calculated as:
c0 = |O \ (S ∪ I)| or desired count
exp(hTi st )
(
αti = ct − 1, wt ∈ / S ∪ I and ct > 0
P T ct+1 =
s0 exp(hs0 st ) ct , otherwise
Finally, we make a prediction of a vocabulary where O is the set of gold output words in training.
word w by using features from the context and de- For the parent comment Hillary Clinton for
coder hidden state with a projection matrix W and president 2020 and FTFY Bernie Sanders for pres-
output vocabulary matrix V : ident, the decoder input is presented, with the time
t in the first row of Table 3 and the inputs wt and
P (w) = softmax(V tanh(W [st ; h∗t ] + bw ) + bv ) ct in the second and third rows, respectively. At
the start of decoding, the model expects to gener-
We explored using a copy mechanism (See et al., ate two new content words, which in this exam-
2017) for word prediction but found it difficult to ple it generates immediately and decrements the
prevent the model from copying the entire input. counter. When the counter reaches 0, it only gen-
erates stop or input words.
4.2 Decoder Representation
t 0 1 2 3 4
Decoder Input: We evaluate two representations
wt - Bernie Sanders for president
of the target input: as a sequence of words and
ct 2 1 0 0 0
as a sequence of edits. The sequence of words
approach is the standard encoder-decoder setup. Table 3: Example of Counter
For the example parent Hillary Clinton for presi-
dent 2020 and FTFY Bernie Sanders for president Unlike the controlled-length scenario, at test
we would use the FTFY without modification. time we do not know the number of new content
Schmaltz et al. (2017) found success modeling er- words to generate. However, the count for most
ror correction using sequence-to-sequence models FTFYs is between 1 and 5, inclusive, so we can ex-
by representing the target input as a sequence of haustively search this range during decoding. We
edits. We apply a similar approach to our prob- experimented with predicting the count but found
lem, generating a target sequence by following the it to be inaccurate so we leave this for future work.
best path in the matrix created by the edit dis- Subreddit Information: As the model often
tance algorithm. The new target sequence is the needs to disambiguate polysemous words, addi-
original parent interleaved with “DELETE-N to- tional context can be useful. Consider the par-
kens” that specify how many previous words to ent comment this is a strange bug. In a program-
delete, followed by the newly generated content. ming subreddit, a sarcastic FTFY might be this is
For the same example, Hillary Clinton for pres- a strange feature. However, in a Pokémon subred-
ident 2020, the modified target sequence would dit, an FTFY might be this is a strange dinosaur
be Hillary Clinton DELETE-2 Bernie Sanders for in an argument over whether Armaldo is a bug or
president 2020 DELETE-1. a dinosaur. We thus include additional features to
Counter: Kikuchi et al. (2016) found that be passed to the encoder at each time step, in the
by using an embedding for a length variable they form of an embedding g for each the three cat-
were able to control output length via a learned egory levels obtained in Section 3.3. These em-
mechanism. In our work, we compute a counter beddings are concatenated to the input word wt at
variable which is initially set to the number of new each timestep, i.e. xt = e(wt , gt1 , gt2 , gt3 ).
content words the model should generate. During
decoding, the counter is decremented if a word 4.3 Objective Function
is generated that is not in the source input (I) or We use a negative log Plikelihood ∗objective func-
in the set of stop words (S) defined in Section tion LN LL = − log t∈1:T P (wt ), where wt∗ is
3.2. The model uses an embedding e(ct ) for each the gold token at time t, normalized by each batch.
count, which is parameterized by a count embed- We also include an additional loss term that uses
ding matrix. The input to the decoder state in this the encoder hidden states to make a binary pre-
scenario is xt = e(wt , ct ). At each time step, the diction over the input for whether a token will be
1760copied or inserted/deleted. For the example from 5 Results
Section 4.2, the target for Hillary Clinton for pres-
ident 2020 would be 0 0 1 1 0. This encourages For training, development, and testing we use the
the model to select features that indicate whether data described in Section 3.3. The test reference
the encoder input will be copied to the output. We data consists of the Reddit FTFYs and the FTFYs
use a 2-layer multi-layer perceptron and a binary generated from AMT. We evaluate our models us-
cross-entropy loss LBCE . The joint loss is then: ing automated metrics and human judgments.
Automated metrics should reflect our joint goals
of 1) copying necessary context and 2) making ap-
L = LN LL + λLBCE propriate substitutions. To address point 1, we use
BLEU-4 as a measure of similarity between the
4.4 Decoding gold FTFY and the model output. As the FTFY
may contain significant overlap with the parent,
We use beam search for generation, as this method BLEU indicates how well the model copies the
has proven effective for many neural language appropriate context. As BLEU reflects mostly
generation tasks. For the settings of the model that span selection rather than the insertion of new con-
require a counter, we expand the beam by count m tent, we need alternative metrics to address point
so that for a beam size k we calculate k ∗ m states. 2. However, addressing point 2 is more difficult
Filtering: We optionally include a constrained due to the variety of possible substitutions, in-
decoding mode where we filter the output based cluding named entities. For example, if the par-
on the counter; when ct > 1 the end-of-sentence ent comment is jaguars for the win! and the gold
(EOS) score is set to −∞ and when ct = 0 the FTFY is chiefs for the win! but the model pro-
score of any word w ∈ V \ (S ∪ I) is set to −∞. duces cowboys for the win! (or any of 29 other
The counter ct is decremented at every time step as NFL teams), most metrics would judge this re-
in Section 4.2. In other words, when the counter sponse incorrectly even though it would be an ac-
is zero, we only allow the model to copy or gener- ceptable response. Thus we present results using
ate stop words. When the counter is positive, we both automated metrics and human evaluation. As
prevent the model from ending the sentence before an approximation to address point 2, we attempt
it generates new content words and decrements the to measure when the model is making changes
counter. The constrained decoding is possible with rather than just copying the input. To this end, we
any combination of settings, with or without the present two additional metrics - novelty, a mea-
counter embedding. sure of whether novel content (non-stop word) to-
kens are generated relative to the parent comment,
and partial match, a measure of whether the novel
4.5 Hyper-parameters and Optimization
tokens in the gold FTFY match any of the novel
We used Pytorch (Paszke et al., 2017) for all ex- tokens in the generated FTFY. To provide a refer-
periments. We used 300-dimensional vectors for ence point, we find that the partial match between
the word embedding and GRU layers. The count two different gold FTFYs (Reddit and AMT) was
embedding dimension was set to 5 with m = 5 11.4% and BLEU was 47.28, which shows the dif-
and k = 10 for decoding. The category embed- ficulty of automatic evaluation. The scores are
ding dimensions were set to 5, 10, and 25 for each lower than expected because the Reddit FTFYs are
of the non-subreddit categories. We also set λ = 1 noisy due to the process in Section 3.2. This also
for multi-task learning. We used the Adam op- justifies obtaining the AMT FTFYs.
timizer (Kingma and Ba, 2015) with settings of Results are presented in Table 4. The baseline
β1 = 0.9, β2 = 0.999, = 10−8 and a learning is a sequence-to-sequence model with attention.
rate of 10−3 decaying by γ = 0.1 every epoch. For other components, the counter embedding is
We used dropout (Srivastava et al., 2014) on the referred to as “COUNT,” the category/subreddit
embeddings with a probability of 0.2 and teacher embeddings as “SUB,” the sequence of edits as
forcing with 0.5. We used a batch size of 100 with “EDIT,” and the multi-task copy loss as “COPY.”
10 epochs, selecting the best model on the devel- The models in the top half of the table use con-
opment set based on perplexity. We set the mini- strained decoding and those in the bottom half
mum frequency of a word in the vocabulary to 4. are unconstrained, to show the learning capabil-
1761Reddit AMT
Model Novelty BLEU-4 % Match BLEU-4 % Match
Constrained Baseline 79.88 18.81 4.67 40.14 10.06
COUNT 89.69 22.61 4.72 47.55 12.55
COUNT + SUB + COPY 90.45 23.13 4.83 50.05 14.92
EDIT 64.64 16.12 3.37 35.48 7.33
EDIT + COUNT + SUB + COPY 82.96 19.37 4.23 42.69 11.62
Unconstrained
Baseline 3.34 7.31 0.73 25.83 0.68
COUNT 16.19 8.51 1.95 27.68 2.36
COUNT + SUB + COPY 16.26 9.62 1.93 31.23 3.81
EDIT 7.97 35.41 1.57 74.24 1.56
EDIT + COUNT + SUB + COPY 39.99 32.59 3.25 67.56 6.25
Table 4: Automatic Evaluation
ities of the models. For each model we com- contradicts the original comment. We specified
pute statistical significance with bootstrap resam- that if the response is different but does not pro-
pling (Koehn, 2004) for the constrained or uncon- vide a contrasting view it should receive a low rat-
strained baseline as appropriate and we find the ing. Previous work (Bilu et al., 2015) used flu-
COUNT and EDIT models to be significantly bet- ency, clarity/usability (which we combine into co-
ter for constrained and unconstrained decoding, herence), and opposition (where we use contrast).
respectively (p < 0.005).
Under constrained decoding, we see that the Model Fluency Coherence Contrast
“COUNT + SUB + COPY” model performs the Gold 4.34 4.26 3.01
best in all metrics, although most of the perfor- Baseline 3.49 3.19 1.94
mance can be attributed to the count embedding. Constrained 3.46 3.32 2.53
When we allow the model to determine its own Best 3.52 3.46 2.87
output, we find that “EDIT + COUNT” performs
the best. In particular, this model does well at un- Table 5: Human Evaluation
derstanding which part of the context to select, and
even does better than other unconstrained models We used a Likert scale where 5 is strongly agree
at selecting appropriate substitutions. However, and 1 is strongly disagree. We used the same data
when we combine this model with constrained de- and qualification test from Section 3.3 for each
coding, the improvement is smaller than for the category and used three annotators per example.
other settings. We suspect that because the EDIT We asked annotators to judge 4 different pairs: 3
model often needs to generate a DELETE-N token model outputs and the gold Reddit8 FTFYs for
before a new response, these longer-term depen- comparison. We include the baseline, the base-
dencies are hard to capture with constrained de- line with constrained decoding, and the best con-
coding but easier if included in training. strained model (“COUNT + SUB + COPY”) ac-
We also conducted a human evaluation of the cording to BLEU and partial match. We verified
model output on the same subset of 2,625 exam- that the annotators understood how to rate con-
ples described in Section 3.3. We performed an trast by examining the distribution of responses:
additional experiment on AMT where we asked the annotators selected option 3 (neither) 15% of
annotators to rate responses on fluency, coherence, the time and preferred to select either extreme, 5
and contrast. Fluency is a measure of the qual- (21%) or 1 (27%). Results are presented in Table
ity of the grammar and syntax and the likelihood 5, showing a clear preference for the best model.
that a native English speaker would utter that state- Note the degradation in fluency for the constrained
ment. Coherence is a measure of whether the baseline, as the model is prevented from generat-
response makes sense, is semantically meaning- ing the EOS token and may repeat tokens up to the
ful, and would be usable as a response to a claim. 8
We did not evaluate the AMT FTFYs as these were gen-
Contrast is a measure of how much the response erated by the same pool of annotators.
1762Parent: ah yes the wonders of the free market clue what i ’m talking about . We also found ex-
Model: ah yes the wonders of government in- amples where the model chose poorly due to un-
tervention filtered jokes or errors in the training data (12 in
Parent: i know that this is an unofficial mod total). In 15 cases, due to the constrained decod-
, but xp is the best os for this machine ing the model repeated a word until the maximum
Model: linux is the best os for this machine length or appended an incoherent phrase. For the
Parent: that ’s why it ’s important to get all most common error, the model made a substitu-
your propaganda from infowars and brietbart tion that was not contrasting as in Table 6 (19
examples). Finally, we found 38 of the samples
Model: propaganda from fox news outlets
were valid responses, but did not match the gold,
Table 6: Model Output indicating the difficulty of automatic evaluation.
For example, in response to the claim Nintendo is
the only company that puts customers over profits,
maximum length. the model replaces Nintendo with Rockstar (both
6 Qualitative Analysis video game companies) while the gold FTFYs had
other video game companies.
We provide three examples of the model output in
Table 6 with the first and third from the News and 7 Related Work
Politics category, demonstrating how the model
handles different types of input. In the first ex- Understanding contrast and contradiction is key
ample, the contrast is between allowing markets to argumentation as it requires an understand-
to regulate themselves versus an increased role of ing of differing points-of-view. Recent work ex-
government. In the second example, the contra- amined the negation of claims via explicit nega-
diction is due to the choice of operating system. tion (Bilu et al., 2015). Other work investigated
In the third (invalid) example, the model responds the detection of different points-of-view in opin-
to a sarcastic claim with another right-wing news ionated text (Al Khatib et al., 2012; Paul et al.,
organization; this response is not a contradiction 2010). Wachsmuth et al. (2017; 2018) retrieved
since it is plausible the original speaker would also arguments for and against a particular stance us-
utter this statement. ing online debate forums. In non-argumentative
text, researchers predicted contradictions for types
6.1 Error Analysis such as negation, antonyms, phrasal, or structural
We conduct an error analysis by selecting 100 re- (de Marneffe et al., 2008) or those that can be
sponses where the model did not partially match expressed with functional relations (Ritter et al.,
any of the 6 gold responses and we found 6 main 2008). Other researchers have incorporated en-
types of errors. One error is the model identify- tailment models (Kloetzer et al., 2013) or crowd-
ing an incorrect substitution span while the hu- sourcing methods (Takabatake et al., 2015). Con-
man responses all selected a different span to re- tradiction has also become a part of the natural lan-
place. We noticed that this occurred 5 times and guage inference (NLI) paradigm, with datasets la-
may require world knowledge to understand which beling contradiction, entailment, or neutral (Bow-
tokens to select. For example, in response to the man et al., 2015a; Williams et al., 2018). The
claim Hillary Clinton could have been president if increase in resources with contrast and contradic-
not for robots, the model generates Donald Trump tion has resulted in new representations with con-
in place of Hillary Clinton, whereas the gold re- trastive meaning (Chen et al., 2015; Nguyen et al.,
sponses generate humans / votes / Trump’s tweets 2016; Vulić, 2018; Conneau et al., 2017). Most
in place of robots. Another type of error is when of this work has focused on identifying contrast or
the responses are not coherent with the parent and contradiction while we aim to generate contrast.
the language model instead determines the token Furthermore, while contradiction and contrast are
selection based on the most recent context (11 present in these corpora, we obtain distant-labeled
cases). For example, given the claim bb-8 gets a alignments for contrast at the word and phrase
girlfriend and poe still does n’t have a girlf :’) the level. Our dataset also includes contrastive con-
Reddit FTFY has boyf instead of girlf whereas the cepts and entities while other corpora primarily
model generates ... and poe still does n’t have a contain antonyms and explicit negation.
1763Contrast also appears in the study of stance, generating contrastive claims and obtained signif-
where the opinion towards a target may vary. The icant improvement in automated metrics and hu-
SemEval 2016 Stance Detection for Twitter task man evaluations for Reddit and AMT test data.
(Mohammad et al., 2016) involved predicting if a Our goal is to incorporate this model into an ar-
tweet favors a target entity. The Interpretable Se- gumentative dialogue system. In addition to gen-
mantic Similarity task (Agirre et al., 2016) called erating claims with a contrasting view, we can
to identify semantic relation types (including op- also retrieve supporting evidence for the newly-
position) between headlines or captions. Target- generated claims. Additionally, we plan to exper-
specific stance prediction in debates is addressed iment with using our model to improve claim de-
by Hasan and Ng (2014) and Walker et al. (2012). tection (Daxenberger et al., 2017) and stance pre-
Fact checking can be viewed as stance toward an diction (Bar-Haim et al., 2017). Our model could
event, resulting in research on social media (Lend- be used to generate artificial data to enhance clas-
vai and Reichel, 2016; Mihaylova et al., 2018), sification performance on these tasks.
politician statements (Vlachos and Riedel, 2014), To improve our model, we plan to experiment
news articles (Pomerleau and Rao, 2017), and with retrieval-based approaches to handle low-
Wikipedia (Thorne et al., 2018). frequency terms and named entities, as sequence-
In computational argumentation mining, iden- to-sequence models are likely to have trouble in
tifying claims and other argumentative compo- this environment. One possibility is to incorpo-
nents is a well-studied task (Stab and Gurevych, rate external knowledge with entity linking over
2014). Daxenberger et al. (2017) and Schulz et al. Wikipedia articles to find semantically-relevant
(2018) developed approaches to detect claims substitutions.
across across diverse claim detection datasets. Re- Another way to improve the model is by intro-
cently, a shared task was developed for argument ducing controllable generation. One aspect of con-
reasoning comprehension (Habernal et al., 2018). trollability is intention; our model produces con-
The best system (Choi and Lee, 2018) used models trastive claims without understanding the view of
pre-trained on NLI data (Bowman et al., 2015b), the original claim. Category embeddings partially
which contains contradictions. While this work address this issue (some labels are “Liberal” or
is concerned with identification of argumentative “Conservative”), but labels are not available for
components, we propose to generate new claims. all views. Going forward, we hope to classify the
In the field of argument generation, Wang and viewpoint of the original claim and then generate a
Ling (2016) train neural abstractive summarizers claim with a desired orientation. Furthermore, we
for opinions and arguments. Additional work in- hope to improve on the generation task by identi-
volved generating opinions given a product rating fying the types of claims we encounter. For exam-
(Wang and Zhang, 2017). Bilu and Slonim (2016) ple, we may want to change the target of the claims
combine topics and predicates via a template- in some claims but in others change the polarity.
based classifier. This work involves the genera- We also plan to improve the dataset by improv-
tion of claims but in relation to a topic. Other ing our models for contrastive pair prediction to
researchers generated political counter-arguments reduce noise. Finally, we hope that this dataset
supported by external evidence (Hua and Wang, proves useful for related tasks such as textual
2018) and generating argumentative dialogue by entailment (providing examples of contradiction)
maximizing mutual information (Le et al., 2018). and argument comprehension (providing counter-
This research considers end-to-end argument gen- examples of arguments) or even unrelated tasks
eration, which may not be coherent, whereas we like humor or error correction.
focus specifically on contrastive claims.
Acknowledgments
8 Conclusion
We would like to thank the AMT annotators for
We presented a new source of over 1 million con- their work and the anonymous reviewers for their
trastive claim pairs that can be mined from social valuable feedback. The authors also thank Alyssa
media sites such as Reddit. We provided an anal- Hwang for providing annotations and Smaranda
ysis and models to filter noisy training data from Muresan, Chris Kedzie, Jessica Ouyang, and Els-
49% down to 25%. We created neural models for beth Turcan for their helpful comments.
1764References In Proceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics and the
Eneko Agirre, Aitor Gonzalez-Agirre, Inigo Lopez- 7th International Joint Conference on Natural Lan-
Gazpio, Montse Maritxalar, German Rigau, and Lar- guage Processing (Volume 1: Long Papers), pages
raitz Uria. 2016. Semeval-2016 task 2: Interpretable 106–115. Association for Computational Linguis-
semantic textual similarity. In Proceedings of the tics.
10th International Workshop on Semantic Evalua-
tion (SemEval-2016), pages 512–524. Association Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-
for Computational Linguistics. cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Schwenk, and Yoshua Bengio. 2014. Learning
Khalid Al Khatib, Hinrich Schütze, and Cathleen Kant- phrase representations using rnn encoder–decoder
ner. 2012. Automatic detection of point of view dif- for statistical machine translation. In Proceedings of
ferences in wikipedia. In Proceedings of COLING the 2014 Conference on Empirical Methods in Nat-
2012, pages 33–50. The COLING 2012 Organizing ural Language Processing (EMNLP), pages 1724–
Committee. 1734, Doha, Qatar. Association for Computational
Linguistics.
Denis Apothloz, Pierre-Yves Brandt, and Gustavo
Quiroz. 1993. The function of negation in argumen- HongSeok Choi and Hyunju Lee. 2018. Gist at
tation. Journal of Pragmatics, 19:23–38. semeval-2018 task 12: A network transferring infer-
ence knowledge to argument reasoning comprehen-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- sion task. In Proceedings of The 12th International
gio. 2015. Neural machine translation by jointly Workshop on Semantic Evaluation, pages 773–777.
learning to align and translate. In 3rd International Association for Computational Linguistics.
Conference on Learning Representations.
Alexis Conneau, Douwe Kiela, Holger Schwenk, Loı̈c
Roy Bar-Haim, Indrajit Bhattacharya, Francesco Din- Barrault, and Antoine Bordes. 2017. Supervised
uzzo, Amrita Saha, and Noam Slonim. 2017. Stance learning of universal sentence representations from
classification of context-dependent claims. In Pro- natural language inference data. In Proceedings of
ceedings of the 15th Conference of the European the 2017 Conference on Empirical Methods in Nat-
Chapter of the Association for Computational Lin- ural Language Processing, pages 670–680. Associ-
guistics: Volume 1, Long Papers, pages 251–261, ation for Computational Linguistics.
Valencia, Spain. Association for Computational Lin-
guistics. Johannes Daxenberger, Steffen Eger, Ivan Habernal,
Christian Stab, and Iryna Gurevych. 2017. What is
Yonatan Bilu, Daniel Hershcovich, and Noam Slonim. the essence of a claim? cross-domain claim iden-
2015. Automatic claim negation: Why, how and tification. In Proceedings of the 2017 Conference
when. In Proceedings of the 2nd Workshop on Ar- on Empirical Methods in Natural Language Pro-
gumentation Mining, pages 84–93. Association for cessing, pages 2055–2066. Association for Compu-
Computational Linguistics. tational Linguistics.
Yonatan Bilu and Noam Slonim. 2016. Claim synthe- Trudy Govier. 2010. A Practical Study of Argument.
sis via predicate recycling. In Proceedings of the Cengage Learning, Wadsworth.
54th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 2: Short Papers), pages Ivan Habernal, Henning Wachsmuth, Iryna Gurevych,
525–530, Berlin, Germany. Association for Compu- and Benno Stein. 2018. The argument reasoning
tational Linguistics. comprehension task: Identification and reconstruc-
tion of implicit warrants. In Proceedings of the
Samuel R. Bowman, Gabor Angeli, Christopher Potts, 2018 Conference of the North American Chapter of
and Christopher D. Manning. 2015a. A large an- the Association for Computational Linguistics: Hu-
notated corpus for learning natural language infer- man Language Technologies, Volume 1 (Long Pa-
ence. In Proceedings of the 2015 Conference on pers), pages 1930–1940. Association for Computa-
Empirical Methods in Natural Language Process- tional Linguistics.
ing, pages 632–642. Association for Computational
Linguistics. Kazi Saidul Hasan and Vincent Ng. 2014. Why are
you taking this stance? identifying and classifying
Samuel R. Bowman, Gabor Angeli, Christopher Potts, reasons in ideological debates. In Proceedings of
and Christopher D. Manning. 2015b. A large an- the 2014 Conference on Empirical Methods in Natu-
notated corpus for learning natural language infer- ral Language Processing (EMNLP), pages 751–762.
ence. In Proceedings of the 2015 Conference on Association for Computational Linguistics.
Empirical Methods in Natural Language Processing
(EMNLP). Association for Computational Linguis- Xinyu Hua and Lu Wang. 2018. Neural argument
tics. generation augmented with externally retrieved evi-
dence. In Proceedings of the 56th Annual Meeting of
Zhigang Chen, Wei Lin, Qian Chen, Xiaoping Chen, the Association for Computational Linguistics (Vol-
Si Wei, Hui Jiang, and Xiaodan Zhu. 2015. Re- ume 1: Long Papers), pages 1–10, Melbourne, Aus-
visiting word embedding for contrasting meaning. tralia. Association for Computational Linguistics.
1765Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hi- Saif Mohammad, Svetlana Kiritchenko, Parinaz Sob-
roya Takamura, and Manabu Okumura. 2016. Con- hani, Xiaodan Zhu, and Colin Cherry. 2016.
trolling output length in neural encoder-decoders. Semeval-2016 task 6: Detecting stance in tweets. In
In Proceedings of the 2016 Conference on Empiri- Proceedings of the 10th International Workshop on
cal Methods in Natural Language Processing, pages Semantic Evaluation (SemEval-2016), pages 31–41.
1328–1338, Austin, Texas. Association for Compu- Association for Computational Linguistics.
tational Linguistics.
Elena Musi. 2018. How did you change my view?
Diederik Kingma and Jimmy Ba. 2015. Adam: A a corpus-based study of concessions argumentative
method for stochastic optimization. In 3rd Interna- role. Discourse Studies, 20(2):270–288.
tional Conference on Learning Representations.
Kim Anh Nguyen, Sabine Schulte im Walde, and
Julien Kloetzer, Stijn De Saeger, Kentaro Torisawa, Ngoc Thang Vu. 2016. Integrating distributional
Chikara Hashimoto, Jong-Hoon Oh, Motoki Sano, lexical contrast into word embeddings for antonym-
and Kiyonori Ohtake. 2013. Two-stage method for synonym distinction. In Proceedings of the 54th An-
large-scale acquisition of contradiction pattern pairs nual Meeting of the Association for Computational
using entailment. In Proceedings of the 2013 Con- Linguistics (Volume 2: Short Papers), pages 454–
ference on Empirical Methods in Natural Language 459. Association for Computational Linguistics.
Processing, pages 693–703. Association for Com- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
putational Linguistics. Jing Zhu. 2002. Bleu: a method for automatic eval-
uation of machine translation. In Proceedings of
Philipp Koehn. 2004. Statistical significance tests for 40th Annual Meeting of the Association for Com-
machine translation evaluation. In Proceedings of putational Linguistics, pages 311–318, Philadelphia,
EMNLP 2004, pages 388–395, Barcelona, Spain. Pennsylvania, USA. Association for Computational
Association for Computational Linguistics. Linguistics.
Dieu-Thu Le, Cam Tu Nguyen, and Kim Anh Nguyen. Adam Paszke, Sam Gross, Soumith Chintala, Gre-
2018. Dave the debater: a retrieval-based and gen- gory Chanan, Edward Yang, Zachary DeVito, Zem-
erative argumentative dialogue agent. In Proceed- ing Lin, Alban Desmaison, Luca Antiga, and Adam
ings of the 5th Workshop on Argument Mining, pages Lerer. 2017. Automatic differentiation in pytorch.
121–130. Association for Computational Linguis- In NIPS-W.
tics.
Michael Paul, ChengXiang Zhai, and Roxana Girju.
Piroska Lendvai and Uwe Reichel. 2016. Contradic- 2010. Summarizing contrastive viewpoints in opin-
tion detection for rumorous claims. In Proceedings ionated text. In Proceedings of the 2010 Confer-
of the Workshop on Extra-Propositional Aspects of ence on Empirical Methods in Natural Language
Meaning in Computational Linguistics (ExProM), Processing, pages 66–76. Association for Compu-
pages 31–40. The COLING 2016 Organizing Com- tational Linguistics.
mittee.
Jeffrey Pennington, Richard Socher, and Christopher
Thang Luong, Hieu Pham, and Christopher D. Man- Manning. 2014. Glove: Global vectors for word
ning. 2015. Effective approaches to attention-based representation. In Proceedings of the 2014 Con-
neural machine translation. In Proceedings of the ference on Empirical Methods in Natural Language
2015 Conference on Empirical Methods in Natu- Processing (EMNLP), pages 1532–1543, Doha,
ral Language Processing, pages 1412–1421, Lis- Qatar. Association for Computational Linguistics.
bon, Portugal. Association for Computational Lin-
guistics. Dean Pomerleau and Delip Rao. 2017. Fake news chal-
lenge.
Xuezhe Ma and Eduard Hovy. 2016. End-to-end se- Alan Ritter, Stephen Soderland, Doug Downey, and
quence labeling via bi-directional lstm-cnns-crf. In Oren Etzioni. 2008. It’s a contradiction – no, it’s
Proceedings of the 54th Annual Meeting of the As- not: A case study using functional relations. In Pro-
sociation for Computational Linguistics (Volume 1: ceedings of the 2008 Conference on Empirical Meth-
Long Papers), pages 1064–1074, Berlin, Germany. ods in Natural Language Processing, pages 11–20.
Association for Computational Linguistics. Association for Computational Linguistics.
Marie-Catherine de Marneffe, Anna N. Rafferty, and E. Schiappa and J.P. Nordin. 2013. Argumentation:
Christopher D. Manning. 2008. Finding contradic- Keeping Faith with Reason. Pearson Education.
tions in text. In Proceedings of ACL-08: HLT, pages
1039–1047, Columbus, Ohio. Association for Com- Allen Schmaltz, Yoon Kim, Alexander Rush, and Stu-
putational Linguistics. art Shieber. 2017. Adapting sequence models for
sentence correction. In Proceedings of the 2017
Tsvetomila Mihaylova, Preslav Nakov, Lluı́s Màrquez, Conference on Empirical Methods in Natural Lan-
Alberto Barrón-Cedeño, Mitra Mohtarami, Georgi guage Processing, pages 2807–2813, Copenhagen,
Karadzhov, and James R. Glass. 2018. Fact check- Denmark. Association for Computational Linguis-
ing in community forums. CoRR, abs/1803.03178. tics.
1766Claudia Schulz, Steffen Eger, Johannes Daxenberger, Proceedings of The Third Workshop on Representa-
Tobias Kahse, and Iryna Gurevych. 2018. Multi- tion Learning for NLP, pages 137–143. Association
task learning for argumentation mining in low- for Computational Linguistics.
resource settings. CoRR, abs/1804.04083.
Henning Wachsmuth, Martin Potthast, Khalid
Abigail See, Peter J. Liu, and Christopher D. Manning. Al Khatib, Yamen Ajjour, Jana Puschmann, Jiani
2017. Get to the point: Summarization with pointer- Qu, Jonas Dorsch, Viorel Morari, Janek Bevendorff,
generator networks. In Proceedings of the 55th An- and Benno Stein. 2017. Building an argument
nual Meeting of the Association for Computational search engine for the web. In Proceedings of
Linguistics (Volume 1: Long Papers), pages 1073– the 4th Workshop on Argument Mining, pages
1083, Vancouver, Canada. Association for Compu- 49–59, Copenhagen, Denmark. Association for
tational Linguistics. Computational Linguistics.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Henning Wachsmuth, Shahbaz Syed, and Benno Stein.
Ilya Sutskever, and Ruslan Salakhutdinov. 2014. 2018. Retrieval of the best counterargument with-
Dropout: A simple way to prevent neural networks out prior topic knowledge. In Proceedings of the
from overfitting. J. Mach. Learn. Res., 15(1):1929– 56th Annual Meeting of the Association for Compu-
1958. tational Linguistics (Volume 1: Long Papers), pages
241–251. Association for Computational Linguis-
Christian Stab and Iryna Gurevych. 2014. Identify- tics.
ing argumentative discourse structures in persuasive
Marilyn Walker, Pranav Anand, Rob Abbott, and Ricky
essays. In Proceedings of the 2014 Conference on
Grant. 2012. Stance classification using dialogic
Empirical Methods in Natural Language Processing
properties of persuasion. In Proceedings of the 2012
(EMNLP), pages 46–56, Doha, Qatar. Association
Conference of the North American Chapter of the
for Computational Linguistics.
Association for Computational Linguistics: Human
Language Technologies, pages 592–596. Associa-
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
tion for Computational Linguistics.
Sequence to sequence learning with neural net-
works. In Z. Ghahramani, M. Welling, C. Cortes, Lu Wang and Wang Ling. 2016. Neural network-based
N. D. Lawrence, and K. Q. Weinberger, editors, Ad- abstract generation for opinions and arguments. In
vances in Neural Information Processing Systems Proceedings of the 2016 Conference of the North
27, pages 3104–3112. Curran Associates, Inc. American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
Yu Takabatake, Hajime Morita, Daisuke Kawahara, pages 47–57, San Diego, California. Association for
Sadao Kurohashi, Ryuichiro Higashinaka, and Computational Linguistics.
Yoshihiro Matsuo. 2015. Classification and ac-
quisition of contradictory event pairs using crowd- Zhongqing Wang and Yue Zhang. 2017. Opinion rec-
sourcing. In Proceedings of the The 3rd Work- ommendation using a neural model. In Proceed-
shop on EVENTS: Definition, Detection, Corefer- ings of the 2017 Conference on Empirical Methods
ence, and Representation, pages 99–107. Associa- in Natural Language Processing, pages 1627–1638.
tion for Computational Linguistics. Association for Computational Linguistics.
James Thorne, Andreas Vlachos, Christos Adina Williams, Nikita Nangia, and Samuel Bowman.
Christodoulopoulos, and Arpit Mittal. 2018. 2018. A broad-coverage challenge corpus for sen-
Fever: a large-scale dataset for fact extraction tence understanding through inference. In Proceed-
and verification. In Proceedings of the 2018 ings of the 2018 Conference of the North American
Conference of the North American Chapter of the Chapter of the Association for Computational Lin-
Association for Computational Linguistics: Human guistics: Human Language Technologies, Volume 1
Language Technologies, Volume 1 (Long Papers), (Long Papers), pages 1112–1122. Association for
pages 809–819. Association for Computational Computational Linguistics.
Linguistics.
Ronald J. Williams and David Zipser. 1989. A learn-
Stephen Toulmin. 1958. The Uses of Argument. Cam- ing algorithm for continually running fully recurrent
bridge At the University Press. neural networks. Neural Comput., 1(2):270–280.
Andreas Vlachos and Sebastian Riedel. 2014. Fact
checking: Task definition and dataset construction.
In Proceedings of the ACL 2014 Workshop on Lan-
guage Technologies and Computational Social Sci-
ence, pages 18–22. Association for Computational
Linguistics.
Ivan Vulić. 2018. Injecting lexical contrast into word
vectors by guiding vector space specialisation. In
1767You can also read