Reactive Supervision: A New Method for Collecting Sarcasm Data

Page created by Mildred Holmes
 
CONTINUE READING
Reactive Supervision: A New Method for Collecting Sarcasm Data

                         Boaz Shmueli1,2,3,∗, Lun-Wei Ku2 and Soumya Ray3
    1
        Social Networks and Human-Centered Computing, Taiwan International Graduate Program
                            2
                              Institute of Information Science, Academia Sinica
                      3
                        Institute of Service Science, National Tsing Hua University

                             Abstract                              generated labels such as the #sarcasm hashtag
        Sarcasm detection is an important task in af-              (Davidov et al., 2010; Ptáček et al., 2014). This
        fective computing, requiring large amounts of              method generates large amounts of data at low cost,
        labeled data. We introduce reactive supervi-               but labels are often noisy and biased (Bamman and
        sion, a novel data collection method that uti-             Smith, 2015).
        lizes the dynamics of online conversations to                 To improve quality, manual annotation asks hu-
        overcome the limitations of existing data col-             mans to label given tweets as sarcastic or not. Since
        lection techniques. We use the new method
                                                                   finding sarcasm in a large corpus is “a needle-in-a-
        to create and release a first-of-its-kind large
        dataset of tweets with sarcasm perspective la-
                                                                   haystack problem” (Liebrecht et al., 2013), manual
        bels and new contextual features. The dataset              annotation can be combined with distant supervi-
        is expected to advance sarcasm detection re-               sion (Riloff et al., 2013). Still, low inter-annotator
        search. Our method can be adapted to other                 reliability is often reported (Swanson et al., 2014),
        affective computing domains, thus opening up               resulting not only from the subjective nature of sar-
        new research opportunities.                                casm but also the lack of cultural context (Joshi
1        Introduction                                              et al., 2016). Moreover, neither method collects
                                                                   both sarcasm perspectives: distant supervision col-
Sarcasm is ubiquitous in human conversations. As                   lects intended sarcasm, while manual annotation
a form of insincere speech, the intent behind a                    can only collect perceived sarcasm.
sarcastic utterance is integral to its meaning. Per-                  Lastly, in manual collection, humans are asked
ceiving a sarcastic utterance as genuine will often                to gather and report sarcastic texts, either their own
result in a complete reversal of the intended mean-                (Oprea and Magdy, 2020) or by others (Filatova,
ing, and vice versa (Gibbs, 1986). It is therefore                 2012). However, both manual methods are slower
crucial for affective computing systems and tasks,                 and more expensive than distant supervision, result-
such as sentiment analysis and dialogue systems, to                ing in smaller datasets.
automatically detect sarcasm from the perspective                     To overcome the above limitations, we propose
of the author as well as the reader in order to avoid              reactive supervision, a novel conversation-based
misunderstandings. Oprea and Magdy (2019) re-                      method that offers automated, high-volume, “in-
cently pioneered the study of intended sarcasm (by                 the-wild” collection of high-quality intended and
the author) vs. perceived sarcasm (by the reader) in               perceived sarcasm data. We use our method to
the context of sarcasm detection tasks. The training               create and release the SPIRS sarcasm dataset1 .
of models for these tasks requires large amounts of
labeled sarcasm data, with Twitter becoming a ma-                  2       Reactive Supervision
jor source due to its popularity as a social network
as well as the huge amounts of conversational text                 Reactive supervision exploits the frequent use in
its users generate. Previous works describe three                  online conversations of a cue tweet — a reply that
methods for collecting sarcasm data: distant super-                highlights sarcasm in a prior tweet. Figure 1 (left
vision, manual annotation, and manual collection.                  panel) shows a typical exchange on Twitter: C
   Distant supervision automatically collects “in-                 posts a sarcastic tweet. Unaware of C’s sarcastic
the-wild” sarcastic tweets by leveraging author-                   intent, B replies with an oblivious tweet. Lastly, A
        ∗                                                              1
            Corresponding author: shmueli@iis.sinica.edu.tw                github.com/bshmueli/SPIRS

                                                              2553
             Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 2553–2559,
                              November 16–20, 2020. c 2020 Association for Computational Linguistics
Person         Example Cue                                  Regular Expression                     Example Author Sequences
 1st            I was only being sarcastic lol               ˆA[ˆA]*(A)[ˆA]*$                       ABA, ABAC, ABAB
 2nd            Why are you being sarcastic?                 ˆAA*(B)A*$                             AB, ABA, ABAA
 3rd            She was just being sarcastic!                ˆAA*B[AB]*(C)[AB]*$                    ABC, ABCB, ABAC

Table 1: The three grammatical person classes, with example cue tweets, corresponding regular expressions, and
examples of matching author sequences. The bold author letter corresponds to the position of the sarcastic tweet.

             User_C                             User_C                     Algorithm Given a thread {tn , tn−1 , . . . , t1 }
             The app we use for work            Just watched Forrest       with cue tweet tn by an = A, our aim is to identify
             emails is not working.             Gump. Great film!
             I feel terrible about this!
                                                User_A
                                                                           the sarcastic tweet among {tn−1 , . . . , t1 }. We first
                                                So Tom Hanks can act!      examine the personal subject pronoun used in the
             User_B                             Who knew???
             Not your fault. Do not
                                                                           cue (I, you, s/he) and map it to a grammatical per-
                                                User_B
             feel guilty!
                                                Literally everyone!!!
                                                                           son class (1st, 2nd, 3rd). This informs us whether
                                                User_A
                                                                           the sarcastic author is also the author of the cue
             User_A
             Replying to @User_B                Replying to @User_B        (1st), its addressee (2nd), or another party (3rd).
       She was just being sarcastic!       I was being sarcastic lol       For each person class we then apply a heuristic to
                                                                           identify the sarcastic tweet.
Figure 1: Conversation threads. Left panel: 3rd-person                        For example, for a 1st-person cue tweet (e.g., I
cue with author sequence ABC. Right panel: 1st-
                                                                           was just being sarcastic!), the sarcastic tweet must
person cue with author sequence ABAC.
                                                                           also be authored by A. If the earlier tweets in T
                                                                           contain exactly one tweet from A, it is unambigu-
                                                                           ously the sarcastic tweet. Otherwise, if there are
alerts B by replying with a cue tweet (She was just
                                                                           two or more earlier tweets from A (or none), the sar-
being sarcastic!). Since A replies to B but refers
                                                                           castic tweet cannot be unambiguously pinpointed
to the sarcastic author in the 3rd person (She), C
                                                                           and the entire thread is discarded. We formalize
is necessarily the author of the perceived sarcastic
                                                                           this rule by requiring the author sequence to match
tweet. Similarly, Figure 1 (right panel) shows how
                                                                           the regular expression /ˆA[ˆA]*(A)[ˆA]*$/,
a 1st person cue (I was just being sarcastic!) can
                                                                           where the capturing group (A) corresponds to the
be used to unequivocally label intended sarcasm.
                                                                           sarcastic tweet2 . We are able to use regular expres-
  To capture sarcastic tweets, we thus first search                        sions because we use a string of letters to represent
for cue tweets (using the query phrase “being sar-                         the author sequence. 2nd- and 3rd-person cues
castic”, often used in responses to sarcastic tweets),                     produce corresponding rules and patterns. Table 1
then carefully examine each cue tweet to identify                          lists the three person classes, corresponding regular
the corresponding sarcastic tweet.                                         expressions, and example author sequences.
  The following formalizes our method.
                                                                           2.2       Advantages
2.1    Method                                                              Additional Tweet Types Along with each sar-
                                                                           castic tweet, we collect the oblivious tweet (the
Definitions We define a thread to be a sequence                            unsuspecting reply to the sarcastic tweet) when
of tweets {tn , tn−1 , . . . , t1 }, where ti+1 is a re-                   available. As far as we know, this is the first
ply to ti , i = 1, . . . , n − 1. Tweets are listed                        work that identifies and collects oblivious texts,
in reverse chronological order, with t1 being                              a new type of data that can improve research on the
the root tweet. The corresponding author se-                               (mis)understanding of sarcasm, with applications
quence is an an−1 . . . a1 , were we replace the orig-                     such as automated assistive systems for people with
inal author names with consecutive capital letters                         emotional or cognitive disabilities. If the sarcastic
(A, B, C, ...), starting with an = A. For exam-                            tweet is a reply, we also capture the eliciting tweet,
ple, Figure 1 (right panel) depicts a thread of                            which is the tweet that evoked the sarcastic reply.
length n = 4 with author sequence ABAC. Here                               We provide more details in Appendix A.
a4 = a2 = A, a3 = B, and a1 = C is the author
                                                                               2
of the root tweet.                                                                 We use Perl-Compatible Regular Expressions (PCRE).

                                                                        2554
Extraction of Semantic Relations Being able                          Algorithm 1: Data collection pipeline.
to identify the various tweets types (cue, oblivious,                 Result: Set S of Sarcastic Tweets
sarcastic, eliciting), reactive supervision can be                    S ← {}
                                                                      candidates ← Fetch(’being sarcastic’)
understood more abstractly as capturing semantic                      for cue in candidates do
dependency relations between utterances3 . Reac-                           switch Classify(cue) do
tive supervision can thus be useful in the context                              case 1st person do
                                                                                    regexp ← ˆA[ˆA]*(A)[ˆA]*$
of discourse analysis.                                                          case 2nd person do
                                                                                    regexp ← ˆAA*(B)A*$
Context-Aware Annotation Our method uses                                        case 3rd person do
cues from thread participants, who therefore serve                                  regexp ← ˆAA*B[AB]*(C)[AB]*$
                                                                                case unknown do
as de facto annotators. As participants are familiar                                continue
with the conversation’s context, we overcome some                          end
                                                                           {tn (= cue), tn−1 , . . . , t1 } ← Traverse(cue)
quality issues of using external annotators, who are                       an an−1 . . . a1 ← authors({tn , tn−1 , . . . , t1 })
often unfamiliar with the conversation context due                         if i ← Match(regexp, an an−1 . . . a1 ) then
to cultural and social gaps (Joshi et al., 2016).                               S ← S ∪ {ti }
                                                                           end
Sarcasm Perspective Previous datasets contain                         end
either intended or perceived sarcasm, but not both
(Oprea and Magdy, 2019). Our method identifies
and labels both intended and perceived sarcasm                    3     SPIRS Dataset
within the same data context: by their essence, 1st-
                                                                  We implemented reactive supervision using a 4-
person cue tweets capture intended sarcasm, while
                                                                  step pipeline (see Algorithm 1):
2nd- and 3rd-person cues capture perceived sar-
                                                                     1. Fetch calls the Twitter Search API to collect
casm. We label a tweet as perceived sarcasm when
                                                                  cue tweets, using “being sarcastic” as the query.
at least one reader perceives the tweet as sarcastic
                                                                     2. Classify is a rule-based, precision-oriented
and posts a cue tweet. Detecting perceived sarcasm
                                                                  classifier that classifies cues as 1st-, 2nd-, or 3rd-
is useful, for example, for training algorithms that
                                                                  person according to the referred pronoun (I, you,
flag sensitive texts which might be (mis)perceived
                                                                  s/he). If the cue cannot be accurately classified
as sarcastic (even by a single reader).
                                                                  (e.g., a pronoun cannot be found, the cue contains
Faster Data Collection We tested González-                       multiple pronouns, negation words are present), the
Ibáñez et al. (2011)’s distant supervision method               cue is classified as unknown and discarded.
of collecting tweets ending with #sarcasm and re-                    3. Traverse calls the Twitter Lookup API to
lated hashtags, fetching 171 tweets/day on average.               retrieve the thread by starting from the cue tweet
During the same period, our method collected 312                  and repeatedly fetching the parent tweet up to the
tweets/day on average, an 82% rate improvement.                   root tweet.
Summary of Advantages Table 2 summarizes                             4. Finally, Match matches the thread’s author se-
the advantages of our best-of-all-worlds method                   quence with the corresponding regular expression.
over other approaches. Reactive supervision offers                Unmatched sequences are discarded. Otherwise,
automated, in-the-wild, and context-aware detec-                  the sarcastic tweet is identified and saved along
tion of intended and perceived sarcasm data.                      with the cue tweet, as well as the eliciting and
                                                                  oblivious tweets when available.
Method →            Distant    Manual     Manual     Reactive        The pipeline collected 65K cue tweets contain-
Feature ↓         Supervision Annotation Collection Supervision
                                                                  ing the phrase “being sarcastic” and corresponding
Automatic              3          7         7           4
In-the-wild            3          7         7           4
                                                                  threads during 48 days in October and November
Oblivious Tweet        7          7         7           4         2019. 77% of the cues were classified as unknown
Context-Aware          3       Maybe      Maybe         4         and discarded, ending with 15 000 English sarcas-
Perspective        Intended   Perceived   Either       Both
Samples/Day           171      Manual     Manual       312        tic tweets. In addition, 10 648 oblivious and 9 156
                                                                  eliciting tweets were automatically captured. Table
   Table 2: Comparison of data collection methods.                3 summarizes the SPIRS dataset. We added 15 000
                                                                  negative instances by sampling random English
   3
     It is worth noting that Hearst (1992) uses patterns to       tweets captured during the same period, discarding
automatically extract lexical relations between words.            tweets with sarcasm-related words or hashtags.

                                                              2555
# Tweets                       (Devlin et al., 2019). For all three models, we
 Person Perspective Sarcastic Oblivious Eliciting                        used 5-fold cross-validation for training, holding
 1st        Intended            10 300         9 065          8 075      out 20% of the data for testing.
 2nd        Perceived            3 000            —             842         Results are shown in Table 5 (top panel). BERT
 3rd        Perceived            1 700         1583             239      is the best performing model, with 70.3% accuracy.
 Total                          15 000        10 648          9 156      We compared SPIRS’s classification results to the
                                                                         Ptáček et al. (2014) dataset, commonly used in sar-
    Table 3: SPIRS data breakdown by person class.                       casm benchmarks. We found that Ptáček’s accuracy
                                                                         is significantly higher (86.6%). We posit that it is
   Sarcastic tweets can be either root tweets or                         because sarcasm is confounded with locale in the
replies. We found that the majority of intended                          Ptáček (sarcastic tweets are from worldwide users;
sarcasm tweets are replies (78.4%), while the ma-                        non-sarcastic tweets are from users near Prague),
jority of perceived sarcasm tweets are root tweets                       and thus classifiers learn features correlated to lo-
(77.0%). Further dataset statistics on author se-                        cale. We tested our hypothesis by replacing our
quence and tweet position distributions are avail-                       negative samples with Ptáček’s, which indeed re-
able in Appendices B and C.                                              sulted in boosting the accuracy by 19.1%.

Reliability To assess our method’s reliability in                        4.2   Detection with Conversation Context
capturing sarcastic tweets, we manually inspected                        Our second sarcasm classification experiment uses
200 random sarcastic tweets, along with their cue                        conversation context by adding eliciting and obliv-
tweets, from each person class. The accuracy of                          ious tweets to the model. As far as we know, this
sarcastic tweet labeling was high: 98.5%, 98%,                           is the first sarcasm-related task that uses oblivious
and 97% for 1st-, 2nd-, and 3rd-person cue tweets,                       texts. Our model concatenated the outputs of three
respectively. Table 4 shows samples of correct and                       identical 100-unit BiLSTMs (one per tweet: sarcas-
incorrect cue tweet classifications.                                     tic, oblivious, eliciting) before feeding it into dense
                                                                         layers for classification. Tweets without surround-
Cue Tweet                                              Pers. Correct?    ing context were not used in this task. Results are
Shudda been more clear...I was being sarcastic         1st     3         shown in Table 5 (middle panel). Accuracy for the
I’m almost always being sarcastic, but this was real   1st     7
Take it you are being sarcastic                        2nd     3         full-context model was 74.7% (MCC 0.398).
You do realize @user was being sarcastic right?        2nd     7
She was being sarcastic. You missed the joke           3rd     3         Ablation Study We conducted context ablation
Mind blown. Had no idea he was being sarcastic         3rd     7         experiments to identify the contribution of each
                                                                         tweet type. We found that removing the elicit-
Table 4: Correctly and incorrectly classified cue tweets.                ing tweets reduces accuracy by 0.5% and MCC
                                                                         by 0.026. Removing the oblivious tweets, however,
                                                                         lowered accuracy by 3.4% to 71.4%, and the MCC
4      Experiments and Analysis                                          dropped significantly by 31%, from 0.398 to 0.275.
We present dataset baselines for three tasks: sar-                       This illustrates the importance of the new oblivious
casm detection, sarcasm detection with conversa-                         text data provided in the dataset and suggests its
tion context, and sarcasm perspective classification,                    usefulness in sarcasm-related tasks.
a new task enabled by our dataset.                                       4.3   Perspective Classification
4.1      Sarcasm Detection                                               Taking advantage of the new labels in our dataset,
                                                                         we propose a new task to classify a sarcastic text’s
The first experiment is sarcasm detection. We
                                                                         perspective: intended vs. perceived. Our results are
trained a total of three models: CNN (100 filters
                                                                         displayed in Table 5 (bottom panel), demonstrating
with a kernel size 3) and BiLSTM (100 units), both
                                                                         the superiority of BERT over the other models, with
max-pooled and Adam-optimized with a learning
                                                                         an accuracy of 68.2% and MCC of 0.366.
rate of 0.0005; data was preprocessed as described
in Tay et al. (2018); the embedding layer was pre-                       Error Analysis We carefully examined the er-
loaded with GloVe embeddings (Twitter data, 100                          rors to analyze the causes of perspective misclassifi-
dimensions) (Pennington et al., 2014). We also                           cation. We observed that misclassified-as-intended
fine-tuned a pre-trained base uncased BERT model                         tweets (e.g., “You’re lost!”, “Omg that was so

                                                                      2556
Task                           Dataset              Model                      P           R            F1          Acc           MCC
Sarcasm                        SPIRS                CNN                    67.2 (1.8)   73.6 (5.1)   65.0 (1.2)   65.8 (0.5)   0.308 (0.011)
Detection                      (our dataset)        BiLSTM                 68.9 (2.1)   75.4 (5.5)   67.1 (0.9)   67.9 (0.3)   0.350 (0.008)
                               N =19 384            BERT                   70.1 (1.1)   77.4 (1.2)   69.9 (0.5)   70.3 (0.5)   0.402 (0.008)
                               Ptáček             CNN                    79.1 (0.8)   87.5 (1.3)   77.9 (0.6)   79.2 (0.6)   0.566 (0.012)
                               N =49 766            BiLSTM                 82.4 (1.6)   87.6 (2.9)   80.9 (0.1)   81.7 (0.2)   0.622 (0.002)
                                                    BERT                   87.0 (0.6)   90.9 (0.6)   86.0 (0.2)   86.6 (0.2)   0.721 (0.004)
                               Ptáček (−)         CNN                    84.3 (1.6)   82.6 (2.5)   83.6 (0.8)   83.6 (0.8)   0.673 (0.017)
                               SPIRS (+)            BiLSTM                 86.2 (2.8)   86.7 (2.8)   86.4 (0.7)   86.4 (0.7)   0.729 (0.012)
                               N =21 138∗           BERT                   89.8 (0.7)   89.1 (0.7)   89.4 (0.2)   89.4 (0.2)   0.788 (0.004)
Sarcasm                        SPIRS                3 X BiLSTM             77.7 (1.1)   87.9 (3.5)   68.9 (0.7)   74.8 (0.6)   0.398 (0.007)
Detection                      (our dataset)        w/o eliciting          75.6 (1.1)   91.4 (2.8)   66.3 (1.4)   74.3 (0.3)   0.372 (0.005)
w/ Conversation                N =7 810∗            w/o oblivious          72.4 (2.4)   93.3 (4.5)   58.8 (6.2)   71.4 (1.4)   0.275 (0.053)
Context                                             w/o both               73.2 (2.7)   90.8 (6.6)   60.3 (4.6)   71.2 (0.4)   0.282 (0.033)
Sarcasm                        SPIRS                CNN                    65.5 (1.2)   61.7 (3.3)   64.4 (0.5)   64.5 (0.5)   0.291 (0.009)
Perspective                    (our dataset)        BiLSTM                 66.8 (2.3)   63.1 (5.8)   65.5 (0.7)   65.6 (0.7)   0.315 (0.015)
Classification                 N =6 324∗            BERT                   70.0 (2.9)   63.8 (5.7)   68.0 (1.7)   68.2 (1.6)   0.366 (0.032)

Table 5: Baselines. We report precision, recall, macro-F1, accuracy, and MCC (Matthews correlation coefficient).
Mean and standard deviation were calculated using 5-fold cross-validation. N is the number of instances after
preprocessing. ∗ Dataset classes were balanced using majority class downsampling.

                                                          Intended sarcasm         5    Conclusion
              0.04                                        Perceived sarcasm
Probability

                                                                                   We present an innovative method for collecting
              0.02                                                                 sarcasm data that exploits the natural dynamics of
                                                                                   online conversations. Our approach has multiple
              0.00
                     0    10      20        30       40      50       60           advantages over all existing methods. We used it to
                                          Word count
                                                                                   create and release SPIRS, a large sarcasm dataset
                 Figure 2: Word count distribution in SPIRS                        with multiple novel features. These new features,
                                                                                   including labels for sarcasm perspective and unique
                                                                                   context (e.g., oblivious texts), offer opportunities
funny”) had, on average, almost half the word count                                for advances in sarcasm detection.
of misclassified-as-perceived tweets (17.2 vs. 27.8).                                 Reactive supervision is generalizable. By modi-
We posit that longer, more informative texts make                                  fying the cue tweet selection criteria, our method
sarcasm easier to perceive; hence, short perceived                                 can be adapted to related domains such as senti-
sarcasm or long intended sarcasm might introduce                                   ment analysis and emotion detection, thereby ad-
errors. Analysis of the dataset’s word count distri-                               vancing the quality and quantity of data collection
bution supports our hypothesis (see Figure 2).                                     and offering new research directions in affective
   Looking for further error sources, we inspected                                 computing.
short intended tweets that were misclassified, for
                                                                                   Acknowledgements
example “great friends i have!” and “My mom is
so beautiful”. These tweets can be read as root                                    This research was partially supported by the Min-
tweets and not as replies, yet most intended sar-                                  istry of Science and Technology of Taiwan under
casm tweets are replies while most perceived sar-                                  contracts MOST 108-2221-E-001-012-MY3 and
casm tweets are root tweets (see Section 3). We hy-                                MOST 108-2321-B-009-006-MY2.
pothesize that the classifier learns discourse-related
features (original tweet vs. reply tweet), which can
lead to these errors. Further analysis of sarcasm
perspective and its interplay with sarcasm pragmat-
ics is a promising avenue for future research.

                                                                              2557
References                                               Silviu Oprea and Walid Magdy. 2019. Exploring au-
                                                            thor context for detecting intended vs perceived sar-
David Bamman and Noah A Smith. 2015. Contextual-            casm. In Proceedings of the 57th Annual Meet-
  ized Sarcasm Detection on Twitter. In Ninth Interna-      ing of the Association for Computational Linguis-
  tional AAAI Conference on Web and Social Media.           tics, pages 2854–2859, Florence, Italy. Association
Dmitry Davidov, Oren Tsur, and Ari Rappoport. 2010.         for Computational Linguistics.
 Semi-supervised Recognition of Sarcastic Sentences
                                                         Silviu Oprea and Walid Magdy. 2020. iSarcasm: A
 in Twitter and Amazon. In Proceedings of the
                                                            Dataset of Intended Sarcasm. In Proceedings of the
 Fourteenth Conference on Computational Natural
                                                            58th Annual Meeting of the Association for Compu-
 Language Learning, CoNLL ’10, pages 107–116,
                                                            tational Linguistics. Association for Computational
 Stroudsburg, PA, USA. Association for Computa-
                                                            Linguistics.
 tional Linguistics. Event-place: Uppsala, Sweden.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and            Jeffrey Pennington, Richard Socher, and Christopher
   Kristina Toutanova. 2019. BERT: Pre-training of          Manning. 2014. GloVe: Global vectors for word
   deep bidirectional transformers for language under-      representation. In Proceedings of the 2014 Confer-
   standing. In Proceedings of the 2019 Conference          ence on Empirical Methods in Natural Language
   of the North American Chapter of the Association         Processing (EMNLP), pages 1532–1543, Doha,
   for Computational Linguistics: Human Language            Qatar. Association for Computational Linguistics.
  Technologies, Volume 1 (Long and Short Papers),
                                                         Tomáš Ptáček, Ivan Habernal, and Jun Hong. 2014.
   pages 4171–4186, Minneapolis, Minnesota. Associ-
                                                           Sarcasm Detection on Czech and English Twitter.
   ation for Computational Linguistics.
                                                           In Proceedings of COLING 2014, the 25th Inter-
Elena Filatova. 2012. Irony and Sarcasm: Corpus            national Conference on Computational Linguistics:
  Generation and Analysis Using Crowdsourcing. In          Technical Papers, pages 213–223, Dublin, Ireland.
  Proceedings of the Eighth International Conference       Dublin City University and Association for Compu-
  on Language Resources and Evaluation (LREC’12),          tational Linguistics.
  pages 392–398, Istanbul, Turkey. European Lan-
  guage Resources Association (ELRA).                    Ellen Riloff, Ashequl Qadir, Prafulla Surve, Lalindra
                                                            De Silva, Nathan Gilbert, and Ruihong Huang. 2013.
Raymond W Gibbs. 1986. On the psycholinguistics of          Sarcasm as Contrast between a Positive Sentiment
  sarcasm. Journal of Experimental Psychology: Gen-         and Negative Situation. In Proceedings of the 2013
  eral, 115(1):3.                                           Conference on Empirical Methods in Natural Lan-
                                                            guage Processing, pages 704–714, Seattle, Washing-
Roberto González-Ibáñez, Smaranda Muresan, and           ton, USA. Association for Computational Linguis-
  Nina Wacholder. 2011. Identifying Sarcasm in Twit-        tics.
  ter: A Closer Look. In Proceedings of the 49th
  Annual Meeting of the Association for Computa-         Reid Swanson, Stephanie Lukin, Luke Eisenberg,
  tional Linguistics: Human Language Technologies:         Thomas Corcoran, and Marilyn Walker. 2014. Get-
  Short Papers - Volume 2, HLT ’11, pages 581–586,         ting reliable annotations for sarcasm in online dia-
  Stroudsburg, PA, USA. Association for Computa-           logues. In Proceedings of the Ninth International
  tional Linguistics. Event-place: Portland, Oregon.       Conference on Language Resources and Evalua-
                                                           tion (LREC’14), pages 4250–4257, Reykjavik, Ice-
Marti A. Hearst. 1992. Automatic acquisition of hy-        land. European Language Resources Association
 ponyms from large text corpora. In COLING 1992            (ELRA).
 Volume 2: The 15th International Conference on
 Computational Linguistics.                              Yi Tay, Anh Tuan Luu, Siu Cheung Hui, and Jian
                                                           Su. 2018. Reasoning with Sarcasm by Reading In-
Aditya Joshi, Pushpak Bhattacharyya, Mark Carman,          Between. In Proceedings of the 56th Annual Meet-
  Jaya Saraswati, and Rajita Shukla. 2016. How             ing of the Association for Computational Linguistics
  Do Cultural Differences Impact the Quality of Sar-       (Volume 1: Long Papers), pages 1010–1020, Mel-
  casm Annotation?: A Case Study of Indian Anno-           bourne, Australia. Association for Computational
  tators and American Text. In Proceedings of the          Linguistics.
  10th SIGHUM Workshop on Language Technology
  for Cultural Heritage, Social Sciences, and Humani-    A    Search Pattern Production
  ties, pages 95–99, Berlin, Germany. Association for
  Computational Linguistics.                             We construct the regular expression for capturing
Christine Liebrecht, Florian Kunneman, and Antal         all tweet types — sarcastic, oblivious, and elicit-
  van den Bosch. 2013. The perfect solution for de-      ing — given a 3rd-person cue tweet. Similar logic
  tecting sarcasm in tweets #not. In Proceedings         produces the patterns for 1st- and 2nd-person cues.
  of the 4th Workshop on Computational Approaches
  to Subjectivity, Sentiment and Social Media Analy-
                                                            The cue tweet author (A) refers to the sarcas-
  sis, pages 29–37, Atlanta, Georgia. Association for    tic tweet author in the 3rd person (e.g., She was
  Computational Linguistics.                             being sarcastic!); we thus assume that A’s tweet

                                                     2558
is a response to a second author B, but refers to                                                    # Tweets
a third author C (the sarcastic author). To unam-              Person         Patterns Sarcast. Obliv. Elicit.
biguously pinpoint the sarcastic tweet, C can only             1st            ABAC            2 841     2 841      2 841
appear once in the author sequence. Moreover,                  (Intended)     ABA             1 818     1 818         —
only A, B, and C can participate in the thread.                               ABAB            1 551     1 551      1 551
                                                                              Other           4 090     2 855      2 683
Finally, C’s tweet can either be a root tweet or
                                                                              Subtotal       10 300     9 065      8 075
a reply to another tweet. The combination of
these constraints leads to the regular expression              2nd         AB                 2 122          —       —
                                                               (Perceived) ABA                  782          —      782
/ˆ(A)(A*B[AB]*)(C)([AB]*)$/.                                               Other                 96          —       60
   (A) is the cue tweet. (A*B[AB]*) forces at                                 Subtotal        3 000          —      842
least one tweet from B (to which A responded).
                                                               3rd         ABC                1 235     1 235        —
(C) is the sarcastic tweet. Finally, ([AB]*) rep-              (Perceived) ABCB                 119       119       119
resents optional tweets from A or B. If the author                         ABAC                 110       110        —
                                                                           Other                236       119       120
sequence matches the regular expression, we can
unambiguously identify the sarcastic author and                               Subtotal        1 700     1 583       239
the corresponding sarcastic tweet. We also use                                Total          15 000 10 648         9 156
the search pattern to find the oblivious and elicit-
ing tweets. We assume that the cue tweet (A) is           Table 7: The most common author patterns by person
                                                          class. The colors denote the locations of the cue, obliv-
triggered by an oblivious tweet from B. Thus, if
                                                          ious, sarcastic and eliciting tweets.
(A*B[AB]*) contains exactly one B, we desig-
nate the corresponding tweet as oblivious. Like-
wise, ([AB]*) contains the eliciting tweet.               C     Tweet Position Distribution
   Table 6 lists the search patterns for the three
                                                          Reactive supervision enables the measurement of
person classes. Note that the 2nd-person pattern
                                                          conversation position statistics for sarcastic tweets
does not include an oblivious tweet because A’s
                                                          on Twitter. Given a thread {tn , . . . , ti = s, . . . , t1 }
cue tweet is a response to a sarcastic tweet from B,
                                                          with cue tweet tn , sarcastic tweet ti = s, and root
i.e., it is not triggered by an oblivious tweet.
                                                          tweet t1 , we define the position of the sarcastic
    Person   Regular Expression                           tweet as the distance i − 1 between the sarcastic
                                                          tweet and the root. Furthermore, the cue lag is the
    1st      ˆ(A)([ˆA]*)(A)([ˆA]*)$                       distance n − i between the cue and the sarcastic
    2nd      ˆ(A)A*(B)(A*)$                               tweet. Table 8 shows the distribution of sarcastic
    3rd      ˆ(A)(A*B[AB]*)(C)([AB]*)$
                                                          tweets by position and cue lag in the SPIRS dataset.
                                                             Root tweets (position = 0) account for 39% of
Table 6: Person classes and their search patterns. The
capturing groups’ colors correspond to the locations of   sarcastic tweets. A further 39% of sarcastic tweets
the cue, oblivious, sarcastic and eliciting tweets.       are direct replies to root tweets (position = 1).
                                                          Interestingly, only 25% of cue tweets are direct
                                                          replies to their sarcastic targets (lag = 1), while an
B    Author Sequence Distribution                         overwhelming 71% have a lag of 2, mostly reflect-
Table 7 shows the most common author sequences            ing a response to an intermediate oblivious tweet.
in SPIRS. The different colors correspond to the          We further find that the average thread length is 3.9
different tweet types. The most common pattern            tweets, while the average lag is 1.8 tweets.
for 1st-person cues is ABAC (as in Figure 1, right
panel). AB is the most common pattern for 2nd-                              Distance from the root tweet
person cues, which denote a sarcastic root tweet          Cue lag       0        1       2      3       4    5+     Total
followed immediately by a cue tweet (e.g., Why                1      16.5       7.2    0.9     0.3     0.1   0.2       25.1
are you being sarcastic?). For 3rd-person cues, the           2      20.6      30.6   11.4     3.8     1.7   2.3       70.4
most common pattern is ABC (as in Figure 1, left              3+      1.9       1.3    0.7     0.3     0.1   0.2        4.5
panel). Note that some patterns appear in more               Total   39.0      39.1   13.0     4.3     1.9   2.7     100.0
than one person class. For example, ABA appears
in both 1st- and 2nd-person classes, while ABAC           Table 8: % of sarcastic tweets by position (distance
appears in both 1st- and 3rd-person.                      from the root tweet) and cue lag.

                                                      2559
You can also read