An animated picture says at least a thousand words: Selecting Gif-based Replies in Multimodal Dialog

Page created by Geraldine Myers

Hobbies & Interests

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

An animated picture says at least a thousand words: Selecting Gif-based Replies in Multimodal Dialog

An animated picture says at least a thousand words: Selecting Gif-based
                                                              Replies in Multimodal Dialog

                                                             Xingyao Wang                                     David Jurgens
                                                          University of Michigan                           University of Michigan
                                                         xingyaow@umich.edu                               jurgens@umich.edu

                                                                Abstract                           PizzaMagic: Ahhhhh!!! The EMNLP deadline
                                                                                                   is in 24 hours!!
                                             Online conversations include more than just                x CasualModel:
                                             text. Increasingly, image-based responses
                                             such as memes and animated gifs serve as
arXiv:2109.12212v1 [cs.CL] 24 Sep 2021

                                             culturally recognized and often humorous re-
                                             sponses in conversation. However, while NLP
                                             has broadened to multimodal models, conver-
                                             sational dialog systems have largely focused
                                             only on generating text replies. Here, we intro-
                                             duce a new dataset of 1.56M text-gif conversa-
                                             tion turns and introduce a new multimodal con-
                                             versational model P EPE THE K ING P RAWN
                                                                                                   Figure 1: Gif responses in conversation like the one
                                             for selecting gif-based replies. We demon-
                                                                                                   shown above are embodied dialog that use visual im-
                                             strate that our model produces relevant and
                                                                                                   agery to convey reactions and emotions. This paper de-
                                             high-quality gif responses and, in a large ran-
                                                                                                   velops a system to select the appropriate gif response
                                             domized control trial of multiple models re-
                                                                                                   to messages. (PDF best viewed with Adobe Acrobat)
                                             plying to real users, we show that our model
                                             replies with gifs that are significantly better re-
                                             ceived by the community.
                                                                                                      Conversation analysis is central to NLP and mul-
                                         1   Introduction                                          tiple approaches have analyzed this dialog struc-
                                         Conversations are central to many online social           ture (Jurafsky et al., 1998; Pareti and Lando, 2018;
                                         platforms. While most conversations are text-             Cohn et al., 2019) and developed conversational
                                         based, computer mediated dialog also affords al-          agents to engage with people (e.g., Fang et al.,
                                         ternative forms of communication, such as emoji           2018; Xu et al., 2020; Hong et al., 2020). Recent
                                         or stickers like bitmoji, that allow users to ex-         work has focused on generating open domain social
                                         press themselves (Tang and Hew, 2019; Konrad              chatbots that engage in sustained conversations in
                                         et al., 2020). Increasingly, these visual forms of        a natural way (Ram et al., 2018). Because many
                                         communication have become common in social                of these systems are designed to support voice-
                                         media (Bourlai and Herring, 2014; Highfield and           based dialog, they overlook non-textual forms of
                                         Leaver, 2016), with a notable use of the reaction gif     interaction used in social media conversations. In
                                         (Bakhshi et al., 2016; Miltner and Highfield, 2017).      parallel, multimodal NLP systems have been devel-
                                         These gifs are short video sequences that depict a        oped for image data, often focusing on image-to-
                                         particular scene and sometimes contain text that          text tasks such as image captioning (Melas-Kyriazi
                                         acts as a meta-commentary (Eppink, 2014). As a            et al., 2018; Sharma et al., 2018) and visual ques-
                                         result, conversations become multimodal where in-         tion answering (Antol et al., 2015; Huang et al.,
                                         dividuals reply to one another using combinations         2019; Khademi, 2020). More recent work has fo-
                                         of text and gifs (Figure 1). While conversational         cused on the reverse text-to-image dimension, such
                                         AI systems have been developed in a purely text-          as generating an image from a description (Niu
                                         based setting, such systems do not capture the full       et al., 2020; Ramesh et al., 2021). Our work unites
                                         multimodal behavior seen online. Here, we study           these two strands of research by integrating image-
                                         multimodal conversation by introducing new dialog         based communication into conversational agents.
                                         models for selecting gif replies in conversation.           Our paper offers three main contributions. First,

we propose the new task of selecting gif responses though platforms like Twitter support gif replies,
in multimodal conversation analysis and introduce these gifs are not canonicalized to identify which re-
a new dataset of 1,562,701 real-world conversa- sponses correspond to the same gif. Therefore, we
tion turns with gif replies. Second, we introduce construct a new dataset for this task by collecting
a new model P EPE THE K ING P RAWN that fuses responses, matching their images, and augmenting
image and text-based features to select a relevant this data with metadata about the gif, where possi-
gif response. In in-house experiments, we show ble. A visual description of the whole procedure
that our model substantially outperforms strong can be found in Appendix Figure 7.
baseline models at selecting the exact gif used in
real data and, in a manual test of the quality of the 3.1 Gif Response Data
best responses, achieves an nDCG of 0.8145 on the
annotated test set. Third, in a real-world test, we Gifs have many uses (Miltner and Highfield, 2017)
deploy our model as a part of a large-scale random- and so we use a two-step approach to collect data
ized controlled trial and show that the gif replies that focus specifically on those likely to be used
produced by our model are more highly voted by in conversation. First, gif responses are collected
the community. Data, code, and models are avail- from Twitter by identifying all replies to English-
able at https://github.com/xingyaoww/gif-reply. language tweets containing animated_gif as
embedded media. Tweets were collected from a
2 GIF Communications ∼10% sample of Twitter from March 13th, 2019
to Jan 24th, 2020, totaling 42,096,566 tweets with
Gifs have been widely adopted in communication a gif that we were able to retrieve. Twitter does not
as a natural form of embodied speech where the vi- canonicalize its gifs so two separate gif files may
sual imagery conveys emotions or a reaction as a re- actually have the same imagery. Further, these files
sponse (Bakhshi et al., 2016; Tolins and Samermit, may not be identical due to small differences such
2016). These gifs commonly come from widely- as color variations or aspect ratios. To identify uses
known cultural products, such as movies or televi- of the reference gifs, we use Average Hash from
sion shows, which provides common knowledge the imagehash library to create low-dimensional
for how they could be interpreted (Eppink, 2014; representations of each gif where hash distance
Miltner and Highfield, 2017). However, a single gif corresponds to perceptual distance. Since gifs are
may have multiple interpretations, depending on animated and may contain varying scenes, we com-
the context, cultural knowledge of its content, and pute the hash for the first, middle, and final frames,
the viewer (Jiang et al., 2017). As a result, a single concatenating these into a single hash. Two gifs
gif can serve multiple functions in communication are considered the same if (i) they have identical
(Tolins and Samermit, 2016). hashes or (ii) their hamming distance is < 10 and
Gifs have grown in their use through increas- gifs with that hash have been used more than 500
ing affordances by platforms like Tumblr, Reddit, times in Twitter. This latter condition was selected
Imgur, and Twitter that allow gifs to be natively after manual evaluation of thresholds to trade-off
displayed like text in conversation threads (Jiang between increasing the size of the training data and
et al., 2018). Further, gif-based keyboards have reducing potential noise caused by matching error.
been introduced that allow users to search for gifs A visual example of this process can be found in
that have been tagged with keywords or other meta- Appendix Figure 8.
data (Griggio et al., 2019). Yet, these technologies
Not all gif responses in the Twitter data are con-
require that gif data be prepared with sufficient tags
versational or appropriate for wider re-use. There-
to be searchable or to have sufficient data to use
fore, we filter these responses to only those gifs
collaborative filtering techniques for recommenda-
whose imagery matches gifs hosted by the Giphy
tions (Jiang et al., 2018, p.9). As a result, there is a
website, which is the backend for many gif-based
clear gap in identifying appropriate response gifs
keyboards. Giphy contains a wide collection of gifs
directly from the text, which this work fills.
that are curated to remove content inappropriate for
general use (e.g., violent or sexual imagery). Gifs
3 Data
on the platform are categorized (e.g., “reaction” or
Despite the widespread use of gifs, no standard “celebrities”) and we identify 28 categories con-
dataset exists for text and gif replies. Further, al- taining 972 keywords likely to contain gifs used

used with at least five gifs and where those gifs
have been used at least 1000 times in total; this
process removes many low-frequency tags that are
either overly-specific or idiosyncratic in their use.
Finally, we performed a manual inspection of all
remaining tags to remove tags that are too general
(e.g., “emotion”) and retain only noun, adjective,
and verb tags (words or multi-word expressions)
that describe specific emotions or actions. A total
of 241 unique tags were retained (Appendix C).
6.0% of gifs have at least one tag associated with
Figure 2: The frequency distribution of gifs in our data
them (mean 1.9 tags). However, these tagged gifs
roughly follows a log-normal distribution, with a few account for 38.7% of the replies in our dataset,
gifs used often, while a long tail of gifs are used rela- suggesting tags are only available for more-popular
tively infrequently. gifs. Our dataset represents roughly an order of
magnitude more data and more tags than the closest
related dataset of Chen et al. (2017) that contained
in conversation. A total of 2,095,993 gifs linked 23K gifs with 17 manually-curated emotions.
to those keywords were ultimately retrieved and
stored as image hashes. Additional details of cate- 4 Gif Reply Models
gories and keywords are in Appendix B.
After the matching image hashes to filter replies, We introduce a series of models for producing a gif
we identify 115,586 unique gifs, referred to as ref- response in conversation. Each model will select a
erence gifs, and 1,562,701 tweet replies using one gif from the 115K gifs in our dataset as a response
of these gifs, which forms our official dataset. Fig- to a text-based message. This task is related to but
ure 2 shows these gifs’ frequency in the data; much distinct from work on image-text matching (Lee
like words, a handful of gifs receive widespread et al., 2018), which aims to find an image describ-
use, while a long tail of gifs are rarely used. ing a piece of text, or text-to-image (e.g., Wen et al.,
2015; Xu et al., 2018), which generates an image
3.2 Gif Metadata from a text description. Here, we aim to select gifs
We augment our gif data with information about that reflect natural continuations or reactions to a
their content. Some gifs have text that transcribes message in a dialog, akin to how gifs are used in
what a person is saying in the gif’s scene or is a social media. For all models, additional details on
meta-commentary on the content. This text is ex- the training procedures and hyperparameters are
tracted using paddleOCR (Du et al., 2020). Since provided in Appendix A. The three models that
some gifs are long enough to contain multiple utter- follow use varying degrees of information about
ances, we run OCR on four frames sampled from the gifs and text to select a response.
each quartile of the gif’s length. Roughly 50%
(58,020) of gifs contain at least one extracted word 4.1 Tag-based Predictions
from the selected frames, with an mean of 5.5 ex- The first model uses tags as a shared representation
tracted words per gif across the dataset. for characterizing gifs and text. Analogous to how
Second, some gif repositories like Giphy allow object tags are used as anchor points for image-text
users to tag gifs with information on their content matching (Li et al., 2020) and pivot languages are
or theme, e.g., “face palm” or “movie.” We collect used in machine translation (Cheng et al., 2017),
tags for the 115K reference gifs used in Twitter, ob- we use tags to bridge information between the text
taining 39,651 unique tags. These user-generated in a tweet and the visual content of a gif. Here,
tags were moderately noisy due to orthographic each gif becomes associated with a set of tags de-
variations like spelling, capitalization, and spacing. scribing its conversational functions and for each
Therefore, we merge tags by (i) lower-casing the text, we predict the set of tags for gifs responses to
text and (ii) performing a manual merge for similar it—in essence, predicting what types of responses
word forms (e.g., “excited” and “exciting”). To are most appropriate. We describe both of these
minimize noise, we retain only tags that have been processes next and how gifs are ultimately selected.

Estimating Gif Tags Only 6.0% of the gifs in our given a tweet, we use the trained tweet encoder to
data have associated tags. Therefore we train a extract its representation and compute its cosine
neural model to predict tags using known tags as similarity with each encoded representation for our
training data. To capture any changes in emotion gifs. The gif with the highest cosine similarity is
or imagery across the gif, we make separate pre- returned as the best response.
dictions for four frames sampled across the gif
(the same used in §3.2). Each frame is passed 4.3 P EPE THE K ING P RAWN
through an EfficientNet-based (Tan and Le, 2019) Our final model, K ING P RAWN1 (referred to as
GIF encoder, shown in Figure 3, to extract a low- “P EPE”.) selects gif responses by using a richer
dimensional feature vector from each frame. These set of multimodal features to create a gif represen-
frame embeddings are fused using the attention tation. Rather than encode the gif solely from its
mechanism from a transformer encoder layer. The image content, we use a multimodal encoder that
output of the transformer feeds into a fully con- captures (i) any text it might have, (ii) the types of
nected layer, which is trained as a multi-label clas- objects present in the gif, and (iii) object regions as
sifier using binary cross-entropy to predict which visual features. We encode these gif aspects using
tags should be present. an O SCAR transformer (Li et al., 2020) to create
Predicting Response Tags for Text For each mes- a unified representation, shown in Figure 3 (bot-
sage, we predict the k-hot distribution of tags tom). Object names and regions of interest feature
for a gif response by training a BERTweet model vectors are extracted using a pre-trained bottom-up
(Nguyen et al., 2020), which has been pre-trained attention model (Anderson et al., 2018).
on a large corpus of Twitter data (shown as “Tweet As input to the O SCAR encoder, the captions to
Encoder" in Figure 3). The model with an addi- each of the gif’s four frames are concatenated to-
tional fully connected layer is trained as a multi- gether with an “[INTER_FRAME_SEP]" separator
label classifier using binary cross-entropy, using token. We filter object areas detected by the bottom-
the tags for the gifs used in reply (if known). up attention model (Anderson et al., 2018) and we
Tag-based Gif Selection At inference time, given keep all objects with probability >0.5. We then
a message, we use the text-to-tag model to predict concatenate object names together with the same
a k-hot distribution over tags. Then, we select the inter-frame separator between names of different
gif whose estimated tag distribution is closest in frames. Together, the caption text, object names,
Euclidean distance. and image-region features are fed into the O SCAR
transformer encoder to generate a GIF feature vec-
4.2 CLIP variant tor; the transformer is initialized with the default
O SCAR weights. We use BERTweet to encode text.
The second model uses an end-to-end training ap- The entire P EPE model is trained end-to-end using
proach based on the architecture of OpenAI CLIP contrastive loss, similar to the CLIP model.
(Radford et al., 2021). The architecture features
two encoders, one for text and one for images. Dur- 5 Evaluation
ing training, the encoders are updated using con-
We initially evaluate the methods in two ways.
trastive loss that maximizes the cosine similarity of
First, we use traditional classification-based eval-
paired image-text representations and minimizes
uation, testing whether the models can reproduce
the cosine similarity of random pairs of images and
the observed gif replies. However, some messages
texts. We replicate the CLIP architecture and train-
could have multiple valid gif responses. Therefore,
ing procedure, using BERTweet to encode text and
as a second test, we evaluate the model in a retrieval
EfficientNet (Tan and Le, 2019) to encode a com-
setting, measuring whether its most-probable re-
posite image of four frames from the gif (compared
sponses are good quality for a message.
with BERT and ResNet in their implementation).
Experimental Setup Models are trained and
While originally designed to select an image for a
tested on a dataset containing 1,562,701 Tweet-
text description, our model is trained to select a gif
reply for a text message—a more challenging task 1
K ING P RAWN refers to “selecKting INteresting Gifs for
than the image retrieval task used in the original Personal RespAWNses.” In this crazy muppet-name-land-
grab world we live in, our only regret is that we couldn’t
CLIP setup, as the message may not contain words get “Pepino Rodrigo Serrano Gonzales” to fit as a bacronym,
describing elements of the gif. At inference time, which we leave to future work.

EfficientNet GIF Encoder                                                                    Tweet Encoder

                                                     EfficientNet                     Transformer                              Tweet
                                                      EfficientNet                                                                              BERTweet-base
                                                        EfficientNet
                                                       EfficientNet-b0                  Encoder                            We 're getting
                                                                                         Layer                                                   Transformer
                                                      (shared weight)                                                      married today !

                                                                                                    feature vector for                                          feature vector for
            4 selected frames
                                                                                                    downstream task                                             downstream task
       4 x 224 x 224 (after reshape)
                                                                                                         1 x 512                                                     1 x 512

                                                                                   Oscar GIF Encoder

                                                                    OCR                        extracted caption
                                                                                                Aww , thank you

                                                                                             extracted object names                               Oscar
                                                             Bottom-up Attention                                                                Multimodal
                                                                                   face woman [INTER_FRAME_SEP] face woman
                                                                                                                                               Transformer

                                                                                        extracted object feature vectors                                        feature vector for
                                                                                                                                                                downstream task
                                                                                                                                                                     1 x 512
                      4 selected frames
                 4 x 224 x 224 (after reshape)

                             Figure 3: The different encoder modules used to construct the models in §4.

GIF pairs associated with 115,586 unique gifs,                                                 sufficient content to judge appropriateness indepen-
where 605,063 tweet-gif pairs are associated                                                   dent of the larger social context.
with at least one tag. Using the finalized 241                                                    Two annotators (the authors) were shown a list
unique tags as classes for multi-label classifica-                                             of potential gif responses for a tweet and asked to
tion, we split the dataset by stratify on tags us-                                             judge whether this is an appropriate gif response
ing the iterative train-test split method provided by                                          (a binary rating). Gifs were selected from the ten
scikit-multilearn library (Sechidis et al.,                                                    most-probable replies for each system and collec-
2011; Szymański and Kajdanowicz, 2017) to cre-                                                tively shown in random order to prevent knowing
ate a 80:10:10 train, dev, and test split which                                                which system generated each reply. A total of 2,500
is finalized to train the models described in §4.                                              gif-tweet pairings were annotated. Annotators at-
Following BERTweet (Nguyen et al., 2020), we                                                   tained a Krippendorf’s α of 0.462; while moderate
preprocess tweets in our dataset using NLTK                                                    agreement, this value is expected given known dif-
TweetTokenizer for tokenization, emoji                                                         ferences in how people interpret and value gif re-
package to translate emotion icons, and converted                                              sponses based on their familiarity with its content,
mentions and links to special “@USER" and                                                      message interpretation, and life-experience (Jiang
“HTTPURL" tokens.                                                                              et al., 2018). We follow the evaluation setup from
Annotated Data To test whether each model’s pre-                                               other retrieval-based dialog systems (e.g. Yu et al.,
dictions are valid responses, we annotate the ten                                              2021; Kumar and Callan, 2020) and use normal-
most-probable gif predictions for a subset of the                                              ized Discounted Cumulative Gain (nDCG), which
tweets in our test data. Many tweets in our test set                                           measures whether more appropriate gif responses
require substantial context to understand due to hav-                                          are ranked higher. A gif’s appropriateness score is
ing few tokens, linking to URLs that provide extra                                             the sum of annotators’ ratings.
knowledge, mentioning other users in directed com-                                             Results The P EPE model was able to identify rele-
munication. These factors suggest social context                                               vant and good-quality gif responses, as shown by
or general knowledge aids in the recipient’s under-                                            its performances on the test data (Table 1) and an-
standing of the gif’s intentions. While the model                                              notated data (Table 2). Performance on the test set
can still benefit from training on such examples,                                              is expected to be low, given the challenge of identi-
judging the appropriateness of response is difficult                                           fying the exact gif used for a tweet when multiple
without access to the social context. Therefore, to                                            possible gifs are likely to be equally valid. How-
reduce interpretation ambiguity, we annotate only                                              ever, the P EPE model is still able to identify the
tweets without URLs or user mentions and having                                                exact gif (out of 115K) in its top 10 predictions
at least 10 tokens. This process selects tweets with                                           for 3% of the data, substantially outperforming all

Model                      Top-1        Top-5      Top-10            Model                                  nDCG
  Tag-based               0.000000     0.000092    0.000119            P EPE                                  0.8145
  Random                  0.000020     0.000059    0.000158
  CLIP variant            0.000488     0.001669    0.002783
                                                                       P EPE without object names             0.7665
  Distribution sampling   0.000996     0.005098    0.009780            P EPE without caption                  0.7559
  P EPE                   0.005375     0.018723    0.030918            P EPE without object features          0.7533

Table 1: Models’ precision-at-k on selecting the exact          Table 3: Results for ablated versions of P EPE where
gif used as a response for a tweet in the test set; this per-   specific input is removed (cf. Table 2) show that all in-
formance is an underestimate of each model, as many             put forms contribute to the ability to select replies.
model-predicted gifs may be appropriate.

           Model                        nDCG                    and evaluating the model’s resulting gifs on the
                                                                same test instances.
           Random                       0.3273
           Tag-based                    0.4526                     The ablated model performances, shown in Ta-
           Distribution sampling        0.4969                  ble 3, reveal that each input is useful for selecting
           CLIP variant                 0.5934                  gifs.2 Object features capture visual information
           P EPE                        0.8145                  about what specifically is present in the gif (beyond
                                                                the discrete names of what is present, e.g., “person”
Table 2: Models’ nDCG scores at proposing appropri-             or “building”) and show that multimodality is im-
ate gif replies, measured from annotations on the top           portant for high performance—predicting replies
10 most probable gif replies of each model.                     just from a gif’s caption and categorized content
                                                                are insufficient. Similarly, the caption of a gif (if
                                                                present) is important, as the text can help make
other models.                                                   explicit the intended interpretation of a gif.
   Performance on the annotated data (Table 2) pro-
vides a more realistic assessment of whether mod-               6     Field Experiment
els can generate high-quality replies, as it measures
whether the models’ replies themselves were good.               To test the generalizability of our models and qual-
The P EPE model attains substantially higher per-               ity of their responses, we conduct a large-scale ran-
formance (p

Parent Tweets Tag-based CLIP variant P EPE Dist. Samp. Random
That wonderful feeling
you get when you ar-
rive to a business din-
ner that you’re suppos-
edly paying for...and re-
alize you’ve forgotten
your credit card

I’m convinced some of
y’all don’t get laid

Table 4: Model-selected replies to messages (paraphrased for privacy). Click an image to view the gif on Giphy.

this score in our experiments to evaluate quality. we use Thompson sampling (Russo et al., 2018)
Our experiment focuses on generating Gif-based to randomly select which arm of the trial to use.
replies to top-level text comments (comments made Thompson sampling builds a probability model for
directly to the post). This setup mirrors the conver- the estimated reward of each arm (here, the score a
sational data our models were trained on. Imgur reply receives) and samples from the model such
supports several ways of filtering its stream of posts. that higher-rewarding arms are sampled more fre-
To ensure that our replies have sufficient visibility, quently. As a result, this method can provide tighter
we select posts that have already receive 10 com- estimates for the reward of the most useful arms.
ments and appear in the “most viral” sorting. From Scores in Imgur have a skewed distribution, with
these posts, we reply to the top-rated text comment. few comments receiving very high scores and most
The RCT runs from 8 AM to 8 PM (local time), receiving near the default score (1). Therefore, we
making at most 10 replies per hour. use Poisson Thompson sampling. Some comments
Not all topics or comments are suitable for auto- may be downvoted to receive scores below zero, so
mated responses and great care was taken to pre- for simplicity, we truncate these scores to 0.
vent potential harm to the community. Through We initialize the reward estimates for our ex-
multiple rounds of testing which replies would be periment by selecting one of the five models in
responded to, we curated a list of keywords that a round-robin manner to reply to an Imgur com-
could lead to potential controversial replies, such ment for 3 days. These initial scores act as priors
as terms about religion or race (full list in Ap- for Thompson sampling to update Poisson distri-
pendix D). Any comment containing a token or butions for each model. In the trial, we choose a
lemma matching a word on this list is excluded model by sampling from the up distributions using
and not replied to. As a further safeguard, exper- all previous days’ scores as the prior. The exper-
imenters monitored all replies to remove any that iment ran from April 15th, 2021 to August 30th,
were deemed inappropriate. See the Ethics Section 2021, and models generated a total of 8,369 replies.
(§9) for a longer discussion of safeguards. To evaluate the results of the RCT, we construct
The field experiment consists of five arms, cor- a Negative Binomial regression on the dependent
responding to the three trained models and the two variable of the score received for a model’s reply,
baseline models. During each trial, one model is se- truncating negative scores to zero. The Negative
lected and generates a response; the trained model binomial was chosen instead of Poisson due to
replies with the most probable gif.4 over-dispersion in the score variable. The models
Not all models are equally likely to perform well are treated as a categorical variable, using the ran-
and so to make the most use of our trial budget, dom model as a reference. Since the score will
4
Due to a bug, early experimental trials for the CLIP and depend, in part, on the attention received by the
P EPE models used the tenth most-probable gif; however, using parent post and comment (higher-rated comments
the ratings in the annotated data, a t-test of the difference in are displayed first), we include linear effects for
quality for most- and tenth-most probable gifs showed no
statistically-significant difference in quality for both models the post and parent comment. Finally, we include
(p>0.1). Therefore, we include this data in our results. five text-related variables to control for the con-

CLIP variant           ***                                                   2gG2xiMTtFwsg
                                                                         fnjxvV295sWEJjvwXU
                                                                               BAPSj0xM1cFe8
   Dist. Samp.                                                            iJsvRxNTAcup6DVfLP

                                                           GIPHY GIF ID
                                                                      3oEjHLcg4QMU5umb9m
                                                                               aKrTvuOv4hlKM
          Pepe                                   ***                    3oKIPllDN24q8Awtwc
                                                                         jIu44mYwUItSHTW3tj
      Tag-based                                                            jTrWAzlFGfvVY34PSJ
                                                                          l396L17pwHWOIJrTG
                   0.2         0.0         0.2                                                         60     40    20           0    20
                  Coefficient for GIF Reply Score                                                           GIF Reply Score
                                                                                             (a) Tag-based
Figure 4: Negative Binomial regression coefficients for
each model on predicting a gif reply’s score, using the                   lfesfEtobCSbsHzC8d
random-gif model as the reference category; bars show                  m9d3Xif3ShZ42CxlWP
standard error and *** denotes significance at 0.01.                    f9k1tV7HyORcngKF8v
                                                                            loitbnzQ1JQ8Iizx8w

                                                           GIPHY GIF ID
                                                                                 bfrlODgSLqXxS
                                                                      4HmjGg306HiLHWlm2f
tent of the parent comment: the topic distribution                     7J26CGAahos6d5S1A6
                                                                        8hZ9FMolyKc0X8BSr7
(Appendix Table 9) from a 10-topic model (drop-                       iqkHA3DmB8GjORY030
ping one topic due to collinearity), the sentiment                    OOzcnk3PzLDHqWs6Tb
                                                                                                       30   20     10        0       10    20
and subjectivity of the message estimated using                                                             GIF Reply Score
TextBlob library, the length of the comment, and                                           (b) CLIP variant
whether the comment contained a question.
                                                                                  tnYri4n2Frnig
                                                                      5wWf7GR2nhgamhRnEuA
6.2    Results                                                              5gw0VWGbgNm8w
                                                                              iXTrbbYMQBCMM
                                                           GIPHY GIF ID

The field experiment demonstrates that the P EPE                       65ODCwM00NVmEyLsX3
model is able to generate significantly higher-                           26AHLBZUC1n53ozi8
                                                                         3o8doT9BL7dgtolp7O
scoring responses. Figure 4 shows the Negative                                 Fq6Bdki3coEWQ
Binomial regression coefficients for the three mod-                      3oEjHAUOqG3lSS0f1C
                                                                                KzyMcEfDh4Jiw
els and empirical distribution baseline, with the                                                 40        20           0           20
                                                                                                              GIF Reply Score
random gif model as a reference; full regression
                                                                                                  (c) P EPE
results are shown in Appendix Table 6. The P EPE
model substantially outperforms all other models          Figure 5: Score distributions for most-frequently used
(p0.73 for all models). The most-used gifs
and the reply (Madden, 2018, p.29). Further, we           for each model had average scores that were pos-
observed that, when the model’s reply truly seemed        itive, but the distributions for each gif show that
random, some users replied say they upvoted solely        some uses were occasionally downvoted. This high
because they enjoyed the gif.                             variance in scores indicates that a gif’s intrinsic
   As a follow-up experiment, we tested whether           qualities are not solely responsible for the received
models could be getting higher (or lower) scores by       score and, instead, appropriate use in context is
repeatedly picking the same gifs that are skewed          plays a significant part in community reception.
towards a positive or negative reaction. Figure 5            We examined whether models relied on the same
shows the score distribution for the top ten most fre-    set of gifs. Figure 6 shows the distribution of gif

isting messages from a large social media corpus as
                              2.5                                         Tag-based            potential replies and rank these to select a response.
 log10(frequency of unique gif)

                                                                          CLIP variant         Our work mirrors models that use neural networks
                              2.0                                         Pepe
                                                                          Dist. Samp.          for ranking (Yan et al., 2016; Inaba and Takahashi,
                              1.5                                         random               2016; Penha and Hauff, 2021, e.g.,); however, we
                                                                                               note that many recent knowledge-grounded and
                              1.0                                                              open domain models use encoder-decoder meth-
                              0.5                                                              ods to improve versatility and applicability (e.g.,
                                                                                               Ghazvininejad et al., 2018; Gao et al., 2019; Zhou
                              0.0
                                                                                               et al., 2020). Generative approaches are likely in-
                                  0.0   0.5     1.0      1.5     2.0        2.5          3.0   appropriate for gif-based conversation as gifs are
                                              log10(rank of unique gif)
                                                                                               more akin to mimetic artifacts that build on cultural
Figure 6: Gif use frequency by each model, shown                                               knowledge (Eppink, 2014), making synthesizing a
as frequency-vs-rank log-scaled with first-order line fit                                      new gif from scratch likely less effective.
(jitter added for separation).
                                                                                                  All three models used here rely on joint embed-
                                                                                               ding spaces for gif and text. Multiple works in
uses by each model, indicating that the tag-based                                              NLP have been proposed to align these representa-
model relied frequently on a small set of gifs. How-                                           tions (Kiros et al., 2014; Wang et al., 2016), often
ever, the P EPE and CLIP variant models were sub-                                              for particular applications such as visual question
stantially more varied, indicating they draw from                                              answering (Antol et al., 2015). Recent work has
the long-tail of possible gifs.                                                                focused on embeddings these media with a single
   Do any of our models spark more subsequent                                                  encoder that takes both text and images as input
conversation? We fit a separate Negative Binomial                                              (e.g., Wang et al., 2019; Chen et al., 2020), in con-
regression on the total number of comments made                                                trast to our model that uses separate image and text
to our reply, using the same IVs as the score regres-                                          encoders (Figure 3); these multimodal encoders
sion and include the reply’s score itself as another                                           are prohibitively computationally expensive to use
IV. This model (Appendix Table 8) shows that both                                              in our setting during inference time, as the model
the distributional-sampling baseline and P EPE mod-                                            would need to be run on each gif (and message) to
els produced replies that led to fewer subsequent                                              rank replies, compared with our model that only
comments (p

9 Ethics ate gif, which is mitigated by the use of Giphy to
seed our initial gifs. As this platform is curated
The interactive nature of the RCT necessitated and does not host objectively offensive gifs (e.g.,
a close consideration of ethical issues (Thieltges overly-violent content), our initial gif set is rela-
et al., 2016). Prior to beginning the RCT, the study tively free of objectionable gifs. Because our model
team obtained IRB approval to interact with users. learns directly from gifs’ frequency of use, unless
While necessary in the legal sense, IRB approval objectively offensive gifs are widely used, they are
is not sufficient to justify the ethical grounds of unlikely to be deployed from our RCT; we specu-
the study. The primary risks of the study are if the late that few objectively offensive gifs are widely
automated models respond with an inappropriate used and, in practice, we have not identified any
gif or respond to a message that is not suitable for during the study period or when examining hun-
automated response (e.g., discussing the death of a dreds of random gifs in our data (or used in the
loved one or making an offensive statement). These RCT).
risks were mitigated in multiple ways throughout Finally, one risk is that by learning gif responses
the dataset construction and field experiment. from observed data, our models may reinforce
First, the selection criteria for which comments cultural stereotypes that are encoded in the gifs
we reply to was designed to only reply to content themselves (Erinn, 2019), e.g., the association of
that was already deemed appropriate by the com- African American individuals with strong emotions.
munity. By selecting only posts that had received While our gif data is relatively clean of overtly of-
sufficient upvotes to be called “viral” and were fensive gifs, we acknowledge that our model likely
already receiving comments, we mitigate the risk does inadvertently perpetuate some of these latent
of engaging in topics or conversations that are in- biases in the data. However, the success of our
appropriate according to the norms of the Imgur model suggests a future mitigation strategy for plat-
community, as these posts would be removed by forms suggesting gifs: as biases become known,
moderators or would have received sufficient down- our approach can be used to suggest less-biased
votes to stay in obscurity. gifs as potential responses to mitigate future harm.
Second, by focusing on the top-voted comment
to these posts, we again reply to content that has Acknowledgments
already been deemed high-quality by the comment. We thank the reviewers, area chairs, and senior area
This comment-level criteria substantially lowers chairs for their thoughtful comments and feedback.
the risk of our models commenting on inappropri- We also thank the Blablablab for helpful feedback
ate comments (e.g., a comment insulting another and letting us deploy P EPE to the group’s Slack and
user), as these comments are readily downvoted by putting up with the ridiculous gif replies and Imgur
the community prior to our intervention. for being a wonderful community. This material is
Third, we employed extensive filtering to avoid based upon work supported by the National Science
replying to any comment containing a potentially Foundation under Grant No. 2007251.
sensitive topic, e.g., a discussion of race or trauma
(keywords are listed in Appendix D). The initial set
of keywords was developed through examining po- References
tentially sensitive topics and then iteratively added Peter Anderson, Xiaodong He, Chris Buehler, Damien
to by simulating which messages our RCT would Teney, Mark Johnson, Stephen Gould, and Lei
reply to and examining whether it would be appro- Zhang. 2018. Bottom-up and top-down attention for
priate. During the field RCT, experimenters contin- image captioning and visual question answering. In
2018 IEEE Conference on Computer Vision and Pat-
uously monitored the comments to ensure no harm tern Recognition, CVPR 2018, Salt Lake City, UT,
was being done. Ultimately, only three comments USA, June 18-22, 2018, pages 6077–6086. IEEE
were removed during the initial two days, which Computer Society.
was due to a bug in the lemmatization and these Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-
comments should have been filtered out by our ear- garet Mitchell, Dhruv Batra, C. Lawrence Zitnick,
lier criteria; these comments were removed quickly and Devi Parikh. 2015. VQA: visual question an-
swering. In 2015 IEEE International Conference on
and we did not observe any notable response from
Computer Vision, ICCV 2015, Santiago, Chile, De-
the community. cember 7-13, 2015, pages 2425–2433. IEEE Com-
Fourth, one risk is replying with an inappropri- puter Society.

Saeideh Bakhshi, David A. Shamma, Lyndon Kennedy, Hao Fang, Hao Cheng, Maarten Sap, Elizabeth Clark,
Yale Song, Paloma de Juan, and Joseph Jofish Kaye. Ari Holtzman, Yejin Choi, Noah A. Smith, and Mari
2016. Fast, cheap, and good: Why animated gifs en- Ostendorf. 2018. Sounding board: A user-centric
gage us. In Proceedings of the 2016 CHI Conference and content-driven social chatbot. In Proceedings of
on Human Factors in Computing Systems, San Jose, the 2018 Conference of the North American Chap-
CA, USA, May 7-12, 2016, pages 575–586. ACM. ter of the Association for Computational Linguis-
tics: Demonstrations, pages 96–100, New Orleans,
Michael S Bernstein, Margaret Levi, David Magnus, Louisiana. Association for Computational Linguis-
Betsy Rajala, Debra Satz, and Charla Waeiss. 2021. tics.
Esr: Ethics and society review of artificial intelli-
gence research. ArXiv preprint, abs/2106.11521. Jianfeng Gao, Michel Galley, and Lihong Li. 2019.
Neural Approaches to Conversational AI: Ques-
Fumihiro Bessho, Tatsuya Harada, and Yasuo Ku- tion Answering, Task-oriented Dialogues and Social
niyoshi. 2012. Dialog system using real-time crowd- Chatbots. Now Foundations and Trends.
sourcing and Twitter large-scale corpus. In Proceed-
ings of the 13th Annual Meeting of the Special Inter- Marjan Ghazvininejad, Chris Brockett, Ming-Wei
est Group on Discourse and Dialogue, pages 227– Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, and
231, Seoul, South Korea. Association for Computa- Michel Galley. 2018. A knowledge-grounded neural
tional Linguistics. conversation model. In Proceedings of the Thirty-
Second AAAI Conference on Artificial Intelligence,
Elli Bourlai and Susan C Herring. 2014. Multimodal (AAAI-18), the 30th innovative Applications of Arti-
communication on tumblr: “i have so many feels!”. ficial Intelligence (IAAI-18), and the 8th AAAI Sym-
In Proceedings of the 2014 ACM conference on Web posium on Educational Advances in Artificial Intel-
science, pages 171–175. ligence (EAAI-18), New Orleans, Louisiana, USA,
February 2-7, 2018, pages 5110–5117. AAAI Press.
Weixuan Chen, Ognjen Oggi Rudovic, and Rosalind W
Picard. 2017. Gifgif+: Collecting emotional ani- Carla F Griggio, Joanna Mcgrenere, and Wendy E
mated gifs with clustered multi-task learning. In Mackay. 2019. Customizations and expression
2017 Seventh International Conference on Affective breakdowns in ecosystems of communication apps.
Computing and Intelligent Interaction (ACII), pages Proceedings of the ACM on Human-Computer Inter-
510–517. IEEE. action, 3(CSCW):1–26.

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Tim Highfield and Tama Leaver. 2016. Instagrammat-
Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and ics and digital methods: Studying visual social me-
Jingjing Liu. 2020. UNITER: Universal Image- dia, from selfies and gifs to memes and emoji. Com-
TExt representation learning. In ECCV. munication research and practice, 2(1):47–62.

Yong Cheng, Qian Yang, Yang Liu, Maosong Sun, and Chung Hoon Hong, Yuan Liang, Sagnik Sinha Roy,
Wei Xu. 2017. Joint training for pivot-based neural Arushi Jain, Vihang Agarwal, Ryan Draves, Zhizhuo
machine translation. In Proceedings of the Twenty- Zhou, William Chen, Yujian Liu, Martha Miracky,
Sixth International Joint Conference on Artificial In- et al. 2020. Audrey: A personalized open-domain
telligence, IJCAI 2017, Melbourne, Australia, Au- conversational bot. In Alexa Prize Proceedings.
gust 19-25, 2017, pages 3974–3980. ijcai.org.
Pingping Huang, Jianhui Huang, Yuqing Guo, Min
Michelle Cohn, Chun-Yen Chen, and Zhou Yu. 2019. Qiao, and Yong Zhu. 2019. Multi-grained attention
A large-scale user study of an Alexa Prize chatbot: with object-level grounding for visual question an-
Effect of TTS dynamism on perceived quality of so- swering. In Proceedings of the 57th Annual Meet-
cial dialog. In Proceedings of the 20th Annual SIG- ing of the Association for Computational Linguis-
dial Meeting on Discourse and Dialogue, pages 293– tics, pages 3595–3600, Florence, Italy. Association
306, Stockholm, Sweden. Association for Computa- for Computational Linguistics.
tional Linguistics.
Michimasa Inaba and Kenichi Takahashi. 2016. Neural
Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, utterance ranking model for conversational dialogue
Weiwei Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua systems. In Proceedings of the 17th Annual Meeting
Yang, Qingqing Dang, et al. 2020. PP-OCR: A Prac- of the Special Interest Group on Discourse and Dia-
tical Ultra lightweight OCR system. ArXiv preprint, logue, pages 393–403, Los Angeles. Association for
abs/2009.09941. Computational Linguistics.

Jason Eppink. 2014. A brief history of the gif (so far). Jialun “Aaron" Jiang, Jed R Brubaker, and Casey
Journal of visual culture, 13(3):298–306. Fiesler. 2017. Understanding diverse interpretations
of animated GIFs. In Proceedings of the 2017 CHI
Wong Erinn. 2019. Digital blackface: How 21st cen- Conference Extended Abstracts on Human Factors
tury internet language reinforces racism. in Computing Systems, pages 1726–1732.

Jialun “Aaron" Jiang, Casey Fiesler, and Jed R Conference on Empirical Methods in Natural Lan-
Brubaker. 2018. “The Perfect One” Understanding guage Processing: System Demonstrations, pages 9–
Communication Practices and Challenges with An- 14, Online. Association for Computational Linguis-
imated GIFs. Proceedings of the ACM on human- tics.
computer interaction, 2(CSCW):1–20.
Tianrui Niu, Fangxiang Feng, Lingxuan Li, and Xiaojie
Daniel Jurafsky, Elizabeth Shriberg, Barbara Fox, and Wang. 2020. Image synthesis from locally related
Traci Curl. 1998. Lexical, prosodic, and syntactic texts. In Proceedings of the 2020 International Con-
cues for dialog acts. In Discourse Relations and Dis- ference on Multimedia Retrieval, pages 145–153.
course Markers.
Mahmoud Khademi. 2020. Multimodal neural graph Silvia Pareti and Tatiana Lando. 2018. Dialog intent
memory networks for visual question answering. In structure: A hierarchical schema of linked dialog
Proceedings of the 58th Annual Meeting of the Asso- acts. In Proceedings of the Eleventh International
ciation for Computational Linguistics, pages 7177– Conference on Language Resources and Evaluation
7188, Online. Association for Computational Lin- (LREC 2018), Miyazaki, Japan. European Language
guistics. Resources Association (ELRA).

Ryan Kiros, Ruslan Salakhutdinov, and Richard S Gustavo Penha and Claudia Hauff. 2021. On the cal-
Zemel. 2014. Unifying visual-semantic embeddings ibration and uncertainty of neural learning to rank
with multimodal neural language models. ArXiv models for conversational search. In Proceedings
preprint, abs/1411.2539. of the 16th Conference of the European Chapter
of the Association for Computational Linguistics:
Artie Konrad, Susan C Herring, and David Choi. 2020. Main Volume, pages 160–170, Online. Association
Sticker and emoji use in facebook messenger: impli- for Computational Linguistics.
cations for graphicon change. Journal of Computer-
Mediated Communication, 25(3):217–235. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish
Vaibhav Kumar and Jamie Callan. 2020. Making in- Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,
formation seeking easier: An improved pipeline for et al. 2021. Learning transferable visual models
conversational search. In Findings of the Associa- from natural language supervision. ArXiv preprint,
tion for Computational Linguistics: EMNLP 2020, abs/2103.00020.
pages 3971–3980, Online. Association for Compu-
tational Linguistics. Ashwin Ram, Rohit Prasad, Chandra Khatri, Anu
Kuang-Huei Lee, X. Chen, G. Hua, H. Hu, and Xi- Venkatesh, Raefer Gabriel, Qing Liu, Jeff Nunn,
aodong He. 2018. Stacked cross attention for image- Behnam Hedayatnia, Ming Cheng, Ashish Nagar,
text matching. ArXiv preprint, abs/1803.08024. et al. 2018. Conversational ai: The science behind
the alexa prize. ArXiv preprint, abs/1801.03604.
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xi-
aowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott
Li Dong, Furu Wei, et al. 2020. Oscar: Object- Gray, Mark Chen, Rewon Child, Vedant Misra,
semantics aligned pre-training for vision-language Pamela Mishkin, Gretchen Kruegerand Sandhini
tasks. In European Conference on Computer Vision, Agarwal, and Ilya Sutskever. 2021. DALL-E: Cre-
pages 121–137. Springer. ating images from text. https://openai.com/
blog/dall-e/.
John Savery Madden. 2018. The Phenomenological
Exploration of Animated GIF Use in Computer- Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni,
Mediated Communication. Ph.D. thesis, University Ian Osband, and Zheng Wen. 2018. A tutorial on
of Oklahoma. thompson sampling. Foundations and Trends® in
Machine Learning, 11(1):1–96.
Luke Melas-Kyriazi, Alexander Rush, and George Han.
2018. Training for diversity in image paragraph cap-
Konstantinos Sechidis, Grigorios Tsoumakas, and Ioan-
tioning. In Proceedings of the 2018 Conference on
nis Vlahavas. 2011. On the stratification of multi-
Empirical Methods in Natural Language Processing,
label data. Machine Learning and Knowledge Dis-
pages 757–761, Brussels, Belgium. Association for
covery in Databases, pages 145–158.
Computational Linguistics.
Kate M Miltner and Tim Highfield. 2017. Never Piyush Sharma, Nan Ding, Sebastian Goodman, and
gonna GIF you up: Analyzing the cultural signifi- Radu Soricut. 2018. Conceptual captions: A
cance of the animated GIF. Social Media+ Society, cleaned, hypernymed, image alt-text dataset for au-
3(3):2056305117725223. tomatic image captioning. In Proceedings of the
56th Annual Meeting of the Association for Compu-
Dat Quoc Nguyen, Thanh Vu, and Anh Tuan Nguyen. tational Linguistics (Volume 1: Long Papers), pages
2020. BERTweet: A pre-trained language model 2556–2565, Melbourne, Australia. Association for
for English tweets. In Proceedings of the 2020 Computational Linguistics.

Piotr Szymański and Tomasz Kajdanowicz. 2017. A           Rui Yan, Yiping Song, and Hua Wu. 2016. Learning
   network perspective on stratification of multi-label      to respond with deep neural networks for retrieval-
   data. In First International Workshop on Learn-           based human-computer conversation system. In Pro-
   ing with Imbalanced Domains: Theory and Appli-            ceedings of the 39th International ACM SIGIR con-
   cations, pages 22–35. PMLR.                               ference on Research and Development in Informa-
                                                             tion Retrieval, SIGIR 2016, Pisa, Italy, July 17-21,
Mingxing Tan and Quoc V. Le. 2019. Efficientnet:             2016, pages 55–64. ACM.
  Rethinking model scaling for convolutional neural
  networks. In Proceedings of the 36th International       Shi Yu, Zhenghao Liu, Chenyan Xiong, Tao Feng, and
  Conference on Machine Learning, ICML 2019, 9-              Zhiyuan Liu. 2021. Few-shot conversational dense
 15 June 2019, Long Beach, California, USA, vol-             retrieval. ArXiv preprint, abs/2105.04166.
  ume 97 of Proceedings of Machine Learning Re-
  search, pages 6105–6114. PMLR.                           Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum.
                                                              2020. The design and implementation of XiaoIce,
Ying Tang and Khe Foon Hew. 2019. Emoticon, emoji,            an empathetic social chatbot. Computational Lin-
  and sticker use in computer-mediated communica-             guistics, 46(1):53–93.
  tion: A review of theories and research findings. In-
  ternational Journal of Communication, 13:27.

Andree Thieltges, Florian Schmidt, and Simon
  Hegelich. 2016. The devil’s triangle: Ethical con-
  siderations on developing bot detection methods. In
  2016 AAAI Spring Symposium Series.

Jackson Tolins and Patrawat Samermit. 2016. Gifs
   as embodied enactments in text-mediated conversa-
   tion. Research on Language and Social Interaction,
   49(2):75–91.

Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016.
  Learning deep structure-preserving image-text em-
  beddings. In 2016 IEEE Conference on Computer
  Vision and Pattern Recognition, CVPR 2016, Las Ve-
  gas, NV, USA, June 27-30, 2016, pages 5005–5013.
  IEEE Computer Society.

Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng,
  Junjie Yan, Xiaogang Wang, and Jing Shao. 2019.
  CAMP: cross-modal adaptive message passing for
  text-image retrieval. In 2019 IEEE/CVF Interna-
  tional Conference on Computer Vision, ICCV 2019,
  Seoul, Korea (South), October 27 - November 2,
  2019, pages 5763–5772. IEEE.

Miaomiao Wen, Nancy Baym, Omer Tamuz, Jaime
  Teevan, Susan T Dumais, and Adam Kalai. 2015.
  Omg ur funny! computer-aided humor with an ap-
  plication to chat. In ICCC, pages 86–93.

Jun Xu, Haifeng Wang, Zheng-Yu Niu, Hua Wu, Wanx-
  iang Che, and Ting Liu. 2020. Conversational graph
  grounded policy learning for open-domain conversa-
  tion generation. In Proceedings of the 58th Annual
  Meeting of the Association for Computational Lin-
  guistics, pages 1835–1845, Online. Association for
  Computational Linguistics.

Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han
  Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He.
  2018. Attngan: Fine-grained text to image genera-
  tion with attentional generative adversarial networks.
  In 2018 IEEE Conference on Computer Vision and
  Pattern Recognition, CVPR 2018, Salt Lake City, UT,
  USA, June 18-22, 2018, pages 1316–1324. IEEE
  Computer Society.

Category Subcategory validation macro-f1 was 0.07 achieved at the 70th
epoch.
Cartoons & Comics aqua teen hunger force
Celebrities richard pryor A.2 CLIP variant
Reactions angry The evaluation performance for model selection
Emotions happy is measured by nDCG. For every tweet-gif pair in
Anime bleach the validation set, we measure the top 30 predicted
Art & Design psychedelic GIFs from the model using the tweet as input. The
Nature sunrise relevance of an occurring ground truth gif in the
Transportation bicycle top 30 predictions given a tweet is set 1 for the
nDCG calculation.
Table 5: Examples of GIF categories on GIPHY
CLIP variant is trained on the same finalized
dataset using contrastive loss. It was trained for
A Additional Details on Model Training 16 epochs with a batch size of 16 using AdamW
optimizer of learning rate 1e-5 and weight decay
Following, we provide additional details on how 1e-3. Best validation performance is achieved at
each of the three models was trained. epoch 6 with an nDCG value of 0.015.
We replace the Transformer encoder layer with
A.1 Tag-based Model
a linear Layer on Efficient GIF Encoder from Fig-
EfficientNet-based Tag Classifier Gifs are re- ure 3, and use this as our GIF Encoder for the
shaped to 224 by 224 pixel while keeping the as- CLIP variant. Image inputs to the GIF encoder are
pect ratio by padding and normalized to a mean of normalized following the official CLIP implemen-
0.5 and standard deviation of 0.5 for each channel tation.
before feeding into the EfficientNet-based model.
We selected unique GIFs from the finalized dataset A.3 P EPE
that has at least one associated tag and using the The P EPE model follows most configurations from
iterative train test split on k-hot tag representation the CLIP variant model, but replace the EffcientNet
to select 5% of those GIFs for validation. The Effi- GIF encoder with an Oscar GIF encoder based
cientNet tag classifier was trained for 100 epochs on Oscar pre-trained multi-modal transformer (Li
on a batch size of 32, using AdamW optimizer et al., 2020).
with learning rate 1e-5 and weight decay 1e-3. The Extra metadata are extracted from GIFs in the fi-
best validation performance was achieved at the nalized dataset for further training. Captions within
40th epoch with macro-f1 of 0.30 in predicting 241 the GIF are extracted using PaddleOCR (Du et al.,
multi-label classes. Early experiment shows that 2020), and only extracted text with probability
transformer encoder layer (macro-f1 of 0.30) out greater than 0.9 are kept as caption metadata.
performs linear layer (macro-f1 of 0.19) in fusing Object tags and their corresponding features are
multi-frame gif features on the development set, extracted with bottom-up attention (Anderson et al.,
therefore transformer encoder layer is used to fuse 2018) using py-bottom-up-attention
features of different frames in our implementation. package. Object instances are filtered to only
Tweet-to-tag classifier Using the finalized dataset keep instances that have a score higher than
mentioned in §3, we use tweet as input, and the 0.5, then object tags and their corresponding
k-hot tag representation of that tweet instance as features are extracted from these instances. Final
ground truth label to train the multi-label classifier object features of dimension 2054 are obtained
along with the tweet encoder for 241 classes. Ad- by concatenating feature output with dimension
ditionally, we filter out tweets from the finalized 2048 from Faster-RCNN with scaled box position
dataset that do not have corresponding twitter tags coordinates of the object following (Li et al.,
before training. The model with the best valida- 2020).
tion performance is selected to perform subsequent The P EPE model is trained on the finalized
evaluation and field experiments. The tweet en- dataset with extracted caption and object metadata.
coder was trained for 100 epochs with a batch size It was trained for 16 epochs with a batch size of
of 32. The learning rate was set to 1e-5 with 1e-3 8 using AdamW optimizer of learning rate 1e-6
weight decay using AdamW optimizer. The best and weight decay 1e-3. Preprocessing for GIFs is

Tweet-GIF Pairs
                                     42,096,566 pairs

                               Parent Tweet
                               Having a super great day .

                               Child animated GIF (reply)
                               File: tweet_video/EEqo71PWkAAqGab.mp4
                               AverageHash: 00080c2ce4f0f46410383828c262626300040424b4f4e061

                                           ......
                                 39,401,680 unique GIF files

                                 GIPHY GIF Collections

                                                                                                             Match Animated
                              GIPHY GIF ID: 3ornjLd54I3eQYmfpC                                                GIF with Hash
                              Tags: #high five #late night with seth meyers                      Matched
                              AverageHash: 00080c2ce4f0f46410383828c262726300040424b4f4e061

                              GIPHY GIF ID: 3o6fIQSs4BcsEbDE7S
                              Tags: #real housewives #bravo tv #slice #rhonj                   Not Matched
                              AverageHash: fbf9f35141008c8bfbf9f35151008c89fbf9f35151008c89

                                            ......
                                  2,095,993 unique GIF IDs

                                    Finalized Dataset
               1,562,701 pairs (605,063 pairs associated with selected tags)

                              Parent Tweet: Having a super great day .
                              GIPHY GIF ID: 3ornjLd54I3eQYmfpC
                              Selected Tags: #high five

                                                                                                              GIPHY Tags
                                                                                                               Selection

                              Parent Tweet: We 're getting married today !
                              GIPHY GIF ID: g9582DNuQppxC
                              Selected Tags: #cheers, #drink, #congratulations

                                              ......
                                     115,586 unique GIF IDs

Figure 7: A diagram of the pipeline used to collect, canonicalize, and filter gif-reply data from Twitter.

GIPHY GIF (ID: Y9pd1baXUIJTW)

               Tweet Animated GIF (D0mIkeXX0AIQt3N.mp4)                        Hamming distance = 9
                                                                                   Matched

                                                                                                        Image Average Hash: 6864f6ded2d282001f180b3d1c18f07c083e2f1fdfd08c8e

                                                                                                                       GIPHY GIF (ID: 2zR6G1VD9a1ws)

                                                                              Hamming distance = 51.0
                                                                                  Not Matched

       Image Average Hash: 686cfcfef2d282021f181b3d1c18f07c083e2e1f9fd08c8e

                                                                                                        Image Average Hash: f0f4fcfcf8e060001018383818183018001a1b1b1a18180c

           Figure 8: Matching Animated GIFs from Twitter with GIPHY gifs using Image Average Hash

the same as the Tag-based model. Max sequence                                                 Reactions              thumbs down
length is set to 256 tokens for the Oscar transformer.                                        Reactions              frustrated
Best evaluation performance is achieved at epoch                                              Reactions              oh snap
12 with an nDCG score of 0.007.                                                               Reactions              disgusted
                                                                                              Reactions              rejected
B    GIF categories on GIPHY                                                                  Reactions              embarrassed
                                                                                              Reactions              hug
 Category           Subcategory                                                               Reactions              yolo
 Reactions          what                                                                      Reactions              interested
 Reactions          hair flip                                                                 Reactions              thank you
 Reactions          bored                                                                     Reactions              sarcastic
 Reactions          frown                                                                     Reactions              shocked
 Reactions          slow clap                                                                 Reactions              cool story bro
 Reactions          mic drop                                                                  Reactions              middle finger
 Reactions          goodbye                                                                   Reactions              you got this
 Reactions          meh                                                                       Reactions              whatever
 Reactions          scared                                                                    Reactions              omg
 Reactions          do not want                                                               Reactions              deal with it
 Reactions          confused                                                                  Reactions              sigh
 Reactions          drunk                                                                     Reactions              oops
 Reactions          wow                                                                       Reactions              angry
 Reactions          mad                                                                       Reactions              finger guns
 Reactions          awesome                                                                   Reactions              good luck
 Reactions          please

You can also read