An animated picture says at least a thousand words: Selecting Gif-based Replies in Multimodal Dialog
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
An animated picture says at least a thousand words: Selecting Gif-based
Replies in Multimodal Dialog
Xingyao Wang David Jurgens
University of Michigan University of Michigan
xingyaow@umich.edu jurgens@umich.edu
Abstract PizzaMagic: Ahhhhh!!! The EMNLP deadline
is in 24 hours!!
Online conversations include more than just x CasualModel:
text. Increasingly, image-based responses
such as memes and animated gifs serve as
arXiv:2109.12212v1 [cs.CL] 24 Sep 2021
culturally recognized and often humorous re-
sponses in conversation. However, while NLP
has broadened to multimodal models, conver-
sational dialog systems have largely focused
only on generating text replies. Here, we intro-
duce a new dataset of 1.56M text-gif conversa-
tion turns and introduce a new multimodal con-
versational model P EPE THE K ING P RAWN
Figure 1: Gif responses in conversation like the one
for selecting gif-based replies. We demon-
shown above are embodied dialog that use visual im-
strate that our model produces relevant and
agery to convey reactions and emotions. This paper de-
high-quality gif responses and, in a large ran-
velops a system to select the appropriate gif response
domized control trial of multiple models re-
to messages. (PDF best viewed with Adobe Acrobat)
plying to real users, we show that our model
replies with gifs that are significantly better re-
ceived by the community.
Conversation analysis is central to NLP and mul-
1 Introduction tiple approaches have analyzed this dialog struc-
Conversations are central to many online social ture (Jurafsky et al., 1998; Pareti and Lando, 2018;
platforms. While most conversations are text- Cohn et al., 2019) and developed conversational
based, computer mediated dialog also affords al- agents to engage with people (e.g., Fang et al.,
ternative forms of communication, such as emoji 2018; Xu et al., 2020; Hong et al., 2020). Recent
or stickers like bitmoji, that allow users to ex- work has focused on generating open domain social
press themselves (Tang and Hew, 2019; Konrad chatbots that engage in sustained conversations in
et al., 2020). Increasingly, these visual forms of a natural way (Ram et al., 2018). Because many
communication have become common in social of these systems are designed to support voice-
media (Bourlai and Herring, 2014; Highfield and based dialog, they overlook non-textual forms of
Leaver, 2016), with a notable use of the reaction gif interaction used in social media conversations. In
(Bakhshi et al., 2016; Miltner and Highfield, 2017). parallel, multimodal NLP systems have been devel-
These gifs are short video sequences that depict a oped for image data, often focusing on image-to-
particular scene and sometimes contain text that text tasks such as image captioning (Melas-Kyriazi
acts as a meta-commentary (Eppink, 2014). As a et al., 2018; Sharma et al., 2018) and visual ques-
result, conversations become multimodal where in- tion answering (Antol et al., 2015; Huang et al.,
dividuals reply to one another using combinations 2019; Khademi, 2020). More recent work has fo-
of text and gifs (Figure 1). While conversational cused on the reverse text-to-image dimension, such
AI systems have been developed in a purely text- as generating an image from a description (Niu
based setting, such systems do not capture the full et al., 2020; Ramesh et al., 2021). Our work unites
multimodal behavior seen online. Here, we study these two strands of research by integrating image-
multimodal conversation by introducing new dialog based communication into conversational agents.
models for selecting gif replies in conversation. Our paper offers three main contributions. First,we propose the new task of selecting gif responses though platforms like Twitter support gif replies,
in multimodal conversation analysis and introduce these gifs are not canonicalized to identify which re-
a new dataset of 1,562,701 real-world conversa- sponses correspond to the same gif. Therefore, we
tion turns with gif replies. Second, we introduce construct a new dataset for this task by collecting
a new model P EPE THE K ING P RAWN that fuses responses, matching their images, and augmenting
image and text-based features to select a relevant this data with metadata about the gif, where possi-
gif response. In in-house experiments, we show ble. A visual description of the whole procedure
that our model substantially outperforms strong can be found in Appendix Figure 7.
baseline models at selecting the exact gif used in
real data and, in a manual test of the quality of the 3.1 Gif Response Data
best responses, achieves an nDCG of 0.8145 on the
annotated test set. Third, in a real-world test, we Gifs have many uses (Miltner and Highfield, 2017)
deploy our model as a part of a large-scale random- and so we use a two-step approach to collect data
ized controlled trial and show that the gif replies that focus specifically on those likely to be used
produced by our model are more highly voted by in conversation. First, gif responses are collected
the community. Data, code, and models are avail- from Twitter by identifying all replies to English-
able at https://github.com/xingyaoww/gif-reply. language tweets containing animated_gif as
embedded media. Tweets were collected from a
2 GIF Communications ∼10% sample of Twitter from March 13th, 2019
to Jan 24th, 2020, totaling 42,096,566 tweets with
Gifs have been widely adopted in communication a gif that we were able to retrieve. Twitter does not
as a natural form of embodied speech where the vi- canonicalize its gifs so two separate gif files may
sual imagery conveys emotions or a reaction as a re- actually have the same imagery. Further, these files
sponse (Bakhshi et al., 2016; Tolins and Samermit, may not be identical due to small differences such
2016). These gifs commonly come from widely- as color variations or aspect ratios. To identify uses
known cultural products, such as movies or televi- of the reference gifs, we use Average Hash from
sion shows, which provides common knowledge the imagehash library to create low-dimensional
for how they could be interpreted (Eppink, 2014; representations of each gif where hash distance
Miltner and Highfield, 2017). However, a single gif corresponds to perceptual distance. Since gifs are
may have multiple interpretations, depending on animated and may contain varying scenes, we com-
the context, cultural knowledge of its content, and pute the hash for the first, middle, and final frames,
the viewer (Jiang et al., 2017). As a result, a single concatenating these into a single hash. Two gifs
gif can serve multiple functions in communication are considered the same if (i) they have identical
(Tolins and Samermit, 2016). hashes or (ii) their hamming distance is < 10 and
Gifs have grown in their use through increas- gifs with that hash have been used more than 500
ing affordances by platforms like Tumblr, Reddit, times in Twitter. This latter condition was selected
Imgur, and Twitter that allow gifs to be natively after manual evaluation of thresholds to trade-off
displayed like text in conversation threads (Jiang between increasing the size of the training data and
et al., 2018). Further, gif-based keyboards have reducing potential noise caused by matching error.
been introduced that allow users to search for gifs A visual example of this process can be found in
that have been tagged with keywords or other meta- Appendix Figure 8.
data (Griggio et al., 2019). Yet, these technologies
Not all gif responses in the Twitter data are con-
require that gif data be prepared with sufficient tags
versational or appropriate for wider re-use. There-
to be searchable or to have sufficient data to use
fore, we filter these responses to only those gifs
collaborative filtering techniques for recommenda-
whose imagery matches gifs hosted by the Giphy
tions (Jiang et al., 2018, p.9). As a result, there is a
website, which is the backend for many gif-based
clear gap in identifying appropriate response gifs
keyboards. Giphy contains a wide collection of gifs
directly from the text, which this work fills.
that are curated to remove content inappropriate for
general use (e.g., violent or sexual imagery). Gifs
3 Data
on the platform are categorized (e.g., “reaction” or
Despite the widespread use of gifs, no standard “celebrities”) and we identify 28 categories con-
dataset exists for text and gif replies. Further, al- taining 972 keywords likely to contain gifs usedused with at least five gifs and where those gifs
have been used at least 1000 times in total; this
process removes many low-frequency tags that are
either overly-specific or idiosyncratic in their use.
Finally, we performed a manual inspection of all
remaining tags to remove tags that are too general
(e.g., “emotion”) and retain only noun, adjective,
and verb tags (words or multi-word expressions)
that describe specific emotions or actions. A total
of 241 unique tags were retained (Appendix C).
6.0% of gifs have at least one tag associated with
Figure 2: The frequency distribution of gifs in our data
them (mean 1.9 tags). However, these tagged gifs
roughly follows a log-normal distribution, with a few account for 38.7% of the replies in our dataset,
gifs used often, while a long tail of gifs are used rela- suggesting tags are only available for more-popular
tively infrequently. gifs. Our dataset represents roughly an order of
magnitude more data and more tags than the closest
related dataset of Chen et al. (2017) that contained
in conversation. A total of 2,095,993 gifs linked 23K gifs with 17 manually-curated emotions.
to those keywords were ultimately retrieved and
stored as image hashes. Additional details of cate- 4 Gif Reply Models
gories and keywords are in Appendix B.
After the matching image hashes to filter replies, We introduce a series of models for producing a gif
we identify 115,586 unique gifs, referred to as ref- response in conversation. Each model will select a
erence gifs, and 1,562,701 tweet replies using one gif from the 115K gifs in our dataset as a response
of these gifs, which forms our official dataset. Fig- to a text-based message. This task is related to but
ure 2 shows these gifs’ frequency in the data; much distinct from work on image-text matching (Lee
like words, a handful of gifs receive widespread et al., 2018), which aims to find an image describ-
use, while a long tail of gifs are rarely used. ing a piece of text, or text-to-image (e.g., Wen et al.,
2015; Xu et al., 2018), which generates an image
3.2 Gif Metadata from a text description. Here, we aim to select gifs
We augment our gif data with information about that reflect natural continuations or reactions to a
their content. Some gifs have text that transcribes message in a dialog, akin to how gifs are used in
what a person is saying in the gif’s scene or is a social media. For all models, additional details on
meta-commentary on the content. This text is ex- the training procedures and hyperparameters are
tracted using paddleOCR (Du et al., 2020). Since provided in Appendix A. The three models that
some gifs are long enough to contain multiple utter- follow use varying degrees of information about
ances, we run OCR on four frames sampled from the gifs and text to select a response.
each quartile of the gif’s length. Roughly 50%
(58,020) of gifs contain at least one extracted word 4.1 Tag-based Predictions
from the selected frames, with an mean of 5.5 ex- The first model uses tags as a shared representation
tracted words per gif across the dataset. for characterizing gifs and text. Analogous to how
Second, some gif repositories like Giphy allow object tags are used as anchor points for image-text
users to tag gifs with information on their content matching (Li et al., 2020) and pivot languages are
or theme, e.g., “face palm” or “movie.” We collect used in machine translation (Cheng et al., 2017),
tags for the 115K reference gifs used in Twitter, ob- we use tags to bridge information between the text
taining 39,651 unique tags. These user-generated in a tweet and the visual content of a gif. Here,
tags were moderately noisy due to orthographic each gif becomes associated with a set of tags de-
variations like spelling, capitalization, and spacing. scribing its conversational functions and for each
Therefore, we merge tags by (i) lower-casing the text, we predict the set of tags for gifs responses to
text and (ii) performing a manual merge for similar it—in essence, predicting what types of responses
word forms (e.g., “excited” and “exciting”). To are most appropriate. We describe both of these
minimize noise, we retain only tags that have been processes next and how gifs are ultimately selected.Estimating Gif Tags Only 6.0% of the gifs in our given a tweet, we use the trained tweet encoder to
data have associated tags. Therefore we train a extract its representation and compute its cosine
neural model to predict tags using known tags as similarity with each encoded representation for our
training data. To capture any changes in emotion gifs. The gif with the highest cosine similarity is
or imagery across the gif, we make separate pre- returned as the best response.
dictions for four frames sampled across the gif
(the same used in §3.2). Each frame is passed 4.3 P EPE THE K ING P RAWN
through an EfficientNet-based (Tan and Le, 2019) Our final model, K ING P RAWN1 (referred to as
GIF encoder, shown in Figure 3, to extract a low- “P EPE”.) selects gif responses by using a richer
dimensional feature vector from each frame. These set of multimodal features to create a gif represen-
frame embeddings are fused using the attention tation. Rather than encode the gif solely from its
mechanism from a transformer encoder layer. The image content, we use a multimodal encoder that
output of the transformer feeds into a fully con- captures (i) any text it might have, (ii) the types of
nected layer, which is trained as a multi-label clas- objects present in the gif, and (iii) object regions as
sifier using binary cross-entropy to predict which visual features. We encode these gif aspects using
tags should be present. an O SCAR transformer (Li et al., 2020) to create
Predicting Response Tags for Text For each mes- a unified representation, shown in Figure 3 (bot-
sage, we predict the k-hot distribution of tags tom). Object names and regions of interest feature
for a gif response by training a BERTweet model vectors are extracted using a pre-trained bottom-up
(Nguyen et al., 2020), which has been pre-trained attention model (Anderson et al., 2018).
on a large corpus of Twitter data (shown as “Tweet As input to the O SCAR encoder, the captions to
Encoder" in Figure 3). The model with an addi- each of the gif’s four frames are concatenated to-
tional fully connected layer is trained as a multi- gether with an “[INTER_FRAME_SEP]" separator
label classifier using binary cross-entropy, using token. We filter object areas detected by the bottom-
the tags for the gifs used in reply (if known). up attention model (Anderson et al., 2018) and we
Tag-based Gif Selection At inference time, given keep all objects with probability >0.5. We then
a message, we use the text-to-tag model to predict concatenate object names together with the same
a k-hot distribution over tags. Then, we select the inter-frame separator between names of different
gif whose estimated tag distribution is closest in frames. Together, the caption text, object names,
Euclidean distance. and image-region features are fed into the O SCAR
transformer encoder to generate a GIF feature vec-
4.2 CLIP variant tor; the transformer is initialized with the default
O SCAR weights. We use BERTweet to encode text.
The second model uses an end-to-end training ap- The entire P EPE model is trained end-to-end using
proach based on the architecture of OpenAI CLIP contrastive loss, similar to the CLIP model.
(Radford et al., 2021). The architecture features
two encoders, one for text and one for images. Dur- 5 Evaluation
ing training, the encoders are updated using con-
We initially evaluate the methods in two ways.
trastive loss that maximizes the cosine similarity of
First, we use traditional classification-based eval-
paired image-text representations and minimizes
uation, testing whether the models can reproduce
the cosine similarity of random pairs of images and
the observed gif replies. However, some messages
texts. We replicate the CLIP architecture and train-
could have multiple valid gif responses. Therefore,
ing procedure, using BERTweet to encode text and
as a second test, we evaluate the model in a retrieval
EfficientNet (Tan and Le, 2019) to encode a com-
setting, measuring whether its most-probable re-
posite image of four frames from the gif (compared
sponses are good quality for a message.
with BERT and ResNet in their implementation).
Experimental Setup Models are trained and
While originally designed to select an image for a
tested on a dataset containing 1,562,701 Tweet-
text description, our model is trained to select a gif
reply for a text message—a more challenging task 1
K ING P RAWN refers to “selecKting INteresting Gifs for
than the image retrieval task used in the original Personal RespAWNses.” In this crazy muppet-name-land-
grab world we live in, our only regret is that we couldn’t
CLIP setup, as the message may not contain words get “Pepino Rodrigo Serrano Gonzales” to fit as a bacronym,
describing elements of the gif. At inference time, which we leave to future work.EfficientNet GIF Encoder Tweet Encoder
EfficientNet Transformer Tweet
EfficientNet BERTweet-base
EfficientNet
EfficientNet-b0 Encoder We 're getting
Layer Transformer
(shared weight) married today !
feature vector for feature vector for
4 selected frames
downstream task downstream task
4 x 224 x 224 (after reshape)
1 x 512 1 x 512
Oscar GIF Encoder
OCR extracted caption
Aww , thank you
extracted object names Oscar
Bottom-up Attention Multimodal
face woman [INTER_FRAME_SEP] face woman
Transformer
extracted object feature vectors feature vector for
downstream task
1 x 512
4 selected frames
4 x 224 x 224 (after reshape)
Figure 3: The different encoder modules used to construct the models in §4.
GIF pairs associated with 115,586 unique gifs, sufficient content to judge appropriateness indepen-
where 605,063 tweet-gif pairs are associated dent of the larger social context.
with at least one tag. Using the finalized 241 Two annotators (the authors) were shown a list
unique tags as classes for multi-label classifica- of potential gif responses for a tweet and asked to
tion, we split the dataset by stratify on tags us- judge whether this is an appropriate gif response
ing the iterative train-test split method provided by (a binary rating). Gifs were selected from the ten
scikit-multilearn library (Sechidis et al., most-probable replies for each system and collec-
2011; Szymański and Kajdanowicz, 2017) to cre- tively shown in random order to prevent knowing
ate a 80:10:10 train, dev, and test split which which system generated each reply. A total of 2,500
is finalized to train the models described in §4. gif-tweet pairings were annotated. Annotators at-
Following BERTweet (Nguyen et al., 2020), we tained a Krippendorf’s α of 0.462; while moderate
preprocess tweets in our dataset using NLTK agreement, this value is expected given known dif-
TweetTokenizer for tokenization, emoji ferences in how people interpret and value gif re-
package to translate emotion icons, and converted sponses based on their familiarity with its content,
mentions and links to special “@USER" and message interpretation, and life-experience (Jiang
“HTTPURL" tokens. et al., 2018). We follow the evaluation setup from
Annotated Data To test whether each model’s pre- other retrieval-based dialog systems (e.g. Yu et al.,
dictions are valid responses, we annotate the ten 2021; Kumar and Callan, 2020) and use normal-
most-probable gif predictions for a subset of the ized Discounted Cumulative Gain (nDCG), which
tweets in our test data. Many tweets in our test set measures whether more appropriate gif responses
require substantial context to understand due to hav- are ranked higher. A gif’s appropriateness score is
ing few tokens, linking to URLs that provide extra the sum of annotators’ ratings.
knowledge, mentioning other users in directed com- Results The P EPE model was able to identify rele-
munication. These factors suggest social context vant and good-quality gif responses, as shown by
or general knowledge aids in the recipient’s under- its performances on the test data (Table 1) and an-
standing of the gif’s intentions. While the model notated data (Table 2). Performance on the test set
can still benefit from training on such examples, is expected to be low, given the challenge of identi-
judging the appropriateness of response is difficult fying the exact gif used for a tweet when multiple
without access to the social context. Therefore, to possible gifs are likely to be equally valid. How-
reduce interpretation ambiguity, we annotate only ever, the P EPE model is still able to identify the
tweets without URLs or user mentions and having exact gif (out of 115K) in its top 10 predictions
at least 10 tokens. This process selects tweets with for 3% of the data, substantially outperforming allModel Top-1 Top-5 Top-10 Model nDCG
Tag-based 0.000000 0.000092 0.000119 P EPE 0.8145
Random 0.000020 0.000059 0.000158
CLIP variant 0.000488 0.001669 0.002783
P EPE without object names 0.7665
Distribution sampling 0.000996 0.005098 0.009780 P EPE without caption 0.7559
P EPE 0.005375 0.018723 0.030918 P EPE without object features 0.7533
Table 1: Models’ precision-at-k on selecting the exact Table 3: Results for ablated versions of P EPE where
gif used as a response for a tweet in the test set; this per- specific input is removed (cf. Table 2) show that all in-
formance is an underestimate of each model, as many put forms contribute to the ability to select replies.
model-predicted gifs may be appropriate.
Model nDCG and evaluating the model’s resulting gifs on the
same test instances.
Random 0.3273
Tag-based 0.4526 The ablated model performances, shown in Ta-
Distribution sampling 0.4969 ble 3, reveal that each input is useful for selecting
CLIP variant 0.5934 gifs.2 Object features capture visual information
P EPE 0.8145 about what specifically is present in the gif (beyond
the discrete names of what is present, e.g., “person”
Table 2: Models’ nDCG scores at proposing appropri- or “building”) and show that multimodality is im-
ate gif replies, measured from annotations on the top portant for high performance—predicting replies
10 most probable gif replies of each model. just from a gif’s caption and categorized content
are insufficient. Similarly, the caption of a gif (if
present) is important, as the text can help make
other models. explicit the intended interpretation of a gif.
Performance on the annotated data (Table 2) pro-
vides a more realistic assessment of whether mod- 6 Field Experiment
els can generate high-quality replies, as it measures
whether the models’ replies themselves were good. To test the generalizability of our models and qual-
The P EPE model attains substantially higher per- ity of their responses, we conduct a large-scale ran-
formance (pParent Tweets Tag-based CLIP variant P EPE Dist. Samp. Random
That wonderful feeling
you get when you ar-
rive to a business din-
ner that you’re suppos-
edly paying for...and re-
alize you’ve forgotten
your credit card
I’m convinced some of
y’all don’t get laid
Table 4: Model-selected replies to messages (paraphrased for privacy). Click an image to view the gif on Giphy.
this score in our experiments to evaluate quality. we use Thompson sampling (Russo et al., 2018)
Our experiment focuses on generating Gif-based to randomly select which arm of the trial to use.
replies to top-level text comments (comments made Thompson sampling builds a probability model for
directly to the post). This setup mirrors the conver- the estimated reward of each arm (here, the score a
sational data our models were trained on. Imgur reply receives) and samples from the model such
supports several ways of filtering its stream of posts. that higher-rewarding arms are sampled more fre-
To ensure that our replies have sufficient visibility, quently. As a result, this method can provide tighter
we select posts that have already receive 10 com- estimates for the reward of the most useful arms.
ments and appear in the “most viral” sorting. From Scores in Imgur have a skewed distribution, with
these posts, we reply to the top-rated text comment. few comments receiving very high scores and most
The RCT runs from 8 AM to 8 PM (local time), receiving near the default score (1). Therefore, we
making at most 10 replies per hour. use Poisson Thompson sampling. Some comments
Not all topics or comments are suitable for auto- may be downvoted to receive scores below zero, so
mated responses and great care was taken to pre- for simplicity, we truncate these scores to 0.
vent potential harm to the community. Through We initialize the reward estimates for our ex-
multiple rounds of testing which replies would be periment by selecting one of the five models in
responded to, we curated a list of keywords that a round-robin manner to reply to an Imgur com-
could lead to potential controversial replies, such ment for 3 days. These initial scores act as priors
as terms about religion or race (full list in Ap- for Thompson sampling to update Poisson distri-
pendix D). Any comment containing a token or butions for each model. In the trial, we choose a
lemma matching a word on this list is excluded model by sampling from the up distributions using
and not replied to. As a further safeguard, exper- all previous days’ scores as the prior. The exper-
imenters monitored all replies to remove any that iment ran from April 15th, 2021 to August 30th,
were deemed inappropriate. See the Ethics Section 2021, and models generated a total of 8,369 replies.
(§9) for a longer discussion of safeguards. To evaluate the results of the RCT, we construct
The field experiment consists of five arms, cor- a Negative Binomial regression on the dependent
responding to the three trained models and the two variable of the score received for a model’s reply,
baseline models. During each trial, one model is se- truncating negative scores to zero. The Negative
lected and generates a response; the trained model binomial was chosen instead of Poisson due to
replies with the most probable gif.4 over-dispersion in the score variable. The models
Not all models are equally likely to perform well are treated as a categorical variable, using the ran-
and so to make the most use of our trial budget, dom model as a reference. Since the score will
4
Due to a bug, early experimental trials for the CLIP and depend, in part, on the attention received by the
P EPE models used the tenth most-probable gif; however, using parent post and comment (higher-rated comments
the ratings in the annotated data, a t-test of the difference in are displayed first), we include linear effects for
quality for most- and tenth-most probable gifs showed no
statistically-significant difference in quality for both models the post and parent comment. Finally, we include
(p>0.1). Therefore, we include this data in our results. five text-related variables to control for the con-CLIP variant *** 2gG2xiMTtFwsg
fnjxvV295sWEJjvwXU
BAPSj0xM1cFe8
Dist. Samp. iJsvRxNTAcup6DVfLP
GIPHY GIF ID
3oEjHLcg4QMU5umb9m
aKrTvuOv4hlKM
Pepe *** 3oKIPllDN24q8Awtwc
jIu44mYwUItSHTW3tj
Tag-based jTrWAzlFGfvVY34PSJ
l396L17pwHWOIJrTG
0.2 0.0 0.2 60 40 20 0 20
Coefficient for GIF Reply Score GIF Reply Score
(a) Tag-based
Figure 4: Negative Binomial regression coefficients for
each model on predicting a gif reply’s score, using the lfesfEtobCSbsHzC8d
random-gif model as the reference category; bars show m9d3Xif3ShZ42CxlWP
standard error and *** denotes significance at 0.01. f9k1tV7HyORcngKF8v
loitbnzQ1JQ8Iizx8w
GIPHY GIF ID
bfrlODgSLqXxS
4HmjGg306HiLHWlm2f
tent of the parent comment: the topic distribution 7J26CGAahos6d5S1A6
8hZ9FMolyKc0X8BSr7
(Appendix Table 9) from a 10-topic model (drop- iqkHA3DmB8GjORY030
ping one topic due to collinearity), the sentiment OOzcnk3PzLDHqWs6Tb
30 20 10 0 10 20
and subjectivity of the message estimated using GIF Reply Score
TextBlob library, the length of the comment, and (b) CLIP variant
whether the comment contained a question.
tnYri4n2Frnig
5wWf7GR2nhgamhRnEuA
6.2 Results 5gw0VWGbgNm8w
iXTrbbYMQBCMM
GIPHY GIF ID
The field experiment demonstrates that the P EPE 65ODCwM00NVmEyLsX3
model is able to generate significantly higher- 26AHLBZUC1n53ozi8
3o8doT9BL7dgtolp7O
scoring responses. Figure 4 shows the Negative Fq6Bdki3coEWQ
Binomial regression coefficients for the three mod- 3oEjHAUOqG3lSS0f1C
KzyMcEfDh4Jiw
els and empirical distribution baseline, with the 40 20 0 20
GIF Reply Score
random gif model as a reference; full regression
(c) P EPE
results are shown in Appendix Table 6. The P EPE
model substantially outperforms all other models Figure 5: Score distributions for most-frequently used
(p0.73 for all models). The most-used gifs
and the reply (Madden, 2018, p.29). Further, we for each model had average scores that were pos-
observed that, when the model’s reply truly seemed itive, but the distributions for each gif show that
random, some users replied say they upvoted solely some uses were occasionally downvoted. This high
because they enjoyed the gif. variance in scores indicates that a gif’s intrinsic
As a follow-up experiment, we tested whether qualities are not solely responsible for the received
models could be getting higher (or lower) scores by score and, instead, appropriate use in context is
repeatedly picking the same gifs that are skewed plays a significant part in community reception.
towards a positive or negative reaction. Figure 5 We examined whether models relied on the same
shows the score distribution for the top ten most fre- set of gifs. Figure 6 shows the distribution of gifisting messages from a large social media corpus as
2.5 Tag-based potential replies and rank these to select a response.
log10(frequency of unique gif)
CLIP variant Our work mirrors models that use neural networks
2.0 Pepe
Dist. Samp. for ranking (Yan et al., 2016; Inaba and Takahashi,
1.5 random 2016; Penha and Hauff, 2021, e.g.,); however, we
note that many recent knowledge-grounded and
1.0 open domain models use encoder-decoder meth-
0.5 ods to improve versatility and applicability (e.g.,
Ghazvininejad et al., 2018; Gao et al., 2019; Zhou
0.0
et al., 2020). Generative approaches are likely in-
0.0 0.5 1.0 1.5 2.0 2.5 3.0 appropriate for gif-based conversation as gifs are
log10(rank of unique gif)
more akin to mimetic artifacts that build on cultural
Figure 6: Gif use frequency by each model, shown knowledge (Eppink, 2014), making synthesizing a
as frequency-vs-rank log-scaled with first-order line fit new gif from scratch likely less effective.
(jitter added for separation).
All three models used here rely on joint embed-
ding spaces for gif and text. Multiple works in
uses by each model, indicating that the tag-based NLP have been proposed to align these representa-
model relied frequently on a small set of gifs. How- tions (Kiros et al., 2014; Wang et al., 2016), often
ever, the P EPE and CLIP variant models were sub- for particular applications such as visual question
stantially more varied, indicating they draw from answering (Antol et al., 2015). Recent work has
the long-tail of possible gifs. focused on embeddings these media with a single
Do any of our models spark more subsequent encoder that takes both text and images as input
conversation? We fit a separate Negative Binomial (e.g., Wang et al., 2019; Chen et al., 2020), in con-
regression on the total number of comments made trast to our model that uses separate image and text
to our reply, using the same IVs as the score regres- encoders (Figure 3); these multimodal encoders
sion and include the reply’s score itself as another are prohibitively computationally expensive to use
IV. This model (Appendix Table 8) shows that both in our setting during inference time, as the model
the distributional-sampling baseline and P EPE mod- would need to be run on each gif (and message) to
els produced replies that led to fewer subsequent rank replies, compared with our model that only
comments (p9 Ethics ate gif, which is mitigated by the use of Giphy to
seed our initial gifs. As this platform is curated
The interactive nature of the RCT necessitated and does not host objectively offensive gifs (e.g.,
a close consideration of ethical issues (Thieltges overly-violent content), our initial gif set is rela-
et al., 2016). Prior to beginning the RCT, the study tively free of objectionable gifs. Because our model
team obtained IRB approval to interact with users. learns directly from gifs’ frequency of use, unless
While necessary in the legal sense, IRB approval objectively offensive gifs are widely used, they are
is not sufficient to justify the ethical grounds of unlikely to be deployed from our RCT; we specu-
the study. The primary risks of the study are if the late that few objectively offensive gifs are widely
automated models respond with an inappropriate used and, in practice, we have not identified any
gif or respond to a message that is not suitable for during the study period or when examining hun-
automated response (e.g., discussing the death of a dreds of random gifs in our data (or used in the
loved one or making an offensive statement). These RCT).
risks were mitigated in multiple ways throughout Finally, one risk is that by learning gif responses
the dataset construction and field experiment. from observed data, our models may reinforce
First, the selection criteria for which comments cultural stereotypes that are encoded in the gifs
we reply to was designed to only reply to content themselves (Erinn, 2019), e.g., the association of
that was already deemed appropriate by the com- African American individuals with strong emotions.
munity. By selecting only posts that had received While our gif data is relatively clean of overtly of-
sufficient upvotes to be called “viral” and were fensive gifs, we acknowledge that our model likely
already receiving comments, we mitigate the risk does inadvertently perpetuate some of these latent
of engaging in topics or conversations that are in- biases in the data. However, the success of our
appropriate according to the norms of the Imgur model suggests a future mitigation strategy for plat-
community, as these posts would be removed by forms suggesting gifs: as biases become known,
moderators or would have received sufficient down- our approach can be used to suggest less-biased
votes to stay in obscurity. gifs as potential responses to mitigate future harm.
Second, by focusing on the top-voted comment
to these posts, we again reply to content that has Acknowledgments
already been deemed high-quality by the comment. We thank the reviewers, area chairs, and senior area
This comment-level criteria substantially lowers chairs for their thoughtful comments and feedback.
the risk of our models commenting on inappropri- We also thank the Blablablab for helpful feedback
ate comments (e.g., a comment insulting another and letting us deploy P EPE to the group’s Slack and
user), as these comments are readily downvoted by putting up with the ridiculous gif replies and Imgur
the community prior to our intervention. for being a wonderful community. This material is
Third, we employed extensive filtering to avoid based upon work supported by the National Science
replying to any comment containing a potentially Foundation under Grant No. 2007251.
sensitive topic, e.g., a discussion of race or trauma
(keywords are listed in Appendix D). The initial set
of keywords was developed through examining po- References
tentially sensitive topics and then iteratively added Peter Anderson, Xiaodong He, Chris Buehler, Damien
to by simulating which messages our RCT would Teney, Mark Johnson, Stephen Gould, and Lei
reply to and examining whether it would be appro- Zhang. 2018. Bottom-up and top-down attention for
priate. During the field RCT, experimenters contin- image captioning and visual question answering. In
2018 IEEE Conference on Computer Vision and Pat-
uously monitored the comments to ensure no harm tern Recognition, CVPR 2018, Salt Lake City, UT,
was being done. Ultimately, only three comments USA, June 18-22, 2018, pages 6077–6086. IEEE
were removed during the initial two days, which Computer Society.
was due to a bug in the lemmatization and these Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-
comments should have been filtered out by our ear- garet Mitchell, Dhruv Batra, C. Lawrence Zitnick,
lier criteria; these comments were removed quickly and Devi Parikh. 2015. VQA: visual question an-
swering. In 2015 IEEE International Conference on
and we did not observe any notable response from
Computer Vision, ICCV 2015, Santiago, Chile, De-
the community. cember 7-13, 2015, pages 2425–2433. IEEE Com-
Fourth, one risk is replying with an inappropri- puter Society.Saeideh Bakhshi, David A. Shamma, Lyndon Kennedy, Hao Fang, Hao Cheng, Maarten Sap, Elizabeth Clark,
Yale Song, Paloma de Juan, and Joseph Jofish Kaye. Ari Holtzman, Yejin Choi, Noah A. Smith, and Mari
2016. Fast, cheap, and good: Why animated gifs en- Ostendorf. 2018. Sounding board: A user-centric
gage us. In Proceedings of the 2016 CHI Conference and content-driven social chatbot. In Proceedings of
on Human Factors in Computing Systems, San Jose, the 2018 Conference of the North American Chap-
CA, USA, May 7-12, 2016, pages 575–586. ACM. ter of the Association for Computational Linguis-
tics: Demonstrations, pages 96–100, New Orleans,
Michael S Bernstein, Margaret Levi, David Magnus, Louisiana. Association for Computational Linguis-
Betsy Rajala, Debra Satz, and Charla Waeiss. 2021. tics.
Esr: Ethics and society review of artificial intelli-
gence research. ArXiv preprint, abs/2106.11521. Jianfeng Gao, Michel Galley, and Lihong Li. 2019.
Neural Approaches to Conversational AI: Ques-
Fumihiro Bessho, Tatsuya Harada, and Yasuo Ku- tion Answering, Task-oriented Dialogues and Social
niyoshi. 2012. Dialog system using real-time crowd- Chatbots. Now Foundations and Trends.
sourcing and Twitter large-scale corpus. In Proceed-
ings of the 13th Annual Meeting of the Special Inter- Marjan Ghazvininejad, Chris Brockett, Ming-Wei
est Group on Discourse and Dialogue, pages 227– Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, and
231, Seoul, South Korea. Association for Computa- Michel Galley. 2018. A knowledge-grounded neural
tional Linguistics. conversation model. In Proceedings of the Thirty-
Second AAAI Conference on Artificial Intelligence,
Elli Bourlai and Susan C Herring. 2014. Multimodal (AAAI-18), the 30th innovative Applications of Arti-
communication on tumblr: “i have so many feels!”. ficial Intelligence (IAAI-18), and the 8th AAAI Sym-
In Proceedings of the 2014 ACM conference on Web posium on Educational Advances in Artificial Intel-
science, pages 171–175. ligence (EAAI-18), New Orleans, Louisiana, USA,
February 2-7, 2018, pages 5110–5117. AAAI Press.
Weixuan Chen, Ognjen Oggi Rudovic, and Rosalind W
Picard. 2017. Gifgif+: Collecting emotional ani- Carla F Griggio, Joanna Mcgrenere, and Wendy E
mated gifs with clustered multi-task learning. In Mackay. 2019. Customizations and expression
2017 Seventh International Conference on Affective breakdowns in ecosystems of communication apps.
Computing and Intelligent Interaction (ACII), pages Proceedings of the ACM on Human-Computer Inter-
510–517. IEEE. action, 3(CSCW):1–26.
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Tim Highfield and Tama Leaver. 2016. Instagrammat-
Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and ics and digital methods: Studying visual social me-
Jingjing Liu. 2020. UNITER: Universal Image- dia, from selfies and gifs to memes and emoji. Com-
TExt representation learning. In ECCV. munication research and practice, 2(1):47–62.
Yong Cheng, Qian Yang, Yang Liu, Maosong Sun, and Chung Hoon Hong, Yuan Liang, Sagnik Sinha Roy,
Wei Xu. 2017. Joint training for pivot-based neural Arushi Jain, Vihang Agarwal, Ryan Draves, Zhizhuo
machine translation. In Proceedings of the Twenty- Zhou, William Chen, Yujian Liu, Martha Miracky,
Sixth International Joint Conference on Artificial In- et al. 2020. Audrey: A personalized open-domain
telligence, IJCAI 2017, Melbourne, Australia, Au- conversational bot. In Alexa Prize Proceedings.
gust 19-25, 2017, pages 3974–3980. ijcai.org.
Pingping Huang, Jianhui Huang, Yuqing Guo, Min
Michelle Cohn, Chun-Yen Chen, and Zhou Yu. 2019. Qiao, and Yong Zhu. 2019. Multi-grained attention
A large-scale user study of an Alexa Prize chatbot: with object-level grounding for visual question an-
Effect of TTS dynamism on perceived quality of so- swering. In Proceedings of the 57th Annual Meet-
cial dialog. In Proceedings of the 20th Annual SIG- ing of the Association for Computational Linguis-
dial Meeting on Discourse and Dialogue, pages 293– tics, pages 3595–3600, Florence, Italy. Association
306, Stockholm, Sweden. Association for Computa- for Computational Linguistics.
tional Linguistics.
Michimasa Inaba and Kenichi Takahashi. 2016. Neural
Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, utterance ranking model for conversational dialogue
Weiwei Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua systems. In Proceedings of the 17th Annual Meeting
Yang, Qingqing Dang, et al. 2020. PP-OCR: A Prac- of the Special Interest Group on Discourse and Dia-
tical Ultra lightweight OCR system. ArXiv preprint, logue, pages 393–403, Los Angeles. Association for
abs/2009.09941. Computational Linguistics.
Jason Eppink. 2014. A brief history of the gif (so far). Jialun “Aaron" Jiang, Jed R Brubaker, and Casey
Journal of visual culture, 13(3):298–306. Fiesler. 2017. Understanding diverse interpretations
of animated GIFs. In Proceedings of the 2017 CHI
Wong Erinn. 2019. Digital blackface: How 21st cen- Conference Extended Abstracts on Human Factors
tury internet language reinforces racism. in Computing Systems, pages 1726–1732.Jialun “Aaron" Jiang, Casey Fiesler, and Jed R Conference on Empirical Methods in Natural Lan-
Brubaker. 2018. “The Perfect One” Understanding guage Processing: System Demonstrations, pages 9–
Communication Practices and Challenges with An- 14, Online. Association for Computational Linguis-
imated GIFs. Proceedings of the ACM on human- tics.
computer interaction, 2(CSCW):1–20.
Tianrui Niu, Fangxiang Feng, Lingxuan Li, and Xiaojie
Daniel Jurafsky, Elizabeth Shriberg, Barbara Fox, and Wang. 2020. Image synthesis from locally related
Traci Curl. 1998. Lexical, prosodic, and syntactic texts. In Proceedings of the 2020 International Con-
cues for dialog acts. In Discourse Relations and Dis- ference on Multimedia Retrieval, pages 145–153.
course Markers.
Mahmoud Khademi. 2020. Multimodal neural graph Silvia Pareti and Tatiana Lando. 2018. Dialog intent
memory networks for visual question answering. In structure: A hierarchical schema of linked dialog
Proceedings of the 58th Annual Meeting of the Asso- acts. In Proceedings of the Eleventh International
ciation for Computational Linguistics, pages 7177– Conference on Language Resources and Evaluation
7188, Online. Association for Computational Lin- (LREC 2018), Miyazaki, Japan. European Language
guistics. Resources Association (ELRA).
Ryan Kiros, Ruslan Salakhutdinov, and Richard S Gustavo Penha and Claudia Hauff. 2021. On the cal-
Zemel. 2014. Unifying visual-semantic embeddings ibration and uncertainty of neural learning to rank
with multimodal neural language models. ArXiv models for conversational search. In Proceedings
preprint, abs/1411.2539. of the 16th Conference of the European Chapter
of the Association for Computational Linguistics:
Artie Konrad, Susan C Herring, and David Choi. 2020. Main Volume, pages 160–170, Online. Association
Sticker and emoji use in facebook messenger: impli- for Computational Linguistics.
cations for graphicon change. Journal of Computer-
Mediated Communication, 25(3):217–235. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish
Vaibhav Kumar and Jamie Callan. 2020. Making in- Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,
formation seeking easier: An improved pipeline for et al. 2021. Learning transferable visual models
conversational search. In Findings of the Associa- from natural language supervision. ArXiv preprint,
tion for Computational Linguistics: EMNLP 2020, abs/2103.00020.
pages 3971–3980, Online. Association for Compu-
tational Linguistics. Ashwin Ram, Rohit Prasad, Chandra Khatri, Anu
Kuang-Huei Lee, X. Chen, G. Hua, H. Hu, and Xi- Venkatesh, Raefer Gabriel, Qing Liu, Jeff Nunn,
aodong He. 2018. Stacked cross attention for image- Behnam Hedayatnia, Ming Cheng, Ashish Nagar,
text matching. ArXiv preprint, abs/1803.08024. et al. 2018. Conversational ai: The science behind
the alexa prize. ArXiv preprint, abs/1801.03604.
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xi-
aowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott
Li Dong, Furu Wei, et al. 2020. Oscar: Object- Gray, Mark Chen, Rewon Child, Vedant Misra,
semantics aligned pre-training for vision-language Pamela Mishkin, Gretchen Kruegerand Sandhini
tasks. In European Conference on Computer Vision, Agarwal, and Ilya Sutskever. 2021. DALL-E: Cre-
pages 121–137. Springer. ating images from text. https://openai.com/
blog/dall-e/.
John Savery Madden. 2018. The Phenomenological
Exploration of Animated GIF Use in Computer- Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni,
Mediated Communication. Ph.D. thesis, University Ian Osband, and Zheng Wen. 2018. A tutorial on
of Oklahoma. thompson sampling. Foundations and Trends® in
Machine Learning, 11(1):1–96.
Luke Melas-Kyriazi, Alexander Rush, and George Han.
2018. Training for diversity in image paragraph cap-
Konstantinos Sechidis, Grigorios Tsoumakas, and Ioan-
tioning. In Proceedings of the 2018 Conference on
nis Vlahavas. 2011. On the stratification of multi-
Empirical Methods in Natural Language Processing,
label data. Machine Learning and Knowledge Dis-
pages 757–761, Brussels, Belgium. Association for
covery in Databases, pages 145–158.
Computational Linguistics.
Kate M Miltner and Tim Highfield. 2017. Never Piyush Sharma, Nan Ding, Sebastian Goodman, and
gonna GIF you up: Analyzing the cultural signifi- Radu Soricut. 2018. Conceptual captions: A
cance of the animated GIF. Social Media+ Society, cleaned, hypernymed, image alt-text dataset for au-
3(3):2056305117725223. tomatic image captioning. In Proceedings of the
56th Annual Meeting of the Association for Compu-
Dat Quoc Nguyen, Thanh Vu, and Anh Tuan Nguyen. tational Linguistics (Volume 1: Long Papers), pages
2020. BERTweet: A pre-trained language model 2556–2565, Melbourne, Australia. Association for
for English tweets. In Proceedings of the 2020 Computational Linguistics.Piotr Szymański and Tomasz Kajdanowicz. 2017. A Rui Yan, Yiping Song, and Hua Wu. 2016. Learning
network perspective on stratification of multi-label to respond with deep neural networks for retrieval-
data. In First International Workshop on Learn- based human-computer conversation system. In Pro-
ing with Imbalanced Domains: Theory and Appli- ceedings of the 39th International ACM SIGIR con-
cations, pages 22–35. PMLR. ference on Research and Development in Informa-
tion Retrieval, SIGIR 2016, Pisa, Italy, July 17-21,
Mingxing Tan and Quoc V. Le. 2019. Efficientnet: 2016, pages 55–64. ACM.
Rethinking model scaling for convolutional neural
networks. In Proceedings of the 36th International Shi Yu, Zhenghao Liu, Chenyan Xiong, Tao Feng, and
Conference on Machine Learning, ICML 2019, 9- Zhiyuan Liu. 2021. Few-shot conversational dense
15 June 2019, Long Beach, California, USA, vol- retrieval. ArXiv preprint, abs/2105.04166.
ume 97 of Proceedings of Machine Learning Re-
search, pages 6105–6114. PMLR. Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum.
2020. The design and implementation of XiaoIce,
Ying Tang and Khe Foon Hew. 2019. Emoticon, emoji, an empathetic social chatbot. Computational Lin-
and sticker use in computer-mediated communica- guistics, 46(1):53–93.
tion: A review of theories and research findings. In-
ternational Journal of Communication, 13:27.
Andree Thieltges, Florian Schmidt, and Simon
Hegelich. 2016. The devil’s triangle: Ethical con-
siderations on developing bot detection methods. In
2016 AAAI Spring Symposium Series.
Jackson Tolins and Patrawat Samermit. 2016. Gifs
as embodied enactments in text-mediated conversa-
tion. Research on Language and Social Interaction,
49(2):75–91.
Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016.
Learning deep structure-preserving image-text em-
beddings. In 2016 IEEE Conference on Computer
Vision and Pattern Recognition, CVPR 2016, Las Ve-
gas, NV, USA, June 27-30, 2016, pages 5005–5013.
IEEE Computer Society.
Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng,
Junjie Yan, Xiaogang Wang, and Jing Shao. 2019.
CAMP: cross-modal adaptive message passing for
text-image retrieval. In 2019 IEEE/CVF Interna-
tional Conference on Computer Vision, ICCV 2019,
Seoul, Korea (South), October 27 - November 2,
2019, pages 5763–5772. IEEE.
Miaomiao Wen, Nancy Baym, Omer Tamuz, Jaime
Teevan, Susan T Dumais, and Adam Kalai. 2015.
Omg ur funny! computer-aided humor with an ap-
plication to chat. In ICCC, pages 86–93.
Jun Xu, Haifeng Wang, Zheng-Yu Niu, Hua Wu, Wanx-
iang Che, and Ting Liu. 2020. Conversational graph
grounded policy learning for open-domain conversa-
tion generation. In Proceedings of the 58th Annual
Meeting of the Association for Computational Lin-
guistics, pages 1835–1845, Online. Association for
Computational Linguistics.
Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han
Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He.
2018. Attngan: Fine-grained text to image genera-
tion with attentional generative adversarial networks.
In 2018 IEEE Conference on Computer Vision and
Pattern Recognition, CVPR 2018, Salt Lake City, UT,
USA, June 18-22, 2018, pages 1316–1324. IEEE
Computer Society.Category Subcategory validation macro-f1 was 0.07 achieved at the 70th
epoch.
Cartoons & Comics aqua teen hunger force
Celebrities richard pryor A.2 CLIP variant
Reactions angry The evaluation performance for model selection
Emotions happy is measured by nDCG. For every tweet-gif pair in
Anime bleach the validation set, we measure the top 30 predicted
Art & Design psychedelic GIFs from the model using the tweet as input. The
Nature sunrise relevance of an occurring ground truth gif in the
Transportation bicycle top 30 predictions given a tweet is set 1 for the
nDCG calculation.
Table 5: Examples of GIF categories on GIPHY
CLIP variant is trained on the same finalized
dataset using contrastive loss. It was trained for
A Additional Details on Model Training 16 epochs with a batch size of 16 using AdamW
optimizer of learning rate 1e-5 and weight decay
Following, we provide additional details on how 1e-3. Best validation performance is achieved at
each of the three models was trained. epoch 6 with an nDCG value of 0.015.
We replace the Transformer encoder layer with
A.1 Tag-based Model
a linear Layer on Efficient GIF Encoder from Fig-
EfficientNet-based Tag Classifier Gifs are re- ure 3, and use this as our GIF Encoder for the
shaped to 224 by 224 pixel while keeping the as- CLIP variant. Image inputs to the GIF encoder are
pect ratio by padding and normalized to a mean of normalized following the official CLIP implemen-
0.5 and standard deviation of 0.5 for each channel tation.
before feeding into the EfficientNet-based model.
We selected unique GIFs from the finalized dataset A.3 P EPE
that has at least one associated tag and using the The P EPE model follows most configurations from
iterative train test split on k-hot tag representation the CLIP variant model, but replace the EffcientNet
to select 5% of those GIFs for validation. The Effi- GIF encoder with an Oscar GIF encoder based
cientNet tag classifier was trained for 100 epochs on Oscar pre-trained multi-modal transformer (Li
on a batch size of 32, using AdamW optimizer et al., 2020).
with learning rate 1e-5 and weight decay 1e-3. The Extra metadata are extracted from GIFs in the fi-
best validation performance was achieved at the nalized dataset for further training. Captions within
40th epoch with macro-f1 of 0.30 in predicting 241 the GIF are extracted using PaddleOCR (Du et al.,
multi-label classes. Early experiment shows that 2020), and only extracted text with probability
transformer encoder layer (macro-f1 of 0.30) out greater than 0.9 are kept as caption metadata.
performs linear layer (macro-f1 of 0.19) in fusing Object tags and their corresponding features are
multi-frame gif features on the development set, extracted with bottom-up attention (Anderson et al.,
therefore transformer encoder layer is used to fuse 2018) using py-bottom-up-attention
features of different frames in our implementation. package. Object instances are filtered to only
Tweet-to-tag classifier Using the finalized dataset keep instances that have a score higher than
mentioned in §3, we use tweet as input, and the 0.5, then object tags and their corresponding
k-hot tag representation of that tweet instance as features are extracted from these instances. Final
ground truth label to train the multi-label classifier object features of dimension 2054 are obtained
along with the tweet encoder for 241 classes. Ad- by concatenating feature output with dimension
ditionally, we filter out tweets from the finalized 2048 from Faster-RCNN with scaled box position
dataset that do not have corresponding twitter tags coordinates of the object following (Li et al.,
before training. The model with the best valida- 2020).
tion performance is selected to perform subsequent The P EPE model is trained on the finalized
evaluation and field experiments. The tweet en- dataset with extracted caption and object metadata.
coder was trained for 100 epochs with a batch size It was trained for 16 epochs with a batch size of
of 32. The learning rate was set to 1e-5 with 1e-3 8 using AdamW optimizer of learning rate 1e-6
weight decay using AdamW optimizer. The best and weight decay 1e-3. Preprocessing for GIFs isTweet-GIF Pairs
42,096,566 pairs
Parent Tweet
Having a super great day .
Child animated GIF (reply)
File: tweet_video/EEqo71PWkAAqGab.mp4
AverageHash: 00080c2ce4f0f46410383828c262626300040424b4f4e061
......
39,401,680 unique GIF files
GIPHY GIF Collections
Match Animated
GIPHY GIF ID: 3ornjLd54I3eQYmfpC GIF with Hash
Tags: #high five #late night with seth meyers Matched
AverageHash: 00080c2ce4f0f46410383828c262726300040424b4f4e061
GIPHY GIF ID: 3o6fIQSs4BcsEbDE7S
Tags: #real housewives #bravo tv #slice #rhonj Not Matched
AverageHash: fbf9f35141008c8bfbf9f35151008c89fbf9f35151008c89
......
2,095,993 unique GIF IDs
Finalized Dataset
1,562,701 pairs (605,063 pairs associated with selected tags)
Parent Tweet: Having a super great day .
GIPHY GIF ID: 3ornjLd54I3eQYmfpC
Selected Tags: #high five
GIPHY Tags
Selection
Parent Tweet: We 're getting married today !
GIPHY GIF ID: g9582DNuQppxC
Selected Tags: #cheers, #drink, #congratulations
......
115,586 unique GIF IDs
Figure 7: A diagram of the pipeline used to collect, canonicalize, and filter gif-reply data from Twitter.GIPHY GIF (ID: Y9pd1baXUIJTW)
Tweet Animated GIF (D0mIkeXX0AIQt3N.mp4) Hamming distance = 9
Matched
Image Average Hash: 6864f6ded2d282001f180b3d1c18f07c083e2f1fdfd08c8e
GIPHY GIF (ID: 2zR6G1VD9a1ws)
Hamming distance = 51.0
Not Matched
Image Average Hash: 686cfcfef2d282021f181b3d1c18f07c083e2e1f9fd08c8e
Image Average Hash: f0f4fcfcf8e060001018383818183018001a1b1b1a18180c
Figure 8: Matching Animated GIFs from Twitter with GIPHY gifs using Image Average Hash
the same as the Tag-based model. Max sequence Reactions thumbs down
length is set to 256 tokens for the Oscar transformer. Reactions frustrated
Best evaluation performance is achieved at epoch Reactions oh snap
12 with an nDCG score of 0.007. Reactions disgusted
Reactions rejected
B GIF categories on GIPHY Reactions embarrassed
Reactions hug
Category Subcategory Reactions yolo
Reactions what Reactions interested
Reactions hair flip Reactions thank you
Reactions bored Reactions sarcastic
Reactions frown Reactions shocked
Reactions slow clap Reactions cool story bro
Reactions mic drop Reactions middle finger
Reactions goodbye Reactions you got this
Reactions meh Reactions whatever
Reactions scared Reactions omg
Reactions do not want Reactions deal with it
Reactions confused Reactions sigh
Reactions drunk Reactions oops
Reactions wow Reactions angry
Reactions mad Reactions finger guns
Reactions awesome Reactions good luck
Reactions pleaseYou can also read