An animated picture says at least a thousand words: Selecting Gif-based Replies in Multimodal Dialog
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
An animated picture says at least a thousand words: Selecting Gif-based Replies in Multimodal Dialog Xingyao Wang David Jurgens University of Michigan University of Michigan xingyaow@umich.edu jurgens@umich.edu Abstract PizzaMagic: Ahhhhh!!! The EMNLP deadline is in 24 hours!! Online conversations include more than just x CasualModel: text. Increasingly, image-based responses such as memes and animated gifs serve as arXiv:2109.12212v1 [cs.CL] 24 Sep 2021 culturally recognized and often humorous re- sponses in conversation. However, while NLP has broadened to multimodal models, conver- sational dialog systems have largely focused only on generating text replies. Here, we intro- duce a new dataset of 1.56M text-gif conversa- tion turns and introduce a new multimodal con- versational model P EPE THE K ING P RAWN Figure 1: Gif responses in conversation like the one for selecting gif-based replies. We demon- shown above are embodied dialog that use visual im- strate that our model produces relevant and agery to convey reactions and emotions. This paper de- high-quality gif responses and, in a large ran- velops a system to select the appropriate gif response domized control trial of multiple models re- to messages. (PDF best viewed with Adobe Acrobat) plying to real users, we show that our model replies with gifs that are significantly better re- ceived by the community. Conversation analysis is central to NLP and mul- 1 Introduction tiple approaches have analyzed this dialog struc- Conversations are central to many online social ture (Jurafsky et al., 1998; Pareti and Lando, 2018; platforms. While most conversations are text- Cohn et al., 2019) and developed conversational based, computer mediated dialog also affords al- agents to engage with people (e.g., Fang et al., ternative forms of communication, such as emoji 2018; Xu et al., 2020; Hong et al., 2020). Recent or stickers like bitmoji, that allow users to ex- work has focused on generating open domain social press themselves (Tang and Hew, 2019; Konrad chatbots that engage in sustained conversations in et al., 2020). Increasingly, these visual forms of a natural way (Ram et al., 2018). Because many communication have become common in social of these systems are designed to support voice- media (Bourlai and Herring, 2014; Highfield and based dialog, they overlook non-textual forms of Leaver, 2016), with a notable use of the reaction gif interaction used in social media conversations. In (Bakhshi et al., 2016; Miltner and Highfield, 2017). parallel, multimodal NLP systems have been devel- These gifs are short video sequences that depict a oped for image data, often focusing on image-to- particular scene and sometimes contain text that text tasks such as image captioning (Melas-Kyriazi acts as a meta-commentary (Eppink, 2014). As a et al., 2018; Sharma et al., 2018) and visual ques- result, conversations become multimodal where in- tion answering (Antol et al., 2015; Huang et al., dividuals reply to one another using combinations 2019; Khademi, 2020). More recent work has fo- of text and gifs (Figure 1). While conversational cused on the reverse text-to-image dimension, such AI systems have been developed in a purely text- as generating an image from a description (Niu based setting, such systems do not capture the full et al., 2020; Ramesh et al., 2021). Our work unites multimodal behavior seen online. Here, we study these two strands of research by integrating image- multimodal conversation by introducing new dialog based communication into conversational agents. models for selecting gif replies in conversation. Our paper offers three main contributions. First,
we propose the new task of selecting gif responses though platforms like Twitter support gif replies, in multimodal conversation analysis and introduce these gifs are not canonicalized to identify which re- a new dataset of 1,562,701 real-world conversa- sponses correspond to the same gif. Therefore, we tion turns with gif replies. Second, we introduce construct a new dataset for this task by collecting a new model P EPE THE K ING P RAWN that fuses responses, matching their images, and augmenting image and text-based features to select a relevant this data with metadata about the gif, where possi- gif response. In in-house experiments, we show ble. A visual description of the whole procedure that our model substantially outperforms strong can be found in Appendix Figure 7. baseline models at selecting the exact gif used in real data and, in a manual test of the quality of the 3.1 Gif Response Data best responses, achieves an nDCG of 0.8145 on the annotated test set. Third, in a real-world test, we Gifs have many uses (Miltner and Highfield, 2017) deploy our model as a part of a large-scale random- and so we use a two-step approach to collect data ized controlled trial and show that the gif replies that focus specifically on those likely to be used produced by our model are more highly voted by in conversation. First, gif responses are collected the community. Data, code, and models are avail- from Twitter by identifying all replies to English- able at https://github.com/xingyaoww/gif-reply. language tweets containing animated_gif as embedded media. Tweets were collected from a 2 GIF Communications ∼10% sample of Twitter from March 13th, 2019 to Jan 24th, 2020, totaling 42,096,566 tweets with Gifs have been widely adopted in communication a gif that we were able to retrieve. Twitter does not as a natural form of embodied speech where the vi- canonicalize its gifs so two separate gif files may sual imagery conveys emotions or a reaction as a re- actually have the same imagery. Further, these files sponse (Bakhshi et al., 2016; Tolins and Samermit, may not be identical due to small differences such 2016). These gifs commonly come from widely- as color variations or aspect ratios. To identify uses known cultural products, such as movies or televi- of the reference gifs, we use Average Hash from sion shows, which provides common knowledge the imagehash library to create low-dimensional for how they could be interpreted (Eppink, 2014; representations of each gif where hash distance Miltner and Highfield, 2017). However, a single gif corresponds to perceptual distance. Since gifs are may have multiple interpretations, depending on animated and may contain varying scenes, we com- the context, cultural knowledge of its content, and pute the hash for the first, middle, and final frames, the viewer (Jiang et al., 2017). As a result, a single concatenating these into a single hash. Two gifs gif can serve multiple functions in communication are considered the same if (i) they have identical (Tolins and Samermit, 2016). hashes or (ii) their hamming distance is < 10 and Gifs have grown in their use through increas- gifs with that hash have been used more than 500 ing affordances by platforms like Tumblr, Reddit, times in Twitter. This latter condition was selected Imgur, and Twitter that allow gifs to be natively after manual evaluation of thresholds to trade-off displayed like text in conversation threads (Jiang between increasing the size of the training data and et al., 2018). Further, gif-based keyboards have reducing potential noise caused by matching error. been introduced that allow users to search for gifs A visual example of this process can be found in that have been tagged with keywords or other meta- Appendix Figure 8. data (Griggio et al., 2019). Yet, these technologies Not all gif responses in the Twitter data are con- require that gif data be prepared with sufficient tags versational or appropriate for wider re-use. There- to be searchable or to have sufficient data to use fore, we filter these responses to only those gifs collaborative filtering techniques for recommenda- whose imagery matches gifs hosted by the Giphy tions (Jiang et al., 2018, p.9). As a result, there is a website, which is the backend for many gif-based clear gap in identifying appropriate response gifs keyboards. Giphy contains a wide collection of gifs directly from the text, which this work fills. that are curated to remove content inappropriate for general use (e.g., violent or sexual imagery). Gifs 3 Data on the platform are categorized (e.g., “reaction” or Despite the widespread use of gifs, no standard “celebrities”) and we identify 28 categories con- dataset exists for text and gif replies. Further, al- taining 972 keywords likely to contain gifs used
used with at least five gifs and where those gifs have been used at least 1000 times in total; this process removes many low-frequency tags that are either overly-specific or idiosyncratic in their use. Finally, we performed a manual inspection of all remaining tags to remove tags that are too general (e.g., “emotion”) and retain only noun, adjective, and verb tags (words or multi-word expressions) that describe specific emotions or actions. A total of 241 unique tags were retained (Appendix C). 6.0% of gifs have at least one tag associated with Figure 2: The frequency distribution of gifs in our data them (mean 1.9 tags). However, these tagged gifs roughly follows a log-normal distribution, with a few account for 38.7% of the replies in our dataset, gifs used often, while a long tail of gifs are used rela- suggesting tags are only available for more-popular tively infrequently. gifs. Our dataset represents roughly an order of magnitude more data and more tags than the closest related dataset of Chen et al. (2017) that contained in conversation. A total of 2,095,993 gifs linked 23K gifs with 17 manually-curated emotions. to those keywords were ultimately retrieved and stored as image hashes. Additional details of cate- 4 Gif Reply Models gories and keywords are in Appendix B. After the matching image hashes to filter replies, We introduce a series of models for producing a gif we identify 115,586 unique gifs, referred to as ref- response in conversation. Each model will select a erence gifs, and 1,562,701 tweet replies using one gif from the 115K gifs in our dataset as a response of these gifs, which forms our official dataset. Fig- to a text-based message. This task is related to but ure 2 shows these gifs’ frequency in the data; much distinct from work on image-text matching (Lee like words, a handful of gifs receive widespread et al., 2018), which aims to find an image describ- use, while a long tail of gifs are rarely used. ing a piece of text, or text-to-image (e.g., Wen et al., 2015; Xu et al., 2018), which generates an image 3.2 Gif Metadata from a text description. Here, we aim to select gifs We augment our gif data with information about that reflect natural continuations or reactions to a their content. Some gifs have text that transcribes message in a dialog, akin to how gifs are used in what a person is saying in the gif’s scene or is a social media. For all models, additional details on meta-commentary on the content. This text is ex- the training procedures and hyperparameters are tracted using paddleOCR (Du et al., 2020). Since provided in Appendix A. The three models that some gifs are long enough to contain multiple utter- follow use varying degrees of information about ances, we run OCR on four frames sampled from the gifs and text to select a response. each quartile of the gif’s length. Roughly 50% (58,020) of gifs contain at least one extracted word 4.1 Tag-based Predictions from the selected frames, with an mean of 5.5 ex- The first model uses tags as a shared representation tracted words per gif across the dataset. for characterizing gifs and text. Analogous to how Second, some gif repositories like Giphy allow object tags are used as anchor points for image-text users to tag gifs with information on their content matching (Li et al., 2020) and pivot languages are or theme, e.g., “face palm” or “movie.” We collect used in machine translation (Cheng et al., 2017), tags for the 115K reference gifs used in Twitter, ob- we use tags to bridge information between the text taining 39,651 unique tags. These user-generated in a tweet and the visual content of a gif. Here, tags were moderately noisy due to orthographic each gif becomes associated with a set of tags de- variations like spelling, capitalization, and spacing. scribing its conversational functions and for each Therefore, we merge tags by (i) lower-casing the text, we predict the set of tags for gifs responses to text and (ii) performing a manual merge for similar it—in essence, predicting what types of responses word forms (e.g., “excited” and “exciting”). To are most appropriate. We describe both of these minimize noise, we retain only tags that have been processes next and how gifs are ultimately selected.
Estimating Gif Tags Only 6.0% of the gifs in our given a tweet, we use the trained tweet encoder to data have associated tags. Therefore we train a extract its representation and compute its cosine neural model to predict tags using known tags as similarity with each encoded representation for our training data. To capture any changes in emotion gifs. The gif with the highest cosine similarity is or imagery across the gif, we make separate pre- returned as the best response. dictions for four frames sampled across the gif (the same used in §3.2). Each frame is passed 4.3 P EPE THE K ING P RAWN through an EfficientNet-based (Tan and Le, 2019) Our final model, K ING P RAWN1 (referred to as GIF encoder, shown in Figure 3, to extract a low- “P EPE”.) selects gif responses by using a richer dimensional feature vector from each frame. These set of multimodal features to create a gif represen- frame embeddings are fused using the attention tation. Rather than encode the gif solely from its mechanism from a transformer encoder layer. The image content, we use a multimodal encoder that output of the transformer feeds into a fully con- captures (i) any text it might have, (ii) the types of nected layer, which is trained as a multi-label clas- objects present in the gif, and (iii) object regions as sifier using binary cross-entropy to predict which visual features. We encode these gif aspects using tags should be present. an O SCAR transformer (Li et al., 2020) to create Predicting Response Tags for Text For each mes- a unified representation, shown in Figure 3 (bot- sage, we predict the k-hot distribution of tags tom). Object names and regions of interest feature for a gif response by training a BERTweet model vectors are extracted using a pre-trained bottom-up (Nguyen et al., 2020), which has been pre-trained attention model (Anderson et al., 2018). on a large corpus of Twitter data (shown as “Tweet As input to the O SCAR encoder, the captions to Encoder" in Figure 3). The model with an addi- each of the gif’s four frames are concatenated to- tional fully connected layer is trained as a multi- gether with an “[INTER_FRAME_SEP]" separator label classifier using binary cross-entropy, using token. We filter object areas detected by the bottom- the tags for the gifs used in reply (if known). up attention model (Anderson et al., 2018) and we Tag-based Gif Selection At inference time, given keep all objects with probability >0.5. We then a message, we use the text-to-tag model to predict concatenate object names together with the same a k-hot distribution over tags. Then, we select the inter-frame separator between names of different gif whose estimated tag distribution is closest in frames. Together, the caption text, object names, Euclidean distance. and image-region features are fed into the O SCAR transformer encoder to generate a GIF feature vec- 4.2 CLIP variant tor; the transformer is initialized with the default O SCAR weights. We use BERTweet to encode text. The second model uses an end-to-end training ap- The entire P EPE model is trained end-to-end using proach based on the architecture of OpenAI CLIP contrastive loss, similar to the CLIP model. (Radford et al., 2021). The architecture features two encoders, one for text and one for images. Dur- 5 Evaluation ing training, the encoders are updated using con- We initially evaluate the methods in two ways. trastive loss that maximizes the cosine similarity of First, we use traditional classification-based eval- paired image-text representations and minimizes uation, testing whether the models can reproduce the cosine similarity of random pairs of images and the observed gif replies. However, some messages texts. We replicate the CLIP architecture and train- could have multiple valid gif responses. Therefore, ing procedure, using BERTweet to encode text and as a second test, we evaluate the model in a retrieval EfficientNet (Tan and Le, 2019) to encode a com- setting, measuring whether its most-probable re- posite image of four frames from the gif (compared sponses are good quality for a message. with BERT and ResNet in their implementation). Experimental Setup Models are trained and While originally designed to select an image for a tested on a dataset containing 1,562,701 Tweet- text description, our model is trained to select a gif reply for a text message—a more challenging task 1 K ING P RAWN refers to “selecKting INteresting Gifs for than the image retrieval task used in the original Personal RespAWNses.” In this crazy muppet-name-land- grab world we live in, our only regret is that we couldn’t CLIP setup, as the message may not contain words get “Pepino Rodrigo Serrano Gonzales” to fit as a bacronym, describing elements of the gif. At inference time, which we leave to future work.
EfficientNet GIF Encoder Tweet Encoder EfficientNet Transformer Tweet EfficientNet BERTweet-base EfficientNet EfficientNet-b0 Encoder We 're getting Layer Transformer (shared weight) married today ! feature vector for feature vector for 4 selected frames downstream task downstream task 4 x 224 x 224 (after reshape) 1 x 512 1 x 512 Oscar GIF Encoder OCR extracted caption Aww , thank you extracted object names Oscar Bottom-up Attention Multimodal face woman [INTER_FRAME_SEP] face woman Transformer extracted object feature vectors feature vector for downstream task 1 x 512 4 selected frames 4 x 224 x 224 (after reshape) Figure 3: The different encoder modules used to construct the models in §4. GIF pairs associated with 115,586 unique gifs, sufficient content to judge appropriateness indepen- where 605,063 tweet-gif pairs are associated dent of the larger social context. with at least one tag. Using the finalized 241 Two annotators (the authors) were shown a list unique tags as classes for multi-label classifica- of potential gif responses for a tweet and asked to tion, we split the dataset by stratify on tags us- judge whether this is an appropriate gif response ing the iterative train-test split method provided by (a binary rating). Gifs were selected from the ten scikit-multilearn library (Sechidis et al., most-probable replies for each system and collec- 2011; Szymański and Kajdanowicz, 2017) to cre- tively shown in random order to prevent knowing ate a 80:10:10 train, dev, and test split which which system generated each reply. A total of 2,500 is finalized to train the models described in §4. gif-tweet pairings were annotated. Annotators at- Following BERTweet (Nguyen et al., 2020), we tained a Krippendorf’s α of 0.462; while moderate preprocess tweets in our dataset using NLTK agreement, this value is expected given known dif- TweetTokenizer for tokenization, emoji ferences in how people interpret and value gif re- package to translate emotion icons, and converted sponses based on their familiarity with its content, mentions and links to special “@USER" and message interpretation, and life-experience (Jiang “HTTPURL" tokens. et al., 2018). We follow the evaluation setup from Annotated Data To test whether each model’s pre- other retrieval-based dialog systems (e.g. Yu et al., dictions are valid responses, we annotate the ten 2021; Kumar and Callan, 2020) and use normal- most-probable gif predictions for a subset of the ized Discounted Cumulative Gain (nDCG), which tweets in our test data. Many tweets in our test set measures whether more appropriate gif responses require substantial context to understand due to hav- are ranked higher. A gif’s appropriateness score is ing few tokens, linking to URLs that provide extra the sum of annotators’ ratings. knowledge, mentioning other users in directed com- Results The P EPE model was able to identify rele- munication. These factors suggest social context vant and good-quality gif responses, as shown by or general knowledge aids in the recipient’s under- its performances on the test data (Table 1) and an- standing of the gif’s intentions. While the model notated data (Table 2). Performance on the test set can still benefit from training on such examples, is expected to be low, given the challenge of identi- judging the appropriateness of response is difficult fying the exact gif used for a tweet when multiple without access to the social context. Therefore, to possible gifs are likely to be equally valid. How- reduce interpretation ambiguity, we annotate only ever, the P EPE model is still able to identify the tweets without URLs or user mentions and having exact gif (out of 115K) in its top 10 predictions at least 10 tokens. This process selects tweets with for 3% of the data, substantially outperforming all
Model Top-1 Top-5 Top-10 Model nDCG Tag-based 0.000000 0.000092 0.000119 P EPE 0.8145 Random 0.000020 0.000059 0.000158 CLIP variant 0.000488 0.001669 0.002783 P EPE without object names 0.7665 Distribution sampling 0.000996 0.005098 0.009780 P EPE without caption 0.7559 P EPE 0.005375 0.018723 0.030918 P EPE without object features 0.7533 Table 1: Models’ precision-at-k on selecting the exact Table 3: Results for ablated versions of P EPE where gif used as a response for a tweet in the test set; this per- specific input is removed (cf. Table 2) show that all in- formance is an underestimate of each model, as many put forms contribute to the ability to select replies. model-predicted gifs may be appropriate. Model nDCG and evaluating the model’s resulting gifs on the same test instances. Random 0.3273 Tag-based 0.4526 The ablated model performances, shown in Ta- Distribution sampling 0.4969 ble 3, reveal that each input is useful for selecting CLIP variant 0.5934 gifs.2 Object features capture visual information P EPE 0.8145 about what specifically is present in the gif (beyond the discrete names of what is present, e.g., “person” Table 2: Models’ nDCG scores at proposing appropri- or “building”) and show that multimodality is im- ate gif replies, measured from annotations on the top portant for high performance—predicting replies 10 most probable gif replies of each model. just from a gif’s caption and categorized content are insufficient. Similarly, the caption of a gif (if present) is important, as the text can help make other models. explicit the intended interpretation of a gif. Performance on the annotated data (Table 2) pro- vides a more realistic assessment of whether mod- 6 Field Experiment els can generate high-quality replies, as it measures whether the models’ replies themselves were good. To test the generalizability of our models and qual- The P EPE model attains substantially higher per- ity of their responses, we conduct a large-scale ran- formance (p
Parent Tweets Tag-based CLIP variant P EPE Dist. Samp. Random That wonderful feeling you get when you ar- rive to a business din- ner that you’re suppos- edly paying for...and re- alize you’ve forgotten your credit card I’m convinced some of y’all don’t get laid Table 4: Model-selected replies to messages (paraphrased for privacy). Click an image to view the gif on Giphy. this score in our experiments to evaluate quality. we use Thompson sampling (Russo et al., 2018) Our experiment focuses on generating Gif-based to randomly select which arm of the trial to use. replies to top-level text comments (comments made Thompson sampling builds a probability model for directly to the post). This setup mirrors the conver- the estimated reward of each arm (here, the score a sational data our models were trained on. Imgur reply receives) and samples from the model such supports several ways of filtering its stream of posts. that higher-rewarding arms are sampled more fre- To ensure that our replies have sufficient visibility, quently. As a result, this method can provide tighter we select posts that have already receive 10 com- estimates for the reward of the most useful arms. ments and appear in the “most viral” sorting. From Scores in Imgur have a skewed distribution, with these posts, we reply to the top-rated text comment. few comments receiving very high scores and most The RCT runs from 8 AM to 8 PM (local time), receiving near the default score (1). Therefore, we making at most 10 replies per hour. use Poisson Thompson sampling. Some comments Not all topics or comments are suitable for auto- may be downvoted to receive scores below zero, so mated responses and great care was taken to pre- for simplicity, we truncate these scores to 0. vent potential harm to the community. Through We initialize the reward estimates for our ex- multiple rounds of testing which replies would be periment by selecting one of the five models in responded to, we curated a list of keywords that a round-robin manner to reply to an Imgur com- could lead to potential controversial replies, such ment for 3 days. These initial scores act as priors as terms about religion or race (full list in Ap- for Thompson sampling to update Poisson distri- pendix D). Any comment containing a token or butions for each model. In the trial, we choose a lemma matching a word on this list is excluded model by sampling from the up distributions using and not replied to. As a further safeguard, exper- all previous days’ scores as the prior. The exper- imenters monitored all replies to remove any that iment ran from April 15th, 2021 to August 30th, were deemed inappropriate. See the Ethics Section 2021, and models generated a total of 8,369 replies. (§9) for a longer discussion of safeguards. To evaluate the results of the RCT, we construct The field experiment consists of five arms, cor- a Negative Binomial regression on the dependent responding to the three trained models and the two variable of the score received for a model’s reply, baseline models. During each trial, one model is se- truncating negative scores to zero. The Negative lected and generates a response; the trained model binomial was chosen instead of Poisson due to replies with the most probable gif.4 over-dispersion in the score variable. The models Not all models are equally likely to perform well are treated as a categorical variable, using the ran- and so to make the most use of our trial budget, dom model as a reference. Since the score will 4 Due to a bug, early experimental trials for the CLIP and depend, in part, on the attention received by the P EPE models used the tenth most-probable gif; however, using parent post and comment (higher-rated comments the ratings in the annotated data, a t-test of the difference in are displayed first), we include linear effects for quality for most- and tenth-most probable gifs showed no statistically-significant difference in quality for both models the post and parent comment. Finally, we include (p>0.1). Therefore, we include this data in our results. five text-related variables to control for the con-
CLIP variant *** 2gG2xiMTtFwsg fnjxvV295sWEJjvwXU BAPSj0xM1cFe8 Dist. Samp. iJsvRxNTAcup6DVfLP GIPHY GIF ID 3oEjHLcg4QMU5umb9m aKrTvuOv4hlKM Pepe *** 3oKIPllDN24q8Awtwc jIu44mYwUItSHTW3tj Tag-based jTrWAzlFGfvVY34PSJ l396L17pwHWOIJrTG 0.2 0.0 0.2 60 40 20 0 20 Coefficient for GIF Reply Score GIF Reply Score (a) Tag-based Figure 4: Negative Binomial regression coefficients for each model on predicting a gif reply’s score, using the lfesfEtobCSbsHzC8d random-gif model as the reference category; bars show m9d3Xif3ShZ42CxlWP standard error and *** denotes significance at 0.01. f9k1tV7HyORcngKF8v loitbnzQ1JQ8Iizx8w GIPHY GIF ID bfrlODgSLqXxS 4HmjGg306HiLHWlm2f tent of the parent comment: the topic distribution 7J26CGAahos6d5S1A6 8hZ9FMolyKc0X8BSr7 (Appendix Table 9) from a 10-topic model (drop- iqkHA3DmB8GjORY030 ping one topic due to collinearity), the sentiment OOzcnk3PzLDHqWs6Tb 30 20 10 0 10 20 and subjectivity of the message estimated using GIF Reply Score TextBlob library, the length of the comment, and (b) CLIP variant whether the comment contained a question. tnYri4n2Frnig 5wWf7GR2nhgamhRnEuA 6.2 Results 5gw0VWGbgNm8w iXTrbbYMQBCMM GIPHY GIF ID The field experiment demonstrates that the P EPE 65ODCwM00NVmEyLsX3 model is able to generate significantly higher- 26AHLBZUC1n53ozi8 3o8doT9BL7dgtolp7O scoring responses. Figure 4 shows the Negative Fq6Bdki3coEWQ Binomial regression coefficients for the three mod- 3oEjHAUOqG3lSS0f1C KzyMcEfDh4Jiw els and empirical distribution baseline, with the 40 20 0 20 GIF Reply Score random gif model as a reference; full regression (c) P EPE results are shown in Appendix Table 6. The P EPE model substantially outperforms all other models Figure 5: Score distributions for most-frequently used (p0.73 for all models). The most-used gifs and the reply (Madden, 2018, p.29). Further, we for each model had average scores that were pos- observed that, when the model’s reply truly seemed itive, but the distributions for each gif show that random, some users replied say they upvoted solely some uses were occasionally downvoted. This high because they enjoyed the gif. variance in scores indicates that a gif’s intrinsic As a follow-up experiment, we tested whether qualities are not solely responsible for the received models could be getting higher (or lower) scores by score and, instead, appropriate use in context is repeatedly picking the same gifs that are skewed plays a significant part in community reception. towards a positive or negative reaction. Figure 5 We examined whether models relied on the same shows the score distribution for the top ten most fre- set of gifs. Figure 6 shows the distribution of gif
isting messages from a large social media corpus as 2.5 Tag-based potential replies and rank these to select a response. log10(frequency of unique gif) CLIP variant Our work mirrors models that use neural networks 2.0 Pepe Dist. Samp. for ranking (Yan et al., 2016; Inaba and Takahashi, 1.5 random 2016; Penha and Hauff, 2021, e.g.,); however, we note that many recent knowledge-grounded and 1.0 open domain models use encoder-decoder meth- 0.5 ods to improve versatility and applicability (e.g., Ghazvininejad et al., 2018; Gao et al., 2019; Zhou 0.0 et al., 2020). Generative approaches are likely in- 0.0 0.5 1.0 1.5 2.0 2.5 3.0 appropriate for gif-based conversation as gifs are log10(rank of unique gif) more akin to mimetic artifacts that build on cultural Figure 6: Gif use frequency by each model, shown knowledge (Eppink, 2014), making synthesizing a as frequency-vs-rank log-scaled with first-order line fit new gif from scratch likely less effective. (jitter added for separation). All three models used here rely on joint embed- ding spaces for gif and text. Multiple works in uses by each model, indicating that the tag-based NLP have been proposed to align these representa- model relied frequently on a small set of gifs. How- tions (Kiros et al., 2014; Wang et al., 2016), often ever, the P EPE and CLIP variant models were sub- for particular applications such as visual question stantially more varied, indicating they draw from answering (Antol et al., 2015). Recent work has the long-tail of possible gifs. focused on embeddings these media with a single Do any of our models spark more subsequent encoder that takes both text and images as input conversation? We fit a separate Negative Binomial (e.g., Wang et al., 2019; Chen et al., 2020), in con- regression on the total number of comments made trast to our model that uses separate image and text to our reply, using the same IVs as the score regres- encoders (Figure 3); these multimodal encoders sion and include the reply’s score itself as another are prohibitively computationally expensive to use IV. This model (Appendix Table 8) shows that both in our setting during inference time, as the model the distributional-sampling baseline and P EPE mod- would need to be run on each gif (and message) to els produced replies that led to fewer subsequent rank replies, compared with our model that only comments (p
9 Ethics ate gif, which is mitigated by the use of Giphy to seed our initial gifs. As this platform is curated The interactive nature of the RCT necessitated and does not host objectively offensive gifs (e.g., a close consideration of ethical issues (Thieltges overly-violent content), our initial gif set is rela- et al., 2016). Prior to beginning the RCT, the study tively free of objectionable gifs. Because our model team obtained IRB approval to interact with users. learns directly from gifs’ frequency of use, unless While necessary in the legal sense, IRB approval objectively offensive gifs are widely used, they are is not sufficient to justify the ethical grounds of unlikely to be deployed from our RCT; we specu- the study. The primary risks of the study are if the late that few objectively offensive gifs are widely automated models respond with an inappropriate used and, in practice, we have not identified any gif or respond to a message that is not suitable for during the study period or when examining hun- automated response (e.g., discussing the death of a dreds of random gifs in our data (or used in the loved one or making an offensive statement). These RCT). risks were mitigated in multiple ways throughout Finally, one risk is that by learning gif responses the dataset construction and field experiment. from observed data, our models may reinforce First, the selection criteria for which comments cultural stereotypes that are encoded in the gifs we reply to was designed to only reply to content themselves (Erinn, 2019), e.g., the association of that was already deemed appropriate by the com- African American individuals with strong emotions. munity. By selecting only posts that had received While our gif data is relatively clean of overtly of- sufficient upvotes to be called “viral” and were fensive gifs, we acknowledge that our model likely already receiving comments, we mitigate the risk does inadvertently perpetuate some of these latent of engaging in topics or conversations that are in- biases in the data. However, the success of our appropriate according to the norms of the Imgur model suggests a future mitigation strategy for plat- community, as these posts would be removed by forms suggesting gifs: as biases become known, moderators or would have received sufficient down- our approach can be used to suggest less-biased votes to stay in obscurity. gifs as potential responses to mitigate future harm. Second, by focusing on the top-voted comment to these posts, we again reply to content that has Acknowledgments already been deemed high-quality by the comment. We thank the reviewers, area chairs, and senior area This comment-level criteria substantially lowers chairs for their thoughtful comments and feedback. the risk of our models commenting on inappropri- We also thank the Blablablab for helpful feedback ate comments (e.g., a comment insulting another and letting us deploy P EPE to the group’s Slack and user), as these comments are readily downvoted by putting up with the ridiculous gif replies and Imgur the community prior to our intervention. for being a wonderful community. This material is Third, we employed extensive filtering to avoid based upon work supported by the National Science replying to any comment containing a potentially Foundation under Grant No. 2007251. sensitive topic, e.g., a discussion of race or trauma (keywords are listed in Appendix D). The initial set of keywords was developed through examining po- References tentially sensitive topics and then iteratively added Peter Anderson, Xiaodong He, Chris Buehler, Damien to by simulating which messages our RCT would Teney, Mark Johnson, Stephen Gould, and Lei reply to and examining whether it would be appro- Zhang. 2018. Bottom-up and top-down attention for priate. During the field RCT, experimenters contin- image captioning and visual question answering. In 2018 IEEE Conference on Computer Vision and Pat- uously monitored the comments to ensure no harm tern Recognition, CVPR 2018, Salt Lake City, UT, was being done. Ultimately, only three comments USA, June 18-22, 2018, pages 6077–6086. IEEE were removed during the initial two days, which Computer Society. was due to a bug in the lemmatization and these Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- comments should have been filtered out by our ear- garet Mitchell, Dhruv Batra, C. Lawrence Zitnick, lier criteria; these comments were removed quickly and Devi Parikh. 2015. VQA: visual question an- swering. In 2015 IEEE International Conference on and we did not observe any notable response from Computer Vision, ICCV 2015, Santiago, Chile, De- the community. cember 7-13, 2015, pages 2425–2433. IEEE Com- Fourth, one risk is replying with an inappropri- puter Society.
Saeideh Bakhshi, David A. Shamma, Lyndon Kennedy, Hao Fang, Hao Cheng, Maarten Sap, Elizabeth Clark, Yale Song, Paloma de Juan, and Joseph Jofish Kaye. Ari Holtzman, Yejin Choi, Noah A. Smith, and Mari 2016. Fast, cheap, and good: Why animated gifs en- Ostendorf. 2018. Sounding board: A user-centric gage us. In Proceedings of the 2016 CHI Conference and content-driven social chatbot. In Proceedings of on Human Factors in Computing Systems, San Jose, the 2018 Conference of the North American Chap- CA, USA, May 7-12, 2016, pages 575–586. ACM. ter of the Association for Computational Linguis- tics: Demonstrations, pages 96–100, New Orleans, Michael S Bernstein, Margaret Levi, David Magnus, Louisiana. Association for Computational Linguis- Betsy Rajala, Debra Satz, and Charla Waeiss. 2021. tics. Esr: Ethics and society review of artificial intelli- gence research. ArXiv preprint, abs/2106.11521. Jianfeng Gao, Michel Galley, and Lihong Li. 2019. Neural Approaches to Conversational AI: Ques- Fumihiro Bessho, Tatsuya Harada, and Yasuo Ku- tion Answering, Task-oriented Dialogues and Social niyoshi. 2012. Dialog system using real-time crowd- Chatbots. Now Foundations and Trends. sourcing and Twitter large-scale corpus. In Proceed- ings of the 13th Annual Meeting of the Special Inter- Marjan Ghazvininejad, Chris Brockett, Ming-Wei est Group on Discourse and Dialogue, pages 227– Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, and 231, Seoul, South Korea. Association for Computa- Michel Galley. 2018. A knowledge-grounded neural tional Linguistics. conversation model. In Proceedings of the Thirty- Second AAAI Conference on Artificial Intelligence, Elli Bourlai and Susan C Herring. 2014. Multimodal (AAAI-18), the 30th innovative Applications of Arti- communication on tumblr: “i have so many feels!”. ficial Intelligence (IAAI-18), and the 8th AAAI Sym- In Proceedings of the 2014 ACM conference on Web posium on Educational Advances in Artificial Intel- science, pages 171–175. ligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 5110–5117. AAAI Press. Weixuan Chen, Ognjen Oggi Rudovic, and Rosalind W Picard. 2017. Gifgif+: Collecting emotional ani- Carla F Griggio, Joanna Mcgrenere, and Wendy E mated gifs with clustered multi-task learning. In Mackay. 2019. Customizations and expression 2017 Seventh International Conference on Affective breakdowns in ecosystems of communication apps. Computing and Intelligent Interaction (ACII), pages Proceedings of the ACM on Human-Computer Inter- 510–517. IEEE. action, 3(CSCW):1–26. Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Tim Highfield and Tama Leaver. 2016. Instagrammat- Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and ics and digital methods: Studying visual social me- Jingjing Liu. 2020. UNITER: Universal Image- dia, from selfies and gifs to memes and emoji. Com- TExt representation learning. In ECCV. munication research and practice, 2(1):47–62. Yong Cheng, Qian Yang, Yang Liu, Maosong Sun, and Chung Hoon Hong, Yuan Liang, Sagnik Sinha Roy, Wei Xu. 2017. Joint training for pivot-based neural Arushi Jain, Vihang Agarwal, Ryan Draves, Zhizhuo machine translation. In Proceedings of the Twenty- Zhou, William Chen, Yujian Liu, Martha Miracky, Sixth International Joint Conference on Artificial In- et al. 2020. Audrey: A personalized open-domain telligence, IJCAI 2017, Melbourne, Australia, Au- conversational bot. In Alexa Prize Proceedings. gust 19-25, 2017, pages 3974–3980. ijcai.org. Pingping Huang, Jianhui Huang, Yuqing Guo, Min Michelle Cohn, Chun-Yen Chen, and Zhou Yu. 2019. Qiao, and Yong Zhu. 2019. Multi-grained attention A large-scale user study of an Alexa Prize chatbot: with object-level grounding for visual question an- Effect of TTS dynamism on perceived quality of so- swering. In Proceedings of the 57th Annual Meet- cial dialog. In Proceedings of the 20th Annual SIG- ing of the Association for Computational Linguis- dial Meeting on Discourse and Dialogue, pages 293– tics, pages 3595–3600, Florence, Italy. Association 306, Stockholm, Sweden. Association for Computa- for Computational Linguistics. tional Linguistics. Michimasa Inaba and Kenichi Takahashi. 2016. Neural Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, utterance ranking model for conversational dialogue Weiwei Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua systems. In Proceedings of the 17th Annual Meeting Yang, Qingqing Dang, et al. 2020. PP-OCR: A Prac- of the Special Interest Group on Discourse and Dia- tical Ultra lightweight OCR system. ArXiv preprint, logue, pages 393–403, Los Angeles. Association for abs/2009.09941. Computational Linguistics. Jason Eppink. 2014. A brief history of the gif (so far). Jialun “Aaron" Jiang, Jed R Brubaker, and Casey Journal of visual culture, 13(3):298–306. Fiesler. 2017. Understanding diverse interpretations of animated GIFs. In Proceedings of the 2017 CHI Wong Erinn. 2019. Digital blackface: How 21st cen- Conference Extended Abstracts on Human Factors tury internet language reinforces racism. in Computing Systems, pages 1726–1732.
Jialun “Aaron" Jiang, Casey Fiesler, and Jed R Conference on Empirical Methods in Natural Lan- Brubaker. 2018. “The Perfect One” Understanding guage Processing: System Demonstrations, pages 9– Communication Practices and Challenges with An- 14, Online. Association for Computational Linguis- imated GIFs. Proceedings of the ACM on human- tics. computer interaction, 2(CSCW):1–20. Tianrui Niu, Fangxiang Feng, Lingxuan Li, and Xiaojie Daniel Jurafsky, Elizabeth Shriberg, Barbara Fox, and Wang. 2020. Image synthesis from locally related Traci Curl. 1998. Lexical, prosodic, and syntactic texts. In Proceedings of the 2020 International Con- cues for dialog acts. In Discourse Relations and Dis- ference on Multimedia Retrieval, pages 145–153. course Markers. Mahmoud Khademi. 2020. Multimodal neural graph Silvia Pareti and Tatiana Lando. 2018. Dialog intent memory networks for visual question answering. In structure: A hierarchical schema of linked dialog Proceedings of the 58th Annual Meeting of the Asso- acts. In Proceedings of the Eleventh International ciation for Computational Linguistics, pages 7177– Conference on Language Resources and Evaluation 7188, Online. Association for Computational Lin- (LREC 2018), Miyazaki, Japan. European Language guistics. Resources Association (ELRA). Ryan Kiros, Ruslan Salakhutdinov, and Richard S Gustavo Penha and Claudia Hauff. 2021. On the cal- Zemel. 2014. Unifying visual-semantic embeddings ibration and uncertainty of neural learning to rank with multimodal neural language models. ArXiv models for conversational search. In Proceedings preprint, abs/1411.2539. of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Artie Konrad, Susan C Herring, and David Choi. 2020. Main Volume, pages 160–170, Online. Association Sticker and emoji use in facebook messenger: impli- for Computational Linguistics. cations for graphicon change. Journal of Computer- Mediated Communication, 25(3):217–235. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Vaibhav Kumar and Jamie Callan. 2020. Making in- Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, formation seeking easier: An improved pipeline for et al. 2021. Learning transferable visual models conversational search. In Findings of the Associa- from natural language supervision. ArXiv preprint, tion for Computational Linguistics: EMNLP 2020, abs/2103.00020. pages 3971–3980, Online. Association for Compu- tational Linguistics. Ashwin Ram, Rohit Prasad, Chandra Khatri, Anu Kuang-Huei Lee, X. Chen, G. Hua, H. Hu, and Xi- Venkatesh, Raefer Gabriel, Qing Liu, Jeff Nunn, aodong He. 2018. Stacked cross attention for image- Behnam Hedayatnia, Ming Cheng, Ashish Nagar, text matching. ArXiv preprint, abs/1803.08024. et al. 2018. Conversational ai: The science behind the alexa prize. ArXiv preprint, abs/1801.03604. Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xi- aowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Li Dong, Furu Wei, et al. 2020. Oscar: Object- Gray, Mark Chen, Rewon Child, Vedant Misra, semantics aligned pre-training for vision-language Pamela Mishkin, Gretchen Kruegerand Sandhini tasks. In European Conference on Computer Vision, Agarwal, and Ilya Sutskever. 2021. DALL-E: Cre- pages 121–137. Springer. ating images from text. https://openai.com/ blog/dall-e/. John Savery Madden. 2018. The Phenomenological Exploration of Animated GIF Use in Computer- Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Mediated Communication. Ph.D. thesis, University Ian Osband, and Zheng Wen. 2018. A tutorial on of Oklahoma. thompson sampling. Foundations and Trends® in Machine Learning, 11(1):1–96. Luke Melas-Kyriazi, Alexander Rush, and George Han. 2018. Training for diversity in image paragraph cap- Konstantinos Sechidis, Grigorios Tsoumakas, and Ioan- tioning. In Proceedings of the 2018 Conference on nis Vlahavas. 2011. On the stratification of multi- Empirical Methods in Natural Language Processing, label data. Machine Learning and Knowledge Dis- pages 757–761, Brussels, Belgium. Association for covery in Databases, pages 145–158. Computational Linguistics. Kate M Miltner and Tim Highfield. 2017. Never Piyush Sharma, Nan Ding, Sebastian Goodman, and gonna GIF you up: Analyzing the cultural signifi- Radu Soricut. 2018. Conceptual captions: A cance of the animated GIF. Social Media+ Society, cleaned, hypernymed, image alt-text dataset for au- 3(3):2056305117725223. tomatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Compu- Dat Quoc Nguyen, Thanh Vu, and Anh Tuan Nguyen. tational Linguistics (Volume 1: Long Papers), pages 2020. BERTweet: A pre-trained language model 2556–2565, Melbourne, Australia. Association for for English tweets. In Proceedings of the 2020 Computational Linguistics.
Piotr Szymański and Tomasz Kajdanowicz. 2017. A Rui Yan, Yiping Song, and Hua Wu. 2016. Learning network perspective on stratification of multi-label to respond with deep neural networks for retrieval- data. In First International Workshop on Learn- based human-computer conversation system. In Pro- ing with Imbalanced Domains: Theory and Appli- ceedings of the 39th International ACM SIGIR con- cations, pages 22–35. PMLR. ference on Research and Development in Informa- tion Retrieval, SIGIR 2016, Pisa, Italy, July 17-21, Mingxing Tan and Quoc V. Le. 2019. Efficientnet: 2016, pages 55–64. ACM. Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Shi Yu, Zhenghao Liu, Chenyan Xiong, Tao Feng, and Conference on Machine Learning, ICML 2019, 9- Zhiyuan Liu. 2021. Few-shot conversational dense 15 June 2019, Long Beach, California, USA, vol- retrieval. ArXiv preprint, abs/2105.04166. ume 97 of Proceedings of Machine Learning Re- search, pages 6105–6114. PMLR. Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum. 2020. The design and implementation of XiaoIce, Ying Tang and Khe Foon Hew. 2019. Emoticon, emoji, an empathetic social chatbot. Computational Lin- and sticker use in computer-mediated communica- guistics, 46(1):53–93. tion: A review of theories and research findings. In- ternational Journal of Communication, 13:27. Andree Thieltges, Florian Schmidt, and Simon Hegelich. 2016. The devil’s triangle: Ethical con- siderations on developing bot detection methods. In 2016 AAAI Spring Symposium Series. Jackson Tolins and Patrawat Samermit. 2016. Gifs as embodied enactments in text-mediated conversa- tion. Research on Language and Social Interaction, 49(2):75–91. Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text em- beddings. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Ve- gas, NV, USA, June 27-30, 2016, pages 5005–5013. IEEE Computer Society. Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, and Jing Shao. 2019. CAMP: cross-modal adaptive message passing for text-image retrieval. In 2019 IEEE/CVF Interna- tional Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 5763–5772. IEEE. Miaomiao Wen, Nancy Baym, Omer Tamuz, Jaime Teevan, Susan T Dumais, and Adam Kalai. 2015. Omg ur funny! computer-aided humor with an ap- plication to chat. In ICCC, pages 86–93. Jun Xu, Haifeng Wang, Zheng-Yu Niu, Hua Wu, Wanx- iang Che, and Ting Liu. 2020. Conversational graph grounded policy learning for open-domain conversa- tion generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Lin- guistics, pages 1835–1845, Online. Association for Computational Linguistics. Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. Attngan: Fine-grained text to image genera- tion with attentional generative adversarial networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 1316–1324. IEEE Computer Society.
Category Subcategory validation macro-f1 was 0.07 achieved at the 70th epoch. Cartoons & Comics aqua teen hunger force Celebrities richard pryor A.2 CLIP variant Reactions angry The evaluation performance for model selection Emotions happy is measured by nDCG. For every tweet-gif pair in Anime bleach the validation set, we measure the top 30 predicted Art & Design psychedelic GIFs from the model using the tweet as input. The Nature sunrise relevance of an occurring ground truth gif in the Transportation bicycle top 30 predictions given a tweet is set 1 for the nDCG calculation. Table 5: Examples of GIF categories on GIPHY CLIP variant is trained on the same finalized dataset using contrastive loss. It was trained for A Additional Details on Model Training 16 epochs with a batch size of 16 using AdamW optimizer of learning rate 1e-5 and weight decay Following, we provide additional details on how 1e-3. Best validation performance is achieved at each of the three models was trained. epoch 6 with an nDCG value of 0.015. We replace the Transformer encoder layer with A.1 Tag-based Model a linear Layer on Efficient GIF Encoder from Fig- EfficientNet-based Tag Classifier Gifs are re- ure 3, and use this as our GIF Encoder for the shaped to 224 by 224 pixel while keeping the as- CLIP variant. Image inputs to the GIF encoder are pect ratio by padding and normalized to a mean of normalized following the official CLIP implemen- 0.5 and standard deviation of 0.5 for each channel tation. before feeding into the EfficientNet-based model. We selected unique GIFs from the finalized dataset A.3 P EPE that has at least one associated tag and using the The P EPE model follows most configurations from iterative train test split on k-hot tag representation the CLIP variant model, but replace the EffcientNet to select 5% of those GIFs for validation. The Effi- GIF encoder with an Oscar GIF encoder based cientNet tag classifier was trained for 100 epochs on Oscar pre-trained multi-modal transformer (Li on a batch size of 32, using AdamW optimizer et al., 2020). with learning rate 1e-5 and weight decay 1e-3. The Extra metadata are extracted from GIFs in the fi- best validation performance was achieved at the nalized dataset for further training. Captions within 40th epoch with macro-f1 of 0.30 in predicting 241 the GIF are extracted using PaddleOCR (Du et al., multi-label classes. Early experiment shows that 2020), and only extracted text with probability transformer encoder layer (macro-f1 of 0.30) out greater than 0.9 are kept as caption metadata. performs linear layer (macro-f1 of 0.19) in fusing Object tags and their corresponding features are multi-frame gif features on the development set, extracted with bottom-up attention (Anderson et al., therefore transformer encoder layer is used to fuse 2018) using py-bottom-up-attention features of different frames in our implementation. package. Object instances are filtered to only Tweet-to-tag classifier Using the finalized dataset keep instances that have a score higher than mentioned in §3, we use tweet as input, and the 0.5, then object tags and their corresponding k-hot tag representation of that tweet instance as features are extracted from these instances. Final ground truth label to train the multi-label classifier object features of dimension 2054 are obtained along with the tweet encoder for 241 classes. Ad- by concatenating feature output with dimension ditionally, we filter out tweets from the finalized 2048 from Faster-RCNN with scaled box position dataset that do not have corresponding twitter tags coordinates of the object following (Li et al., before training. The model with the best valida- 2020). tion performance is selected to perform subsequent The P EPE model is trained on the finalized evaluation and field experiments. The tweet en- dataset with extracted caption and object metadata. coder was trained for 100 epochs with a batch size It was trained for 16 epochs with a batch size of of 32. The learning rate was set to 1e-5 with 1e-3 8 using AdamW optimizer of learning rate 1e-6 weight decay using AdamW optimizer. The best and weight decay 1e-3. Preprocessing for GIFs is
Tweet-GIF Pairs 42,096,566 pairs Parent Tweet Having a super great day . Child animated GIF (reply) File: tweet_video/EEqo71PWkAAqGab.mp4 AverageHash: 00080c2ce4f0f46410383828c262626300040424b4f4e061 ...... 39,401,680 unique GIF files GIPHY GIF Collections Match Animated GIPHY GIF ID: 3ornjLd54I3eQYmfpC GIF with Hash Tags: #high five #late night with seth meyers Matched AverageHash: 00080c2ce4f0f46410383828c262726300040424b4f4e061 GIPHY GIF ID: 3o6fIQSs4BcsEbDE7S Tags: #real housewives #bravo tv #slice #rhonj Not Matched AverageHash: fbf9f35141008c8bfbf9f35151008c89fbf9f35151008c89 ...... 2,095,993 unique GIF IDs Finalized Dataset 1,562,701 pairs (605,063 pairs associated with selected tags) Parent Tweet: Having a super great day . GIPHY GIF ID: 3ornjLd54I3eQYmfpC Selected Tags: #high five GIPHY Tags Selection Parent Tweet: We 're getting married today ! GIPHY GIF ID: g9582DNuQppxC Selected Tags: #cheers, #drink, #congratulations ...... 115,586 unique GIF IDs Figure 7: A diagram of the pipeline used to collect, canonicalize, and filter gif-reply data from Twitter.
GIPHY GIF (ID: Y9pd1baXUIJTW) Tweet Animated GIF (D0mIkeXX0AIQt3N.mp4) Hamming distance = 9 Matched Image Average Hash: 6864f6ded2d282001f180b3d1c18f07c083e2f1fdfd08c8e GIPHY GIF (ID: 2zR6G1VD9a1ws) Hamming distance = 51.0 Not Matched Image Average Hash: 686cfcfef2d282021f181b3d1c18f07c083e2e1f9fd08c8e Image Average Hash: f0f4fcfcf8e060001018383818183018001a1b1b1a18180c Figure 8: Matching Animated GIFs from Twitter with GIPHY gifs using Image Average Hash the same as the Tag-based model. Max sequence Reactions thumbs down length is set to 256 tokens for the Oscar transformer. Reactions frustrated Best evaluation performance is achieved at epoch Reactions oh snap 12 with an nDCG score of 0.007. Reactions disgusted Reactions rejected B GIF categories on GIPHY Reactions embarrassed Reactions hug Category Subcategory Reactions yolo Reactions what Reactions interested Reactions hair flip Reactions thank you Reactions bored Reactions sarcastic Reactions frown Reactions shocked Reactions slow clap Reactions cool story bro Reactions mic drop Reactions middle finger Reactions goodbye Reactions you got this Reactions meh Reactions whatever Reactions scared Reactions omg Reactions do not want Reactions deal with it Reactions confused Reactions sigh Reactions drunk Reactions oops Reactions wow Reactions angry Reactions mad Reactions finger guns Reactions awesome Reactions good luck Reactions please
You can also read