I Can Has Cheezburger? A Nonparanormal Approach to Combining Textual and Visual Information for Predicting and Generating Popular Meme Descriptions
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
I Can Has Cheezburger? A Nonparanormal Approach to Combining Textual and Visual Information for Predicting and Generating Popular Meme Descriptions William Yang Wang and Miaomiao Wen School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Abstract The advent of social media has brought Inter- net memes, a unique social phenomenon, to the front stage of the Web. Embodied in the form of images with text descriptions, little do we know about the “language of memes”. In this paper, we statistically study the correla- tions among popular memes and their word- ings, and generate meme descriptions from Figure 1: An example of the LOL cat memes. raw images. To do this, we take a multi- modal approach—we propose a robust non- include superimposed text with broken grammars paranormal model to learn the stochastic de- and/or spellings. pendencies among the image, the candidate Even though the memes are popular over the In- descriptions, and the popular votes. In experi- ments, we show that combining text and vision ternet, the “language of memes” is still not well- helps identifying popular meme descriptions; understood: there are no systematic studies on pre- that our nonparanormal model is able to learn dicting and generating popular Internet memes from dense and continuous vision features jointly the Natural Language Processing (NLP) and Com- with sparse and discrete text features in a prin- puter Vision (CV) perspectives. cipled manner, outperforming various com- In this paper, we take a multimodal approach to petitive baselines; that our system can gener- predict and generate popular meme descriptions. To ate meme descriptions using a simple pipeline. do this, we collect a set of original meme images, 1 Introduction a list of candidate descriptions, and the correspond- ing votes. We propose a robust nonparanormal ap- In the past few years, Internet memes become a new, proach (Liu et al., 2009) to model the multimodal contagious social phenomenon: it all starts with an stochastic dependencies among images, text, and image with a witty, catchy, or sarcastic sentence, and votes. We then introduce a simple pipeline for gen- people circulate it from friends to friends, colleagues erating meme descriptions combining reverse im- to colleagues, and families to families. Eventually, age search and traditional information retrieval ap- some of them go viral on the Internet. proaches. In empirical experiments, we show that Meme is not only about the funny picture, the our model outperforms strong discriminative base- Internet culture, or the emotion that passes along, lines by very large margins in the regression/ranking but also about the richness and uniqueness of its experiments, and that in the generation experiment, language: it is often highly structured with special the nonparanormal outperforms the second-best su- written style, and forms interesting and subtle con- pervised baseline by 4.35 BLEU points, and obtains notations that resonate among the readers. For ex- a BLEU score improvement of 4.48 over an unsu- ample, the LOL cat memes (e.g., Figure 1) often pervised recurrent neural network language model
trained on a large meme corpus that is almost 90 and Zitnick, 2014; Donahue et al., 2014; Fang et times larger. Our contributions are three-fold: al., 2014; Karpathy and Fei-Fei, 2014) using neural network models. Although the above studies have • We are the first to study the “language of shown interesting results, our task is arguably more memes” combining NLP, CV, and machine complex than generating text descriptions: in ad- learning techniques, and show that combining dition to the visual and textual signals, we have to the visual and textual signals helps identifying model the popular votes as a third dimension for popular meme descriptions; learning. For example, we cannot simply train a con- • Our approach empowers Internet users to select volutional neural network image parser on billions better wordings and generate new memes auto- of images, and use recurrent neural networks to gen- matically; erate texts such as “There is a white cat sitting next • Our proposed robust nonparanormal model to a laptop.” for Figure 1. Additionally, since not outperforms competitive baselines for predict- all images are suitable as meme images, collecting ing and generating popular meme descriptions. training images is also more challenging in our task. In contrast to prior work, we take a very In the next section, we outline related work. In different approach: we investigate copula meth- Section 3, we introduce the theory of copula, and ods (Schweizer and Sklar, 1983; Nelsen, 1999), in our nonparanormal approach. In Section 4, we de- particular, the nonparanormals (Liu et al., 2009), for scribe the datasets. We show the prediction and gen- joint modeling of raw images, text descriptions, and eration results in Section 5 and Section 6. Finally, popular votes. Copula is a statistical framework for we conclude in Section 7. analyzing random variables from Statistics (Liu et 2 Related Work al., 2012), and often used in Economics (Chen and Fan, 2006). Only until very recently, researchers Although the language of Internet memes is a rel- from the machine learning and information retrieval atively new research topic, our work is broadly re- communities (Ghahramani et al., 2012; Han et al., lated to studies on predicting popular social media 2012; Eickhoff et al., 2013). start to understand the messages (Hong et al., 2011; Bakshy et al., 2011; theory and the predictive power of copula models. Artzi et al., 2012). Most recently, Tan et al. (2014) Wang and Hua (2014) are the first to introduce semi- study the effect on wordings for Tweets. However, parametric Gaussian copula (a.k.a. nonparanormals) none of the above studies have investigated multi- for text prediction. However, their approach may modal approaches that combine text and vision. be prone to overfitting. In this work, we generalize Recently, there has been growing interests in Wang and Hua’s method to jointly model text and inter-disciplinary research on generating image de- vision features with popular votes, while scaling up scriptions. Gupta el al. (2009) have studied the prob- the model using effective dropout regularization. lem of constructing plots from video understand- ing. The work by Farhadi et al. (2010) is among 3 Our Approach the first to generate sentences from images. Kulka- rni et al. (2011) use linguistic constraints and a con- A key challenge for joint modeling of text and vision ditional random field model for the task, whereas is that, because textual features are often relatively Mitchell et al. (2012) leverage syntactic information sparse and discrete, while visual features are typi- and co-occurrence statistics and Dodge et al. (2012) cally dense and continuous, it is difficult to model use a large text corpus and CV algorithms for detect- them jointly in a principled way. ing visual text. With the surge of interests in deep To avoid comparing “apple and oranges” in the learning techniques in NLP (Socher et al., 2013; De- same probabilistic space, we propose the non- vlin et al., 2014) and CV (Krizhevsky et al., 2012; paranormal approach, which extends the Gaussian Oquab et al., 2013), there have been several unref- graphical model by transforming its variables by ereed manuscripts on parsing images and generating smooth functions. More specifically, for each di- text descriptions lately (Vinyals et al., 2014; Chen mension of textual and visual features, instead of
Figure 2: Our nonparanormal method extends Gaussian by transforming each dimension with a smooth function, Figure 3: An example of the standard SIFT keypoints de- and jointly models the stochastic dependencies among tected on the “doge” meme. textual and visual features, as well as the popular votes memes, we have also extracted typed depen- by the crowd. dency triples (e.g., subj(I,are)) using the Malt- using raw counts or histograms, we first use prob- Parser (Nivre et al., 2007). ability integral transform to generate empirical cu- • Named Entity Features: after browsing the mulative density functions (ECDF): now instead of dataset, we notice that certain names are of- the probability density function (PDF) space, we are ten mentioned in memes (e.g. “Drake”, “Kenye working in the ECDF space where the value of each West”, and “Justin Bieber”), so we utilize the feature is based on the rank, and is strictly restricted Stanford named entity recognizer (Finkel et al., between 0 and 1. Then, we use kernel density esti- 2005) to extract lexicalized named entities. mation to smooth out the zeroing features1 . Finally, • Frame-Semantics Features: SEMAFOR (Das now textual and visual features are compatible, and et al., 2010) is a state-of-the-art frame- we then build a parametric Gaussian copula model semantics parser that produces FrameNet-style to estimate the pair-wise correlations among the co- semantic annotation. We use SEMAFOR to ex- variate and the dependent variable. tract frame-level semantic features. In this section, we first explain the visual and tex- tual features used in this study. Then, we introduce Visual Features A key insight on viral memes is the theory of copula, and describe the robust non- that the images producing a shared social signal are paranormal. Finally, we show a simple pipeline for typically inter-related in style. For example, LOL- generating meme descriptions. cats are an early series of memes involving funny cat photos. Similarly, “Bieber memes” involve modified 3.1 Features pictures of Bieber. Textual Features To model the meme descriptions, Therefore, we hypothesize that, by extracting vi- we take a broad range of textual features into con- sual features, it is of crucial importance to capture siderations: the entities, objects, and styles as visual words in • Lexical Features: we extract unigrams and bi- these inter-related meme images. The popular vi- grams from meme descriptions as surface-level sual bag-of-words representation (Sivic and Zisser- lexical features. man, 2003) is used to describe images: • Part-of-Speech Features: to model shallow 1. PHOW Features Extraction: unlike text fea- syntactic cues, we extract lexicalized part-of- tures, SIFT first detects the Harris keypoints speech features using the Stanford part-of- from an image, and then describes each key- speech tagger (Toutanova et al., 2003). point with a vector. An example of the SIFT frames are shown in Figure 3. PHOW (Bosch • Dependency Triples: to better understand the et al., 2007) is a dense and multi-scale vari- deeper syntactic dependencies of keywords in ant of the Scale Invariant Feature Transform 1 This is necessary for the normal inversion of the ECDFs, (SIFT) descriptors. Using PHOW, we obtain which we will describe in Section 3.2. about 20K keypoints for each image.
2. Elkan K-means Clustering is the clustering C[F1 (x1 ), ..., Fn (xn )] defines a multivariate method (Elkan, 2003) that we use to obtain cumulative distribution function. the vocabulary for visual words. Compar- ing to other variants of K-means, this method 3.3 The Nonparanormal quickly constructs the codebook from PHOW To model multivariate text and vision variables, keypoints. we choose the nonparanormal (NPN) as the copula 3. Bag-of-Words Histograms are used to repre- function in this study, which can be explained in the sent each image. We match the PHOW key- following two parts. points of each image with the vocabulary that The Nonparametric Estimation we extract from the previous step, and generate a 1 × 200 sized visual bag-of-words vector. Assume we have n random variables of vision and 3.2 The Theory of Copula text features X1 , X2 , ..., Xn . The problem is that text features are sparse, so we need to perform non- In the Statistics literature, copula is widely known parametric kernel density estimation to smooth out as a family of distribution function. The idea be- the distribution of each variable. Let f1 , f2 , ..., fn hind copula theory is that the cumulative distribu- be the unknown density, we are interested in deriv- tion function (CDF) of a random vector can be rep- ing the shape of these functions. Assume we have m resented in the form of uniform marginal cumula- samples, the kernel density estimator can be defined tive distribution functions, and a copula that con- as: nects these marginal CDFs, which describes the cor- m relations among the input random variables. How- 1 X ever, in order to have a valid multivariate distribution fˆh (x) = Kh (x − xi ) (2) m i=1 function regardless of n-dimensional covariates, not m ! every function can be used as a copula function. The 1 X x − xi = K (3) central idea behind copula, therefore, can be sum- mh h i=1 marize by the Sklar’s theorem and the corollary. Theorem 1 (Sklar’s Theorem (1959)) Let F be Here, K(·) is the kernel function, where in our case, the joint cumulative distribution function of n ran- we use the Box kernel2 K(z): dom variables X1 , X2 , ..., Xn . Let the correspond- 1 ing marginal cumulative distribution functions of K(z) = , |z| ≤ 1, (4) 2 the random variable be F1 (x1 ), F2 (x2 ), ..., Fn (xn ). = 0, |z| > 1. (5) Then, if the marginal functions are continuous, there exists a unique copula C, such that Comparing to the Gaussian kernel and other kernels, F (x1 , ..., xn ) = C[F1 (x1 ), ..., Fn (xn )]. (1) the Box kernel is simple, and computationally in- expensive. The parameter h is the bandwidth for Furthermore, if the distributions are continuous, the smoothing3 . multivariate dependency structure and the marginals Now, we can derive the empirical cumulative dis- might be separated, and the copula can be consid- tribution functions ered independent of the marginals (Joe, 1997; Parsa and Klugman, 2011). Therefore, the copula does not F̂X1 (fˆ1 (X1 )), F̂X2 (fˆ2 (X2 )), ..., F̂Xn (fˆn (Xn )) have requirements on the marginal distributions, and any arbitrary marginals can be combined and their of the smoothed covariates, as well as the dependent dependency structure can be modeled using the cop- variable y (which is the reciprocal rank of the pop- ula. The inverse of Sklar’s Theorem is also true in ular votes of a meme) and its CDF F̂y (fˆ(y)). The the following: 2 It is also known as the original Parzen windows (Parzen, Corollary 1 If there exists a copula C : (0, 1)n 1962). and marginal cumulative distribution func- 3 In our implementation, we use the default h of the Box tions F1 (x1 ), F2 (x2 ), ..., Fn (xn ), then kernel in the ksdensity function in Matlab.
empirical cumulative distribution functions are de- shirani, 1996). While Lasso is widely used, the non- fined as: differentiable nature of the L1 norm often make the m objective function difficult to optimize. In this work, 1 X we propose dropout training (Hinton et al., 2012) F̂ (ν) = I{xi ≤ ν} (6) m as copula regularization. Dropout was proposed by i=1 Hinton et al. as a method to prevent feature co- where I{·} is the indicator function, and ν indicates adaptation in the deep learning framework, but re- the current value that we are evaluating. Note that cently studies (Wager et al., 2013) also show that its the above step is also known as probability integral behaviour is similar to L2 regularization, and can be transform (Diebold et al., 1997), which allows us to approximated efficiently (Wang and Manning, 2013) convert any given continuous distribution to random in many other machine learning tasks. Another ad- variables having a uniform distribution. This is cru- vantage of dropout training is that, unlike Lasso, it cial for text: instead of using the raw counts, we are does not require all the features for training, and now working with uniform marginal CDFs, which training is “embarrassingly” parallelizable. helps coping with the overfitting issue due to noise In Gaussian copula estimation context, we can in- and data sparsity. We also use the same procedure to troduce another dimension `: the number of dropout transform the vision features into CDF space to be learners, to extend the Σ into a dropout tensor. Es- compatible with text features. sentially, the task becomes the estimation of The Robust Estimation of Copula Σ1 , Σ2 , ..., Σ` Now that we have obtained the marginals, and where the input feature space for each dropout com- then the joint distribution can be constructed by ap- ponent is randomly corrupted by (1 − δ) percent of plying the copula function that models the stochastic the original dimension. In the inference time, we dependencies among marginal CDFs: use geometric mean to average the predictions from each dropout learner, and generate the final predic- F̂ (fˆ1 (X1 ), ..., fˆ1 (Xn ), fˆ(y)) tion. Note that the final Σ matrix has to be symmet- = C[F̂X1 fˆ1 (X1 ) , ..., F̂Xn fˆn (Xn ) , F̂y fˆy (y) ] ric and positive definite, so we apply tiny random (7) Gaussian noise to maintain the property. In this work, we apply the parametric Gaussian cop- Computational Complexity ula to model the correlations among the text features One important question regarding the proposed and the label. Assume xi is the smoothed version of nonparanormal model is the corresponding compu- random variable Xi , and y is the smoothed label, we tational complexity. This boils down to the es- have: timation of the Σ̂ matrix (Liu et al., 2012): one F (x1 , ..., xn , y) only needs to calculate the correlation coefficients of n(n − 1)/2 pairs of random variables. Chris- = ΦΣ Φ−1 [Fx1 (x1 )], ..., , Φ−1 [Fxn (xn )], Φ−1 [Fy (y)] (8) tensen (2005) shows that sorting and balanced bi- where ΦΣ is the joint cumulative distribution func- nary trees can be used to calculate the correlation tion of a multivariate Gaussian with zero mean and coefficients with complexity of O(n log n). There- Σ variance. Φ−1 is the inverse CDF of a standard fore, the computational complexity of MLE for the Gaussian. In this parametric part of the model, the proposed model is O(n log n). parameter estimation boils down to the problem of Efficient Approximate Inference learning the covariance matrix Σ of this Gaussian In this prediction task, in order to perform copula. In this work, we perform standard maxi- the exact inference of the conditional probabil- mum likelihood estimation (MLE) for the Σ matrix, ity distribution p(Fy (y)|Fx1 (x1 ), ..., Fxn (xn )), where we follow the details from prior work (Wang one needs to solve the mean response and Hua, 2014). Ê(Fy (y)|Fx1 (x1 ), ..., Fx1 (x1 )) from a joint To avoid overfitting, traditionally, one resorts to distribution of high-dimensional Gaussian cop- classic regularization techniques such as Lasso (Tib- ula. Unfortunately, the exact inference can be
query image with all possible images with their cap- tions in Google’s database, a “Best Guess” of the keywords in the image is then revealed. Using the extracted image keywords, we further query a TF-IDF based Lucene5 meme search en- gine, which we indexed with a large number of Web- crawled meme descriptions. After we obtain the candidate generations, we then extract all the text and vision features that we described in Section 3.1. Finally, our nonparanormal model ranks all possible candidates, and selects the final generation with the highest posterior. Figure 4: Our pipeline for generating memes from raw 4 Datasets images. We collected meme images and text descriptions6 intractable in the multivariate case, and approximate from two popular meme websites7 . In the predic- inference, such as Markov Chain Monte Carlo tion experiment, we use 3,008 image-description sampling (Gelfand and Smith, 1990; Pitt et al., pairs for training, and 526 image-description pairs 2006) is often used for posterior inference. In this for testing. In the generation experiment, we use work, we propose an efficient sampling method 269,473 meme descriptions to index the meme to derive y given the text features — we sample search engine, and 50 randomly selected images for Fyˆ(y) s.t. it maximizes the joint high-dimensional testing. During training, we convert the raw counts Gaussian copula density: of popular votes into reciprocal ranks (e.g., the most 1 1 popular text descriptions will all have a reciprocal exp − ∆T · Σ−1 − I · ∆ arg max √ rank of 1, and n-th popular one will have a score of Fyˆ(y)∈(0,1) det Σ 2 1/n). (9) where 5 Prediction Experiments Φ−1 (F x1 (x1 )) In the first experiment, we compare the proposed .. NPN with various baselines in a prediction task, ∆= . Φ−1 (Fxn (xn )) since prior literature (Hodosh et al., 2013) also sug- Φ−1 (Fy (y)) gests using ranking based evaluation for associating images with text descriptions. Throughout the ex- This approximate inference scheme using max- periment sections, we set ` = 10, and δ = 80 as the imum density sampling from the Gaussian copula dropout hyperparameters. significantly relaxes the complexity of inference. Fi- nally, to derive ŷ, the last step is to compute the Baselines: inverse CDF of Fyˆ(y). A detailed description of The baselines are standard squared-loss lin- the inference algorithm can be found in our prior ear regression, linear kernel SVM, and non-linear work (Wang and Hua, 2014). (Gaussian) kernel SVM. In a recent empirical study (Fernández-Delgado et al., 2014) that evalu- 3.4 A Simple Meme Generation Pipeline ates 179 classifiers from 17 families on 121 UCI Now after we train a nonparanormal model for rank- datasets, the authors find that Gaussian SVM is one ing meme descriptions, we show the simple meme of the top performing classifiers. We use the Sta- generation pipeline in Figure 4. tistical Toolbox’s linear regression implementation Given a test image, we disguise as the Internet in Matlab, and LibSVM (Chang and Lin, 2011) for Explorer, and query Google’s “Search By Image” 5 http://lucene.apache.org/ inverse image search service4 . By comparing the 6 http://www.cs.cmu.edu/˜yww/data/meme dataset.zip. 4 7 http://www.google.com/imghp/ memegenerator.net and cheezburger.com
training and testing the SVM models. The hyperpa- Feature Sets LR LSVM GSVM NPN rameter C in linear SVM, and the γ and C hyperpa- Unigrams 0.152 0.158 0.176 0.241* rameters in Gaussian SVM are tuned on the training + Bigrams 0.163 0.248 0.279 0.318* + Named Entities 0.188 0.296 0.312 0.339* set using 10-fold cross-validation. + Part-of-Speech 0.184 0.318 0.337 0.343 Evaluation Metrics: + Dependency 0.191 0.322 0.348 0.350 + Semantics 0.183 0.368 0.388 0.367 Spearman’s correlation (Hogg and Craig, 1994) All Text + Vision 0.413 0.415 0.451 0.754* and Kendall’s tau (Kendall, 1938) have been widely used in many real-valued prediction (regression) Unigrams 0.102 0.105 0.118 0.181* problems in NLP (Albrecht and Hwa, 2007; Yo- + Bigrams 0.115 0.164 0.187 0.237* gatama et al., 2011), and here we use them to mea- + Named Entities 0.127 0.202 0.213 0.248* sure the quality of predicted values ŷ by comparing + Part-of-Speech 0.125 0.218 0.232 0.239 + Dependency 0.130 0.223 0.242 0.255 to the vector of ground truth y. Kendall’s tau is a + Semantics 0.124 0.257 0.270 0.270 nonparametric statistical metric that have shown to All Text + Vision 0.284 0.288 0.314 0.580* be inexpensive, robust, and representation indepen- dent (Lapata, 2006). We use paired two-tailed t-test Table 1: The Spearman correlation (top table) and to measure the statistical significance. Kendall’s τ (bottom table) for comparing various text fea- tures and combining with vision features. The best results 5.1 Comparison with Various Baselines of each row are highlighted in bold. * indicates p < .001 The first two figures in Figure 5 show the learn- comparing to the second best result. ing curve of our system, comparing other baselines. We see that when increasing the amount of training mance for associating popular votes, meme images, data, our approach clearly dominates all other meth- and text descriptions. ods by a large margin. Linear and Gaussian SVMs 5.3 The Effects of Dropout Training for perform similarly, and have good performances with Nonparanormals only 25% of the training data, but the improvements are not large when increasing the amount of training As we mentioned before, because NPNs model the data. complex network of random variables, a key issue In the last two figures in Figure 5, we increase for training NPN is to prevent the model from over- the amount of features, and compare various mod- fitting to the training data. So far, none of the prior els. We see that the linear regression model overfits work have investigated dropout training for regular- with 600 features, and Gaussian SVM outperforms izing the nonparanormals or even copula in general. the linear SVM. We see that our NPN model clearly To empirical test the effects of dropout training for outperforms all baselines by a big gap, and does not nonparanormals, in addition to our datasets, we also overfit. compare with the unregularized copula from Wang and Hua (2014) on predicting financial risks from 5.2 Combination of Text and Vision earnings calls. Table 2 clearly suggests that dropout In Table 1, we systematically compare the contribu- training for NPNs significant improves the perfor- tions of each feature set. First, we see that bigram mances on various datasets. features clearly improve the performance on top of unigram features. Second, named entities are crucial 5.4 Qualitative Analysis for further boosting the performance. Third, adding Table 3 shows the top ranked text features that are the shallow part-of-speech features does not benefit highly correlated with popular votes. We see that the all models, but the dependency triples are shown to named entity features are useful: Paul Walker, UPS, be useful for all methods. Finally, we see that using Bruce Willis, Pencil Guy, Amy Winehouse are rec- semantic features helps increasing the performances ognized as entities in the meme dataset. Dependency for most of the cases, and combining text and vision triples, as a less-understood feature set, also perform features in our NPN framework doubles the perfor- well in this task. For example, xcomp(tell,mean)
Figure 5: Two figures on the left: varying the amount of training data. L(1): Spearman. L(2): Kendall. Two figures on the right: varying the amount of features. R(1): Spearman. R(2): Kendall. Datasets No Dropout With Dropout Top 1-10 Top 11-20 Top 21-30 Meme 0.625 0.754* paul/PER FE party you new Finance (pre2009) 0.416 0.482* xcomp(tell,mean) dep(when,but) FE Entity it Finance (2009) 0.412 0.445* possessive(’s,it) ... : bruce/PER Finance (post2009) 0.377 0.409* yo daw FE Theme i FE party we Meme 0.491 0.580* pobj(vegas,in) on a FE Food fat Finance (pre2009) 0.307 0.349* ups/ORG FE Exp. they make Finance (2009) 0.302 0.318* into FE Entity you so you Finance (post2009) 0.282 0.297* so you’re how penci/PER FE Cognizer i of the y Table 2: The effects of dropout training for NPNs on yo . pobj(life,of) winehouse/PER meme and other datasets. The best results of each row are highlighted in bold. * indicates p < .001 comparing Table 3: Top-30 linguistic features that are highly corre- to the no dropout setting. lated with the popular votes. captures the dependency relation of the popular of memes, and Web dialect such as “y” (why) also meme series “You mean to tell me...”. Interestingly, exhibits high correlation with the popular votes. the transitional dependency feature dep(when,but) 6 Generation Experiments plays an important role in the language of memes. In this section, we investigate the performance of The object of a preposition, such as pobj(vegas,in) our meme generation system using 50 test meme and pobj(life,of), also made the list. images. To quantitatively evaluate our system, we Bigrams are shown to be important features as compare with both unsupervised and supervised usual. For example, “Yo daw” is a popular meme baselines. For the unsupervised baselines, we com- based on rapper Xzibit’s famous reality car show pare with a compact recurrent neural network lan- “Pimp My Ride”, where the rapper customizes peo- guage model (RNNLM) (Mikolov, 2012) trained on ple’s car according to personal preferences. This vi- the 3,008 text descriptions of our meme training set, ral meme follows the pattern8 of “Yo daw(g), I herd as well as a full model of RNNLM trained on a large you like X (noun), so I put an X in your Y (noun) meme corpus of 269K sentences9 . For the super- so you can W (verb) while you Z (verb).” vised baselines, all models are trained on the 3,008 The use of pronouns, captured by frame semantics training image-description pairs with labels. All features, is associated with popular memes. We hy- these models can be viewed as different re-ranking pothesize that by using pronouns such as “i”, “you”, methods for the retrieved candidate descriptions. We “we”, and “they”, the meme recalls personal expe- use BLEU score (Papineni et al., 2002) as the evalu- riences and emotions, thus connects better with the ation metric, since the generation task can be viewed audience. Finally, we see that the punctuation bi- as translating raw images into sentences, and it is gram “... :” is an important feature in the language 9 Note that there are no image features feeding to the unsu- 8 http://knowyourmeme.com/memes/xzibit-yo-dawg pervised RNN models.
BLEU points over the full RNNLM, which is trained on a corpus that is ∼90 times larger, in an unsuper- vised fashion. When breaking down the results, we see that our NPN’s advantage is on generating longer phrases, typically trigrams and four-grams, compar- ing to the other models. This is very interesting, be- cause generating high-quality long phrases is diffi- cult, since the memes are often short. We show some generation examples in Figure 6. We see that on the left column, the reference memes are the ones with top votes by the crowd. The first chemistry cat meme includes puns, the second for- ever alone meme includes reference to the life sim- ulation video game, while the last Batman meme has interesting conversations. In the second col- umn, we see that the memes generated by the full RNNLM model are short, which corresponds to the quantitative results in Table 4. In the third col- umn, our NPN meme generator was able to gen- Figure 6: Examples from the meme generation exper- erate longer descriptions. Interestingly, it also cre- iment. First row: the chemistry cat meme. Second row: the forever alone meme. Third row: the Batman ates a pun for the chemistry cat meme. Our genera- slaps Robin meme. Left column: human generated top- tion on the forever alone meme is also accurate. In voted meme descriptions on memegenerator.net at the the Batman example, we show that the NPN model time of writing. Middle column: generated output from makes a sentence-image-mismatch type of error: al- RNNLM. Right column: generated output from NPNs. though the generated sentence includes the entities Batman and Robin, as well as their slapping activ- used in many caption generation studies (Vinyals ity, it was originally created for the “overly attached et al., 2014; Chen and Zitnick, 2014; Donahue et girlfriend” meme10 . al., 2014; Fang et al., 2014; Karpathy and Fei-Fei, 7 Conclusions 2014). The generation result is shown in Table 4. Note In this paper, we study the language of memes that when combining B-1 to B-4 scores, BLEU in- by jointly learning the image, the description, and cludes a brevity penalty as described in the original the popular votes. In particular, we propose a ro- BLEU paper. We see that our NPN model outper- bust nonparanormal approach to transform all vi- forms the best supervised baseline by 4.35 BLEU sion and text features into the cumulative density points, while also obtaining an advantage of 4.48 function space. By learning the stochastic depen- dencies, we show that our model significantly out- performs various competitive baselines in the pre- Systems BLEU B-1 B-2 B-3 B-4 diction experiments. In addition, we also propose RNN-C 19.52 62.2 21.2 12.1 9.0 a simple pipeline for generating memes from raw RNN-F 23.76 72.2 31.4* 16.2 8.7 LR 23.89 72.3 28.3 15.0 10.6 images, drawing the wisdom from reverse image LSVM 21.06 65.0 24.8 13.1 9.3 search and traditional information retrieval perspec- GSVM 20.63 66.2 22.8 12.8 9.3 tives. Finally, we show that our model obtains sig- NPN 28.24* 66.9 29.0 19.7* 16.6* nificant BLEU point improvements over an unsuper- vised RNNLM baseline trained on a larger corpus, Table 4: The BLEU scores for generating memes from as well as other strong supervised baselines. images. B-1 to B-4: BLEU unigram to four-grams. The 10 best BLEU results are highlighted in bold. * indicates http://www.overlyattachedgirlfriend.com p < .001 comparing to the second best system.
References Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh Sri- vastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xi- Joshua Albrecht and Rebecca Hwa. 2007. Regression for aodong He, Margaret Mitchell, John Platt, et al. 2014. sentence-level mt evaluation with pseudo references. From captions to visual concepts and back. arXiv In Proceedings of ACL. preprint arXiv:1411.4952. Yoav Artzi, Patrick Pantel, and Michael Gamon. 2012. Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Predicting responses to microblog posts. In Proceed- Peter Young, Cyrus Rashtchian, Julia Hockenmaier, ings of NAACL-HLT. and David Forsyth. 2010. Every picture tells a Eytan Bakshy, Jake M Hofman, Winter A Mason, and story: Generating sentences from images. In Com- Duncan J Watts. 2011. Everyone’s an influencer: puter Vision–ECCV 2010, pages 15–29. Springer. quantifying influence on twitter. In Proceedings of Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, WSDM, pages 65–74. ACM. and Dinani Amorim. 2014. Do we need hun- Anna Bosch, Andrew Zisserman, and Xavier Munoz. dreds of classifiers to solve real world classification 2007. Image classification using random forests and problems? Journal of Machine Learning Research, ferns. 15:3133–3181. Chih-Chung Chang and Chih-Jen Lin. 2011. Libsvm: a Jenny Rose Finkel, Trond Grenager, and Christopher library for support vector machines. ACM TIST. Manning. 2005. Incorporating non-local information Xiaohong Chen and Yanqin Fan. 2006. Estimation into information extraction systems by gibbs sampling. of copula-based semiparametric time series models. In Proceedings of the 43rd Annual Meeting on Associ- Journal of Econometrics. ation for Computational Linguistics, pages 363–370. Association for Computational Linguistics. Xinlei Chen and C Lawrence Zitnick. 2014. Learning a recurrent visual representation for image caption gen- Alan Gelfand and Adrian Smith. 1990. Sampling-based eration. arXiv preprint arXiv:1411.5654. approaches to calculating marginal densities. Journal of the American statistical association. David Christensen. 2005. Fast algorithms for the calcu- Zoubin Ghahramani, Barnabás Póczos, and Jeff Schnei- lation of kendalls τ . Computational Statistics. der. 2012. Copula-based kernel dependency mea- Dipanjan Das, Nathan Schneider, Desai Chen, and sures. In Proceedings of the 29th International Con- Noah A Smith. 2010. Probabilistic frame-semantic ference on Machine Learning. parsing. In Proceedings of NAACL-HLT. Abhinav Gupta, Praveen Srinivasan, Jianbo Shi, and Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Larry S Davis. 2009. Understanding videos, con- Lamar, Richard Schwartz, and John Makhoul. 2014. structing plots learning a visually grounded storyline Fast and robust neural network joint models for statis- model from annotated videos. In Computer Vision and tical machine translation. In Proceedings of ACL. Pattern Recognition, 2009. CVPR 2009. IEEE Confer- Francis X Diebold, Todd A Gunther, and Anthony S Tay. ence on, pages 2012–2019. IEEE. 1997. Evaluating density forecasts. Fang Han, Tuo Zhao, and Han Liu. 2012. Coda: High Jesse Dodge, Amit Goyal, Xufeng Han, Alyssa Men- dimensional copula discriminant analysis. Journal of sch, Margaret Mitchell, Karl Stratos, Kota Yamaguchi, Machine Learning Research. Yejin Choi, Hal Daumé III, Alexander C Berg, et al. Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, 2012. Detecting visual text. In Proceedings of the Ilya Sutskever, and Ruslan R Salakhutdinov. 2012. NAACL-HLT. Improving neural networks by preventing co- Jeff Donahue, Lisa Anne Hendricks, Sergio Guadar- adaptation of feature detectors. arXiv preprint rama, Marcus Rohrbach, Subhashini Venugopalan, arXiv:1207.0580. Kate Saenko, and Trevor Darrell. 2014. Long-term re- Micah Hodosh, Peter Young, and Julia Hockenmaier. current convolutional networks for visual recognition 2013. Framing image description as a ranking task: and description. arXiv preprint arXiv:1411.4389. Data, models and evaluation metrics. J. Artif. Intell. Carsten Eickhoff, Arjen P. de Vries, and Kevyn Collins- Res.(JAIR), 47:853–899. Thompson. 2013. Copulas for information retrieval. Robert V Hogg and Allen Craig. 1994. Introduction to In Proceedings of the 36th International ACM SIGIR mathematical statistics. Conference on Research and Development in Informa- Liangjie Hong, Ovidiu Dan, and Brian D Davison. 2011. tion Retrieval. Predicting popular messages in twitter. In Proceedings Charles Elkan. 2003. Using the triangle inequality to of WWW. accelerate k-means. In ICML, volume 3, pages 147– Harry Joe. 1997. Multivariate models and dependence 153. concepts.
Andrej Karpathy and Li Fei-Fei. 2014. Deep visual- Michael Pitt, David Chan, and Robert Kohn. 2006. Effi- semantic alignments for generating image descrip- cient bayesian inference for gaussian copula regression tions. Stanford University Technical Report. models. Biometrika. Maurice Kendall. 1938. A new measure of rank correla- Berthold Schweizer and Abe Sklar. 1983. Probabilistic tion. Biometrika. metric spaces. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Josef Sivic and Andrew Zisserman. 2003. Video google: 2012. Imagenet classification with deep convolutional A text retrieval approach to object matching in videos. neural networks. In Advances in neural information In Proceedings of ICCV, pages 1470–1477. IEEE. processing systems, pages 1097–1105. Abe Sklar. 1959. Fonctions de répartition à n dimen- Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming sions et leurs marges. Université Paris 8. Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. Richard Socher, Alex Perelygin, Jean Y Wu, Jason 2011. Baby talk: Understanding and generating im- Chuang, Christopher D Manning, Andrew Y Ng, and age descriptions. In Proceedings of the 24th CVPR. Christopher Potts. 2013. Recursive deep models for Citeseer. semantic compositionality over a sentiment treebank. Mirella Lapata. 2006. Automatic evaluation of informa- In Proceedings of EMNLP, pages 1631–1642. Cite- tion ordering: Kendall’s tau. Computational Linguis- seer. tics. Chenhao Tan, Lillian Lee, and Bo Pang. 2014. The ef- Han Liu, John Lafferty, and Larry Wasserman. 2009. fect of wording on message propagation: Topic- and The nonparanormal: Semiparametric estimation of author-controlled natural experiments on twitter. In high dimensional undirected graphs. The Journal of Proceedings of ACL. Machine Learning Research, 10:2295–2328. Robert Tibshirani. 1996. Regression shrinkage and se- Han Liu, Fang Han, Ming Yuan, John Lafferty, and Larry lection via the lasso. Journal of the Royal Statistical Wasserman. 2012. High-dimensional semiparamet- Society. Series B (Methodological), pages 267–288. ric gaussian copula graphical models. The Annals of Kristina Toutanova, Dan Klein, Christopher D Manning, Statistics. and Yoram Singer. 2003. Feature-rich part-of-speech Tomáš Mikolov. 2012. Statistical language models tagging with a cyclic dependency network. In Pro- based on neural networks. Ph.D. thesis, Ph. D. the- ceedings of NAACL-HLT. sis, Brno University of Technology. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du- Margaret Mitchell, Xufeng Han, Jesse Dodge, Alyssa mitru Erhan. 2014. Show and tell: A neural image Mensch, Amit Goyal, Alex Berg, Kota Yamaguchi, caption generator. arXiv preprint arXiv:1411.4555. Tamara Berg, Karl Stratos, and Hal Daumé III. 2012. Stefan Wager, Sida Wang, and Percy Liang. 2013. Midge: Generating image descriptions from computer Dropout training as adaptive regularization. In Ad- vision detections. In Proceedings of EACL. vances in Neural Information Processing Systems, Roger B Nelsen. 1999. An introduction to copulas. pages 351–359. Springer Verlag. William Yang Wang and Zhenhao Hua. 2014. A semi- Joakim Nivre, Johan Hall, Jens Nilsson, Atanas Chanev, parametric gaussian copula regression model for pre- Gülsen Eryigit, Sandra Kübler, Svetoslav Marinov, dicting financial risks from earnings calls. In Proceed- and Erwin Marsi. 2007. Maltparser: A language- ings of ACL. independent system for data-driven dependency pars- Sida Wang and Christopher Manning. 2013. Fast ing. Natural Language Engineering, 13(02):95–135. dropout training. In Proceedings of ICML. Maxime Oquab, Leon Bottou, Ivan Laptev, Josef Sivic, Dani Yogatama, Michael Heilman, Brendan O’Connor, et al. 2013. Learning and transferring mid-level image Chris Dyer, Bryan R Routledge, and Noah A Smith. representations using convolutional neural networks. 2011. Predicting a scientific community’s response to Kishore Papineni, Salim Roukos, Todd Ward, and Wei- an article. In Proceedings of EMNLP. Jing Zhu. 2002. Bleu: a method for automatic evalua- tion of machine translation. In Proceedings of ACL, pages 311–318. Association for Computational Lin- guistics. Rahul A Parsa and Stuart A Klugman. 2011. Copula regression. Variance Advancing and Science of Risk. Emanuel Parzen. 1962. On estimation of a probability density function and mode. The annals of mathemati- cal statistics.
You can also read