Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model Yitao Cai, Huiyu Cai and Xiaojun Wan Institute of Computer Science and Technology, Peking University The MOE Key Laboratory of Computational Linguistics, Peking University Center for Data Science, Peking University {caiyitao,hy cai,wanxiaojun}@pku.edu.cn Abstract licly available sarcastic posts. Previous works on Twitter sarcasm detection focus on the text modal- Sarcasm is a subtle form of language in which ity and propose many supervised approaches, in- people express the opposite of what is implied. Previous works of sarcasm detection focused cluding conventional machine learning methods on texts. However, more and more social me- with lexical features (Bouazizi and Ohtsuki, 2015; dia platforms like Twitter allow users to cre- Ptáček et al., 2014), and deep learning methods ate multi-modal messages, including texts, im- (Wu et al., 2018; Baziotis et al., 2018). ages, and videos. It is insufficient to detect sar- However, detecting sarcasm with only text casm from multi-model messages based only modality can never be certain of the true intention on texts. In this paper, we focus on multi- of the simple tweet “What a wonderful weather!” modal sarcasm detection for tweets consisting of texts and images in Twitter. We treat text until the dark clouds in the attached picture (Fig- features, image features and image attributes ure 1(a)) are seen. Images, while are ubiquitous on as three modalities and propose a multi-modal social platforms, can help reveal (Figure 1(a)), af- hierarchical fusion model to address this task. firm (Figure 1(b)) or disprove the sarcastic nature Our model first extracts image features and at- of tweets, thus are intuitively crucial to Twitter sar- tribute features, and then leverages attribute casm detection tasks. features and bidirectional LSTM network to In this work, we propose a multi-modal hier- extract text features. Features of three modal- ities are then reconstructed and fused into archical fusion model for detecting sarcasm in one feature vector for prediction. We cre- Twitter. We leverage three types of features, ate a multi-modal sarcasm detection dataset namely text, image and image attribute features, based on Twitter. Evaluation results on the and fuse them in a novel way. During early fu- dataset demonstrate the efficacy of our pro- sion, the attribute features are used to initialize a posed model and the usefulness of the three bi-directional LSTM network (Bi-LSTM), which modalities. is then used to extract the text features. The three features then undergo representation fusion, 1 Introduction where they are transformed into reconstructed rep- Merriam Webster defines sarcasm as “a mode of resentation vectors. A modality fusion layer per- satirical wit depending for its effect on bitter, caus- forms weighted average to the vectors and pumps tic, and often ironic language that is usually di- them to a classification layer to yield the final re- rected against an individual”. It has the magi- sult. Our results show that all three types of fea- cal power to disguise the hostility of the speaker tures contribute to the model performance. Fur- (Dews and Winner, 1995) while enhancing the ef- thermore, our fusion strategy successfully refines fect of mockery or humor on the listener. Sarcasm the representation of each modality and is signif- is prevalent on today’s social media platforms, and icantly more effective than simply concatenating its automatic detection bears great significance in the three types of features. customer service, opinion mining, online harass- Our main contributions are summarized as fol- ment detection and all sorts of tasks that require lows: knowledge of people’s real sentiment. Twitter has become a focus of sarcasm detec- • We propose a novel hierarchical fusion model tion research due to its ample resources of pub- to address the challenging multi-modal sar- 2506 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2506–2515 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics
(a)“What a wonderful weather!” (b)“Yep, totally normal . Nothing is off about this. Nothing at all. #itstoohotalready #climatechangeisreal” Figure 1: Examples of image modality aiding sarcasm detection. (a) The image is necessary for the sarcasm to be spotted due to the contradiction of dark clouds in the image and “wonderful weather” in the text; (b) The image affirms the sarcastic nature of the tweet by showing the weather is actually very “hot” and is not at all “totally normal”. casm detection task in Twitter. To the best sarcasm. Bamman and Smith (2015) make use of of our knowledge, we are the first to deeply human-engineered author, audience and response fuse the three modalities of image, attribute features to promote sarcasm detection. Zhang, and text, rather than naı̈ve concatenation, for Zhang and Fu (2016) concatenate target tweet em- Twitter sarcasm detection. beddings(obtained by a Bi-GRU model) with man- ually engineered contextual features, and show • We create a new dataset for multi-modal fair improvement compared to completely feature- Twitter sarcasm detection and release it1 . based systems. Amir et al. (2016) exploit trainable user embeddings to enhance the performance of a • We quantitatively show the significance of CNN classification model. Poria et al. (2016) use each modality in Twitter sarcasm detection. the concatenated output of CNNs trained on tweets We further show that to fully unleash the po- and pre-trained on emotion, sentiment, personality tential of images, we would need to consider as the inputs for the final SVM classifier. Y. Tay et image attributes - a high-level abstract infor- al. (2018) come up with a novel multi-dimensional mation bridging the gap between texts and intra-attention mechanism to explicitly model con- images. trast and incongruity. Wu et al. (2018) construct 2 Related Works a multi-task model with densely connected LSTM based on embeddings, sentiment features and syn- 2.1 Sarcasm Detection tactic features. Baziotis et al. (2018) ensemble Various methods have been proposed for sarcasm a word based bidirectional LSTM and a character detection from texts. Earlier methods extract based bidirectional LSTM to capture both seman- carefully engineered discrete features from texts tic and syntactic features. (Davidov et al., 2010; Riloff et al., 2013; Ptáček However, little has been revealed by far on how et al., 2014; Bouazizi and Ohtsuki, 2015), in- to effectively combine textual and visual informa- cluding n-grams, word’s sentiment, punctuations, tion to boost performance of Twitter sarcasm de- emoticons, part-of-speech tags, etc. More re- tection. Schifanella et al. (2016) simply concate- cently, researchers leverage the powerful tech- nate manually designed features or deep learning niques of deep learning to get more precise seman- based features of texts and images to make predic- tic representations of tweet texts. Ghosh and Veale tion with two modalities. Different from this work, (2016) propose a model with CNN and RNN lay- we propose a hierarchical fusion model to deeply ers. Besides the tweet content in question, contex- fuse three modalities. tual features such as historical behaviors of the au- thor and the audience serve as a good indicator for 2.2 Other Multi-Modal Tasks 1 https://github.com/headacheboy/data-of-multimodal- Sentiment analysis is a related task with sarcasm sarcasm-detection detection. Many researches on multi-modal sen- 2507
timent analysis deal with video data (Wang et al., to get our image guidance vector. For the (image) 2016; Zadeh et al., 2017), where text, image and attribute modality, we use another pre-trained and audio data can usually be aligned and support each fine-tuned ResNet models to predict 5 attributes other. Though inputs are different, their fusion for each image, the GloVe embeddings of which mechanisms can be inspiring to our task. Poria, are considered as the raw attribute vectors. Our Cambria, and Gelbukh (2015) use multiple kernel attribute guidance vector is a weighted average learning to fuse different modalities. Zadeh et al. of the raw attribute vectors. We use Bi-LSTM (2017) build their fusion layer by outer product in- to obtain our text vectors. The raw text vectors stead of simple concatenation in order to get more are the concatenated forward and backward hid- features. Gu et al. (2018b) align text and audio den states for each time step of the Bi-LSTM, at word level and apply several attention mecha- while the text guidance vector is the average of the nisms. Gu et al. (2018a) first introduce modal- above raw vectors. In the belief that the attached ity fusion structure attempting to reveal the actual image could aid the model’s understanding of the importance of multiple modalities, but their meth- tweet text, we apply non-linear transformations on ods are quite different from our hierarchical fusion the attribute guidance vector and feed the result techniques. to the Bi-LSTM as its initial hidden state. This Inspiration can also be drawn from other multi- process is named early fusion. In order to utilize modal tasks, such as visual question answer- multimodal information to refine representations ing (VQA) tasks where a frame of image and of all modalities, representation fusion is proposed a query sentence are provided as model inputs. in which feature vectors of the three modalities A question-guided attention mechanism is pro- are reconstructed using raw vectors and guidance posed in VQA tasks (Chen et al., 2015) and can vectors. The refined vectors of three modalities boost model performance compared to those using are combined into one vector with weighted av- global image features. Attribute prediction layer erage instead of simple concatenation in the pro- is introduced (Wu et al., 2016) as a way to incor- cess of modality fusion. Lastly, the fused vector porate high-level concepts into the CNN-LSTM is pumped into a two layer fully-connected neural framework. Wang et al. (2017) exploit a hand- network to obtain classification result. More de- ful of off-the-shelf algorithms, gluing them with a tails of our model are provided below. co-attention model and achieve generalizability as well as scalability. Yang et al. (2014) try image 3.1 Image Feature Representation emotion extraction tasks with image comments We use ResNet-50 V2 (He et al., 2016) to ob- and propose a model to bridge images and com- tain representations of tweet images. We chop the ment information by learning Bernoulli parame- last fully-connected (FC) layer of the pre-trained ters. model and replace it with a new one for the sake of model fine-tuning. Following (Wang et al., 3 Proposed Hierarchical Fusion Model 2017), a input image I is re-sized to 448 × 448 Figure 2 shows the architecture of our proposed and divided into 14 × 14 regions. Each region hierarchical fusion model. In this work, we treat Ii (i = 1, 2 . . . , 196) is then sent through the text, image and image attribute as three modalities. ResNet model to obtain a regional feature repre- Image attribute modality has been shown to boost sentation vregioni , a.k.a. a raw image vector. model performance by adding high-level concept of the image content (Wu et al., 2016). Modality vregioni = ResNet(Ii ) fusion techniques are proposed to make full use of the three modalities. In the following paragraph, As is described before, the image guidance vector we will first define raw vectors and guidance vec- vimage is the average of all regional image vectors. tors, and then briefly introduce our hierarchical PNr fusion techniques. i=1 vregioni For the image modality, we use a pre-trained vimage = Nr and fine-tuned ResNet model to obtain 14 × 14 regional vectors of the tweet image, which is de- where Nr is the number of regions and is 196 in fined as the raw image vectors, and average them this work. 2508
Fused vector (for classification) Modality Fusion Reconstructed Feature Vectors Representation Fusion Representation Fusion Representation Fusion Guidance Vectors fork regional knife Raw Vectors image seat Bi-LSTM vectors white meat Early Fusion image modality attribute modality text modality GloVe embedding Input tweet YUM hospital cafeteria food – at Lake Charles Memorial Hospital Figure 2: Overview of our proposed model 3.2 Attribute Feature Representation tions are as follows. αi = W2 · tanh(W1 · e(ai ) + b1 ) + b2 Previous work (Wu et al., 2016) in image caption- α = softmax(α) ing and visual question answering introduces at- Na X tributes as high-level concepts of images. In their vattr = αi e(ai ) work, single-label and multi-label losses are pro- i=1 posed to train the attribute prediction CNN, whose where ai is the ith image attribute, literally a word parameters are transferred to generate the final im- out of a vocabulary of 1000; e is the GloVe em- age representation. While they use parameter shar- bedding operation; W1 and W2 are weight matri- ing for better image representation with attribute- ces; b1 and b2 are biases; Na is the number of at- labeling tasks, we take a more explicit approach. tributes, and is 5 in our settings. We treat attributes as an extra modality bridging the tweet text and image, by directly using the 3.3 Text Feature Representation word embeddings of five predicted attributes of Bidirectional LSTM (Bi-LSTM) (Hochreiter and each tweet image as the raw attribute vectors. Schmidhuber, 1997) are used to obtain the repre- sentation of the tweet text. The equations of op- We first train an attribute predictor with ResNet- erations performed by LSTM at time step t are as 101 and COCO image captioning dataset (Lin follows: et al., 2014). We build the multi-label dataset by extracting 1000 attributes from sentences of it = σ(Wi · xt + Ui · ht−1 ) the COCO dataset. We use a ResNet model pre- ft = σ(Wf · xt + Uf · ht−1 ) trained on ImageNet (Russakovsky et al., 2015) ot = σ(Wo · xt + Uo · ht−1 ) and fine-tune it on the multi-label dataset. Then the attribute predictor is used to predict five at- c̃t = tanh(Wc · xt + Uc · ht−1 ) tributes ai (i = 1, . . . , 5) for each image. ct = ft ct−1 + it c̃t ht = ot tanh(ct ) We generate the attribute guidance vector by weighted average. Raw attribute vectors e(ai ) are where Wi , Wf , Wo , Ui , Uf , Uo are weight matri- passed through a two-layer neural network to ob- ces; xt , ht are input state and hidden state at time tain the attention weights αi for constructing the step t, respectively; σ is the sigmoid function; attribute guidance vector vattr . The related equa- denotes element-wise product. The text guidance 2509
vector is the arithmetic average of hidden states in raw vectors in each modality. For the ith raw vec- each time step. tor of each modality m, we calculate three guided (i) PL weights αmn from the guidance vectors of differ- i=1 ht ent modalities n. The final reconstruction weight vtext = L for the raw vector is the average of the normalized guided weights. where L is the length of the tweet text. (i) (i) αmn = Wmn2 · tanh(Wmn1 · [Xm ; vn ] + bmn1 ) 3.4 Early Fusion + bmn2 The Bi-LSTM initial states are usually set to ze- roes in text classification tasks, but it is a poten- αmn = softmax(αmn ) P (i) tial spot where multi-modal information could be n∈{text, image, attr} αmn (i) infused to promote the modal’s comprehension of αm = 3 the text. In the proposed model, we apply the non- Lm X linearly transformed attribute guidance vector as (i) (i) vm = αm Xm the initial state of Bi-LSTM. i=1 [hf 0 ; hb0 ; cf 0 ; cb0 ] = ReLu(W · vattr + b) where m, n ∈ {text, image, attr} denote modal- (i) ities; αmn is the guided weight for the ith raw vec- where hf 0 , cf 0 are forward LSTM initial states tor of modality m under the guidance of modality (i) and hb0 , cb0 are backward LSTM initial states; [; ] n, and αmn contains all αmn of all raw vectors is vector concatenation; ReLu denotes element- of modality m under the guidance of modality n; (i) wise application of the Rectified Linear Units ac- αm is the final reconstruction weight for the ith tivation function; W and b are weight matrix and raw vector of modality m; Lm is the length of se- bias. (i) quence {Xm }; Wmn1 ,Wmn2 are weight matrices We also try to use image guidance vector for and bmn1 , bmn2 are biases. early fusion, in which the LSTM initial states are After representation fusion, vimage , vtext , vattr , obtained with means similar to the one described previously denoted as guidance vectors, are now above, but it does not perform very well, as will be considered feature vectors of each modality and discussed in the experiments. ready to serve as inputs of the next layer. 3.5 Representation Fusion 3.6 Modality Fusion Inspired by attention mechanism in VQA tasks, Instead of simply concatenating the feature vec- representation fusion aims at reconstructing the tors from different modalities to form a longer feature vectors vimage , vtext , vattr with the help of vector, we perform modality fusion motivated by low-level raw vectors (namely, the hidden states the work of (Gu et al., 2018a). The feature vec- of time step t {ht } for the text modality, the 196 tor for each modality m, denoted as vm , is first regional vectors for the image modality, and the transformed into a fixed-length form vm 0 . A two- five attribute embeddings for the attribute modal- layer feed-forward neural network is implemented ity) and high-level guidance vectors from different to calculate the attention weights for each modal- modalities. ity m, which is then used in the weighted average (i) 0 . The result is a We denote Xm as the ith raw vector from of transformed feature vectors vm modality m (which may be text, image or at- single, fixed-length vector vfused . tribute). The key in this stage is to calculate the (i) α̃m = Wm2 · tanh(Wm1 · vm + bm1 ) + bm2 weight for each Xm . The weighted average then becomes the new representation of modality m. α̃ = softmax(α̃) 0 To leverage as much information as possible vm = tanh(Wm3 · vm + bm3 ) and more accurately model the relationship be- vfused = X 0 α̃m vm tween multiple modalities, we exploit informa- m∈{text, image, attr} tion from all three modalities - more explicitly, guidance vectors vn where n could be text, im- where m is one of the three modalities and α̃ is age or attribute, when calculating the weights of a vector containing α̃m ; Wm1 , Wm2 , Wm3 are 2510
Training Development Test Hyper-parameters Value sentences 19816 2410 2409 LSTM hidden size 256 positive 8642 959 959 Batch size 32 negative 11174 1451 1450 Learning rate 0.001 Gradient Clipping 5 Table 1: Statistics of our dataset Early stop patience 5 Word and attribute embedding size 200 ResNet FC size 1024 Modality fusion size 512 weight matrices. bm1 , bm2 , bm3 are biases; vm rep- LSTM dropout rate 0.2 resents reconstructed feature vectors in the repre- Classification layer l2 parameters 1e-7 sentation fusion process. Table 2: Hyper-parameters 3.7 Classification layer We use a two layer fully-connected neural net- set are replaced with a certain symbol hunki. work as our classification layer. The activation function of the hidden layer and the output layer 5 Experiments are element-wise ReLu and sigmoid functions, re- 5.1 Training Details spectively. The loss function is cross entropy. Pre-trained models. The pre-trained ResNet 4 Dataset and Preprocessing model is available online. The word embeddings and attribute embeddings are trained on the Twit- There is no publicly available dataset for evaluat- ter dataset using Glove (Pennington et al., 2014). ing the multi-modal sarcasm detection task, and Fine tuning. Parameters of the pre-trained ResNet thus we build our own dataset, which will be re- model are fixed during training. Parameters of leased later. We collect and preprocess our data word and attribute embeddings are updated during similar to (Schifanella et al., 2016). We collect training. English tweets containing a picture and some spe- Optimization. We use the Adam optimizer cial hashtag (e.g., #sarcasm, etc.) as positive ex- (Kingma and Ba, 2014) to optimize the loss func- amples (i.e. sarcastic) and collect English tweets tion. with images but without such hashtags as negative Hyper-parameters. The hidden layer size in the examples (i.e. not sarcastic). We further clean up neural networks described in the fusion techniques the data as follows. First, we discard tweets con- is half of its input size. Other hyper-parameters are taining sarcasm, sarcastic, irony, ironic as regular listed in table 2. words. We also discard tweets containing URLs in order to avoid introducing additional informa- 5.2 Comparison Results tion. Furthermore, we discard tweets with words Table 3 shows the comparison results (F-score and that frequently co-occur with sarcastic tweets and Accuracy) of baseline models and our proposed thus may express sarcasm, for instance jokes, hu- model. We implement models with one or multi- mor and exgag. We divide the data into training ple modalities as baseline models. We also present set, development set and test set with a ratio of the results of naı̈ve solution (all negative, random) 80%:10%:10%. In order to evaluate models more of this task. accurately, we manually check the development Random. It randomly predicts whether a tweet is set and the test set to ensure the accuracy of the sarcastic or not. labels. The statistics of our final dataset are listed Text(Bi-LSTM). Bi-LSTM is one of the most in table 1. popular method for addressing many text clas- For preprocessing, we first replace mentions sification problems. It leverages a bidirectional with a certain symbol huseri. We then separate LSTM network for learning text representations words, emoticons and hashtags with the NLTK and then uses a classification layer to make pre- toolkit. We also separate hashtag sign # from diction. hashtags and replace capitals with their lower- Text(CNN). CNN is also one of the state-of-the- cases. Finally, words appearing only once in the art methods to address text classification prob- training set and words not appearing in the train- lems. We implement text CNN (Kim, 2014) as a ing set but appearing in the development set or test baseline model. 2511
Model F-score Pre Rec Acc Concat(3) Concat(2) Text(Bi-LSTM) + All negative - - - 0.6019 t 106 149 120 Random 0.4470 0.4005 0.5057 0.5027 t− 65 91 83 Text(Bi-LSTM) 0.7753 0.7666 0.7842 0.8190 p 0.0011 0.0001 0.0057 Text(CNN) 0.7532 0.7429 0.7639 0.8003 Image 0.6153 0.5441 0.7080 0.6476 Attr 0.6334 0.5606 0.7278 0.6646 Table 4: Statistics of sign tests. (t+ is the number of Concat(2) 0.7799 0.7388 0.8259 0.8103 tweets that our proposed model predicts them right but Concat(3) 0.7874 0.7336 0.8498 0.8174 baseline models do not. t− is the number of tweets that Our model 0.8018 0.7657 0.8415 0.8344 baseline models predict them right but our proposed model does not. p is the significance value.) Table 3: Comparison results forms better than baseline models. Image. Image vectors after the pooling layer of ResNet are inputs of the classification layer. We 5.3 Component Analysis of Our Model only update parameters of the classification layer. We further evaluate the influence of early fusion, Attr. Since image attribute is one of the modal- representation fusion, as well as different modality ities in our proposed model, we also try to use representation in early fusion on the final perfor- only attribute features to make prediction. The at- mance. The evaluation results are listed in Table 5. tribute feature vectors are inputs of the classifica- tion layer. F-score Pre Rec Acc Concat. Previous work (Schifanella et al., 2016) w/o EF 0.7880 0.7570 0.8217 0.8240 w/o RF 0.7902 0.7456 0.8405 0.8223 concatenates different feature vectors of different EF(img) 0.7787 0.7099 0.8624 0.8049 modalities as the input of the classification layer. Our model 0.8018 0.7657 0.8415 0.8344 We implement this concatenation model with our feature vectors of different modalities and apply Table 5: Ablation study. ‘w/o’ means removal of this it for classification. The number in parentheses is component. EF denotes early fusion. RF denotes repre- sentation fusion. EF(img) means using image guidance the number of modalities we use. (2) means con- vectors for early fusion. catenating text features and image features, while (3) means concatenating all text, image and at- tribute features. We can see that the removal of early fusion de- creases the performance, which shows that early We can see that the models based only on the fusion can improve the text representation. Early image or attribute modality do not perform well, fusion with attribute representation performs bet- while Text(Bi-LSTM) and Text(CNN) models per- ter than that with image representation, indicating form much better, indicating the important role of the gap between text representation and image rep- text modality. The Concat(3) model outperforms resentation. If representation fusion is removed, Concat(2), because adding attributes as a new the performance is also decreased, which indicates modality actually introduces external semantic in- that representation fusion is necessary and that the formation of images and helps the model when representation fusion can refine the feature repre- it fails to extract valid image features. Our pro- sentation of each modality. posed hierarchical fusion model further improves the performance and achieves the state-of-the-art 6 Visualization Analysis scores, revealing that our fusion model leverages features of three modalities in a more effective 6.1 Running Examples way. Figure 3 shows some sarcastic examples that our We further apply sign tests between our pro- proposed model predicts them correctly while the posed model and Text(Bi-LSTM), Concat(2), model with only text modality fails to label them Concat(3) models. The null hypotheses are that right. It shows that with our model, images and our proposed model doesn’t perform better than attributes can contribute to sarcasm detection. For each baseline model. The statistics of the sign tests example, an image with a dangerous tackle and a are listed in table 4. All significance levels are less text saying ’not dangerous’ convey strong sarcasm than 0.05. Therefore, all of the null hypotheses is in example (a). ’Respectful customers’ is contra- rejected and our proposed model significantly per- dicted to the messy parcels as well as the attribute 2512
attributes: attributes: attributes: field players playing soccer men pile stack messy sitting boxes teddy bear wearing hat brown (a) this isn 't dangerous . going to teach (b) i love respectful customers (c) your counselor is so cute . glad my players to tackle like this at practice you 're staffing up so well . first thing tomorrow morning . Figure 3: Examples of sarcastic tweets (a) (b) (c) Weather ‘s lookin Attributes: happy testing Attributes: yum hospital Attributes: amazing today • ... houses street sitting monday ! eat a cloth sitting colored cafeteria food – at fork knife sitting trees near good breakfast # table meat lake charles white meat serious memorial hospital Figure 4: Attention visualization of sarcastic tweets ’messy’ in example (b). Without images, success- fully detecting these sarcasm instances is almost impossible. The model with only text modality fails to detect sarcasm as for example (c), though the word so is repeated several times in example (c). However, with image and attribute modalities, yo thanks for the yearly fee reminder! Attributes: our proposed model correctly detects sarcasm in Here's to you! #planetfitness #hiddenfee ball holding shoes little white #mrmet these tweets. Figure 5: Example of misclassified samples 6.2 Attention Visualization Figure 4 shows the attention of some examples at the representation fusion stage. Our model can In the example, the insulting gesture in the pic- successfully focus on the appropriate parts of the ture is contrast to the phrase ’thanks for’. How- image, the essential words in the sentences and the ever, the model is unable to obtain the common important attributes. For example, our model pays sense that this gesture is insulting. Therefore, the more attention on the unamused face emoji and the attention of this picture does not focus on the in- word ’amazing’ for texts, and pays more attention sulting gesture. Moreover, attributes do not reveal on the gloomy sky in example (a), thus this tweet the insulting meaning of the pictures as well, thus is predicted as sarcastic tweet because of the in- our model fails to predict this tweet as sarcastic. consistency of these two modalities. In example (b), our model focuses on the word ’serious’ in 7 Conclusion and Future Work texts and focuses on the simple meal in the picture In this paper we propose a new hierarchical fu- that contradicts to the ’good breakfast’, revealing sion model to make full use of three modalities that this tweet should be sarcastic. In example (c), (images, texts and image attributes) to address the the word ’yum’, the attribute ’meat’ and the food challenging multi-modal sarcasm detection task. in the image indicate the sarcastic meaning of the Evaluation results demonstrate the effectiveness of tweet. our proposed model and the usefulness of the three modalities. In future work, we will incorporate 6.3 Error Analysis other modality such as audio into the sarcasm de- Figure 5 shows an example that our model fails to tection task and we will also investigate to make label it right. use of common sense knowledge in our model. 2513
Acknowledgment pages 2379–2390. Association for Computational Linguistics. This work was supported by National Natural Sci- ence Foundation of China (61772036) and Key Yue Gu, Kangning Yang, Shiyu Fu, Shuhong Chen, Xinyu Li, and Ivan Marsic. 2018b. Multimodal af- Laboratory of Science, Technology and Standard fective analysis using hierarchical attention strategy in Press Industry (Key Laboratory of Intelligent with word-level alignment. In Proceedings of the Press Media Technology). We thank the anony- 56th Annual Meeting of the Association for Compu- mous reviewers for their helpful comments. Xiao- tational Linguistics (Volume 1: Long Papers), pages jun Wan is the corresponding author. 2225–2235. Association for Computational Linguis- tics. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian References Sun. 2016. Identity mappings in deep residual net- works. CoRR, abs/1603.05027. Silvio Amir, Byron C. Wallace, Hao Lyu, Paula Car- valho, and Mário J. Silva. 2016. Modelling context Sepp Hochreiter and Jürgen Schmidhuber. 1997. with user embeddings for sarcasm detection in social Long short-term memory. Neural computation, media. CoRR, abs/1607.00976. 9(8):1735–1780. David Bamman and Noah A. Smith. 2015. Contextu- Yoon Kim. 2014. Convolutional neural networks for alized sarcasm detection on twitter. In ICWSM. sentence classification. CoRR, abs/1408.5882. Christos Baziotis, Nikos Athanasiou, Pinelopi Papalampidi, Athanasia Kolovou, Georgios Diederik P. Kingma and Jimmy Ba. 2014. Adam: Paraskevopoulos, Nikolaos Ellinas, and Alexandros A method for stochastic optimization. CoRR, Potamianos. 2018. Ntua-slp at semeval-2018 task abs/1412.6980. 3: Tracking ironic tweets using ensembles of word Tsung-Yi Lin, Michael Maire, Serge Belongie, James and character level attentive rnns. arXiv preprint Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, arXiv:1804.06659. and C. Lawrence Zitnick. 2014. Microsoft coco: M. Bouazizi and T. Ohtsuki. 2015. Sarcasm detection Common objects in context. In Computer Vision – in twitter: ”all your products are incredibly amaz- ECCV 2014, pages 740–755, Cham. Springer Inter- ing!!!” - are they really? In 2015 IEEE Global Com- national Publishing. munications Conference (GLOBECOM), pages 1–6. Jeffrey Pennington, Richard Socher, and Christopher Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Manning. 2014. Glove: Global vectors for word Gao, Wei Xu, and Ram Nevatia. 2015. Abc- representation. In Conference on Empirical Meth- cnn: An attention based convolutional neural net- ods in Natural Language Processing, pages 1532– work for visual question answering. arXiv preprint 1543. arXiv:1511.05960. Soujanya Poria, Erik Cambria, and Alexander Gel- Dmitry Davidov, Oren Tsur, and Ari Rappoport. 2010. bukh. 2015. Deep convolutional neural network Semi-supervised recognition of sarcastic sentences textual features and multiple kernel learning for in twitter and amazon. In Proceedings of the utterance-level multimodal sentiment analysis. In Fourteenth Conference on Computational Natural Proceedings of the 2015 conference on empiri- Language Learning, CoNLL ’10, pages 107–116, cal methods in natural language processing, pages Stroudsburg, PA, USA. Association for Computa- 2539–2544. tional Linguistics. Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Shelly Dews and Ellen Winner. 1995. Muting the and Prateek Vij. 2016. A deeper look into sarcas- meaning a social function of irony. Metaphor and tic tweets using deep convolutional neural networks. Symbol, 10(1):3–19. CoRR, abs/1610.08815. Aniruddha Ghosh and Dr. Tony Veale. 2016. Frack- Tomáš Ptáček, Ivan Habernal, and Jun Hong. 2014. ing sarcasm using neural network. In Proceedings Sarcasm detection on czech and english twitter. of the 7th Workshop on Computational Approaches In Proceedings of COLING 2014, the 25th Inter- to Subjectivity, Sentiment and Social Media Analy- national Conference on Computational Linguistics: sis, pages 161–169. Association for Computational Technical Papers, pages 213–223. Dublin City Uni- Linguistics. versity and Association for Computational Linguis- tics. Yue Gu, Kangning Yang, Shiyu Fu, Shuhong Chen, Xinyu Li, and Ivan Marsic. 2018a. Hybrid atten- Ellen Riloff, Ashequl Qadir, Prafulla Surve, Lalin- tion based multimodal network for spoken language dra De Silva, Nathan Gilbert, and Ruihong Huang. classification. In Proceedings of the 27th Inter- 2013. Sarcasm as contrast between a positive senti- national Conference on Computational Linguistics, ment and negative situation. In EMNLP 2013 - 2013 2514
Conference on Empirical Methods in Natural Lan- guage Processing, Proceedings of the Conference, pages 704–714. Association for Computational Lin- guistics (ACL). Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, An- drej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. Ima- geNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252. Rossano Schifanella, Paloma de Juan, Joel R. Tetreault, and Liangliang Cao. 2016. Detecting sarcasm in multimodal social platforms. CoRR, abs/1608.02289. Yi Tay, Luu Anh Tuan, Siu Cheung Hui, and Jian Su. 2018. Reasoning with sarcasm by reading in- between. CoRR, abs/1805.02856. Haohan Wang, Aaksha Meghawat, Louis-Philippe Morency, and Eric P. Xing. 2016. Select-additive learning: Improving cross-individual generaliza- tion in multimodal sentiment analysis. CoRR, abs/1609.05244. Peng Wang, Qi Wu, Chunhua Shen, and Anton van den Hengel. 2017. The vqa-machine: Learning how to use existing vision algorithms to answer new ques- tions. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn, volume 4. Chuhan Wu, Fangzhao Wu, Sixing Wu, Junxin Liu, Zhigang Yuan, and Yongfeng Huang. 2018. Thu ngn at semeval-2018 task 3: Tweet irony de- tection with densely connected lstm and multi-task learning. In Proceedings of The 12th International Workshop on Semantic Evaluation, pages 51–56. Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton van den Hengel. 2016. What value do ex- plicit high level concepts have in vision to language problems? In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 203–212. Yang Yang, Jia Jia, Shumei Zhang, Boya Wu, Qicong Chen, Juanzi Li, Chunxiao Xing, and Jie Tang. 2014. How do your friends on social media disclose your emotions? In Twenty-Eighth AAAI Conference on Artificial Intelligence, pages 306–312. Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Ten- sor fusion network for multimodal sentiment analy- sis. CoRR, abs/1707.07250. Meishan Zhang, Yue Zhang, and Guohong Fu. 2016. Tweet sarcasm detection using deep neural network. In COLING. 2515
You can also read