SEERNET AT SEMEVAL-2018 TASK 1: DOMAIN ADAPTATION FOR AFFECT IN TWEETS - DEEPAFFECTS
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
SeerNet at SemEval-2018 Task 1: Domain Adaptation for Affect in Tweets Venkatesh Duppada and Royal Jain and Sushant Hiray Seernet Technologies, LLC {venkatesh.duppada, royal.jain, sushant.hiray}@seernet.io Abstract two fundamental viewpoints. (Ekman, 1992) and The paper describes the best performing sys- (Plutchik, 2001) proposed that emotions are dis- tem for the SemEval-2018 Affect in Tweets crete with each emotion being a distinct entity. (English) sub-tasks. The system focuses on the On the contrary, (Mehrabian, 1980) and (Russell, ordinal classification and regression sub-tasks 1980) propose that emotions can be categorized for valence and emotion. For ordinal classi- into dimensional groupings. fication valence is classified into 7 different Affect in Tweets (Mohammad et al., 2018) - classes ranging from -3 to 3 whereas emotion shared task in SemEval-2018 focuses on extract- is classified into 4 different classes 0 to 3 sep- arately for each emotion namely anger, fear, ing affect from tweets confirming to both vari- joy and sadness. The regression sub-tasks es- ants of the emotion models, extracting valence (di- timate the intensity of valence and each emo- mensional) and emotion (discrete). Previous ver- tion. The system performs domain adaptation sion of the task (Mohammad and Bravo-Marquez, of 4 different models and creates an ensem- 2017) focused on estimating the emotion intensity ble to give the final prediction. The proposed in tweets. We participated in 4 sub-tasks of Affect system achieved 1st position out of 75 teams in Tweets, all dealing with English tweets. The which participated in the fore-mentioned sub- sub-tasks were: EI-oc: Ordinal classification of tasks. We outperform the baseline model by margins ranging from 49.2% to 76.4 %, thus, emotion intensity of 4 different emotions (anger, pushing the state-of-the-art significantly. joy, sadness, fear), EI-reg: to determine the inten- sity of emotions (anger, joy, sadness, fear) into a 1 Introduction real-valued scale of 0-1, V-oc: Ordinal classifica- Twitter is one of the most popular micro-blogging tion of valence into one of 7 ordinal classes [-3, platforms that has attracted over 300M daily 3], V-reg: determine the intensity of valence on users1 with over 500M 2 tweets sent every day. the scale of 0-1. Tweet data has attracted NLP researchers because Prior work in extracting Valence, Arousal, of the ease of access to large data-source of peo- Dominance (VAD) from text primarily relied ple expressing themselves online. Tweets are on using and extending lexicons (Bestgen and micro-texts comprising of emoticons, hashtags as Vincze, 2012) (Turney et al., 2011). Recent ad- well as location data, making them feature rich vancements in deep learning have been applied for performing various kinds of analysis. Tweets in detecting sentiments from tweets (Tang et al., provide an interesting challenge as users tend to 2014), (Liu et al., 2012), (Mohammad et al., write grammatically incorrect and use informal 2013). and slang words. In this work, we use various state-of-the-art ma- In domain of natural language processing, emo- chine learning models and perform domain adap- tion recognition is the task of associating words, tation (Pan and Yang, 2010) from their source task phrases or documents with emotions from prede- to the target task. We use multi-view ensemble fined using psychological models. The classifica- learning technique (Kumar and Minz, 2016) to tion of emotions has mainly been researched from produce the optimal feature-set partitioning for the 1 classifier. Finally, results from multiple such clas- https://www.statista.com/statistics/282087/number-of- monthly-active-twitter-users/ sifiers are stacked together to create an ensemble 2 http://www.internetlivestats.com/twitter-statistics/ (Polikar, 2012).
In this paper, we describe our approach and ex- tweets) comprising of noisy labels (emojis). Deep- periments to solve this problem. The rest of the Moji was able to obtain state-of-the-art results in paper is laid out as follows: Section 2 describes various downstream tasks using transfer learning. the system architecture, Section 3 reports results This makes it an ideal candidate for domain adap- and inference from different experiments. Finally tation into related target tasks. We extract 2 differ- we conclude in Section 4 along with a discussion ent feature sets by extracting the embeddings from about future work. the softmax and the attention layer from the pre- trained DeepMoji model. The vector from soft- 2 System Description max layer is of dimension 64 and the vector from 2.1 Pipeline attention layer is of dimension 2304. Figure 1 details the System Architecture. We now 2.3.2 Skip-Thought Vectors describe how all the different modules are tied to- Skip-Thought vectors (Kiros et al., 2015) is an off- gether. The input raw tweet is pre-processed as the-shelf encoder that can produce highly generic described in Section 2.2. The processed tweet is sentence representations. Since tweets are re- passed through all the feature extractors described stricted by character limit, skip-thought vectors in Section 2.3. At the end of this step, we extract can create a good semantic representation. This 5 different feature vectors corresponding to each representation is then passed to the classifier. The tweet. Each feature vector is passed through the representation is of dimension 4800. model zoo where classifiers with different hyper parameters are tuned. The models are described in 2.3.3 Unsupervised Sentiment Neuron Section 2.4. For each vector, the results of top-2 (Radford et al., 2017) developed an unsupervised performing models (based on cross-validation) are system which learned an excellent representation retained. At the end of this step, we’ve 10 differ- of sentiment. The original model was trained to ent results corresponding to each tweet. All these generate amazon reviews, this makes the senti- results are ensembled together via stacking as de- ment neuron an ideal candidate for transfer learn- scribed in Section 2.4.3. Finally, the output from ing. The representation extracted from Sentiment the ensembler is the output returned by the system. Neuron is of size 4096. 2.2 Pre-processing 2.3.4 EmoInt The pre-processing step modifies the raw tweets Apart from all the pre-trained embeddings, we to prepare them for feature extraction step. choose to also include various lexical features bun- Tweets are pre-processed using tweettokenize 3 dled through the EmoInt package 5 (Duppada and tool. Twitter specific keywords are replaced with Hiray, 2017) The lexical features include AFINN tokens, namely, USERNAME, PHONENUMBER, (Nielsen, 2011), NRC Affect Intensities (Moham- URLs, timestamps. All characters are con- mad, 2017), NRC-Word-Affect Emotion Lexi- verted to lowercase. A contiguous sequence of con (Mohammad and Turney, 2010), NRC Hash- emojis is first split into individual emojis. We then tag Sentiment Lexicon and Sentiment140 Lexicon replace an emoji with its description. The descrip- (Mohammad et al., 2013). The final feature vector tions were scraped from EmojiPedia4 . is the concatenation of all the individual features. This feature vector is of size (141, 1). 2.3 Feature Extraction This gives us five different feature vector vari- As mentioned in Section 1, we perform transfer ants. All of these feature vectors are passed indi- learning from various state-of-the-art deep learn- vidually to the underlying models. The pipeline is ing techniques. We will go through the following explained in detail in Section 2.1 sub-sections to understand these models in detail. 2.4 Machine Learning Models 2.3.1 DeepMoji We participated in 4 sub-tasks, namely, EI-oc, EI- DeepMoji (Felbo et al., 2017) performs distant su- reg, V-oc, V-reg. Two of the sub-tasks are ordi- pervision on a very large dataset (1246 million nal classification and the remaining two are regres- 3 sions. We describe our approach for building ML https://github.com/jaredks/tweetokenize 4 5 https://emojipedia.org/ https://github.com/SEERNET/EmoInt
Figure 1: System Architecture models for both the variants in the upcoming sec- Task Baseline 2nd best Our Results tions. EI-reg 0.520 0.776 0.799 EI-oc 0.394 0.659 0.695 2.4.1 Ordinal Classification V-reg 0.585 0.861 0.873 We participated in the emotion intensity ordinal V-oc 0.509 0.833 0.836 classification where the task was to predict the intensity of emotions from the categories anger, Table 1: Primary metrics across various sub-tasks fear, joy, and, sadness. Separate datasets were pro- vided for each emotion class. The goal of the sub- task of valence ordinal classification was to clas- ods where the output variable is discrete and or- sify the tweet into one of 7 ordinal classes [-3, 3]. dered. We use the ordinal logistic classification We experimented with XG Boost Classifier, Ran- with squared error (Rennie and Srebro, 2005) from dom Forest Classifier of sklearn (Pedregosa et al., the python library Mord. 6 (Pedregosa-Izquierdo, 2011). 2015) In case of regression sub-tasks we observed the 2.4.2 Regression best cross validation results with Ridge Regres- For the regression tasks (E-reg, V-reg), the goal sion. Hence, we chose Ridge Regression as the was to predict the intensity on a scale of 0-1. meta regressor. We experimented with XG Boost Regressor, Ran- dom Forest Regressor of sklearn (Pedregosa et al., 3 Results and Analysis 2011). 3.1 Task Results The hyper-parameters of each model were tuned The metrics used for ranking various systems are - separately for each sub-task. The top-2 best mod- els corresponding to each feature vector type were 3.1.1 Primary Metrics chosen after performing 7-fold cross-validation. Pearson correlation with gold labels was used as 2.4.3 Stacking a primary metric for ranking the systems. For EI- reg and EI-oc tasks Pearson correlation is macro- Once we get the results from all the classi- averaged (MA Pearson) over the four emotion cat- fiers/regressors for a given tweet, we use stack- egories. ing ensemble technique to combine the results. In Table 1 describes the results based on primary this case, we pass the results from the models to metrics for various sub-tasks in English language. a meta classifier/regressor as input. The output of Our system achieved the best performance in each this meta model is treated as the final output of the of the four sub-tasks. We have also included the system. results of the baseline and second best perform- We observed that using ordinal regressors gave ing system for comparison. As we can observe, us better performance than using classifiers which our system vastly outperforms the baseline and is a treat each output class as disjoint. Ordinal Re- 6 gression is a family of statistical learning meth- https://github.com/fabianp/mord
Task Pearson (SE) Kappa Kappa (SE) Feature Set Pearson V-oc 0.884 (1) 0.831 (1) 0.873 (1) Deepmoji (softmax layer) 0.808 EI-oc 0.547 (1) 0.669 (1) 0.503 (1) Deepmoji (attention layer) 0.843 EmoInt 0.823 Table 2: Secondary metrics for ordinal classification Unsupervised sentiment Neuron 0.714 sub-tasks. System rank is mentioned in the brackets Skip-Thought Vectors 0.777 Combined 0.873 Task Pearson (gold in 0.5-1) V-reg 0.697 (1) Table 4: Pearson Correlation for V-reg task. Best re- EI-reg 0.638 (1) sults are highlighted in bold. Table 3: Secondary metrics for regression sub-tasks. System rank is mentioned in brackets Feature Set Pearson Deepmoji (softmax layer) 0.780 Deepmoji (attention layer) 0.813 significant improvement over the second best sys- EmoInt 0.785 tem, especially, in the emotion sub-tasks. Unsupervised sentiment Neuron 0.685 3.1.2 Secondary Metrics Skip-Thought Vectors 0.748 The competition also uses some secondary met- Combined 0.836 rics to provide a different perspective on the re- sults. Pearson correlation for a subset of the test Table 5: Pearson Correlation for V-oc task. Best re- sults are highlighted in bold. set that includes only those tweets with intensity score greater or equal to 0.5 is used as the sec- ondary metric for the regression tasks. For ordi- Feature Set Pearson nal classification tasks following secondary met- Deepmoji (softmax layer) 0.703 rics were used: Deepmoji (attention layer) 0.756 EmoInt 0.694 • Pearson correlation for a subset of the test Unsupervised sentiment Neuron 0.548 set that includes only those tweets with in- Skip-Thought Vectors 0.656 tensity classes low X, moderate X, or high X Combined 0.799 (where X is an emotion). The organizers re- fer to this set of tweets as the some-emotion Table 6: Macro-Averaged Pearson Correlation for EI- subset (SE). reg task. Best results are highlighted in bold. • Weighted quadratic kappa on the full test set Feature Set Pearson • Weighted quadratic kappa on the some- Deepmoji softmax layer 0.611 emotion subset of the test set Deepmoji attention layer 0.664 The results for secondary metrics are listed in EmoInt 0.596 Table 2 and 3. We have also included the ranking Unsupervised sentiment Neuron 0.445 in brackets along with the score. We see that our Skip-Thought Vectors 0.557 system achieves the top rank according to all the Combined 0.695 secondary metrics, thus, proving its robustness. Table 7: Macro-Averaged Pearson Correlation for EI- 3.2 Feature Importance oc task. Best results are highlighted in bold. The performance of the system is highly depen- dent on the discriminative ability of the tweet rep- resentation generated by the featurizers. We mea- We observe that deepmoji featurizer is the most sure the predictive power for each of the featurizer powerful featurizer of all the ones that we’ve used. used by calculating the pearson correlation of the Also, we can see that stacking ensembles of mod- system using only that featurizer. We describe the els trained on the outputs of multiple featurizers results for each sub task separately in tables 4-7. gives a significant improvement in performance.
3.3 System Limitations of emoji occurrences to learn any-domain represen- tations for detecting sentiment, emotion and sar- We analyze the data points where our model’s pre- casm. In Proceedings of the 2017 Conference on diction is far from the ground truth. We observed Empirical Methods in Natural Language Process- some limitations of the system, such as, some- ing, pages 1615–1625. times understanding a tweet’s requires contextual Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, knowledge about the world. Such examples can Richard Zemel, Raquel Urtasun, Antonio Torralba, be very confusing for the model. We use deepmoji and Sanja Fidler. 2015. Skip-thought vectors. In pre-trained model which uses emojis as proxy for Advances in neural information processing systems, pages 3294–3302. labels, however partly due to the nature of twitter conversations same emojis can be used for mul- Vipin Kumar and Sonajharia Minz. 2016. Multi-view tiple emotions, for example, joy emojis can be ensemble learning: an optimal feature set partition- sometimes used to express joy, sometimes for sar- ing for high-dimensional data classification. Knowl- edge and Information Systems, 49(1):1–59. casm or for insulting someone. One such example is ’Your club is a laughing stock’. Such cases are Kun-Lin Liu, Wu-Jun Li, and Minyi Guo. 2012. sometimes incorrectly predicted by our system. Emoticon smoothed language models for twitter sentiment analysis. In Aaai. 4 Future Work & Conclusion Albert Mehrabian. 1980. Basic dimensions for a gen- eral psychological theory implications for personal- The paper studies the effectiveness of various rep- ity, social, environmental, and developmental stud- resentations of tweets and proposes ways to com- ies. bine them to obtain state-of-the-art results. We Saif M Mohammad. 2017. Word affect intensities. also show that stacking ensemble of various clas- arXiv preprint arXiv:1704.08798. sifiers learnt using different representations can vastly improve the robustness of the system. Saif M Mohammad and Felipe Bravo-Marquez. 2017. Wassa-2017 shared task on emotion intensity. arXiv Further improvements can be made in the pre- preprint arXiv:1708.03700. processing stage. Instead of discarding various tokens such as punctuation’s, incorrectly spelled Saif M. Mohammad, Felipe Bravo-Marquez, Mo- words, etc, we can utilize the information by learn- hammad Salameh, and Svetlana Kiritchenko. 2018. Semeval-2018 Task 1: Affect in tweets. In Proceed- ing their semantic representations. Also, we can ings of International Workshop on Semantic Evalu- improve the system performance by employing ation (SemEval-2018), New Orleans, LA, USA. multi-task learning techniques as various emotions Saif M Mohammad, Svetlana Kiritchenko, and Xiao- are not independent of each other and information dan Zhu. 2013. Nrc-canada: Building the state- about one emotion can aid in predicting the other. of-the-art in sentiment analysis of tweets. arXiv Furthermore, more robust techniques can be em- preprint arXiv:1308.6242. ployed for distant supervision which are less prone Saif M Mohammad and Peter D Turney. 2010. Emo- to noisy labels to get better quality training data. tions evoked by common words and phrases: Us- ing mechanical turk to create an emotion lexicon. In Proceedings of the NAACL HLT 2010 workshop on References computational approaches to analysis and genera- tion of emotion in text, pages 26–34. Association for Yves Bestgen and Nadja Vincze. 2012. Checking Computational Linguistics. and bootstrapping lexical norms by means of word similarity indexes. Behavior research methods, Finn Årup Nielsen. 2011. A new anew: Evaluation of a 44(4):998–1006. word list for sentiment analysis in microblogs. arXiv preprint arXiv:1103.2903. Venkatesh Duppada and Sushant Hiray. 2017. Seernet at emoint-2017: Tweet emotion intensity estimator. Sinno Jialin Pan and Qiang Yang. 2010. A survey on In Proceedings of the 8th Workshop on Computa- transfer learning. IEEE Transactions on knowledge tional Approaches to Subjectivity, Sentiment and So- and data engineering, 22(10):1345–1359. cial Media Analysis, pages 205–211. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram- Paul Ekman. 1992. An argument for basic emotions. fort, Vincent Michel, Bertrand Thirion, Olivier Cognition & emotion, 6(3-4):169–200. Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Machine learning in python. Journal of machine Rahwan, and Sune Lehmann. 2017. Using millions learning research, 12(Oct):2825–2830.
Fabian Pedregosa-Izquierdo. 2015. Feature extraction and supervised learning on fMRI: from practice to theory. Ph.D. thesis, Université Pierre et Marie Curie-Paris VI. Robert Plutchik. 2001. The nature of emotions: Hu- man emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice. American scientist, 89(4):344– 350. Robi Polikar. 2012. Ensemble learning. In Ensemble machine learning, pages 1–34. Springer. Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. 2017. Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444. Jason DM Rennie and Nathan Srebro. 2005. Loss func- tions for preference levels: Regression with discrete ordered labels. In Proceedings of the IJCAI mul- tidisciplinary workshop on advances in preference handling, pages 180–186. Kluwer Norwell, MA. James A Russell. 1980. A circumplex model of af- fect. Journal of personality and social psychology, 39(6):1161. Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. 2014. Learning sentiment- specific word embedding for twitter sentiment clas- sification. In Proceedings of the 52nd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1555– 1565. Peter D Turney, Yair Neuman, Dan Assaf, and Yohai Cohen. 2011. Literal and metaphorical sense iden- tification through concrete and abstract context. In Proceedings of the Conference on Empirical Meth- ods in Natural Language Processing, pages 680– 690. Association for Computational Linguistics.
You can also read