IIT Gandhinagar at SemEval-2019 Task 3: Contextual Emotion Detection Using Deep Learning
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
IIT Gandhinagar at SemEval-2019 Task 3: Contextual Emotion Detection Using Deep Learning Arik Pamnani∗, Rajat Goel∗, Jayesh Choudhari, Mayank Singh IIT Gandhinagar Gujarat, India {arik.pamnani,rajat.goel,choudhari.jayesh,singh.mayank}@iitgn.ac.in Abstract Existing work: For sentiment analysis, most of the previous year’s submissions focused on neu- Recent advancements in Internet and Mobile ral networks (Nakov et al., 2016). Teams exper- infrastructure have resulted in the development of faster and efficient platforms of communi- imented with Recurrent Neural Network (RNN) cation. These platforms include speech, facial (Yadav, 2016) as well as Convolutional Neu- and text-based conversational mediums. Ma- ral Network (CNN) based models (Ruder et al., jority of these are text-based messaging plat- 2016). However, some top ranking teams also forms. Development of Chatbots that automat- used (Giorgis et al., 2016) classic machine learn- ically understand latent emotions in the tex- ing models. Aiming for the best system, we started tual message is a challenging task. In this pa- with classical machine learning algorithms like per, we present an automatic emotion detec- tion system that aims to detect the emotion of a Support Vector Machine (SVM) and Logistic Re- person textually conversing with a chatbot. We gression (LR). Based on the findings from them explore deep learning techniques such as CNN we moved to complex models using Long Short- and LSTM based neural networks and outper- Term Memory (LSTM), and finally, we experi- formed the baseline score by 14%. The trained mented with CNN in search for the right system. model and code are kept in public domain. 1 Introduction Our contribution: In this paper, we present In recent times, text has become a preferred mode models to extract emotions from text. All our of communication (Reporter, 2012; ORTUTAY, models are trained using only the dataset provided 2018) over phone/video calling or face-to-face by EmoContext organizers. The evaluation metric communication. New challenges and opportuni- set by the organizers is micro F1 score (referred ties accompany this change. Identifying sentiment as score in rest of the paper) on three {Happy, from text has become a sought after research Sad, Angry} out of the four labels. We experi- topic. Applications include detecting depres- mented by using simpler models like SVM, and sion (Wang et al., 2013) or teaching empathy to Logistic regression but the score on dev set was chatbots (WILSON, 2016). These applications below 0.45. We then worked with a CNN model leverage NLP for extracting sentiments from and two LSTM based models where we were able text. On this line, SemEval Task 3: EmoContext to beat the baseline and achieve a maximum score (Chatterjee et al., 2019) challenges participants to of 0.667 on test set. identify Contextual Emotions in Text. Challenges: The challenges with extracting Outline: Section 2 describes the dataset and sentiments from text are not only limited to the use preprocessing steps. Section 3 presents model de- of slang, sarcasm or multiple languages in a sen- scription and system information. In the next sec- tence. There is also a challenge which is presented tion (Section 4), we discuss experimental results by the use of non-standard acronyms specific to in- and comparisons against state-of-the-art baselines. dividuals and others which are present in the task’s Towards the end, Section 5 and Section 6 conclude dataset 2. this work with current limitations and proposal for ∗ Equal Contribution future extension. 236 Proceedings of the 13th International Workshop on Semantic Evaluation (SemEval-2019), pages 236–240 Minneapolis, Minnesota, USA, June 6–7, 2019. ©2019 Association for Computational Linguistics
2 Dataset Tokenize sentences using NLTK’s We used the dataset provided by Task 3 in S E - TweetTokenizer† M E VAL 2019. This task is titled as ‘EmoCon- text: Contextual Emotion Detection in Text’. The dataset consists of textual dialogues i.e. a user utterance along with two turns of context. Each Is the Convert Unicode YES dialogue is labelled into several emotion classes: token an to string Happy, Sad, Angry or Others. Figure 1 shows an emoji? −→ “:)” example dialogue. NO Turn 1: N u Use regex to remove Turn 2: Im fine, and you? repetition of letters Turn 3: I am fabulous at the end of a token. “heyyyyy” −→ “hey” Figure 1: Example textual dialogue from EmoCon- text dataset. Correct spelling‡ errors in tokens. Table 1 shows the distribution of classes in the EmoContext dataset. The dataset is further subdi- Lemmatize the vided into train, dev and test sets. In this work, we token using NLTKs use training set for model training and dev set for WordNetLemmatizer. validation and hyper-parameter tuning. Others Happy Sad Angry Figure 2: Data processing pipeline. Train 14948 4243 5463 5506 Dev 2338 142 125 150 Test 4677 284 250 298 3 Experiments Table 1: Dataset statistics. We experiment with several classification systems. In the following subsections we first explain the Preprocessing: We leverage two pretrained classical ML based models followed by Deep word embedding: (i) GloVe (Pennington et al., Neural Network based models. 2014) and (ii) sentiment specific word embedding (SSWE) (Tang et al., 2014). However, several 3.1 Classical Machine Learning Methods classes of words are not present in these embed- We learn§ two classical ML models, (i) Support dings. We list these classes below: Vector Machine (SVM) and (ii) Logistic Regres- • Emojis: , , etc. sion (LR). The input consists a single feature vec- • Elongated words: Wowwww, noooo, etc. tor formed by combining all sentences (turns). We • Misspelled words: ofcorse, activ, doin, etc. term this combination of sentences as ’Dialogue’. • Non-English words: Chalo, kyun, etc. We create feature vector by averaging over d di- We follow a standard preprocessing pipeline to mensional GloVe representations of all the words address the above limitations. Figure 2 describes present in the dialogue. Apart from standard aver- the dataset preprocessing pipeline. aging, we also experimented with tf-idf weighted By using the dataset preprocessing pipeline, we averaging. The dialogue vector construction from reduced the number of words not found in GloVe tf-idf averaging scheme is described below: embedding from 4154 to 813 and in SSWE from 3188 to 1089. ΣN i=1 (tf-idfwi × GloV ewi ) V ectordialogue = N † NLTK Library (Bird et al., 2009) ‡ § For spell check we the used the following PyPI package We leverage the Scikit-learn (Pedregosa et al., 2011) im- - pypi.org/project/pyspellchecker/. plementation. 237
Here, N is the total number of tokens in a sen- softmax layer to obtain probabilities for classifica- tence and wi is the ith token in the dialogue. Em- tion. We used Keras for this model. pirically, we found that, standard averaging shows better prediction accuracy than tf-idf weighted av- 3.2.2 Long Short-Term Memory-I (LSTM-I) eraging. We experiment with two Long Short-term Mem- ory (Hochreiter and Schmidhuber, 1997) based ap- 3.2 Deep Neural Networks proaches. In the first approach, we use an architec- ture similar to (Gupta et al., 2017) Here, similar to In this subsection, we describe three deep neural the CNN model, the input contains an entire dia- architectures that provide better prediction accu- logue. We experiment with two embedding layers, racy than classical ML models. one with SSWE embeddings, and the other with 3.2.1 Convolution Neural Network (CNN) GloVe embeddings. Figure 4 presents detailed de- scription. Gupta et al. showed that SSWE embed- We explore a CNN model analogous to (Kim, dings capture sentiment information and GloVe 2014) for classification. Figure 3 describes our embeddings capture semantic information in the CNN architecture. The model consists of an em- continuous representation of words. Similar to the bedding layer, two convolution layers, a max pool- CNN model, here also, we input the word indices ing layer, a hidden layer, and a softmax layer. of dialogue. We pad input sequences with zeros so For each dialogue, the input to this model is a that each sequence has length n. sequence of token indices. Input sequences are The architecture consists of two LSTM layers padded with zeros so that each sequence has equal after each embedding layer. The LSTM layer out- length n. puts a vector of shape 128 × 1. Further, concate- Filters of size Feature nation of these output vectors results a vector of 2xd, 3xd, 4xd maps shape 256 × 1. In the end, we have a hidden layer Embedding matrix of Concat followed by a softmax layer. The output from the size N x d Hey softmax layer is a vector of shape 4 × 1 which Max ⊕ refers to class probabilities for the four classes. We How pooling are * = ⊕ . used Keras for this model. you Softmax . ? . . TURN 1 TURN 2 TURN 3 . SSWE SSWE SSWE d = embedding . . . dimension LSTM . . Fully Connected . . Layers 128 LSTM Leaky Softmax ReLU Figure 3: Architecture of the CNN model. LSTM 128 LSTM The embedding layer maps the input sequence Fully Connected GloVe GloVe GloVe Layer to a matrix of shape n × d, where n represents nth TURN 1 TURN 2 TURN 3 word in the dialogue and d represents dimensions of the embedding. Rows of the matrix correspond to the GloVe embedding of corresponding words Figure 4: LSTM-I architecture. in the sequence. A zero vector represents words which are not present in the embedding. At the convolution layer, filters of shape m × d 3.2.3 Long Short-Term Memory-II slide over the input matrix to create feature maps (LSTM-II) of length n − m + 1. Here, m is the ‘region size’. In the second approach, we use the architecture For each region size, we use k filters. Thus, the shown in Figure 5. This model consists of embed- total number of feature maps is m × k. We use ding layers, LSTM layers, a dense layer and a soft- two convolution layers, one after the other. max layer. Here, the entire dialogue is not passed Next, we apply a max-pooling operation over at once. Turn 1 is passed through an embedding each feature map to get a vector of length m × k. layer which is followed by an LSTM layer. The At the end, we add a hidden layer followed by a output is a vector of shape 256 × 1. Turn 2 is also 238
passed through an embedding layer which is fol- SVM and Logistic Regression models did not lowed by an LSTM layer. The output from Turn 2 yield very good results. We attribute this to the is concatenated with the output of Turn 1 to form dialogue features that we use for the models. Tf- a vector of shape 512 × 1. The concatenated vec- idf weighted average of GloVe vectors performed tor is passed through a dense layer which reduces worse than the simple average of vectors. Hand- the vector to 256 × 1. Turn 3 is passed through an crafted features might have performed better than embedding layer which is followed by an LSTM our current implementation. Neural network based layer. The output from Turn 3 is concatenated with models had very good results, CNN performed the reduced output of Turn 1 & 2, and the resultant better than classical ML models but lagged behind vector has shape 512 × 1. The resultant vector is LSTM based models. On the test set, our LSTM- passed through a dense layer and then a softmax I model performed slightly better than LSTM-II layer to find the probability distribution across dif- model. ferent classes. We used Pytorch for this model. Hyper-parameter selection for CNN was diffi- The motivation of this architecture was derived cult, and we restricted to LSTM for the Phase from the Task’s focus to identify the emotion of 2 (i.e. test phase). We also noticed that the Turn 3. Hence, this architecture gives more weight LSTM model was overfitting early in the train- to Turn 3 while making a prediction and condi- ing process (4-5 epochs) and that was a challenge tions the result on Turn 1 & 2 by concatenating when searching for optimal hyper-parameters. We their output vectors. The concatenated vector of used grid search to find the right set of hyper- Turn 1 & 2 accounts for the context of the conver- parameters for our models. We grid searched over sation. dropout (Srivastava et al., 2014), number of LSTM layers, learning rate and number of epochs. In case LSTM of the CNN model, number of filters was an extra 256 hyper-parameter. We used Nvidia GeForce GTX GloVe GloVe GloVe TURN 1 512 256 1080 for training our models. LSTM 256 ⊕ Leaky ReLU Softmax 5 Conclusion GloVe GloVe GloVe TURN 2 256 In this paper, we experimented with multiple ma- LSTM chine learning models. We see that LSTM and GloVe GloVe GloVe Fully Connected CNN models perform far better than classical ML TURN 3 Layer methods. In phase-1 of the competition (dev dataset), we were able to achieve a score of 0.71, when the scoreboard leader had 0.77. But in Figure 5: LSTM-II architecture. phase-2 (test dataset), our best score was only 0.634, when the scoreboard leader had 0.79. After 4 Results phase-2 ended, we experimented more with hyper- In Table 2, we report the performance of all the parameters and achieved an increase in scores on models described in the previous section. We train the test-set (mentioned in Table 2). each model multiple (=5) times and compute the Full code for the paper can be found on mean of scores. GitHub∗ . Algorithm Scoredev Scoretest 6 Future Work SVM 0.46 0.41 LR 0.44 0.40 Our scores on the test dataset suggest room for im- SVM provement. Now we are narrowing down to trans- 0.42 0.38 (tf-idf weighted averaging) fer learning where the starting point for our model LR will be a pre-trained network on a similar task. 0.37 0.34 (tf-idf weighted averaging) Our assumption is, this will help in better con- CNN 0.632 0.612 vergence on EmoContext dataset given the dataset LSTM-I 0.677 0.667 LSTM-II 0.684 0.661 size is not too large. ∗ https://github.com/lingo-iitgn/ Table 2: Model performance on dev & test dataset. emocontext-19 239
References convolutional neural networks for sentiment clas- sification and quantification. arXiv preprint Steven Bird, Ewan Klein, and Edward Loper. 2009. arXiv:1609.02746. Natural language processing with Python: analyz- ing text with the natural language toolkit. ” O’Reilly Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Media, Inc.”. Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks Ankush Chatterjee, Kedhar Nath Narahari, Meghana from overfitting. The Journal of Machine Learning Joshi, and Puneet Agrawal. 2019. Semeval-2019 Research, 15(1):1929–1958. task 3: Emocontext: Contextual emotion detection in text. In Proceedings of The 13th International Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Workshop on Semantic Evaluation (SemEval-2019), Liu, and Bing Qin. 2014. Learning sentiment- Minneapolis, Minnesota. specific word embedding for twitter sentiment clas- sification. volume 1, pages 1555–1565. Stavros Giorgis, Apostolos Rousas, John Pavlopoulos, Prodromos Malakasiotis, and Ion Androutsopoulos. Xinyu Wang, Chunhong Zhang, Yang Ji, Li Sun, Leijia 2016. aueb. twitter. sentiment at semeval-2016 task Wu, and Zhana Bao. 2013. A depression detection 4: A weighted ensemble of svms for twitter senti- model based on sentiment analysis in micro-blog so- ment analysis. In Proceedings of the 10th Interna- cial network. In Pacific-Asia Conference on Knowl- tional Workshop on Semantic Evaluation (SemEval- edge Discovery and Data Mining, pages 201–213. 2016), pages 96–99. Springer. Umang Gupta, Ankush Chatterjee, Radhakrish- MARK WILSON. 2016. This startup is teaching chat- nan Srikanth, and Puneet Agrawal. 2017. A bots real empathy. sentiment-and-semantics-based approach for emo- tion detection in textual conversations. CoRR, Vikrant Yadav. 2016. thecerealkiller at semeval-2016 abs/1707.06996. task 4: Deep learning based system for classifying sentiment of tweets on two point scale. In Proceed- Sepp Hochreiter and Jrgen Schmidhuber. 1997. ings of the 10th International Workshop on Semantic Long short-term memory. Neural Computation, Evaluation (SemEval-2016), pages 100–102. 9(8):1735–1780. Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Lan- guage Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1746–1751. Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio Sebastiani, and Veselin Stoyanov. 2016. Semeval- 2016 task 4: Sentiment analysis in twitter. In Pro- ceedings of the 10th international workshop on se- mantic evaluation (semeval-2016), pages 1–18. BARBARA ORTUTAY. 2018. Poll: Teens prefer tex- ting over face-to-face communication. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten- hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas- sos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830. Jeffrey Pennington, Richard Socher, and Christo- pher D. Manning. 2014. Glove: Global vectors for word representation. In In EMNLP. Daily Telegraph Reporter. 2012. Texting more popular than face-to-face conversation. Sebastian Ruder, Parsa Ghaffari, and John G Bres- lin. 2016. Insight-1 at semeval-2016 task 4: 240
You can also read