IIT Gandhinagar at SemEval-2019 Task 3: Contextual Emotion Detection Using Deep Learning
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
IIT Gandhinagar at SemEval-2019 Task 3: Contextual Emotion Detection
Using Deep Learning
Arik Pamnani∗, Rajat Goel∗, Jayesh Choudhari, Mayank Singh
IIT Gandhinagar
Gujarat, India
{arik.pamnani,rajat.goel,choudhari.jayesh,singh.mayank}@iitgn.ac.in
Abstract Existing work: For sentiment analysis, most of
the previous year’s submissions focused on neu-
Recent advancements in Internet and Mobile
ral networks (Nakov et al., 2016). Teams exper-
infrastructure have resulted in the development
of faster and efficient platforms of communi- imented with Recurrent Neural Network (RNN)
cation. These platforms include speech, facial (Yadav, 2016) as well as Convolutional Neu-
and text-based conversational mediums. Ma- ral Network (CNN) based models (Ruder et al.,
jority of these are text-based messaging plat- 2016). However, some top ranking teams also
forms. Development of Chatbots that automat- used (Giorgis et al., 2016) classic machine learn-
ically understand latent emotions in the tex- ing models. Aiming for the best system, we started
tual message is a challenging task. In this pa-
with classical machine learning algorithms like
per, we present an automatic emotion detec-
tion system that aims to detect the emotion of a Support Vector Machine (SVM) and Logistic Re-
person textually conversing with a chatbot. We gression (LR). Based on the findings from them
explore deep learning techniques such as CNN we moved to complex models using Long Short-
and LSTM based neural networks and outper- Term Memory (LSTM), and finally, we experi-
formed the baseline score by 14%. The trained mented with CNN in search for the right system.
model and code are kept in public domain.
1 Introduction
Our contribution: In this paper, we present
In recent times, text has become a preferred mode models to extract emotions from text. All our
of communication (Reporter, 2012; ORTUTAY, models are trained using only the dataset provided
2018) over phone/video calling or face-to-face by EmoContext organizers. The evaluation metric
communication. New challenges and opportuni- set by the organizers is micro F1 score (referred
ties accompany this change. Identifying sentiment as score in rest of the paper) on three {Happy,
from text has become a sought after research Sad, Angry} out of the four labels. We experi-
topic. Applications include detecting depres- mented by using simpler models like SVM, and
sion (Wang et al., 2013) or teaching empathy to Logistic regression but the score on dev set was
chatbots (WILSON, 2016). These applications below 0.45. We then worked with a CNN model
leverage NLP for extracting sentiments from and two LSTM based models where we were able
text. On this line, SemEval Task 3: EmoContext to beat the baseline and achieve a maximum score
(Chatterjee et al., 2019) challenges participants to of 0.667 on test set.
identify Contextual Emotions in Text.
Challenges: The challenges with extracting Outline: Section 2 describes the dataset and
sentiments from text are not only limited to the use preprocessing steps. Section 3 presents model de-
of slang, sarcasm or multiple languages in a sen- scription and system information. In the next sec-
tence. There is also a challenge which is presented tion (Section 4), we discuss experimental results
by the use of non-standard acronyms specific to in- and comparisons against state-of-the-art baselines.
dividuals and others which are present in the task’s Towards the end, Section 5 and Section 6 conclude
dataset 2. this work with current limitations and proposal for
∗
Equal Contribution future extension.
236
Proceedings of the 13th International Workshop on Semantic Evaluation (SemEval-2019), pages 236–240
Minneapolis, Minnesota, USA, June 6–7, 2019. ©2019 Association for Computational Linguistics2 Dataset Tokenize sentences
using NLTK’s
We used the dataset provided by Task 3 in S E -
TweetTokenizer†
M E VAL 2019. This task is titled as ‘EmoCon-
text: Contextual Emotion Detection in Text’. The
dataset consists of textual dialogues i.e. a user
utterance along with two turns of context. Each Is the Convert Unicode
YES
dialogue is labelled into several emotion classes: token an to string
Happy, Sad, Angry or Others. Figure 1 shows an emoji? −→ “:)”
example dialogue.
NO
Turn 1: N u Use regex to remove
Turn 2: Im fine, and you? repetition of letters
Turn 3: I am fabulous at the end of a token.
“heyyyyy” −→ “hey”
Figure 1: Example textual dialogue from EmoCon-
text dataset. Correct spelling‡
errors in tokens.
Table 1 shows the distribution of classes in the
EmoContext dataset. The dataset is further subdi-
Lemmatize the
vided into train, dev and test sets. In this work, we
token using NLTKs
use training set for model training and dev set for
WordNetLemmatizer.
validation and hyper-parameter tuning.
Others Happy Sad Angry Figure 2: Data processing pipeline.
Train 14948 4243 5463 5506
Dev 2338 142 125 150
Test 4677 284 250 298 3 Experiments
Table 1: Dataset statistics. We experiment with several classification systems.
In the following subsections we first explain the
Preprocessing: We leverage two pretrained classical ML based models followed by Deep
word embedding: (i) GloVe (Pennington et al., Neural Network based models.
2014) and (ii) sentiment specific word embedding
(SSWE) (Tang et al., 2014). However, several 3.1 Classical Machine Learning Methods
classes of words are not present in these embed- We learn§ two classical ML models, (i) Support
dings. We list these classes below: Vector Machine (SVM) and (ii) Logistic Regres-
• Emojis: , , etc. sion (LR). The input consists a single feature vec-
• Elongated words: Wowwww, noooo, etc. tor formed by combining all sentences (turns). We
• Misspelled words: ofcorse, activ, doin, etc. term this combination of sentences as ’Dialogue’.
• Non-English words: Chalo, kyun, etc. We create feature vector by averaging over d di-
We follow a standard preprocessing pipeline to mensional GloVe representations of all the words
address the above limitations. Figure 2 describes present in the dialogue. Apart from standard aver-
the dataset preprocessing pipeline. aging, we also experimented with tf-idf weighted
By using the dataset preprocessing pipeline, we averaging. The dialogue vector construction from
reduced the number of words not found in GloVe tf-idf averaging scheme is described below:
embedding from 4154 to 813 and in SSWE from
3188 to 1089. ΣN
i=1 (tf-idfwi × GloV ewi )
V ectordialogue =
N
†
NLTK Library (Bird et al., 2009)
‡ §
For spell check we the used the following PyPI package We leverage the Scikit-learn (Pedregosa et al., 2011) im-
- pypi.org/project/pyspellchecker/. plementation.
237Here, N is the total number of tokens in a sen- softmax layer to obtain probabilities for classifica-
tence and wi is the ith token in the dialogue. Em- tion. We used Keras for this model.
pirically, we found that, standard averaging shows
better prediction accuracy than tf-idf weighted av- 3.2.2 Long Short-Term Memory-I (LSTM-I)
eraging. We experiment with two Long Short-term Mem-
ory (Hochreiter and Schmidhuber, 1997) based ap-
3.2 Deep Neural Networks proaches. In the first approach, we use an architec-
ture similar to (Gupta et al., 2017) Here, similar to
In this subsection, we describe three deep neural
the CNN model, the input contains an entire dia-
architectures that provide better prediction accu-
logue. We experiment with two embedding layers,
racy than classical ML models.
one with SSWE embeddings, and the other with
3.2.1 Convolution Neural Network (CNN) GloVe embeddings. Figure 4 presents detailed de-
scription. Gupta et al. showed that SSWE embed-
We explore a CNN model analogous to (Kim, dings capture sentiment information and GloVe
2014) for classification. Figure 3 describes our embeddings capture semantic information in the
CNN architecture. The model consists of an em- continuous representation of words. Similar to the
bedding layer, two convolution layers, a max pool- CNN model, here also, we input the word indices
ing layer, a hidden layer, and a softmax layer. of dialogue. We pad input sequences with zeros so
For each dialogue, the input to this model is a that each sequence has length n.
sequence of token indices. Input sequences are
The architecture consists of two LSTM layers
padded with zeros so that each sequence has equal
after each embedding layer. The LSTM layer out-
length n.
puts a vector of shape 128 × 1. Further, concate-
Filters of size Feature
nation of these output vectors results a vector of
2xd, 3xd, 4xd maps shape 256 × 1. In the end, we have a hidden layer
Embedding
matrix of Concat followed by a softmax layer. The output from the
size N x d
Hey
softmax layer is a vector of shape 4 × 1 which
Max ⊕ refers to class probabilities for the four classes. We
How
pooling
are
* = ⊕ .
used Keras for this model.
you Softmax
.
? .
. TURN 1 TURN 2 TURN 3
. SSWE SSWE SSWE
d = embedding .
. .
dimension LSTM
. . Fully Connected
. . Layers 128
LSTM
Leaky
Softmax
ReLU
Figure 3: Architecture of the CNN model.
LSTM
128
LSTM
The embedding layer maps the input sequence Fully Connected
GloVe GloVe GloVe Layer
to a matrix of shape n × d, where n represents nth
TURN 1 TURN 2 TURN 3
word in the dialogue and d represents dimensions
of the embedding. Rows of the matrix correspond
to the GloVe embedding of corresponding words Figure 4: LSTM-I architecture.
in the sequence. A zero vector represents words
which are not present in the embedding.
At the convolution layer, filters of shape m × d 3.2.3 Long Short-Term Memory-II
slide over the input matrix to create feature maps (LSTM-II)
of length n − m + 1. Here, m is the ‘region size’. In the second approach, we use the architecture
For each region size, we use k filters. Thus, the shown in Figure 5. This model consists of embed-
total number of feature maps is m × k. We use ding layers, LSTM layers, a dense layer and a soft-
two convolution layers, one after the other. max layer. Here, the entire dialogue is not passed
Next, we apply a max-pooling operation over at once. Turn 1 is passed through an embedding
each feature map to get a vector of length m × k. layer which is followed by an LSTM layer. The
At the end, we add a hidden layer followed by a output is a vector of shape 256 × 1. Turn 2 is also
238passed through an embedding layer which is fol- SVM and Logistic Regression models did not
lowed by an LSTM layer. The output from Turn 2 yield very good results. We attribute this to the
is concatenated with the output of Turn 1 to form dialogue features that we use for the models. Tf-
a vector of shape 512 × 1. The concatenated vec- idf weighted average of GloVe vectors performed
tor is passed through a dense layer which reduces worse than the simple average of vectors. Hand-
the vector to 256 × 1. Turn 3 is passed through an crafted features might have performed better than
embedding layer which is followed by an LSTM our current implementation. Neural network based
layer. The output from Turn 3 is concatenated with models had very good results, CNN performed
the reduced output of Turn 1 & 2, and the resultant better than classical ML models but lagged behind
vector has shape 512 × 1. The resultant vector is LSTM based models. On the test set, our LSTM-
passed through a dense layer and then a softmax I model performed slightly better than LSTM-II
layer to find the probability distribution across dif- model.
ferent classes. We used Pytorch for this model. Hyper-parameter selection for CNN was diffi-
The motivation of this architecture was derived cult, and we restricted to LSTM for the Phase
from the Task’s focus to identify the emotion of 2 (i.e. test phase). We also noticed that the
Turn 3. Hence, this architecture gives more weight LSTM model was overfitting early in the train-
to Turn 3 while making a prediction and condi- ing process (4-5 epochs) and that was a challenge
tions the result on Turn 1 & 2 by concatenating when searching for optimal hyper-parameters. We
their output vectors. The concatenated vector of used grid search to find the right set of hyper-
Turn 1 & 2 accounts for the context of the conver- parameters for our models. We grid searched over
sation. dropout (Srivastava et al., 2014), number of LSTM
layers, learning rate and number of epochs. In case
LSTM
of the CNN model, number of filters was an extra
256
hyper-parameter. We used Nvidia GeForce GTX
GloVe GloVe GloVe
TURN 1 512 256
1080 for training our models.
LSTM 256
⊕
Leaky
ReLU Softmax 5 Conclusion
GloVe GloVe GloVe
TURN 2 256 In this paper, we experimented with multiple ma-
LSTM chine learning models. We see that LSTM and
GloVe GloVe GloVe
Fully Connected
CNN models perform far better than classical ML
TURN 3 Layer methods. In phase-1 of the competition (dev
dataset), we were able to achieve a score of 0.71,
when the scoreboard leader had 0.77. But in
Figure 5: LSTM-II architecture.
phase-2 (test dataset), our best score was only
0.634, when the scoreboard leader had 0.79. After
4 Results phase-2 ended, we experimented more with hyper-
In Table 2, we report the performance of all the parameters and achieved an increase in scores on
models described in the previous section. We train the test-set (mentioned in Table 2).
each model multiple (=5) times and compute the Full code for the paper can be found on
mean of scores. GitHub∗ .
Algorithm Scoredev Scoretest 6 Future Work
SVM 0.46 0.41
LR 0.44 0.40 Our scores on the test dataset suggest room for im-
SVM provement. Now we are narrowing down to trans-
0.42 0.38
(tf-idf weighted averaging) fer learning where the starting point for our model
LR will be a pre-trained network on a similar task.
0.37 0.34
(tf-idf weighted averaging) Our assumption is, this will help in better con-
CNN 0.632 0.612
vergence on EmoContext dataset given the dataset
LSTM-I 0.677 0.667
LSTM-II 0.684 0.661
size is not too large.
∗
https://github.com/lingo-iitgn/
Table 2: Model performance on dev & test dataset. emocontext-19
239References convolutional neural networks for sentiment clas-
sification and quantification. arXiv preprint
Steven Bird, Ewan Klein, and Edward Loper. 2009. arXiv:1609.02746.
Natural language processing with Python: analyz-
ing text with the natural language toolkit. ” O’Reilly Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
Media, Inc.”. Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
Dropout: a simple way to prevent neural networks
Ankush Chatterjee, Kedhar Nath Narahari, Meghana from overfitting. The Journal of Machine Learning
Joshi, and Puneet Agrawal. 2019. Semeval-2019 Research, 15(1):1929–1958.
task 3: Emocontext: Contextual emotion detection
in text. In Proceedings of The 13th International Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting
Workshop on Semantic Evaluation (SemEval-2019), Liu, and Bing Qin. 2014. Learning sentiment-
Minneapolis, Minnesota. specific word embedding for twitter sentiment clas-
sification. volume 1, pages 1555–1565.
Stavros Giorgis, Apostolos Rousas, John Pavlopoulos,
Prodromos Malakasiotis, and Ion Androutsopoulos. Xinyu Wang, Chunhong Zhang, Yang Ji, Li Sun, Leijia
2016. aueb. twitter. sentiment at semeval-2016 task Wu, and Zhana Bao. 2013. A depression detection
4: A weighted ensemble of svms for twitter senti- model based on sentiment analysis in micro-blog so-
ment analysis. In Proceedings of the 10th Interna- cial network. In Pacific-Asia Conference on Knowl-
tional Workshop on Semantic Evaluation (SemEval- edge Discovery and Data Mining, pages 201–213.
2016), pages 96–99. Springer.
Umang Gupta, Ankush Chatterjee, Radhakrish- MARK WILSON. 2016. This startup is teaching chat-
nan Srikanth, and Puneet Agrawal. 2017. A bots real empathy.
sentiment-and-semantics-based approach for emo-
tion detection in textual conversations. CoRR, Vikrant Yadav. 2016. thecerealkiller at semeval-2016
abs/1707.06996. task 4: Deep learning based system for classifying
sentiment of tweets on two point scale. In Proceed-
Sepp Hochreiter and Jrgen Schmidhuber. 1997. ings of the 10th International Workshop on Semantic
Long short-term memory. Neural Computation, Evaluation (SemEval-2016), pages 100–102.
9(8):1735–1780.
Yoon Kim. 2014. Convolutional neural networks for
sentence classification. In Proceedings of the 2014
Conference on Empirical Methods in Natural Lan-
guage Processing, EMNLP 2014, October 25-29,
2014, Doha, Qatar, A meeting of SIGDAT, a Special
Interest Group of the ACL, pages 1746–1751.
Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio
Sebastiani, and Veselin Stoyanov. 2016. Semeval-
2016 task 4: Sentiment analysis in twitter. In Pro-
ceedings of the 10th international workshop on se-
mantic evaluation (semeval-2016), pages 1–18.
BARBARA ORTUTAY. 2018. Poll: Teens prefer tex-
ting over face-to-face communication.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
B. Thirion, O. Grisel, M. Blondel, P. Pretten-
hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-
sos, D. Cournapeau, M. Brucher, M. Perrot, and
E. Duchesnay. 2011. Scikit-learn: Machine learning
in Python. Journal of Machine Learning Research,
12:2825–2830.
Jeffrey Pennington, Richard Socher, and Christo-
pher D. Manning. 2014. Glove: Global vectors for
word representation. In In EMNLP.
Daily Telegraph Reporter. 2012. Texting more popular
than face-to-face conversation.
Sebastian Ruder, Parsa Ghaffari, and John G Bres-
lin. 2016. Insight-1 at semeval-2016 task 4:
240You can also read