IIT Gandhinagar at SemEval-2019 Task 3: Contextual Emotion Detection Using Deep Learning

IIT Gandhinagar at SemEval-2019 Task 3: Contextual Emotion Detection
                        Using Deep Learning

           Arik Pamnani∗, Rajat Goel∗, Jayesh Choudhari, Mayank Singh
                                 IIT Gandhinagar
                                   Gujarat, India

                              Abstract                                Existing work: For sentiment analysis, most of
                                                                   the previous year’s submissions focused on neu-
         Recent advancements in Internet and Mobile
                                                                   ral networks (Nakov et al., 2016). Teams exper-
         infrastructure have resulted in the development
         of faster and efficient platforms of communi-             imented with Recurrent Neural Network (RNN)
         cation. These platforms include speech, facial            (Yadav, 2016) as well as Convolutional Neu-
         and text-based conversational mediums. Ma-                ral Network (CNN) based models (Ruder et al.,
         jority of these are text-based messaging plat-            2016). However, some top ranking teams also
         forms. Development of Chatbots that automat-              used (Giorgis et al., 2016) classic machine learn-
         ically understand latent emotions in the tex-             ing models. Aiming for the best system, we started
         tual message is a challenging task. In this pa-
                                                                   with classical machine learning algorithms like
         per, we present an automatic emotion detec-
         tion system that aims to detect the emotion of a          Support Vector Machine (SVM) and Logistic Re-
         person textually conversing with a chatbot. We            gression (LR). Based on the findings from them
         explore deep learning techniques such as CNN              we moved to complex models using Long Short-
         and LSTM based neural networks and outper-                Term Memory (LSTM), and finally, we experi-
         formed the baseline score by 14%. The trained             mented with CNN in search for the right system.
         model and code are kept in public domain.

 1       Introduction
                                                                      Our contribution: In this paper, we present
 In recent times, text has become a preferred mode                 models to extract emotions from text. All our
 of communication (Reporter, 2012; ORTUTAY,                        models are trained using only the dataset provided
 2018) over phone/video calling or face-to-face                    by EmoContext organizers. The evaluation metric
 communication. New challenges and opportuni-                      set by the organizers is micro F1 score (referred
 ties accompany this change. Identifying sentiment                 as score in rest of the paper) on three {Happy,
 from text has become a sought after research                      Sad, Angry} out of the four labels. We experi-
 topic. Applications include detecting depres-                     mented by using simpler models like SVM, and
 sion (Wang et al., 2013) or teaching empathy to                   Logistic regression but the score on dev set was
 chatbots (WILSON, 2016). These applications                       below 0.45. We then worked with a CNN model
 leverage NLP for extracting sentiments from                       and two LSTM based models where we were able
 text. On this line, SemEval Task 3: EmoContext                    to beat the baseline and achieve a maximum score
 (Chatterjee et al., 2019) challenges participants to              of 0.667 on test set.
 identify Contextual Emotions in Text.

    Challenges: The challenges with extracting                        Outline: Section 2 describes the dataset and
 sentiments from text are not only limited to the use              preprocessing steps. Section 3 presents model de-
 of slang, sarcasm or multiple languages in a sen-                 scription and system information. In the next sec-
 tence. There is also a challenge which is presented               tion (Section 4), we discuss experimental results
 by the use of non-standard acronyms specific to in-               and comparisons against state-of-the-art baselines.
 dividuals and others which are present in the task’s              Towards the end, Section 5 and Section 6 conclude
 dataset 2.                                                        this work with current limitations and proposal for
         Equal Contribution                                        future extension.

2       Dataset                                                          Tokenize sentences
                                                                           using NLTK’s
We used the dataset provided by Task 3 in S E -
M E VAL 2019. This task is titled as ‘EmoCon-
text: Contextual Emotion Detection in Text’. The
dataset consists of textual dialogues i.e. a user
utterance along with two turns of context. Each                                  Is the                 Convert Unicode
dialogue is labelled into several emotion classes:                             token an                    to string
Happy, Sad, Angry or Others. Figure 1 shows an                                  emoji?                      −→ “:)”
example dialogue.
    Turn 1: N u                                                        Use regex to remove
    Turn 2: Im fine, and you?                                           repetition of letters
    Turn 3: I am fabulous                                              at the end of a token.
                                                                       “heyyyyy” −→ “hey”
Figure 1: Example textual dialogue from EmoCon-
text dataset.                                                             Correct spelling‡
                                                                          errors in tokens.
  Table 1 shows the distribution of classes in the
EmoContext dataset. The dataset is further subdi-
                                                                           Lemmatize the
vided into train, dev and test sets. In this work, we
                                                                         token using NLTKs
use training set for model training and dev set for
validation and hyper-parameter tuning.

             Others      Happy      Sad        Angry                        Figure 2: Data processing pipeline.
    Train    14948       4243       5463       5506
    Dev      2338        142        125        150
    Test     4677        284        250        298                 3     Experiments

                Table 1: Dataset statistics.                       We experiment with several classification systems.
                                                                   In the following subsections we first explain the
   Preprocessing: We leverage two pretrained                       classical ML based models followed by Deep
word embedding: (i) GloVe (Pennington et al.,                      Neural Network based models.
2014) and (ii) sentiment specific word embedding
(SSWE) (Tang et al., 2014). However, several                       3.1    Classical Machine Learning Methods
classes of words are not present in these embed-                   We learn§ two classical ML models, (i) Support
dings. We list these classes below:                                Vector Machine (SVM) and (ii) Logistic Regres-
   • Emojis: , , etc.                                              sion (LR). The input consists a single feature vec-
   • Elongated words: Wowwww, noooo, etc.                          tor formed by combining all sentences (turns). We
   • Misspelled words: ofcorse, activ, doin, etc.                  term this combination of sentences as ’Dialogue’.
   • Non-English words: Chalo, kyun, etc.                             We create feature vector by averaging over d di-
   We follow a standard preprocessing pipeline to                  mensional GloVe representations of all the words
address the above limitations. Figure 2 describes                  present in the dialogue. Apart from standard aver-
the dataset preprocessing pipeline.                                aging, we also experimented with tf-idf weighted
   By using the dataset preprocessing pipeline, we                 averaging. The dialogue vector construction from
reduced the number of words not found in GloVe                     tf-idf averaging scheme is described below:
embedding from 4154 to 813 and in SSWE from
3188 to 1089.                                                                               ΣN
                                                                                             i=1 (tf-idfwi × GloV ewi )
                                                                       V ectordialogue =
    NLTK Library (Bird et al., 2009)
    ‡                                                                 §
    For spell check we the used the following PyPI package              We leverage the Scikit-learn (Pedregosa et al., 2011) im-
- pypi.org/project/pyspellchecker/.                                plementation.

Here, N is the total number of tokens in a sen-                                            softmax layer to obtain probabilities for classifica-
tence and wi is the ith token in the dialogue. Em-                                            tion. We used Keras for this model.
pirically, we found that, standard averaging shows
better prediction accuracy than tf-idf weighted av-                                           3.2.2    Long Short-Term Memory-I (LSTM-I)
eraging.                                                                                      We experiment with two Long Short-term Mem-
                                                                                              ory (Hochreiter and Schmidhuber, 1997) based ap-
3.2      Deep Neural Networks                                                                 proaches. In the first approach, we use an architec-
                                                                                              ture similar to (Gupta et al., 2017) Here, similar to
In this subsection, we describe three deep neural
                                                                                              the CNN model, the input contains an entire dia-
architectures that provide better prediction accu-
                                                                                              logue. We experiment with two embedding layers,
racy than classical ML models.
                                                                                              one with SSWE embeddings, and the other with
3.2.1      Convolution Neural Network (CNN)                                                   GloVe embeddings. Figure 4 presents detailed de-
                                                                                              scription. Gupta et al. showed that SSWE embed-
We explore a CNN model analogous to (Kim,                                                     dings capture sentiment information and GloVe
2014) for classification. Figure 3 describes our                                              embeddings capture semantic information in the
CNN architecture. The model consists of an em-                                                continuous representation of words. Similar to the
bedding layer, two convolution layers, a max pool-                                            CNN model, here also, we input the word indices
ing layer, a hidden layer, and a softmax layer.                                               of dialogue. We pad input sequences with zeros so
For each dialogue, the input to this model is a                                               that each sequence has length n.
sequence of token indices. Input sequences are
                                                                                                 The architecture consists of two LSTM layers
padded with zeros so that each sequence has equal
                                                                                              after each embedding layer. The LSTM layer out-
length n.
                                                                                              puts a vector of shape 128 × 1. Further, concate-
                          Filters of size   Feature
                                                                                              nation of these output vectors results a vector of
                          2xd, 3xd, 4xd      maps                                             shape 256 × 1. In the end, we have a hidden layer
         matrix of                                                  Concat                    followed by a softmax layer. The output from the
        size N x d
                                                                                              softmax layer is a vector of shape 4 × 1 which
                                                       Max      ⊕                             refers to class probabilities for the four classes. We
                      *                     =                   ⊕            .
                                                                                              used Keras for this model.
you                                                                                Softmax
 ?                                                                           .
                                                                .                                TURN 1  TURN 2  TURN 3
                                                                .                                      SSWE    SSWE    SSWE
      d = embedding                                             .
                                 .              .
        dimension                                                                               LSTM
                                 .              .                     Fully Connected
                                 .              .                          Layers                                                   128
       Figure 3: Architecture of the CNN model.
   The embedding layer maps the input sequence                                                                                            Fully Connected
                                                                                                       GloVe   GloVe   GloVe                    Layer
to a matrix of shape n × d, where n represents nth
                                                                                                 TURN 1  TURN 2  TURN 3
word in the dialogue and d represents dimensions
of the embedding. Rows of the matrix correspond
to the GloVe embedding of corresponding words                                                            Figure 4: LSTM-I architecture.
in the sequence. A zero vector represents words
which are not present in the embedding.
   At the convolution layer, filters of shape m × d                                           3.2.3    Long Short-Term Memory-II
slide over the input matrix to create feature maps                                                     (LSTM-II)
of length n − m + 1. Here, m is the ‘region size’.                                            In the second approach, we use the architecture
For each region size, we use k filters. Thus, the                                             shown in Figure 5. This model consists of embed-
total number of feature maps is m × k. We use                                                 ding layers, LSTM layers, a dense layer and a soft-
two convolution layers, one after the other.                                                  max layer. Here, the entire dialogue is not passed
   Next, we apply a max-pooling operation over                                                at once. Turn 1 is passed through an embedding
each feature map to get a vector of length m × k.                                             layer which is followed by an LSTM layer. The
At the end, we add a hidden layer followed by a                                               output is a vector of shape 256 × 1. Turn 2 is also

passed through an embedding layer which is fol-                                     SVM and Logistic Regression models did not
lowed by an LSTM layer. The output from Turn 2                                   yield very good results. We attribute this to the
is concatenated with the output of Turn 1 to form                                dialogue features that we use for the models. Tf-
a vector of shape 512 × 1. The concatenated vec-                                 idf weighted average of GloVe vectors performed
tor is passed through a dense layer which reduces                                worse than the simple average of vectors. Hand-
the vector to 256 × 1. Turn 3 is passed through an                               crafted features might have performed better than
embedding layer which is followed by an LSTM                                     our current implementation. Neural network based
layer. The output from Turn 3 is concatenated with                               models had very good results, CNN performed
the reduced output of Turn 1 & 2, and the resultant                              better than classical ML models but lagged behind
vector has shape 512 × 1. The resultant vector is                                LSTM based models. On the test set, our LSTM-
passed through a dense layer and then a softmax                                  I model performed slightly better than LSTM-II
layer to find the probability distribution across dif-                           model.
ferent classes. We used Pytorch for this model.                                     Hyper-parameter selection for CNN was diffi-
   The motivation of this architecture was derived                               cult, and we restricted to LSTM for the Phase
from the Task’s focus to identify the emotion of                                 2 (i.e. test phase). We also noticed that the
Turn 3. Hence, this architecture gives more weight                               LSTM model was overfitting early in the train-
to Turn 3 while making a prediction and condi-                                   ing process (4-5 epochs) and that was a challenge
tions the result on Turn 1 & 2 by concatenating                                  when searching for optimal hyper-parameters. We
their output vectors. The concatenated vector of                                 used grid search to find the right set of hyper-
Turn 1 & 2 accounts for the context of the conver-                               parameters for our models. We grid searched over
sation.                                                                          dropout (Srivastava et al., 2014), number of LSTM
                                                                                 layers, learning rate and number of epochs. In case
                                                                                 of the CNN model, number of filters was an extra
                                                                                 hyper-parameter. We used Nvidia GeForce GTX
       GloVe    GloVe    GloVe
               TURN 1                              512       256
                                                                                 1080 for training our models.

LSTM                                     256
                                                               ReLU    Softmax   5   Conclusion
       GloVe    GloVe    GloVe
               TURN 2                                        256                 In this paper, we experimented with multiple ma-
                        LSTM                                                     chine learning models. We see that LSTM and
                               GloVe   GloVe    GloVe
                                                         Fully Connected
                                                                                 CNN models perform far better than classical ML
                                       TURN 3                  Layer             methods. In phase-1 of the competition (dev
                                                                                 dataset), we were able to achieve a score of 0.71,
                                                                                 when the scoreboard leader had 0.77. But in
               Figure 5: LSTM-II architecture.
                                                                                 phase-2 (test dataset), our best score was only
                                                                                 0.634, when the scoreboard leader had 0.79. After
4     Results                                                                    phase-2 ended, we experimented more with hyper-
In Table 2, we report the performance of all the                                 parameters and achieved an increase in scores on
models described in the previous section. We train                               the test-set (mentioned in Table 2).
each model multiple (=5) times and compute the                                      Full code for the paper can be found on
mean of scores.                                                                  GitHub∗ .

    Algorithm                               Scoredev     Scoretest               6   Future Work
    SVM                                     0.46         0.41
    LR                                      0.44         0.40                    Our scores on the test dataset suggest room for im-
    SVM                                                                          provement. Now we are narrowing down to trans-
                                            0.42         0.38
    (tf-idf weighted averaging)                                                  fer learning where the starting point for our model
    LR                                                                           will be a pre-trained network on a similar task.
                                            0.37         0.34
    (tf-idf weighted averaging)                                                  Our assumption is, this will help in better con-
    CNN                                     0.632        0.612
                                                                                 vergence on EmoContext dataset given the dataset
    LSTM-I                                  0.677        0.667
    LSTM-II                                 0.684        0.661
                                                                                 size is not too large.
Table 2: Model performance on dev & test dataset.                                emocontext-19

You can also read