An Attention Ensemble Approach for Efficient Text Classification of Indian Languages - ResearchGate

Page created by Bruce Barber

Uncategorized

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

An Attention Ensemble Approach for Efficient Text Classification of
                                                                      Indian Languages

                                                              Atharva Kulkarni1 , Amey Hengle1 , and Rutuja Udyawar2
                                         1
                                             Department of Computer Engineering, PVG’s COET, Savitribai Phule Pune University, India.
                                                                        2
                                                                          Optimum Data Analytics, India.
                                                     1
                                                       {atharva.j.kulkarni1998, ameyhengle22}@gmail.com
                                                                     2
                                                                       rutuja.udyawar@odaml.com

                                                               Abstract                          2019; Tummalapalli et al., 2018; Bolaj and Gov-
                                                                                                 ilkar, 2016a,b; Dhar et al., 2018). Apart from
                                              The recent surge of complex attention-based        being heavily consumed in the print format, the
arXiv:2102.10275v1 [cs.CL] 20 Feb 2021

                                              deep learning architectures has led to extraor-    growth in the Indian languages internet user base
                                              dinary results in various downstream NLP
                                                                                                 is monumental, scaling from 234 million in 2016
                                              tasks in the English language. However, such
                                              research for resource-constrained and morpho-
                                                                                                 to 536 million by 2021 1 . Even so, just like most
                                              logically rich Indian vernacular languages has     other Indian languages, the progress in NLP for
                                              been relatively limited. This paper proffers       Marathi has been relatively constrained, due to fac-
                                              team SPPU AKAH’s solution for the TechD-           tors such as the unavailability of large-scale train-
                                              Ofication 2020 subtask-1f: which focuses on        ing resources, structural un-similarity with the En-
                                              the coarse-grained technical domain identifica-    glish language, and a profusion of morphological
                                              tion of short text documents in Marathi, a De-     variations, thus, making the generalization of deep
                                              vanagari script-based Indian language. Avail-
                                                                                                 learning architectures to languages like Marathi
                                              ing the large dataset at hand, a hybrid CNN-
                                              BiLSTM attention ensemble model is pro-            difficult.
                                              posed that competently combines the inter-            This work posits a solution for the TechDOfica-
                                              mediate sentence representations generated by      tion 2020 subtask-1f: coarse-grained domain clas-
                                              the convolutional neural network and the bidi-     sification for short Marathi texts. The task provides
                                              rectional long short-term memory, leading to       a large corpus of Marathi text documents spanning
                                              efficient text classification. Experimental re-    across four domains: Biochemistry, Communica-
                                              sults show that the proposed model outper-
                                                                                                 tion Technology, Computer Science, and Physics.
                                              forms various baseline machine learning and
                                              deep learning models in the given task, giving
                                                                                                 Efficient domain identification can potentially im-
                                              the best validation accuracy of 89.57% and f1-     pact, and improve research in downstream NLP
                                              score of 0.8875. Furthermore, the solution re-     applications such as question answering, transliter-
                                              sulted in the best system submission for this      ation, machine translation, and text summarization,
                                              subtask, giving a test accuracy of 64.26% and      to name a few. Inspired by the works of (Er et al.,
                                              f1-score of 0.6157, transcending the perfor-       2016; Guo et al., 2018; Zheng and Zheng, 2019),
                                              mances of other teams as well as the baseline      a hybrid CNN-BiLSTM attention ensemble model
                                              system given by the organizers of the shared
                                                                                                 is proposed in this work. In recent years, Convo-
                                              task.
                                                                                                 lutional Neural Networks (Kim, 2014; Conneau
                                                                                                 et al., 2016; Zhang et al., 2015; Liu et al., 2020;
                                         1     Introduction
                                                                                                 Le et al., 2017) and Recurrent Neural Networks
                                         The advent of attention-based neural networks and       (Liu et al., 2016; Sundermeyer et al., 2015) have
                                         the availability of large labelled datasets has re-     been used quite frequently for text classification
                                         sulted in great success and state-of-the-art perfor-    tasks. Quite different from one another, the CNNs
                                         mance for English text classification (Yang et al.,     and the RNNs show different capabilities to gener-
                                         2016; Zhou et al., 2016; Wang et al., 2016; Gao         ate intermediate text representation. CNN models
                                         et al., 2018). Such results, however, for Indian lan-   an input sentence by utilizing convolutional filters
                                         guage text classification tasks are far and few as      to identify the most influential n-grams of differ-
                                         most of the research employ traditional machine            1
                                                                                                      https://home.kpmg/in/en/home/insights/2017/04/indian-
                                         learning and deep learning models (Joshi et al.,        language-internet-users.html

ent semantic aspects (Conneau et al., 2016). RNN          mood, 2017) succeed in exploiting the advantages
can handle variable-length input sentences and is         of both CNN and RNN, by using them in combina-
particularly well suited for modeling sequential          tion for text classification.
data, learning important temporal features and long-
                                                             Since the introduction of the attention mecha-
term dependencies for robust text representation
                                                          nism (Vaswani et al., 2017), it has become an effec-
(Hochreiter and Schmidhuber, 1997). However,
                                                          tive strategy for dynamically learning the contribu-
whilst CNN can only capture local patterns and
                                                          tion of different features to specific tasks. Needless
fails to incorporate the long-term dependencies and
                                                          to say, the attention mechanism has expeditiously
the sequential features, RNN cannot distinguish
                                                          found its way into NLP literature, with many works
between keywords that contribute more context to
                                                          effectively leveraging it to improve the text classi-
the classification task from the normal stopwords.
                                                          fication task. (Guo et al., 2018) proposed a CNN -
Thus, the proposed model hypothesizes a potent
                                                          RNN attention-based neural network (CRAN) for
way to subsume the advantages of both the CNN
                                                          text classification. This work illustrates the effec-
and the RNN using the attention mechanism. The
                                                          tiveness of using the CNN layer as a context of
model employs a parallel structure where both the
                                                          the attention model. Results show that using this
CNN and the BiLSTM model the input sentences
                                                          mechanism enables the proposed model to pick the
independently. The intermediate representations,
                                                          important words from the sequences generated by
thus generated, are combined using the attention
                                                          the RNN layer, thus helping it to outperform many
mechanism. Therefore, the generated vector has
                                                          baselines as well as hybrid attention-based models
useful temporal features from the sequences gener-
                                                          in the text classification task. (Er et al., 2016) pro-
ated by the RNN according to the context generated
                                                          posed an attention pooling strategy, which focuses
by the CNN. Results attest that the proposed model
                                                          on making a model learn better sentence represen-
outperforms various baseline machine learning and
                                                          tations for improved text classification. Authors
deep learning models in the given task, giving the
                                                          use the intermediate sentence representations pro-
best validation accuracy and f1-score.
                                                          duced by a BiLSTM layer in reference with the
                                                          local representations produced by a CNN layer to
2   Related Work                                          obtain the attention weights. Experimental results
                                                          demonstrate that the proposed model outperforms
Since the past decade, the research in NLP has
                                                          state-of-the-art approaches on a number of bench-
shifted from a traditional statistical standpoint to
                                                          mark datasets for text classification. (Zheng and
complex neural network architectures. The CNN
                                                          Zheng, 2019) combine the BiLSTM and CNN with
and RNN based architectures have emerged greatly
                                                          the attention mechanism for fine-grained text clas-
successful for the text classification task. Yoon
                                                          sification tasks. The authors employ a method in
Kim was the first one who applied a CNN model
                                                          which intermediate sentence representations gener-
for English text classification. In this work, a series
                                                          ated by BiLSTM are passed to a CNN layer which
of experiments were conducted with single as well
                                                          is then max pooled to capture the local features of a
as multi-channel convolutional neural networks,
                                                          sentence. The local feature representations are fur-
built on top of randomly generated, pretrained, and
                                                          ther combined by using an attention layer to calcu-
fine-tuned word vectors (Kim, 2014). This success
                                                          late the attention weights. In this way, the attention
of CNN for text classification led to the emergence
                                                          layer can assign different weights to features ac-
of more complex CNN models (Conneau et al.,
                                                          cording to their importance to the text classification
2016) as well as CNN models with character level
                                                          task.
inputs (Zhang et al., 2015). RNNs are capable of
generating effective text representation by learn-           The literature in NLP focusing on the resource-
ing temporal features and long-term dependencies          constrained Indian languages has been fairly re-
between the words (Hochreiter and Schmidhuber,            strained. (Tummalapalli et al., 2018) evaluated the
1997; Graves and Schmidhuber, 2005). However,             performance of vanilla CNN, LSTM, and multi-
these methods treat each word in the sentences            Input CNN for the text-classification of Hindi and
equally and thus cannot distinguish between the           Telugu texts. The results indicate that CNN based
keywords that contribute more to the classification       models perform surprisingly better as compared
and the common words. Hybrid models proposed              to LSTM and SVM using n-gram features. (Joshi
by (Xiao and Cho, 2016) and (Hassan and Mah-              et al., 2019) have compared different deep learn-

Label Training Data Validation Data
bioche 5,002 420
com tech 17,995 1,505
cse 9,344 885
phy 9,656 970
Total 41,997 3,780

Table 1: Data distribution.

ing approaches for Hindi sentence classification.
The authors have evaluated the effect of using pre-
trained fasttext Hindi embeddings on the sentence
classification task. The finest classification per-
formance is achieved by the Vanilla CNN model
when initialized with fasttext word embeddings
fine-tuned on the specific dataset.

3 Dataset
The TechDOfication-2020 subtask-1f dataset con-
sists of labelled Marathi text documents, each be-
longing to one of the four classes, namely: Bio-
chemistry (bioche), Communication Technology
(com tech), Computer Science (cse), and Physics
(phy). The training data has a mean length of 26.89
words with a standard deviation of 25.12.
Table 1 provides an overview of the distribution Figure 1: Model Architecture.
of the corpus across the four labels for training and
validation data. From the table, it is evident that of character-n-grams, which in turn helps to cap-
the dataset is imbalanced, with the class Commu- ture the morphological richness of languages like
nication Technology and Biochemistry having the Marathi. The embedding layer converts each word
most and the least documents, respectively. It is, wi in the document T = {w1 , w2 , ..., wn } of n
therefore, reasonable to postulate that this data im- words, into a real-valued dense vector ei using the
balance may lead to the overfitting of the model following matrix-vector product:
on some classes. This is further articulated in the
Results section. ei = W vi (1)

4 Proposed Model where W ∈ Rd×|V | is the embedding matrix, |V |
is a fixed-sized vocabulary of the corpus and d
This section describes the proposed multi-input is the word embedding dimension. vi is the one-
attention-based parallel CNN-BiLSTM. Figure 1 hot encoded vector with the element ei set to 1
depicts the model architecture. Each component is while the other elements set to 0. Thus, the doc-
explained in detail as follows: ument can be represented as real-valued vector
e = {e1 , e2 , ..., en }.
4.1 Word Embedding Layer
The proposed model uses fasttext word embeddings 4.2 Bi-LSTM Layer
trained on the unsupervised skip-gram model to The word embeddings generated by the embed-
map the words from the corpus vocabulary to a dings layer are fed to the BiLSTM unit step by
corresponding dense vector. Fasttext embeddings step. A Bidirectional Long-short term memory (Bi-
are preferred over the word2vec (Mikolov et al., LSTM) (Graves and Schmidhuber, 2005) layer is
2013) or glove variants (Pennington et al., 2014), just a combination of two LSTMs (Hochreiter and
as fasttext represents each word as a sequence Schmidhuber, 1997) running in opposite directions.

This allows the networks to have both forward and 5 Experimental Setup
backward information about the sequence at ev-
Each text document is tokenized and padded to a
ery time step, resulting in better understanding and
maximum length of 100. Longer documents are
preservation of the context. It is also able to counter
truncated. The work of (Kakwani et al., 2020) is
the problem of vanishing gradients to a certain ex-
referred for selecting the optimal set of hyperpa-
tent by utilizing the input, the output, and the forget
rameters for training the fasttext skip-gram model.
gates. The intermediate sentence representation
The 300-dimensional fasttext word embeddings are
generated by Bi-LSTM is denoted as h.
trained on the given corpus for 50 epochs, with a
4.3 CNN Layer minimum token count of 1, and 10 negative exam-
The discrete convolutions performed by the CNN ples, sampled for each instance. The rest of the
layer on the input word embeddings, help to extract hyperparameter values were chosen as default (Bo-
the most influential n-grams in the sentence. Three janowski et al., 2017). After training, an average
parallel convolutional layers with three different loss of 0.6338. was obtained over 50 epochs. The
window sizes are used so that the model can learn validation dataset is used to tune the hyperparam-
multiple types of embeddings of local regions, and eters. The LSTM layer dimension is set to 128
complement one another. Finally, the sentence rep- neurons with a dropout rate of 0.3. Thus, the BiL-
resentations of all the different convolutions are STM gives an intermediate representation of 256
concatenated and max-pooled to get the most dom- dimensions. For the CNN block, we employ three
inant features. The output is denoted as c. parallel convolutional layers with filter sizes 3, 4,
and 5, each having 256 feature maps. A dropout
4.4 Attention Layer rate of 0.3 is applied to each layer. The local repre-
The linchpin of the model is the attention block sentations, thus, generated by the parallel CNNs are
that effectively combines the intermediate sentence then concatenated and max-pooled. All other pa-
feature representation generated by BiLSTM with rameters in the model are initialized randomly. The
the local feature representation generated by CNN. model is trained end-to-end for 15 epochs, with the
At each time step t, taking the output ht of the Adam optimizer (Kingma and Ba, 2014), sparse
BiLSTM, and ct of the CNN, the attention weights categorical cross-entropy loss, a learning rate of
αt are calculated as: 0.001, and a minibatch size of 128. The best model
is stored and the learning rate is reduced by a factor
ut = tanh(W1 ht + W2 ct + b) (2)
of 0.1 if validation loss does not decline after two
αt = Sof tmax(ut ) (3) successive epochs.
Where W1 and W2 are the attention weights, and
6 Baseline Models
b is the attention bias learned via backpropagation.
The final sentence representation s is calculated as The performance of the proposed model is com-
the weighted arithmetic mean based on the weights pared with a host of machine learning and deep
α = {α1 , α2 , ..., αn }, and output of the BiLSTM learning models and the results are reported in ta-
h = {h1 , h2 , ..., hn }. It is given as: ble 3. They are as follows:
n
Feature Based models: Multinomial Naive
X
s= αt ∗ ht (4)
t=1 Bayes with bag-of-words input (MNB + BoW),
Thus, the model is able to retain the merits of both Multinomial Naive Bayes with tf-idf input (MNB
the BiLSTM and the CNN, leading to a more robust + TF-IDF), Linear SVC with bag-of-words input
sentence representation. This representation is then (LSVC + BoW), and Linear SVC with tf-idf input
fed to a fully connected layer for dimensionality (LSVC + TF-IDF).
reduction. Basic Neural Networks: Feed forward Neural
4.5 Classification Layer network with max-pooling (FFNN), CNN with
max-pooling (CNN), and BiLSTM with maxpool-
The output of the fully connected attention layer is ing (BiLSTM)
passed to a dense layer with softmax activation to
predict a discrete label out of the four labels in the Complex Neural Networks: BiLSTM +atten-
given task. tion (Zhou et al., 2016) , serial BiLSTM-CNN

Metrics bioche com tech cse phy Label Validation Validation
Precision 0.9128 0.8831 0.9145 0.8931 Accuracy F1-Score
Recall 0.7976 0.9342 0.8949 0.8793 MNB + Bow 86.74 0.8352
F1-Score 0.8513 0.9079 0.9046 0.8862 MNB + TF-IDF 77.16 0.8010
Linear SVC + Bow 85.76 0.8432
Table 2: Detailed performance of the proposed model Linear SVC + TF-IDF 88.17 0.8681
on the validation data.
FFNN 76.11 0.7454
CNN 86.67 0.8532
(Chen et al., 2017), and serial BiLSTM-CNN + BiLSTM 89.31 0.8842
attention. BiLSTM + Attention 88.14 0.8697
Serial BiLSTM-CNN 88.99 0.8807
7 Results and Discussion Serial BiLSTM-CNN
+ Attention 88.23 0.8727
The performance of all the models is listed in Ta-
Ensemble CNN-BiLSTM
ble 3. The proposed model outperforms all other
+ Attention 89.57 0.8875
models in validation accuracy and weighted f1-
score. It achieves better results than standalone Table 3: Performance comparison of different models
CNN and BiLSTM, thus reasserting the impor- on the validation data.
tance of combining both the structures. The BiL-
STM with attention model is similar to the pro-
posed model, but the context is ignored. As the the intermediate representations generated by both
proposed model outperforms the BiLSTM with the models successfully using the attention mech-
attention model, it proves the effectiveness of the anism. It provides a way for further research in
CNN layer for providing context. Stacking a convo- adapting attention-based models for low resource
lutional layer over a BiLSTM unit results in lower and morphologically rich languages. The perfor-
performance than the standalone BiLSTM. It can mance of the model can be enhanced by giving
be thus inferred that combining CNN and BiLSTM additional inputs such as character n-grams and
in a parallel way is much more effective than just se- document-topic distribution. More efficient atten-
rially stacking. Thus, the attention mechanism pro- tion mechanisms can be applied to further consoli-
posed is able to successfully unify the CNN and the date the amalgamation of CNN and RNN.
BiLSTM, providing meaningful context to the tem-
poral representation generated by BiLSTM. Table 2
References
reports the detailed performance of the proposed
model for the validation data. The precision and Piotr Bojanowski, Edouard Grave, Armand Joulin, and
recall for communication technology (com tech), Tomas Mikolov. 2017. Enriching word vectors with
subword information. Transactions of the Associa-
computer science (cse), and physics(phy) labels are tion for Computational Linguistics, 5:135–146.
quite consistent. Biochemistry (bioche) label suf-
fers from a high difference in precision and recall. Pooja Bolaj and Sharvari Govilkar. 2016a. A survey
This can be traced back to the fact that less amount on text categorization techniques for indian regional
languages. International Journal of computer sci-
of training data is available for the label, leading to ence and Information Technologies, 7(2):480–483.
the model overfitting on it.
Pooja Bolaj and Sharvari Govilkar. 2016b. Text clas-
8 Conclusion and Future work sification for marathi documents using supervised
learning methods. International Journal of Com-
While NLP research in English is achieving new puter Applications, 155(8):6–10.
heights, the progress in low resource languages is
Tao Chen, Ruifeng Xu, Yulan He, and Xuan Wang.
still in its nascent stage. The TechDOfication task 2017. Improving sentiment analysis via sentence
paves the way for research in this field through type classification using bilstm-crf and cnn. Expert
the task of technical domain identification for texts Systems with Applications, 72:221–230.
in Indian languages. This paper proposes a CNN-
Alexis Conneau, Holger Schwenk, Loı̈c Barrault,
BiLSTM based attention ensemble model for the and Yann Lecun. 2016. Very deep convolutional
subtask-1f of Marathi text classification. The par- networks for text classification. arXiv preprint
allel CNN-BiLSTM attention-based model unifies arXiv:1606.01781.

Ankita Dhar, Himadri Mukherjee, Niladri Sekhar Dash, Pengfei Liu, Xipeng Qiu, and Xuanjing Huang.
and Kaushik Roy. 2018. Performance of classifiers 2016. Recurrent neural network for text classi-
in bangla text categorization. In 2018 International fication with multi-task learning. arXiv preprint
Conference on Innovations in Science, Engineering arXiv:1605.05101.
and Technology (ICISET), pages 168–173. IEEE.
Zhenyu Liu, Haiwei Huang, Chaohong Lu, and
Meng Joo Er, Yong Zhang, Ning Wang, and Mahard- Shengfei Lyu. 2020. Multichannel cnn with at-
hika Pratama. 2016. Attention pooling-based convo- tention for text classification. arXiv preprint
lutional neural network for sentence modelling. In- arXiv:2006.16174.
formation Sciences, 373:388–403.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-
Shang Gao, Arvind Ramanathan, and Georgia Tourassi. frey Dean. 2013. Efficient estimation of word
2018. Hierarchical convolutional attention net- representations in vector space. arXiv preprint
works for text classification. In Proceedings of arXiv:1301.3781.
The Third Workshop on Representation Learning for
NLP, pages 11–23. Jeffrey Pennington, Richard Socher, and Christopher D
Manning. 2014. Glove: Global vectors for word rep-
Alex Graves and Jürgen Schmidhuber. 2005. Frame- resentation. In Proceedings of the 2014 conference
wise phoneme classification with bidirectional lstm on empirical methods in natural language process-
and other neural network architectures. Neural net- ing (EMNLP), pages 1532–1543.
works, 18(5-6):602–610.
Martin Sundermeyer, Hermann Ney, and Ralf Schlüter.
Long Guo, Dongxiang Zhang, Lei Wang, Han Wang, 2015. From feedforward to recurrent lstm neural net-
and Bin Cui. 2018. Cran: a hybrid cnn-rnn attention- works for language modeling. IEEE/ACM Transac-
based model for text classification. In International tions on Audio, Speech, and Language Processing,
Conference on Conceptual Modeling, pages 571– 23(3):517–529.
585. Springer.
Madhuri Tummalapalli, Manoj Chinnakotla, and Rad-
A. Hassan and A. Mahmood. 2017. Efficient deep hika Mamidi. 2018. Towards better sentence classi-
learning model for text classification based on recur- fication for morphologically rich languages. In Pro-
rent and convolutional layers. In 2017 16th IEEE ceedings of the International Conference on Compu-
International Conference on Machine Learning and tational Linguistics and Intelligent Text Processing.
Applications (ICMLA), pages 1108–1113.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
Long short-term memory. Neural computation, Kaiser, and Illia Polosukhin. 2017. Attention is all
9(8):1735–1780. you need. arXiv preprint arXiv:1706.03762.

Ramchandra Joshi, Purvi Goel, and Raviraj Joshi. 2019. Yequan Wang, Minlie Huang, Xiaoyan Zhu, and
Deep learning for hindi text classification: A com- Li Zhao. 2016. Attention-based lstm for aspect-
parison. In International Conference on Intelli- level sentiment classification. In Proceedings of the
gent Human Computer Interaction, pages 94–101. 2016 conference on empirical methods in natural
Springer. language processing, pages 606–615.

Divyanshu Kakwani, Anoop Kunchukuttan, Satish Yijun Xiao and Kyunghyun Cho. 2016. Efficient
Golla, NC Gokul, Avik Bhattacharyya, Mitesh M character-level document classification by combin-
Khapra, and Pratyush Kumar. 2020. inlpsuite: ing convolution and recurrent layers.
Monolingual corpora, evaluation benchmarks and
pre-trained multilingual language models for indian Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,
languages. In Proceedings of the 2020 Conference Alex Smola, and Eduard Hovy. 2016. Hierarchi-
on Empirical Methods in Natural Language Process- cal attention networks for document classification.
ing: Findings, pages 4948–4961. In Proceedings of the 2016 conference of the North
American chapter of the association for computa-
Yoon Kim. 2014. Convolutional neural net- tional linguistics: human language technologies,
works for sentence classification. arXiv preprint pages 1480–1489.
arXiv:1408.5882.
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.
Diederik P Kingma and Jimmy Ba. 2014. Adam: A Character-level convolutional networks for text clas-
method for stochastic optimization. arXiv preprint sification. In Advances in neural information pro-
arXiv:1412.6980. cessing systems, pages 649–657.

Hoa T Le, Christophe Cerisara, and Alexandre De- Jin Zheng and Limin Zheng. 2019. A hybrid bidi-
nis. 2017. Do convolutional networks need to rectional recurrent convolutional neural network
be deep for text classification? arXiv preprint attention-based model for text classification. IEEE
arXiv:1707.04108. Access, 7:106673–106685.

Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li,
  Hongwei Hao, and Bo Xu. 2016. Attention-based
  bidirectional long short-term memory networks for
  relation classification. In Proceedings of the 54th
  Annual Meeting of the Association for Computa-
  tional Linguistics (Volume 2: Short Papers), pages
  207–212.

You can also read