An Attention Ensemble Approach for Efficient Text Classification of Indian Languages - ResearchGate
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
An Attention Ensemble Approach for Efficient Text Classification of Indian Languages Atharva Kulkarni1 , Amey Hengle1 , and Rutuja Udyawar2 1 Department of Computer Engineering, PVG’s COET, Savitribai Phule Pune University, India. 2 Optimum Data Analytics, India. 1 {atharva.j.kulkarni1998, ameyhengle22}@gmail.com 2 rutuja.udyawar@odaml.com Abstract 2019; Tummalapalli et al., 2018; Bolaj and Gov- ilkar, 2016a,b; Dhar et al., 2018). Apart from The recent surge of complex attention-based being heavily consumed in the print format, the arXiv:2102.10275v1 [cs.CL] 20 Feb 2021 deep learning architectures has led to extraor- growth in the Indian languages internet user base dinary results in various downstream NLP is monumental, scaling from 234 million in 2016 tasks in the English language. However, such research for resource-constrained and morpho- to 536 million by 2021 1 . Even so, just like most logically rich Indian vernacular languages has other Indian languages, the progress in NLP for been relatively limited. This paper proffers Marathi has been relatively constrained, due to fac- team SPPU AKAH’s solution for the TechD- tors such as the unavailability of large-scale train- Ofication 2020 subtask-1f: which focuses on ing resources, structural un-similarity with the En- the coarse-grained technical domain identifica- glish language, and a profusion of morphological tion of short text documents in Marathi, a De- variations, thus, making the generalization of deep vanagari script-based Indian language. Avail- learning architectures to languages like Marathi ing the large dataset at hand, a hybrid CNN- BiLSTM attention ensemble model is pro- difficult. posed that competently combines the inter- This work posits a solution for the TechDOfica- mediate sentence representations generated by tion 2020 subtask-1f: coarse-grained domain clas- the convolutional neural network and the bidi- sification for short Marathi texts. The task provides rectional long short-term memory, leading to a large corpus of Marathi text documents spanning efficient text classification. Experimental re- across four domains: Biochemistry, Communica- sults show that the proposed model outper- tion Technology, Computer Science, and Physics. forms various baseline machine learning and deep learning models in the given task, giving Efficient domain identification can potentially im- the best validation accuracy of 89.57% and f1- pact, and improve research in downstream NLP score of 0.8875. Furthermore, the solution re- applications such as question answering, transliter- sulted in the best system submission for this ation, machine translation, and text summarization, subtask, giving a test accuracy of 64.26% and to name a few. Inspired by the works of (Er et al., f1-score of 0.6157, transcending the perfor- 2016; Guo et al., 2018; Zheng and Zheng, 2019), mances of other teams as well as the baseline a hybrid CNN-BiLSTM attention ensemble model system given by the organizers of the shared is proposed in this work. In recent years, Convo- task. lutional Neural Networks (Kim, 2014; Conneau et al., 2016; Zhang et al., 2015; Liu et al., 2020; 1 Introduction Le et al., 2017) and Recurrent Neural Networks The advent of attention-based neural networks and (Liu et al., 2016; Sundermeyer et al., 2015) have the availability of large labelled datasets has re- been used quite frequently for text classification sulted in great success and state-of-the-art perfor- tasks. Quite different from one another, the CNNs mance for English text classification (Yang et al., and the RNNs show different capabilities to gener- 2016; Zhou et al., 2016; Wang et al., 2016; Gao ate intermediate text representation. CNN models et al., 2018). Such results, however, for Indian lan- an input sentence by utilizing convolutional filters guage text classification tasks are far and few as to identify the most influential n-grams of differ- most of the research employ traditional machine 1 https://home.kpmg/in/en/home/insights/2017/04/indian- learning and deep learning models (Joshi et al., language-internet-users.html
ent semantic aspects (Conneau et al., 2016). RNN mood, 2017) succeed in exploiting the advantages can handle variable-length input sentences and is of both CNN and RNN, by using them in combina- particularly well suited for modeling sequential tion for text classification. data, learning important temporal features and long- Since the introduction of the attention mecha- term dependencies for robust text representation nism (Vaswani et al., 2017), it has become an effec- (Hochreiter and Schmidhuber, 1997). However, tive strategy for dynamically learning the contribu- whilst CNN can only capture local patterns and tion of different features to specific tasks. Needless fails to incorporate the long-term dependencies and to say, the attention mechanism has expeditiously the sequential features, RNN cannot distinguish found its way into NLP literature, with many works between keywords that contribute more context to effectively leveraging it to improve the text classi- the classification task from the normal stopwords. fication task. (Guo et al., 2018) proposed a CNN - Thus, the proposed model hypothesizes a potent RNN attention-based neural network (CRAN) for way to subsume the advantages of both the CNN text classification. This work illustrates the effec- and the RNN using the attention mechanism. The tiveness of using the CNN layer as a context of model employs a parallel structure where both the the attention model. Results show that using this CNN and the BiLSTM model the input sentences mechanism enables the proposed model to pick the independently. The intermediate representations, important words from the sequences generated by thus generated, are combined using the attention the RNN layer, thus helping it to outperform many mechanism. Therefore, the generated vector has baselines as well as hybrid attention-based models useful temporal features from the sequences gener- in the text classification task. (Er et al., 2016) pro- ated by the RNN according to the context generated posed an attention pooling strategy, which focuses by the CNN. Results attest that the proposed model on making a model learn better sentence represen- outperforms various baseline machine learning and tations for improved text classification. Authors deep learning models in the given task, giving the use the intermediate sentence representations pro- best validation accuracy and f1-score. duced by a BiLSTM layer in reference with the local representations produced by a CNN layer to 2 Related Work obtain the attention weights. Experimental results demonstrate that the proposed model outperforms Since the past decade, the research in NLP has state-of-the-art approaches on a number of bench- shifted from a traditional statistical standpoint to mark datasets for text classification. (Zheng and complex neural network architectures. The CNN Zheng, 2019) combine the BiLSTM and CNN with and RNN based architectures have emerged greatly the attention mechanism for fine-grained text clas- successful for the text classification task. Yoon sification tasks. The authors employ a method in Kim was the first one who applied a CNN model which intermediate sentence representations gener- for English text classification. In this work, a series ated by BiLSTM are passed to a CNN layer which of experiments were conducted with single as well is then max pooled to capture the local features of a as multi-channel convolutional neural networks, sentence. The local feature representations are fur- built on top of randomly generated, pretrained, and ther combined by using an attention layer to calcu- fine-tuned word vectors (Kim, 2014). This success late the attention weights. In this way, the attention of CNN for text classification led to the emergence layer can assign different weights to features ac- of more complex CNN models (Conneau et al., cording to their importance to the text classification 2016) as well as CNN models with character level task. inputs (Zhang et al., 2015). RNNs are capable of generating effective text representation by learn- The literature in NLP focusing on the resource- ing temporal features and long-term dependencies constrained Indian languages has been fairly re- between the words (Hochreiter and Schmidhuber, strained. (Tummalapalli et al., 2018) evaluated the 1997; Graves and Schmidhuber, 2005). However, performance of vanilla CNN, LSTM, and multi- these methods treat each word in the sentences Input CNN for the text-classification of Hindi and equally and thus cannot distinguish between the Telugu texts. The results indicate that CNN based keywords that contribute more to the classification models perform surprisingly better as compared and the common words. Hybrid models proposed to LSTM and SVM using n-gram features. (Joshi by (Xiao and Cho, 2016) and (Hassan and Mah- et al., 2019) have compared different deep learn-
Label Training Data Validation Data bioche 5,002 420 com tech 17,995 1,505 cse 9,344 885 phy 9,656 970 Total 41,997 3,780 Table 1: Data distribution. ing approaches for Hindi sentence classification. The authors have evaluated the effect of using pre- trained fasttext Hindi embeddings on the sentence classification task. The finest classification per- formance is achieved by the Vanilla CNN model when initialized with fasttext word embeddings fine-tuned on the specific dataset. 3 Dataset The TechDOfication-2020 subtask-1f dataset con- sists of labelled Marathi text documents, each be- longing to one of the four classes, namely: Bio- chemistry (bioche), Communication Technology (com tech), Computer Science (cse), and Physics (phy). The training data has a mean length of 26.89 words with a standard deviation of 25.12. Table 1 provides an overview of the distribution Figure 1: Model Architecture. of the corpus across the four labels for training and validation data. From the table, it is evident that of character-n-grams, which in turn helps to cap- the dataset is imbalanced, with the class Commu- ture the morphological richness of languages like nication Technology and Biochemistry having the Marathi. The embedding layer converts each word most and the least documents, respectively. It is, wi in the document T = {w1 , w2 , ..., wn } of n therefore, reasonable to postulate that this data im- words, into a real-valued dense vector ei using the balance may lead to the overfitting of the model following matrix-vector product: on some classes. This is further articulated in the Results section. ei = W vi (1) 4 Proposed Model where W ∈ Rd×|V | is the embedding matrix, |V | is a fixed-sized vocabulary of the corpus and d This section describes the proposed multi-input is the word embedding dimension. vi is the one- attention-based parallel CNN-BiLSTM. Figure 1 hot encoded vector with the element ei set to 1 depicts the model architecture. Each component is while the other elements set to 0. Thus, the doc- explained in detail as follows: ument can be represented as real-valued vector e = {e1 , e2 , ..., en }. 4.1 Word Embedding Layer The proposed model uses fasttext word embeddings 4.2 Bi-LSTM Layer trained on the unsupervised skip-gram model to The word embeddings generated by the embed- map the words from the corpus vocabulary to a dings layer are fed to the BiLSTM unit step by corresponding dense vector. Fasttext embeddings step. A Bidirectional Long-short term memory (Bi- are preferred over the word2vec (Mikolov et al., LSTM) (Graves and Schmidhuber, 2005) layer is 2013) or glove variants (Pennington et al., 2014), just a combination of two LSTMs (Hochreiter and as fasttext represents each word as a sequence Schmidhuber, 1997) running in opposite directions.
This allows the networks to have both forward and 5 Experimental Setup backward information about the sequence at ev- Each text document is tokenized and padded to a ery time step, resulting in better understanding and maximum length of 100. Longer documents are preservation of the context. It is also able to counter truncated. The work of (Kakwani et al., 2020) is the problem of vanishing gradients to a certain ex- referred for selecting the optimal set of hyperpa- tent by utilizing the input, the output, and the forget rameters for training the fasttext skip-gram model. gates. The intermediate sentence representation The 300-dimensional fasttext word embeddings are generated by Bi-LSTM is denoted as h. trained on the given corpus for 50 epochs, with a 4.3 CNN Layer minimum token count of 1, and 10 negative exam- The discrete convolutions performed by the CNN ples, sampled for each instance. The rest of the layer on the input word embeddings, help to extract hyperparameter values were chosen as default (Bo- the most influential n-grams in the sentence. Three janowski et al., 2017). After training, an average parallel convolutional layers with three different loss of 0.6338. was obtained over 50 epochs. The window sizes are used so that the model can learn validation dataset is used to tune the hyperparam- multiple types of embeddings of local regions, and eters. The LSTM layer dimension is set to 128 complement one another. Finally, the sentence rep- neurons with a dropout rate of 0.3. Thus, the BiL- resentations of all the different convolutions are STM gives an intermediate representation of 256 concatenated and max-pooled to get the most dom- dimensions. For the CNN block, we employ three inant features. The output is denoted as c. parallel convolutional layers with filter sizes 3, 4, and 5, each having 256 feature maps. A dropout 4.4 Attention Layer rate of 0.3 is applied to each layer. The local repre- The linchpin of the model is the attention block sentations, thus, generated by the parallel CNNs are that effectively combines the intermediate sentence then concatenated and max-pooled. All other pa- feature representation generated by BiLSTM with rameters in the model are initialized randomly. The the local feature representation generated by CNN. model is trained end-to-end for 15 epochs, with the At each time step t, taking the output ht of the Adam optimizer (Kingma and Ba, 2014), sparse BiLSTM, and ct of the CNN, the attention weights categorical cross-entropy loss, a learning rate of αt are calculated as: 0.001, and a minibatch size of 128. The best model is stored and the learning rate is reduced by a factor ut = tanh(W1 ht + W2 ct + b) (2) of 0.1 if validation loss does not decline after two αt = Sof tmax(ut ) (3) successive epochs. Where W1 and W2 are the attention weights, and 6 Baseline Models b is the attention bias learned via backpropagation. The final sentence representation s is calculated as The performance of the proposed model is com- the weighted arithmetic mean based on the weights pared with a host of machine learning and deep α = {α1 , α2 , ..., αn }, and output of the BiLSTM learning models and the results are reported in ta- h = {h1 , h2 , ..., hn }. It is given as: ble 3. They are as follows: n Feature Based models: Multinomial Naive X s= αt ∗ ht (4) t=1 Bayes with bag-of-words input (MNB + BoW), Thus, the model is able to retain the merits of both Multinomial Naive Bayes with tf-idf input (MNB the BiLSTM and the CNN, leading to a more robust + TF-IDF), Linear SVC with bag-of-words input sentence representation. This representation is then (LSVC + BoW), and Linear SVC with tf-idf input fed to a fully connected layer for dimensionality (LSVC + TF-IDF). reduction. Basic Neural Networks: Feed forward Neural 4.5 Classification Layer network with max-pooling (FFNN), CNN with max-pooling (CNN), and BiLSTM with maxpool- The output of the fully connected attention layer is ing (BiLSTM) passed to a dense layer with softmax activation to predict a discrete label out of the four labels in the Complex Neural Networks: BiLSTM +atten- given task. tion (Zhou et al., 2016) , serial BiLSTM-CNN
Metrics bioche com tech cse phy Label Validation Validation Precision 0.9128 0.8831 0.9145 0.8931 Accuracy F1-Score Recall 0.7976 0.9342 0.8949 0.8793 MNB + Bow 86.74 0.8352 F1-Score 0.8513 0.9079 0.9046 0.8862 MNB + TF-IDF 77.16 0.8010 Linear SVC + Bow 85.76 0.8432 Table 2: Detailed performance of the proposed model Linear SVC + TF-IDF 88.17 0.8681 on the validation data. FFNN 76.11 0.7454 CNN 86.67 0.8532 (Chen et al., 2017), and serial BiLSTM-CNN + BiLSTM 89.31 0.8842 attention. BiLSTM + Attention 88.14 0.8697 Serial BiLSTM-CNN 88.99 0.8807 7 Results and Discussion Serial BiLSTM-CNN + Attention 88.23 0.8727 The performance of all the models is listed in Ta- Ensemble CNN-BiLSTM ble 3. The proposed model outperforms all other + Attention 89.57 0.8875 models in validation accuracy and weighted f1- score. It achieves better results than standalone Table 3: Performance comparison of different models CNN and BiLSTM, thus reasserting the impor- on the validation data. tance of combining both the structures. The BiL- STM with attention model is similar to the pro- posed model, but the context is ignored. As the the intermediate representations generated by both proposed model outperforms the BiLSTM with the models successfully using the attention mech- attention model, it proves the effectiveness of the anism. It provides a way for further research in CNN layer for providing context. Stacking a convo- adapting attention-based models for low resource lutional layer over a BiLSTM unit results in lower and morphologically rich languages. The perfor- performance than the standalone BiLSTM. It can mance of the model can be enhanced by giving be thus inferred that combining CNN and BiLSTM additional inputs such as character n-grams and in a parallel way is much more effective than just se- document-topic distribution. More efficient atten- rially stacking. Thus, the attention mechanism pro- tion mechanisms can be applied to further consoli- posed is able to successfully unify the CNN and the date the amalgamation of CNN and RNN. BiLSTM, providing meaningful context to the tem- poral representation generated by BiLSTM. Table 2 References reports the detailed performance of the proposed model for the validation data. The precision and Piotr Bojanowski, Edouard Grave, Armand Joulin, and recall for communication technology (com tech), Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Associa- computer science (cse), and physics(phy) labels are tion for Computational Linguistics, 5:135–146. quite consistent. Biochemistry (bioche) label suf- fers from a high difference in precision and recall. Pooja Bolaj and Sharvari Govilkar. 2016a. A survey This can be traced back to the fact that less amount on text categorization techniques for indian regional languages. International Journal of computer sci- of training data is available for the label, leading to ence and Information Technologies, 7(2):480–483. the model overfitting on it. Pooja Bolaj and Sharvari Govilkar. 2016b. Text clas- 8 Conclusion and Future work sification for marathi documents using supervised learning methods. International Journal of Com- While NLP research in English is achieving new puter Applications, 155(8):6–10. heights, the progress in low resource languages is Tao Chen, Ruifeng Xu, Yulan He, and Xuan Wang. still in its nascent stage. The TechDOfication task 2017. Improving sentiment analysis via sentence paves the way for research in this field through type classification using bilstm-crf and cnn. Expert the task of technical domain identification for texts Systems with Applications, 72:221–230. in Indian languages. This paper proposes a CNN- Alexis Conneau, Holger Schwenk, Loı̈c Barrault, BiLSTM based attention ensemble model for the and Yann Lecun. 2016. Very deep convolutional subtask-1f of Marathi text classification. The par- networks for text classification. arXiv preprint allel CNN-BiLSTM attention-based model unifies arXiv:1606.01781.
Ankita Dhar, Himadri Mukherjee, Niladri Sekhar Dash, Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. and Kaushik Roy. 2018. Performance of classifiers 2016. Recurrent neural network for text classi- in bangla text categorization. In 2018 International fication with multi-task learning. arXiv preprint Conference on Innovations in Science, Engineering arXiv:1605.05101. and Technology (ICISET), pages 168–173. IEEE. Zhenyu Liu, Haiwei Huang, Chaohong Lu, and Meng Joo Er, Yong Zhang, Ning Wang, and Mahard- Shengfei Lyu. 2020. Multichannel cnn with at- hika Pratama. 2016. Attention pooling-based convo- tention for text classification. arXiv preprint lutional neural network for sentence modelling. In- arXiv:2006.16174. formation Sciences, 373:388–403. Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- Shang Gao, Arvind Ramanathan, and Georgia Tourassi. frey Dean. 2013. Efficient estimation of word 2018. Hierarchical convolutional attention net- representations in vector space. arXiv preprint works for text classification. In Proceedings of arXiv:1301.3781. The Third Workshop on Representation Learning for NLP, pages 11–23. Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word rep- Alex Graves and Jürgen Schmidhuber. 2005. Frame- resentation. In Proceedings of the 2014 conference wise phoneme classification with bidirectional lstm on empirical methods in natural language process- and other neural network architectures. Neural net- ing (EMNLP), pages 1532–1543. works, 18(5-6):602–610. Martin Sundermeyer, Hermann Ney, and Ralf Schlüter. Long Guo, Dongxiang Zhang, Lei Wang, Han Wang, 2015. From feedforward to recurrent lstm neural net- and Bin Cui. 2018. Cran: a hybrid cnn-rnn attention- works for language modeling. IEEE/ACM Transac- based model for text classification. In International tions on Audio, Speech, and Language Processing, Conference on Conceptual Modeling, pages 571– 23(3):517–529. 585. Springer. Madhuri Tummalapalli, Manoj Chinnakotla, and Rad- A. Hassan and A. Mahmood. 2017. Efficient deep hika Mamidi. 2018. Towards better sentence classi- learning model for text classification based on recur- fication for morphologically rich languages. In Pro- rent and convolutional layers. In 2017 16th IEEE ceedings of the International Conference on Compu- International Conference on Machine Learning and tational Linguistics and Intelligent Text Processing. Applications (ICMLA), pages 1108–1113. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Sepp Hochreiter and Jürgen Schmidhuber. 1997. Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Long short-term memory. Neural computation, Kaiser, and Illia Polosukhin. 2017. Attention is all 9(8):1735–1780. you need. arXiv preprint arXiv:1706.03762. Ramchandra Joshi, Purvi Goel, and Raviraj Joshi. 2019. Yequan Wang, Minlie Huang, Xiaoyan Zhu, and Deep learning for hindi text classification: A com- Li Zhao. 2016. Attention-based lstm for aspect- parison. In International Conference on Intelli- level sentiment classification. In Proceedings of the gent Human Computer Interaction, pages 94–101. 2016 conference on empirical methods in natural Springer. language processing, pages 606–615. Divyanshu Kakwani, Anoop Kunchukuttan, Satish Yijun Xiao and Kyunghyun Cho. 2016. Efficient Golla, NC Gokul, Avik Bhattacharyya, Mitesh M character-level document classification by combin- Khapra, and Pratyush Kumar. 2020. inlpsuite: ing convolution and recurrent layers. Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for indian Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, languages. In Proceedings of the 2020 Conference Alex Smola, and Eduard Hovy. 2016. Hierarchi- on Empirical Methods in Natural Language Process- cal attention networks for document classification. ing: Findings, pages 4948–4961. In Proceedings of the 2016 conference of the North American chapter of the association for computa- Yoon Kim. 2014. Convolutional neural net- tional linguistics: human language technologies, works for sentence classification. arXiv preprint pages 1480–1489. arXiv:1408.5882. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Diederik P Kingma and Jimmy Ba. 2014. Adam: A Character-level convolutional networks for text clas- method for stochastic optimization. arXiv preprint sification. In Advances in neural information pro- arXiv:1412.6980. cessing systems, pages 649–657. Hoa T Le, Christophe Cerisara, and Alexandre De- Jin Zheng and Limin Zheng. 2019. A hybrid bidi- nis. 2017. Do convolutional networks need to rectional recurrent convolutional neural network be deep for text classification? arXiv preprint attention-based model for text classification. IEEE arXiv:1707.04108. Access, 7:106673–106685.
Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Hongwei Hao, and Bo Xu. 2016. Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computa- tional Linguistics (Volume 2: Short Papers), pages 207–212.
You can also read