Multi-class Text Classification using BERT-based Active Learning
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Multi-class Text Classification using BERT-based Active Learning Sumanth Prabhu† Moosa Mohamed† Hemant Misra Applied Research, Swiggy Applied Research, Swiggy Applied Research, Swiggy Abstract nique under Human-in-the-Loop Machine Learn- ing (Monarch, 2019). Active Learning overcomes Text Classification finds interesting applica- the labeling bottleneck by choosing a subset of in- tions in the pickup and delivery services in- stances from a pool of unlabeled instances to be dustry where customers require one or more arXiv:2104.14289v1 [cs.IR] 27 Apr 2021 items to be picked up from a location and de- labeled by the human annotators (a.k.a the oracle). livered to a certain destination. Classifying The goal of identifying this subset is to achieve high these customer transactions into multiple cat- accuracy while having labeled as few instances as egories helps understand the market needs for possible. different customer segments. Each transaction BERT-based (Devlin et al., 2019; Sanh et al., is accompanied by a text description provided 2020; Yinhan Liu et al., 2019) Deep Learning mod- by the customer to describe the products being els have been proven to achieve state-of-the-art picked up and delivered which can be used to classify the transaction. BERT-based models results on GLUE (Wang et al., 2018), RACE (Lai have proven to perform well in Natural Lan- et al., 2017) and SQuAD (Rajpurkar et al., 2016, guage Understanding. However, the product 2018). Active Learning has been successfully inte- descriptions provided by the customers tend to grated with various Deep Learning approaches (Gal be short, incoherent and code-mixed (Hindi- and Ghahramani, 2016; Gissin and Shalev-Shwartz, English) text which demands fine-tuning of 2019). However, the use of Active Learning with such models with manually labelled data to BERT based models for multi class text classifica- achieve high accuracy. Collecting this labelled tion has not been studied extensively. data can prove to be expensive. In this pa- per, we explore Active Learning strategies to In this paper, we consider a text classification label transaction descriptions cost effectively use-case in industry specific to pickup and delivery while using BERT to train a transaction classi- services where customers make use of short text fication model. On TREC-6, AG’s News Cor- to describe the products to be picked up from a pus and an internal dataset, we benchmark the certain location and dropped at a target location. performance of BERT across different Active Table 1 shows a few examples of the descriptions Learning strategies in Multi-Class Text Classi- used by our customers to describe their transac- fication. tions. Customers tend to use short incoherent and 1 Introduction code-mixed (using more than one language in the same message1 ) textual descriptions of the prod- Most of the focus of the machine learning commu- ucts for describing them in the transaction. These nity is about creating better algorithms for learn- descriptions, if mapped to a fixed set of possible ing from data. However, getting useful annotated categories, help assist critical business decisions datasets can prove to be difficult. In many scenar- such as demographic driven prioritization of cate- ios, we may have access to a large unannotated gories, launch of new product categories and so on. corpus while it would be infeasible to annotate all Furthermore, a transaction may comprise of multi- instances (Settles, 2009). Thus, weaker forms of ple products which adds to the complexity of the supervision (Liang, 2005; Natarajan et al., 2013; task. In this work, we focus on a multi-class clas- Ratner et al., 2017; Mekala and Shang, 2020) were sification of transactions, where a single majority explored to label data cost effectively. In this pa- category drives the transaction. per, we focus on Active Learning, a popular tech- 1 In our case, Hindi written in roman case is mixed with † Equal contribution. English
Transaction Description Category 2.1 Traditional Active Learning "Get me dahi 1.5kg" Grocery Popular traditional Active Learning approaches in- Translation : Get me 1.5 kilo- clude sampling based on model uncertainty using grams of curd measures such as entropy (Lewis and Gale, 1994; "Pick up 1 yellow coloured Clothes Zhu et al., 2008), margin sampling (Scheffer et al., dress" 2001) and least model confidence (Settles and "3 plate chole bhature" Food Craven, 2008; Culotta and McCallum, 2005; Set- "2254/- pay krke samaan uthana Package tles, 2009). Researchers also explored sampling hai" based on diversity of instances (McCallum and Translation : Pay 2254/- and Nigam, 1998; Settles and Craven, 2008; Xu et al., pick up the package 2016; Dasgupta and Hsu, 2008; Wei et al., 2015) Table 1: Instances of actual transaction descriptions to prevent selection of redundant instances for la- used by our customers along with their corresponding belling. (Baram et al., 2004; Hsu and Lin, 2015; categories. Settles et al., 2008) explored hybrid strategies com- bining the advantages of uncertainty and diversity based sampling. Our work focusses on extending We explored supervised category classification such strategies to Active Learning in a Deep Learn- of transactions which required labelled data for ing setting. training. Our experiments with different ap- proaches revealed that BERT-based models per- 2.2 Active Learning for Deep Learning formed really well for the task. The train data Deep Learning in an Active Learning setting can be used in this paper was labeled manually by subject challenging due to their tendency of rarely being matter experts. However, this was proving to be uncertain during inference and requirement of a a very expensive exercise, and hence, necessitated relatively large amount of training data. However, exploration of cost effective strategies to collecting Bayesian Deep Learning based approaches (Gal manually labelled training data. and Ghahramani, 2016; Gal et al., 2017; Kirsch The key contributions of our work are as follows et al., 2019) were leveraged to demonstrate high performance in an active learning setting for im- • Active Learning with incoherent and code- age classification. Similar to Traditional Active mixed data: An Active Learning framework Learning approaches, researchers explored diver- to reduce the cost of labelling data for multi sity based sampling (Sener and Savarese, 2018) class text classification by 85% in an industry and hybrid sampling (Zhdanov, 2019) to address use-case. the drawbacks of pure model uncertainty based approaches. Moreover, researchers also explored • BERT for Active Learning in multi-class approaches based on Expected Model Change (Set- text Classification The first work, to the best tles et al., 2007; Huang et al., 2016; Zhang et al., of our knowledge, to explore and compare 2017; Yoo and Kweon, 2019) where the selected in- multiple advanced strategies in Active Learn- stances are expected to result in the greatest change ing like Discriminative Active Learning using to the current model parameter estimates when their BERT for multi-class text classification on labels are provided. (Gissin and Shalev-Shwartz, publicly available TREC-6 and AG’s News 2019) proposed “Discriminative Active Learning Corpus benchmark datasets. (DAL)” where active learning in Image Classifica- tion was posed as a binary classification task where 2 Related Work instances to label were chosen such that the set of labeled instances and the set of unlabeled in- Active Learning has been widely studied and ap- stances became indistinguishable. However, there plied in a variety of tasks including classifica- is minimal literature in applying these strategies to tion (Novak et al., 2006; Tong and Koller, 2002), Natural Language Understanding (NLU). Our work structured output learning (Roth and Small, 2006; focusses on extending such Deep Active Learning Tomanek and Hahn, 2009; Haffari and Sarkar, strategies for Text Classification using BERT-based 2009), clustering (Bodó et al., 2011) and so on. models.
2.3 Deep Active Learning for Text • Core-Set (Sener and Savarese, 2018) selects Classification in NLP instances that best cover the dataset in the (Siddhant and Lipton, 2018) conducted an em- learned representation space using farthest- pirical study of Active Learning in NLP to ob- first traversal algorithm serve that Bayesian Active Learning outperformed • Discriminative Active Learning classical uncertainty sampling across all settings. (DAL) (Gissin and Shalev-Shwartz, 2019) However, they do not explore the performances This strategy aims to select instances of BERT-based models. (Prabhu et al., 2019; that make the labeled set of instances Zhang, 2019) explored Active Learning strategies indistinguishable from the unlabeled pool. extensible to BERT-based models. (Prabhu et al., 2019) conducted an empirical study to demon- 4 Experiments strate that active set selection using the posterior entropy of deep models like FastText.zip (FTZ) 4.1 Data is robust to sampling biases and to various algo- Internal Dataset Our internal dataset data com- rithmic choices such as BERT. (Zhang, 2019) ap- prises of 55,038 customer transactions each of plied an ensemble of Active Learning strategies to which had an associated transaction description BERT for the task of intent classification. How- provided by the customer. The instances were man- ever, neither of the works account for comparison ually annotated by a team of SMEs and mapped to against advanced strategies like Discriminative Ac- one of 10 pre-defined categories. This data set is tive Learning. (Ein-Dor et al., 2020) explored Ac- split into 44,030 training samples and 11,008 vali- tive Learning with multiple strategies using BERT dation samples. The list of categories considered based models for binary text classification tasks. are as follows: Our work is very closely related to this work with {‘Food’, ‘Grocery’, ‘Package’, ‘Medicines’, the difference that we focus on experiments with ‘Household Items’, ‘Cigarettes’, ‘Clothes’, ‘Elec- multi-class text classification. tronics’, ‘Keys’, ‘Documents/Books’ } 3 Methodology TREC-6 Dataset To verify the reproducibility of We consider multiple strategies in Active Learning our observations, we consider the TREC dataset (Li to understand the impact of the performance using and Roth, 2002) which comprises of 5,452 training BERT based models. We attempt to reproduce set- samples and 500 validation samples with 6 classes. tings similar to the study conducted by Ein-Dor et al. (Ein-Dor et al., 2020) for binary text classifica- AG’s News Corpus To verify the reproducibility tion. of our observations, we also consider the AG’s corpus of news articles (Zhang et al., 2015) which • Random We sample instances at random in comprises of 120,000 training samples and 7,600 each iteration while leveraging them to retrain validation samples with 4 classes. the model. 4.2 Experiment Setting • Uncertainty (Entropy) Instances selected to We leverage DistilBERT (Sanh et al., 2020) for the retrain the model have the highest entropy in purpose of the experiments in this paper. We run model predictions. each strategy for two random seed settings on the TREC-6 Dataset and AG’s News Corpus and one • Expected Gradient Length (EGL) (Huang random seed setting for our Internal Dataset. The et al., 2016) This strategy picks instances that results present here consist of 558 fine-tuning ex- are expected to have the largest gradient norm periments (78 for our Internal dataset and 240 each over all possible labelings of the instances. for the TREC-6 dataset and AG’s News Corpus). The experiments were run on AWS Sagemaker’s p3 • Deep Bayesian Active Learning 2x large Spot Instances. The implementation was (DBAL) (Gal et al., 2017) Instances based on the code2 made available by (Gissin and are selected to improve the uncertainty Shalev-Shwartz, 2019). Table 2 shows the batch measures of BERT with Monte Carlo dropout 2 using the max-entropy acquisition function. https://github.com/dsgissin/DiscriminativeActiveLearning
Figure 1: Comparison of F1 scores of various Active Learning strategies on the Internal Dataset(BatchSize=500, Iterations=13), TREC-6 Dataset(BatchSize=100, Iterations=20) and AG’s News Corpus(BatchSize=100, Itera- tions=20) size and number of iterations settings used for the BERT in binary classification. Moreover, we ob- experiments. In each iteration, a batch of samples serve that the performance improvements over Ran- are identified and the model is retrained for 1 epoch dom Sampling strategy is also inconsistent. Con- with a learning rate of 3 × 10−5 . We retrain the cretely, on the Internal Dataset and TREC-6, we ob- models from scratch in each iteration to prevent serve that DBAL and Core-set are perform poorly overfitting (Siddhant and Lipton, 2018; Hu et al., when compared to Random Sampling. However, on 2019). In other words, a new model is constructed the AG’s News Corpus, we observe a contradicting in each iteration without dependency on the model behaviour where both DBAL and Core-set outper- learnt in the previous iteration. form the Random Sampling strategy. Uncertainty sampling strategy performs very similar to the Ran- Dataset Batch # Iterations dom Sampling strategy. EGL performs consistently Size well across all three datasets. The DAL strategy, Internal Dataset 500 13 proven to have worked well in image classification, TREC-6 Dataset 100 20 is consistent across the datasets. However, it does AG’s News Corpus 100 20 not outperform the EGL strategy. A deeper analysis of these observations will be conducted in future Table 2: Training details for the Internal Dataset, work. TREC-6 Dataset and AG’s News Corpus 5 Results 6 Conclusion We report results for the Active Learning Strategies using F1-score as the classification metric. Fig- In this paper, we explored multiple Active Learning ure 1 visualizes the performance of different ap- strategies using BERT. Our goal was to understand proaches. We observe that the best F1-score that if BERT-based models can prove effective in an Active Learning using BERT achieves on the In- Active Learning setting for multi-class text classifi- ternal Dataset is 0.83. A fully supervised setting cation. We observed that BERT-based models can with DistilBERT using the entire training data also be trained cost effectively with lesser data using achieved an F1-score of 0.83. Thus, we reduce the Active Learning strategies. We also observed that labelling cost by 85% while achieving similar F1- none of the strategies significantly outperformed scores. However, we observe that no single strat- the others while EGL performs reasonably well egy significantly outperforms the other. (Lowell across datasets. In future work, we plan to inves- et al., 2019) who studied Active Learning for text tigate the results of the paper in detail and also classification and sequence tagging in non-BERT explore ensembling approaches by combining the models, and demonstrated the brittleness and incon- advantages of each strategy to achieve extensive sistency of Active Learning results. (Ein-Dor et al., performance improvements. 2020) also observed inconsistency of results using
References Wei-Ning Hsu and Hsuan-Tien Lin. 2015. Active learn- ing by learning. In Proceedings of the Twenty- Yoram Baram, Ran El-Yaniv, and Kobi Luz. 2004. On- Ninth AAAI Conference on Artificial Intelligence, line choice of active learning algorithms. J. Mach. AAAI’15, page 2659–2665. AAAI Press. Learn. Res., 5:255–291. Peiyun Hu, Zachary C. Lipton, Anima Anandkumar, Zalán Bodó, Zsolt Minier, and Lehel Csató. 2011. Ac- and Deva Ramanan. 2019. Active learning with par- tive learning with clustering. In Active Learning and tial feedback. Experimental Design workshop In conjunction with AISTATS 2010, volume 16 of Proceedings of Ma- Jiaji Huang, Rewon Child, Vinay Rao, Hairong Liu, chine Learning Research, pages 127–139, Sardinia, Sanjeev Satheesh, and Adam Coates. 2016. Active Italy. JMLR Workshop and Conference Proceedings. learning for speech recognition: the power of gradi- ents. Aron Culotta and Andrew McCallum. 2005. Reduc- ing labeling effort for structured prediction tasks. Andreas Kirsch, Joost R. van Amersfoort, and Y. Gal. In Proceedings of the 20th National Conference on 2019. Batchbald: Efficient and diverse batch acqui- Artificial Intelligence - Volume 2, AAAI’05, page sition for deep bayesian active learning. In NeurIPS. 746–751. AAAI Press. Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, Sanjoy Dasgupta and Daniel Hsu. 2008. Hierarchi- and Eduard Hovy. 2017. Race: Large-scale reading cal sampling for active learning. ICML’08, page comprehension dataset from examinations. arXiv 208–215, New York, NY, USA. Association for preprint arXiv:1704.04683. Computing Machinery. David D. Lewis and William A. Gale. 1994. A sequen- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and tial algorithm for training text classifiers. CoRR, Kristina Toutanova. 2019. BERT: Pre-training of abs/cmp-lg/9407020. deep bidirectional transformers for language under- standing. In Proceedings of the 2019 Conference Xin Li and Dan Roth. 2002. Learning question classi- of the North American Chapter of the Association fiers. In Proceedings of the 19th International Con- for Computational Linguistics: Human Language ference on Computational Linguistics - Volume 1, Technologies, Volume 1 (Long and Short Papers), COLING ’02, page 1–7, USA. Association for Com- pages 4171–4186, Minneapolis, Minnesota. Associ- putational Linguistics. ation for Computational Linguistics. Percy Liang. 2005. Semi-supervised learning for natu- Liat Ein-Dor, Alon Halfon, Ariel Gera, Eyal Shnarch, ral language. In MASTER’S THESIS, MIT. Lena Dankin, Leshem Choshen, Marina Danilevsky, Ranit Aharonov, Yoav Katz, and Noam Slonim. David Lowell, Zachary Chase Lipton, and Byron C. 2020. Active Learning for BERT: An Empirical Wallace. 2019. Practical obstacles to deploying ac- Study. In Proceedings of the 2020 Conference on tive learning. In EMNLP/IJCNLP. Empirical Methods in Natural Language Process- ing (EMNLP), pages 7949–7962, Online. Associa- Andrew McCallum and Kamal Nigam. 1998. Employ- tion for Computational Linguistics. ing em and pool-based active learning for text clas- sification. In Proceedings of the Fifteenth Interna- Yarin Gal and Zoubin Ghahramani. 2016. Bayesian tional Conference on Machine Learning, ICML ’98, convolutional neural networks with bernoulli ap- page 350–358, San Francisco, CA, USA. Morgan proximate variational inference. Kaufmann Publishers Inc. Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Dheeraj Mekala and Jingbo Shang. 2020. Contextual- 2017. Deep Bayesian active learning with image ized weak supervision for text classification. In As- data. In Proceedings of the 34th International sociation for Computational Linguistics, ACL’20. Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages Robert (Munro) Monarch. 2019. Human-in-the-Loop 1183–1192. PMLR. Machine Learning. Manning Publications. Daniel Gissin and Shai Shalev-Shwartz. 2019. Dis- Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep criminative active learning. Ravikumar, and Ambuj Tewari. 2013. Learning with noisy labels. In Neural Information Processing Sys- Gholamreza Haffari and Anoop Sarkar. 2009. Active tems, NIPS’13. learning for multilingual statistical machine transla- tion. In Proceedings of the Joint Conference of the Blaž Novak, Dunja Mladenič, and Marko Grobelnik. 47th Annual Meeting of the ACL and the 4th Interna- 2006. Text classification with active learning. In tional Joint Conference on Natural Language Pro- From Data and Information Analysis to Knowledge cessing of the AFNLP, pages 181–189, Suntec, Sin- Engineering, pages 398–405, Berlin, Heidelberg. gapore. Association for Computational Linguistics. Springer Berlin Heidelberg.
Ameya Prabhu, Charles Dognin, and Maneesh Singh. In Proceedings of the 2018 Conference on Em- 2019. Sampling bias in deep active classification: pirical Methods in Natural Language Processing, An empirical study. In Proceedings of the 2019 Con- pages 2904–2909, Brussels, Belgium. Association ference on Empirical Methods in Natural Language for Computational Linguistics. Processing and the 9th International Joint Confer- ence on Natural Language Processing (EMNLP- Katrin Tomanek and Udo Hahn. 2009. Reducing class IJCNLP), pages 4058–4068, Hong Kong, China. As- imbalance during active learning for named entity sociation for Computational Linguistics. annotation. In Proceedings of the Fifth Interna- tional Conference on Knowledge Capture, K-CAP Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. ’09, page 105–112, New York, NY, USA. Associa- Know what you don’t know: Unanswerable ques- tion for Computing Machinery. tions for SQuAD. In Association for Computational Linguistics. Simon Tong and Daphne Koller. 2002. Support vec- tor machine active learning with applications to text Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and classification. J. Mach. Learn. Res., 2:45–66. Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Empirical Meth- Alex Wang, Amanpreet Singh, Julian Michael, Felix ods in Natural Language Processing. Hill, Omer Levy, and Samuel Bowman. 2018. Glue: A multi-task benchmark and analysis platform for Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, natural language understanding. In Empirical Meth- Jason Fries, Sen Wu, and Christopher Ré. 2017. ods in Natural Language Processing. Snorkel: Rapid training data creation with weak su- pervision. In Proceedings of the VLDB Endowment, Kai Wei, Rishabh Iyer, and Jeff Bilmes. 2015. Submod- VLDB’17. ularity in data subset selection and active learning. Dan Roth and Kevin Small. 2006. Margin-based active In Proceedings of the 32nd International Conference learning for structured output spaces. In Machine on International Conference on Machine Learning - Learning: ECML 2006, pages 413–424, Berlin, Hei- Volume 37, ICML’15, page 1954–1963. JMLR.org. delberg. Springer Berlin Heidelberg. Zuobing Xu, Ram Akella, and Yi Zhang. 2016. Incor- Victor Sanh, Lysandre Debut, Julien Chaumond, and porating diversity and density in active learning for Thomas Wolf. 2020. Distilbert, a distilled version of relevance feedback. bert: smaller, faster, cheaper and lighter. Myle Ott Yinhan Liu, Naman Goyal, Jingfei Du, Tobias Scheffer, Christian Decomain, and Stefan Wro- Mandar Joshi, Danqi Chen, Omer Levy, Mike bel. 2001. Active hidden markov models for infor- Lewis, Luke Zettlemoyer, and Veselin Stoyanov. mation extraction. 2019. Roberta: A robustly optimized bert pre- training approach. Computing Research Repository, Ozan Sener and Silvio Savarese. 2018. Active learn- arXiv:1907.11692. Version 1. ing for convolutional neural networks: A core-set approach. Donggeun Yoo and In So Kweon. 2019. Learning loss for active learning. In 2019 IEEE/CVF Confer- Burr Settles. 2009. Active learning literature survey. ence on Computer Vision and Pattern Recognition Computer Sciences Technical Report 1648, Univer- (CVPR), pages 93–102. sity of Wisconsin–Madison. L. Zhang. 2019. An ensemble deep active learning Burr Settles and Mark Craven. 2008. An analysis of ac- method for intent classification. Proceedings of the tive learning strategies for sequence labeling tasks. 2019 3rd International Conference on Computer Sci- EMNLP ’08, page 1070–1079, USA. Association ence and Artificial Intelligence. for Computational Linguistics. Burr Settles, Mark Craven, and Soumya Ray. 2007. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Multiple-instance active learning. In Proceedings Character-level convolutional networks for text clas- of the 20th International Conference on Neural sification. In Proceedings of the 28th International Information Processing Systems, NIPS’07, page Conference on Neural Information Processing Sys- 1289–1296, Red Hook, NY, USA. Curran Associates tems - Volume 1, NIPS’15, page 649–657, Cam- Inc. bridge, MA, USA. MIT Press. Burr Settles, Mark Craven, and Soumya Ray. 2008. Ye Zhang, Matthew Lease, and Byron C. Wallace. Multiple-instance active learning. In Advances in 2017. Active discriminative text representation Neural Information Processing Systems, volume 20. learning. In Proceedings of the Thirty-First AAAI Curran Associates, Inc. Conference on Artificial Intelligence, AAAI’17, page 3386–3392. AAAI Press. Aditya Siddhant and Zachary C. Lipton. 2018. Deep Bayesian active learning for natural language pro- Fedor Zhdanov. 2019. Diverse mini-batch active learn- cessing: Results of a large-scale empirical study. ing.
Jingbo Zhu, Huizhen Wang, Tianshun Yao, and Ben- jamin K. Tsou. 2008. Active learning with sam- pling by uncertainty and density for word sense dis- ambiguation and text classification. In Proceedings of the 22nd International Conference on Computa- tional Linguistics - Volume 1, COLING’08, page 1137–1144, USA. Association for Computational Linguistics.
You can also read