Multi-class Text Classification using BERT-based Active Learning

Multi-class Text Classification using BERT-based Active Learning
Multi-class Text Classification using BERT-based Active Learning

                                                Sumanth Prabhu†                        Moosa Mohamed†                       Hemant Misra
                                             Applied Research, Swiggy                Applied Research, Swiggy          Applied Research, Swiggy

                                                                       Abstract                    nique under Human-in-the-Loop Machine Learn-
                                                                                                   ing (Monarch, 2019). Active Learning overcomes
                                                 Text Classification finds interesting applica-    the labeling bottleneck by choosing a subset of in-
                                                 tions in the pickup and delivery services in-
                                                                                                   stances from a pool of unlabeled instances to be
                                                 dustry where customers require one or more
                                                 items to be picked up from a location and de-     labeled by the human annotators (a.k.a the oracle).
                                                 livered to a certain destination. Classifying     The goal of identifying this subset is to achieve high
                                                 these customer transactions into multiple cat-    accuracy while having labeled as few instances as
                                                 egories helps understand the market needs for     possible.
                                                 different customer segments. Each transaction        BERT-based (Devlin et al., 2019; Sanh et al.,
                                                 is accompanied by a text description provided     2020; Yinhan Liu et al., 2019) Deep Learning mod-
                                                 by the customer to describe the products being
                                                                                                   els have been proven to achieve state-of-the-art
                                                 picked up and delivered which can be used to
                                                 classify the transaction. BERT-based models       results on GLUE (Wang et al., 2018), RACE (Lai
                                                 have proven to perform well in Natural Lan-       et al., 2017) and SQuAD (Rajpurkar et al., 2016,
                                                 guage Understanding. However, the product         2018). Active Learning has been successfully inte-
                                                 descriptions provided by the customers tend to    grated with various Deep Learning approaches (Gal
                                                 be short, incoherent and code-mixed (Hindi-       and Ghahramani, 2016; Gissin and Shalev-Shwartz,
                                                 English) text which demands fine-tuning of        2019). However, the use of Active Learning with
                                                 such models with manually labelled data to        BERT based models for multi class text classifica-
                                                 achieve high accuracy. Collecting this labelled
                                                                                                   tion has not been studied extensively.
                                                 data can prove to be expensive. In this pa-
                                                 per, we explore Active Learning strategies to        In this paper, we consider a text classification
                                                 label transaction descriptions cost effectively   use-case in industry specific to pickup and delivery
                                                 while using BERT to train a transaction classi-   services where customers make use of short text
                                                 fication model. On TREC-6, AG’s News Cor-         to describe the products to be picked up from a
                                                 pus and an internal dataset, we benchmark the     certain location and dropped at a target location.
                                                 performance of BERT across different Active       Table 1 shows a few examples of the descriptions
                                                 Learning strategies in Multi-Class Text Classi-
                                                                                                   used by our customers to describe their transac-
                                                                                                   tions. Customers tend to use short incoherent and
                                         1       Introduction                                      code-mixed (using more than one language in the
                                                                                                   same message1 ) textual descriptions of the prod-
                                         Most of the focus of the machine learning commu-          ucts for describing them in the transaction. These
                                         nity is about creating better algorithms for learn-       descriptions, if mapped to a fixed set of possible
                                         ing from data. However, getting useful annotated          categories, help assist critical business decisions
                                         datasets can prove to be difficult. In many scenar-       such as demographic driven prioritization of cate-
                                         ios, we may have access to a large unannotated            gories, launch of new product categories and so on.
                                         corpus while it would be infeasible to annotate all       Furthermore, a transaction may comprise of multi-
                                         instances (Settles, 2009). Thus, weaker forms of          ple products which adds to the complexity of the
                                         supervision (Liang, 2005; Natarajan et al., 2013;         task. In this work, we focus on a multi-class clas-
                                         Ratner et al., 2017; Mekala and Shang, 2020) were         sification of transactions, where a single majority
                                         explored to label data cost effectively. In this pa-      category drives the transaction.
                                         per, we focus on Active Learning, a popular tech-
                                                                                                        In our case, Hindi written in roman case is mixed with
                                                 Equal contribution.                               English
Transaction Description               Category          2.1   Traditional Active Learning
 "Get me dahi 1.5kg"                   Grocery
                                                         Popular traditional Active Learning approaches in-
 Translation : Get me 1.5 kilo-
                                                         clude sampling based on model uncertainty using
 grams of curd
                                                         measures such as entropy (Lewis and Gale, 1994;
 "Pick up 1 yellow coloured            Clothes
                                                         Zhu et al., 2008), margin sampling (Scheffer et al.,
                                                         2001) and least model confidence (Settles and
 "3 plate chole bhature"               Food
                                                         Craven, 2008; Culotta and McCallum, 2005; Set-
 "2254/- pay krke samaan uthana        Package
                                                         tles, 2009). Researchers also explored sampling
                                                         based on diversity of instances (McCallum and
 Translation : Pay 2254/- and
                                                         Nigam, 1998; Settles and Craven, 2008; Xu et al.,
 pick up the package
                                                         2016; Dasgupta and Hsu, 2008; Wei et al., 2015)
Table 1: Instances of actual transaction descriptions    to prevent selection of redundant instances for la-
used by our customers along with their corresponding     belling. (Baram et al., 2004; Hsu and Lin, 2015;
categories.                                              Settles et al., 2008) explored hybrid strategies com-
                                                         bining the advantages of uncertainty and diversity
                                                         based sampling. Our work focusses on extending
   We explored supervised category classification        such strategies to Active Learning in a Deep Learn-
of transactions which required labelled data for         ing setting.
training. Our experiments with different ap-
proaches revealed that BERT-based models per-            2.2   Active Learning for Deep Learning
formed really well for the task. The train data
                                                         Deep Learning in an Active Learning setting can be
used in this paper was labeled manually by subject
                                                         challenging due to their tendency of rarely being
matter experts. However, this was proving to be
                                                         uncertain during inference and requirement of a
a very expensive exercise, and hence, necessitated
                                                         relatively large amount of training data. However,
exploration of cost effective strategies to collecting
                                                         Bayesian Deep Learning based approaches (Gal
manually labelled training data.
                                                         and Ghahramani, 2016; Gal et al., 2017; Kirsch
   The key contributions of our work are as follows
                                                         et al., 2019) were leveraged to demonstrate high
                                                         performance in an active learning setting for im-
    • Active Learning with incoherent and code-          age classification. Similar to Traditional Active
      mixed data: An Active Learning framework           Learning approaches, researchers explored diver-
      to reduce the cost of labelling data for multi     sity based sampling (Sener and Savarese, 2018)
      class text classification by 85% in an industry    and hybrid sampling (Zhdanov, 2019) to address
      use-case.                                          the drawbacks of pure model uncertainty based
                                                         approaches. Moreover, researchers also explored
    • BERT for Active Learning in multi-class            approaches based on Expected Model Change (Set-
      text Classification The first work, to the best    tles et al., 2007; Huang et al., 2016; Zhang et al.,
      of our knowledge, to explore and compare           2017; Yoo and Kweon, 2019) where the selected in-
      multiple advanced strategies in Active Learn-      stances are expected to result in the greatest change
      ing like Discriminative Active Learning using      to the current model parameter estimates when their
      BERT for multi-class text classification on        labels are provided. (Gissin and Shalev-Shwartz,
      publicly available TREC-6 and AG’s News            2019) proposed “Discriminative Active Learning
      Corpus benchmark datasets.                         (DAL)” where active learning in Image Classifica-
                                                         tion was posed as a binary classification task where
2    Related Work                                        instances to label were chosen such that the set
                                                         of labeled instances and the set of unlabeled in-
Active Learning has been widely studied and ap-          stances became indistinguishable. However, there
plied in a variety of tasks including classifica-        is minimal literature in applying these strategies to
tion (Novak et al., 2006; Tong and Koller, 2002),        Natural Language Understanding (NLU). Our work
structured output learning (Roth and Small, 2006;        focusses on extending such Deep Active Learning
Tomanek and Hahn, 2009; Haffari and Sarkar,              strategies for Text Classification using BERT-based
2009), clustering (Bodó et al., 2011) and so on.         models.
2.3    Deep Active Learning for Text                         • Core-Set (Sener and Savarese, 2018) selects
       Classification in NLP                                   instances that best cover the dataset in the
(Siddhant and Lipton, 2018) conducted an em-                   learned representation space using farthest-
pirical study of Active Learning in NLP to ob-                 first traversal algorithm
serve that Bayesian Active Learning outperformed             • Discriminative        Active       Learning
classical uncertainty sampling across all settings.            (DAL) (Gissin and Shalev-Shwartz, 2019)
However, they do not explore the performances                  This strategy aims to select instances
of BERT-based models. (Prabhu et al., 2019;                    that make the labeled set of instances
Zhang, 2019) explored Active Learning strategies               indistinguishable from the unlabeled pool.
extensible to BERT-based models. (Prabhu et al.,
2019) conducted an empirical study to demon-             4       Experiments
strate that active set selection using the posterior
entropy of deep models like (FTZ)           4.1       Data
is robust to sampling biases and to various algo-        Internal Dataset Our internal dataset data com-
rithmic choices such as BERT. (Zhang, 2019) ap-          prises of 55,038 customer transactions each of
plied an ensemble of Active Learning strategies to       which had an associated transaction description
BERT for the task of intent classification. How-         provided by the customer. The instances were man-
ever, neither of the works account for comparison        ually annotated by a team of SMEs and mapped to
against advanced strategies like Discriminative Ac-      one of 10 pre-defined categories. This data set is
tive Learning. (Ein-Dor et al., 2020) explored Ac-       split into 44,030 training samples and 11,008 vali-
tive Learning with multiple strategies using BERT        dation samples. The list of categories considered
based models for binary text classification tasks.       are as follows:
Our work is very closely related to this work with          {‘Food’, ‘Grocery’, ‘Package’, ‘Medicines’,
the difference that we focus on experiments with         ‘Household Items’, ‘Cigarettes’, ‘Clothes’, ‘Elec-
multi-class text classification.                         tronics’, ‘Keys’, ‘Documents/Books’ }

3     Methodology                                        TREC-6 Dataset To verify the reproducibility of
We consider multiple strategies in Active Learning       our observations, we consider the TREC dataset (Li
to understand the impact of the performance using        and Roth, 2002) which comprises of 5,452 training
BERT based models. We attempt to reproduce set-          samples and 500 validation samples with 6 classes.
tings similar to the study conducted by Ein-Dor et
al. (Ein-Dor et al., 2020) for binary text classifica-   AG’s News Corpus To verify the reproducibility
tion.                                                    of our observations, we also consider the AG’s
                                                         corpus of news articles (Zhang et al., 2015) which
    • Random We sample instances at random in            comprises of 120,000 training samples and 7,600
      each iteration while leveraging them to retrain    validation samples with 4 classes.
      the model.
                                                         4.2       Experiment Setting
    • Uncertainty (Entropy) Instances selected to        We leverage DistilBERT (Sanh et al., 2020) for the
      retrain the model have the highest entropy in      purpose of the experiments in this paper. We run
      model predictions.                                 each strategy for two random seed settings on the
                                                         TREC-6 Dataset and AG’s News Corpus and one
    • Expected Gradient Length (EGL) (Huang              random seed setting for our Internal Dataset. The
      et al., 2016) This strategy picks instances that   results present here consist of 558 fine-tuning ex-
      are expected to have the largest gradient norm     periments (78 for our Internal dataset and 240 each
      over all possible labelings of the instances.      for the TREC-6 dataset and AG’s News Corpus).
                                                         The experiments were run on AWS Sagemaker’s p3
    • Deep      Bayesian     Active      Learning
                                                         2x large Spot Instances. The implementation was
      (DBAL) (Gal et al., 2017) Instances
                                                         based on the code2 made available by (Gissin and
      are selected to improve the uncertainty
                                                         Shalev-Shwartz, 2019). Table 2 shows the batch
      measures of BERT with Monte Carlo dropout
      using the max-entropy acquisition function.      
Figure 1: Comparison of F1 scores of various Active Learning strategies on the Internal Dataset(BatchSize=500,
Iterations=13), TREC-6 Dataset(BatchSize=100, Iterations=20) and AG’s News Corpus(BatchSize=100, Itera-

size and number of iterations settings used for the     BERT in binary classification. Moreover, we ob-
experiments. In each iteration, a batch of samples      serve that the performance improvements over Ran-
are identified and the model is retrained for 1 epoch   dom Sampling strategy is also inconsistent. Con-
with a learning rate of 3 × 10−5 . We retrain the       cretely, on the Internal Dataset and TREC-6, we ob-
models from scratch in each iteration to prevent        serve that DBAL and Core-set are perform poorly
overfitting (Siddhant and Lipton, 2018; Hu et al.,      when compared to Random Sampling. However, on
2019). In other words, a new model is constructed       the AG’s News Corpus, we observe a contradicting
in each iteration without dependency on the model       behaviour where both DBAL and Core-set outper-
learnt in the previous iteration.                       form the Random Sampling strategy. Uncertainty
                                                        sampling strategy performs very similar to the Ran-
    Dataset             Batch         # Iterations      dom Sampling strategy. EGL performs consistently
                        Size                            well across all three datasets. The DAL strategy,
    Internal Dataset    500           13                proven to have worked well in image classification,
    TREC-6 Dataset      100           20                is consistent across the datasets. However, it does
    AG’s News Corpus    100           20                not outperform the EGL strategy. A deeper analysis
                                                        of these observations will be conducted in future
Table 2: Training details for the Internal Dataset,     work.
TREC-6 Dataset and AG’s News Corpus

5     Results                                           6    Conclusion

We report results for the Active Learning Strategies
using F1-score as the classification metric. Fig-       In this paper, we explored multiple Active Learning
ure 1 visualizes the performance of different ap-       strategies using BERT. Our goal was to understand
proaches. We observe that the best F1-score that        if BERT-based models can prove effective in an
Active Learning using BERT achieves on the In-          Active Learning setting for multi-class text classifi-
ternal Dataset is 0.83. A fully supervised setting      cation. We observed that BERT-based models can
with DistilBERT using the entire training data also     be trained cost effectively with lesser data using
achieved an F1-score of 0.83. Thus, we reduce the       Active Learning strategies. We also observed that
labelling cost by 85% while achieving similar F1-       none of the strategies significantly outperformed
scores. However, we observe that no single strat-       the others while EGL performs reasonably well
egy significantly outperforms the other. (Lowell        across datasets. In future work, we plan to inves-
et al., 2019) who studied Active Learning for text      tigate the results of the paper in detail and also
classification and sequence tagging in non-BERT         explore ensembling approaches by combining the
models, and demonstrated the brittleness and incon-     advantages of each strategy to achieve extensive
sistency of Active Learning results. (Ein-Dor et al.,   performance improvements.
2020) also observed inconsistency of results using
