Accelerating BERT Inference for Sequence Labeling via Early-Exit
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Accelerating BERT Inference for Sequence Labeling via Early-Exit Xiaonan Li, Yunfan Shao, Tianxiang Sun, Hang Yan, Xipeng Qiu∗, Xuanjing Huang Shanghai Key Laboratory of Intelligent Information Processing, Fudan University School of Computer Science, Fudan University {lixn20, yfshao19, txsun19, hyan19, xpqiu, xjhuang}@fudan.edu.cn Abstract Chinese word segmentation and Semantic Role La- beling. These tasks are usually fundamental and Both performance and efficiency are crucial highly time-demanding, therefore, apart from per- factors for sequence labeling tasks in many real-world scenarios. Although the pre-trained formance, their inference efficiency is also very models (PTMs) have significantly improved important. the performance of various sequence labeling The past few years have witnessed the prevailing tasks, their computational cost is expensive. of pre-trained models (PTMs) (Qiu et al., 2020) To alleviate this problem, we extend the re- on various sequence labeling tasks (Nguyen et al., cent successful early-exit mechanism to accel- 2020; Ke et al., 2020; Tian et al., 2020; Mengge erate the inference of PTMs for sequence label- et al., 2020). Despite their significant improve- ing tasks. However, existing early-exit mech- ments on sequence labeling, they are notorious for anisms are specifically designed for sequence- level tasks, rather than sequence labeling. In enormous computational cost and slow inference this paper, we first propose SENTEE: a sim- speed, which hinders their utility in real-time sce- ple extension of SENTence-level Early-Exit narios or mobile-device scenarios. for sequence labeling tasks. To further reduce Recently, early-exit mechanism (Liu et al., 2020; computational cost, we also propose TOKEE: Xin et al., 2020; Schwartz et al., 2020; Zhou et al., a TOKen-level Early-Exit mechanism that al- 2020) has been introduced to accelerate inference lows partial tokens to exit early at different for large-scale PTMs. In their methods, each layer layers. Considering the local dependency in- herent in sequence labeling, we employed a of the PTM is coupled with a classifier to pre- window-based criterion to decide for a token dict the label for a given instance. At inference whether or not to exit. The token-level early- stage, if the prediction is confident 2 enough at exit brings the gap between training and infer- an earlier time, it is allowed to exit without pass- ence, so we introduce an extra self-sampling ing through the entire model. Figure 1(a) gives an fine-tuning stage to alleviate it. The extensive illustration of early-exit mechanism for text classifi- experiments on three popular sequence label- cation. However, most existing early-exit methods ing tasks show that our approach can save up to are targeted at sequence-level prediction, such as 66%∼75% inference cost with minimal perfor- mance degradation. Compared with compet- text classification, in which the prediction and its itive compressed models such as DistilBERT, confidence score are calculated over a sequence. our approach can achieve better performance Therefore, these methods cannot be directly applied under the same speed-up ratios of 2×, 3×, and to sequence labeling tasks, where the prediction is 4×.1 token-level and the confidence score is required for each token. 1 Introduction In this paper, we aim to extend the early-exit Sequence labeling plays an important role in natu- mechanism to sequence labeling tasks. First, we ral language processing (NLP). Many NLP tasks proposed the SENTence-level Early-Exit (SEN- can be converted to sequence labeling tasks, such as TEE), which is a simple extension of existing early- named entity recognition, part-of-speech tagging, exit methods. SENTEE allows a sequence of to- ∗ kens to exit together once the maximum uncertainty Corresponding author. 1 2 Our implementation is publicly available at https:// In this paper, confident prediction indicates that the un- github.com/LeeSureman/Sequence-Labeling-Early-Exit. certainty of it is low. 189 Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 189–199 August 1–6, 2021. ©2021 Association for Computational Linguistics
Layers Early-Exit (>i+2) ... prediction lem, we introduce an additional fine-tuning stage Layer Yes that samples the token’s halting layer based on its It is not good Negative Confident? (i+2) uncertainty and copies its representation to upper No Layer layers during training. We conduct extensive ex- It is not good Negative Confident? (i+1) periments on three sequence labeling tasks: NER, No POS tagging, and CWS. Experimental results show Layer is (i) It not good Positive Confident? that our approach can save up to 66% ∼75% infer- (a) Early-Exit for Text Classification ence cost with minimal performance degradation. Compared with competitive compressed models Early-Exit Layers (>i+2) ... prediction such as DistilBERT, our approach can achieve bet- Yes B-PER O B-ORG O B-TIME Confident? ter performance under speed-up ratio of 2×, 3×, (Pooling) Layer founded and 4×. Dell Dell in 1984 No (i+2) Confident? B-PER O B-ORG O B-TIME (Pooling) 2 BERT for Sequence Labeling Layer Dell founded Dell in 1984 No (i+1) B-ORG O B-ORG O B-TIME Confident? (Pooling) Recently, PTMs (Qiu et al., 2020) have become the Layer Dell founded Dell in 1984 mainstream backbone model for various sequence (i) labeling tasks. The typical framework consists of a (b) Sentence-Level Early-Exit for Sequence Labeling backbone encoder and a task-specific decoder. Early-Exit Layers (>i+2) ... prediction Encoder In this paper, we use BERT (Devlin B-PER O B-ORG O B-TIME Yes et al., 2019) as our backbone encoder . The ar- Layer founded chitecture of BERT consists of multiple stacked Dell Dell in 1984 (i+2) Transformer layers (Vaswani et al., 2017). B-PER O B-ORG O B-TIME Layer Given a sequence of tokens x1 , · · · , xN , the hid- Dell founded Dell in 1984 (i+1) den state of l-th transformer layer is denoted by B-ORG B-ORG O (l) (l) H(l) = [h1 , · · · , hN ], and H(0) is the BERT in- O B-TIME Layer Dell founded Dell in 1984 (i) put embedding. (c) Token-Level Early-Exit for Sequence Labeling Decoder Usually, we can predict the label for Figure 1: Early-Exit for Text Classification and Se- each token according to the hidden state of the top quence Labeling, where represents the self-attention layer. The probability of labels is predicted by connection, represents simply copying, and represents the confident and uncertain prediction. The P = f (WH(L) ) ∈ RN ×C , (1) darker the hidden state block, the deeper layer it is from. Due to space limit, the window-based uncertainty is not where N is the sequence length, C is the number reflected. of labels, L is the number of BERT layers, W is a learnable matrix, and f (·) is a simple softmax clas- sifier or conditional random field (CRF) (Lafferty of the tokens is below a threshold. Despite its ef- et al., 2001). Since we focus on inference accelera- fectiveness, we find it redundant for most tokens tion and PTM performs well enough on sequence to update the representation at each layer. Thus, labeling without CRF (Devlin et al., 2019), we do we proposed a TOKen-level Early-Exit (TOKEE) not consider using such a recurrent structure. that allows part of tokens that get confident predic- 3 Early-Exit for Sequence Labeling tions to exit earlier. Figure 1(b) and 1(c) illustrate our proposed SENTEE and TOKEE. Considering The inference speed and computational costs of the local dependency inherent in sequence labeling PTMs are crucial bottlenecks to hinder their appli- tasks, we decide whether a token could exit based cation in many real-world scenarios. In many tasks, on the uncertainty of a window of its context in- the representations at an earlier layer of PTMs stead of itself. For tokens that are already exited, are usually adequate to make a correct prediction. we do not update their representation but just copy Therefore, early-exit mechanisms (Liu et al., 2020; it to the upper layers. However, this will introduce Xin et al., 2020; Schwartz et al., 2020; Zhou et al., a train-inference discrepancy. To tackle this prob- 2020) are proposed to dynamically stop inference 190
on the backbone model and make prediction with an overall uncertainty for the whole sequence. Here intermediate representation. we perform a straight-forward but effective method, However, these existing early-exit mechanisms i.e., conduct max-pooling3 over uncertainties of all are built on sentence-level prediction and unsuit- the tokens, able for token-level prediction in sequence label- (l) ing tasks. In this section, we propose two early- u(l) = max{u1 , · · · , u(l) n }, (5) exist mechanisms to accelerate the inference for where u(l) represents the uncertainty for the whole sequence labeling tasks. sentence. If u(l) < δ where δ is a pre-defined 3.1 Token-Level Off-Ramps threshold, we let the sentence exit at layer l. The intuition is that only when the model is confident To extend early-exit to sequence labeling, we cou- of its prediction for the most difficult token, the ple each layer of the PTM with token-level s that whole sequence could exit. can be simply implemented as a linear classifier. Once the off-ramps are trained with the golden la- 3.2.2 TOKEE: Token-Level Early-Exit bels, the instance has a chance to be predicted and Despite the effectiveness of SENTEE (see Table 1), exit at an earlier time instead of passing through we find it redundant for most simple tokens to be the entire model. fed into the deep layers. The simple tokens that Given a sequence of tokens X = x1 , · · · , xN , have been correctly predicted in the shallow layer we can make predictions by the injected off-ramps can not exit (under SENTEE) because the uncer- at each layer. For an off-ramp at l-th layer, the label tainty of a small number of difficult tokens is still distribution of all tokens is predicted by above the threshold. Thus, to further accelerate the inference for sequence labeling tasks, we propose P(l) = f (l) (X; θ) (2) a token-level early-exit (TOKEE) method that al- = softmax(WH(l) ), (3) lows simple tokens with confident predictions to exit early. where W is a learnable matrix, f (l) is the token- (l) (l) Window-Based Uncertainty Note that a preva- level off-ramp at l-th layer, P(l) = [p1 , · · · , pN ], (l) lent problem in sequence labeling tasks is the local pn ∈ RC , indicates the predicted label distribu- dependency (or label dependency). That is, the la- tion at the l-th off-ramp for each token. bel of a token heavily depends on the tokens around Uncertainty of the Off-Ramp With the predic- it. To that end, the calculation of the uncertainty for tion for each token at hand, we can calculate the a given token should not only be based on itself but uncertainty for each token as follows, also its context. Motivated by this, we proposed a window-based uncertainty criterion to decide for a (l) (l) token whether or not to exit at the current layer. In −pn · log pn u(l) n = , (4) particular, the uncertainty for the token xn at l-th log C layer is defined as (l) where pn is the label probability distribution for (l) (l) the n-th token. u0(l) n = max{un−k , · · · , un+k }, (6) 3.2 Early-Exit Strategies where k is a pre-defined window size. Then we use 0(l) In the following sections, we will introduce two un to decide whether the n th token can exit at (l) early-exit mechanisms for sequence labeling, at layer l, instead of un . Note that window-based un- sentence-level and token-level. certainty is equivalent to sentence-level uncertainty when k equals to the sentence length. 3.2.1 SENTEE: Sentence-Level Early-Exit Halt-and-Copy For tokens that have exited, their Sentence-Level Early-Exit (SENTEE) is a simple representation would not be updated in the upper extension for sequential labeling tasks based on layers, i.e., the hidden states of exited tokens are existing early-exit approaches. SENTEE allows a 3 sequence of tokens to exit together if their uncer- We also tried average-pooling, but it brings drastic per- formance drop. We find that the average uncertainty over the tainty is low enough. Therefore, SENTEE is to sequence is often overwhelmed by lots of easy tokens and this aggregate the uncertainty for each token to obtain causes many wrong exits of difficult tokens. 191
directly copied to the upper layers.4 Such a halt- where H is the cross-entropy loss function, N is and-copy mechanism is rather intuitive in two-fold: the sequence length. The total loss function for each sample is a weighted sum of the losses for all • Halt. If the uncertainty of a token is very the off-ramps, small, there are also few chances that its pre- diction will be changed in the following layers. PL wl Ll So it is redundant to keep updating its repre- Ltotal = Pl=1 L , (8) sentation. l=1 wl • Copy. If the representation of a token can where wl is the weight for the l-th off-ramp and L be classified into a label with a high degree is the number of backbone layers. Following (Zhou of confidence, then its representation already et al., 2020), we simply set wl = l. In this way, contains the label information. So we can di- The deeper an off-ramp is, the weight of its loss is rectly copy its representation into the upper bigger, thus each off-ramp can be trained jointly in layers to help predict the labels of other to- a relatively balanced way. kens. 3.3.2 Fine-Tuning for TOKEE These exited tokens will not attend to other to- kens at upper layers but can still be attended by Since we equip halt-and-copy in TOKEE, the com- other tokens thus part of the layer-specific query mon joint training off-ramps are not enough. Be- projections in upper layers can be omitted. By this, cause the model never conducts halt-and-copy in the computational complexity in self-attention is re- training but does in inference. In this stage, we duced from O(N 2 d) to O(N M d), where M N aim to train the model to use the hidden state from is the number of tokens that have not exited. Be- different previous layers but not only the previous sides, the computational complexity of the point- adjacent layer, just like in inference. wise FFN can also be reduced from O(N d2 ) to Random Sampling A direct way is to uniformly O(M d2 ). sample halting layers of tokens. However, halting The halt-and-copy mechanism is also similar to layers at the inference are not random but depends multi-pass sequence labeling paradigm, in which on the difficulty of each token in the sequence. So the tokens are labeled their in order of difficulty random sampling halting layers also causes the gap (easiest first). However, the copy mechanism re- between training and inference. sults in a train-inference discrepancy. That is, a layer never processed the representation from its Self-Sampling Instead, we use the fine-tuned non-adjacent previous layers during training. To al- model itself to sample the halting layers. For every leviate the discrepancy, we further proposed an ad- sample in each training epoch, we will randomly ditional fine-tuning stage, which will be discussed sample a window size and threshold for it, and in Section 3.3.2. then we can conduct TOKEE on the trained model, under the window size and threshold, without halt- 3.3 Model Training and-copy. Thus we get the exiting layer of each to- In this section, we describe the training process of ken, and we use it to re-forward the sample, by halt- our proposed early-exit mechanisms. ing and copying each token in the corresponding layer. In this way, the exiting layer of a token can 3.3.1 Fine-Tuning for SENTEE correspond to its difficulty. The deeper a token’s For sentence-level early-exit, we follow prior early- exiting layer is, the more difficult it is. Because exit work for text classification to jointly train the we sample the exiting layer using the model itself, added off-ramps. For each off-ramp, the loss func- we think the gap between training and inference tion is as follows, can be further shrunk. To avoid over-fitting during N further training, we prevent the training loss from X Ll = H yn , f (l) (X; θ)n , (7) further reducing, similar with the flooding mecha- n=1 nism used by Ishida et al. (2020). We also employ 4 the sandwich rule to stabilize this training stage For English sequence labeling, we use the first-pooling to get the representation of the word. If a word exits, we will (Yu and Huang, 2019). We compare self-sampling halt-and-copy its all wordpieces. with random sampling in Section 4.4.4. 192
Task NER Dataset/ CoNLL2003 Twitter Ontonotes 4.0 Weibo CLUE NER Model F1(∆) Speedup F1(∆) Speedup F1(∆) Speedup F1(∆) Speedup F1(∆) Speedup BERT 91.20 1.00× 77.97 1.00× 79.18 1.00× 66.15 1.00× 79.11 1.00× LSTM-CRF -0.29 - -12.65 - -7.37 - -9.40 - -7.75 - DistilBERT-6L -1.42 2.00× -1.79 2.00× -3.76 2.00× -1.95 2.00× -2.05 2.00× ∼2X SENTEE +0.01 2.07× -0.82 2.03× -1.70 1.94× -0.23 2.11× -0.22 2.04× TOKEE -0.09 2.02× -0.20 2.00× -0.20 2.00× -0.15 2.10× -0.24 1.99× DistilBERT-4L -1.57 3.00× -3.37 3.00× -8.79 3.00× -5.92 3.00× -5.67 3.00× ∼3X SENTEE -0.75 3.02× -5.54 3.03× -8.90 2.95× -2.51 2.87× -6.17 2.98× TOKEE -0.32 3.03× -1.32 2.98× -1.48 3.02× -0.70 3.04× -0.92 2.91× DistilBERT-3L -2.78 4.00× -5.08 4.00× -12.77 4.00× -9.05 4.00× -8.22 4.00× ∼4X SENTEE -6.51 4.00× -11.50 4.14× -14.40 3.78× -8.31 3.86× -14.95 3.91× TOKEE -1.44 3.94× -3.96 3.86× -3.81 3.81× -2.04 3.92× -3.16 4.19× Task POS Tagging CWS Dataset/ ARK Twitter CTB5 POS UD POS CTB5 Seg UD Seg Model Acc Speedup F1(∆) Speedup F1(∆) Speedup F1(∆) Speedup F1(∆) Speedup BERT 91.51 1.00× 96.20 1.00× 95.04 1.00× 98.48 1.00× 98.04 1.00× LSTM-CRF -0.92 - -2.48 - -6.30 - -0.79 - -3.03 - DistilBERT-6L -0.17 2.00× -0.27 2.00× -0.19 2.00× -0.12 2.00× -0.15 2.00× ∼2X SENTEE -0.49 2.01× -0.19 1.95× -0.35 1.87× -0.02 1.99× -0.11 2.13× TOKEE -0.13 1.96× -0.07 2.04× -0.17 1.93× -0.05 2.02× -0.07 2.10× DistilBERT-4L -0.93 3.00× -0.74 3.00× -1.21 3.00× -0.19 3.00× -0.43 3.00× ∼3X SENTEE -1.98 3.03× -1.63 3.12× -2.47 2.95× -0.11 3.03× -0.47 2.99× TOKEE -0.67 2.95× -0.30 3.01× -1.04 2.89× -0.06 3.01× -0.20 3.01× DistilBERT-3L -1.04 4.00× -1.34 4.00× -4.16 4.00× -0.21 4.00× -1.01 4.00× ∼4X SENTEE -2.96 3.98× -2.37 3.91× -4.78 3.96× -0.83 4.02× -1.23 3.98× TOKEE -2.17 3.89× -1.29 3.98× -4.03 3.89× -0.14 3.85× -0.53 3.97× Table 1: Comparison on BERT model. The performance of LSTM-CRF is from previous paper (Tian et al., 2020; Mengge et al., 2020; Yan et al., 2019; Gui et al., 2018), and others are implemented by ourselves. Task NER POS CWS universal enough since it is not involved with the Dataset/ Ontonotes4.0 CTB5 POS UD Seg model running environment (CPU, GPU or TPU) Model F1(∆)/Speedup F1(∆)/Speedup F1(∆)/Speedup and it can measure the theoretical running time RoBERTa-base of the model. In general, the lower the model’s RoBERTa 79.32 / 1.00× 96.29 / 1.00× 97.91 / 1.00× FLOPs is, the faster the model’s inference is. RoBERTa 6L -4.03 / 2.00× -0.11 / 2.00× -0.28 / 2.00× SENTEE -0.55 / 1.98× -0.08 / 1.85× -0.08 / 2.01× TOKEE -0.19 / 2.01× -0.06 / 1.98× -0.02 / 1.97× 4.2 Experimental Setup RoBERTa 4L -11.97 / 3.00× -1.59 / 3.00× -0.76 / 3.00× SENTEE -5.61 / 3.03× -1.02 / 2.80× -0.51 / 2.99× TOKEE -0.58 / 3.05× -0.51 / 3.01× -0.12 / 3.00× 4.2.1 Dataset ALBERT-base To verify the effectiveness of our methods, We ALBERT 76.24 / 1.00× 95.73 / 1.00× 97.00 / 1.00× conduct experiments on ten English and Chi- ALBERT 6L -5.78 / 2.00× -0.42 / 2.00× -0.20 / 2.00× nese datasets of sequence labeling, covering NER: SENTEE -1.25 / 2.01× -0.23 / 1.97× -0.02 / 1.98× TOKEE -0.43 / 1.97× -0.08 / 2.00× -0.03 / 2.01× CoNLL2003 (Tjong Kim Sang and De Meulder, ALBERT 4L -12.93 / 3.00× -2.15 / 3.00× -1.38 / 3.00× 2003), Twitter NER (Zhang et al., 2018), Ontonotes SENTEE -6.90 / 2.83× -1.47 / 3.11× -0.41 / 2.97× TOKEE -0.87 / 2.87× -0.58 / 2.89× -0.06 / 2.98× 4.0 (Chinese) (Weischedel et al., 2011), Weibo (Peng and Dredze, 2015; He and Sun, 2017) and Table 2: Result on RoBERTa and ALBERT. CLUE NER (Xu et al., 2020), POS: ARK Twitter (Gimpel et al., 2011; Owoputi et al., 2013), CTB5 POS (Xue et al., 2005) and UD POS (Nivre et al., 4 Experiment 2016), CWS: CTB5 Seg (Xue et al., 2005) and UD Seg (Nivre et al., 2016). Besides the standard 4.1 Computational Cost Measure benchmark dataset like CoNLL2003 and Ontonotes We use average floating-point operations (FLOPs) 4.0, we also choose some datasets closer to real- as the measure of computational cost, which world application to verify the actual utility of our denotes how many floating-point operations the methods, such as Twitter NER and Weibo in social model performs for a single sample. The FLOPs is media domain. We use the same dataset prepro- 193
cessing and split as in previous work (Huang et al., e.g., Chinese NER, TOKEE (4×) on BERT can 2015; Mengge et al., 2020; Jia et al., 2020; Tian still outperform LSTM-CRF significantly. This et al., 2020; Nguyen et al., 2020). indicates the potential utility of it in complicated real-world scenario. 4.2.2 Baseline To explore the fine-grained performance change We compare our methods with three baselines: under different speedup ratio, We visualize the speedup-performance trade-off curve on 6 datasets, • BiLSTM-CRF (Huang et al., 2015; Ma and in Figure2. We observe that, Hovy, 2016) The most widely used model in sequence labeling tasks before the pre-trained • Before the speedup ratio rises to a certain turn- language model prevails in NLP. ing point, there is almost no drop on perfor- • BERT The powerful stacked Transformer en- mance. After that, the performance will drop coder model, pre-trained on large-scale cor- gradually. This shows our methods keep the pus, which we use as the backbone of our superiority of existing early-exit methods (Xin methods. et al., 2020). • DistilBERT The most well-known distillation • As the speedup rises, TOKEE will encounter method of BERT. Huggingface released 6 lay- the speedup turning point later than SENTEE. ers DistilBERT for English (Sanh et al., 2019). After both methods reach the turning point, For comparison, we distill {3, 4} and {3, 4, SENTEE’s performance degradation is more 6} layers DistilBERT for English and Chinese drastic than TOKEE. These both indicate the using the same method. higher speedup ceiling of TOKEE. • On some datasets, such as CoNLL2003, we 4.2.3 Hyper-Parameters observe a little performance improvement un- For all datasets, We use batch size=10. We perform der low speedup ratio, we attribute this to the grid search over learning rate in {5e-6,1e-5,2e-5}. potential regularization brought by early-exit, We choose learning rate and the model based on such as alleviating overthinking (Kaya et al., the development set. We use the AdamW optimizer 2019). (Loshchilov and Hutter, 2019). The warmup step, To verify the versatility of our method over dif- weight decay is set to 0.05, 0.01, respectively. ferent PTMs, we also conduct experiments on two 4.3 Main Results well-known BERT variants, RoBERTa (Liu et al., 2019)5 and ALBERT (Lan et al., 2020)6 , as shown For English Datasets, we use the ‘BERT-base- in Table 2. We can see that SENTEE and TO- cased’ released by Google (Devlin et al., 2019) KEE also significantly outperform static backbone as backbone. For Chinese Datasets, we use ‘BERT- internal layer on three Representative datasets of wwm’ released by (Cui et al., 2019). The Distil- corresponding tasks. For RoBERTa and ALBERT, BERT is distilled from the backbone BERT. we also observe the TOKEE can have a better per- To fairly compare our methods with baselines, formance than SENTEE under high speedup ratio. we turn the speedup ratio of our methods to be consistent with the corresponding static baseline. 4.4 Analysis We report the average performance over 5 times In this section, we conduct a set of detailed analysis under different random seeds. The overall results on our methods. are shown in Table 1, where the speedup is based on the backbone. We can see both SENTEE and 4.4.1 The Effect of Window Size TOKEE brings little performance drop and outper- We show the performance change under different forms DistilBERT in speedup ratio of 2, which has k in Figure 3, keeping the speedup ratio consis- achieved similar effect like existing early-exit for tent. We observe that: (1) when k is 0, in other text classification. Under higher speedup, 3× and words, not using window-based uncertainty but 4×, SENTEE shows its weakness but TOKEE can token-independent uncertainty, the performance still keep a certain performance. And under 2∼4× is the almost lowest across different speedup ra- speedup ratio, TOKEE has a lower performance tio, because it does not consider local dependency drop than DistilBERT. What’s more, for datasets 5 https://github.com/ymcui/Chinese-BERT-wwm. 6 where BERT can show its power than LSTM-CRF, https://github.com/brightmart/albert zh. 194
100 96.72 CoNLL2003 Ontonotes 4.0 Off-Ramp 8 Acc 80 70.99 70.19 0 0 62.72 60 50 46.15 40 40 30.59 26.3 −2 −5 22.3 20 ∆ F1 ∆ F1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ∼ 1 0∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 −4 −10 100 98.43 97.79 93.54 −6 SENTEE SENTEE 86.69 Off-Ramp 4 Acc TOKEE TOKEE 80.8 −15 80 70.15 1 1.5 2 2.5 3 3.5 4 1 1.5 2 2.5 3 3.5 4 61.51 57.49 60 52.66 Speed Up Speed Up 40 36.84 CTB5 POS UD POS 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ∼ 1 0∼ 0.1∼ 0 .2∼ 0 .3∼ 0.4∼ 0.5∼ 0.6∼ 0.7∼ 0.8∼ 0.9 0 0 (a) Uncertainty −0.5 −1 94.78 94.79 94.8 94.8 94.81 95 94.7 94.64 Off-Ramp 8 Acc −1 −2 95 94.56 ∆ F1 ∆ F1 94.43 94 −1.5 −3 94 94 −2 −4 93.83 SENTEE SENTEE 94 TOKEE TOKEE −2.5 −5 0 1 2 3 4 5 6 7 8 9 1 1.5 2 2.5 3 3.5 4 1 1.5 2 2.5 3 3.5 4 98 97.64 97.67 Speed Up Speed Up 97.42 97.6 97.63 97.15 97.28 Off-Ramp 4 Acc 96.96 97 96.72 CTB5 Seg UD Seg 96 0 0 95 94.69 −0.2 0 1 2 3 4 5 6 7 8 9 −0.5 (b) k ∆ F1 ∆ F1 −0.4 −0.6 SENTEE −1 SENTEE Figure 4: The relation of off-ramps’ accuracy between −0.8 TOKEE TOKEE uncertainty and window size. Two off-ramps of shal- 1 1.5 2 2.5 3 3.5 4 1 1.5 2 2.5 3 3.5 4 Speed Up Speed Up low and deep layers are analyzed on CoNLL2003. Figure 2: The performance-speedup trade-off curve on six datasets of SENTEE and TOKEE. on sequence labeling. In detail, we verify the entire window-based uncertainty and its specific hyper- parameter, k, on CoNLL2003, shown in Figure 4. at all. This shows the necessity of the window- For the uncertainty, we intercept the 4 th and 8 th based uncertainty. (2) When k is relatively large, it off-ramps and calculate their accuracy in each un- will bring significant performance drop under high certainty interval, when k=2. The result shown in speedup ratio (3× and 4×), like SENTEE. (3) It is Figure 4(a) indicates that ‘the lower the window- necessary to choose an appropriate k under high based uncertainty, the higher the accuracy’, similar speedup ratio, where the effect of different k has a as in text classification. For k, we set a certain high variance. threshold = 0.3, and calculate accuracy of tokens whose window-based uncertainty is small than the 4.4.2 Accuracy V.S. Uncertainty threshold under different k, shown in Figure 4(b). Liu et al. (2020) verified ‘the lower the uncertainty, The result shows that, as k increases: (1) The ac- the higher the accuracy’ on text classification. Here, curacy of screened tokens is higher. This shows we’d like to verify our window-based uncertainty that the wider of a token’s low-uncertainty neigh- borhood, the more accurate the token’s prediction is. This also verifies the validity of window-based 0 uncertainty strategy. (2) The accuracy improve- −2 ment slows down. This shows the low relevance of distant tokens’ uncertainty and explains why large ∆ F1 −4 Speedup ∼ 2× Speedup ∼ 3× k performs not well under high speedup ratio: it −6 Speedup ∼ 4× does not help improving more accurate exiting but slowing down exiting. −8 0 2 4 6 8 10 12 k 4.4.3 Influence of Sequence Length Transformer-based PTMs, e.g. BERT, face a Figure 3: The performance change over different win- challenge in processing long text, due to the dow size under the same speedup ratio. O(N 2 d) computational complexity brought by self- 195
4 Speedup ~ 2× Exited Tokens (%) 60 Sent-EE 3.5 ( F1 = -0.11) Speedup Ratio 40 Token-EE 3 ( F1 = -0.02) 20 2.5 0 2 k=2 1 2 3 4 5 6 7 8 9 10 11 12 # Layer 1.5 k = +∞ 1 Speedup ~ 3× Exited Tokens (%) 10 30 50 70 > 70 Sentence Length 60 Sent-EE ( F1 = -2.28) 40 Token-EE Figure 5: Influence of sentence length. ( F1 = -0.54) 20 0 1 2 3 4 5 6 7 8 9 10 11 12 0 # Layer Speedup ~ 4× Exited Tokens (%) −2 60 ∆ F1 Sent-EE ( F1 = -8.20) 40 −4 Self-Sampling Token-EE ( F1 = -3.12) Random Sampling 20 −6 None 0 1 2 3 4 5 6 7 8 9 10 11 12 # Layer 1 1.5 2 2.5 3 3.5 4 Speed Up Figure 7: TOKEE layer distribution in a sentence. Figure 6: Comparison between different training way. 4.4.5 Layer Distribution of Early-Exit attention. Since the TOKEE reduces the layer-wise In TOKEE, by halt-and-copy mechanism, each to- computational complexity from O(N 2 d + N d2 ) ken goes through a different number of PTM layers to O(N M d + M d2 ) and SENTEE does not, we’d according to the difficulty. We show the average like to explore their effect over different sentence distribution of a sentence’s tokens exiting layers length. We compare the highest speedup ratio of under different speedup ratio on CoNLL2003, in TOKEE and SENTEE when performance drop < 1 Figure 7. We also draw the average exiting layer on Ontonotes 4.0, shown in Figure 5. We observe number of SENTEE under the same speedup ratio. that TOKEE has a stable computational cost saving We observe that as speedup ratio rises, more tokens as the sentence length increases, but SENTEE’s will exit at the earlier layer but a bit of tokens can speedup ratio will gradually reduce. For this, we still go through the deeper layer even when 4×, give an intuitive explanation. In general, a longer meanwhile, the SENTEE’s average exiting layer sentence has more tokens, it is more difficult for the number reduces to 2.5, where the PTM’s encoding model to give them all confident prediction at the power is severely cut down. This gives an intuitive same layer. This comparison reveals the potential explanation of why TOKEE is more effective than of TOKEE on accelerating long text inference. SENTEE under high speedup ratio: although both SENTEE and TOKEE can dynamically adjust com- 4.4.4 Effects of Self-Sampling Fine-Tuning putational cost on the sample-level, TOKEE can adjust do it in a more fine-grained way. To verify the effect of self-sampling fine-tuning in Section 3.3.2, we compare it with random sam- 5 Related Work pling and no extra fine-tuning on CoNLL2003. The performance-speedup trade-off curve of TOKEE is PTMs are powerful but have high computational shown in Figure 6, which shows self-sampling is cost. To accelerate them, many attempts have been always better than random sampling for TOKEE. made. A kind of methods is to reduce its size, such As speedup ratio rises, this trend is more significant. as distillation (Sanh et al., 2019; Jiao et al., 2020), This shows the self-sampling can help more in re- structural pruning (Michel et al., 2019; Fan et al., ducing the gap of training and inference. As for 2020) and quantization (Shen et al., 2020). no extra fine-tuning, it will deteriorate drastically Another kind of methods is early-exit, which at high speedup ratio. But it can roughly keep a dynamically adjusts the encoding layer number of certain capability at low speedup ratio, which we different samples (Liu et al., 2020; Xin et al., 2020; attribute to the residual-connection of PTM and Schwartz et al., 2020; Zhou et al., 2020; Li et al., similar results were reported by Veit et al. (2016). 2020). While they introduced early-exit mecha- 196
nism in simple classification tasks, our methods for Computational Linguistics: Human Language are proposed for the more complicated scenario: Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Associ- sequence labeling, where it has not only one predic- ation for Computational Linguistics. tion probability and it’s necessary to consider the dependency of token exitings. Elbayad et al. (2020) Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael proposed Depth-Adaptive Transformer to accel- Auli. 2020. Depth-adaptive transformer. In 8th erate machine translation. However, their early- International Conference on Learning Representa- tions, ICLR 2020, Addis Ababa, Ethiopia, April 26- exit mechanism is designed for auto-regressive se- 30, 2020. OpenReview.net. quence generation, in which the exit of tokens must be in left-to-right order. Therefore, it is unsuit- Angela Fan, Edouard Grave, and Armand Joulin. 2020. able for language understanding tasks. Different Reducing transformer depth on demand with struc- tured dropout. In 8th International Conference on from their method, our early-exit mechanism can Learning Representations, ICLR 2020, Addis Ababa, consider the exit of all tokens simultaneously. Ethiopia, April 26-30, 2020. OpenReview.net. 6 Conclusion and Future Work Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, In this work, we propose two early-exit mecha- Michael Heilman, Dani Yogatama, Jeffrey Flanigan, nisms for sequence labeling: SENTEE and TO- and Noah A. Smith. 2011. Part-of-speech tagging for twitter: Annotation, features, and experiments. KEE. The former is a simple extension of sequence- In The 49th Annual Meeting of the Association for level early-exit while the latter is specially designed Computational Linguistics: Human Language Tech- for sequence labeling, which can conduct more fine- nologies, Proceedings of the Conference, 19-24 June, grained computational cost allocation. We equip 2011, Portland, Oregon, USA - Short Papers, pages TOKEE with window-based uncertainty and self- 42–47. The Association for Computer Linguistics. sampling finetuning to make it more robust and Tao Gui, Qi Zhang, Jingjing Gong, Minlong Peng, faster. The detailed analysis verifies their effective- Di Liang, Keyu Ding, and Xuanjing Huang. 2018. ness. SENTEE and TOKEE can achieve 2× and Transferring from formal newswire domain with hy- 3∼4× speedup with minimal performance drop. pernet for Twitter POS tagging. In Proceedings of the 2018 Conference on Empirical Methods in Nat- For future work, we wish to explore: (1) leverag- ural Language Processing, pages 2540–2549, Brus- ing the exited token’s label information to help the sels, Belgium. Association for Computational Lin- exiting of remained tokens; (2) introducing CRF or guistics. other global decoding methods into early-exit for Hangfeng He and Xu Sun. 2017. F-score driven max sequence labeling. margin neural network for named entity recognition in chinese social media. In Proceedings of the 15th Acknowledgments Conference of the European Chapter of the Associa- tion for Computational Linguistics, EACL 2017, Va- We thank anonymous reviewers for their detailed lencia, Spain, April 3-7, 2017, Volume 2: Short Pa- reviews and great suggestions. This work was sup- pers, pages 713–718. Association for Computational ported by the National Key Research and Develop- Linguistics. ment Program of China (No. 2020AAA0106700), Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidi- National Natural Science Foundation of China (No. rectional LSTM-CRF models for sequence tagging. 62022027) and Major Scientific Research Project CoRR, abs/1508.01991. of Zhejiang Lab (No. 2019KD0AD01). Takashi Ishida, Ikko Yamane, Tomoya Sakai, Gang Niu, and Masashi Sugiyama. 2020. Do we need zero training loss after achieving zero training error? References In Proceedings of the 37th International Conference Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, on Machine Learning, volume 119 of Proceedings Ziqing Yang, Shijin Wang, and Guoping Hu. 2019. of Machine Learning Research, pages 4604–4614. Pre-training with whole word masking for chinese PMLR. bert. arXiv preprint arXiv:1906.08101. Chen Jia, Yuefeng Shi, Qinrong Yang, and Yue Zhang. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and 2020. Entity enhanced BERT pre-training for Chi- Kristina Toutanova. 2019. BERT: Pre-training of nese NER. In Proceedings of the 2020 Conference deep bidirectional transformers for language under- on Empirical Methods in Natural Language Process- standing. In Proceedings of the 2019 Conference ing (EMNLP), pages 6384–6396, Online. Associa- of the North American Chapter of the Association tion for Computational Linguistics. 197
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xue Mengge, Bowen Yu, Zhenyu Zhang, Tingwen Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Liu, Yue Zhang, and Bin Wang. 2020. Coarse-to- 2020. Tinybert: Distilling BERT for natural lan- Fine Pre-training for Named Entity Recognition. In guage understanding. In Proceedings of the 2020 Proceedings of the 2020 Conference on Empirical Conference on Empirical Methods in Natural Lan- Methods in Natural Language Processing (EMNLP), guage Processing: Findings, EMNLP 2020, Online pages 6345–6354, Online. Association for Computa- Event, 16-20 November 2020, pages 4163–4174. As- tional Linguistics. sociation for Computational Linguistics. Paul Michel, Omer Levy, and Graham Neubig. 2019. Yigitcan Kaya, Sanghyun Hong, and Tudor Dumitras. Are sixteen heads really better than one? In Ad- 2019. Shallow-deep networks: Understanding and vances in Neural Information Processing Systems mitigating network overthinking. In Proceedings 32: Annual Conference on Neural Information Pro- of the 36th International Conference on Machine cessing Systems 2019, NeurIPS 2019, December 8- Learning, volume 97 of Proceedings of Machine 14, 2019, Vancouver, BC, Canada, pages 14014– Learning Research, pages 3301–3310. PMLR. 14024. Zhen Ke, Liang Shi, Erli Meng, Bin Wang, Xipeng Qiu, Dat Quoc Nguyen, Thanh Vu, and Anh Tuan Nguyen. and Xuanjing Huang. 2020. Unified multi-criteria 2020. BERTweet: A pre-trained language model chinese word segmentation with bert. for English tweets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Lan- John D. Lafferty, Andrew McCallum, and Fernando guage Processing: System Demonstrations, pages 9– C. N. Pereira. 2001. Conditional random fields: 14, Online. Association for Computational Linguis- Probabilistic models for segmenting and labeling se- tics. quence data. In Proceedings of the Eighteenth Inter- national Conference on Machine Learning, ICML Joakim Nivre, Marie-Catherine de Marneffe, Filip ’01, pages 282–289, San Francisco, CA, USA. Mor- Ginter, Yoav Goldberg, Jan Hajic, Christopher D. gan Kaufmann Publishers Inc. Manning, Ryan T. McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Zeman. 2016. Universal dependencies v1: A mul- Kevin Gimpel, Piyush Sharma, and Radu Soricut. tilingual treebank collection. In Proceedings of 2020. Albert: A lite bert for self-supervised learning the Tenth International Conference on Language of language representations. In International Con- Resources and Evaluation LREC 2016, Portorož, ference on Learning Representations. Slovenia, May 23-28, 2016. European Language Re- sources Association (ELRA). Lei Li, Yankai Lin, Shuhuai Ren, Deli Chen, Xu- ancheng Ren, Peng Li, Jie Zhou, and Xu Sun. 2020. Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Accelerating pre-trained language models via cali- Kevin Gimpel, Nathan Schneider, and Noah A. brated cascade. CoRR, abs/2012.14682. Smith. 2013. Improved part-of-speech tagging for online conversational text with word clusters. Weijie Liu, Peng Zhou, Zhiruo Wang, Zhe Zhao, In Human Language Technologies: Conference of Haotang Deng, and Qi Ju. 2020. FastBERT: a self- the North American Chapter of the Association of distilling BERT with adaptive inference time. In Computational Linguistics, Proceedings, June 9-14, Proceedings of the 58th Annual Meeting of the Asso- 2013, Westin Peachtree Plaza Hotel, Atlanta, Geor- ciation for Computational Linguistics, pages 6035– gia, USA, pages 380–390. The Association for Com- 6044, Online. Association for Computational Lin- putational Linguistics. guistics. Nanyun Peng and Mark Dredze. 2015. Named entity recognition for Chinese social media with jointly Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- trained embeddings. In Proceedings of the 2015 dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Conference on Empirical Methods in Natural Lan- Luke Zettlemoyer, and Veselin Stoyanov. 2019. guage Processing, pages 548–554, Lisbon, Portugal. Roberta: A robustly optimized bert pretraining ap- Association for Computational Linguistics. proach. Xipeng Qiu, TianXiang Sun, Yige Xu, Yunfan Shao, Ilya Loshchilov and Frank Hutter. 2019. Decou- Ning Dai, and Xuanjing Huang. 2020. Pre-trained pled weight decay regularization. In 7th Inter- models for natural language processing: A sur- national Conference on Learning Representations, vey. SCIENCE CHINA Technological Sciences, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. 63(10):1872–1897. OpenReview.net. Victor Sanh, Lysandre Debut, Julien Chaumond, and Xuezhe Ma and Eduard Hovy. 2016. End-to-end Thomas Wolf. 2019. DistilBERT, a distilled ver- sequence labeling via bi-directional LSTM-CNNs- sion of bert: smaller, faster, cheaper and lighter. In CRF. In Proceedings of the 54th Annual Meeting of NeurIPS EMC2 Workshop. the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 1064–1074, Berlin, Ger- Roy Schwartz, Gabriel Stanovsky, Swabha many. Association for Computational Linguistics. Swayamdipta, Jesse Dodge, and Noah A. Smith. 198
2020. The right tool for the job: Matching model Zhang, Zhengliang Yang, Kyle Richardson, and and instance complexities. In Proceedings of the Zhenzhong Lan. 2020. CLUE: A Chinese language 58th Annual Meeting of the Association for Com- understanding evaluation benchmark. In Proceed- putational Linguistics, pages 6640–6651, Online. ings of the 28th International Conference on Com- Association for Computational Linguistics. putational Linguistics, pages 4762–4772, Barcelona, Spain (Online). International Committee on Compu- Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei tational Linguistics. Yao, Amir Gholami, Michael W. Mahoney, and Kurt Keutzer. 2020. Q-bert: Hessian based ultra low pre- Naiwen Xue, Fei Xia, Fu-Dong Chiou, and Martha cision quantization of bert. Proceedings of the AAAI Palmer. 2005. The penn chinese treebank: Phrase Conference on Artificial Intelligence, 34(05):8815– structure annotation of a large corpus. Nat. Lang. 8821. Eng., 11(2):207–238. Yuanhe Tian, Yan Song, Xiang Ao, Fei Xia, Xiao- Hang Yan, Bocao Deng, Xiaonan Li, and Xipeng Qiu. jun Quan, Tong Zhang, and Yonggang Wang. 2020. 2019. TENER: adapting transformer encoder for Joint Chinese word segmentation and part-of-speech named entity recognition. CoRR, abs/1911.04474. tagging via two-way attentions of auto-analyzed knowledge. In Proceedings of the 58th Annual Meet- Jiahui Yu and Thomas S. Huang. 2019. Universally ing of the Association for Computational Linguistics, slimmable networks and improved training tech- pages 8286–8296, Online. Association for Computa- niques. In 2019 IEEE/CVF International Confer- tional Linguistics. ence on Computer Vision, ICCV 2019, Seoul, Ko- rea (South), October 27 - November 2, 2019, pages Erik F. Tjong Kim Sang and Fien De Meulder. 1803–1811. IEEE. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Qi Zhang, Jinlan Fu, Xiaoyu Liu, and Xuanjing Huang. Proceedings of the Seventh Conference on Natu- 2018. Adaptive co-attention network for named en- ral Language Learning at HLT-NAACL 2003, pages tity recognition in tweets. 142–147. Wangchunshu Zhou, Canwen Xu, Tao Ge, Julian J. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob McAuley, Ke Xu, and Furu Wei. 2020. BERT loses Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz patience: Fast and robust inference with early exit. Kaiser, and Illia Polosukhin. 2017. Attention is all In Advances in Neural Information Processing Sys- you need. In Advances in Neural Information Pro- tems 33: Annual Conference on Neural Information cessing Systems, volume 30, pages 5998–6008. Cur- Processing Systems 2020, NeurIPS 2020, December ran Associates, Inc. 6-12, 2020, virtual. Andreas Veit, Michael J. Wilber, and Serge J. Be- longie. 2016. Residual networks behave like ensem- bles of relatively shallow networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Sys- tems 2016, December 5-10, 2016, Barcelona, Spain, pages 550–558. Ralph Weischedel, Sameer Pradhan, Lance Ramshaw, Martha Palmer, Nianwen Xue, Mitchell Marcus, Ann Taylor, Craig Greenberg, Eduard Hovy, Robert Belvin, et al. 2011. Ontonotes release 4.0. LDC2011T03, Philadelphia, Penn.: Linguistic Data Consortium. Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin. 2020. DeeBERT: Dynamic early exiting for accelerating BERT inference. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2246–2251, On- line. Association for Computational Linguistics. Liang Xu, Hai Hu, Xuanwei Zhang, Lu Li, Chenjie Cao, Yudong Li, Yechen Xu, Kai Sun, Dian Yu, Cong Yu, Yin Tian, Qianqian Dong, Weitang Liu, Bo Shi, Yiming Cui, Junyi Li, Jun Zeng, Rongzhao Wang, Weijian Xie, Yanting Li, Yina Patterson, Zuoyu Tian, Yiwen Zhang, He Zhou, Shaoweihua Liu, Zhe Zhao, Qipeng Zhao, Cong Yue, Xinrui 199
You can also read