Accelerating BERT Inference for Sequence Labeling via Early-Exit

Page created by Stephanie Kramer
 
CONTINUE READING
Accelerating BERT Inference for Sequence Labeling via Early-Exit

 Xiaonan Li, Yunfan Shao, Tianxiang Sun, Hang Yan, Xipeng Qiu∗, Xuanjing Huang
 Shanghai Key Laboratory of Intelligent Information Processing, Fudan University
 School of Computer Science, Fudan University
 {lixn20, yfshao19, txsun19, hyan19, xpqiu, xjhuang}@fudan.edu.cn

 Abstract Chinese word segmentation and Semantic Role La-
 beling. These tasks are usually fundamental and
 Both performance and efficiency are crucial highly time-demanding, therefore, apart from per-
 factors for sequence labeling tasks in many
 real-world scenarios. Although the pre-trained
 formance, their inference efficiency is also very
 models (PTMs) have significantly improved important.
 the performance of various sequence labeling The past few years have witnessed the prevailing
 tasks, their computational cost is expensive. of pre-trained models (PTMs) (Qiu et al., 2020)
 To alleviate this problem, we extend the re- on various sequence labeling tasks (Nguyen et al.,
 cent successful early-exit mechanism to accel- 2020; Ke et al., 2020; Tian et al., 2020; Mengge
 erate the inference of PTMs for sequence label- et al., 2020). Despite their significant improve-
 ing tasks. However, existing early-exit mech-
 ments on sequence labeling, they are notorious for
 anisms are specifically designed for sequence-
 level tasks, rather than sequence labeling. In enormous computational cost and slow inference
 this paper, we first propose SENTEE: a sim- speed, which hinders their utility in real-time sce-
 ple extension of SENTence-level Early-Exit narios or mobile-device scenarios.
 for sequence labeling tasks. To further reduce Recently, early-exit mechanism (Liu et al., 2020;
 computational cost, we also propose TOKEE: Xin et al., 2020; Schwartz et al., 2020; Zhou et al.,
 a TOKen-level Early-Exit mechanism that al- 2020) has been introduced to accelerate inference
 lows partial tokens to exit early at different for large-scale PTMs. In their methods, each layer
 layers. Considering the local dependency in-
 herent in sequence labeling, we employed a
 of the PTM is coupled with a classifier to pre-
 window-based criterion to decide for a token dict the label for a given instance. At inference
 whether or not to exit. The token-level early- stage, if the prediction is confident 2 enough at
 exit brings the gap between training and infer- an earlier time, it is allowed to exit without pass-
 ence, so we introduce an extra self-sampling ing through the entire model. Figure 1(a) gives an
 fine-tuning stage to alleviate it. The extensive illustration of early-exit mechanism for text classifi-
 experiments on three popular sequence label- cation. However, most existing early-exit methods
 ing tasks show that our approach can save up to
 are targeted at sequence-level prediction, such as
 66%∼75% inference cost with minimal perfor-
 mance degradation. Compared with compet- text classification, in which the prediction and its
 itive compressed models such as DistilBERT, confidence score are calculated over a sequence.
 our approach can achieve better performance Therefore, these methods cannot be directly applied
 under the same speed-up ratios of 2×, 3×, and to sequence labeling tasks, where the prediction is
 4×.1 token-level and the confidence score is required for
 each token.
1 Introduction In this paper, we aim to extend the early-exit
Sequence labeling plays an important role in natu- mechanism to sequence labeling tasks. First, we
ral language processing (NLP). Many NLP tasks proposed the SENTence-level Early-Exit (SEN-
can be converted to sequence labeling tasks, such as TEE), which is a simple extension of existing early-
named entity recognition, part-of-speech tagging, exit methods. SENTEE allows a sequence of to-
 ∗
 kens to exit together once the maximum uncertainty
 Corresponding author.
 1 2
 Our implementation is publicly available at https:// In this paper, confident prediction indicates that the un-
github.com/LeeSureman/Sequence-Labeling-Early-Exit. certainty of it is low.

 189
 Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics
 and the 11th International Joint Conference on Natural Language Processing, pages 189–199
 August 1–6, 2021. ©2021 Association for Computational Linguistics
Layers Early-Exit
 (>i+2) ... prediction lem, we introduce an additional fine-tuning stage
 Layer
 Yes 
 that samples the token’s halting layer based on its
 It is not good Negative Confident?
 (i+2)
 uncertainty and copies its representation to upper
 No 
 Layer
 layers during training. We conduct extensive ex-
 It is not good Negative Confident?
 (i+1)
 periments on three sequence labeling tasks: NER,
 No 
 POS tagging, and CWS. Experimental results show
 Layer
 is
 (i)
 It not good Positive Confident?
 that our approach can save up to 66% ∼75% infer-
 (a) Early-Exit for Text Classification ence cost with minimal performance degradation.
 Compared with competitive compressed models
 Early-Exit
 Layers
 (>i+2)
 ... prediction such as DistilBERT, our approach can achieve bet-
 Yes 

 B-PER O B-ORG O B-TIME 
 Confident? ter performance under speed-up ratio of 2×, 3×,
 (Pooling)

 Layer
 founded
 and 4×.
 Dell Dell in 1984 No 
 (i+2)
 Confident?
 B-PER O B-ORG O B-TIME (Pooling)
 2 BERT for Sequence Labeling
 Layer
 Dell founded Dell in 1984 No 
 (i+1)

 B-ORG O B-ORG O B-TIME 
 Confident?
 (Pooling)
 Recently, PTMs (Qiu et al., 2020) have become the
 Layer
 Dell founded Dell in 1984
 mainstream backbone model for various sequence
 (i)
 labeling tasks. The typical framework consists of a
 (b) Sentence-Level Early-Exit for Sequence Labeling backbone encoder and a task-specific decoder.
 Early-Exit
 Layers
 (>i+2)
 ... prediction Encoder In this paper, we use BERT (Devlin
 B-PER O B-ORG O B-TIME 
 Yes et al., 2019) as our backbone encoder . The ar-
 Layer
 founded
 chitecture of BERT consists of multiple stacked
 Dell Dell in 1984
 (i+2)
 Transformer layers (Vaswani et al., 2017).
 B-PER O B-ORG O B-TIME 

 Layer
 Given a sequence of tokens x1 , · · · , xN , the hid-
 Dell founded Dell in 1984
 (i+1)
 den state of l-th transformer layer is denoted by
 B-ORG B-ORG O (l) (l)
 H(l) = [h1 , · · · , hN ], and H(0) is the BERT in-
 O B-TIME 

 Layer
 Dell founded Dell in 1984
 (i) put embedding.
 (c) Token-Level Early-Exit for Sequence Labeling
 Decoder Usually, we can predict the label for
Figure 1: Early-Exit for Text Classification and Se- each token according to the hidden state of the top
quence Labeling, where represents the self-attention layer. The probability of labels is predicted by
connection, represents simply copying, and 
represents the confident and uncertain prediction. The P = f (WH(L) ) ∈ RN ×C , (1)
darker the hidden state block, the deeper layer it is from.
Due to space limit, the window-based uncertainty is not where N is the sequence length, C is the number
reflected. of labels, L is the number of BERT layers, W is a
 learnable matrix, and f (·) is a simple softmax clas-
 sifier or conditional random field (CRF) (Lafferty
of the tokens is below a threshold. Despite its ef- et al., 2001). Since we focus on inference accelera-
fectiveness, we find it redundant for most tokens tion and PTM performs well enough on sequence
to update the representation at each layer. Thus, labeling without CRF (Devlin et al., 2019), we do
we proposed a TOKen-level Early-Exit (TOKEE) not consider using such a recurrent structure.
that allows part of tokens that get confident predic-
 3 Early-Exit for Sequence Labeling
tions to exit earlier. Figure 1(b) and 1(c) illustrate
our proposed SENTEE and TOKEE. Considering The inference speed and computational costs of
the local dependency inherent in sequence labeling PTMs are crucial bottlenecks to hinder their appli-
tasks, we decide whether a token could exit based cation in many real-world scenarios. In many tasks,
on the uncertainty of a window of its context in- the representations at an earlier layer of PTMs
stead of itself. For tokens that are already exited, are usually adequate to make a correct prediction.
we do not update their representation but just copy Therefore, early-exit mechanisms (Liu et al., 2020;
it to the upper layers. However, this will introduce Xin et al., 2020; Schwartz et al., 2020; Zhou et al.,
a train-inference discrepancy. To tackle this prob- 2020) are proposed to dynamically stop inference

 190
on the backbone model and make prediction with an overall uncertainty for the whole sequence. Here
intermediate representation. we perform a straight-forward but effective method,
 However, these existing early-exit mechanisms i.e., conduct max-pooling3 over uncertainties of all
are built on sentence-level prediction and unsuit- the tokens,
able for token-level prediction in sequence label- (l)
ing tasks. In this section, we propose two early- u(l) = max{u1 , · · · , u(l)
 n }, (5)
exist mechanisms to accelerate the inference for
 where u(l) represents the uncertainty for the whole
sequence labeling tasks.
 sentence. If u(l) < δ where δ is a pre-defined
3.1 Token-Level Off-Ramps threshold, we let the sentence exit at layer l. The
 intuition is that only when the model is confident
To extend early-exit to sequence labeling, we cou-
 of its prediction for the most difficult token, the
ple each layer of the PTM with token-level s that
 whole sequence could exit.
can be simply implemented as a linear classifier.
Once the off-ramps are trained with the golden la- 3.2.2 TOKEE: Token-Level Early-Exit
bels, the instance has a chance to be predicted and Despite the effectiveness of SENTEE (see Table 1),
exit at an earlier time instead of passing through we find it redundant for most simple tokens to be
the entire model. fed into the deep layers. The simple tokens that
 Given a sequence of tokens X = x1 , · · · , xN , have been correctly predicted in the shallow layer
we can make predictions by the injected off-ramps can not exit (under SENTEE) because the uncer-
at each layer. For an off-ramp at l-th layer, the label tainty of a small number of difficult tokens is still
distribution of all tokens is predicted by above the threshold. Thus, to further accelerate the
 inference for sequence labeling tasks, we propose
 P(l) = f (l) (X; θ) (2)
 a token-level early-exit (TOKEE) method that al-
 = softmax(WH(l) ), (3) lows simple tokens with confident predictions to
 exit early.
where W is a learnable matrix, f (l) is the token-
 (l) (l) Window-Based Uncertainty Note that a preva-
level off-ramp at l-th layer, P(l) = [p1 , · · · , pN ],
 (l) lent problem in sequence labeling tasks is the local
pn ∈ RC , indicates the predicted label distribu- dependency (or label dependency). That is, the la-
tion at the l-th off-ramp for each token. bel of a token heavily depends on the tokens around
Uncertainty of the Off-Ramp With the predic- it. To that end, the calculation of the uncertainty for
tion for each token at hand, we can calculate the a given token should not only be based on itself but
uncertainty for each token as follows, also its context. Motivated by this, we proposed a
 window-based uncertainty criterion to decide for a
 (l) (l) token whether or not to exit at the current layer. In
 −pn · log pn
 u(l)
 n = , (4) particular, the uncertainty for the token xn at l-th
 log C
 layer is defined as
 (l)
where pn is the label probability distribution for (l) (l)
the n-th token. u0(l)
 n = max{un−k , · · · , un+k }, (6)

3.2 Early-Exit Strategies where k is a pre-defined window size. Then we use
 0(l)
In the following sections, we will introduce two un to decide whether the n th token can exit at
 (l)
early-exit mechanisms for sequence labeling, at layer l, instead of un . Note that window-based un-
sentence-level and token-level. certainty is equivalent to sentence-level uncertainty
 when k equals to the sentence length.
3.2.1 SENTEE: Sentence-Level Early-Exit
 Halt-and-Copy For tokens that have exited, their
Sentence-Level Early-Exit (SENTEE) is a simple representation would not be updated in the upper
extension for sequential labeling tasks based on layers, i.e., the hidden states of exited tokens are
existing early-exit approaches. SENTEE allows a
 3
sequence of tokens to exit together if their uncer- We also tried average-pooling, but it brings drastic per-
 formance drop. We find that the average uncertainty over the
tainty is low enough. Therefore, SENTEE is to sequence is often overwhelmed by lots of easy tokens and this
aggregate the uncertainty for each token to obtain causes many wrong exits of difficult tokens.

 191
directly copied to the upper layers.4 Such a halt- where H is the cross-entropy loss function, N is
and-copy mechanism is rather intuitive in two-fold: the sequence length. The total loss function for
 each sample is a weighted sum of the losses for all
 • Halt. If the uncertainty of a token is very the off-ramps,
 small, there are also few chances that its pre-
 diction will be changed in the following layers. PL
 wl Ll
 So it is redundant to keep updating its repre- Ltotal = Pl=1
 L
 , (8)
 sentation. l=1 wl

 • Copy. If the representation of a token can where wl is the weight for the l-th off-ramp and L
 be classified into a label with a high degree is the number of backbone layers. Following (Zhou
 of confidence, then its representation already et al., 2020), we simply set wl = l. In this way,
 contains the label information. So we can di- The deeper an off-ramp is, the weight of its loss is
 rectly copy its representation into the upper bigger, thus each off-ramp can be trained jointly in
 layers to help predict the labels of other to- a relatively balanced way.
 kens.
 3.3.2 Fine-Tuning for TOKEE
 These exited tokens will not attend to other to-
kens at upper layers but can still be attended by Since we equip halt-and-copy in TOKEE, the com-
other tokens thus part of the layer-specific query mon joint training off-ramps are not enough. Be-
projections in upper layers can be omitted. By this, cause the model never conducts halt-and-copy in
the computational complexity in self-attention is re- training but does in inference. In this stage, we
duced from O(N 2 d) to O(N M d), where M  N aim to train the model to use the hidden state from
is the number of tokens that have not exited. Be- different previous layers but not only the previous
sides, the computational complexity of the point- adjacent layer, just like in inference.
wise FFN can also be reduced from O(N d2 ) to
 Random Sampling A direct way is to uniformly
O(M d2 ).
 sample halting layers of tokens. However, halting
 The halt-and-copy mechanism is also similar to
 layers at the inference are not random but depends
multi-pass sequence labeling paradigm, in which
 on the difficulty of each token in the sequence. So
the tokens are labeled their in order of difficulty
 random sampling halting layers also causes the gap
(easiest first). However, the copy mechanism re-
 between training and inference.
sults in a train-inference discrepancy. That is, a
layer never processed the representation from its Self-Sampling Instead, we use the fine-tuned
non-adjacent previous layers during training. To al- model itself to sample the halting layers. For every
leviate the discrepancy, we further proposed an ad- sample in each training epoch, we will randomly
ditional fine-tuning stage, which will be discussed sample a window size and threshold for it, and
in Section 3.3.2. then we can conduct TOKEE on the trained model,
 under the window size and threshold, without halt-
3.3 Model Training
 and-copy. Thus we get the exiting layer of each to-
In this section, we describe the training process of ken, and we use it to re-forward the sample, by halt-
our proposed early-exit mechanisms. ing and copying each token in the corresponding
 layer. In this way, the exiting layer of a token can
3.3.1 Fine-Tuning for SENTEE
 correspond to its difficulty. The deeper a token’s
For sentence-level early-exit, we follow prior early- exiting layer is, the more difficult it is. Because
exit work for text classification to jointly train the we sample the exiting layer using the model itself,
added off-ramps. For each off-ramp, the loss func- we think the gap between training and inference
tion is as follows, can be further shrunk. To avoid over-fitting during
 N   further training, we prevent the training loss from
 X
 Ll = H yn , f (l) (X; θ)n , (7) further reducing, similar with the flooding mecha-
 n=1 nism used by Ishida et al. (2020). We also employ
 4
 the sandwich rule to stabilize this training stage
 For English sequence labeling, we use the first-pooling
to get the representation of the word. If a word exits, we will (Yu and Huang, 2019). We compare self-sampling
halt-and-copy its all wordpieces. with random sampling in Section 4.4.4.

 192
Task NER
 Dataset/ CoNLL2003 Twitter Ontonotes 4.0 Weibo CLUE NER
 Model F1(∆) Speedup F1(∆) Speedup F1(∆) Speedup F1(∆) Speedup F1(∆) Speedup
 BERT 91.20 1.00× 77.97 1.00× 79.18 1.00× 66.15 1.00× 79.11 1.00×
 LSTM-CRF -0.29 - -12.65 - -7.37 - -9.40 - -7.75 -
 DistilBERT-6L -1.42 2.00× -1.79 2.00× -3.76 2.00× -1.95 2.00× -2.05 2.00×
 ∼2X

 SENTEE +0.01 2.07× -0.82 2.03× -1.70 1.94× -0.23 2.11× -0.22 2.04×
 TOKEE -0.09 2.02× -0.20 2.00× -0.20 2.00× -0.15 2.10× -0.24 1.99×
 DistilBERT-4L -1.57 3.00× -3.37 3.00× -8.79 3.00× -5.92 3.00× -5.67 3.00×
 ∼3X

 SENTEE -0.75 3.02× -5.54 3.03× -8.90 2.95× -2.51 2.87× -6.17 2.98×
 TOKEE -0.32 3.03× -1.32 2.98× -1.48 3.02× -0.70 3.04× -0.92 2.91×
 DistilBERT-3L -2.78 4.00× -5.08 4.00× -12.77 4.00× -9.05 4.00× -8.22 4.00×
 ∼4X

 SENTEE -6.51 4.00× -11.50 4.14× -14.40 3.78× -8.31 3.86× -14.95 3.91×
 TOKEE -1.44 3.94× -3.96 3.86× -3.81 3.81× -2.04 3.92× -3.16 4.19×
 Task POS Tagging CWS
 Dataset/ ARK Twitter CTB5 POS UD POS CTB5 Seg UD Seg
 Model Acc Speedup F1(∆) Speedup F1(∆) Speedup F1(∆) Speedup F1(∆) Speedup
 BERT 91.51 1.00× 96.20 1.00× 95.04 1.00× 98.48 1.00× 98.04 1.00×
 LSTM-CRF -0.92 - -2.48 - -6.30 - -0.79 - -3.03 -
 DistilBERT-6L -0.17 2.00× -0.27 2.00× -0.19 2.00× -0.12 2.00× -0.15 2.00×
 ∼2X

 SENTEE -0.49 2.01× -0.19 1.95× -0.35 1.87× -0.02 1.99× -0.11 2.13×
 TOKEE -0.13 1.96× -0.07 2.04× -0.17 1.93× -0.05 2.02× -0.07 2.10×
 DistilBERT-4L -0.93 3.00× -0.74 3.00× -1.21 3.00× -0.19 3.00× -0.43 3.00×
 ∼3X

 SENTEE -1.98 3.03× -1.63 3.12× -2.47 2.95× -0.11 3.03× -0.47 2.99×
 TOKEE -0.67 2.95× -0.30 3.01× -1.04 2.89× -0.06 3.01× -0.20 3.01×
 DistilBERT-3L -1.04 4.00× -1.34 4.00× -4.16 4.00× -0.21 4.00× -1.01 4.00×
 ∼4X

 SENTEE -2.96 3.98× -2.37 3.91× -4.78 3.96× -0.83 4.02× -1.23 3.98×
 TOKEE -2.17 3.89× -1.29 3.98× -4.03 3.89× -0.14 3.85× -0.53 3.97×

Table 1: Comparison on BERT model. The performance of LSTM-CRF is from previous paper (Tian et al., 2020;
Mengge et al., 2020; Yan et al., 2019; Gui et al., 2018), and others are implemented by ourselves.

 Task NER POS CWS universal enough since it is not involved with the
 Dataset/ Ontonotes4.0 CTB5 POS UD Seg model running environment (CPU, GPU or TPU)
 Model F1(∆)/Speedup F1(∆)/Speedup F1(∆)/Speedup
 and it can measure the theoretical running time
 RoBERTa-base
 of the model. In general, the lower the model’s
 RoBERTa 79.32 / 1.00× 96.29 / 1.00× 97.91 / 1.00×
 FLOPs is, the faster the model’s inference is.
 RoBERTa 6L -4.03 / 2.00× -0.11 / 2.00× -0.28 / 2.00×
 SENTEE -0.55 / 1.98× -0.08 / 1.85× -0.08 / 2.01×
 TOKEE -0.19 / 2.01× -0.06 / 1.98× -0.02 / 1.97×
 4.2 Experimental Setup
 RoBERTa 4L -11.97 / 3.00× -1.59 / 3.00× -0.76 / 3.00×
 SENTEE -5.61 / 3.03× -1.02 / 2.80× -0.51 / 2.99×
 TOKEE -0.58 / 3.05× -0.51 / 3.01× -0.12 / 3.00× 4.2.1 Dataset
 ALBERT-base To verify the effectiveness of our methods, We
 ALBERT 76.24 / 1.00× 95.73 / 1.00× 97.00 / 1.00× conduct experiments on ten English and Chi-
 ALBERT 6L -5.78 / 2.00× -0.42 / 2.00× -0.20 / 2.00× nese datasets of sequence labeling, covering NER:
 SENTEE -1.25 / 2.01× -0.23 / 1.97× -0.02 / 1.98×
 TOKEE -0.43 / 1.97× -0.08 / 2.00× -0.03 / 2.01× CoNLL2003 (Tjong Kim Sang and De Meulder,
 ALBERT 4L -12.93 / 3.00× -2.15 / 3.00× -1.38 / 3.00× 2003), Twitter NER (Zhang et al., 2018), Ontonotes
 SENTEE -6.90 / 2.83× -1.47 / 3.11× -0.41 / 2.97×
 TOKEE -0.87 / 2.87× -0.58 / 2.89× -0.06 / 2.98× 4.0 (Chinese) (Weischedel et al., 2011), Weibo
 (Peng and Dredze, 2015; He and Sun, 2017) and
 Table 2: Result on RoBERTa and ALBERT. CLUE NER (Xu et al., 2020), POS: ARK Twitter
 (Gimpel et al., 2011; Owoputi et al., 2013), CTB5
 POS (Xue et al., 2005) and UD POS (Nivre et al.,
4 Experiment 2016), CWS: CTB5 Seg (Xue et al., 2005) and
 UD Seg (Nivre et al., 2016). Besides the standard
4.1 Computational Cost Measure
 benchmark dataset like CoNLL2003 and Ontonotes
We use average floating-point operations (FLOPs) 4.0, we also choose some datasets closer to real-
as the measure of computational cost, which world application to verify the actual utility of our
denotes how many floating-point operations the methods, such as Twitter NER and Weibo in social
model performs for a single sample. The FLOPs is media domain. We use the same dataset prepro-

 193
cessing and split as in previous work (Huang et al., e.g., Chinese NER, TOKEE (4×) on BERT can
2015; Mengge et al., 2020; Jia et al., 2020; Tian still outperform LSTM-CRF significantly. This
et al., 2020; Nguyen et al., 2020). indicates the potential utility of it in complicated
 real-world scenario.
4.2.2 Baseline To explore the fine-grained performance change
We compare our methods with three baselines: under different speedup ratio, We visualize the
 speedup-performance trade-off curve on 6 datasets,
 • BiLSTM-CRF (Huang et al., 2015; Ma and
 in Figure2. We observe that,
 Hovy, 2016) The most widely used model in
 sequence labeling tasks before the pre-trained • Before the speedup ratio rises to a certain turn-
 language model prevails in NLP. ing point, there is almost no drop on perfor-
 • BERT The powerful stacked Transformer en- mance. After that, the performance will drop
 coder model, pre-trained on large-scale cor- gradually. This shows our methods keep the
 pus, which we use as the backbone of our superiority of existing early-exit methods (Xin
 methods. et al., 2020).
 • DistilBERT The most well-known distillation • As the speedup rises, TOKEE will encounter
 method of BERT. Huggingface released 6 lay- the speedup turning point later than SENTEE.
 ers DistilBERT for English (Sanh et al., 2019). After both methods reach the turning point,
 For comparison, we distill {3, 4} and {3, 4, SENTEE’s performance degradation is more
 6} layers DistilBERT for English and Chinese drastic than TOKEE. These both indicate the
 using the same method. higher speedup ceiling of TOKEE.
 • On some datasets, such as CoNLL2003, we
4.2.3 Hyper-Parameters observe a little performance improvement un-
For all datasets, We use batch size=10. We perform der low speedup ratio, we attribute this to the
grid search over learning rate in {5e-6,1e-5,2e-5}. potential regularization brought by early-exit,
We choose learning rate and the model based on such as alleviating overthinking (Kaya et al.,
the development set. We use the AdamW optimizer 2019).
(Loshchilov and Hutter, 2019). The warmup step, To verify the versatility of our method over dif-
weight decay is set to 0.05, 0.01, respectively. ferent PTMs, we also conduct experiments on two
4.3 Main Results well-known BERT variants, RoBERTa (Liu et al.,
 2019)5 and ALBERT (Lan et al., 2020)6 , as shown
For English Datasets, we use the ‘BERT-base- in Table 2. We can see that SENTEE and TO-
cased’ released by Google (Devlin et al., 2019) KEE also significantly outperform static backbone
as backbone. For Chinese Datasets, we use ‘BERT- internal layer on three Representative datasets of
wwm’ released by (Cui et al., 2019). The Distil- corresponding tasks. For RoBERTa and ALBERT,
BERT is distilled from the backbone BERT. we also observe the TOKEE can have a better per-
 To fairly compare our methods with baselines, formance than SENTEE under high speedup ratio.
we turn the speedup ratio of our methods to be
consistent with the corresponding static baseline. 4.4 Analysis
We report the average performance over 5 times In this section, we conduct a set of detailed analysis
under different random seeds. The overall results on our methods.
are shown in Table 1, where the speedup is based
on the backbone. We can see both SENTEE and 4.4.1 The Effect of Window Size
TOKEE brings little performance drop and outper- We show the performance change under different
forms DistilBERT in speedup ratio of 2, which has k in Figure 3, keeping the speedup ratio consis-
achieved similar effect like existing early-exit for tent. We observe that: (1) when k is 0, in other
text classification. Under higher speedup, 3× and words, not using window-based uncertainty but
4×, SENTEE shows its weakness but TOKEE can token-independent uncertainty, the performance
still keep a certain performance. And under 2∼4× is the almost lowest across different speedup ra-
speedup ratio, TOKEE has a lower performance tio, because it does not consider local dependency
drop than DistilBERT. What’s more, for datasets 5
 https://github.com/ymcui/Chinese-BERT-wwm.
 6
where BERT can show its power than LSTM-CRF, https://github.com/brightmart/albert zh.

 194
100 96.72
 CoNLL2003 Ontonotes 4.0

 Off-Ramp 8 Acc
 80 70.99 70.19
 0 0 62.72
 60
 50 46.15
 40
 40 30.59
 26.3
 −2 −5
 22.3
 20
 ∆ F1

 ∆ F1
 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ∼
 1
 0∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ ∼ 0.9
 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
 −4
 −10
 100 98.43 97.79
 93.54
 −6 SENTEE SENTEE 86.69

 Off-Ramp 4 Acc
 TOKEE TOKEE 80.8
 −15 80
 70.15
 1 1.5 2 2.5 3 3.5 4 1 1.5 2 2.5 3 3.5 4 61.51
 57.49
 60 52.66
 Speed Up Speed Up
 40 36.84

 CTB5 POS UD POS 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ∼
 1
 0∼ 0.1∼ 0 .2∼ 0 .3∼ 0.4∼ 0.5∼ 0.6∼ 0.7∼ 0.8∼ 0.9

 0 0 (a) Uncertainty
 −0.5 −1
 94.78 94.79 94.8 94.8 94.81
 95 94.7
 94.64

 Off-Ramp 8 Acc
 −1 −2 95 94.56
 ∆ F1

 ∆ F1

 94.43
 94
 −1.5 −3 94
 94
 −2 −4 93.83
 SENTEE SENTEE 94
 TOKEE TOKEE
 −2.5 −5 0 1 2 3 4 5 6 7 8 9
 1 1.5 2 2.5 3 3.5 4 1 1.5 2 2.5 3 3.5 4
 98 97.64 97.67
 Speed Up Speed Up 97.42
 97.6 97.63
 97.15 97.28

 Off-Ramp 4 Acc
 96.96
 97 96.72

 CTB5 Seg UD Seg 96

 0 0 95 94.69

 −0.2 0 1 2 3 4 5 6 7 8 9

 −0.5 (b) k
 ∆ F1

 ∆ F1

 −0.4

 −0.6

 SENTEE
 −1
 SENTEE
 Figure 4: The relation of off-ramps’ accuracy between
 −0.8 TOKEE TOKEE uncertainty and window size. Two off-ramps of shal-
 1 1.5 2 2.5 3 3.5 4 1 1.5 2 2.5 3 3.5 4
 Speed Up Speed Up low and deep layers are analyzed on CoNLL2003.

Figure 2: The performance-speedup trade-off curve on
six datasets of SENTEE and TOKEE. on sequence labeling. In detail, we verify the entire
 window-based uncertainty and its specific hyper-
 parameter, k, on CoNLL2003, shown in Figure 4.
at all. This shows the necessity of the window- For the uncertainty, we intercept the 4 th and 8 th
based uncertainty. (2) When k is relatively large, it off-ramps and calculate their accuracy in each un-
will bring significant performance drop under high certainty interval, when k=2. The result shown in
speedup ratio (3× and 4×), like SENTEE. (3) It is Figure 4(a) indicates that ‘the lower the window-
necessary to choose an appropriate k under high based uncertainty, the higher the accuracy’, similar
speedup ratio, where the effect of different k has a as in text classification. For k, we set a certain
high variance. threshold = 0.3, and calculate accuracy of tokens
 whose window-based uncertainty is small than the
4.4.2 Accuracy V.S. Uncertainty threshold under different k, shown in Figure 4(b).
Liu et al. (2020) verified ‘the lower the uncertainty, The result shows that, as k increases: (1) The ac-
the higher the accuracy’ on text classification. Here, curacy of screened tokens is higher. This shows
we’d like to verify our window-based uncertainty that the wider of a token’s low-uncertainty neigh-
 borhood, the more accurate the token’s prediction
 is. This also verifies the validity of window-based
 0
 uncertainty strategy. (2) The accuracy improve-
 −2 ment slows down. This shows the low relevance of
 distant tokens’ uncertainty and explains why large
 ∆ F1

 −4 Speedup ∼ 2×
 Speedup ∼ 3× k performs not well under high speedup ratio: it
 −6 Speedup ∼ 4× does not help improving more accurate exiting but
 slowing down exiting.
 −8
 0 2 4 6 8 10 12
 k
 4.4.3 Influence of Sequence Length
 Transformer-based PTMs, e.g. BERT, face a
Figure 3: The performance change over different win- challenge in processing long text, due to the
dow size under the same speedup ratio. O(N 2 d) computational complexity brought by self-

 195
4 Speedup ~ 2×

 Exited Tokens (%)
 60
 Sent-EE
 3.5 ( F1 = -0.11)
 Speedup Ratio 40
 Token-EE
 3 ( F1 = -0.02)
 20
 2.5
 0
 2 k=2 1 2 3 4 5 6 7 8 9 10 11 12
 # Layer
 1.5 k = +∞
 1 Speedup ~ 3×

 Exited Tokens (%)
 10 30 50 70 > 70
 Sentence Length 60
 Sent-EE
 ( F1 = -2.28)
 40
 Token-EE
 Figure 5: Influence of sentence length. ( F1 = -0.54)
 20

 0
 1 2 3 4 5 6 7 8 9 10 11 12
 0 # Layer
 Speedup ~ 4×

 Exited Tokens (%)
 −2 60
 ∆ F1

 Sent-EE
 ( F1 = -8.20)
 40
 −4 Self-Sampling Token-EE
 ( F1 = -3.12)
 Random Sampling 20

 −6 None 0
 1 2 3 4 5 6 7 8 9 10 11 12
 # Layer
 1 1.5 2 2.5 3 3.5 4
 Speed Up
 Figure 7: TOKEE layer distribution in a sentence.
Figure 6: Comparison between different training way.
 4.4.5 Layer Distribution of Early-Exit
attention. Since the TOKEE reduces the layer-wise In TOKEE, by halt-and-copy mechanism, each to-
computational complexity from O(N 2 d + N d2 ) ken goes through a different number of PTM layers
to O(N M d + M d2 ) and SENTEE does not, we’d according to the difficulty. We show the average
like to explore their effect over different sentence distribution of a sentence’s tokens exiting layers
length. We compare the highest speedup ratio of under different speedup ratio on CoNLL2003, in
TOKEE and SENTEE when performance drop < 1 Figure 7. We also draw the average exiting layer
on Ontonotes 4.0, shown in Figure 5. We observe number of SENTEE under the same speedup ratio.
that TOKEE has a stable computational cost saving We observe that as speedup ratio rises, more tokens
as the sentence length increases, but SENTEE’s will exit at the earlier layer but a bit of tokens can
speedup ratio will gradually reduce. For this, we still go through the deeper layer even when 4×,
give an intuitive explanation. In general, a longer meanwhile, the SENTEE’s average exiting layer
sentence has more tokens, it is more difficult for the number reduces to 2.5, where the PTM’s encoding
model to give them all confident prediction at the power is severely cut down. This gives an intuitive
same layer. This comparison reveals the potential explanation of why TOKEE is more effective than
of TOKEE on accelerating long text inference. SENTEE under high speedup ratio: although both
 SENTEE and TOKEE can dynamically adjust com-
4.4.4 Effects of Self-Sampling Fine-Tuning putational cost on the sample-level, TOKEE can
 adjust do it in a more fine-grained way.
To verify the effect of self-sampling fine-tuning
in Section 3.3.2, we compare it with random sam- 5 Related Work
pling and no extra fine-tuning on CoNLL2003. The
performance-speedup trade-off curve of TOKEE is PTMs are powerful but have high computational
shown in Figure 6, which shows self-sampling is cost. To accelerate them, many attempts have been
always better than random sampling for TOKEE. made. A kind of methods is to reduce its size, such
As speedup ratio rises, this trend is more significant. as distillation (Sanh et al., 2019; Jiao et al., 2020),
This shows the self-sampling can help more in re- structural pruning (Michel et al., 2019; Fan et al.,
ducing the gap of training and inference. As for 2020) and quantization (Shen et al., 2020).
no extra fine-tuning, it will deteriorate drastically Another kind of methods is early-exit, which
at high speedup ratio. But it can roughly keep a dynamically adjusts the encoding layer number of
certain capability at low speedup ratio, which we different samples (Liu et al., 2020; Xin et al., 2020;
attribute to the residual-connection of PTM and Schwartz et al., 2020; Zhou et al., 2020; Li et al.,
similar results were reported by Veit et al. (2016). 2020). While they introduced early-exit mecha-

 196
nism in simple classification tasks, our methods for Computational Linguistics: Human Language
are proposed for the more complicated scenario: Technologies, Volume 1 (Long and Short Papers),
 pages 4171–4186, Minneapolis, Minnesota. Associ-
sequence labeling, where it has not only one predic-
 ation for Computational Linguistics.
tion probability and it’s necessary to consider the
dependency of token exitings. Elbayad et al. (2020) Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael
proposed Depth-Adaptive Transformer to accel- Auli. 2020. Depth-adaptive transformer. In 8th
erate machine translation. However, their early- International Conference on Learning Representa-
 tions, ICLR 2020, Addis Ababa, Ethiopia, April 26-
exit mechanism is designed for auto-regressive se- 30, 2020. OpenReview.net.
quence generation, in which the exit of tokens must
be in left-to-right order. Therefore, it is unsuit- Angela Fan, Edouard Grave, and Armand Joulin. 2020.
able for language understanding tasks. Different Reducing transformer depth on demand with struc-
 tured dropout. In 8th International Conference on
from their method, our early-exit mechanism can Learning Representations, ICLR 2020, Addis Ababa,
consider the exit of all tokens simultaneously. Ethiopia, April 26-30, 2020. OpenReview.net.

6 Conclusion and Future Work Kevin Gimpel, Nathan Schneider, Brendan O’Connor,
 Dipanjan Das, Daniel Mills, Jacob Eisenstein,
In this work, we propose two early-exit mecha- Michael Heilman, Dani Yogatama, Jeffrey Flanigan,
nisms for sequence labeling: SENTEE and TO- and Noah A. Smith. 2011. Part-of-speech tagging
 for twitter: Annotation, features, and experiments.
KEE. The former is a simple extension of sequence-
 In The 49th Annual Meeting of the Association for
level early-exit while the latter is specially designed Computational Linguistics: Human Language Tech-
for sequence labeling, which can conduct more fine- nologies, Proceedings of the Conference, 19-24 June,
grained computational cost allocation. We equip 2011, Portland, Oregon, USA - Short Papers, pages
TOKEE with window-based uncertainty and self- 42–47. The Association for Computer Linguistics.
sampling finetuning to make it more robust and Tao Gui, Qi Zhang, Jingjing Gong, Minlong Peng,
faster. The detailed analysis verifies their effective- Di Liang, Keyu Ding, and Xuanjing Huang. 2018.
ness. SENTEE and TOKEE can achieve 2× and Transferring from formal newswire domain with hy-
3∼4× speedup with minimal performance drop. pernet for Twitter POS tagging. In Proceedings of
 the 2018 Conference on Empirical Methods in Nat-
 For future work, we wish to explore: (1) leverag- ural Language Processing, pages 2540–2549, Brus-
ing the exited token’s label information to help the sels, Belgium. Association for Computational Lin-
exiting of remained tokens; (2) introducing CRF or guistics.
other global decoding methods into early-exit for
 Hangfeng He and Xu Sun. 2017. F-score driven max
sequence labeling. margin neural network for named entity recognition
 in chinese social media. In Proceedings of the 15th
Acknowledgments Conference of the European Chapter of the Associa-
 tion for Computational Linguistics, EACL 2017, Va-
We thank anonymous reviewers for their detailed lencia, Spain, April 3-7, 2017, Volume 2: Short Pa-
reviews and great suggestions. This work was sup- pers, pages 713–718. Association for Computational
ported by the National Key Research and Develop- Linguistics.
ment Program of China (No. 2020AAA0106700),
 Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidi-
National Natural Science Foundation of China (No. rectional LSTM-CRF models for sequence tagging.
62022027) and Major Scientific Research Project CoRR, abs/1508.01991.
of Zhejiang Lab (No. 2019KD0AD01).
 Takashi Ishida, Ikko Yamane, Tomoya Sakai, Gang
 Niu, and Masashi Sugiyama. 2020. Do we need
 zero training loss after achieving zero training error?
References In Proceedings of the 37th International Conference
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, on Machine Learning, volume 119 of Proceedings
 Ziqing Yang, Shijin Wang, and Guoping Hu. 2019. of Machine Learning Research, pages 4604–4614.
 Pre-training with whole word masking for chinese PMLR.
 bert. arXiv preprint arXiv:1906.08101.
 Chen Jia, Yuefeng Shi, Qinrong Yang, and Yue Zhang.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and 2020. Entity enhanced BERT pre-training for Chi-
 Kristina Toutanova. 2019. BERT: Pre-training of nese NER. In Proceedings of the 2020 Conference
 deep bidirectional transformers for language under- on Empirical Methods in Natural Language Process-
 standing. In Proceedings of the 2019 Conference ing (EMNLP), pages 6384–6396, Online. Associa-
 of the North American Chapter of the Association tion for Computational Linguistics.

 197
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xue Mengge, Bowen Yu, Zhenyu Zhang, Tingwen
 Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Liu, Yue Zhang, and Bin Wang. 2020. Coarse-to-
 2020. Tinybert: Distilling BERT for natural lan- Fine Pre-training for Named Entity Recognition. In
 guage understanding. In Proceedings of the 2020 Proceedings of the 2020 Conference on Empirical
 Conference on Empirical Methods in Natural Lan- Methods in Natural Language Processing (EMNLP),
 guage Processing: Findings, EMNLP 2020, Online pages 6345–6354, Online. Association for Computa-
 Event, 16-20 November 2020, pages 4163–4174. As- tional Linguistics.
 sociation for Computational Linguistics.
 Paul Michel, Omer Levy, and Graham Neubig. 2019.
Yigitcan Kaya, Sanghyun Hong, and Tudor Dumitras. Are sixteen heads really better than one? In Ad-
 2019. Shallow-deep networks: Understanding and vances in Neural Information Processing Systems
 mitigating network overthinking. In Proceedings 32: Annual Conference on Neural Information Pro-
 of the 36th International Conference on Machine cessing Systems 2019, NeurIPS 2019, December 8-
 Learning, volume 97 of Proceedings of Machine 14, 2019, Vancouver, BC, Canada, pages 14014–
 Learning Research, pages 3301–3310. PMLR. 14024.

Zhen Ke, Liang Shi, Erli Meng, Bin Wang, Xipeng Qiu, Dat Quoc Nguyen, Thanh Vu, and Anh Tuan Nguyen.
 and Xuanjing Huang. 2020. Unified multi-criteria 2020. BERTweet: A pre-trained language model
 chinese word segmentation with bert. for English tweets. In Proceedings of the 2020
 Conference on Empirical Methods in Natural Lan-
John D. Lafferty, Andrew McCallum, and Fernando guage Processing: System Demonstrations, pages 9–
 C. N. Pereira. 2001. Conditional random fields: 14, Online. Association for Computational Linguis-
 Probabilistic models for segmenting and labeling se- tics.
 quence data. In Proceedings of the Eighteenth Inter-
 national Conference on Machine Learning, ICML Joakim Nivre, Marie-Catherine de Marneffe, Filip
 ’01, pages 282–289, San Francisco, CA, USA. Mor- Ginter, Yoav Goldberg, Jan Hajic, Christopher D.
 gan Kaufmann Publishers Inc. Manning, Ryan T. McDonald, Slav Petrov, Sampo
 Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Zeman. 2016. Universal dependencies v1: A mul-
 Kevin Gimpel, Piyush Sharma, and Radu Soricut. tilingual treebank collection. In Proceedings of
 2020. Albert: A lite bert for self-supervised learning the Tenth International Conference on Language
 of language representations. In International Con- Resources and Evaluation LREC 2016, Portorož,
 ference on Learning Representations. Slovenia, May 23-28, 2016. European Language Re-
 sources Association (ELRA).
Lei Li, Yankai Lin, Shuhuai Ren, Deli Chen, Xu-
 ancheng Ren, Peng Li, Jie Zhou, and Xu Sun. 2020. Olutobi Owoputi, Brendan O’Connor, Chris Dyer,
 Accelerating pre-trained language models via cali- Kevin Gimpel, Nathan Schneider, and Noah A.
 brated cascade. CoRR, abs/2012.14682. Smith. 2013. Improved part-of-speech tagging
 for online conversational text with word clusters.
Weijie Liu, Peng Zhou, Zhiruo Wang, Zhe Zhao, In Human Language Technologies: Conference of
 Haotang Deng, and Qi Ju. 2020. FastBERT: a self- the North American Chapter of the Association of
 distilling BERT with adaptive inference time. In Computational Linguistics, Proceedings, June 9-14,
 Proceedings of the 58th Annual Meeting of the Asso- 2013, Westin Peachtree Plaza Hotel, Atlanta, Geor-
 ciation for Computational Linguistics, pages 6035– gia, USA, pages 380–390. The Association for Com-
 6044, Online. Association for Computational Lin- putational Linguistics.
 guistics. Nanyun Peng and Mark Dredze. 2015. Named entity
 recognition for Chinese social media with jointly
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
 trained embeddings. In Proceedings of the 2015
 dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
 Conference on Empirical Methods in Natural Lan-
 Luke Zettlemoyer, and Veselin Stoyanov. 2019.
 guage Processing, pages 548–554, Lisbon, Portugal.
 Roberta: A robustly optimized bert pretraining ap-
 Association for Computational Linguistics.
 proach.
 Xipeng Qiu, TianXiang Sun, Yige Xu, Yunfan Shao,
Ilya Loshchilov and Frank Hutter. 2019. Decou- Ning Dai, and Xuanjing Huang. 2020. Pre-trained
 pled weight decay regularization. In 7th Inter- models for natural language processing: A sur-
 national Conference on Learning Representations, vey. SCIENCE CHINA Technological Sciences,
 ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. 63(10):1872–1897.
 OpenReview.net.
 Victor Sanh, Lysandre Debut, Julien Chaumond, and
Xuezhe Ma and Eduard Hovy. 2016. End-to-end Thomas Wolf. 2019. DistilBERT, a distilled ver-
 sequence labeling via bi-directional LSTM-CNNs- sion of bert: smaller, faster, cheaper and lighter. In
 CRF. In Proceedings of the 54th Annual Meeting of NeurIPS EMC2 Workshop.
 the Association for Computational Linguistics (Vol-
 ume 1: Long Papers), pages 1064–1074, Berlin, Ger- Roy Schwartz, Gabriel Stanovsky, Swabha
 many. Association for Computational Linguistics. Swayamdipta, Jesse Dodge, and Noah A. Smith.

 198
2020. The right tool for the job: Matching model Zhang, Zhengliang Yang, Kyle Richardson, and
 and instance complexities. In Proceedings of the Zhenzhong Lan. 2020. CLUE: A Chinese language
 58th Annual Meeting of the Association for Com- understanding evaluation benchmark. In Proceed-
 putational Linguistics, pages 6640–6651, Online. ings of the 28th International Conference on Com-
 Association for Computational Linguistics. putational Linguistics, pages 4762–4772, Barcelona,
 Spain (Online). International Committee on Compu-
Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei tational Linguistics.
 Yao, Amir Gholami, Michael W. Mahoney, and Kurt
 Keutzer. 2020. Q-bert: Hessian based ultra low pre- Naiwen Xue, Fei Xia, Fu-Dong Chiou, and Martha
 cision quantization of bert. Proceedings of the AAAI Palmer. 2005. The penn chinese treebank: Phrase
 Conference on Artificial Intelligence, 34(05):8815– structure annotation of a large corpus. Nat. Lang.
 8821. Eng., 11(2):207–238.

Yuanhe Tian, Yan Song, Xiang Ao, Fei Xia, Xiao- Hang Yan, Bocao Deng, Xiaonan Li, and Xipeng Qiu.
 jun Quan, Tong Zhang, and Yonggang Wang. 2020. 2019. TENER: adapting transformer encoder for
 Joint Chinese word segmentation and part-of-speech named entity recognition. CoRR, abs/1911.04474.
 tagging via two-way attentions of auto-analyzed
 knowledge. In Proceedings of the 58th Annual Meet- Jiahui Yu and Thomas S. Huang. 2019. Universally
 ing of the Association for Computational Linguistics, slimmable networks and improved training tech-
 pages 8286–8296, Online. Association for Computa- niques. In 2019 IEEE/CVF International Confer-
 tional Linguistics. ence on Computer Vision, ICCV 2019, Seoul, Ko-
 rea (South), October 27 - November 2, 2019, pages
Erik F. Tjong Kim Sang and Fien De Meulder. 1803–1811. IEEE.
 2003. Introduction to the CoNLL-2003 shared task:
 Language-independent named entity recognition. In Qi Zhang, Jinlan Fu, Xiaoyu Liu, and Xuanjing Huang.
 Proceedings of the Seventh Conference on Natu- 2018. Adaptive co-attention network for named en-
 ral Language Learning at HLT-NAACL 2003, pages tity recognition in tweets.
 142–147.
 Wangchunshu Zhou, Canwen Xu, Tao Ge, Julian J.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob McAuley, Ke Xu, and Furu Wei. 2020. BERT loses
 Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz patience: Fast and robust inference with early exit.
 Kaiser, and Illia Polosukhin. 2017. Attention is all In Advances in Neural Information Processing Sys-
 you need. In Advances in Neural Information Pro- tems 33: Annual Conference on Neural Information
 cessing Systems, volume 30, pages 5998–6008. Cur- Processing Systems 2020, NeurIPS 2020, December
 ran Associates, Inc. 6-12, 2020, virtual.

Andreas Veit, Michael J. Wilber, and Serge J. Be-
 longie. 2016. Residual networks behave like ensem-
 bles of relatively shallow networks. In Advances in
 Neural Information Processing Systems 29: Annual
 Conference on Neural Information Processing Sys-
 tems 2016, December 5-10, 2016, Barcelona, Spain,
 pages 550–558.

Ralph Weischedel, Sameer Pradhan, Lance Ramshaw,
 Martha Palmer, Nianwen Xue, Mitchell Marcus,
 Ann Taylor, Craig Greenberg, Eduard Hovy, Robert
 Belvin, et al. 2011. Ontonotes release 4.0.
 LDC2011T03, Philadelphia, Penn.: Linguistic Data
 Consortium.

Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and
 Jimmy Lin. 2020. DeeBERT: Dynamic early exiting
 for accelerating BERT inference. In Proceedings
 of the 58th Annual Meeting of the Association for
 Computational Linguistics, pages 2246–2251, On-
 line. Association for Computational Linguistics.

Liang Xu, Hai Hu, Xuanwei Zhang, Lu Li, Chenjie
 Cao, Yudong Li, Yechen Xu, Kai Sun, Dian Yu,
 Cong Yu, Yin Tian, Qianqian Dong, Weitang Liu,
 Bo Shi, Yiming Cui, Junyi Li, Jun Zeng, Rongzhao
 Wang, Weijian Xie, Yanting Li, Yina Patterson,
 Zuoyu Tian, Yiwen Zhang, He Zhou, Shaoweihua
 Liu, Zhe Zhao, Qipeng Zhao, Cong Yue, Xinrui

 199
You can also read