DISTILHUBERT: SPEECH REPRESENTATION LEARNING BY LAYER-WISE DISTILLATION OF HIDDEN-UNIT BERT

Page created by Bonnie Bowen
 
CONTINUE READING
DISTILHUBERT: SPEECH REPRESENTATION LEARNING BY LAYER-WISE DISTILLATION OF HIDDEN-UNIT BERT
DISTILHUBERT: SPEECH REPRESENTATION LEARNING
                                                                   BY LAYER-WISE DISTILLATION OF HIDDEN-UNIT BERT

                                                                                   Heng-Jui Chang, Shu-wen Yang, Hung-yi Lee

                                                       College of Electrical Engineering and Computer Science, National Taiwan University

                                                                      ABSTRACT                                       models, and few studies investigated this problem [25, 26]. There-
                                                                                                                     fore, we propose a novel multi-task learning framework to layer-wise
arXiv:2110.01900v4 [cs.CL] 28 Apr 2022

                                         Self-supervised speech representation learning methods like wav2vec
                                         2.0 and Hidden-unit BERT (HuBERT) leverage unlabeled speech                 distill hidden representations of SSL speech models.
                                         data for pre-training and offer good representations for numerous                 In this work, we distill HuBERT and obtain DistilHuBERT. Dis-
                                         speech processing tasks. Despite the success of these methods,              tilHuBERT uses three prediction heads to respectively predict the
                                         they require large memory and high pre-training costs, making               4th , 8th , and 12th HuBERT hidden layers’ output. After training,
                                         them inaccessible for researchers in academia and small companies.          the heads are removed because the multi-task learning paradigm
                                         Therefore, this paper introduces DistilHuBERT, a novel multi-task           forces the DistilHuBERT model to learn representations containing
                                         learning framework to distill hidden representations from a HuBERT          rich information. DistilHuBERT reduces HuBERT’s size by 75%
                                         model directly. This method reduces HuBERT’s size by 75% and                and speedup by 73%, retaining most performance and requiring less
                                         73% faster while retaining most performance in ten different tasks.         training time. Moreover, we offer comprehensive analyses of pro-
                                         Moreover, DistilHuBERT required little training time and data,              posed methods and show DistilHuBERT’s ability to distill knowl-
                                         opening the possibilities of pre-training personal and on-device SSL        edge with few data. To our knowledge, this is the first attempt to
                                         models for speech.                                                          distill speech representations from SSL pre-trained models directly.

                                             Index Terms— Self-supervised learning, speech representation
                                         learning, knowledge distillation, model compression                                                   2. METHODS

                                                                                                                     2.1. HuBERT
                                                                1. INTRODUCTION
                                                                                                                     The proposed distillation approach can be applied to any self-
                                         Recently, self-supervised learning (SSL) methods for speech repre-          supervised model. This paper takes HuBERT [15] as the teacher
                                         sentation learning succeeded in many tasks [1–18]. They learn from          because it outperforms other methods in almost all tasks in SU-
                                         targets derived from unlabeled speech data. They can be roughly             PERB [19], showing it offers good speech representations. Hu-
                                         categorized into two classes by learning objectives: generative and         BERT consists of a CNN and a transformer encoder [27] to classify
                                         discriminative. For generative methods, models try to either recon-         randomly masked frames to pseudo labels. The labels are obtained
                                         struct masked acoustic features [1–5], or generate future acoustic          by clustering MFCC or another model’s hidden units. Despite Hu-
                                         features [6–8]. For the discriminative methods, models either learn         BERT’s success, two major disadvantages appear. First, HuBERT
                                         by contrastive learning [9–14] or classifying pseudo labels [15].           models have 95M to 1B parameters, consuming large memory and
                                              Many previous SSL methods for speech representation learn-             slow inference. Second, the training costs are high, e.g., HuBERT
                                         ing are designed for few tasks like speech recognition. There-              Base pre-training requires 32 GPUs and 2k GPU hours. These is-
                                         fore, Speech processing Universal PERformance Benchmark (SU-                sues make academia and small corporations unable to pre-train their
                                         PERB) [19] and LeBenchmark [20] are developed to evaluate the               models or apply these technologies to products. Hence, this paper
                                         effectiveness of these methods on general speech processing appli-          aims to alleviate these problems.
                                         cations. These benchmarks comprehensively assess the capability of
                                         pre-trained models by applying them to multiple downstream tasks
                                         like recognition, detection, semantics, and speaker identification.         2.2. DistilHuBERT
                                              Although recent developments of speech SSL methods show ex-            In a typical knowledge distillation method, a student learns to gener-
                                         cellent performance across various tasks, many require large mem-           ate the teacher’s output. However, different layers in self-supervised
                                         ory and high pre-training costs, including wav2vec 2.0 [13] and Hu-         speech models contain different information like speaker identity or
                                         BERT [15]. The limitations make these models unsuitable for on-             semantics [13,28,29]. An SSL model’s output layer does not always
                                         device computation and make researchers in academia and small cor-          offer rich information. For instance, wav2vec 2.0 stores phonetic in-
                                         porations difficult to use [21]. Knowledge distillation is a common         formation in its middle layers [28, 30]. Therefore, the student only
                                         method for compressing models [22], in which a small student model          learning the last layer of the teacher might benefit little. One pos-
                                         is learned to generate the teacher model’s outputs or hidden repre-         sible way to solve this problem is to let different hidden layers of
                                         sentations. Distilling knowledge has shown to be effective for NLP,         the student learn from different layers of the teacher [24]. The stu-
                                         and DistilBERT [23] and TinyBERT [24] are good examples. How-               dent model must be deep enough to have multiple layers, but against
                                         ever, we found these approaches ineffective in distilling speech SSL        having a small model.
                                              We thank to National Center for High-performance Computing (NCHC)           This paper proposes a novel teacher-student framework for
                                         of National Applied Research Laboratories (NARLabs) in Taiwan for provid-   speech representation learning by multi-task knowledge distillation,
                                         ing computational and storage resources.                                    as shown in Fig. 1. We distill HuBERT to obtain DistilHuBERT,
Teacher: HuBERT (frozen)                    Student: DistilHuBERT
                                               Minimize                                           Speech           Intent         Speaker
                    Encoder Layer 12                            Prediction Head 3               Recognition     Classification   Identification

                                                                                                  Keyword                        Speaker
                                                                                                                 Slot Filling
                     Encoder Layer 8                            Prediction Head 2                 Spotting                      Verification

                                                                                                 Emotion         Query-by-        Speaker
                                                                                                Recognition      Example         Diarization
                     Encoder Layer 4                            Prediction Head 1

                                                Initialize
                     Encoder Layer 2                            Encoder Layer 2                               Encoder Layer 2
                                                                                           Frozen
                     Encoder Layer 1                             Encoder Layer 1                              Encoder Layer 1

                          CNN                                         CNN                                           CNN

                                             (I) Pre-training                                                 (II) Downstream
Fig. 1. The proposed DistilHuBERT framework. (I) Before pre-training, the student is initialized with the teacher’s parameters. Then,
student’s prediction heads learn to generate the teacher’s hidden representations by minimizing the loss function L = L(4) + L(8) + L(12) .
(II) The pre-trained DistilHuBERT model is frozen to generate speech representations for downstream tasks.

consisting of a CNN feature extractor and a small transformer en-                                             3. EXPERIMENTS
coder. The basic idea of our approach for knowledge distillation is
to learn to generate multiple teacher’s hidden representations from                  3.1. Experimental Setup
shared representations. Therefore, we propose predicting teacher’s
                                                                                     Experiments were implemented with S3PRL [1, 2] and fairseq [31].
hidden representations with separate prediction heads as shown in
                                                                                     Model. The HuBERT model had a 7-layer CNN and a 12-layer
Fig. 1(I). This objective is a multi-task learning paradigm. It encour-
                                                                                     transformer encoder, while DistilHuBERT had a similar architecture,
ages the transformer encoder to produce compact representations for
                                                                                     but its transformer encoder had only two layers. DistilHuBERT had
multiple prediction heads. After pre-training, we remove the heads.
                                                                                     three prediction heads respectively predicting the 4th , 8th , and 12th
The model parameters are frozen, and the output representations are
                                                                                     HuBERT hidden layers. We applied past distillation methods to Dis-
used for various downstream tasks, as shown in Fig. 1(II).
                                                                                     tilHuBERT for comparison [23, 24]. As shown in Table 1, ”predict
Objective Function. Here, we describe the objective function of
                                                                                     last layer” predicted only the final output of HuBERT. ”Predict w/
DistilHuBERT. We denote the D dimensional feature vectors at time
                                                                                     hidden” was trained to predict the three chosen layers with its hidden
tth produced by the teacher’s lth layer and the student’s corresponding
                                       (l)         (l)
                                                                                     layers, i.e., the CNN and the two transformer encoder layers respec-
prediction head respectively as ĥt and ht . The loss function is                    tively predicted the 4th , 8th , and 12th HuBERT layers. ”W/ prediction
           (l)                                                                       heads” preserved the three prediction heads after pre-training.
 L(l) = L`1 + λL(l)
                cos                                                                  Data. We used the 960-hour LibriSpeech dataset [32] for knowl-
         T                                                                       edge distillation, except Sec. 3.6 used the English Wall Street Journal
        X    1    (l)  (l)
                                                    
                                                       (l)   (l)
      =         ht − ĥt                − λ log σ cos ht , ĥt       ,               (WSJ) [33] and the Mandarin AISHELL-1 [34] speech corpora.
             D                     1
        t=1                                                                          Pre-training. The default DistilHuBERT was trained on a 32GB
                                                                            (1)      V100 GPU for 200k updates with a batch size of 24 utterances, tak-
                                                                                     ing roughly 55 hours. The learning rate was linearly increased to
where T is the number of time steps. σ and cos(·, ·) respectively
                                                                                     2e-4 in the first 7% of updates, then linearly decreased to zero. λ =
denote sigmoid activation and cosine similarity. Minimizing L(l) is                  1 in Eq. (1). Note that training DistilHuBERT was cheaper than
equivalent to simultaneously minimizing the `1 distance and maxi-                    HuBERT since the HuBERT Base model required 2k GPU hours.
mizing the cosine similarity between the hidden representations. We
empirically found that considering both L`1 and Lcos yielded better
performance than considering only one of them. λ > 0 controls the                    3.2. SUPERB
contribution of the cosine similarity loss.                                          DistilHuBERT was evaluated on SUPERB [19]. Each SSL model
Parameter Initialization. Following DistilBERT [23], we initial-                     was frozen, and a trainable weighted sum summarized the hid-
ized DistilHuBERT with HuBERT’s CNN extractor and the first two                      den representations. Each downstream task used a simple model
transformer layers.                                                                  with minimal labeled data using the summarized representations.
Reducing Computation for Distillation. We note that the num-                         The tasks included phoneme recognition (PR), keyword spotting
ber of prediction heads can be 1 to L, where L is the number of                      (KS), intent classification (IC), speaker identification (SID), emo-
hidden layers in the self-supervised speech model to be distilled.                   tion recognition (ER), automatic speech recognition (ASR), query
Because neighboring layers’ representations might contain similar                    by example spoken term detection (QbE), slot filling (SF), automatic
information, the student model only predicts some specific layers in                 speaker verification (ASV), and speaker diarization (SD).
the teacher model and skip other layers in our implementation. For                       Results on SUPERB are shown in Table 1.1 The rightmost col-
example, in Fig. 1, only the 4th , 8th , and 12th HuBERT hidden lay-
ers’ outputs are predicted. In the experiments, we will show that                       1 We only show some results because of the paper length limitation. A
predicting these three layers is sufficient.                                         complete comparison is in https://superbbenchmark.org/.
Table 1. Results on SUPERB [19]. The metrics include accuracy (Acc%) phoneme error rate (PER%), word error rate (WER%), maximum
term weighted value (MTWV), F1 score (F1%), concept error rate (CER%), equal error rate (EER%), and diarization error rate (DER%).
                                  # param.    PR      KS        IC        SID      ER     ASR (+LM)          QbE                   SF         ASV       SD
    Method                        Millions PER↓ Acc↑ Acc↑                Acc↑     Acc↑        WER↓         MTWV↑ F1↑ / CER↓ EER↓ DER↓                            Rank↓
    (I) Baselines [19]
    FBANK                            0       82.01    8.63     9.10      8.5E-4   35.39   23.18 / 15.21    0.0058         69.64 / 52.94 9.56 10.05                8.8
    TERA                           21.33     49.17   89.48    58.42      57.57    56.27   18.17 / 12.16    0.0013         67.50 / 54.17 15.89 9.96                7.9
    wav2vec                        32.54     31.58   95.59    84.92      56.56    59.79   15.86 / 11.00    0.0485         76.37 / 43.71 7.99 9.90                 6.5
    DeCoAR 2.0                     85.12     14.93   94.48    90.80      74.42    62.47    13.02 / 9.07    0.0406         83.28 / 34.73 7.16 6.59                 4.1
    HuBERT (teacher)               94.68      5.41   96.30    98.34      81.42    64.92     6.42 / 4.79    0.0736         88.53 / 25.20 5.11 5.88                 1.0
    (II) Distillation Baselines
    predict last layer             24.67     16.71 96.07 95.57 43.44 62.81 13.67 / 9.43                    0.0423         81.52 / 37.31       7.29      6.45      4.6
    predict w/ hidden              23.49     17.43 95.46 94.91 54.90 61.49 14.95 / 10.27                   0.0529         83.73 / 35.14       7.41      6.86      5.0
    (III) Proposed
    w/ prediction heads            27.03     14.35 96.20 94.17 72.83 62.73 13.26 / 8.99                    0.0451         83.14 / 35.53       6.85      5.97      3.3
    DistilHuBERT                   23.49     16.27 95.98 94.99 73.54 63.02 13.37 / 9.21                    0.0511         82.57 / 35.59       8.55      6.19      3.8

                  DistilHuBERT             HuBERT                                                 HuBERT                  25   PASE+
                                       wav2vec 2.0                  80 DistilHuBERT            wav2vec 2.0
            vq-wav2vec                DeCoAR 2.0                                                                             FBANK
              wav2vec                                                                        DeCoAR 2.0                                        Mockingjay
         80 APC                                                                                                               APC
                                                                         APC                                              20 CPC   NPC
                 NPC                                                60        TERA
            CPC                                                           NPC     wav2vec                                          TERAvq-wav2vec
         60       TERA

                                                                                                                 WER(%)
                                                                                                                                       wav2vec
Acc(%)

                                                           Acc(%)

                                                                    40 CPC
                                                                         PASE+
                                                                               vq-wav2vec                                 15
         40                                                                                                                               DistilHuBERT
                   PASE+                                                                      Mockingjay                                               DeCoAR 2.0
                                       Mockingjay
                                                                    20                                                    10
         20
              FBANK                                                                                                                                    wav2vec 2.0
         0                                                          0 FBANK                                                5                              HuBERT
              0      20      40       60     80      100             0    20        40       60    80      100                 0        20    40    60     80 100
                           parameters (M)                                         parameters (M)                                           parameters (M)
                             (a) IC                                                (b) SID                                              (c) ASR (w/o LM)

                             Fig. 2. Comparisons of pre-trained model size vs. performance on three tasks: IC, SID, and ASR.

umn was each method’s ranking score by averaging its ranks in the
ten tasks, offering an easy way to compare the performance of the                         Table 2. Comparison of different SSL model sizes and inference
methods. First, the ranking scores revealed that DistilHuBERT out-                        time. The inference was performed on 4 CPUs extracting all features
performed all methods, except HuBERT (its teacher), indicating Dis-                       from the LibriSpeech dev-clean set with a batch size of one. Results
tilHuBERT retained most of HuBERT’s performance (sec. (III) vs.                           were averaged over three runs.
(I)). Next, the distillation baselines in section (II) performed worse                                                             # param.          Inf. time
than DistilHuBERT, especially in PR and SID tasks, showing the
proposed framework distilled better representations using separate                                  Model                          Millions          seconds
prediction heads. Moreover, DistilHuBERT slightly degraded with-                                    HuBERT                     94.68 (100%)     992 (1.00X)
out the prediction heads, but it reduced the model size by 13%, im-                                 DeCoAR 2.0                  85.12 (90%)     924 (1.07X)
plying the prediction heads were redundant after pre-training.                                      DistilHuBERT                23.49 (25%)     574 (1.73X)

    Furthermore, Fig. 2 visualizes the relation between the model                         3.3. Model Size and Inference Speed
sizes and their performance on IC, SID, and ASR. We chose IC,
SID, and ASR for comparison since they were easier to observe the                         We compared the sizes and inference speed of several methods as
performance differences. In Fig. 2(a)(b), DistilHuBERT showed                             shown in Table 2. DistilHuBERT had a significantly smaller model
competitive performance compared with HuBERT and wav2vec 2.0                              size than HuBERT and offered a 73% speedup. DistilHuBERT per-
and surpassed all other similar-sized methods in IC and SID. In Fig.                      formed equally well as DeCoAR 2.0 but was smaller and faster.
2(c), DistilHuBERT degraded more in ASR yet still better than most                        Combining the results in Tables 1 and 2, DistilHuBERT was fast,
other SSL methods, showing the better trade-off between the model                         small, and powerful. Note that DistilHuBERT can be further com-
size and performance.                                                                     pressed by pruning or quantization for on-device computation.
Table 3. Ablation study of the techniques used.                Table 4. Results of predicting different HuBERT’s hidden layers.
                                IC          SID   ASR                                                      IC         SID         ASR
            Method             Acc↑     Acc↑      WER↓                             Predicted Layers      Acc↑         Acc↑    WER↓
            DistilHuBERT       94.99    73.54     13.34                            4, 8, 12              94.99        73.54       13.34
            w/o Lcos           93.36    72.54     13.74                            4                     79.09        76.85       14.90
            w/o teacher init   94.20    73.68     13.41                            8                     96.89        61.52       12.28
                                                                                   12                    95.52        65.14       13.48
 1.0                                                                               4, 8                  91.67        75.36       13.73
                        ASR            KS         IC         SID                   4, 12                 91.88        74.48       14.03
                                                                                   8, 12                 96.34        65.30       13.03

 0.2                                                                       Table 5. Pre-training DistilHuBERT with different datasets.
                                                                                               Data Size         IC         SID       ASR
 0.0                                                                           Dataset           hours          Acc↑     Acc↑        WER↓
       feat hid 1 2 3 4 5 6 7 8 9 10 11 12
                        layer                                                  LibriSpeech        960           94.99    73.54       13.34
Fig. 3. Weights of the feature summarization in several SUPERB                 LibriSpeech        100           93.17    69.46       14.77
tasks after training on DistilHuBERT representations (predicting all           WSJ                 81           90.22    64.14       15.59
hidden layers in HuBERT). feat and hid respectively denotes the                AISHELL-1          150           87.29    67.65       16.42
CNN’s and transformer’s outputs.

                                                                        worst in IC and ASR but better in SID than predicting three layers,
3.4. Ablation Study
                                                                        corroborating our hypothesis that the bottom layers offered speaker
Table 3 shows the effectiveness of the cosine similarity loss in Eq.    identity. Then, predicting two of the selected layers, the downstream
(1) and teacher initialization. Both techniques showed some benefits    tasks’ scores were also imbalanced. Therefore, the three selected
to training DistilHuBERT, and we thus used them in the rest of the      layers offered a good combination for learning representations with
experiments. Removing cosine similarity loss degraded more than         less biased information.
teacher initialization, showing that the former method had more im-
pact on improving DistilHuBERT, while the latter was less effective.    3.6. Knowledge Distillation with Different Datasets
                                                                        In practice, the pre-training data for a self-supervised teacher model
3.5. Layer Selection                                                    might be inaccessible or too large to reuse for distillation. This sec-
This section discussed the selection of HuBERT’s hidden layers for      tion inspected the generalizability and flexibility of DistilHuBERT
DistilHuBERT to learn. Because each HuBERT layer has different          by training it with small or out-of-domain datasets. We used the
information, we first inspected each layer’s importance for several     100-hour subset of LibriSpeech, 81-hour WSJ, and the Mandarin
downstream tasks. We trained a DistilHuBERT model with 12 pre-          AISHELL-1 corpora to train DistilHuBERT by the proposed knowl-
diction heads to predict all HuBERT layers. The outputs of Dis-         edge distillation approach. The results are shown in Table 5.
tilHuBERT’s hidden layers and all prediction heads were weighted              First, training with the 100-hour LibriSpeech, DistilHuBERT
and summed by a set of learnable weights, where the weights were        degraded slightly but retained better performance than many other
jointly trained with the downstream model. The summarized out-          methods (in Table 1), indicating that using a smaller dataset to dis-
puts were used for all downstream tasks with separate weights. The      till knowledge was sufficient. Next, the smaller WSJ corpus showed
weights were normalized by the averaged `2 norm of each layer’s         more degradation; perhaps the WSJ’s averaged utterance duration
output. We visualized the learned weights in Fig. 3, where larger       was shorter than LibriSpeech [35] or simply because it was small.
values indicate higher importance. We only showed ASR, KS, IC,          Furthermore, training with the Mandarin AISHELL-1 corpus of-
and SID because they represent recognition, detection, semantics,       fered better SID than WSJ because speaker identity was independent
and speaker identity tasks, while others showed similar trends.         of language and AISHELL had more data than WSJ. In contrast,
     Fig. 3 showed features before the prediction heads (hid) were      AISHELL performed the worst in IC and ASR because of the lan-
useful, especially for SID. Results corroborated our hypothesis that    guage mismatch. Although the performance was affected by train-
the shared representation for multiple heads carried rich informa-      ing data, DistilHuBERT offered better performance than many other
tion. Among the head outputs, the heads for the 6th to 12th layers      SSL methods, even with smaller datasets for distillation.
were also crucial to ASR, KS, and IC because they had content and
semantic information. With the above findings, we made DistilHu-                                4. CONCLUSION
BERT predicting the 8th and 12th HuBERT layers. DistilHuBERT
additionally predicted the 4th layer because we hypothesized that the   This paper proposes DistilHuBERT, a novel framework to layer-wise
bottom layers preserved speaker identity.                               distill knowledge from HuBERT. DistilHuBERT retained most of
     To justify our selection method, we conducted experiments by       HuBERT’s performance and had only 25% of its size. We provided
predicting different HuBERT layers as shown in Table 4. First, for      comprehensive analyses of the proposed methods and demonstrated
predicting only one HuBERT layer, the evaluation scores among the       the DistilHuBERT’s flexibility and generalizability. Our method can
three tasks were imbalanced. Predicting the 4th layer performed the     be easily applied to more powerful SSL models.
5. REFERENCES                                   [18] A. Baevski, M. Auli, and A. Mohamed, “Effectiveness of self-
                                                                            supervised pre-training for speech recognition,” arXiv preprint
 [1] A. T. Liu, S.-w. Yang, P.-H. Chi, P.-c. Hsu, and H.-y. Lee,            arXiv:1911.03912, 2019.
     “Mockingjay: Unsupervised speech representation learning          [19] S.-w. Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia,
     with deep bidirectional transformer encoders,” in ICASSP,              Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin et al., “SU-
     2020.                                                                  PERB: Speech processing universal performance benchmark,”
 [2] A. T. Liu, S.-W. Li, and H.-y. Lee, “TERA: Self-supervised             in Interspeech, 2021.
     learning of transformer encoder representation for speech,”       [20] S. Evain, H. Nguyen, H. Le, M. Z. Boito, S. Mdhaffar, S. Al-
     TASLP, vol. 29, 2021.                                                  isamir, Z. Tong, N. Tomashenko, M. Dinarelli, T. Parcollet,
 [3] S. Ling, Y. Liu, J. Salazar, and K. Kirchhoff, “Deep contex-           A. Allauzen, Y. Estève, B. Lecouteux, F. Portet, S. Rossato,
     tualized acoustic representations for semi-supervised speech           F. Ringeval, D. Schwab, and L. Besacier, “LeBenchmark: A re-
     recognition,” in ICASSP, 2020.                                         producible framework for assessing self-supervised represen-
                                                                            tation learning from speech,” in Interspeech, 2021.
 [4] S. Ling and Y. Liu, “DeCoAR 2.0: Deep contextualized acous-
                                                                       [21] A. Hannun, “The history of speech recognition to the year
     tic representations with vector quantization,” arXiv preprint
                                                                            2030,” arXiv preprint arXiv:2108.00084, 2021.
     arXiv:2012.06659, 2020.
                                                                       [22] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge
 [5] A. H. Liu, Y.-A. Chung, and J. Glass, “Non-autoregressive pre-         in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
     dictive coding for learning speech representations from local
     dependencies,” in Interspeech, 2021.                              [23] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a
                                                                            distilled version of bert: smaller, faster, cheaper and lighter,”
 [6] Y.-A. Chung, W.-N. Hsu, H. Tang, and J. Glass, “An unsuper-            arXiv preprint arXiv:1910.01108, 2019.
     vised autoregressive model for speech representation learning,”
                                                                       [24] X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang,
     in Interspeech, 2019.
                                                                            and Q. Liu, “TinyBERT: Distilling bert for natural language
 [7] Y.-A. Chung, H. Tang, and J. Glass, “Vector-quantized autore-          understanding,” in EMNLP, 2020.
     gressive predictive coding,” in Interspeech, 2020.                [25] Z. Peng, A. Budhkar, I. Tuil, J. Levy, P. Sobhani, R. Cohen, and
 [8] Y.-A. Chung and J. Glass, “Improved speech representations             J. Nassour, “Shrinking bigfoot: Reducing wav2vec 2.0 foot-
     with multi-target autoregressive predictive coding,” in ACL,           print,” arXiv preprint arXiv:2103.15760, 2021.
     2020.                                                             [26] C.-I. J. Lai, Y. Zhang, A. H. Liu, S. Chang, Y.-L. Liao, Y.-S.
 [9] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation                  Chuang, K. Qian, S. Khurana, D. Cox, and J. Glass, “PARP:
     learning with contrastive predictive coding,” arXiv preprint           Prune, adjust and re-prune for self-supervised speech recogni-
     arXiv:1807.03748, 2018.                                                tion,” NeurIPS, 2021.
                                                                       [27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
[10] M. Riviere, A. Joulin, P.-E. Mazaré, and E. Dupoux, “Unsuper-
                                                                            A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all
     vised pretraining transfers well across languages,” in ICASSP,
                                                                            you need,” in NeurIPS, 2017.
     2020.
                                                                       [28] A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise analysis of
[11] S. Schneider, A. Baevski, R. Collobert, and M. Auli,                   a self-supervised speech representation model,” arXiv preprint
     “wav2vec: Unsupervised pre-training for speech recognition,”           arXiv:2107.04734, 2021.
     in Interspeech, 2019.
                                                                       [29] X. Chang, T. Maekaku, P. Guo, J. Shi, Y.-J. Lu, A. S. Subrama-
[12] A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-              nian, T. Wang, S. wen Yang, Y. Tsao, H. yi Lee, and S. Watan-
     supervised learning of discrete speech representations,” in            abe, “An exploration of self-supervised pretrained representa-
     ICLR, 2020.                                                            tions for end-to-end speech recognition,” in ASRU, 2021.
[13] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec            [30] A. Baevski, W.-N. Hsu, A. Conneau, and M. Auli, “Unsuper-
     2.0: A framework for self-supervised learning of speech repre-         vised speech recognition,” arXiv preprint arXiv:2105.11084,
     sentations,” in NeurIPS, 2020.                                         2021.
[14] S. Sadhu, D. He, C.-W. Huang, S. H. Mallidi, M. Wu, A. Ras-       [31] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng,
     trow, A. Stolcke, J. Droppo, and R. Maas, “wav2vec-C: A self-          D. Grangier, and M. Auli, “fairseq: A fast, extensible toolkit
     supervised model for speech representation learning,” in Inter-        for sequence modeling,” in NAACL-HLT: Demonstrations,
     speech, 2021.                                                          2019.
[15] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhut-     [32] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-
     dinov, and A. Mohamed, “HuBERT: Self-supervised speech                 rispeech: An ASR corpus based on public domain audio
     representation learning by masked prediction of hidden units,”         books,” in ICASSP, 2015.
     arXiv preprint arXiv:2106.07447, 2021.                            [33] D. B. Paul and J. M. Baker, “The design for the Wall Street
                                                                            Journal-based CSR corpus,” in HLT, 1992.
[16] S. Pascual, M. Ravanelli, J. Serrà, A. Bonafonte, and Y. Ben-
     gio, “Learning problem-agnostic speech representations from       [34] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AISHELL-1: An
     multiple self-supervised tasks,” in Interspeech, 2019.                 open-source mandarin speech corpus and a speech recognition
                                                                            baseline,” in O-COCOSDA, 2017.
[17] M. Ravanelli, J. Zhong, S. Pascual, P. Swietojanski, J. Mon-
                                                                       [35] H.-J. Chang, H.-y. Lee, and L.-s. Lee, “Towards lifelong learn-
     teiro, J. Trmal, and Y. Bengio, “Multi-task self-supervised
                                                                            ing of end-to-end ASR,” in Interspeech, 2021.
     learning for robust speech recognition,” in ICASSP, 2020.
You can also read