DISTILHUBERT: SPEECH REPRESENTATION LEARNING BY LAYER-WISE DISTILLATION OF HIDDEN-UNIT BERT

Page created by Bonnie Bowen

Style & Fashion

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

DISTILHUBERT: SPEECH REPRESENTATION LEARNING
BY LAYER-WISE DISTILLATION OF HIDDEN-UNIT BERT

Heng-Jui Chang, Shu-wen Yang, Hung-yi Lee

College of Electrical Engineering and Computer Science, National Taiwan University

ABSTRACT models, and few studies investigated this problem [25, 26]. There-
fore, we propose a novel multi-task learning framework to layer-wise
arXiv:2110.01900v4 [cs.CL] 28 Apr 2022

Self-supervised speech representation learning methods like wav2vec
2.0 and Hidden-unit BERT (HuBERT) leverage unlabeled speech distill hidden representations of SSL speech models.
data for pre-training and offer good representations for numerous In this work, we distill HuBERT and obtain DistilHuBERT. Dis-
speech processing tasks. Despite the success of these methods, tilHuBERT uses three prediction heads to respectively predict the
they require large memory and high pre-training costs, making 4th , 8th , and 12th HuBERT hidden layers’ output. After training,
them inaccessible for researchers in academia and small companies. the heads are removed because the multi-task learning paradigm
Therefore, this paper introduces DistilHuBERT, a novel multi-task forces the DistilHuBERT model to learn representations containing
learning framework to distill hidden representations from a HuBERT rich information. DistilHuBERT reduces HuBERT’s size by 75%
model directly. This method reduces HuBERT’s size by 75% and and speedup by 73%, retaining most performance and requiring less
73% faster while retaining most performance in ten different tasks. training time. Moreover, we offer comprehensive analyses of pro-
Moreover, DistilHuBERT required little training time and data, posed methods and show DistilHuBERT’s ability to distill knowl-
opening the possibilities of pre-training personal and on-device SSL edge with few data. To our knowledge, this is the first attempt to
models for speech. distill speech representations from SSL pre-trained models directly.

Index Terms— Self-supervised learning, speech representation
learning, knowledge distillation, model compression 2. METHODS

2.1. HuBERT
1. INTRODUCTION
The proposed distillation approach can be applied to any self-
Recently, self-supervised learning (SSL) methods for speech repre- supervised model. This paper takes HuBERT [15] as the teacher
sentation learning succeeded in many tasks [1–18]. They learn from because it outperforms other methods in almost all tasks in SU-
targets derived from unlabeled speech data. They can be roughly PERB [19], showing it offers good speech representations. Hu-
categorized into two classes by learning objectives: generative and BERT consists of a CNN and a transformer encoder [27] to classify
discriminative. For generative methods, models try to either recon- randomly masked frames to pseudo labels. The labels are obtained
struct masked acoustic features [1–5], or generate future acoustic by clustering MFCC or another model’s hidden units. Despite Hu-
features [6–8]. For the discriminative methods, models either learn BERT’s success, two major disadvantages appear. First, HuBERT
by contrastive learning [9–14] or classifying pseudo labels [15]. models have 95M to 1B parameters, consuming large memory and
Many previous SSL methods for speech representation learn- slow inference. Second, the training costs are high, e.g., HuBERT
ing are designed for few tasks like speech recognition. There- Base pre-training requires 32 GPUs and 2k GPU hours. These is-
fore, Speech processing Universal PERformance Benchmark (SU- sues make academia and small corporations unable to pre-train their
PERB) [19] and LeBenchmark [20] are developed to evaluate the models or apply these technologies to products. Hence, this paper
effectiveness of these methods on general speech processing appli- aims to alleviate these problems.
cations. These benchmarks comprehensively assess the capability of
pre-trained models by applying them to multiple downstream tasks
like recognition, detection, semantics, and speaker identification. 2.2. DistilHuBERT
Although recent developments of speech SSL methods show ex- In a typical knowledge distillation method, a student learns to gener-
cellent performance across various tasks, many require large mem- ate the teacher’s output. However, different layers in self-supervised
ory and high pre-training costs, including wav2vec 2.0 [13] and Hu- speech models contain different information like speaker identity or
BERT [15]. The limitations make these models unsuitable for on- semantics [13,28,29]. An SSL model’s output layer does not always
device computation and make researchers in academia and small cor- offer rich information. For instance, wav2vec 2.0 stores phonetic in-
porations difficult to use [21]. Knowledge distillation is a common formation in its middle layers [28, 30]. Therefore, the student only
method for compressing models [22], in which a small student model learning the last layer of the teacher might benefit little. One pos-
is learned to generate the teacher model’s outputs or hidden repre- sible way to solve this problem is to let different hidden layers of
sentations. Distilling knowledge has shown to be effective for NLP, the student learn from different layers of the teacher [24]. The stu-
and DistilBERT [23] and TinyBERT [24] are good examples. How- dent model must be deep enough to have multiple layers, but against
ever, we found these approaches ineffective in distilling speech SSL having a small model.
We thank to National Center for High-performance Computing (NCHC) This paper proposes a novel teacher-student framework for
of National Applied Research Laboratories (NARLabs) in Taiwan for provid- speech representation learning by multi-task knowledge distillation,
ing computational and storage resources. as shown in Fig. 1. We distill HuBERT to obtain DistilHuBERT,

Teacher: HuBERT (frozen) Student: DistilHuBERT
Minimize Speech Intent Speaker
Encoder Layer 12 Prediction Head 3 Recognition Classiﬁcation Identiﬁcation

Keyword Speaker
Slot Filling
Encoder Layer 8 Prediction Head 2 Spotting Veriﬁcation

Emotion Query-by- Speaker
Recognition Example Diarization
Encoder Layer 4 Prediction Head 1

Initialize
Encoder Layer 2 Encoder Layer 2 Encoder Layer 2
Frozen
Encoder Layer 1 Encoder Layer 1 Encoder Layer 1

CNN CNN CNN

(I) Pre-training (II) Downstream
Fig. 1. The proposed DistilHuBERT framework. (I) Before pre-training, the student is initialized with the teacher’s parameters. Then,
student’s prediction heads learn to generate the teacher’s hidden representations by minimizing the loss function L = L(4) + L(8) + L(12) .
(II) The pre-trained DistilHuBERT model is frozen to generate speech representations for downstream tasks.

consisting of a CNN feature extractor and a small transformer en- 3. EXPERIMENTS
coder. The basic idea of our approach for knowledge distillation is
to learn to generate multiple teacher’s hidden representations from 3.1. Experimental Setup
shared representations. Therefore, we propose predicting teacher’s
Experiments were implemented with S3PRL [1, 2] and fairseq [31].
hidden representations with separate prediction heads as shown in
Model. The HuBERT model had a 7-layer CNN and a 12-layer
Fig. 1(I). This objective is a multi-task learning paradigm. It encour-
transformer encoder, while DistilHuBERT had a similar architecture,
ages the transformer encoder to produce compact representations for
but its transformer encoder had only two layers. DistilHuBERT had
multiple prediction heads. After pre-training, we remove the heads.
three prediction heads respectively predicting the 4th , 8th , and 12th
The model parameters are frozen, and the output representations are
HuBERT hidden layers. We applied past distillation methods to Dis-
used for various downstream tasks, as shown in Fig. 1(II).
tilHuBERT for comparison [23, 24]. As shown in Table 1, ”predict
Objective Function. Here, we describe the objective function of
last layer” predicted only the final output of HuBERT. ”Predict w/
DistilHuBERT. We denote the D dimensional feature vectors at time
hidden” was trained to predict the three chosen layers with its hidden
tth produced by the teacher’s lth layer and the student’s corresponding
(l) (l)
layers, i.e., the CNN and the two transformer encoder layers respec-
prediction head respectively as ĥt and ht . The loss function is tively predicted the 4th , 8th , and 12th HuBERT layers. ”W/ prediction
(l) heads” preserved the three prediction heads after pre-training.
L(l) = L`1 + λL(l)
cos Data. We used the 960-hour LibriSpeech dataset [32] for knowl-
T edge distillation, except Sec. 3.6 used the English Wall Street Journal
X 1 (l) (l)

(l) (l)
= ht − ĥt − λ log σ cos ht , ĥt , (WSJ) [33] and the Mandarin AISHELL-1 [34] speech corpora.
D 1
t=1 Pre-training. The default DistilHuBERT was trained on a 32GB
(1) V100 GPU for 200k updates with a batch size of 24 utterances, tak-
ing roughly 55 hours. The learning rate was linearly increased to
where T is the number of time steps. σ and cos(·, ·) respectively
2e-4 in the first 7% of updates, then linearly decreased to zero. λ =
denote sigmoid activation and cosine similarity. Minimizing L(l) is 1 in Eq. (1). Note that training DistilHuBERT was cheaper than
equivalent to simultaneously minimizing the `1 distance and maxi- HuBERT since the HuBERT Base model required 2k GPU hours.
mizing the cosine similarity between the hidden representations. We
empirically found that considering both L`1 and Lcos yielded better
performance than considering only one of them. λ > 0 controls the 3.2. SUPERB
contribution of the cosine similarity loss. DistilHuBERT was evaluated on SUPERB [19]. Each SSL model
Parameter Initialization. Following DistilBERT [23], we initial- was frozen, and a trainable weighted sum summarized the hid-
ized DistilHuBERT with HuBERT’s CNN extractor and the first two den representations. Each downstream task used a simple model
transformer layers. with minimal labeled data using the summarized representations.
Reducing Computation for Distillation. We note that the num- The tasks included phoneme recognition (PR), keyword spotting
ber of prediction heads can be 1 to L, where L is the number of (KS), intent classification (IC), speaker identification (SID), emo-
hidden layers in the self-supervised speech model to be distilled. tion recognition (ER), automatic speech recognition (ASR), query
Because neighboring layers’ representations might contain similar by example spoken term detection (QbE), slot filling (SF), automatic
information, the student model only predicts some specific layers in speaker verification (ASV), and speaker diarization (SD).
the teacher model and skip other layers in our implementation. For Results on SUPERB are shown in Table 1.1 The rightmost col-
example, in Fig. 1, only the 4th , 8th , and 12th HuBERT hidden lay-
ers’ outputs are predicted. In the experiments, we will show that 1 We only show some results because of the paper length limitation. A
predicting these three layers is sufficient. complete comparison is in https://superbbenchmark.org/.

Table 1. Results on SUPERB [19]. The metrics include accuracy (Acc%) phoneme error rate (PER%), word error rate (WER%), maximum
term weighted value (MTWV), F1 score (F1%), concept error rate (CER%), equal error rate (EER%), and diarization error rate (DER%).
                                  # param.    PR      KS        IC        SID      ER     ASR (+LM)          QbE                   SF         ASV       SD
    Method                        Millions PER↓ Acc↑ Acc↑                Acc↑     Acc↑        WER↓         MTWV↑ F1↑ / CER↓ EER↓ DER↓                            Rank↓
    (I) Baselines [19]
    FBANK                            0       82.01    8.63     9.10      8.5E-4   35.39   23.18 / 15.21    0.0058         69.64 / 52.94 9.56 10.05                8.8
    TERA                           21.33     49.17   89.48    58.42      57.57    56.27   18.17 / 12.16    0.0013         67.50 / 54.17 15.89 9.96                7.9
    wav2vec                        32.54     31.58   95.59    84.92      56.56    59.79   15.86 / 11.00    0.0485         76.37 / 43.71 7.99 9.90                 6.5
    DeCoAR 2.0                     85.12     14.93   94.48    90.80      74.42    62.47    13.02 / 9.07    0.0406         83.28 / 34.73 7.16 6.59                 4.1
    HuBERT (teacher)               94.68      5.41   96.30    98.34      81.42    64.92     6.42 / 4.79    0.0736         88.53 / 25.20 5.11 5.88                 1.0
    (II) Distillation Baselines
    predict last layer             24.67     16.71 96.07 95.57 43.44 62.81 13.67 / 9.43                    0.0423         81.52 / 37.31       7.29      6.45      4.6
    predict w/ hidden              23.49     17.43 95.46 94.91 54.90 61.49 14.95 / 10.27                   0.0529         83.73 / 35.14       7.41      6.86      5.0
    (III) Proposed
    w/ prediction heads            27.03     14.35 96.20 94.17 72.83 62.73 13.26 / 8.99                    0.0451         83.14 / 35.53       6.85      5.97      3.3
    DistilHuBERT                   23.49     16.27 95.98 94.99 73.54 63.02 13.37 / 9.21                    0.0511         82.57 / 35.59       8.55      6.19      3.8

                  DistilHuBERT             HuBERT                                                 HuBERT                  25   PASE+
                                       wav2vec 2.0                  80 DistilHuBERT            wav2vec 2.0
            vq-wav2vec                DeCoAR 2.0                                                                             FBANK
              wav2vec                                                                        DeCoAR 2.0                                        Mockingjay
         80 APC                                                                                                               APC
                                                                         APC                                              20 CPC   NPC
                 NPC                                                60        TERA
            CPC                                                           NPC     wav2vec                                          TERAvq-wav2vec
         60       TERA

                                                                                                                 WER(%)
                                                                                                                                       wav2vec
Acc(%)

                                                           Acc(%)

                                                                    40 CPC
                                                                         PASE+
                                                                               vq-wav2vec                                 15
         40                                                                                                                               DistilHuBERT
                   PASE+                                                                      Mockingjay                                               DeCoAR 2.0
                                       Mockingjay
                                                                    20                                                    10
         20
              FBANK                                                                                                                                    wav2vec 2.0
         0                                                          0 FBANK                                                5                              HuBERT
              0      20      40       60     80      100             0    20        40       60    80      100                 0        20    40    60     80 100
                           parameters (M)                                         parameters (M)                                           parameters (M)
                             (a) IC                                                (b) SID                                              (c) ASR (w/o LM)

                             Fig. 2. Comparisons of pre-trained model size vs. performance on three tasks: IC, SID, and ASR.

umn was each method’s ranking score by averaging its ranks in the
ten tasks, offering an easy way to compare the performance of the                         Table 2. Comparison of different SSL model sizes and inference
methods. First, the ranking scores revealed that DistilHuBERT out-                        time. The inference was performed on 4 CPUs extracting all features
performed all methods, except HuBERT (its teacher), indicating Dis-                       from the LibriSpeech dev-clean set with a batch size of one. Results
tilHuBERT retained most of HuBERT’s performance (sec. (III) vs.                           were averaged over three runs.
(I)). Next, the distillation baselines in section (II) performed worse                                                             # param.          Inf. time
than DistilHuBERT, especially in PR and SID tasks, showing the
proposed framework distilled better representations using separate                                  Model                          Millions          seconds
prediction heads. Moreover, DistilHuBERT slightly degraded with-                                    HuBERT                     94.68 (100%)     992 (1.00X)
out the prediction heads, but it reduced the model size by 13%, im-                                 DeCoAR 2.0                  85.12 (90%)     924 (1.07X)
plying the prediction heads were redundant after pre-training.                                      DistilHuBERT                23.49 (25%)     574 (1.73X)

    Furthermore, Fig. 2 visualizes the relation between the model                         3.3. Model Size and Inference Speed
sizes and their performance on IC, SID, and ASR. We chose IC,
SID, and ASR for comparison since they were easier to observe the                         We compared the sizes and inference speed of several methods as
performance differences. In Fig. 2(a)(b), DistilHuBERT showed                             shown in Table 2. DistilHuBERT had a significantly smaller model
competitive performance compared with HuBERT and wav2vec 2.0                              size than HuBERT and offered a 73% speedup. DistilHuBERT per-
and surpassed all other similar-sized methods in IC and SID. In Fig.                      formed equally well as DeCoAR 2.0 but was smaller and faster.
2(c), DistilHuBERT degraded more in ASR yet still better than most                        Combining the results in Tables 1 and 2, DistilHuBERT was fast,
other SSL methods, showing the better trade-off between the model                         small, and powerful. Note that DistilHuBERT can be further com-
size and performance.                                                                     pressed by pruning or quantization for on-device computation.

Table 3. Ablation study of the techniques used. Table 4. Results of predicting different HuBERT’s hidden layers.
IC SID ASR IC SID ASR
Method Acc↑ Acc↑ WER↓ Predicted Layers Acc↑ Acc↑ WER↓
DistilHuBERT 94.99 73.54 13.34 4, 8, 12 94.99 73.54 13.34
w/o Lcos 93.36 72.54 13.74 4 79.09 76.85 14.90
w/o teacher init 94.20 73.68 13.41 8 96.89 61.52 12.28
12 95.52 65.14 13.48
1.0 4, 8 91.67 75.36 13.73
ASR KS IC SID 4, 12 91.88 74.48 14.03
8, 12 96.34 65.30 13.03

0.2 Table 5. Pre-training DistilHuBERT with different datasets.
Data Size IC SID ASR
0.0 Dataset hours Acc↑ Acc↑ WER↓
feat hid 1 2 3 4 5 6 7 8 9 10 11 12
layer LibriSpeech 960 94.99 73.54 13.34
Fig. 3. Weights of the feature summarization in several SUPERB LibriSpeech 100 93.17 69.46 14.77
tasks after training on DistilHuBERT representations (predicting all WSJ 81 90.22 64.14 15.59
hidden layers in HuBERT). feat and hid respectively denotes the AISHELL-1 150 87.29 67.65 16.42
CNN’s and transformer’s outputs.

worst in IC and ASR but better in SID than predicting three layers,
3.4. Ablation Study
corroborating our hypothesis that the bottom layers offered speaker
Table 3 shows the effectiveness of the cosine similarity loss in Eq. identity. Then, predicting two of the selected layers, the downstream
(1) and teacher initialization. Both techniques showed some benefits tasks’ scores were also imbalanced. Therefore, the three selected
to training DistilHuBERT, and we thus used them in the rest of the layers offered a good combination for learning representations with
experiments. Removing cosine similarity loss degraded more than less biased information.
teacher initialization, showing that the former method had more im-
pact on improving DistilHuBERT, while the latter was less effective. 3.6. Knowledge Distillation with Different Datasets
In practice, the pre-training data for a self-supervised teacher model
3.5. Layer Selection might be inaccessible or too large to reuse for distillation. This sec-
This section discussed the selection of HuBERT’s hidden layers for tion inspected the generalizability and flexibility of DistilHuBERT
DistilHuBERT to learn. Because each HuBERT layer has different by training it with small or out-of-domain datasets. We used the
information, we first inspected each layer’s importance for several 100-hour subset of LibriSpeech, 81-hour WSJ, and the Mandarin
downstream tasks. We trained a DistilHuBERT model with 12 pre- AISHELL-1 corpora to train DistilHuBERT by the proposed knowl-
diction heads to predict all HuBERT layers. The outputs of Dis- edge distillation approach. The results are shown in Table 5.
tilHuBERT’s hidden layers and all prediction heads were weighted First, training with the 100-hour LibriSpeech, DistilHuBERT
and summed by a set of learnable weights, where the weights were degraded slightly but retained better performance than many other
jointly trained with the downstream model. The summarized out- methods (in Table 1), indicating that using a smaller dataset to dis-
puts were used for all downstream tasks with separate weights. The till knowledge was sufficient. Next, the smaller WSJ corpus showed
weights were normalized by the averaged `2 norm of each layer’s more degradation; perhaps the WSJ’s averaged utterance duration
output. We visualized the learned weights in Fig. 3, where larger was shorter than LibriSpeech [35] or simply because it was small.
values indicate higher importance. We only showed ASR, KS, IC, Furthermore, training with the Mandarin AISHELL-1 corpus of-
and SID because they represent recognition, detection, semantics, fered better SID than WSJ because speaker identity was independent
and speaker identity tasks, while others showed similar trends. of language and AISHELL had more data than WSJ. In contrast,
Fig. 3 showed features before the prediction heads (hid) were AISHELL performed the worst in IC and ASR because of the lan-
useful, especially for SID. Results corroborated our hypothesis that guage mismatch. Although the performance was affected by train-
the shared representation for multiple heads carried rich informa- ing data, DistilHuBERT offered better performance than many other
tion. Among the head outputs, the heads for the 6th to 12th layers SSL methods, even with smaller datasets for distillation.
were also crucial to ASR, KS, and IC because they had content and
semantic information. With the above findings, we made DistilHu- 4. CONCLUSION
BERT predicting the 8th and 12th HuBERT layers. DistilHuBERT
additionally predicted the 4th layer because we hypothesized that the This paper proposes DistilHuBERT, a novel framework to layer-wise
bottom layers preserved speaker identity. distill knowledge from HuBERT. DistilHuBERT retained most of
To justify our selection method, we conducted experiments by HuBERT’s performance and had only 25% of its size. We provided
predicting different HuBERT layers as shown in Table 4. First, for comprehensive analyses of the proposed methods and demonstrated
predicting only one HuBERT layer, the evaluation scores among the the DistilHuBERT’s flexibility and generalizability. Our method can
three tasks were imbalanced. Predicting the 4th layer performed the be easily applied to more powerful SSL models.

5. REFERENCES                                   [18] A. Baevski, M. Auli, and A. Mohamed, “Effectiveness of self-
                                                                            supervised pre-training for speech recognition,” arXiv preprint
 [1] A. T. Liu, S.-w. Yang, P.-H. Chi, P.-c. Hsu, and H.-y. Lee,            arXiv:1911.03912, 2019.
     “Mockingjay: Unsupervised speech representation learning          [19] S.-w. Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia,
     with deep bidirectional transformer encoders,” in ICASSP,              Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin et al., “SU-
     2020.                                                                  PERB: Speech processing universal performance benchmark,”
 [2] A. T. Liu, S.-W. Li, and H.-y. Lee, “TERA: Self-supervised             in Interspeech, 2021.
     learning of transformer encoder representation for speech,”       [20] S. Evain, H. Nguyen, H. Le, M. Z. Boito, S. Mdhaffar, S. Al-
     TASLP, vol. 29, 2021.                                                  isamir, Z. Tong, N. Tomashenko, M. Dinarelli, T. Parcollet,
 [3] S. Ling, Y. Liu, J. Salazar, and K. Kirchhoff, “Deep contex-           A. Allauzen, Y. Estève, B. Lecouteux, F. Portet, S. Rossato,
     tualized acoustic representations for semi-supervised speech           F. Ringeval, D. Schwab, and L. Besacier, “LeBenchmark: A re-
     recognition,” in ICASSP, 2020.                                         producible framework for assessing self-supervised represen-
                                                                            tation learning from speech,” in Interspeech, 2021.
 [4] S. Ling and Y. Liu, “DeCoAR 2.0: Deep contextualized acous-
                                                                       [21] A. Hannun, “The history of speech recognition to the year
     tic representations with vector quantization,” arXiv preprint
                                                                            2030,” arXiv preprint arXiv:2108.00084, 2021.
     arXiv:2012.06659, 2020.
                                                                       [22] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge
 [5] A. H. Liu, Y.-A. Chung, and J. Glass, “Non-autoregressive pre-         in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
     dictive coding for learning speech representations from local
     dependencies,” in Interspeech, 2021.                              [23] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a
                                                                            distilled version of bert: smaller, faster, cheaper and lighter,”
 [6] Y.-A. Chung, W.-N. Hsu, H. Tang, and J. Glass, “An unsuper-            arXiv preprint arXiv:1910.01108, 2019.
     vised autoregressive model for speech representation learning,”
                                                                       [24] X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang,
     in Interspeech, 2019.
                                                                            and Q. Liu, “TinyBERT: Distilling bert for natural language
 [7] Y.-A. Chung, H. Tang, and J. Glass, “Vector-quantized autore-          understanding,” in EMNLP, 2020.
     gressive predictive coding,” in Interspeech, 2020.                [25] Z. Peng, A. Budhkar, I. Tuil, J. Levy, P. Sobhani, R. Cohen, and
 [8] Y.-A. Chung and J. Glass, “Improved speech representations             J. Nassour, “Shrinking bigfoot: Reducing wav2vec 2.0 foot-
     with multi-target autoregressive predictive coding,” in ACL,           print,” arXiv preprint arXiv:2103.15760, 2021.
     2020.                                                             [26] C.-I. J. Lai, Y. Zhang, A. H. Liu, S. Chang, Y.-L. Liao, Y.-S.
 [9] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation                  Chuang, K. Qian, S. Khurana, D. Cox, and J. Glass, “PARP:
     learning with contrastive predictive coding,” arXiv preprint           Prune, adjust and re-prune for self-supervised speech recogni-
     arXiv:1807.03748, 2018.                                                tion,” NeurIPS, 2021.
                                                                       [27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
[10] M. Riviere, A. Joulin, P.-E. Mazaré, and E. Dupoux, “Unsuper-
                                                                            A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all
     vised pretraining transfers well across languages,” in ICASSP,
                                                                            you need,” in NeurIPS, 2017.
     2020.
                                                                       [28] A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise analysis of
[11] S. Schneider, A. Baevski, R. Collobert, and M. Auli,                   a self-supervised speech representation model,” arXiv preprint
     “wav2vec: Unsupervised pre-training for speech recognition,”           arXiv:2107.04734, 2021.
     in Interspeech, 2019.
                                                                       [29] X. Chang, T. Maekaku, P. Guo, J. Shi, Y.-J. Lu, A. S. Subrama-
[12] A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-              nian, T. Wang, S. wen Yang, Y. Tsao, H. yi Lee, and S. Watan-
     supervised learning of discrete speech representations,” in            abe, “An exploration of self-supervised pretrained representa-
     ICLR, 2020.                                                            tions for end-to-end speech recognition,” in ASRU, 2021.
[13] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec            [30] A. Baevski, W.-N. Hsu, A. Conneau, and M. Auli, “Unsuper-
     2.0: A framework for self-supervised learning of speech repre-         vised speech recognition,” arXiv preprint arXiv:2105.11084,
     sentations,” in NeurIPS, 2020.                                         2021.
[14] S. Sadhu, D. He, C.-W. Huang, S. H. Mallidi, M. Wu, A. Ras-       [31] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng,
     trow, A. Stolcke, J. Droppo, and R. Maas, “wav2vec-C: A self-          D. Grangier, and M. Auli, “fairseq: A fast, extensible toolkit
     supervised model for speech representation learning,” in Inter-        for sequence modeling,” in NAACL-HLT: Demonstrations,
     speech, 2021.                                                          2019.
[15] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhut-     [32] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-
     dinov, and A. Mohamed, “HuBERT: Self-supervised speech                 rispeech: An ASR corpus based on public domain audio
     representation learning by masked prediction of hidden units,”         books,” in ICASSP, 2015.
     arXiv preprint arXiv:2106.07447, 2021.                            [33] D. B. Paul and J. M. Baker, “The design for the Wall Street
                                                                            Journal-based CSR corpus,” in HLT, 1992.
[16] S. Pascual, M. Ravanelli, J. Serrà, A. Bonafonte, and Y. Ben-
     gio, “Learning problem-agnostic speech representations from       [34] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AISHELL-1: An
     multiple self-supervised tasks,” in Interspeech, 2019.                 open-source mandarin speech corpus and a speech recognition
                                                                            baseline,” in O-COCOSDA, 2017.
[17] M. Ravanelli, J. Zhong, S. Pascual, P. Swietojanski, J. Mon-
                                                                       [35] H.-J. Chang, H.-y. Lee, and L.-s. Lee, “Towards lifelong learn-
     teiro, J. Trmal, and Y. Bengio, “Multi-task self-supervised
                                                                            ing of end-to-end ASR,” in Interspeech, 2021.
     learning for robust speech recognition,” in ICASSP, 2020.

You can also read