DISTILHUBERT: SPEECH REPRESENTATION LEARNING BY LAYER-WISE DISTILLATION OF HIDDEN-UNIT BERT
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
DISTILHUBERT: SPEECH REPRESENTATION LEARNING BY LAYER-WISE DISTILLATION OF HIDDEN-UNIT BERT Heng-Jui Chang, Shu-wen Yang, Hung-yi Lee College of Electrical Engineering and Computer Science, National Taiwan University ABSTRACT models, and few studies investigated this problem [25, 26]. There- fore, we propose a novel multi-task learning framework to layer-wise arXiv:2110.01900v4 [cs.CL] 28 Apr 2022 Self-supervised speech representation learning methods like wav2vec 2.0 and Hidden-unit BERT (HuBERT) leverage unlabeled speech distill hidden representations of SSL speech models. data for pre-training and offer good representations for numerous In this work, we distill HuBERT and obtain DistilHuBERT. Dis- speech processing tasks. Despite the success of these methods, tilHuBERT uses three prediction heads to respectively predict the they require large memory and high pre-training costs, making 4th , 8th , and 12th HuBERT hidden layers’ output. After training, them inaccessible for researchers in academia and small companies. the heads are removed because the multi-task learning paradigm Therefore, this paper introduces DistilHuBERT, a novel multi-task forces the DistilHuBERT model to learn representations containing learning framework to distill hidden representations from a HuBERT rich information. DistilHuBERT reduces HuBERT’s size by 75% model directly. This method reduces HuBERT’s size by 75% and and speedup by 73%, retaining most performance and requiring less 73% faster while retaining most performance in ten different tasks. training time. Moreover, we offer comprehensive analyses of pro- Moreover, DistilHuBERT required little training time and data, posed methods and show DistilHuBERT’s ability to distill knowl- opening the possibilities of pre-training personal and on-device SSL edge with few data. To our knowledge, this is the first attempt to models for speech. distill speech representations from SSL pre-trained models directly. Index Terms— Self-supervised learning, speech representation learning, knowledge distillation, model compression 2. METHODS 2.1. HuBERT 1. INTRODUCTION The proposed distillation approach can be applied to any self- Recently, self-supervised learning (SSL) methods for speech repre- supervised model. This paper takes HuBERT [15] as the teacher sentation learning succeeded in many tasks [1–18]. They learn from because it outperforms other methods in almost all tasks in SU- targets derived from unlabeled speech data. They can be roughly PERB [19], showing it offers good speech representations. Hu- categorized into two classes by learning objectives: generative and BERT consists of a CNN and a transformer encoder [27] to classify discriminative. For generative methods, models try to either recon- randomly masked frames to pseudo labels. The labels are obtained struct masked acoustic features [1–5], or generate future acoustic by clustering MFCC or another model’s hidden units. Despite Hu- features [6–8]. For the discriminative methods, models either learn BERT’s success, two major disadvantages appear. First, HuBERT by contrastive learning [9–14] or classifying pseudo labels [15]. models have 95M to 1B parameters, consuming large memory and Many previous SSL methods for speech representation learn- slow inference. Second, the training costs are high, e.g., HuBERT ing are designed for few tasks like speech recognition. There- Base pre-training requires 32 GPUs and 2k GPU hours. These is- fore, Speech processing Universal PERformance Benchmark (SU- sues make academia and small corporations unable to pre-train their PERB) [19] and LeBenchmark [20] are developed to evaluate the models or apply these technologies to products. Hence, this paper effectiveness of these methods on general speech processing appli- aims to alleviate these problems. cations. These benchmarks comprehensively assess the capability of pre-trained models by applying them to multiple downstream tasks like recognition, detection, semantics, and speaker identification. 2.2. DistilHuBERT Although recent developments of speech SSL methods show ex- In a typical knowledge distillation method, a student learns to gener- cellent performance across various tasks, many require large mem- ate the teacher’s output. However, different layers in self-supervised ory and high pre-training costs, including wav2vec 2.0 [13] and Hu- speech models contain different information like speaker identity or BERT [15]. The limitations make these models unsuitable for on- semantics [13,28,29]. An SSL model’s output layer does not always device computation and make researchers in academia and small cor- offer rich information. For instance, wav2vec 2.0 stores phonetic in- porations difficult to use [21]. Knowledge distillation is a common formation in its middle layers [28, 30]. Therefore, the student only method for compressing models [22], in which a small student model learning the last layer of the teacher might benefit little. One pos- is learned to generate the teacher model’s outputs or hidden repre- sible way to solve this problem is to let different hidden layers of sentations. Distilling knowledge has shown to be effective for NLP, the student learn from different layers of the teacher [24]. The stu- and DistilBERT [23] and TinyBERT [24] are good examples. How- dent model must be deep enough to have multiple layers, but against ever, we found these approaches ineffective in distilling speech SSL having a small model. We thank to National Center for High-performance Computing (NCHC) This paper proposes a novel teacher-student framework for of National Applied Research Laboratories (NARLabs) in Taiwan for provid- speech representation learning by multi-task knowledge distillation, ing computational and storage resources. as shown in Fig. 1. We distill HuBERT to obtain DistilHuBERT,
Teacher: HuBERT (frozen) Student: DistilHuBERT Minimize Speech Intent Speaker Encoder Layer 12 Prediction Head 3 Recognition Classification Identification Keyword Speaker Slot Filling Encoder Layer 8 Prediction Head 2 Spotting Verification Emotion Query-by- Speaker Recognition Example Diarization Encoder Layer 4 Prediction Head 1 Initialize Encoder Layer 2 Encoder Layer 2 Encoder Layer 2 Frozen Encoder Layer 1 Encoder Layer 1 Encoder Layer 1 CNN CNN CNN (I) Pre-training (II) Downstream Fig. 1. The proposed DistilHuBERT framework. (I) Before pre-training, the student is initialized with the teacher’s parameters. Then, student’s prediction heads learn to generate the teacher’s hidden representations by minimizing the loss function L = L(4) + L(8) + L(12) . (II) The pre-trained DistilHuBERT model is frozen to generate speech representations for downstream tasks. consisting of a CNN feature extractor and a small transformer en- 3. EXPERIMENTS coder. The basic idea of our approach for knowledge distillation is to learn to generate multiple teacher’s hidden representations from 3.1. Experimental Setup shared representations. Therefore, we propose predicting teacher’s Experiments were implemented with S3PRL [1, 2] and fairseq [31]. hidden representations with separate prediction heads as shown in Model. The HuBERT model had a 7-layer CNN and a 12-layer Fig. 1(I). This objective is a multi-task learning paradigm. It encour- transformer encoder, while DistilHuBERT had a similar architecture, ages the transformer encoder to produce compact representations for but its transformer encoder had only two layers. DistilHuBERT had multiple prediction heads. After pre-training, we remove the heads. three prediction heads respectively predicting the 4th , 8th , and 12th The model parameters are frozen, and the output representations are HuBERT hidden layers. We applied past distillation methods to Dis- used for various downstream tasks, as shown in Fig. 1(II). tilHuBERT for comparison [23, 24]. As shown in Table 1, ”predict Objective Function. Here, we describe the objective function of last layer” predicted only the final output of HuBERT. ”Predict w/ DistilHuBERT. We denote the D dimensional feature vectors at time hidden” was trained to predict the three chosen layers with its hidden tth produced by the teacher’s lth layer and the student’s corresponding (l) (l) layers, i.e., the CNN and the two transformer encoder layers respec- prediction head respectively as ĥt and ht . The loss function is tively predicted the 4th , 8th , and 12th HuBERT layers. ”W/ prediction (l) heads” preserved the three prediction heads after pre-training. L(l) = L`1 + λL(l) cos Data. We used the 960-hour LibriSpeech dataset [32] for knowl- T edge distillation, except Sec. 3.6 used the English Wall Street Journal X 1 (l) (l) (l) (l) = ht − ĥt − λ log σ cos ht , ĥt , (WSJ) [33] and the Mandarin AISHELL-1 [34] speech corpora. D 1 t=1 Pre-training. The default DistilHuBERT was trained on a 32GB (1) V100 GPU for 200k updates with a batch size of 24 utterances, tak- ing roughly 55 hours. The learning rate was linearly increased to where T is the number of time steps. σ and cos(·, ·) respectively 2e-4 in the first 7% of updates, then linearly decreased to zero. λ = denote sigmoid activation and cosine similarity. Minimizing L(l) is 1 in Eq. (1). Note that training DistilHuBERT was cheaper than equivalent to simultaneously minimizing the `1 distance and maxi- HuBERT since the HuBERT Base model required 2k GPU hours. mizing the cosine similarity between the hidden representations. We empirically found that considering both L`1 and Lcos yielded better performance than considering only one of them. λ > 0 controls the 3.2. SUPERB contribution of the cosine similarity loss. DistilHuBERT was evaluated on SUPERB [19]. Each SSL model Parameter Initialization. Following DistilBERT [23], we initial- was frozen, and a trainable weighted sum summarized the hid- ized DistilHuBERT with HuBERT’s CNN extractor and the first two den representations. Each downstream task used a simple model transformer layers. with minimal labeled data using the summarized representations. Reducing Computation for Distillation. We note that the num- The tasks included phoneme recognition (PR), keyword spotting ber of prediction heads can be 1 to L, where L is the number of (KS), intent classification (IC), speaker identification (SID), emo- hidden layers in the self-supervised speech model to be distilled. tion recognition (ER), automatic speech recognition (ASR), query Because neighboring layers’ representations might contain similar by example spoken term detection (QbE), slot filling (SF), automatic information, the student model only predicts some specific layers in speaker verification (ASV), and speaker diarization (SD). the teacher model and skip other layers in our implementation. For Results on SUPERB are shown in Table 1.1 The rightmost col- example, in Fig. 1, only the 4th , 8th , and 12th HuBERT hidden lay- ers’ outputs are predicted. In the experiments, we will show that 1 We only show some results because of the paper length limitation. A predicting these three layers is sufficient. complete comparison is in https://superbbenchmark.org/.
Table 1. Results on SUPERB [19]. The metrics include accuracy (Acc%) phoneme error rate (PER%), word error rate (WER%), maximum term weighted value (MTWV), F1 score (F1%), concept error rate (CER%), equal error rate (EER%), and diarization error rate (DER%). # param. PR KS IC SID ER ASR (+LM) QbE SF ASV SD Method Millions PER↓ Acc↑ Acc↑ Acc↑ Acc↑ WER↓ MTWV↑ F1↑ / CER↓ EER↓ DER↓ Rank↓ (I) Baselines [19] FBANK 0 82.01 8.63 9.10 8.5E-4 35.39 23.18 / 15.21 0.0058 69.64 / 52.94 9.56 10.05 8.8 TERA 21.33 49.17 89.48 58.42 57.57 56.27 18.17 / 12.16 0.0013 67.50 / 54.17 15.89 9.96 7.9 wav2vec 32.54 31.58 95.59 84.92 56.56 59.79 15.86 / 11.00 0.0485 76.37 / 43.71 7.99 9.90 6.5 DeCoAR 2.0 85.12 14.93 94.48 90.80 74.42 62.47 13.02 / 9.07 0.0406 83.28 / 34.73 7.16 6.59 4.1 HuBERT (teacher) 94.68 5.41 96.30 98.34 81.42 64.92 6.42 / 4.79 0.0736 88.53 / 25.20 5.11 5.88 1.0 (II) Distillation Baselines predict last layer 24.67 16.71 96.07 95.57 43.44 62.81 13.67 / 9.43 0.0423 81.52 / 37.31 7.29 6.45 4.6 predict w/ hidden 23.49 17.43 95.46 94.91 54.90 61.49 14.95 / 10.27 0.0529 83.73 / 35.14 7.41 6.86 5.0 (III) Proposed w/ prediction heads 27.03 14.35 96.20 94.17 72.83 62.73 13.26 / 8.99 0.0451 83.14 / 35.53 6.85 5.97 3.3 DistilHuBERT 23.49 16.27 95.98 94.99 73.54 63.02 13.37 / 9.21 0.0511 82.57 / 35.59 8.55 6.19 3.8 DistilHuBERT HuBERT HuBERT 25 PASE+ wav2vec 2.0 80 DistilHuBERT wav2vec 2.0 vq-wav2vec DeCoAR 2.0 FBANK wav2vec DeCoAR 2.0 Mockingjay 80 APC APC APC 20 CPC NPC NPC 60 TERA CPC NPC wav2vec TERAvq-wav2vec 60 TERA WER(%) wav2vec Acc(%) Acc(%) 40 CPC PASE+ vq-wav2vec 15 40 DistilHuBERT PASE+ Mockingjay DeCoAR 2.0 Mockingjay 20 10 20 FBANK wav2vec 2.0 0 0 FBANK 5 HuBERT 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 parameters (M) parameters (M) parameters (M) (a) IC (b) SID (c) ASR (w/o LM) Fig. 2. Comparisons of pre-trained model size vs. performance on three tasks: IC, SID, and ASR. umn was each method’s ranking score by averaging its ranks in the ten tasks, offering an easy way to compare the performance of the Table 2. Comparison of different SSL model sizes and inference methods. First, the ranking scores revealed that DistilHuBERT out- time. The inference was performed on 4 CPUs extracting all features performed all methods, except HuBERT (its teacher), indicating Dis- from the LibriSpeech dev-clean set with a batch size of one. Results tilHuBERT retained most of HuBERT’s performance (sec. (III) vs. were averaged over three runs. (I)). Next, the distillation baselines in section (II) performed worse # param. Inf. time than DistilHuBERT, especially in PR and SID tasks, showing the proposed framework distilled better representations using separate Model Millions seconds prediction heads. Moreover, DistilHuBERT slightly degraded with- HuBERT 94.68 (100%) 992 (1.00X) out the prediction heads, but it reduced the model size by 13%, im- DeCoAR 2.0 85.12 (90%) 924 (1.07X) plying the prediction heads were redundant after pre-training. DistilHuBERT 23.49 (25%) 574 (1.73X) Furthermore, Fig. 2 visualizes the relation between the model 3.3. Model Size and Inference Speed sizes and their performance on IC, SID, and ASR. We chose IC, SID, and ASR for comparison since they were easier to observe the We compared the sizes and inference speed of several methods as performance differences. In Fig. 2(a)(b), DistilHuBERT showed shown in Table 2. DistilHuBERT had a significantly smaller model competitive performance compared with HuBERT and wav2vec 2.0 size than HuBERT and offered a 73% speedup. DistilHuBERT per- and surpassed all other similar-sized methods in IC and SID. In Fig. formed equally well as DeCoAR 2.0 but was smaller and faster. 2(c), DistilHuBERT degraded more in ASR yet still better than most Combining the results in Tables 1 and 2, DistilHuBERT was fast, other SSL methods, showing the better trade-off between the model small, and powerful. Note that DistilHuBERT can be further com- size and performance. pressed by pruning or quantization for on-device computation.
Table 3. Ablation study of the techniques used. Table 4. Results of predicting different HuBERT’s hidden layers. IC SID ASR IC SID ASR Method Acc↑ Acc↑ WER↓ Predicted Layers Acc↑ Acc↑ WER↓ DistilHuBERT 94.99 73.54 13.34 4, 8, 12 94.99 73.54 13.34 w/o Lcos 93.36 72.54 13.74 4 79.09 76.85 14.90 w/o teacher init 94.20 73.68 13.41 8 96.89 61.52 12.28 12 95.52 65.14 13.48 1.0 4, 8 91.67 75.36 13.73 ASR KS IC SID 4, 12 91.88 74.48 14.03 8, 12 96.34 65.30 13.03 0.2 Table 5. Pre-training DistilHuBERT with different datasets. Data Size IC SID ASR 0.0 Dataset hours Acc↑ Acc↑ WER↓ feat hid 1 2 3 4 5 6 7 8 9 10 11 12 layer LibriSpeech 960 94.99 73.54 13.34 Fig. 3. Weights of the feature summarization in several SUPERB LibriSpeech 100 93.17 69.46 14.77 tasks after training on DistilHuBERT representations (predicting all WSJ 81 90.22 64.14 15.59 hidden layers in HuBERT). feat and hid respectively denotes the AISHELL-1 150 87.29 67.65 16.42 CNN’s and transformer’s outputs. worst in IC and ASR but better in SID than predicting three layers, 3.4. Ablation Study corroborating our hypothesis that the bottom layers offered speaker Table 3 shows the effectiveness of the cosine similarity loss in Eq. identity. Then, predicting two of the selected layers, the downstream (1) and teacher initialization. Both techniques showed some benefits tasks’ scores were also imbalanced. Therefore, the three selected to training DistilHuBERT, and we thus used them in the rest of the layers offered a good combination for learning representations with experiments. Removing cosine similarity loss degraded more than less biased information. teacher initialization, showing that the former method had more im- pact on improving DistilHuBERT, while the latter was less effective. 3.6. Knowledge Distillation with Different Datasets In practice, the pre-training data for a self-supervised teacher model 3.5. Layer Selection might be inaccessible or too large to reuse for distillation. This sec- This section discussed the selection of HuBERT’s hidden layers for tion inspected the generalizability and flexibility of DistilHuBERT DistilHuBERT to learn. Because each HuBERT layer has different by training it with small or out-of-domain datasets. We used the information, we first inspected each layer’s importance for several 100-hour subset of LibriSpeech, 81-hour WSJ, and the Mandarin downstream tasks. We trained a DistilHuBERT model with 12 pre- AISHELL-1 corpora to train DistilHuBERT by the proposed knowl- diction heads to predict all HuBERT layers. The outputs of Dis- edge distillation approach. The results are shown in Table 5. tilHuBERT’s hidden layers and all prediction heads were weighted First, training with the 100-hour LibriSpeech, DistilHuBERT and summed by a set of learnable weights, where the weights were degraded slightly but retained better performance than many other jointly trained with the downstream model. The summarized out- methods (in Table 1), indicating that using a smaller dataset to dis- puts were used for all downstream tasks with separate weights. The till knowledge was sufficient. Next, the smaller WSJ corpus showed weights were normalized by the averaged `2 norm of each layer’s more degradation; perhaps the WSJ’s averaged utterance duration output. We visualized the learned weights in Fig. 3, where larger was shorter than LibriSpeech [35] or simply because it was small. values indicate higher importance. We only showed ASR, KS, IC, Furthermore, training with the Mandarin AISHELL-1 corpus of- and SID because they represent recognition, detection, semantics, fered better SID than WSJ because speaker identity was independent and speaker identity tasks, while others showed similar trends. of language and AISHELL had more data than WSJ. In contrast, Fig. 3 showed features before the prediction heads (hid) were AISHELL performed the worst in IC and ASR because of the lan- useful, especially for SID. Results corroborated our hypothesis that guage mismatch. Although the performance was affected by train- the shared representation for multiple heads carried rich informa- ing data, DistilHuBERT offered better performance than many other tion. Among the head outputs, the heads for the 6th to 12th layers SSL methods, even with smaller datasets for distillation. were also crucial to ASR, KS, and IC because they had content and semantic information. With the above findings, we made DistilHu- 4. CONCLUSION BERT predicting the 8th and 12th HuBERT layers. DistilHuBERT additionally predicted the 4th layer because we hypothesized that the This paper proposes DistilHuBERT, a novel framework to layer-wise bottom layers preserved speaker identity. distill knowledge from HuBERT. DistilHuBERT retained most of To justify our selection method, we conducted experiments by HuBERT’s performance and had only 25% of its size. We provided predicting different HuBERT layers as shown in Table 4. First, for comprehensive analyses of the proposed methods and demonstrated predicting only one HuBERT layer, the evaluation scores among the the DistilHuBERT’s flexibility and generalizability. Our method can three tasks were imbalanced. Predicting the 4th layer performed the be easily applied to more powerful SSL models.
5. REFERENCES [18] A. Baevski, M. Auli, and A. Mohamed, “Effectiveness of self- supervised pre-training for speech recognition,” arXiv preprint [1] A. T. Liu, S.-w. Yang, P.-H. Chi, P.-c. Hsu, and H.-y. Lee, arXiv:1911.03912, 2019. “Mockingjay: Unsupervised speech representation learning [19] S.-w. Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia, with deep bidirectional transformer encoders,” in ICASSP, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin et al., “SU- 2020. PERB: Speech processing universal performance benchmark,” [2] A. T. Liu, S.-W. Li, and H.-y. Lee, “TERA: Self-supervised in Interspeech, 2021. learning of transformer encoder representation for speech,” [20] S. Evain, H. Nguyen, H. Le, M. Z. Boito, S. Mdhaffar, S. Al- TASLP, vol. 29, 2021. isamir, Z. Tong, N. Tomashenko, M. Dinarelli, T. Parcollet, [3] S. Ling, Y. Liu, J. Salazar, and K. Kirchhoff, “Deep contex- A. Allauzen, Y. Estève, B. Lecouteux, F. Portet, S. Rossato, tualized acoustic representations for semi-supervised speech F. Ringeval, D. Schwab, and L. Besacier, “LeBenchmark: A re- recognition,” in ICASSP, 2020. producible framework for assessing self-supervised represen- tation learning from speech,” in Interspeech, 2021. [4] S. Ling and Y. Liu, “DeCoAR 2.0: Deep contextualized acous- [21] A. Hannun, “The history of speech recognition to the year tic representations with vector quantization,” arXiv preprint 2030,” arXiv preprint arXiv:2108.00084, 2021. arXiv:2012.06659, 2020. [22] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge [5] A. H. Liu, Y.-A. Chung, and J. Glass, “Non-autoregressive pre- in a neural network,” arXiv preprint arXiv:1503.02531, 2015. dictive coding for learning speech representations from local dependencies,” in Interspeech, 2021. [23] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter,” [6] Y.-A. Chung, W.-N. Hsu, H. Tang, and J. Glass, “An unsuper- arXiv preprint arXiv:1910.01108, 2019. vised autoregressive model for speech representation learning,” [24] X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, in Interspeech, 2019. and Q. Liu, “TinyBERT: Distilling bert for natural language [7] Y.-A. Chung, H. Tang, and J. Glass, “Vector-quantized autore- understanding,” in EMNLP, 2020. gressive predictive coding,” in Interspeech, 2020. [25] Z. Peng, A. Budhkar, I. Tuil, J. Levy, P. Sobhani, R. Cohen, and [8] Y.-A. Chung and J. Glass, “Improved speech representations J. Nassour, “Shrinking bigfoot: Reducing wav2vec 2.0 foot- with multi-target autoregressive predictive coding,” in ACL, print,” arXiv preprint arXiv:2103.15760, 2021. 2020. [26] C.-I. J. Lai, Y. Zhang, A. H. Liu, S. Chang, Y.-L. Liao, Y.-S. [9] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation Chuang, K. Qian, S. Khurana, D. Cox, and J. Glass, “PARP: learning with contrastive predictive coding,” arXiv preprint Prune, adjust and re-prune for self-supervised speech recogni- arXiv:1807.03748, 2018. tion,” NeurIPS, 2021. [27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, [10] M. Riviere, A. Joulin, P.-E. Mazaré, and E. Dupoux, “Unsuper- A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all vised pretraining transfers well across languages,” in ICASSP, you need,” in NeurIPS, 2017. 2020. [28] A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise analysis of [11] S. Schneider, A. Baevski, R. Collobert, and M. Auli, a self-supervised speech representation model,” arXiv preprint “wav2vec: Unsupervised pre-training for speech recognition,” arXiv:2107.04734, 2021. in Interspeech, 2019. [29] X. Chang, T. Maekaku, P. Guo, J. Shi, Y.-J. Lu, A. S. Subrama- [12] A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self- nian, T. Wang, S. wen Yang, Y. Tsao, H. yi Lee, and S. Watan- supervised learning of discrete speech representations,” in abe, “An exploration of self-supervised pretrained representa- ICLR, 2020. tions for end-to-end speech recognition,” in ASRU, 2021. [13] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec [30] A. Baevski, W.-N. Hsu, A. Conneau, and M. Auli, “Unsuper- 2.0: A framework for self-supervised learning of speech repre- vised speech recognition,” arXiv preprint arXiv:2105.11084, sentations,” in NeurIPS, 2020. 2021. [14] S. Sadhu, D. He, C.-W. Huang, S. H. Mallidi, M. Wu, A. Ras- [31] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, trow, A. Stolcke, J. Droppo, and R. Maas, “wav2vec-C: A self- D. Grangier, and M. Auli, “fairseq: A fast, extensible toolkit supervised model for speech representation learning,” in Inter- for sequence modeling,” in NAACL-HLT: Demonstrations, speech, 2021. 2019. [15] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhut- [32] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- dinov, and A. Mohamed, “HuBERT: Self-supervised speech rispeech: An ASR corpus based on public domain audio representation learning by masked prediction of hidden units,” books,” in ICASSP, 2015. arXiv preprint arXiv:2106.07447, 2021. [33] D. B. Paul and J. M. Baker, “The design for the Wall Street Journal-based CSR corpus,” in HLT, 1992. [16] S. Pascual, M. Ravanelli, J. Serrà, A. Bonafonte, and Y. Ben- gio, “Learning problem-agnostic speech representations from [34] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AISHELL-1: An multiple self-supervised tasks,” in Interspeech, 2019. open-source mandarin speech corpus and a speech recognition baseline,” in O-COCOSDA, 2017. [17] M. Ravanelli, J. Zhong, S. Pascual, P. Swietojanski, J. Mon- [35] H.-J. Chang, H.-y. Lee, and L.-s. Lee, “Towards lifelong learn- teiro, J. Trmal, and Y. Bengio, “Multi-task self-supervised ing of end-to-end ASR,” in Interspeech, 2021. learning for robust speech recognition,” in ICASSP, 2020.
You can also read