SLIMIPL: LANGUAGE-MODEL-FREE ITERATIVE PSEUDO-LABELING
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
SLIM IPL: L ANGUAGE -M ODEL -F REE I TERATIVE P SEUDO -L ABELING A P REPRINT Tatiana Likhomanenko, Qiantong Xu, Jacob Kahn, Gabriel Synnaeve, Ronan Collobert Facebook AI Research, Menlo Park & Paris, USA & France arXiv:2010.11524v1 [cs.CL] 22 Oct 2020 {antares,qiantong,jacobkahn,gab,locronan}@fb.com October 23, 2020 A BSTRACT Recent results in end-to-end ASR have demonstrated the efficacy of simple pseudo-labeling for semi- supervised models trained both with Connectionist Temporal Classification (CTC) and Sequence- to-Sequence (seq2seq) losses. Iterative Pseudo-Labeling (IPL), which continuously trains a single model using pseudo-labels iteratively re-generated as the model learns, has been shown to further increase performance in ASR. We improve upon the IPL algorithm: as the model learns, we propose to iteratively re-generate transcriptions with hard labels (the most probable tokens) assignments, that is without a language model. We call this approach Language-Model-Free IPL (slimIPL) and we give a resultant training setup for CTC and seq2seq models. At inference, our experiments show that decoding with a strong language model is more beneficial with slimIPL than IPL, as IPL exhibits some language model over-fitting issues. Compared to prior work on semi-supervised and unsupervised approaches, slimIPL not only simplifies the training process, but also achieves competitive and state-of-the-art results on L IBRI S PEECH test sets in both standard and low-resource settings. Index Terms: deep learning, semi-supervised learning, pseudo-labeling, self-training, speech recognition 1 Introduction Recent work in deep learning has shifted towards methods which can efficiently learn from large amounts of unlabeled data to improve performance and decrease costs on acquiring labels. Semi-supervised learning [1] combines information from both labeled and unlabeled data; the amount of unlabeled data typically exceeds the amount of labeled data. In automatic speech recognition (ASR), while many of the recent semi-supervised methods outperform a supervised baseline in a low-resource setting, a gap between semi- and fully-supervised training remains. Further, not all of the approaches are equally scalable as the amount of labeled and unlabeled data increases, as is the case in recent setups such as the L IBRI L IGHT benchmark [2]. Some of the earliest and simplest semi-supervised approaches use self-training [3]. Self-training employs a base model trained with labeled data which acts as a “teacher” and is used to label unlabeled data (the resulting labels are referred as “pseudo-labels”, PL). A “student” model is then trained (typically from scratch) with both labeled and pseudo-labeled data to yield a final model. For competitive results in ASR, a language model was a key component of pseudo-labeling: it is usually combined with the acoustic model via beam-search decoding [4, 5] or through shallow fusion [6, 7, 8] to generate pseudo-labels. However, then it is observed that acoustic models tend to over-fit to the text training set of the language model used for pseudo-labeling [5, 8]. In this work, we show that competitive pseudo-labeling approaches do not need to rely on beam-search decoding nor on a language model. Thus, pseudo-labels are generated by picking hard labels, tokens with the highest acoustic model probability. Our approach is based on the recently-proposed iterative pseudo-labeling algorithm (IPL) [5]: we continuously train a single model using iteratively re-generated pseudo-labels as model learns. We call our algorithm language-model-free IPL (slimIPL) and give its overview in Section 4. We demonstrate in Section 5 that this approach
A PREPRINT - O CTOBER 23, 2020 is effective across different loss functions and tokens sets in both standard- and low-resource settings. Using the L IBRI L IGHT benchmark, we also show that slimIPL is easily scaled to a large amount of unlabeled audio. Ablation experiments in Section 5.6 show slimIPL can overcome the language model over-fitting issue inherent to the IPL algorithm, but also demonstrate that slimIPL is more stable when training seq2seq models. 2 Related Work Self-training methods [3] still attracts researchers: extensions to the self-training are multiple and include (a) selecting particular subsets of pseudo-labeled data for student training, (b) reiteration of the PL procedure several times to progressively-improve the teacher model, (c) the introduction of different types of noise for student model training, and (d) sampling techniques and schedules over labeled and pseudo-labeled datasets. Many recent works on self-training propose and validate these extensions, including those in computer vision [9, 10], natural language processing [11, 12, 13, 14, 15, 16, 17], ASR [18, 19, 20, 7, 8], and speech translation [21]. An extension to the simple pseudo-labeling method consists in continuously training a single model [22]. At the beginning of training, a model is trained only on labeled data after which training continues where data is selected jointly from both labeled and unlabeled datasets. Pseudo-labels re-generation occurs after some number of iterations, and a supervised loss is computed both on labeled and pseudo-labeled data for each batch. An additional parameter determines the contribution of pseudo-labeled data to the overall loss. The effectiveness of this iterative training for a single model has been validated on tasks in vision [23], natural language processing [15], and ASR [24, 5]. In addition to self-training, many other semi-supervised algorithms have been proposed in a variety of domains: • computer vision: graph-based methods [25, 26], generative modelling [27, 28], consistency-based meth- ods [29, 30, 31, 32], and contrastive methods [33, 34, 35, 36]; • machine translation (MT): integration of a language model trained on monolingual data [37, 38, 39], back- translation [40, 41, 42], synthetic data usage [43], and web-scale bitext mining [44]; • automatic speech recognition (ASR): representation learning [45, 2, 46, 47], local prior matching [4], adversarial training [48], back-translation [49] and others [50, 51, 52, 53, 54]. Below, we give an overview of the approaches in ASR that are most recent and relevant to our work. IPL The iterative pseudo-labeling (IPL) algorithm [5] follows prior work [22]: it uses data augmentation of both labeled and unlabeled data, and continuously trains a single model with iterative re-generation of pseudo-labels by beam-search decoding with a language model (LM), as the model learns. Compared to a self-training with a student network training each time from scratch [19], the IPL algorithm improves efficiency and performance. Prior work on IPL was applied only to models trained with word-pieces and Connectionist Temporal Classification (CTC) [55]. Noisy self-training Another recent work on self-training [8] performs five iterations of student network training, each time from scratch, with pseudo-labels generated by a teacher network. It uses a Listen, Attend and Spell (LAS) [56]- style acoustic model (AM). In this approach, as in ours, data augmentation is used for both labeled and unlabeled data. As is the case with IPL, shallow fusion with a language model is used with a decoding procedure to generate pseudo-labels, while slimIPL doesn’t use a language model. Further, this approach filters teacher network predictions based on transcription score, whereas slimIPL’s filtering criteria is based only on data statistics. Self-training [24] is the closest to our work: authors also continuously train a model re-generating pseudo-labels with hard labels after each iteration. This work focused on studying the impact of noise, and considers only CTC trained models on the Wall Street Journal dataset. Both SpecAugment [57] and speed perturbation are applied for labeled and unlabeled data during training in [24], whereas slimIPL uses only SpecAugment. This work lacks the study of over-fitting to the LM and comparison between hard-labels and beam-search decoding. Wav2vec 2.0 Recent work on unsupervised pre-training [58] shows a significant boost in performance for low-resource settings. wav2vec training has two steps: first, pre-training on unlabeled data by masking the input audio in the latent space and solving a contrastive learning task [59]; second, fine-tuning the model using labeled audio only. Wav2vec learns from the raw waveform, whereas slimIPL uses log-mel filterbanks. 3 Pseudo-Labeling Let L = {xi , y i } be a labeled dataset and U = {xj } a large unlabeled dataset. We consider a semi-supervised pseudo-labeling approach as outlined in Section 1 where the acoustic model is continuously trained on combination of a 2
A PREPRINT - O CTOBER 23, 2020 labeled set and an iteratively re-generated pseudo-labeled set. Training minimizes the following loss function: L(θ) = LL (θ) + λLU (θ), λ ∈ R+ , (1) where θ are the parameters of the acoustic model, and λ is a tunable parameter controlling the importance of unlabeled data. In Eq. (1) the losses for labeled data LL and for unlabeled data LU are defined as: LL (θ) = −Ex,y∼p(x,y) log pθ (y|x), (x, y) ∈ L , (2) where p(x, y) is the empirical data distribution of samples from L, pθ (y|x) is the conditional distribution defined by the acoustic model, LU (θ) = −Ex∼p(x) log pθ (ŷ|x), x∈U, (3) where p(x) is the empirical data distribution of samples from U , and ŷ are the pseudo-labels for utterance x ∈ U . One key difference in existing pseudo-labeling approaches is how the labels assignments ŷ are obtained for unlabeled data x ∈ U . In the general literature, pseudo-labeling refers to the hard labels generation ŷ = argmax log pθ (y|x̂). (4) y In machine translation and automatic speech recognition domains, the model pθ (y|x) is often a sequence-to-sequence model, and the solution of Eq. (4) may be approximated with a beam-search decoding algorithm [40, 15, 7, 6, 5, 8, 21, 24]. In fact, most recent work on speech recognition rely on a language model plm (y) to generate the pseudo-labels, and attempt to find instead: ŷ = argmax log pθ (y|x) + α log plm (y) , x∈U, (5) y where α is an hyper-parameter controlling the amount of language model regularization. More details on decoding can be found in Section 5.3. Pseudo-labeling is also popular in computer vision [22, 60]. Variants exist, such as “soft labels” ŷ = pθ (y|x) and variations on soft labeling [31, 61]. Sampling [62, 41] is also another way to generate pseudo-labels ŷ. 4 Language-Model-Free Iterative Pseudo-Labeling In the original IPL training approach [5], pseudo-labels are generated with a beam-search decoder leveraging a language model, approximating the solution suggested by Eq. (5). While the main motivation is to transfer the knowledge of the language model into the acoustic model, the two main drawbacks of this approach are (i) generating pseudo-labels is computationally intensive, and (ii) over-fitting the language model knowledge is easy. Regularization tricks are proposed in [5] to overcome (ii), such that one can still benefit from the language model when decoding at evaluation time. Algorithm 1: slimIPL Data: Labeled data L = {xi , y i }, Unlabeled data U = {xj } Result: Acoustic model pθ Initialize pθ by training on only labeled data L; repeat 1. Draw a subset of unpaired data S ∈ U 2. Apply pθ to the subset S and generate L̂ = {(x, ŷ)|x ∈ S, ŷ = argmax pθ (y|x)} y 3. (For Seq2Seq only) Filter the subset L̂ by removing a. samples with n-gram repetitions b. outliers for the “beak band” in x-ŷ sizes plane, (x, ŷ) ∈ L̂ 4. Fine-tune pθ on L ∪ L̂ with data augmentation until convergence or maximum iterations are reached; We demonstrate in this paper that pseudo-labels do not need to rely on any language model information. Our approach (as shown in Algorithm 1) follows the IPL algorithm, but pseudo-labels are simply generated by considering the top prediction according to the acoustic model (see Eq. (4)). For CTC-based acoustic models this corresponds exactly to choose the most likely token at each time step. For seq2seq models, we approximate Eq. (4) with the greedy solution 3
A PREPRINT - O CTOBER 23, 2020 of choosing the most likely token at each time step of the seq2seq decoder. In addition, a regularization scheme is implemented via data augmentation over the input (acoustic) data, both for labeled D and unlabeled D̂ samples. In our experiments, we only considered SpecAugment [57] for data augmentation. In our study, we show that this approach (dubbed slimIPL) works for CTC and sequence-to-sequence (seq2seq) models, targeting both letters or word-pieces. 4.1 Seq2seq Filtering In text generation tasks, seq2seq model decoders tend to generate short transcriptions and also suffer from looping issues (generating repeated n-grams) [63, 7, 6]. Compared to the regular IPL approach, the problem is less pronounced with slimIPL. We speculate that language model-based decoding for seq2seq models is a rather fragile process. We nevertheless found in our experiments that filtering generated transcriptions was still valuable for slimIPL (see Section 5.6.2). Recent works [6, 8] in ASR introduce scoring functions to evaluate model confidence for generated transcriptions. These scoring functions estimate the dependence between the acoustic model scores and token transcription length over a validation set, and filter out samples which are too far from the expected behavior. Instead of relying on model predictions, we propose a filtering technique based only on input data statistics, assuming a strong correlation between audio duration and the length of the corresponding transcription. Figure 1 exhibits this correlation for L IBRI S PEECH train and validation sets (details on data are in Section 5.1). As most of the labeled data falls into a “beak band” region in the (audio duration, transcription length) plane, we filter out generated samples falling out this estimated region: Figure 1: Dependence between audio duration (ms) and its tokens transcription length (including spaces between words) for LS-100 with validation sets (left) and LS-960 with validation sets (right). Figure 2: Beak band regions for the LS-100 with validation sets (left) and LS-960 with validation sets (right). Samples with pseudo-labels out of the red lines will be filtered during training, while ones between red lines (grey zone) will be used. 4
A PREPRINT - O CTOBER 23, 2020 l 1. Consider the ratio ri = lxyi , where lxi is the i-th sample duration and ly is the i-th sample token transcription i length (including spaces between words). 2. Take 1%, rdown , and 99%, rup , percentiles for the {ri } empirical distribution. These percentile values will give the beak band in lx − ly plane, see Figure 2: all samples with either ri < rdown or ri > rup will be filtered out for further training. As some n-gram looping issues in generated transcriptions are still observed after this filtering approach, we also filter out a sample if its generated transcription contains a 5-gram occurring more than once or 3-gram occurring more than twice. This filtering criterion was empirically tuned on validation set performance, and follows previous work [6]. 5 Experiments 5.1 Data All experiments are performed on the L IBRI S PEECH dataset [64] (contains 960 hours of training audio with paired transcriptions: train-clean-100, train-clean-360, and train-other-500 parts), and audio from L IBRI VOX (54K hours of unlabeled audio) extracted as described in [2]. We consider three scenarios with different amounts of labeled / unlabeled data: LS-100 / LS-860, LS-100 / LV, and LS-960 / LV, which are defined in Table 1. The standard L IBRI S PEECH validation sets (dev-clean and dev-other) were used to tune all hyper-parameters, as well as to select the best models. Test sets (test-clean and test-other) were used only to report final WER performance. We keep the original 16kHz sampling rate and compute log-mel filterbanks with 80 coefficients for a 25ms sliding window, strided by 10ms. All features are normalized to have zero mean and unit variance per input sequence before feeding them into the acoustic model neural network. Table 1: Different semi-supervised training scenarios. Setting Labeled Data Unlabeled Data LS-100 / LS-860 train-clean-100 train-clean-360, train-other-500 train-clean-360, train-other-500 LS-100 / LV train-clean-100 L IBRI VOX train-clean-100, train-clean-360 LS-960 / LV L IBRI VOX train-other-500 5.2 Acoustic Models We consider both CTC [55] and seq2seq-based [65] models. Architectures follow exactly [7, 5, 66], where more details can be found. The encoder of our acoustic models is composed of a convolutional frontend (several 1-D convolutions with kernel-width 31 ) followed by 36 4-heads Transformer blocks [67]. The self-attention dimension is Dtr = 768 and the feed-forward network (FFN) dimension is 3072 in each Transformer block. Depending on the experiment, our models have different strides, implemented in the convolution layers. For CTC-trained models, the output of the encoder HLe is followed by a linear layer to the output classes. For seq2seq models, we have an additional decoder, which is a stack of 6 Transformers with encoding dimension 256 and 4 attention heads. The probability distribution of the transcription is factorized as: n Y p(y1 , ..., yn ) = p(yi | y0 , ..., yi−1 , HLe ), (6) i=1 where y0 is a special symbol indicating the beginning of the transcription. We use dropout after the convolutions. For all Transformer layers (encoder and decoder – when present), we use dropout on the self-attention and on the FFN. We also use layer drop [68], dropping entire layers at the FFN level. Tokens Two family of tokens set are investigated: word-pieces and letters. We use 5k word-pieces [69, 70] generated from the SentencePiece toolkit2 : for LS-100 / LS-980 and LS-100 / LV scenarios word-pieces are constructed from 1 Kernel size 7 is used for models with stride 3. 2 https://github.com/google/sentencepiece 5
A PREPRINT - O CTOBER 23, 2020 the LS-100 transcriptions; for LS-960 / LV scenario the entire training transcriptions of L IBRI S PEECH are used to generate word-pieces set. The letter set consists of 26 English alphabet letters, augmented with the apostrophe and a word boundary token. Data augmentation during training is performed with SpecAugment [57]. Settings are two frequency masks with frequency mask parameter F = 30, ten time masks with maximum time-mask ratio p = 0.1 and time mask parameter T = 50; time warping is not used. Training For all experiments we use the Adagrad optimizer [71] and decay learning rate by 2 each time the word error rate reaches a plateau on the validation sets. Implementation All models architectures, as well as slimIPL are implemented within the wav2letter++3 frame- work [72]. 5.3 Beam-search Decoding and Rescoring In all our experimental results, we report word error rate (WER) without a language model (LM), but also WER obtained with a one-pass beam-search decoder leveraging a LM. Following the notation introduced in Section 3, the beam-search decoder aims at maximizing: log pθ (ŷ|x) + α log plm (ŷ) + β|ŷ|, where α and β are hyper-parameters to tune. We rely on the beam-search decoder from the wav2letter++ framework following [73, 74, 7]: the lexicon-based beam-search decoder with a word-level LM for CTC models and the lexicon- free-based beam-search decoder with a token-level LM for seq2seq models. The seq2seq beam-search decoder is stabilized by introducing an EOS-penalty γ to hypothesis that have finished in an end-of-sentence token [7]. γ is tuned together with other hyper-parameters and tends to prevent the decoder from early-stopping. L IBRI S PEECH validation sets, dev-clean and dev-other, are used to optimize the beam-search decoder hyper-parameters, through random search. We also report WER obtained by rescoring the beam of hypothesis obtained with our one-pass decoder. Rescoring is performed with a strong word-based Transformer LM, following the procedure described in [7]. We use open-sourced language models trained on the L IBRI S PEECH LM corpus from [74, 7] to perform the beam-search decoding and rescoring: 4-gram word-level LM, 20-gram letter-level LM and word-level Transformer LM. Additionally we train 6-gram word-piece-level LMs on the L IBRI S PEECH LM corpus with the KenLM toolkit [75]. We apply pruning by removing the 5-grams appearing once and 6-gram appearing once or twice. As word-pieces we use either ones constructed from the LS-100 or from LS-960 train sets. Word-level perplexities of all language models used for the beam-search decoding and rescoring are listed in Table 2. Table 2: Word-level perplexities of language models (for token-level language models upper bound on the word-level perplexity is computed). For a 6-gram word-piece-level language model marked with "*" tokens are constructed on the LS-100 training set. Data word 4-gram char 20-gram wp 6-gram* wp 6-gram Transf. dev-clean 148.0 177 156.7 155.8 48.2 dev-other 136.6 161 150.8 149.6 50.2 5.4 Supervised Baselines We considered different strides for acoustic modeling, looking the best configuration among strides 1, 2, 3, 4 letter-based models and 2, 4, 8 for word-pieces models. For both dropout and layer drop we use 0.3 value for models trained on LS-100 and 0.2 for models on LS-960. Performance in WER as well as best stride configurations are reported in Table 3 for LS-100, and Table 4 for LS-960. Our supervised baseline models trained on LS-100 clearly achieve new state-of-the-art results both for seq2seq and CTC criteria, Table 3. The seq2seq model reaches a new state-of-the-art at 16.78% WER on test-other without a language 3 https://github.com/facebookresearch/wav2letter 6
A PREPRINT - O CTOBER 23, 2020 Table 3: WER comparison of our supervised baselines on train-clean-100 with prior work. Dev WER Test WER Method Stride Tokens Criterion LM clean other clean other hybrid word 4-gram 5.0 19.5 5.8 18.6 RWTH [76] - - S2S - 14.7 38.5 14.7 40.8 DeCoAR [47] - phonemes CTC - - - 6.10 17.43 - 12.4 27.7 12.8 28.7 CTC wp 6-gram 9.7 22.9 10.3 24.0 5k wp - 9.0 22.8 9.5 23.3 S2S Word-level [77] 80ms wp 6-gram 8.3 21.2 9.2 22.0 - 8.0 21.0 7.7 21.4 CTC word 4-gram 6.3 19.1 6.8 19.4 words - 7.2 21.2 8.6 21.9 S2S word 4-gram 7.3 19.5 8.0 20.4 Improved T/S [8] - 16k wp S2S - 5.3 16.5 5.5 16.9 - 9.60 21.40 10.38 21.67 40ms 5k wp S2S wp 6-gram 8.51 18.71 8.86 18.94 wp 6-gram + rescoring 7.63 16.91 8.27 17.08 Our 20ms letter S2S - 6.22 16.56 6.43 16.78 - 6.32 17.24 6.57 17.75 30ms letter CTC word 4-gram 4.35 12.78 4.68 13.42 word 4-gram + rescoring 3.32 10.76 3.74 11.31 Table 4: WER comparison of our supervised baselines on L IBRI S PEECH with prior work. Dev WER Test WER Method Stride AM tokens Criterion LM clean other clean other - - - 2.4 5.6 FullAttn T-T [78] 30ms letters RNN-T word Transf. - - 2.0 4.6 - - - 2.1 4.6 ContexNet [79] 80ms 1k wp CNN-RNN-T wp LSTM - - 1.9 4.1 - - - 2.1 4.3 Conformer [80] 40ms 1k wp Conformer-T wp LSTM - - 1.9 3.9 - 2.8 7.6 3.0 7.7 wav2vec 2.0 [58] 20ms letters CTC word Transf. 1.7 4.3 2.1 4.6 - 2.31 5.52 2.74 5.79 Our 40ms 5k wp S2S wp 6-gram 2.08 4.84 2.39 4.98 wp 6-gram + rescoring 1.98 4.32 2.24 4.60 20ms letters S2S - 2.84 5.56 2.92 5.79 - 2.58 6.71 2.71 6.77 20ms letters CTC word 4-gram 1.95 5.04 2.46 5.44 word 4-gram + rescoring 1.49 4.09 2.11 4.52 model. The CTC model achieves new state-of-the-art results both on test-clean, 4.21% WER, and test-other, 12.15% WER, with beam-search decoding and further rescoring. 7
A PREPRINT - O CTOBER 23, 2020 Table 5: Comparison of WER with other semi-supervised methods on the LS-100 / LS-860 setting. Dev WER Test WER Method Stride AM tokens Criterion LM clean other clean other - 5.48 9.32 5.95 10.31 IPL [5] 80ms 5k wp CTC word 4-gram + rescoring 4.98 7.97 5.59 8.95 - 4.3 9.7 4.5 9.5 Improved T/S [8] - 16k wp S2S LSTM 3.9 8.8 4.2 8.6 - 4.6 9.3 4.7 9.0 wav2vec 2.0 [58] 20ms letters CTC word 4-gram 2.3 5.7 2.8 6.0 word Transf. 2.1 4.8 2.3 5.0 - 4.66 7.78 5.07 8.52 40ms 5k wp S2S wp 6-gram 4.55 7.45 5.13 8.08 wp 6-gram + rescoring 4.34 6.8 4.76 7.6 - 13.69 19.25 13.52 20.37 slimIPL, our 80ms 5k wp CTC word 4-gram 12.52 17.18 12.39 18.39 20ms letters S2S - 4.31 7.05 4.32 7.85 - 4.3 8.09 4.3 8.4 30ms letters CTC word 4-gram 2.74 5.94 3.31 6.63 word 4-gram + rescoring 2.04 4.76 2.56 5.37 Table 6: Comparison of WER with other semi-supervised methods on the LS-100 / LV setting. Dev WER Test WER Method Stride AM tokens Criterion LM clean other clean other - 4.35 7.90 5.07 8.84 IPL [5] 80ms 5k wp CTC word 4-gram + rescoring 3.19 6.14 3.72 7.11 - 3.3 6.5 3.1 6.3 wav2vec 2.0 [58] 20ms letters CTC word 4-gram 1.8 4.5 2.3 4.6 word Transf. 1.9 4.0 2.0 4.0 - 4.27 7.53 4.38 7.84 slimIPL, our 40ms 5k wp S2S wp 6-gram 4.03 6.79 4.15 7.16 wp 6-gram + rescoring 3.51 5.89 3.63 6.42 Our models trained on LS-960 are in the same WER ballpark than recent reported state-of-the-art, as shown in Table 4, reaching 2.24% WER on test-clean and 4.6% WER on test-other with a language model. 5.5 Semi-Supervised Experiments with slimIPL slimIPL architectures are identical to their supervised counterpart, except for their dropout and layer drop values. These regularization parameters were decreased to “increase” model capacity, as more data is involved during the semi- supervised training process. Acoustic models are bootstrapped on supervised data only, until they reach a reasonable WER (within < 100 relative WER from the supervised baseline), after which rounds of pseudo-labeling procedure are performed on a regular basis during the training procedure. Following [5], we generate pseudo-labels at each round for all utterances when using LS-860 as unlabeled data. When LV is used as unlabeled data, around 25% of the unlabeled dataset is sampled each time and gets used as pseudo-labels. 8
A PREPRINT - O CTOBER 23, 2020 Table 7: Comparison of WER with other semi-supervised methods on the LS / LV setting. Dev WER Test WER Method Stride AM tokens Criterion LM clean other clean other - 2.05 4.12 2.21 4.71 IPL [5] 80ms 5k wp CTC word 4-gram + rescoring 1.85 3.26 2.10 4.01 - 2.1 4.5 2.2 4.5 wav2vec 2.0 [58] 20ms letters CTC word 4-gram 1.4 3.5 2.0 3.7 word Transf. 1.6 3.0 1.8 3.3 - 1.6 3.7 1.7 3.7 Improved T/S [8] - 1k wp S2S LSTM 1.6 3.4 1.7 3.4 - 1.91 3.97 2.08 4.21 slimIPL, our 40ms 5k wp S2S wp 6-gram 1.83 3.59 2.01 3.90 wp 6-gram + rescoring 1.71 3.25 1.86 3.63 5.5.1 Experiments on LS-100 / LS-860 and LS-100 / LV After reaching 25-35% word error rate on dev-other with supervised data, pseudo-label generation is enabled, and pseudo-labels are re-generated after 20 epochs for CTC models (10 epochs for seq2seq models), then we double the number of epochs before each pseudo-label generation. In total we perform re-labeling around 25-30 times. slimIPL results can be found in Table 5. Both letter and word-piece-based seq2seq models outperform (in WER) their supervised baseline counterparts on test-other, as well as previous work on IPL and noisy teacher-student work [8] (a word-pieces seq2seq model which performs 5 rounds of PL). These experiments confirm that we can benefit from bootstrapping the seq2seq models, performing effectively more pseudo-labeling rounds without training a model from scratch at every PL round4 . slimIPL for CTC model and letters set reaches state-of-the-art result with no LM on both test-clean and test-other and in the same ballpark with the state-of-the-art result from the wav2vec work [58] with a language model. Results for the setting LS-100 / LV are shown in Table 6. slimIPL outperforms in WER the original IPL algorithm on this setting, on both test-clean and test-other. 5.5.2 Experiments on LS-960 / LV As for the LS-100 setups, slimIPL models are first trained on supervised data only until they reach reasonable WER performance (10-15% word error rate on the dev-other). For both CTC and seq2seq models we generate pseudo-labels of unlabeled data every 5 epochs. In total we perform re-labeling around 20 times. At this time we performed experiments with slimIPL only with word-piece-based seq2seq models. As shown in Table 7, slimIPL outperforms in WER the regular IPL algorithm, and approaches recent state-of-the-art results for semi- and unsupervised settings on test-clean and test-other. 5.6 Ablation Experiments slimIPL differs from the original IPL [5] in two ways (i) it does not rely on a language model at training time, and (ii) we introduced a filtering approach for seq2seq models. In the following, we investigate how critical are those differences. 5.6.1 Language Model at Training When the IPL algorithm was introduced [5], over-fitting to the language model used during training was observed. A proposed work-around was to limit the language model weight in the beam-search decoding procedure, when generating the pseudo-labels. In contrast to IPL, no language model is used at training with slimIPL, so we verify that slimIPL can take better advantage of a strong language model at inference time. Table 8 shows a larger WER boost with rescoring when using slimIPL instead of IPL. 4 For CTC models this was shown for the IPL algorithm in [5]. 9
A PREPRINT - O CTOBER 23, 2020 Table 8: Ablation study for LS-100 / LS-860 setting: comparison of WER between IPL (with n-gram beam-search decoding and transformer LM rescoring for PL generation) and Viterbi IPL. Dev WER Test WER Method Stride AM tokens Criterion LM other other - 8.52 9.14 IPL wp 6-gram 8.46 9.24 wp 6-gram + rescoring 7.77 8.39 40ms 5k wp S2S - 7.78 8.52 slimIPL wp 6-gram 7.45 8.08 wp 6-gram + rescoring 6.8 7.6 5.6.2 Filtering All our slimIPL results with seq2seq models in Section 5 were obtained with pseudo-label filtering. In our experience, IPL with seq2seq models diverges in the absence of filtering: during training, the number of pseudo-labels with either too short transcriptions or n-gram repetitions grows quickly. In fact, after 2-3 re-generations more than 50% of pseudo-labels have these issues. In contrast, slimIPL is more robust, and does not blow up without pseudo-label filtering. However, a small WER performance boost is observed with filtering, as shown in Figure 3. Figure 3: Word error rate on dev-other for two VPL seq2seq models with word-pieces set: trained without beak band filtering (grey) and with beak band filtering (orange) for the LS-100 / LS-860 setting. 6 Conclusion We revisited one of the key components of recent pseudo-labeling success in ASR, beam-search decoding with a language model, and proposed the slimIPL algorithm where we iteratively re-generate pseudo-labels with hard labels as a single model learns. We demonstrate that slimIPL performs well across different tokens sets and loss functions. It substantially simplifies the training compared to other semi-supervised and unsupervised approaches, while delivering competitive and state-of-the-art results in both standard and low-resource settings on the L IBRI S PEECH test sets. At inference, slimIPL is less prone to language model over-fitting and decoding with a strong language model is more beneficial than for IPL. 10
A PREPRINT - O CTOBER 23, 2020 References [1] O. Chapelle and B. Schölkopf, “Alexanderzien. semi-supervised learning,” 2006. [2] J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P.-E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen et al., “Libri-light: A benchmark for asr with limited or no supervision,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7669–7673. [3] H. Scudder, “Probability of error of some adaptive pattern-recognition machines,” IEEE Transactions on Informa- tion Theory, vol. 11, no. 3, pp. 363–371, 1965. [4] W.-N. Hsu, A. Lee, G. Synnaeve, and A. Hannun, “Semi-supervised speech recognition via local prior matching,” arXiv preprint arXiv:2002.10336, 2020. [5] Q. Xu, T. Likhomanenko, J. Kahn, A. Hannun, G. Synnaeve, and R. Collobert, “Iterative pseudo-labeling for speech recognition,” arXiv preprint arXiv:2005.09267, 2020. [6] J. Kahn, A. Lee, and A. Hannun, “Self-training for end-to-end speech recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7084–7088. [7] G. Synnaeve, Q. Xu, J. Kahn, T. Likhomanenko, E. Grave, V. Pratap, A. Sriram, V. Liptchinsky, and R. Col- lobert, “End-to-end asr: from supervised to semi-supervised learning with modern architectures,” arXiv preprint arXiv:1911.08460, 2019. [8] D. S. Park, Y. Zhang, Y. Jia, W. Han, C.-C. Chiu, B. Li, Y. Wu, and Q. V. Le, “Improved noisy student training for automatic speech recognition,” arXiv preprint arXiv:2005.09629, 2020. [9] I. Z. Yalniz, H. Jégou, K. Chen, M. Paluri, and D. Mahajan, “Billion-scale semi-supervised learning for image classification,” arXiv preprint arXiv:1905.00546, 2019. [10] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, “Self-training with noisy student improves imagenet classification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 687–10 698. [11] D. Yarowsky, “Unsupervised word sense disambiguation rivaling supervised methods,” in 33rd annual meeting of the association for computational linguistics, 1995, pp. 189–196. [12] D. McClosky, E. Charniak, and M. Johnson, “Effective self-training for parsing,” in Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, 2006, pp. 152–159. [13] R. Reichart and A. Rappoport, “Self-training for enhancement and domain adaptation of statistical parsers trained on small datasets,” in Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, 2007, pp. 616–623. [14] Z. Huang and M. Harper, “Self-training pcfg grammars with latent annotations across languages,” in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 2009, pp. 832–841. [15] J. He, J. Gu, J. Shen, and M. Ranzato, “Revisiting self-training for neural sequence generation,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum? id=SJgdnAVKDH [16] N. Ueffing, “Using monolingual source-language data to improve mt performance,” in International Workshop on Spoken Language Translation (IWSLT) 2006, 2006. [17] J. Zhang and C. Zong, “Exploiting source-side monolingual data in neural machine translation,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 1535–1545. [18] S. Novotney and R. Schwartz, “Analysis of low-resource acoustic model self-training,” in Tenth Annual Conference of the International Speech Communication Association, 2009. [19] J. Kahn, A. Lee, and A. Hannun, “Self-training for end-to-end speech recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7084–7088. [20] S. H. K. Parthasarathi and N. Strom, “Lessons from building acoustic models with a million hours of speech,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6670–6674. [21] J. Pino, Q. Xu, X. Ma, M. J. Dousti, and Y. Tang, “Self-training for end-to-end speech translation,” arXiv preprint arXiv:2006.02490, 2020. [22] D.-H. Lee, “Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks,” in Workshop on challenges in representation learning, ICML, vol. 3, no. 2, 2013. [23] E. Arazo, D. Ortego, P. Albert, N. E. O’Connor, and K. McGuinness, “Pseudo-labeling and confirmation bias in deep semi-supervised learning,” arXiv preprint arXiv:1908.02983, 2019. 11
A PREPRINT - O CTOBER 23, 2020 [24] Y. Chen, W. Wang, and C. Wang, “Semi-supervised asr by end-to-end self-training,” arXiv preprint arXiv:2001.09128, 2020. [25] X. Zhu, Z. Ghahramani, and J. D. Lafferty, “Semi-supervised learning using gaussian fields and harmonic functions,” in Proceedings of the 20th International conference on Machine learning (ICML-03), 2003, pp. 912–919. [26] B. Liu, Z. Wu, H. Hu, and S. Lin, “Deep metric transfer for label propagation with limited annotated data,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019, pp. 0–0. [27] A. Odena, “Semi-supervised learning with generative adversarial networks,” arXiv preprint arXiv:1606.01583, 2016. [28] Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin, “Variational autoencoder for deep learning of images, labels and captions,” in Advances in neural information processing systems, 2016, pp. 2352–2360. [29] X. Zhai, A. Oliver, A. Kolesnikov, and L. Beyer, “S4l: Self-supervised semi-supervised learning,” in Proceedings of the IEEE international conference on computer vision, 2019, pp. 1476–1485. [30] Q. Xie, Z. Dai, E. Hovy, M.-T. Luong, and Q. V. Le, “Unsupervised data augmentation for consistency training,” arXiv preprint arXiv:1904.12848, 2019. [31] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. A. Raffel, “Mixmatch: A holistic approach to semi-supervised learning,” in Advances in Neural Information Processing Systems, 2019, pp. 5049–5059. [32] K. Sohn, D. Berthelot, C.-L. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A. Kurakin, H. Zhang, and C. Raffel, “Fixmatch: Simplifying semi-supervised learning with consistency and confidence,” arXiv preprint arXiv:2001.07685, 2020. [33] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729–9738. [34] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar et al., “Bootstrap your own latent: A new approach to self-supervised learning,” arXiv preprint arXiv:2006.07733, 2020. [35] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” arXiv preprint arXiv:2006.09882, 2020. [36] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton, “Big self-supervised models are strong semi- supervised learners,” arXiv preprint arXiv:2006.10029, 2020. [37] T. Brants, A. Popat, P. Xu, F. J. Och, and J. Dean, “Large language models in machine translation,” in Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 2007, pp. 858–867. [38] W. He, Z. He, H. Wu, and H. Wang, “Improved neural machine translation with smt features,” in Thirtieth AAAI conference on artificial intelligence, 2016. [39] C. Gulcehre, O. Firat, K. Xu, K. Cho, and Y. Bengio, “On integrating a language model into neural machine translation,” Computer Speech & Language, vol. 45, pp. 137–148, 2017. [40] R. Sennrich, B. Haddow, and A. Birch, “Improving neural machine translation models with monolingual data,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, pp. 86–96. [41] S. Edunov, M. Ott, M. Auli, and D. Grangier, “Understanding back-translation at scale,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 489–500. [42] G. Lample, M. Ott, A. Conneau, L. Denoyer, and M. Ranzato, “Phrase-based & neural unsupervised machine translation,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 5039–5049. [43] A. Currey, A. V. Miceli-Barone, and K. Heafield, “Copied monolingual data improves low-resource neural machine translation,” in Proceedings of the Second Conference on Machine Translation, 2017, pp. 148–156. [44] H. Schwenk, G. Wenzek, S. Edunov, E. Grave, and A. Joulin, “Ccmatrix: Mining billions of high-quality parallel sentences on the web,” 2020. [45] S. Thomas, M. L. Seltzer, K. Church, and H. Hermansky, “Deep neural network features and semi-supervised training for low resource speech recognition,” in 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 2013, pp. 6704–6708. 12
A PREPRINT - O CTOBER 23, 2020 [46] A. Baevski, M. Auli, and A. Mohamed, “Effectiveness of self-supervised pre-training for speech recognition,” arXiv preprint arXiv:1911.03912, 2019. [47] S. Ling, Y. Liu, J. Salazar, and K. Kirchhoff, “Deep contextualized acoustic representations for semi-supervised speech recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6429–6433. [48] A. H. Liu, H.-y. Lee, and L.-s. Lee, “Adversarial training of end-to-end speech recognition using a criticizing lan- guage model,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6176–6180. [49] M. K. Baskar, S. Watanabe, R. Astudillo, T. Hori, L. Burget, and J. Černockỳ, “Self-supervised sequence-to- sequence asr using unpaired speech and text,” arXiv preprint arXiv:1905.01152, 2019. [50] L. Lamel, J.-L. Gauvain, and G. Adda, “Lightly supervised and unsupervised acoustic model training,” Computer Speech & Language, vol. 16, no. 1, pp. 115–129, 2002. [51] F. Wessel and H. Ney, “Unsupervised training of acoustic models for large vocabulary continuous speech recognition,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 1, pp. 23–31, 2004. [52] J. Ma and R. Schwartz, “Unsupervised versus supervised training of acoustic models,” in Ninth Annual Conference of the International Speech Communication Association, 2008. [53] H. Liao, E. McDermott, and A. Senior, “Large scale deep neural network acoustic modeling with semi-supervised training data for youtube video transcription,” in 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE, 2013, pp. 368–373. [54] V. Manohar, D. Povey, and S. Khudanpur, “Semi-supervised maximum mutual information training of deep neural network acoustic models,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015. [55] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376. [56] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 4960–4964. [57] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” Proc. Interspeech 2019, pp. 2613–2617, 2019. [58] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” arXiv preprint arXiv:2006.11477, 2020. [59] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018. [60] K. Sohn, D. Berthelot, C. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A. Kurakin, H. Zhang, and C. Raffel, “Fixmatch: Simplifying semi-supervised learning with consistency and confidence,” CoRR, vol. abs/2001.07685, 2020. [Online]. Available: https://arxiv.org/abs/2001.07685 [61] D. Berthelot, N. Carlini, E. D. Cubuk, A. Kurakin, K. Sohn, H. Zhang, and C. Raffel, “Remixmatch: Semi- supervised learning with distribution alignment and augmentation anchoring,” arXiv preprint arXiv:1911.09785, 2019. [62] K. Imamura, A. Fujita, and E. Sumita, “Enhancement of encoder and attention using target monolingual corpora in neural machine translation,” in Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, 2018, pp. 55–63. [63] J. Chorowski and N. Jaitly, “Towards better decoding and language model integration in sequence to sequence models,” Proc. Interspeech 2017, pp. 523–527, 2017. [64] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210. [65] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112. [66] R. Collobert, A. Hannun, and G. Synnaeve, “Word-level speech recognition with a letter to word encoder,” arXiv preprint arXiv:1906.04323, 2019. 13
A PREPRINT - O CTOBER 23, 2020 [67] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008. [68] A. Fan, E. Grave, and A. Joulin, “Reducing transformer depth on demand with structured dropout,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=SylO2yStDr [69] M. Schuster and K. Nakajima, “Japanese and korean voice search,” in International Conference on Acoustics, Speech and Signal Processing, 2012, pp. 5149–5152. [70] T. Kudo and J. Richardson, “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2018, pp. 66–71. [71] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of machine learning research, vol. 12, no. Jul, pp. 2121–2159, 2011. [72] V. Pratap, A. Hannun, Q. Xu, J. Cai, J. Kahn, G. Synnaeve, V. Liptchinsky, and R. Collobert, “wav2letter++: The fastest open-source speech recognition system,” arXiv preprint arXiv:1812.07625, 2018. [73] R. Collobert, C. Puhrsch, and G. Synnaeve, “Wav2letter: an end-to-end convnet-based speech recognition system,” arXiv preprint arXiv:1609.03193, 2016. [74] T. Likhomanenko, G. Synnaeve, and R. Collobert, “Who needs words? lexicon-free speech recognition,” Proc. Interspeech 2019, pp. 3915–3919, 2019. [75] K. Heafield, “Kenlm: Faster and smaller language model queries,” in Proceedings of the sixth workshop on statistical machine translation. Association for Computational Linguistics, 2011, pp. 187–197. [76] C. Lüscher, E. Beck, K. Irie, M. Kitza, W. Michel, A. Zeyer, R. Schlüter, and H. Ney, “Rwth asr systems for librispeech: Hybrid vs attention,” Proc. Interspeech 2019, pp. 231–235, 2019. [77] R. Collobert, A. Hannun, and G. Synnaeve, “Word-level speech recognition with a letter to word encoder.” [78] Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, and S. Kumar, “Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7829–7833. [79] W. Han, Z. Zhang, Y. Zhang, J. Yu, C.-C. Chiu, J. Qin, A. Gulati, R. Pang, and Y. Wu, “Contextnet: Im- proving convolutional neural networks for automatic speech recognition with global context,” arXiv preprint arXiv:2005.03191, 2020. [80] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020. 14
You can also read