Non-autoregressive Mandarin-English Code-switching Speech Recognition with Pinyin Mask-CTC and Word Embedding Regularization
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Non-autoregressive Mandarin-English Code-switching Speech Recognition with Pinyin Mask-CTC and Word Embedding Regularization Shun-Po Chuang†1 , Heng-Jui Chang†2 , Sung-Feng Huang1 , Hung-yi Lee1,2 Graduate Institute of Communication Engineering, National Taiwan University, Taiwan1 Department of Electrical Engineering, National Taiwan University, Taiwan2 {f04942141, b06901020, f06942045, hungyilee}@ntu.edu.tw Abstract pendency by directly predicting the whole sequence at once or refine the output sequence in a constant amount of itera- Mandarin-English code-switching (CS) is frequently used tions [16, 17]. NAR ASR can thus exploit parallel computing among East and Southeast Asian people. However, the intra- technologies for faster inference. This paper primarily adopts sentence language switching of the two very different languages Mask-CTC [18,19], a NAR ASR framework with a NAR condi- makes recognizing CS speech challenging. Meanwhile, the re- tional masked language model (CMLM) for iterative decoding, arXiv:2104.02258v1 [cs.CL] 6 Apr 2021 cent successful non-autoregressive (NAR) ASR models remove possessing competing performance with AR ASR baselines. the need for left-to-right beam decoding in autoregressive (AR) Although Mask-CTC provided exciting results, several models and achieved outstanding performance and fast infer- shortcomings remain unaddressed. The CS task suffers from ence speed. Therefore, in this paper, we took advantage of data scarcity [20, 21]; using Mask-CTC would introduce nu- the Mask-CTC NAR ASR framework to tackle the CS speech merous parameters and lead to overfitting. Thereby, we pro- recognition issue. We propose changing the Mandarin out- pose word embedding label smoothing to incorporate additional put target of the encoder to Pinyin for faster encoder training, textual knowledge, bringing more precise semantic and contex- and introduce Pinyin-to-Mandarin decoder to learn contextual- tual information for a better regularization effect. Second, the ized information. Moreover, we propose word embedding label current Mask-CTC framework trains the decoder part without smoothing to regularize the decoder with contextualized infor- considering the encoder’s recognition error, probably causing mation and projection matrix regularization to bridge that gap the error propagation issue during inference. We suggest con- between the encoder and decoder. We evaluate the proposed straining the encoder’s output projection layer similar to the in- methods on the SEAME corpus and achieved exciting results. put embedding layer in the decoder to bridge the gap between Index Terms: code-switching, end-to-end speech recognition, the encoder and decoder. Moreover, Mask-CTC requires longer non-autoregressive training time [18]; consequently, we propose changing the en- coder’s output target from Mandarin characters to Pinyin sym- 1. Introduction bols, reducing training time and complexity, where Pinyin is a Code-switching (CS) is a phenomenon of using two or more standardized way of translating Chinese characters into Latin languages within a sentence in text or speech. This practice is letters. Each Mandarin character can be mapped to at least one common in regions worldwide, especially in East and Southeast Pinyin symbol, representing its pronunciation. Using Pinyin Asia, where people frequently use Mandarin-English CS speech symbols as the target allows the encoder to focus on acoustic in daily conversations; thus, developing technologies to recog- modeling and reduces the vocabulary size. nize this type of speech is critical. Although humans can eas- Before this paper, Naowarat et al. [22] used contextualized ily recognize CS speech, automatic speech recognition (ASR) CTC to force NAR ASR to learn contextual information in CS technologies nowadays perform poorly since these systems are speech recognition, while we utilize CMLM to better model- mostly trained with monolingual data. Moreover, very few CS ing linguistic information. Also, Huang et al. [23] developed speech corpora are publicly available, making training high- context-dependent label smoothing methods by removing im- performance ASR models more challenging for CS speech. possible labels given the previous output token, while the pro- Various approaches were studied to tackle the CS speech posed word embedding label smoothing is more straightforward recognition problem, including language identity recogni- and leverages knowledge from the additional text data. tion [1–4] and data-augmentation [5–8]. Many prior works ex- We conducted extensive experiments on the SEAME CS ploited the powerful end-to-end ASR technologies like listen, corpus [24] to evaluate our methods. Next, we showed that the attend and spell [9] and RNN transducers [10]. However, these proposed Pinyin-based Mask-CTC and the regularization meth- methods mostly require autoregressive (AR) left-to-right beam ods benefit CS speech recognition and offer comparable perfor- decoding, leading to longer processing time. Some studies also mance to the AR baseline. Moreover, these methods offered demonstrated that processing each language with its encoder exciting performance under the low-resource scenario. or decoder obtained better performance [11–13]. This multi- encoder-decoder architecture and AR decoding require more 2. Method computation power, thus less feasible for real-world applica- 2.1. Mask-CTC tions. Therefore, in this paper, we leverage non-autoregressive (NAR) ASR technology to tackle this issue. The Mask-CTC [18, 19] model is a NAR ASR framework that Unlike the currently very successful AR ASR models [14, adopts the transformer encoder-decoder structure, but the de- 15], NAR ASR removes the need for the left-to-right de- coder is a CMLM for NAR decoding. First, in the training phase, given transcribed audio-text pair † Equal contribution. (X, Y ), the sequence X is first encoded with a conformer en-
我 很 happy One of the challenges of training Mandarin-English CS speech recognition models is that jointly modeling the two very differ- ent languages is tricky. Nearly 5k characters are frequently used Transformer Decoder in Mandarin, consisting of approximately 400 pronunciations or 1500 with tones involved. Thereby having many characters with 我 很 happy 我 狠 happy identical pronunciation, making ASR models prone to predict incorrect characters with the correct pronunciation. Moreover, a larger vocabulary size is difficult for CTC training. We thus Transformer Decoder P2M Decoder propose changing the encoder’s targets from Mandarin charac- ter to Pinyin. This idea could be applied to other languages 我 狠 happy wo hen happy which could be represented with symbols similar to Pinyin sys- tem. CTC CTC Each Mandarin character can be mapped to at least one Pinyin symbol, representing the pronunciation, similar to phoneme representations. Replacing characters with Pinyin re- Conformer Encoder Conformer Encoder duces vocabulary size and allows the encoder to focus on learn- ing acoustic modeling. The Pinyin system’s pronunciation rules are similar to English since Pinyin symbols can be represented in Latin letters, providing a more intuitive way to utilize a pre- trained English ASR model for initialization. (a) Mask-CTC (b) Mask-CTC + P2M Decoder P2M Decoder. Next, we propose a P2M decoder to map Pinyin back to character symbols using a single-layer decoder, Figure 1: (a) The original Mask-CTC [18] framework and (b) as shown in the middle of Fig. 1. The P2M model is trained the proposed Mask-CTC + Pinyin-to-Mandarin decoder frame- with Pinyin-Mandarin character sequence pairs conditioned on work. The example ”我(wo)很(hen) happy” means ”I am the encoder’s output. Based on acoustic-contextualized infor- very happy”. However, the character ”很(hen)” is misspelled mation, the P2M model can translate Pinyin-English mixed as ”狠(hen)” which have the same pronunciation but differ- sequence to Mandarin-English because typical phonetic se- ent meanings. The final transformer decoders can recover the quences in Mandarin are usually different from English. The wrong character. original Mask-CTC uses Mandarin-English sentences to train the decoder; this approach further applies the triplet pair = (Pinyin, Mandarin, English) for training, providing more in- coder model [14], and linearly projected to probability distri- formation to train the final Mandarin-English decoder. We en- butions over all possible vocabularies V to minimize the CTC courage the P2M model to learn a Pinyin-to-Mandarin character loss [25]. Then, some tokens Ymask in Y are randomly masked translation with a randomly masking technique. with the special token . The decoder is trained to predict the masked tokens conditioned on the observed tokens 2.3. Word Embedding Label Smoothing Regularization Yobs = Y \Ymask and the encoder output. The Mask-CTC model is thus trained to maximize the log-likelihood In this section, we introduce a novel label smoothing [26] method using pre-trained word embedding 1 . As mentioned log PNAR (Y |X) = α log PCTC (Y |X) previously, the CMLM decoder has less contextualized infor- + (1 − α) log PCMLM (Ymask |Yobs , X), mation involved or unaware of the neighboring predictions dur- ing decoding. Hence, we wish to bring more textual knowledge (1) through label smoothing. Conventionally, label smoothing re- where 0 ≤ α ≤ 1 is a tunable hyper-parameter. duces the ground truth label’s probability to 1 − for a small At the decoding stage, the encoder output sequence is first constant 0 < < 1, while the other possible labels are equally transformed by CTC decoding, denoted as Ŷ . Next, tokens assigned with a small constant probability. Although label with probability lower than a specified threshold Pthres in Ŷ smoothing shows effectiveness in regularization, it is incapable are masked with for the CMLM to predict, denoted of exploiting the target’s expression. Here we propose to dis- as Ŷmask . Then, the masked tokens are gradually recovered till knowledge from semantic-rich word embedding. Word em- by the CMLM in K iterations. In the kth iteration, the most bedding reflects semantic-similar and contextual-relevant words b|Ŷmask |/Kc probable tokens in the masked sequence are re- that would have similar word representations. By leveraging covered by calculating such properties would allow the model to learn the semantic- (k) contextual information implicitly. yu = arg max PCMLM (yu = y|Ŷobs , X), (2) y∈V In this paper, we determine the possible labels by calcu- lating the cosine similarity between word embeddings. For a (k) where yu is the uth output token and Ŷobs is the sequence in- ground truth label ŷ, we first find N labels with the highest co- cluding the unmasked tokens and the masked tokens recovered sine similarity among the pre-trained word embeddings e as in the first (k − 1) iterations. The Mask-CTC ASR framework DN = arg topN cos (e(ŷ), e(y)) , (3) achieves performance close to AR ASR models but with signif- y∈V\{ŷ} icantly lower computation cost [18, 19]. where the arg topN operator returns a set of N indices indicat- ing the labels with the top N highest score and cos(·, ·) returns 2.2. Using Pinyin as Output Target 1 We noticed a related work [27] was published right before the sub- Here, we introduce the Pinyin-to-Mandarin (P2M) decoder for mission deadline. However, they mainly focused on RNN-LM while dealing with CS data and faster training, as shown in Fig. 1. we focus on improving NAR CS ASR decoders.
the cosine similarity between the two inputs. The probabilities Table 1: The duration and the composition of Man- of each label can thus be written as darin, English, and code-switching utterances in the train- ing/validation/testing sets of the SEAME corpus. 1 − , y = ŷ P (y) = N , y ∈ DN . (4) 0, otherwise train val devman devsge Duration (hours) 114.6 6.0 7.5 3.9 With the proposed label smoothing, we expect the decoders Mandarin 19.3% 19.1% 19.9% 8.9% to learn more contextualized information. Moreover, the word English 23.8% 24.4% 12.4% 49.8% embeddings can be trained with additional text data, introducing Code-switching 56.9% 56.5% 67.7% 41.3% richer information to the decoder. 2.4. Projection Matrix Regularization Mandarin and English. We excluded all the testing data from the SEAME corpus and used the remaining for the train/validation Here, we propose regularizing the projection matrices in the set. The statistics are listed in Table 1. The baseline model ASR model to bridge the gap between the encoder and decoder. used subword units [28] for English and Chinese characters for Although the encoder and decoder in the Mask-CTC model are Mandarin, resulting in a 5751 vocabulary size. The vocabularies trained jointly, the decoder is connected with the encoder solely contain 2704 Chinese characters, and it is reduced to 390 Pinyin with the cross-attention mechanism. The Mask-CTC uses the tokens for training the proposed model. We used the pypinyin greedily decoded sequence from the encoder as the decoder’s package 2 for the Pinyin-Mandarin mapping. input during inference, but the decoder is optimized with se- Monolingual Datasets: We used the LibriSpeech [29] English quences without recognition errors. This mismatch might lead corpus of approximately 960 hours of training data for model to degraded performance of the ASR; we propose to mitigate pre-training. For training the word embedding, we combined this problem by making the encoder output projection matrix SEAME, TEDLIUM2 English corpus [30], and AISHELL and the decoder input embedding matrix similar to each other. Mandarin corpus [31] to train a skip-gram model using the The last layer of the encoder is a matrix WCTC ∈ Rd×|V| fastText toolkit [32]. We chose TEDLIUM2 rather than Lib- for linearly projecting encoded features of dimension d to a riSpeech for training word embeddings since it has shorter and probability distribution over all |V| possible output labels. The more spontaneous utterances similar to SEAME. first layer of the decoder is the embedding layer that transforms one-hot vectors representing text tokens to continuous hidden 3.2. Model T features with a matrix Wemb ∈ Rd×|V| . Matrices WCTC and T Wemb can be respectively written as [wCTC,1 . . . wCTC,|V| ] All experiments were based on the ESPnet toolkit [33] and and [wemb,1 . . . wemb,|V| ], where w are column vectors of followed its training recipe. Audio features were extracted dimension d. Here we hope these two matrices have similar be- to global mean-variance normalized 83-dimensional log-Mel havior, and we thus apply cosine embedding loss to constrain filterbank and pitch features. Speed perturbation [34] and the two matrices as SpecAugment [35] were added throughout the training process. |V| Conventional label smoothing was applied to all experiments 1 X LProjMatReg = [1 − cos (wCTC,v , wemb,v )] . (5) except for those with the proposed label smoothing technique. |V| v=1 The encoder architecture for both AR and NAR ASR models This loss function can also be applied to the two decoders to was a 12-layered conformer encoder [14] with a dimension of build a relation between the output layer of the P2M decoder 512 and 8 attention heads per layer. The P2M decoder was a and the input embedding layer of CMLM. Additionally, an al- 1-layered transformer decoder, and the iterative refinement de- ternative solution is to share the same weights in WCTC and coder was a 6-layered transformer decoder [36]. All decoders T Wemb ; however, we found this method severely damaged the had a feed-forward dimension of 2048. We evaluated the perfor- performance since they have different objectives. mance of our models with token error rate (TER) and real-time Overall, we can combine all objective functions in the pre- factor (RTF), where the tokens refer to Mandarin characters and vious sections, which can be represented as English words. RTF is a commonly used indicator for demon- strating the advantage of NAR models’ fast inference speed over L = − α log PCTC (Y pyin |X) AR models. The ASR models in our experiments were all ini- pyin − (1 − α) log PP2M (Ymask pyin |Yobs , X) tialized with a pre-trained conformer ASR provided by ESPnet. (6) We set the hyper-parameters in all experiments to α = 0.3 and char char − (1 − α) log PCMLM (Ymask |Yobs , X) β = 10−4 , and set = 0.1 and N = 10 for the label smooth- + βLProjMatReg , ing loss. We observed that the iterative refinement step provided no gain to the ASR performance in both original and proposed where Y pyin and Y char respectively represent using Pinyin and Mask-CTC frameworks at the inference stage. Therefore, we character as Mandarin token. directly predict the output with a single pass through decoder without iterative decoding described in [18]. 3. Experiment 3.3. Proposed Pinyin Decoder and Regularization Methods 3.1. Data This section investigated the effectiveness of the proposed To evaluate the proposed methods for CS speech recognition, model and regularization methods using all training data in we used a CS corpus and three monolingual corpora. SEAME. We tested models with the best validation loss, and SEAME Corpus: SEAME [24] is a Mandarin-English CS con- results are listed in Sec. (I) of Table 2. versational speech corpus collected in Singapore. The two eval- uation sets, devman and devsge, are respectively biased toward 2 https://pypi.org/project/pypinyin/
Table 2: TERs(%) and RTF of the AR and NAR ASR models Table 4: Ablation studies of the proposed methods shown in trained with all data from SEAME. The Reg Methods in row TER(%). The EmbLS indicates the proposed word embedding (d) are the word embedding label smoothing and the projection label smoothing method while MatReg is the projection matrix matrix regularization methods applied jointly. Model averaging regularization method. The overall TER was in the All column. was applied to rows (e) and (h). EmbLS MatReg devman devsge All Method devman devsge RTF (I) Non-autoregressive (a) % % 42.2 50.7 45.3 (a) Mask-CTC 16.5 24.4 0.017 (b) " % 42.0 50.1 44.9 (b) +P2M (w/o Pinyin) 16.6 24.4 0.017 (c) % " 42.2 50.4 45.2 (c) +P2M (w/ Pinyin) 16.3 24.0 0.017 (d) " " 41.7 50.2 44.8 (d) + Reg Methods 16.0 24.1 0.017 (e) + Model Avg 15.3 22.3 0.017 tasks. In practice, transcribed CS speech data are difficult to (II) Autoregressive obtain; therefore, we simulated the low-resource scenario with (f) Transformer-T [12] 18.5 26.3 - a different amount (60/30/10%) of available training data in (g) Multi-Enc-Dec [11] 16.7 23.1 - SEAME to evaluate the proposed methods. The results are (h) Conformer (Ours) 14.3 20.6 1.031 shown in Table. 3. The proposed regularization approaches always offered bet- Table 3: Results of TERs(%). Low-resource scenario when dif- ter performance on the SEAME corpus regardless of the given ferent amount of data were available for the Pinyin Mask-CTC amount of training data (rows (a) v.s. (b); (c) v.s. (d); (e) v.s. model w/ and w/o the proposed regularization methods. All the (f)). Note that SpecAugment, a robust data augmentation tech- results were applied model averaging technique. nique capable of forcing ASR models from overfitting to un- derfitting, was applied to all experiments. Despite this method Data Method devman devsge already provided good results, the proposed methods further im- proved ASR performance. (a) Pinyin 42.2 50.7 10% Overall, we demonstrated the proposed Pinyin model and (b) w/ Proposed 41.7 50.2 regularization methods could improve low-resource training. (c) Pinyin 24.8 30.8 These methods could be applied to other tasks as well in the 30% (d) w/ Proposed 24.0 30.2 limited data situation. (e) Pinyin 18.2 24.4 60% 3.5. Ablation Studies (f) w/ Proposed 17.7 23.8 To verify the efficacy of the proposed regularization methods, We first set two baselines: one with the original Mask-CTC we conducted ablation studies under the low-resource scenario (row (a)), the other had the proposed architecture but used man- with only 10% of training data. The performance of the base- darin characters instead of Pinyin tokens as the intermediate line model is shown in row (a) of Table 4. Improvement was (row (b)). Next, switching the encoder’s target to Pinyin im- brought by only applying the proposed label smoothing method proved the ASR performance (rows (c) v.s. (a)(b)), indicating in Sec. 2.3 (rows (b) v.s. (a)), showing leveraging textual knowl- introducing Pinyin as the intermediate representation effectively edge from additional text data was beneficial under the low- transferred the learning burden from the encoder to decoders. resource scenario. With the proposed projection matrix regular- Besides, the proposed regularization methods achieved a better ization technique in Sec. 2.4 applied, the model obtained limited result on devman set and obtained a comparable result on the improvement (rows (c) v.s. (a)). Nevertheless, more improve- devsge set (rows (d) v.s. (c)). To decrease the variance of the ments were achieved by applying both proposed methods (row model’s prediction, we selected the top five checkpoints with (d)), reaching the best performance. We showed that the pro- higher validation accuracy for averaging [37], resulting in the jection matrix regularization technique benefited from the pro- best NAR ASR performance (row (e)). posed label smoothing method and filled up the gap between the To highlight the benefits of using NAR ASR models, we encoder and decoders. The ablation study showed both methods listed the performance of AR ASR models in Sec. (II) of Ta- were compatible with each other and contributed to improving ble 2 for comparison. The implemented AR baseline used beam the recognition performance. search with beam size of 10 for decoding and surpassed the pre- vious SOTA results [11, 12] (row (h) v.s. (f)(g)), providing a 4. Conclusion solid reference. The best NAR model offered good performance with a small gap between the best AR model (rows (e) v.s. (h)) This paper introduces a novel non-autoregressive ASR frame- but with a significantly 60× speedup (the RTF column), imply- work by adding a Pinyin-to-Mandarin decoder in the Mask- ing the NAR model could recognize CS speech while possess- CTC ASR to solve the Mandarin-English code-switching ing fast inference speed. Overall, we have shown NAR ASR speech recognition problem. We also propose word embed- models incorporated with the proposed methods achieved ex- ding label smoothing for including contextual information to citing results. conditional masked language model and projection matrix reg- ularization method to bridge the gap between the encoder and decoders. We demonstrated the effectiveness of our methods 3.4. Low-resource Scenario with exciting performance on the SEAME corpus. The new To shed light on the proposed regularization techniques, we ASR framework and regularization methods have the potential conducted experiments on much-lower resourced CS ASR to improve various speech recognition scenarios.
5. References [22] B. Naowarat, T. Kongthaworn, K. Karunratanakul, S. H. Wu, and E. Chuangsuwanich, “Reducing spelling inconsistencies in code- [1] S. Zhang, J. Yi, Z. Tian, J. Tao, and Y. Bai, “RNN-transducer with switching ASR using contextualized CTC loss,” arXiv preprint language bias for end-to-end Mandarin-English code-switching arXiv:2005.07920, 2020. speech recognition,” in ISCSLP, 2021. [23] Z. Huang, P. Li, J. Xu, P. Zhang, and Y. Yan, “Context-dependent [2] K. Li, J. Li, G. Ye, R. Zhao, and Y. Gong, “Towards code- label smoothing regularization for attention-based end-to-end switching ASR for end-to-end CTC models,” in ICASSP, 2019. code-switching speech recognition,” in ISCSLP, 2021. [3] C. Shan, C. Weng, G. Wang, D. Su, M. Luo, D. Yu, and L. Xie, [24] D.-C. Lyu, T.-P. Tan, E. S. Chng, and H. Li, “SEAME: a “Investigating end-to-end speech recognition for Mandarin- Mandarin-English code-switching speech corpus in south-east English code-switching,” in ICASSP, 2019. asia,” in INTERSPEECH, 2010. [4] Z. Zeng, Y. Khassanov, V. T. Pham, H. Xu, E. S. Chng, and H. Li, [25] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Con- “On the end-to-end solution to Mandarin-English code-switching nectionist temporal classification: Labelling unsegmented se- speech recognition,” in INTERSPEECH, 2019. quence data with recurrent neural networks,” in ICML, 2006. [5] C. Du, H. Li, Y. Lu, L. Wang, and Y. Qian, “Data augmentation [26] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Re- for end-to-end code-switching speech recognition,” in SLT, 2021. thinking the inception architecture for computer vision,” in CVPR, [6] Y. Long, Y. Li, Q. Zhang, S. Wei, H. Ye, and J. Yang, “Acoustic 2016. data augmentation for mandarin-english code-switching speech [27] M. Song, Y. Zhao, S. Wang, and M. Han, “Word similarity based recognition,” Applied Acoustics, vol. 161, 2020. label smoothing in RNNLM training for asr,” in SLT, 2021. [7] Y. Sharma, B. Abraham, K. Taneja, and P. Jyothi, “Improving [28] T. Kudo and J. Richardson, “SentencePiece: A simple and lan- low resource code-switched ASR using augmented code-switched guage independent subword tokenizer and detokenizer for neural TTS,” in INTERSPEECH, 2020. text processing,” in EMNLP: System Demonstrations, 2018. [8] C.-T. Chang, S.-P. Chuang, and H. yi Lee, “Code-switching sen- [29] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- tence generation by generative adversarial networks and its appli- rispeech: An ASR corpus based on public domain audio books,” cation to data augmentation,” in INTERSPEECH, 2019. in ICASSP, 2015. [9] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend [30] A. Rousseau, P. Deléglise, and Y. Estève, “Enhancing the TED- and spell: A neural network for large vocabulary conversational LIUM corpus with selected data for language modeling and more speech recognition,” in ICASSP, 2016. TED talks,” in LREC, 2014. [10] A. Graves, “Sequence transduction with recurrent neural net- [31] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AISHELL-1: An works,” in ICML Workshop on Representation Learning, 2012. open-source mandarin speech corpus and a speech recognition baseline,” in O-COCOSDA, 2017. [11] X. Zhou, E. Yılmaz, Y. Long, Y. Li, and H. Li, “Multi-encoder- decoder transformer for code-switching speech recognition,” in [32] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enrich- INTERSPEECH, 2020. ing word vectors with subword information,” arXiv preprint arXiv:1607.04606, 2016. [12] S. Dalmia, Y. Liu, S. Ronanki, and K. Kirchhoff, “Transformer- transducers for code-switched speech recognition,” in ICASSP, [33] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, 2021. N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech [13] Y. Lu, M. Huang, H. Li, J. Guo, and Y. Qian, “Bi-encoder processing toolkit,” in INTERSPEECH, 2018. transformer network for Mandarin-English code-switching speech [34] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmen- recognition using mixture of experts,” in INTERSPEECH, 2020. tation for speech recognition,” in INTERSPEECH, 2015. [14] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, [35] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: and Q. V. Le, “SpecAugment: A simple data augmentation Convolution-augmented transformer for speech recognition,” in method for automatic speech recognition,” in INTERSPEECH, INTERSPEECH, 2020. 2019. [15] C. Liu, F. Zhang, D. Le, S. Kim, Y. Saraf, and G. Zweig, “Im- [36] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. proving RNN transducer based ASR with auxiliary tasks,” in SLT, Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2021. in NeurIPS, 2017. [16] E. A. Chi, J. Salazar, and K. Kirchhoff, “Align-refine: Non- [37] S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang, autoregressive speech recognition via iterative realignment,” M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang et al., “A arXiv preprint arXiv:2010.14233, 2020. comparative study on transformer vs RNN in speech applica- [17] N. Chen, S. Watanabe, J. Villalba, and N. Dehak, “Listen and fill tions,” in ASRU, 2019. in the missing letters: Non-autoregressive transformer for speech recognition,” arXiv preprint arXiv:1911.04908, 2019. [18] Y. Higuchi, S. Watanabe, N. Chen, T. Ogawa, and T. Kobayashi, “Mask CTC: Non-autoregressive end-to-end ASR with CTC and mask predict,” in INTERSPEECH, 2020. [19] Y. Higuchi, H. Inaguma, S. Watanabe, T. Ogawa, and T. Kobayashi, “Improved mask-CTC for non-autoregressive end- to-end ASR,” in ICASSP, 2021. [20] Y. Sharma, B. Abraham, K. Taneja, and P. Jyothi, “Improving Low Resource Code-Switched ASR Using Augmented Code-Switched TTS,” in Proc. Interspeech 2020, 2020, pp. 4771–4775. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2020-2402 [21] S.-P. Chuang, T.-W. Sung, and H.-y. Lee, “Training code- switching language model with monolingual data,” in ICASSP, 2020.
You can also read