Improving Distinction between ASR Errors and Speech Disfluencies with Feature Space Interpolation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Improving Distinction between ASR Errors and Speech Disfluencies with Feature Space Interpolation Seongmin Park, Dongchan Shin, Sangyoun Paik, Subong Choi, Alena Kazakova, Jihwa Lee ActionPower, Seoul, Republic of Korea {seongmin.park, dongchan.shin, sangyoun.paik, subong.choi, alena.kazakova, jihwa.lee}@actionpower.kr Abstract Fine-tuning pretrained language models (LMs) is a popular ap- proach to automatic speech recognition (ASR) error detection during post-processing. While error detection systems often take advantage of statistical language archetypes captured by arXiv:2108.01812v1 [cs.CL] 4 Aug 2021 LMs, at times the pretrained knowledge can hinder error de- tection performance. For instance, presence of speech disflu- encies might confuse the post-processing system into tagging disfluent but accurate transcriptions as ASR errors. Such con- Figure 1: Demonstration of speech disfluency and ASR error. fusion occurs because both error detection and disfluency de- tection tasks attempt to identify tokens at statistically unlikely positions. This paper proposes a scheme to improve existing LM-based ASR error detection systems, both in terms of de- will utilizing a pretrained LM benefit a single task? Such con- tection scores and resilience to such distracting auxiliary tasks. cerns arise especially in production settings, where customer Our approach adopts the popular mixup method in text feature orations tend to contain significantly more disfluencies than in space and can be utilized with any black-box ASR output. To audio found in academic benchmarks. This situation is illus- demonstrate the effectiveness of our method, we conduct post- trated in Figure 1. In the figure, ASR output of disfluent speech processing experiments with both traditional and end-to-end (ASR output) is compared with its reference transcript (Ground ASR systems (both for English and Korean languages) with 5 truth). Words like and um in the reference are annotated as dis- different speech corpora. We find that our method improves fluent. The post-processing system must only detect ASR errors both ASR error detection F1 scores and reduces the number (red and blue), while refraining from detecting merely disfluent of correctly transcribed disfluencies wrongly detected as ASR speech (green) as ASR errors. errors. Finally, we suggest methods to utilize resulting LMs This paper proposes a regularization scheme that im- directly in semi-supervised ASR training. proves ASR error detection performance by weakening an LM’s Index Terms: disfluency detection, error detection, post- propensity to tag accurate transcriptions of disfluent speech as processing ASR errors. Our method is applied during the LM fine-tuning stage. Resulting LMs have been applied as drop-in replace- 1. Introduction ments in existing ASR error correction systems in production. Furthermore, LM artifacts from our method can be uti- Post-processing is the final auditor of an automatic speech lized beyond post-prcessing. Recent advancements in semi- recognition (ASR) system before its output is presented to con- supervised ASR training take advantage of deep LMs to cre- sumers or downstream tasks. Two notable tasks often entrusted ate pseudo-labels for unlabeled audio [14, 15]. To ensure the to the post-processing step are ASR error detection [1, 2, 3] and quality of synthetic labels, [14] uses an LM during beam-search speech disfluency detection [4, 5, 6]. decoding; [15] filters created labels by length-normalized lan- A recent trend in post-processing is to fine-tune pretrained guage model scores. In both cases, an LM robust to speech transformer-based language models (LMs) such as BERT [7] disfluencies is essential in improving transcription accuracy. or GPT-2 [8] for designated tasks. Utilizing language patterns captured by huge neural networks during their unsupervised pretraining yields considerable advantages in downstream lan- 2. Related work guage tasks [9, 10, 11]. When using pretrained LMs, both ASR 2.1. ASR post-processing error correction and speech disfluency detection tasks employ a similar setup, where the LM is further fine-tuned with task- Post-processing is a task that complements ASR systems by an- specific data and label, often in a sequence-labeling scheme alyzing their output. Compared to tasks involved in directly ma- [12, 13]. Disfluency in speech, by definition, consists of se- nipulating components within the ASR system, the black-box quence of tokens unlikely to appear together. ASR errors are approach of ASR post-processing yields several advantages. It statistically likely to introduce tokens heavily out of context. is often intractable for the LM inside an ASR system to account Both descriptions resemble the pretraining objective of LMs, for and rank all hypothesis presented by the acoustic model [?]. which is to learn a language's statistical landscape. A post-processing system is free from such constraints within From such observation, a natural question follows: in a sit- an ASR system [1]. ASR post-processing also helps improve uation where objectives of two post-processing tasks are mutu- performance in downstream tasks [16] as well as in domain ally exclusive (but both attune to the LM pretraining objective), adaptation flexibility [17].
2.1.1. ASR error detection found success in applying the method to text data. [27] con- structs a new token sequence by randomly selecting a token One common approach to ASR error detection is through ma- from two different sequences at each index. [28] suggests ap- chine translation. Machine translation approaches in error de- plying mixup in feature space, after an intermediary layer of a tection commonly interpret error detection as a sequence-to- pretrained LM. New input data is then generated by reversing sequence task, and construct a system to translate erroneous the synthetic feature to find the most similar token in the vocab- transcriptions to correct sentences [3, 11, 18]. [3] suggests a ulary. While these two works involve interpolating text inputs, method applicable to off-the-shelf ASR systems. [11] and [18] our method differs significantly in that we do not directly gener- translates acoustic model outputs directly to grammatical text. ate augmented training samples; instead, we utilize mixup as a In some domains, however, necessity of huge parallel train- regularizing layer during the training process. Our method also ing data can render translation-based error detection infeasible does not require reversing word-embeddings or discriminative [19]. Another popular approach models ASR error detection filtering using GPT-2 introduced in [28]. as a sequence labeling problem. This approach requires much [29], a work most similar to ours, applies manifold mixup less training samples compared to sequence-to-sequence sys- on text data to calibrate a model to work best on out-of- tems. Both [9] and [?] use a large, pretrained deep LM to gauge distribution data. However, the study focuses on sentence-level the likeliness of each output token from an ASR system. classification, which requires a substantially different feature extraction strategy from the work in this paper. Our paper sug- 2.1.2. Speech disfluency detection gests a token-level mixup method, and aims to gauge the impact The aim of a disfluency detection task is to separate well- of single task (ASR error detection) regularization on an auxil- formed sentences from its unnatural counterparts. Disfluencies iary task (speech disfluency detection). in speech include deviant language, such as hesitation, repeti- tion, correction, and false starts [5]. Speech disfluencies dis- 3. Methodology play regularities that can be detected with statistical systems [4]. While various suggestions such as semi-supervised [6] and 3.1. BERT for sequence tagging self-supervised [20, 21] approaches were proposed, supervised To utilize BERT for a sequence labeling problem, a token clas- training on pretrained LM remains state-of-the-art [10, 12]. sification head (TCH) must be added after BERT’s final layer. Both [10] and [12] fine-tunes deep LMs for sequence tagging. There is no restriction on what constitutes a TCH. In [12], for example, a combination of conditional random field and a fully 2.2. Data interpolation for regularization connected network was attached to BERT’s output as a TCH. In our study, a simple fully-connected layer with the same number Interpolating existing data for data augmentation is a popu- of inputs and outputs as the original input token sequence suf- lar method for regularization. The mixup method suggested fices as a TCH because our aim is to measure the effectiveness in [22] has been particularly effective, especially in the com- of regularization that takes place between BERT and the TCH. puter vision domain [23, 24, 25]. Given dataset D = We modify a reference BERT implementation [30] with a TCH. {(x0 , y0 ), (x1 , y1 ) . . . , (xn , yn )} consisting of (data, label) A single data sample during the supervised training stage pairs, mixup “mixes” two random training samples (xi , yi ) and consists of input token sequence x = x0 , x1 , . . . , xn , and a (xj , yj ) to generate a new training sample (xm , ym ): sequence of labels for each token, y = y0 , y1 . . . , yn . Each corresponding label has a value of 1 if the token contains a xm = xi ∗ λ + xj ∗ (1 − λ) (1) character with erroneous ASR output, and 0 if the token con- tains no wrong character. All input sequences are tokenized ym = yi ∗ λ + yj ∗ (1 − λ) (2) into subword tokens [31] before entering BERT. For the rest of where this paper, boldfaced letters denote vectors. λ ∼ Beta(α, α) 3.2. Feature space interpolation α is a hyperparameter for the mixup operation that charac- terizes the Beta distribution to randomly sample λ from. Our proposed method adds a regularizing layer between [26] suggests manifold mixup to utilize mixup in the feature BERT’s output and the token classification head. BERT pro- space after the data has been passed through a vectorizing layer, duces a hidden state sequence h, with length equal to that of such as a deep neural network (DNN). Whereas mixup explicitly input token sequence x: generates new data by interpolating existing samples in the data space, manifold mixup adds a regularization layer to an existing h = h0 , h1 , . . . , hn = BERT (x) (3) architecture that mixes data in its vectorized form. In essence, From h we obtain hshuf f led , a randomly permuted ver- mixup and its variants discourage a model from making extreme sion of h. Each corresponding label in y is also rearranged decisions. A hidden space visualization in [26] shows that cor- accordingly, creating a new label vector yshuf f led . The same rect targets and non-correct targets are better separated within index-wise shuffle applied to h applies to y, such that each la- the mixup model because it learns to handle intermediate values bel yi originally assigned to an input token xi follows the new instead of making wild guesses in face of uncertainty. index for that value during the shuffling process: Because text is both ordered and discrete, mixup with text data is best applied at feature space. Magnitudes of each to- hshuf f led = hs0 , hs1 , . . . , hsn ken id in a tokenized text sequence does not convey any relative meaning; whereas in an image, varying the degrees of continu- yshuf f led = ys0 , ys1 . . . , ysn ous pixel values does gradually manipulate its properties. where While mixup is yet to be widely studied in speech-to-text or natural language processing domains, several studies have s0 . . . sn = RandomP ermutation(0...n)
4. Experiments 4.1. Experiment setup To obtain ASR outputs on which to perform error detection, we process multiple speech corpora through black box ASR sys- tems. Detailed descriptions of datasets used is provided in the next section. Within the decoded transcript, we tag subword to- ken indices that contain ASR errors to be used as labels during the post-processing fine-tuning. We also tag indices in the tran- scription with speech disfluencies. To demonstrate the effec- tiveness of our method across various off-the-shelf ASR envi- ronments, we transcribe all available speech corpora with both a traditional ASR system and an end-to-end (E2E) ASR system. For English corpora, we utilize a pretrained LibriSpeech Kaldi [32] model (factorized time-delay neural networks-based chain model) for the traditional system, and a reference wav2letter++ (w2l++) [33] system for the E2E system. For Korean corpora, we use the same respective architectures trained from scratch on internal data. Figure 2: Overview of feature-space regularization for text. To create error-labeled decoded transcripts, each decoded transcript is compared with corresponding ground truth refer- ence transcription. We use the Wagner-Fischer algorithm [34] to build a shortest-distance edit matrix from the decoded tran- script to its reference transcription. Traversing the edit matrix from the lowest, rightmost column to the highest, leftmost col- umn yields a sequence of delete, replace, and insert instructions to convert the decoded transcript into the reference transcript. Each character instructed to be deleted and replaced in the de- coded transcript is marked as ASR errors. Disfluency labels in reference transcripts must be trans- ferred over to corresponding positions within the decoded tran- scripts. Since the introduction of ASR errors causes the de- coder transcripts to be mere approximations of its ground truth transcripts, mapping each character index from the reference transcript to the decoded transcript is not rigorously a defined process. Here we again utilize the Wagner-Fischer algorithm, mapping only the character indices in the decoded transcript without any edit instructions to their corresponding indices in Figure 3: Effect of α and normalized counts at which the system the reference transcript. wrongly detects disfluent tokens as ASR errors. Regions within We judge the effectiveness of our regularization method 1 variance of average is shaded. by two criteria: how much it helps in error detection (Error detection F1 ), and how much it reduces the number of accu- rately transcribed disfluent speech tagged as errors (Disfluencies wrongly tagged). Next, we apply the mixup process element-wise to h and hshuf f led , according to equation (1) and (2): 4.2. Desciption of datasets hnew = (h λ) + (hshuf f led (1 − λ)) (4) We tested our proposed methodology on five different speech datasets (4 English, 1 Korean). All speech corpora were pro- Similarly, we also mix the original label vector y with its per- vided with reference transcripts hand-annotated with speech muted counterpart yshuf f led : disfluency positions. ynew = (y λ) + (yshuf f led (1 − λ)) (5) 4.2.1. SCOTUS The SCOTUS corpus [35] consists of seven oral arguments be- Where λ ∼ Beta(α, α). This process “softens” the originally tween justices and advocates. Nested disfluency tags suggested binary labels into range [0, 1]. in [4] were flattened and hand-annotated. For the mixup hyperparameter α, we empirically chose value a of 0.2. Figure 3 shows the experimentally observed rela- 4.2.2. CallHome tionship between α and the normalized counts of the fine-tuned system erroneously tagging disfluent speech as ASR errors. The CallHome corpus [36] is a collection of telephone calls. hnew is propagated to the token classification head, whose Each participant in the recording was asked to call a person of prediction loss is calculated using binary cross entropy against their choice. Similar to SCOTUS, nested disfluency tags were ynew , instead of the original ground truth label y. flattened and hand-annotated [35].
4.2.3. AMI Meeting Corpus (AMI) The proposed regularization in this paper shows greater in- fluence as the number of training data increases (AMI and ICSI). The AMI Meeting Corpus [37] contains recordings of multiple- One surprising outcome is the unchanging scores on the Call- speaker meetings on a wide range of subjects. Multiple types Home corpus. Analysis of the corpus and high F1 scores in of crowd-sourced annotations are available, including speech error detection indicate the uniform scores are a result of dis- disfluencies, body gestures, and named entities. tinct patterns in ASR errors during transcription; the fine-tuned LM in every setting learns to confidently and consistently pick 4.2.4. ICSI Meeting Corpus (ICSI) out only certain tokens as ASR errors. The ICSI Meeting Corpus [38] is a collection of multi-person meeting recordings. The ICSI Meeting corpus provides the Table 2: All results are subword token level scores, averaged highest count of disfluency-annotated transcripts out of the five over 6 repeated experiments with different random seeds. corpora used in this paper. Corpus Error detection F1 Disf. wrongly tagged 4.2.5. KsponSpeech (Kspon) Mixup No mixup Mixup No mixup KsponSpeech [39] is a collection of spontaneous Korean speech. The corpora is a recording of sub-10 seconds of spon- SCOTUS (w2l++) 0.467 0.462 254 312 taneous Korean speech. Each recording is labeled with disflu- SCOTUS (Kaldi) 0.651 0.662 392.5 345.75 encies and mispronunciations. CallHome (w2l++) 0.910 0.910 164 164 CallHome (Kaldi) 0.950 0.950 324 344 Table 1: Statistics for each corpus AMI (w2l++) 0.760 0.746 253.25 291.25 AMI (Kaldi) 0.920 0.915 140 141.25 Corpus Training samples Token level error rate ICSI (w2l++) 0.347 0.340 1556 1559.75 ICSI (Kaldi) 0.777 0.771 1085 1124.25 E2E Traditional E2E Traditional Kspon (w2l++) 0.332 0.308 3 2.3 SCOTUS 1063 2163 0.12 0.12 Kspon (Kaldi) 0.465 0.464 8 5 CallHome 1420 3058 0.43 0.48 AMI 4062 6963 0.35 0.45 ICSI 16389 24165 0.10 0.22 Kspon 2406 2406 0.06 0.15 5. Conclusion The discrete nature of textual data has stifled adaptation of ASR output for each corpus was randomly split into 8:1:1 mixup in natural language processing and speech-to-text do- train:validation:test splits. Table 1 describes the characteris- mains. In this study we propose a way to effectively apply tics of ASR outputs with which we fine-tune BERT. Training mixup to improve post-processing error detection. samples denote the number of sub-word tokenized sequences We extensively demonstrate the effect of our proposed ap- obtained from each ASR output. Token level error rate shows proach on various speech corpora. Impact of feature-space the rate at which subword tokens contain errors in transcrip- mixup is especially pronounced in high resource settings. The tion. This latter statistic is purely concerned with processing proposed regularization layer is also computationally inexpen- ASR outputs, and thus conveys a slightly different meaning than sive and can be easily retrofitted to existing systems as a simple conventional ASR error metrics, such as character error rates. module. While we chose the popular BERT LM for our experi- Most importantly, words missing in ASR outputs (compared to ments, our method can be applied to fine-tuning any deep LM in ground-truth transcriptions) is not tallied in Table 1’s token level a sequence tagging setting. We also highlighted potential uses error rate. This omission was introduced on purpose, because at of resulting LMs in semi-supervised ASR training as pseudo- inference time our model is not expected to introduce new edits label filters. Testing the effectiveness of this approach in other but merely tag each subword token as correct or incorrect. sequence-level tasks is left to further research. Number of training examples obtained from a single cor- pus transcribed with English E2E ASR differs from the num- ber of training examples obtained from the same corpus tran- 6. References scribed with English traditional ASR. This discrepancy stems [1] M. Feld, S. Momtazi, F. Freigang, D. Klakow, and C. Müller, from the utilized E2E system’s tendency to refrain from tran- “Mobile texting: can post-asr correction solve the issues? an scribing words with low confidence. Since the purpose of this experimental study on gain vs. costs,” in Proceedings of the study is to demonstrate the effectiveness of our proposed reg- 2012 ACM international conference on Intelligent User Inter- ularization method in various ASR post-processing setups, the faces, 2012, pp. 37–40. discrepancy in the number of training examples from different [2] M. Ogrodniczuk, Ł. Kobyliński, and P. A. N. I. P. Infor- ASR systems was generated on purpose. matyki, Proceedings of the PolEval 2020 Workshop. Institute of Computer Science. Polish Academy of Sciences, 2020. [Online]. Available: https://books.google.co.kkjkkr/books?id= 4.3. Results grRSzgEACAAJ Results of applying our proposed method is presented Table 2. [3] A. Mani, S. Palaskar, N. V. Meripo, S. Konam, and F. Metze, Corpus names in the first column are each marked with the “ASR Error Correction and Domain Adaptation Using Machine name of ASR system used for transcription. The second and Translation,” in ICASSP 2020-2020 IEEE International Con- third columns of table 3 show error detection F1 scores with ference on Acoustics, Speech and Signal Processing (ICASSP). and without applying proposed regularization, respectively. The IEEE, 2020, pp. 6344–6348. fourth and fifth columns report the number of correctly tran- [4] E. E. Shriberg, “Preliminaries to a theory of speech disfluencies,” scribed disfluencies wrongly tagged as ASR errors. PhD Thesis, Citeseer, 1994.
[5] E. Salesky, S. Burger, J. Niehues, and A. Waibel, “Towards fluent [23] T. He, Z. Zhang, H. Zhang, Z. Zhang, J. Xie, and M. Li, “Bag translations from disfluent speech,” in 2018 IEEE Spoken Lan- of tricks for image classification with convolutional neural net- guage Technology Workshop (SLT). IEEE, 2018, pp. 921–926. works,” in Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2019, pp. 558–567. [6] F. Wang, W. Chen, Z. Yang, Q. Dong, S. Xu, and B. Xu, “Semi- supervised disfluency detection,” in Proceedings of the 27th In- [24] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, ternational Conference on Computational Linguistics, 2018, pp. and C. Raffel, “Mixmatch: A holistic approach to semi-supervised 3529–3538. learning,” arXiv preprint arXiv:1905.02249, 2019. [7] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- [25] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Op- training of Deep Bidirectional Transformers for Language Under- timal speed and accuracy of object detection,” arXiv preprint standing,” in Proceedings of the 2019 Conference of the North arXiv:2004.10934, 2020. American Chapter of the Association for Computational Linguis- [26] V. Verma, A. Lamb, C. Beckham, A. Najafi, I. Mitliagkas, tics: Human Language Technologies, Volume 1 (Long and Short D. Lopez-Paz, and Y. Bengio, “Manifold mixup: Better repre- Papers), 2019, pp. 4171–4186. sentations by interpolating hidden states,” in International Con- [8] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever ference on Machine Learning. PMLR, 2019, pp. 6438–6447. et al., “Language models are unsupervised multitask learners,” [27] D. Guo, Y. Kim, and A. M. Rush, “Sequence-level Mixed Sample OpenAI blog, vol. 1, no. 8, p. 9, 2019. Data Augmentation,” in Proceedings of the 2020 Conference on [9] Y. Weng, S. S. Miryala, C. Khatri, R. Wang, H. Zheng, P. Molino, Empirical Methods in Natural Language Processing (EMNLP), M. Namazifar, A. Papangelis, H. Williams, F. Bell, and others, 2020, pp. 5547–5552. “Joint Contextual Modeling for ASR Correction and Language [28] R. Zhang, Y. Yu, and C. Zhang, “SeqMix: Augmenting Active Understanding,” in ICASSP 2020-2020 IEEE International Con- Sequence Labeling via Sequence Mixup,” in Proceedings of the ference on Acoustics, Speech and Signal Processing (ICASSP). 2020 Conference on Empirical Methods in Natural Language IEEE, 2020, pp. 6349–6353. Processing (EMNLP), 2020, pp. 8566–8579. [10] N. Bach and F. Huang, “Noisy BiLSTM-Based Models for Dis- [29] L. Kong, H. Jiang, Y. Zhuang, J. Lyu, T. Zhao, and fluency Detection.” in INTERSPEECH, 2019, pp. 4230–4234. C. Zhang, “Calibrated Language Model Fine-Tuning for [11] O. Hrinchuk, M. Popova, and B. Ginsburg, “Correction of In- and Out-of-Distribution Data,” in Proceedings of the Automatic Speech Recognition with Transformer Sequence- 2020 Conference on Empirical Methods in Natural Language To-Sequence Model,” in ICASSP 2020-2020 IEEE Interna- Processing (EMNLP). Online: Association for Computational tional Conference on Acoustics, Speech and Signal Processing Linguistics, Nov. 2020, pp. 1326–1340. [Online]. Available: (ICASSP). IEEE, 2020, pp. 7074–7078. https://www.aclweb.org/anthology/2020.emnlp-main.102 [12] D. Lee, B. Ko, M. C. Shin, T. Whang, D. Lee, E. H. Kim, E. Kim, [30] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, and J. Jo, “Auxiliary Sequence Labeling Tasks for Disfluency De- P. Cistac, T. Rault, R. Louf, M. Funtowicz, and others, “Hugging- tection,” arXiv preprint arXiv:2011.04512, 2020. Face’s Transformers: State-of-the-art Natural Language Process- ing,” ArXiv, pp. arXiv–1910, 2019. [13] T. Tanaka, R. Masumura, H. Masataki, and Y. Aono, “Neural error corrective language models for automatic speech recognition,” in [31] R. Sennrich, B. Haddow, and A. Birch, “Neural machine transla- INTERSPEECH, 2018. tion of rare words with subword units,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics [14] D. S. Park, Y. Zhang, Y. Jia, W. Han, C.-C. Chiu, B. Li, Y. Wu, and (Volume 1: Long Papers), 2016, pp. 1715–1725. Q. V. Le, “Improved noisy student training for automatic speech recognition,” arXiv preprint arXiv:2005.09629, 2020. [32] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, [15] W.-N. Hsu, A. Lee, G. Synnaeve, and A. Hannun, “Semi- J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi Speech supervised speech recognition via local prior matching,” arXiv Recognition Toolkit,” in IEEE 2011 Workshop on Automatic preprint arXiv:2002.10336, 2020. Speech Recognition and Understanding. IEEE Signal Process- [16] L. Wang, M. Fazel-Zarandi, A. Tiwari, S. Matsoukas, and ing Society, Dec. 2011, event-place: Hilton Waikoloa Village, Big L. Polymenakos, “Data Augmentation for Training Dialog Island, Hawaii, US. Models Robust to Speech Recognition Errors,” arXiv preprint [33] V. Pratap, A. Hannun, Q. Xu, J. Cai, J. Kahn, G. Syn- arXiv:2006.05635, 2020. naeve, V. Liptchinsky, and R. Collobert, “wav2letter++: The [17] C. Anantaram and S. K. Kopparapu, “Adapting general-purpose fastest open-source speech recognition system. arxiv 2018,” arXiv speech recognition engine output for domain-specific natural lan- preprint arXiv:1812.07625. guage question answering,” arXiv preprint arXiv:1710.06923, [34] G. Navarro, “A guided tour to approximate string matching,” ACM 2017. computing surveys (CSUR), vol. 33, no. 1, pp. 31–88, 2001, pub- [18] A. Mani, S. Palaskar, and S. Konam, “Towards Understanding lisher: ACM New York, NY, USA. ASR Error Correction for Medical Conversations,” ACL 2020, [35] V. Zayats, M. Ostendorf, and H. Hajishirzi, “Multi-domain disflu- p. 7, 2020. ency and repair detection,” in Fifteenth Annual Conference of the [19] M. Popel and O. Bojar, “Training tips for the transformer model,” International Speech Communication Association, 2014. The Prague Bulletin of Mathematical Linguistics, vol. 110, no. 1, [36] Linguistic Data Consortium, CABank CallHome English Corpus. pp. 43–70, 2018, publisher: Sciendo. TalkBank, 2013. [Online]. Available: http://ca.talkbank.org/ [20] S. Wang, W. Che, Q. Liu, P. Qin, T. Liu, and W. Y. Wang, access/CallHome/eng.html “Multi-task self-supervised learning for disfluency detection,” in [37] I. McCowan, J. Carletta, W. Kraaij, S. Ashby, S. Bourban, Proceedings of the AAAI Conference on Artificial Intelligence, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos et al., vol. 34, 2020, pp. 9193–9200, issue: 05. “The ami meeting corpus,” in Proceedings of Measuring Behavior [21] S. Wang, Z. Wang, W. Che, and T. Liu, “Combining Self-Training 2005, 5th International Conference on Methods and Techniques in and Self-Supervised Learning for Unsupervised Disfluency De- Behavioral Research. Noldus Information Technology, 2005, pp. tection,” arXiv preprint arXiv:2010.15360, 2020. 137–140. [22] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: [38] A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, Beyond Empirical Risk Minimization,” in International Confer- B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters, “The ence on Learning Representations, 2018. ICSI meeting corpus,” 2003, pp. I–364.
[39] J.-U. Bang, S. Yun, S.-H. Kim, M.-Y. Choi, M.-K. Lee, Y.-J. Kim, D.-H. Kim, J. Park, Y.-J. Lee, and S.-H. Kim, “KsponSpeech: Korean Spontaneous Speech Corpus for Automatic Speech Recognition,” Applied Sciences, vol. 10, no. 19, 2020. [Online]. Available: https://www.mdpi.com/2076-3417/10/19/6936
You can also read