SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition Patrick K. O’Neill1 , Vitaly Lavrukhin2 , Somshubra Majumdar2 , Vahid Noroozi2 , Yuekai Zhang3 , Oleksii Kuchaiev2 , Jagadeesh Balam2 , Yuliya Dovzhenko1 , Keenan Freyberg1 , Michael D. Shulman1 , Boris Ginsburg2 , Shinji Watanabe3,4 , Georg Kucsko1 1 Kensho Technologies, Cambridge MA, USA 2 NVIDIA Technologies, Santa Clara CA, USA 3 Johns Hopkins University, Baltimore MD, USA arXiv:2104.02014v2 [cs.CL] 6 Apr 2021 4 Carnegie Mellon University, Pittsburgh PA, USA patrick.oneill@kensho.com, georg@kensho.com 1. Abstract are possible only with acoustic information and cannot be reli- ably inferred from text alone. Consider, for example, the prob- In the English speech-to-text (STT) machine learning task, lem of inferring the correct EOS marker for the sentence “the acoustic models are conventionally trained on uncased Latin CEO retired” (either of a period or question mark) with- characters, and any necessary orthography (such as capitaliza- out the information carried in vocal pitch. Third, the practice tion, punctuation, and denormalization of non-standard words) of chaining orthography-imputing models into pipelines tends is imputed by separate post-processing models. This adds com- to encumber STT systems. Each orthographic feature may re- plexity and limits performance, as many formatting tasks bene- quire a model of significant engineering effort in its own right fit from semantic information present in the acoustic signal but (cf. [4, 5, 6, 7, 8, 9]), and improvements to one model’s perfor- absent in transcription. Here we propose a new STT task: end- mance may degrade end-to-end performance as a whole due to to-end neural transcription with fully formatted text for target distribution shift [10]. labels. We present baseline Conformer-based models trained Modern STT models require large volumes of high-quality on a corpus of 5,000 hours of professionally transcribed earn- training data, and to our knowledge there is no extant public ings calls, achieving a CER of 1.7. As a contribution to the corpus suitable for the fully-formatted, end-to-end STT task. To STT research community, we release the corpus free for non- address this limitation we release SPGISpeech3 , a subset of the commercial use.1 financial audio corpus described in [11]. SPGISpeech offers: 2. Introduction • 5,000 hours of audio from real corporate presentations In the English speech-to-text (STT) task, acoustic model output • Fully-formatted, professional manual transcription typically lacks orthography, the full set of conventions for ex- • Both spontaneous and narrated speech pressing the English language in writing (and especially print)2 • Varied teleconference recording conditions [1]. Acoustic models usually render uncased Latin characters, and standard features of English text such as capitalization, • Approximately 50,000 speakers with a diverse selection punctuation, denormalization of non-standard words, and other of L1 and L2 accents formatting information are omitted. While such output is suit- • Varied topics relevant to business and finance, including able for certain purposes like closed captioning, it falls short a broad range of named entities. of the standard of English orthography expected by readers and can even pose problems for certain downstream NLP tasks such as natural language understanding, neural machine translation 3. Prior Work and summarization [2, 3]. In contexts where standard English orthography is required, therefore, it is typically provided after There are many extant STT corpora, varying in their volumes the fact by a pipeline of post-processing models that infer a sin- as well as in the details of their formats. Early corpora include gle aspect of formatting each from the acoustic model output. the Wall Street Journal corpus, consisting of 80 hours of nar- This approach suffers several drawbacks. First, rendering rated news articles [12], SWITCHBOARD, containing approx- orthographically correct text is the more natural task: if asked imately 300 hours of telephone conversations [13]; TIMIT, con- to transcribe audio and given no further instructions, a nor- sisting of a set of ten phonetically balanced sentences read by mally literate English speaker will tend to generate orthographic hundreds of speakers (∼ 50 h) [14]; and the Fisher corpus of text. By contrast, writing in block capitals without punctuation, transcribed telephone conversations (2,000 h) [15]. spelling out numbers in English and so on, is a far less conven- The LibriSpeech corpus, a standard benchmark for STT tional style. Secondly, certain types of orthographic judgments models, consists of approximately 1000 hours of narrated au- diobooks [16]. Although the use of audiobooks as an STT 1 https://datasets.kensho.com/datasets/scribe corpus allows researchers to leverage a large pre-existing body 2 We concede that the precise details of English orthography can vary of transcription work, this approach poses several limitations. by time, geographical region, and even publication house style. We First, LibriSpeech consists entirely of narrated text, hence it nevertheless ostensively define orthographic text here to mean “text as it generally appears in U.S. print publications”. 3 Rhymes with “squeegee“.
Corpus name Acoustic condition Speaking style Switchboard Telephone Spontaneous Librispeech Close-talk mic. Narrated TedLium-3 Close-talk mic. Narrated Common Voice Teleconference Narrated SPGISpeech Teleconference Spontaneous, Narrated Corpus name Transcription Style Speaker Count Vocabulary Size Amount (h) Switchboard Orthographic 550 25,000 300 Librispeech Non-orthographic 2,400 200,000 960 TedLium-3 Non-orthographic 2,000 160,000 450 Common Voice Non-orthographic 40,000 220,000 1,400 SPGISpeech Orthographic 50,000 100,000 5,000 Table 1: Comparison of SPGISpeech to peer corpora. For a select group of comparable STT corpora, we compare the recording format, discursive domain, metadata and quantity of audio. Numeric data are rounded; precise values for SPGISpeech are given in Table 3. lacks many of the acoustic and prosodic features of spontaneous 1. We sampled no more than four consecutive slices from speech. Second, as the narrated books are public domain texts, any call. We also redacted the corpus out of concern there is a bias in the corpus towards older works, and hence for individual privacy. Though earnings calls are pub- against more modern registers of English. lic, we nevertheless identified full names with the spaCy Other corpora include TED-LIUM (450 hours of tran- en core web large model [23], which we selected scribed TED talks) [17] and Common Voice (a multilingual cor- on grounds of its wall clock performance for scanning pus of narrated prompts with ∼ 1,100 validated hours in En- the entire corpus. We withheld slices containing names glish) [18], and GigaSpeech (10,000 hours from audiobooks, that appeared fewer than ten times (7% of total). Full podcasts and YouTube videos) [19]. We present a comparison names appearing ten times or more in the data were con- to select corpora in Table 1; a more exhaustive catalogue of sidered to be public figures and were retained. This nec- prior STT corpora can be found in [20]. essarily incomplete approach to named entity recogni- Within this field of previous work, SPGISpeech is distinc- tion was complemented with randomized manual spot tive for being over ten times larger than the next largest corpus checks which uncovered no false negatives missed by the with orthographic ground truth labels. It also contains approx- automated approach. imately 50,000 speakers, the largest number to our knowledge of any public corpus. 2. We excluded all slices that contain currency information (8% of total), on the grounds that currency utterances 4. Corpus Definition often have non-trivial denormalizations and misquota- tion issues that require global context in order to render The SPGISpeech corpus is derived from company earnings calls correctly. The utterance ‘‘one twenty three’’, manually transcribed by S&P Global, Inc. according to a pro- for example, might be correctly transcribed as any of fessional style guide detailing conventions for capitalization, $1.23, £1.23, $1.23 million, and so on, de- punctuation, denormalization of non-standard words and tran- pending on context. Given the potential for material er- scription of disfluencies in spontaneous speech. The basic unit rors in a business setting if reported incorrectly, more- of SPGISpeech is a pair consisting of a 5 to 15 second long 16 over, misquotation of currency values in spontaneous bit, 16kHz mono wav audio file and its transcription. speech are typically corrected in transcription. Lacking the means to verify the correct spoken form for each cur- 4.1. Alignment and Slicing rency mention, we simply exclude them. Earnings calls last 30-60 minutes in length and are typically transcribed as whole units, without internal timestamps. In or- 3. We excluded slices with transcriptions containing non- der to produce short audio slices suitable for STT training, the ASCII characters. In particular we excluded all slices files were segmented with Gentle [21], a double-pass forced whose transcripts did not consist entirely of the upper- aligner, with the beginning and end of each slice of audio im- and lowercase ASCII alphabet; digits; the comma, pe- puted by voice activity detection with py-webrtc [22]. While riod, apostrophe, hyphen, question mark, percent sign, there is inevitably a potential to introduce certain systematic and space characters. biases through this process, the fraction of recovered aligned audio per call ranges from approximately 40% in the calls of 4. Remaining slices were randomly subsampled in order to lowest audio quality to approximately 70% in the highest. construct the published corpus, which consists of two splits, train and val, having no call events in com- 4.2. Corpus Definition mon. We also construct a private test split, defined Slices in SPGISpeech are not a simple random sample of the exactly as val and having no calls in common with it, available data, but are subject to certain exclusion criteria. which we do not release.
Punctuation Transcript: early in April versus what was going on at the beginning of the quarter? Verbatim: early you know in april versus uh what was going on at the beginning of the quarter Model Output: early in April versus what was going on at the beginning of the quarter? Non-standard words Transcript: [...] for the first time in our 92-year history, we [...] Verbatim: [...] for the first time in our ninety two year history we [...] Model Output: [...] for the first time in our 92-year history, we [...] Disfluency Transcript: As respects our use of insurance to put out -- reinsurance to put out [...] Verbatim: as as respects our use of insurance to put out lim reinsurance to put out [...] Model: As respects our use of insurance to put out -- reinsurance to put out [...] Abbreviations Transcript: in ’15, and got margins back to that kind of mid-teens level [...] Verbatim: in fifteen and got margins back to that kind of mid teens level [...] Model Output: in ’15 and got margins back to kind of that mid-teens level [...] Table 2: Examples of non-standard text in SPGISpeech. For each textual feature we give an example of the ground truth transcript label, a verbatim transcription without casing or punctuation, and Conformer model output with end-to-end orthographic training. Train Val Events 55,289 1,114 Slices 1,966,109 39,341 Time (h) 5,000 100 Vocabulary 100,166 19,865 OOV — 703 Table 3: SPGISpeech Summary Statistics. 5. Corpus Analysis We briefly characterize the data. Summary statistics are given in Table 3. Several representative examples are given in Table 2, highlighting some of the challenges SPGISpeech presents for Figure 1: Distribution of Speakers by Global Region. Speaker end-to-end training. The table contains samples from the val- distribution is estimated in a random sample according to re- idation split where the correct EOS marker is difficult to infer ported region of corporate domicile. from the transcription of the slice alone, non-standard words that must be denormalized, disfluencies and hesitations arising from self-correction in spontaneous speech, and instances like year abbreviations where semantic information is likely nec- ther refine the picture of accent composition, we next consid- essary to determine the correct orthography. In each instance ered the distribution of countries of corporate domicile. We we also report characteristic Conformer model output (see Sec- found it to be long-tailed, the top five most common nations tion 6), demonstrating the feasibility of end-to-end orthographic (US, Japan, UK, India, Canada) comprising 2/3rds of the data. transcription for these examples. Although a few of the twenty most frequent countries are re- SPGISpeech contains many specialized forms and entity puted for encouraging strategic corporate domiciling, we find types such as acronyms (15% of all slices), pauses (10%), or- that their combined share of the total is low, suggesting that ganizations (25%), persons (8%), and locations (8%). For each their inclusion does not significantly distort the bulk statistics. form we estimate its prevalence from a random sample of tran- While imputation of speaker accent from corporate domicile is scripts. Prevalences of the named entity types person, organi- a necessarily limited approach, we found the observed statistics zation and location were estimated using Flair [24], which we to be broadly concordant with direct estimation of accent in a selected on grounds of its accuracy and acceptable wall clock random sample. The number of speakers and diversity of ac- performance for scanning a small sample of the corpus. cents in SPGISpeech is, to our knowledge, the widest of any There are roughly 50,000 speakers in SPGISpeech, drawn public corpus of spontaneous speech, making it especially suit- from corporate officers and analysts appearing in English earn- able for training robust STT models. ings calls and spanning a broad cross-section of L1 and L2 ac- Lastly, we analyzed the industrial composition of included cents. We attempted to characterize this linguistic diversity by companies in order to ensure that they constituted a representa- tabulating the domiciles of the corporate headquarters of the tive cross-section of business topics. All eleven top-level sectors companies in the corpus. In Fig. 1 we report the distribution of the Global Industry Classification Standard (GICS) [26] are of the global region associated with each company’s headquar- reflected in the data at rates roughly comparable to their propor- ters as listed in in the S&P Capital IQ database [25]. To fur- tions in the US economy. SPGISpeech, not being constrained
to any particuar company or industrial sector, therefore offers a WER (%) CER (%) fairly synoptic view of modern English business discourse. ortho norm ortho norm Conformer 5.7 2.3 1.7 1.0 6. Transcription Experiments (ESPnet) To illustrate the feasibility of the end-to-end orthographic ap- Conformer-CTC 6.0 2.6 1.8 1.1 proach we train several models on SPGISpeech, including Con- (NeMo) former (ESPnet) and Conformer-CTC (Nemo). Table 4: Model Results. Greedy word error rate and character error rate on the SPGISpeech test set. ortho numbers refer 6.1. Model Descriptions and Methods to the unnormalized, fully formatted orthographic output, while Both presented models are based on the recently proposed norm numbers refer to a lowercased and reduced vocabulary Conformer (Convolution-Augmented Transformer) architecture to remove the effects of text formatting. [27] which combines the local sensitivities of convolutional neural networks [28, 29], with the long-range interactions of transformers [30] in order to capture both local and global de- pendencies in audio sequences. The unit of prediction for both whole that English orthography is within the grasp of modern models consists of SentencePiece tokenized text [31], whose acoustic architectures, and hence end-to-end orthographic STT detokenized output is upper- and lower-case characters, digits, is a feasible task. Both models achieve WERs of less than 6.0 the comma, period, apostrophe, hyphen, question mark, percent and CERs less than 2.0, making them broadly comparable to sign, and the space character, covering the full range of charac- previous results obtained with Conformer in other corpora [27]. ters present in SPGISpeech. A comparison of the ortho and norm error rates suggests The ESPnet Conformer model presented consists of 12 con- that 1/2 to 2/3rds of the error is due to orthographic issues. One former blocks with an output dimension of 512 and a kernel size must recall, however, that there is a lower bound imposed by of 31 in the encoder, and 6 transformer blocks in the decoder. irreducible error: in some cases there is genuine disagreement Both encoder and decoder have 8 attention heads with 2048 as to whether a pause merits a comma, or whether a hesitation feed-forward unit dimension. The size of the vocabulary was should be recorded or emended. For the disfluency given in chosen at ∼5k. The model was trained using four 24Gb mem- Table 2, for example, the model smoothly emends the first re- ory Titan RTX GPUs for 35 epochs. Adam optimizer with no peated word but records the second correction while omitting weight decay was used and Noam learning rate scheduler was the partially pronounced word “limit”: some degree of judg- applied with 25k warmup steps and a learning rate of 0.0015. ment in the training labels, and hence irreducible error, is in- SpecAug was used with 2 frequency masks and 5 time masks. evitable. The last 10 best checkpoints were averaged for the final model. Detailed setups can be found in the ESPnet recipe4 . The Conformer-CTC model, provided through the NeMo 7. Conclusions toolkit [32], is a CTC-based variant of the Conformer which In this work we introduced a new end-to-end task of fully for- has the same encoder but uses CTC loss [33] instead of RNNT matted speech recognition, in which the acoustic model learns [34]. It also replaces the LSTM decoder with a linear decoder to predict complete English orthography. This approach is both on the top of the encoder. This makes Conformer-CTC a non- conceptually simpler and can lead to improved accuracy due to autoregressive model unlike the original Conformer, allowing acoustic cues only present in the the audio, which are needed for significantly faster inference speeds. The presented model for full formatting. We demonstrated the feasibility of this ap- was trained for 230 epochs at an effective batch size of 4096, proach by training models on SPGISpeech, a corpus uniquely with Adam optimizer and no weight decay. SpecAug was ap- suited for the task of large scale, fully formatted end-to-end plied with 2 frequency maskings of 27, and 5 time maskings at transcription in English. As a contribution to the STT research a maximum ratio of 0.05. Noam learning scheduler was used community, we also offer SPGISpeech for free academic use. with a warmup of 10k steps and learning rate of 2.0. 8. Acknowledgements 6.2. Results and Discussion The authors wish to thank the S&P Global Market Intelligence As performance on the orthographic STT task hinges in large Transcripts team for data collection and annotation; Bhavesh part on single-character distinctions such as capitalization and Dayalji, Abhishek Tomar, Richard Neale and Gabriela Pereyra punctuation, we report both WER and CER for the test split for their project support; and Shea Hudson Kerr and Stacey of SPGISpeech, closely tracking the performance on the val Steele for helpful legal guidance. split. We also estimate the respective error rates for a normal- ized transcription task by lowercasing the output and filtering all but letters, the apostrophe and the space character to conform to 9. References the conventional choice of STT vocabulary. Both results are [1] K. H. Albrow, “The english writing system: Notes towards a de- shown in Table 4. scription,” Schools Council Program in Linguistics and English Our results demonstrate CERs in the orthographic STT task Teaching, papers series 2, no. 2, 1972. comparable to those obtained on standard normalized corpora. [2] A. Ravichander, S. Dalmia, M. Ryskina, F. Metze, E. Hovy, While we caution against laying undue stress upon any one par- and A. W. Black, “NoiseQA: Challenge Set Evaluation ticular comparison made across studies, they suggest on the for User-Centric Question Answering,” in Conference of the European Chapter of the Association for Computational 4 https://github.com/espnet/espnet/tree/ Linguistics (EACL), Online, April 2021. [Online]. Available: master/egs2/spgispeech https://arxiv.org/abs/2102.08345
[3] S. Peitz, M. Freitag, A. Mauser, and H. Ney, “Modeling punc- [19] G. Chen, S. Chai, G. Wang, J. Du, C. W. Wei-Qiang Zhang, D. Su, tuation prediction as machine translation,” in in Proceedings D. Povey, J. Trmal, J. Zhang, M. Ji, S. Khudanpur, S. Watanabe, of the International Workshop on Spoken Language Translation S. Zhao, W. Zou, X. Li, X. Yao, Y. Wang, Z. You, and Z. Yan, “Gi- (IWSLT, 2011. gaSpeech: An evolving, multi-domain ASR corpus with 10,000 [4] R. Sproat and N. Jaitly, “Rnn approaches to text normalization: A hours of transcribed audio,” submitted to INTERSPEECH, 2021. challenge,” ArXiv, vol. abs/1611.00068, 2016. [20] J. Le Roux and E. Vincent, “A categorization of robust speech [5] W. Salloum, G. Finley, E. Edwards, M. Miller, and processing datasets,” Mitsubishi Electric Research Labs TR2014- D. Suendermann-Oeft, “Deep learning for punctuation 116, Technical Report, 2014. restoration in medical reports,” in BioNLP 2017. Van- [21] lowerquality, Gentle Aligner, 2020 (accessed May 4th, 2020). couver, Canada,: Association for Computational Lin- [Online]. Available: https://lowerquality.com/gentle/ guistics, Aug. 2017, pp. 159–164. [Online]. Available: [22] J. Wiseman, pyWebRTC, 2016 (accessed May 4th, 2020). https://www.aclweb.org/anthology/W17-2319 [Online]. Available: https://github.com/wiseman/py-webrtcvad [6] E. Pusateri, B. R. Ambati, E. Brooks, O. Plátek, D. McAllaster, and V. Nagesha, “A mostly data-driven approach to inverse text [23] M. Honnibal and I. Montani, “spaCy 2: Natural language under- normalization,” in INTERSPEECH, 2017. standing with Bloom embeddings, convolutional neural networks and incremental parsing,” 2017, to appear. [7] B. Nguyen, V. B. H. Nguyen, H. Nguyen, P. N. Phuong, T. Nguyen, Q. T. Do, and L. C. Mai, “Fast and accurate [24] A. Akbik, D. Blythe, and R. Vollgraf, “Contextual string embed- capitalization and punctuation for automatic speech recognition dings for sequence labeling,” in COLING 2018, 27th International using transformer and chunk merging,” in 22nd Conference Conference on Computational Linguistics, 2018, pp. 1638–1649. of the Oriental COCOSDA International Committee for the [25] S. Global, “S&p capital iq,” www.capitaliq.com, 2012, accessed Co-ordination and Standardisation of Speech Databases and June 2020. Assessment Techniques, O-COCOSDA 2019, Cebu, Philippines, [26] M. S. C. International, “The global industry classification standard October 25-27, 2019. IEEE, 2019, pp. 1–5. [Online]. Available: (GICS),” 2019. [Online]. Available: https://www.msci.com/gics https://doi.org/10.1109/O-COCOSDA46868.2019.9041202 [27] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, [8] O. Hrinchuk, M. Popova, and B. Ginsburg, “Correction of auto- S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution- matic speech recognition with transformer sequence-to-sequence augmented transformer for speech recognition,” arXiv preprint model,” in IEEE International Conference on Acoustics, Speech arXiv:2005.08100, 2020. and Signal Processing (ICASSP), 05 2020, pp. 7074–7078. [9] M. Sunkara, C. Shivade, S. Bodapati, and K. Kirchhoff, “Neural [28] K. Fukushima, “Neocognitron: A self-organizing neural network inverse text normalization,” 2021. model for a mechanism of pattern recognition unaffected by shift in position,” Biological Cybernetics, vol. 36, pp. 193–202, 1980. [10] D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J.-F. Crespo, and [29] O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Dennison, “Hidden technical debt in machine learning D. Yu, “Convolutional neural networks for speech recognition,” systems,” in Advances in Neural Information Processing IEEE ACM Trans. Audio Speech Lang. Process., vol. 22, Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and no. 10, pp. 1533–1545, 2014. [Online]. Available: https: R. Garnett, Eds., vol. 28. Curran Associates, Inc., 2015. //doi.org/10.1109/TASLP.2014.2339736 [Online]. Available: https://proceedings.neurips.cc/paper/2015/ [30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Atten- [11] J. Huang, O. Kuchaiev, P. O’Neill, V. Lavrukhin, J. Li, A. Flores, tion is all you need,” in Advances in Neural Information G. Kucsko, and B. Ginsburg, “Cross-language transfer learning, Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, continuous learning, and domain adaptation for end-to-end auto- H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar- matic speech recognition,” in International Conference on Multi- nett, Eds., vol. 30. Curran Associates, Inc., 2017. [On- media and Expo, 2021, to appear. line]. Available: https://proceedings.neurips.cc/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf [12] D. B. Paul and J. M. Baker, “The design for the Wall Street Journal-based CSR corpus,” in Speech and Natural Language: [31] T. Kudo and J. Richardson, “SentencePiece: A simple and Proceedings of a Workshop Held at Harriman, New York, Febru- language independent subword tokenizer and detokenizer for ary 23-26, 1992, 1992. neural text processing,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System [13] J. J. Godfrey, E. C. Holliman, and J. McDaniel, “SWITCH- Demonstrations. Brussels, Belgium: Association for Compu- BOARD: Telephone speech corpus for research and develop- tational Linguistics, Nov. 2018, pp. 66–71. [Online]. Available: ment,” in ICASSP, 1992. https://www.aclweb.org/anthology/D18-2012 [14] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. [32] O. Kuchaiev, J. Li, H. Nguyen, O. Hrinchuk, R. Leary, B. Gins- Pallett, and N. L. Dahlgren, “DARPA TIMIT acoustic phonetic burg, S. Kriman, S. Beliaev, V. Lavrukhin, J. Cook, P. Castonguay, continuous speech corpus,” 1993. M. Popova, J. Huang, and J. M. Cohen, “Nemo: a toolkit for build- [15] C. Cieri, D. Miller, and K. Walker, “The Fisher Corpus: a resource ing ai applications using neural modules,” 2019. for the next generations of speech-to-text,” in Proceedings of [33] A. Graves, S. Fernández, and F. Gomez, “Connectionist tempo- the Fourth International Conference on Language Resources and ral classification: Labelling unsegmented sequence data with re- Evaluation (LREC’04). Lisbon, Portugal: European Language current neural networks,” in In Proceedings of the International Resources Association (ELRA), May 2004. [Online]. Available: Conference on Machine Learning, ICML 2006, 2006, pp. 369– http://www.lrec-conf.org/proceedings/lrec2004/pdf/767.pdf 376. [16] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- [34] A. Graves, “Sequence transduction with recurrent neural net- rispeech: an ASR corpus based on public domain audio books,” works,” in ICML 29, 2012. in Acoustics, Speech and Signal Processing (ICASSP), 2015. [17] F. Hernandez, V. Nguyen, S. Ghannay, N. Tomashenko, and Y. Estève, “TED-LIUM 3: twice as much data and corpus reparti- tion for experiments on speaker adaptation,” 2018. [18] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Com- mon Voice: A massively-multilingual speech corpus,” 2019.
You can also read