Leveraging End-to-End ASR for Endangered Language Documentation: An Empirical Study on Yolox ochitl Mixtec
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Leveraging End-to-End ASR for Endangered Language Documentation: An Empirical Study on Yoloxóchitl Mixtec Jiatong Shi1 Jonathan D. Amith2 Rey Castillo Garcı́a3 Esteban Guadalupe Sierra Kevin Duh1 Shinji Watanabe1 1 The Johns Hopkins University, Baltimore, Maryland, United States 2 Department of Anthropology, Gettysburg College 3 Secretarı́a de Educación Pública, Estado de Guerrero, Mexico {jiatong shi@, kevinduh@cs.}jhu.edu {jonamith, reyyoloxochitl, estebanyoloxochitl}@gmail.com shinjiw@ieee.org Abstract challenging, primarily because the traditional method to preserve primary data is not simply with “Transcription bottlenecks”, created by a audio recordings but also through time-coded tran- arXiv:2101.10877v3 [eess.AS] 5 Mar 2021 shortage of effective human transcribers are scriptions. In a best-case scenario, texts are pre- one of the main challenges to endangered lan- guage (EL) documentation. Automatic speech sented in interlinear format with aligned parses and recognition (ASR) has been suggested as a glosses along with a free translation (Anastasopou- tool to overcome such bottlenecks. Following los and Chiang, 2017). But interlinear transcrip- this suggestion, we investigated the effective- tions are difficult to produce in meaningful quanti- ness for EL documentation of end-to-end ASR, ties: (1) ELs often lack a standardized orthography which unlike Hidden Markov Model ASR sys- (if written at all); (2) invariably, few speakers can tems, eschews linguistic resources but is in- accurately transcribe recordings. Even a highly stead more dependent on large-data settings. We open source a Yoloxóchitl Mixtec EL cor- skilled native speaker or linguist will require a min- pus. First, we review our method in build- imum of 30 to 50 hours to simply transcribe one ing an end-to-end ASR system in a way that hour of recording (Michaud et al., 2014; Zahrer would be reproducible by the ASR community. et al., 2020). Additional time is needed for parse, We then propose a novice transcription correc- gloss, and translation. This creates what has been tion task and demonstrate how ASR systems called a “transcription bottleneck”, a situation in and novice transcribers can work together to which the expert transcribers cannot keep up with improve EL documentation. We believe this the amount of recorded material for documentation. combinatory methodology would mitigate the transcription bottleneck and transcriber short- age that hinders EL documentation. (b) Transcriber Shortage: It is generally under- stood that any viable solution to the transcription 1 Introduction bottleneck must involve native speaker transcribers. Yet usually few, if any, native speakers have the Grenoble et al. (2011) warned that half of the skills (or time) to transcribe their language. Train- world’s 7,000 languages would disappear by the ing new transcribers is one solution, but it is time- end of the 21st century. Consequently, a con- consuming, especially with languages that present cern with endangered language documentation has complicated phonology and morphology. The situ- emerged from the convergence of interests of two ation is distinct for major languages, for which major groups: (1) native speakers who wish to transcription can be crowd-sourced to speakers document their language and cultural knowledge with little need for specialized training (Das and for future generations; (2) linguists who wish to Hasegawa-Johnson, 2016). In Yoloxóchitl Mixtec document endangered languages to explore lin- (YM; Glottocode=yolo1241, ISO 639-3=xty), the guistic structures that may soon disappear. En- focus of this study, training is time-consuming: af- dangered language (EL) documentation aims to ter one-year part-time transcription training, a pro- mitigate these concerns by developing and archiv- ficient native speaker, Esteban Guadalupe Sierra, ing corpora, lexicons, and grammars (Lehmann, still has problems with certain phones, particularly 1999). There are two major challenges: tones and glottal stops. Documentation requires (a) Transcription Bottleneck: The creation of accurate transcriptions, a goal yet beyond even the EL resources through documentation is extremely capability of an enthusiastic speaker with many
months of training. to-end ASR system using the YMC-EXP corpus.2 The second corpus, (“YMC-NT”, native trainee, As noted, ASR has been proposed to mitigate corpus) includes 8+ hours of additional recordings the Transcription Bottleneck and create increas- not included in the YMC-EXP corpus. This second ingly extensive EL corpora. Previous studies first corpus contains novice transcriptions with subse- investigated HMM-based ASR for EL documenta- quent expert corrections that has allowed us to eval- tion (Ćavar et al., 2016; Mitra et al., 2016; Adams uate the skill level of the novice. Both the YMC- et al., 2018; Jimerson et al., 2018; Jimerson and EXP and YMC-NT corpora are publicly available Prud’hommeaux, 2018; Michaud et al., 2018; Cruz at OpenSLR under a CC BY-SA-NC 3.0 License.3 and Waring, 2019; Thai et al., 2020; Zahrer et al., The contributions of our research are: 2020; Gupta and Boulianne, 2020a). Along with HMM-based ASR, natural language processing and • A new Yoloxóchitl Mixtec corpus to support semi-supervised learning have been suggested as a ASR efforts in EL documentation. way to produce morphological and syntactic anal- yses. As HMM-based systems have become more • A reproducible workflow to build an end-to- precise, they have been increasingly promoted as a end ASR system for EL documentation. mechanism to bypass the transcription bottleneck. • A comparative study between HMM-based However, ASR’s context for ELs is quite distinct ASR and end-to-end ASR, demonstrating the from that of major languages. Endangered lan- feasibility of the latter. To test the frame- guages seldom have sufficient extant language lexi- work’s generalizability, we also experiment cons to train an HMM system and invariably suffer with another EL: Highland Puebla Nahuat from a dearth of skilled transcribers to create these (Glottocode=high1278; ISO 639-3=azz). necessary resources (Gupta and Boulianne, 2020b). As we have confirmed with this present study, • An in-depth analysis of errors in novice tran- end-to-end ASR systems have shown comparable scription and ASR. Considering the discrepan- or better results over conventional HMM-based cies in error types, we propose Novice Tran- methods (Graves and Jaitly, 2014; Chiu et al., 2018; scription Correction (NTC) as a task for the Pham et al., 2019; Karita et al., 2019a). As end- EL documentation community. A rule-based to-end systems directly predict textual units from method and a voting-based method are pro- acoustic information, they save much effort on posed.4 In clean speech, the best system re- lexicon construction. Nevertheless, end-to-end duces relative word error rate in the novice ASR systems still suffer from the limitation of transcription by 38.9% . training data. Attempts with resource-scarce lan- guages have relatively high character (CER) or 2 Corpus Description word (WER) error rates (Thai et al., 2020; Mat- In this section, we first introduce the linguistic suura et al., 2020; Hjortnaes et al., 2020). It has specifics for YM and YMC. Then we discuss the nevertheless become possible to utilize ASR with recording settings. Since YM is a spoken language ELs to reduce significantly, but not eliminate, the without a standardized textual format, we next ex- need for human input and annotation to create ac- plain the transcription style designed for this lan- ceptable (“archival quality”) transcriptions. guage. Finally, we offer the corpus partition and This Work: This work represents end-to-end some statistics regarding corpora size. ASR efforts on Yoloxóchitl Mixtec (YM), an en- 2.1 Linguistic Specifics for Yoloxóchitl dangered language from western Mexico. The Mixtec YMC1 corpus comprises two sub-corpora. The first (“YMC-EXP”, expert transcribed, corpus) in- Yoloxóchitl Mixtec is an endangered, relatively cludes 100 hours of transcribed speech that have low-resource Mixtecan language. It is mainly spo- been carefully checked for accuracy. We built a ken in the municipality of San Luis Acatlán, state recipe of the ESPNet (Watanabe et al., 2018) that 2 https://github.com/espnet/espnet/ shows the whole process of constructing an end- tree/master/egs/yoloxochitl_mixtec/asr1 3 http://www.openslr.org/89/ 1 4 Specifically, we used material from the community of A system combination method, Recognizer Output Voting Yoloxóchitl (YMC), one of four in which YM is spoken. Error Reduction (Fiscus, 1997))
of Guerrero, Mexico. It is one of some 50 lan- research. Speakers were interviewed one after an- guages in the Mixtec language family, which is part other; there is no overlap. However, the recordings of a larger unit, Otomanguean, that Suárez (1983) often registered background sounds (crickets, birds) considers “a ‘hyper-family’ or ‘stock’.” Mixtec lan- that we expected would negatively impact ASR ac- guages (spoken in Oaxaca, Guerrero, and Puebla) curacy more than seems to have occurred. The are highly varied, resulting from approximately topic was always a discussion of plant knowledge 2,000 years of diversification. (a theme of only 9% of the YMC-EXP corpus). YM is spoken in four communities: Yoloxóchitl, Expectedly, there were many out-of-vocabulary Cuanacaxtitlan, Arroyo Cumiapa, and Buena Vista. (OOV) words (e.g., plant names not elsewhere Mutual intelligibility among the four YM com- recorded) in this YMC-NT corpus.5 munities is high despite significant differences 2.3 Corpus Transcription in phonology, morphology, and syntax. All vil- lages have a simple segmental inventory but sig- (a) Transcription Level: The YMC-EXP corpus nificant though still undocumented variation in presently has two levels of transcription: (1) a prac- tonal phonology. YMC (refering only to the Mix- tical orthography that represents underlying forms; tec of the community of Yoloxóchitl [16.81602, (2) surface forms. The underlying form marks pre- -98.68597]) manifests 28 distinct tonal patterns on fixes (separated from the stem by a hyphen), en- 1,451 identified bimoraic lexical stems. The tonal clitics (separated by an = sign), and tone elision patterns carry a significant functional load in re- (with the elided tones in parentheses). All these gards to the lexicon and inflection. For example, “breaks” and phonological processes disappear in 24 distinct tonal patterns on the bimoraic segmen- the surface form. For example, the underlying tal sequence [nama] yield 30 words (including six be03 e3 =an4 (house=3sgFem; ’her house’) surfaces homophones). This ample tonal inventory presents as be03 ã4 . And be03 e(3) =2 (’my house’) surfaces as challenges to both a native speaker learning to write be03 e2 . Another example is the completive prefix and an ASR system learning to recognize. Notably, ni1 -, which is separated from the stem as in ni1 - it also introduces difficulties in constructing a lan- xi3 xi(3) =2 (completive-eat-1sgS; ’I ate’). The sur- guage lexicon for training HMM-based systems. face form would be written nĩ1 xi3 xi2 . Again, pro- cesses such as nasalization, vowel harmony, palatal- 2.2 Recording Settings ization, and labialization are not represented in the practical (underlying) orthography but are gener- There are two corpora used in this study. The ated in the surface forms. The only phonological first (YMC-EXP) was used for ASR training. The process encoded in the underlying orthography is second (YMC-NT) was used to train the novice tone elision, for which parentheses are used. speaker (e.g., set up a curriculum for him to learn The practical, underlying orthography men- how to transcribe) and for Novice Transcription tioned above was chosen as the default system for Correction. The YMC-EXP corpus comprises ex- ASR training for three reasons: (1) it is easier than pert transcriptions used as the gold-standard refer- a surface representation for native speakers to write; ence for ASR development. The YMC-NT corpus (2) it represents morphological boundaries and thus has paired novice-expert transcription as it was serves to teach native speakers the morphology of used to train and evaluate the novice writer. their language; and (3) for a researcher interested The corpus used for ASR development com- in generating concordances for a corpus-based lex- prises mostly conversational speech in two-channel icographic project it is much easier to discover the recordings (split for training). Each conversation root for ‘house’ in be03 e3 =an4 and be03 e(3) =2 than is with two speakers and each of the two speak- in the surface forms be03 ã4 and be03 e2 . ers was fitted with a separate head-worn mic (usu- ally a Shure SM10a). Over two dozen speakers (b) “Code-Switching” in YMC: Endangered, (mostly male) contributed to the corpus. The topics colonialized Indigenous languages often manifest and their distribution were varied (plants, animals, extensive lexical input from a dominant West- hunting/fishing, food preparation, ritual speech). ern language, and speakers often talk with “code- The YMC-NT corpus comprises single-channel switching” (for lack of a better term). Yoloxóchitl field recordings made with a Zoom H4n at the mo- 5 After separating enclitics and prefixes as separate tokens, ment plants were collected during ethnobotanical the OOV rate in YMC-NT is 4.84%.
Corpus Subset UttNum Dur (h) in the YMC-EXP corpus, and the transcription for Train 52763 92.46 the YMC-EXP were all carefully proofed by an EXP Validation 2470 4.01 expert native-speaker linguist. As shown in Table Test 1577 2.52 1, we offer a train-valid-test split where there is no Train 35144 58.60 overlap in content between the sets. The partition EXP(-CS) Validation 1301 2.16 considers the balance between speakers and relative Test 2603 4.35 size for each part. Clean-Dev 2523 3.45 As introduced in Section 2.2, the YMC-NT cor- NT Clean-Test 2346 3.31 pus has both expert and novice transcription. It in- Noise-Test 1335 1.60 cludes only three speakers for a total of 8.36 hours. In the recordings of two consultants, the environ- Table 1: YMC Corpus Partition for EXP (corpus with ment is relatively clean and free of background expert transcription), EXP(-CS) (subset of EXP with- out “code-switching”), NT (corpus with paired novice noise. The speech of the other individual, however, and expert transcription) is frequently affected by background noise. This seems coincidental as all three were recorded to- gether, one after the other in random order. But Mixtec is no exception. Amith considered how to given this situation, we split the corpus into three write such forms best and decided that Spanish- sets: clean-dev (speaker EGS), clean-test (speaker origin words would be written in Spanish and with- CTB), and noise-test (speaker FEF; see Table 1). out tone when their phonology and meaning are The “code-switching” discussed in 2.3 (b) intro- close to that of Spanish. So Spanish docena ap- duces different phonological representations and pears over a dozen times in the corpus and is writ- makes it difficult to train an HMM-based model ten tucena; it always has the meaning of ‘dozen’. using language lexicons. Therefore, previous work All month and day names are also written without (Mitra et al., 2016) using the HMM-based system tones. Note, however, that Spanish camposanto for YMC did not consider phrases with “code- (‘cemetery’) is also found in the corpus and pro- switching”. To compare our model with their re- nounced as pa3 san4 tu2 . The decision was made to sults, we have used the same experimental corpus write this with tone markings as it is significantly in our evaluation. Their corpus (YMC-EXP(-CS)), different in pronunciation from the Spanish origin shown in Table 1, is a subset of the YMC-EXP; the word. In effect, words like pa3 san4 tu2 are consid- YMC-EXP(-CS) corpus does not contain “code- ered loans into YM and are treated orthographically switching” phrases, i.e., phrases with words that as Mixtec. Words such as tucena are considered were tagged as Spanish origin and transcribed with- “code-switching” and written without tones. out tone. (c) Transcription Process: The initial time- aligned transcriptions were made in Transcriber 3 ASR Experiments (Barras et al., 1998). However, given that Tran- scriber cannot handle multiple tiers (e.g., transcrip- 3.1 End-to-End ASR tion and translation, or underlying and surface or- As ESPNet (Watanabe et al., 2018) is widely used thographies), the Transcriber transcriptions were in open-source end-to-end ASR research, our end- then imported into ELAN (Wittenburg et al., 2006) to-end ASR systems are all constructed using ESP- for further processing (e.g., correction, surface- Net. For the encoder, we employed the conformer form generation, translation). structure (Gulati et al., 2020), while for the decoder we used the transformer structure to condition the 2.4 Corpus Size and Partition full context, following the work of Karita et al. Though endangered, YMC does not suffer from the (2019b). The conformer architecture is a state- same level of resource limitations that affect most of-the-art innovation developed from the previous ASR work with ELs (Ćavar et al., 2016; Jimerson transformer-based encoding methods (Karita et al., et al., 2018; Thai et al., 2020). The YMC-EXP 2019a; Guo et al., 2020). A comparison between corpus, developed for over ten years, provided 100 the conformer and transformer encoders shows the hours for the ASR training, validation, and test value of applying state-of-the-art end-to-end ASR corpora. There are 505 recordings from 34 speakers to ELs.
3.2 Experiments and Results Model Feature WER DNN-HMM MFB 36.9 As discussed above, our end-to-end model applied GFB + Articu. an encoder-decoder architecture with a conformer DNN-HMM 31.1 +Pitch encoder and a transformer decoder. The archi- CNN-HMM tecture of the model follows Gulati et al. (2020) MFCC 19.1 (Chain) while its configuration follows the aishell con- E2E-Conformer MFB + Pitch 15.4 former recipe from ESPNet.6 The experiment is reproducible using ESPNet. Table 2: Comparison between HMM-based Mod- As the end-to-end system models are based on els and the End-to-End Conformer (E2E-Conformer) word pieces, we adopted CER and WER as eval- Model on YMC-EXP(-CS) that is a subset of the YMC- uation metrics. They help demonstrate the sys- EXP without “code-switching”. tem performances at different levels of graininess. But because the HMM-based systems were decod- CER WER Model ing with a word-based lexicon, for comparison to dev/test dev/test HMM we only use the WER metric. To thoroughly E2E-RNN 9.2/9.3 19.1/19.2 examine the model, we conducted several compar- E2E-Transformer 7.8/7.9 16.3/16.7 ative experiments, as discussed in continuation. E2E-Conformer 7.7/7.7 16.0/16.1 (a) Comparison with HMM-based Methods: Table 3: End-to-End ASR Results on YMC-EXP (cor- We first compared our end-to-end method with pus with “code-switching”) the Deep Neural Network-Hidden Markov Model (DNN-HMM) methods proposed in Mitra el al. (2016). In this work, Gammatone Filterbanks we cannot conduct HMM-based experiments with (GFB), articulation, and pitch are configured for the either YMC-EXP or YMC-NT corpora. DNN-HMM model. This baseline is a DNN-HMM model using Mel Filterbanks (MFB). In recent un- (b) Comparison with Different End-to-End published work, Kwon and Kathol develop a lat- ASR Architectures: We also conducted exper- est state-of-the-art CNN-HMM-based ASR model7 iments comparing models with different encoders for YMC based on the lattice-free Maximum Mu- and decoders on the YMC-EXP corpus. For a Re- tual Information (LF-MMI) approach, also known current Neural Network-based (E2E-RNN) model, as “chain model” (Povey et al., 2016). The ex- we followed the best hyper-parameter configura- perimental data of the above HMM-based models tion, as discussed in Zeyer et al. (2018). For a is YMC-EXP(-CS) discussed in Section 2.4. For Transformer-based (E2E-Transformer) model, the the comparison, our end-to-end model adopted the same configuration from Karita et al. (2019b) was same partition to ensure fair comparability with adopted. Both models shared the same data prepa- their results. ration process as the E2E-Conformer model. Table 2 shows the comparison between DNN- Table 3 compares different end-to-end ASR HMM systems and our end-to-end system on YMC- architectures on the YMC-EXP corpus.8 The EXP(-CS). It indicates that even without an exter- E2E-Conformer obtained the best results, obtain- nal language lexicon the end-to-end system signifi- ing significant WER improvement as compared cantly outperforms both the DNN-HMM baseline to the E2E-RNN and the E2E-Transformer mod- models and the CNN-HMM-based state-of-the-art els. The E2E-Conformer’s WER on YMC-EXP(- model. CS) is slightly lower than that obtained for the In Section 2.3 (b), we note that “code-switching” whole YMC-EXP corpus, despite a significantly is invariably present in EL speech (e.g., YMC). smaller training set in the YMC-EXP(-CS) corpus. Thus, ASR models built on ”code-switching-free Since the subset excludes Spanish words, “code- corpora (like YMC-EXP[-CS]) are not practical for switching” may well be a problem to consider in real-world usage. However, a language lexicon is ASR for endangered languages such as YM. available only for the YMC-EXP(-CS) corpus so 8 The train set in YMC-EXP is significantly larger than 6 See Appendix for details about the model configuration. that in YMC-EXP(-CS), the YMC-EXP corpus from which 7 See Appendix for details about the model configuration. all lines containing a Spanish-origin word have been removed.
CER WER CER WER Transcription Level Model dev/test dev/test dev/test dev/test Surface 8.0/7.6 16.6/16.3 E2E-RNN 10.3/9.9 26.8/25.4 Underlying 7.7/7.7 16.0/16.1 E2E-Transformer 9.1/9.1 23.7/21.7 E2E-Conformer 9.9/8.6 23.5/21.7 Table 4: E2E-Conformer Results for Two Transcrip- tion Levels (Underlying represents morphological divi- Table 6: E2E-Conformer Results on another EL: High- sions and underlying phonemes before the application land Puebla Nahuatl of phonological rules; Surface is reflective of spoken forms and lacks morphological parsing) CER WER Corpus CER WER dev/test dev/test Corpus 10h 18.3/17.5 44.7/43.3 dev/test dev/test 10h 19.4/19.5 39.1/39.2 20h 14.2/12.9 34.8/33.3 20h 12.6/12.7 26.2/26.2 50h 11.0/10.2 27.0/24.9 50h 8.6/8.7 18.0/18.0 Whole (120h) 9.9/8.6 23.5/21.7 Whole (92h) 7.7/7.7 16.0/16.1 Table 7: E2E-Conformer Results on another EL: High- Table 5: E2E-Conformer Results on Different Corpus land Puebla Nahuatl (Different Corpus Size) Size significantly. It is apparent that the model still has (c) Comparison with Different Transcription the capacity to improve performance with more Levels: In addition to comparing model archi- data. The result also indicates that our system can tectures, we compared the impact of transcription get reasonable performances from 50 hours of data. levels on the ASR model. E2E-Conformer models This would be an important guideline when we with the same configurations were trained using collect a new EL database. both the surface and the underlying transcription forms, which are discussed in Section 2.3. We (e) The Framework Generalizability: To test also trained separate RNN language models for fu- the end-to-end ASR systems’ generalization ability, sion and unigram language models to extract word we conducted the same end-to-end training and test pieces for different transcription levels. procedures on another endangered language: High- Table 4 shows the E2E-Conformer results for land Puebla Nahuatl (high1278; azz). This corpus both underlying and surface transcription levels. is also open access under the same CC license.9 It As introduced in Section 2.3, the surface form re- comprises 954 recordings that total 185 hours 22 duces several linguistic and phonological processes minutes, including 120 hours transcribed data in compared to the underlying practical form. The re- ELAN and 65 hours still only in Transcriber and sults indicate that the end-to-end system is able to not used in ASR training.10 automatically infer those morphological and phono- logical processes and maintain a consistent low Table 6 shows the performance of three different error rate. end-to-end ASR architectures on Highland Puebla Nahuatl. For this language the E2E-Conformer (d) Comparison with Different Corpus Sizes: again offers better performances over the other As introduced in Section 1, most ELs are consid- models. Table 7 shows the E2E-Conformer per- ered low-resource for ASR purposes. To measure formances on different amounts of training data the impact of resource availability on ASR accu- for Highland Puebla Nahuatl. We can observe that racy we trained the E2E-Conformer model on 10, 50-hour is a reasonable size for an EL, which is 20, and 50 hours subsets of YMC-EXP. The results similar to the experiments in Table 5. These exper- demonstrate the model performances over different iments indicate the general ability to consistently sizes of resources. apply end-to-end ASR systems across ELs. Table 5 shows the E2E-Conformer performances on different amounts of training data. It demon- 9 http://openslr.org/92 strates how the model consumes data. As corpus 10 The recordings are almost all with two channels and two size is incrementally increased, WER decreases speakers in natural conversation.
Error Types Novice ASR Novice Transcriptions Enclitics (=) 96 243 & ASR Hypotheses Prefixes (-) 141 62 Glottal Stop (’) 341 209 Word Parenthesis 1607 302 Alignment Tone 4144 3241 Word Stem-Nasal (n) 0 6 Rules Others 4263 10175 Syllable Alignment Total 10592 14232 Syllable Rules Table 8: Character Error-type Distribution of Novice Character and ASR (by number of errors) Alignment Character Rules 4 Novice Transcription Correction Hybrid Transcription Finally, this paper presents novice transcription correction (NTC) as a task for EL documentation. Figure 1: Novice-ASR Fusion Process That is, in this experiment we explore not only the possibility of using ASR to enhance the accuracy As Amith and Castillo worked with the novice, of a YM novice transcription but to combine both they saw a repetition of types of errors that they novice transcription and ASR to achieve accurate worked to correct by giving the novice exercises results that surpass that of either component. Be- focused on these transcription shortcomings. The low we first analyze patterns manifested in novice end-to-end ASR, however, has demonstrated a dif- transcriptions. Next, we introduce two baselines ferent pattern of errors. For example, it developed a that fuse ASR hypotheses and novice transcription fair understanding of the rules for suppleting tones, for the NTC task. marked by parentheses around the suppleted tones. 4.1 Novice Transcription Error Rather than over-specify the NTC correction algo- rithm, we first analyzed the error-type distribution As mentioned in Section 1, transcriber shortages using the Clean-dev from the YMC-NT corpus, as have been a severe challenge for EL documenta- shown in Table 8. tion. Before 2019, only the native speaker linguist, Rey Castillo Garcı́a, could accurately transcribe the 4.2 Novice-ASR Fusion segments and tones of YMC. To mitigate the YMC transcriber shortage, in 2019 Castillo began to train Rapid comparison of the types of errors for each another speaker, Esteban Guadalupe Sierra. First, transcription (novice and ASR) demonstrated con- a computer course was designed to incrementally sistent patterns and has led us to hypothesize that teach Guadalupe segmental and tonal phonology. a fusion system might automatically correct many In the next stage, he was given YMC-NT corpus of these errors. Two baseline methods are exam- recordings to transcribe. Compared to the paired ined for the fusion: a voting-based system (Fiscus, expert transcription, the novice achieved a CER of 1997) and a rule-based system. 6.0% on clean-dev, defined in Table 1. However, The voting-based system follows the definition it is not feasible to spend many months training in (Fiscus, 1997) that combines hypotheses from speakers with no literacy skills to acquire the tran- different ASR models with novice transcription. scription proficiency achieved by Guadalupe in our The framework of rule-based fusion is shown in project. Moreover, even with a 6.0% CER, there Figure 1. The rules are defined in different linguis- are still enough errors so as to require significant tic units: words, syllables, and characters. They as- annotation/correction by the expert, Castillo. The sume a hierarchical alignment between the novice state-of-the-art ASR system (e.g., E2E-Conformer) transcription and ASR hypotheses. The rules are shown in Table 3 gets an 8.2% CER on the clean- applied to the transcription from word to syllable dev set, more errors than the novice CER. So for to character level. The rules are developed based YMC, ASR is still not a good enough substitute for on continual evaluation of the novice’s progress. a proficient novice. Thus they will be different but discoverable when
Clean-Dev Clean-Test Noise-Test Overall Model CER WER CER WER CER WER CER WER A. Novice 6.0 21.5 6.4 22.6 8.4 26.6 6.8 23.1 B. E2E-Transformer 9.8 23.1 8.8 21.2 24.3 47.0 12.9 28.1 C. E2E-Conformer 8.2 19.6 8.2 19.1 23.6 44.1 12.0 25.3 D. E2E-Conformer(50h) 10.5 25.0 9.9 23.7 25.7 50.1 14.0 30.5 E. Fusion1 (A+C) 6.3 20.6 6.9 22.0 13.1 38.6 8.2 25.4 F. Fusion1 (A+D) 7.0 22.9 7.5 24.5 14.0 41.5 8.9 28.0 G. Fusion2 (A+C) 5.1 17.6 5.5 18.7 9.6 30.3 6.3 21.1 H. Fusion2 (A+D) 5.5 19.4 5.9 20.4 10.1 32.6 6.8 23.0 I. ROVER (A+B+C) 4.7 14.6 4.6 13.8 12.4 32.6 6.5 18.5 J. ROVER-Fusion2 (A+B+C+E) 4.5 16.1 4.7 16.7 9.0 28.3 5.7 19.3 Table 9: NTC Results on YMC-NT (the results are evaluated using the expert transcription in YMC-NT). Model D is trained with a 50-hour subset of the YMC-EXP as shown in Table 5. applied to a new language. However, the general setting is computation-efficient. However, it does principle should be applicable to other ELs: Novice not consider how the contents mismatch between trainees will learn certain transcription tasks easier the Ci and Cj . Therefore, we adopt a hierarchical than others. Below we explain the rules for YMC. dynamic alignment. In this method, the character Word Rules: If a word from the novice transcrip- alignment follows the native setting. While the tion is Spanish (i.e., no tones and no linguistic LS (Ci , Cj0 ) for syllable alignment is defined as the indications [-, =, ’] that mark it as Mixtec), keep normalized character-level edit distance between the novice transcription. If the novice has extra Ci and Cj0 as follows: words, not in the ASR hypothesis, keep those extra words. D[Ci , Cj0 ] LS (Ci , Cj0 ) = (1) Syllable Rules: If a novice syllable is tone initial, |Ci | use the corresponding ASR syllable. If the novice and the ASR have identical segments but different where the |Ci | is the lengths of the syllable. Simi- tones, use the ASR tones. When an ASR syllable larly, the LS (Ci , Cj0 ) for word alignment is defined has CVV or CV’V, and its corresponding novice based on syllable alignment. syllable has CV,11 use the ASR syllable (CVV or CV’V). If the tone from either transcription system 5 NTC Experiments follows a consonant (except a stem-final n), use the 5.1 Experimental Settings other system’s transcription. Character Rules: If the ASR has a hyphen, equal The novice transcription, the E2E-Transformer sign, parentheses, glottal stop which is absent from model, and the E2E-Conformer model were consid- the novice transcription, then always trust the ASR ered as baselines for the NTC task. To evaluate the and maintain the aforementioned symbols in the system for reduced training data, we also show our final transcription. results of E2E-Conformer trained with a 50-hour We apply the edit distance (Wagner and Fischer, subset. For the end-to-end models, we adopted the 1974) to find the alignment between the ASR model trained model from Section 3 with the same decod- hypothesis {C1 , ..., Cn } and the Novice transcrip- ing set-ups. To test the effectiveness of the hierar- tion {C10 , ..., Cm 0 }. The L , L , L are introduced I D S chical dynamic alignment, we tested the data with in the dynamic function as the insertion, deletion, two fusion systems, namely Fusion1 and Fusion2. and substitution loss, respectively. In the naive set- The Fusion1 system used the naive settings of edit ting, LI , LD are both set to 1. The LS is set to 1 distance, while the Fusion2 system adopted the hi- if Ci is different from Cj0 and 0 otherwise. This erarchical dynamic alignment. Both fusion systems adopt rules defined in Section 4.2. Two configura- 11 A CV syllable can occur in a monomoraic word. But tions for voting-based methods were tested. The novice will often write a CV word when it should be CVV or CV’V. Stem-final syllables can be CV, CVV or CV’V. But first “ROVER” combined three hypotheses (i.e., novice tends to write CV in these cases. the E2E-Transformer, the E2E-Conformer, and the
Novice). In contrast, the “ROVER-Fusion2” com- novice-ASR hybridization. The second is a voting- bined the Fusion2 system with the above three. based method that combines hypotheses from the novice and end-to-end ASR systems. Empirical 5.2 Results studies on the YMC-NT corpus indicate that both As shown in Table 9, voting-based methods and methods significantly reduce the CER/WER of the rule-based methods all significantly reduce the novice transcription for clean speech. novice errors for clean speech.12 However, for The above discussion suggests that a useful ap- the noise-test, the novice transcription is the most proach to EL documentation using both human robust method. For overall results, the ROVER sys- and computational (ASR) resources might focus on tem (model I) has a lower WER, while the ROVER- training each system (human and ASR) for partic- Fusion2 system (model J) reaches a lower CER. ular transcription tasks. If we know from the start Model J significantly reduces specific errors, in- that ASR will be used to correct novice transcrip- cluding tone errors (25%), enclitic errors (50%), tions in areas of difficulty, we could train an ASR and parentheses errors (87.5%). In addition, mod- system to maximize accuracy in those areas that els D, F, and H indicate that the system could still challenge novice learning. reduce clean-environment novice errors using ASR models trained with a 50-hour subset of the YMC- References EXP corpus. As we discussed in Section 4, novice and ASR Oliver Adams, Trevor Cohn, Graham Neubig, Hilaria Cruz, Steven Bird, and Alexis Michaud. 2018. Eval- transcriptions manifest distinct patterns of error uating phonemic transcription of low-resource tonal and thus can be used to complement each other. languages for language documentation. In LREC Table 9 shows that our proposed rule-based and 2018 (Language Resources and Evaluation Confer- voting-based fusion methods can potentially elim- ence), pages 3356–3365. inate the errors that come from the novice tran- Antonios Anastasopoulos and David Chiang. 2017. A scriber, and it can mitigate the transcriber shortage case study on using speech-to-translation alignments problems based on these fusion methods. However, for language documentation. In Proceedings of the we should note that a noisy recording condition 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 170– would negatively affect a fusion approach as ASR 178. does poorly under such conditions (>23% CER), and for practical purposes, the novice transcription Claude Barras, Edouard Geoffrois, Zhibiao Wu, and Mark Liberman. 1998. Transcriber: a free tool for alone (
probabilistic transcriptions. In Interspeech, pages Robbie Jimerson, Kruthika Simha, Raymond Ptucha, 3858–3862. and Emily Prudhommeaux. 2018. Improving asr output for endangered language documenta- Jonathan G Fiscus. 1997. A post-processing system to tion. In Proc. The 6th Intl. Workshop on Spoken yield reduced word error rates: Recognizer output Language Technologies for Under-Resourced Lan- voting error reduction (rover). In 1997 IEEE Work- guages, pages 187–191. shop on Automatic Speech Recognition and Under- standing Proceedings, pages 347–354. Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Pegah Ghahremani, Bagher BabaAli, Daniel Povey, Masao Someki, Nelson Enrique Yalta Soplin, Korbinian Riedhammer, Jan Trmal, and Sanjeev Ryuichi Yamamoto, Xiaofei Wang, et al. 2019a. A Khudanpur. 2014. A pitch extraction algorithm comparative study on transformer vs rnn in speech tuned for automatic speech recognition. In 2014 applications. In 2019 IEEE Automatic Speech IEEE international conference on acoustics, speech Recognition and Understanding Workshop (ASRU), and signal processing (ICASSP), pages 2494–2498. pages 449–456. Alex Graves and Navdeep Jaitly. 2014. Towards end- Shigeki Karita, Nelson Enrique Yalta Soplin, Shinji to-end speech recognition with recurrent neural net- Watanabe, Marc Delcroix, Atsunori Ogawa, and To- works. In International conference on machine mohiro Nakatani. 2019b. Improving transformer- learning, pages 1764–1772. based end-to-end speech recognition with connec- tionist temporal classification and language model Lenore A Grenoble, Peter K Austin, and Julia Salla- integration. Proceedings of Interspeech 2019, pages bank. 2011. Handbook of endangered languages. 1408–1412. Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Taku Kudo and John Richardson. 2018. Sentencepiece: Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo A simple and language independent subword tok- Wang, Zhengdong Zhang, Yonghui Wu, et al. enizer and detokenizer for neural text processing. In 2020. Conformer: Convolution-augmented trans- Proceedings of the 2018 Conference on Empirical former for speech recognition. arXiv preprint Methods in Natural Language Processing: System arXiv:2005.08100. Demonstrations, pages 66–71. Pengcheng Guo, Florian Boyer, Xuankai Chang, Christian Lehmann. 1999. Documentation of endan- Tomoki Hayashi, Yosuke Higuchi, Hirofumi In- gered languages. In A priority task for linguistics. aguma, Naoyuki Kamo, Chenda Li, Daniel Garcia- ASSIDUE Arbeitspapiere des Seminars für Sprach- Romero, Jiatong Shi, et al. 2020. Recent develop- wissenschaft der Universität Erfurt, 1. ments on espnet toolkit boosted by conformer. arXiv preprint arXiv:2010.13956. Kohei Matsuura, Sei Ueno, Masato Mimura, Shinsuke Vishwa Gupta and Gilles Boulianne. 2020a. Auto- Sakai, and Tatsuya Kawahara. 2020. Speech corpus matic transcription challenges for Inuktitut, a low- of ainu folklore and end-to-end speech recognition resource polysynthetic language. In Proceedings of for ainu language. In Proceedings of The 12th Lan- the 12th Language Resources and Evaluation Con- guage Resources and Evaluation Conference, pages ference, pages 2521–2527. 2622–2628. Vishwa Gupta and Gilles Boulianne. 2020b. Speech Alexis Michaud, Oliver Adams, Trevor Cohn, Graham transcription challenges for resource constrained in- Neubig, and Séverine Guillaume. 2018. Integrat- digenous language cree. In Proceedings of the 1st ing automatic transcription into the language docu- Joint Workshop on Spoken Language Technologies mentation workflow: Experiments with na data and for Under-resourced languages (SLTU) and Collab- the persephone toolkit. Language Documentation & oration and Computing for Under-Resourced Lan- Conservation, 12:393–429. guages (CCURL), pages 362–367. Alexis Michaud, Eric Castelli, et al. 2014. Towards the Nils Hjortnaes, Niko Partanen, Michael Rießler, and automatic processing of yongning na (sino-tibetan): Francis M Tyers. 2020. Towards a speech recognizer developing a’light’acoustic model of the target lan- for komi, an endangered and low-resource uralic lan- guage and testing’heavyweight’models from five na- guage. In Proceedings of the Sixth International tional languages. In 4th International Workshop on Workshop on Computational Linguistics of Uralic Spoken Language Technologies for Under-resourced Languages, pages 31–37. Languages (SLTU 2014), pages 153–160. Robbie Jimerson and Emily Prud’hommeaux. 2018. Vikramjit Mitra, Andreas Kathol, Jonathan D Amith, Asr for documenting acutely under-resourced indige- and Rey Castillo Garcı́a. 2016. Automatic speech nous languages. In Proceedings of the Eleventh In- transcription for low-resource languages-the case of ternational Conference on Language Resources and yoloxóchitl mixtec (mexico). In Proceedings of In- Evaluation (LREC 2018). tespeech 2016, pages 3076–3080.
Daniel S Park, William Chan, Yu Zhang, Chung-Cheng A Appendices Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. 2019. Specaugment: A simple data augmentation Experimental Settings for End-to-End ASR: method for automatic speech recognition. Proc. In- All the end-to-end ASR systems adopted the hy- terspeech 2019, pages 2613–2617. brid CTC/Attention architecture integrated with Ngoc-Quan Pham, Thai-Son Nguyen, Jan Niehues, an RNN language model. The best model was Markus Müller, and Alex Waibel. 2019. Very deep selected on the basis of performance on the de- self-attention networks for end-to-end speech recog- velopment set. The input acoustic features were nition. Proceedings of Interspeech 2019, pages 66– 70. 83-dimensional log-Mel filterbanks features with pitch features (Ghahremani et al., 2014). The win- Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pe- dow length and the frameshift were set to 25ms and gah Ghahremani, Vimal Manohar, Xingyu Na, Yim- ing Wang, and Sanjeev Khudanpur. 2016. Purely 10ms. SpecAugmentation are adopted for data aug- sequence-trained neural networks for asr based on mentation (Park et al., 2019). The prediction targets lattice-free mmi. In Interspeech, pages 2751–2755. were the 150-word pieces trained using unigram Jorge A Suárez. 1983. The mesoamerican indian lan- language modeling (Kudo and Richardson, 2018) guages. Cambridge University Press. (both for surface and underlying form). All the end- to-end models are fused with RNN language mod- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, els.13 The CTC ratio for Hybrid CTC/Attention and Alexander A Alemi. 2017. Inception-v4, inception-resnet and the impact of residual connec- was set to 0.3. The decoding beam size was 20. tions on learning. In Proceedings of the Thirty-First Training and Testing are based on Pytorch. AAAI Conference on Artificial Intelligence, pages E2E-Conformer Configuration: The E2E- 4278–4284. Conformer used 12 encoder blocks and 6 decoder Bao Thai, Robert Jimerson, Raymond Ptucha, and blocks. All the blocks adopted a 2048 dimen- Emily Prud’hommeaux. 2020. Fully convolutional sion feed-forward layer and four-head multi-head- asr for less-resourced endangered languages. In attention with 256 dimensions. Kernel size in the Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced lan- Conformer block was set to 15. For training, the guages (SLTU) and Collaboration and Computing batch size was set to 32. Adam optimizer with for Under-Resourced Languages (CCURL), pages 1.0 learning rate and Noam scheduler with 25000 126–130. warmup-steps were used in training. We trained for Robert A Wagner and Michael J Fischer. 1974. The a max epoch of 50. The parameter size is 43M. string-to-string correction problem. Journal of the E2E-RNN Configuration: The E2E-RNN used ACM (JACM), 21(1):168–173. 3 encoder blocks and 2 decoder blocks. All the Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki blocks adopt 1024 hidden units. Location-based Hayashi, Jiro Nishitoba, Yuya Unno, Nelson- attention adopted 1024-dim attention. Adadelta Enrique Yalta Soplin, Jahn Heymann, Matthew was chosen as the optimizer, and we trained for a Wiesner, Nanxin Chen, et al. 2018. Espnet: End- max epoch of 15. The parameter size is 108M. to-end speech processing toolkit. Proceedings of In- terspeech 2018, pages 2207–2211. E2E-Transformer Configuration: The E2E- Transformer used 12 encoder blocks and 6 decoder Peter Wittenburg, Hennie Brugman, Albert Russel, blocks. All the blocks adopted a 2048 dimen- Alex Klassmann, and Han Sloetjes. 2006. Elan: a professional framework for multimodality research. sion feed-forward layer and four-head multi-head- In 5th International Conference on Language Re- attention with 256 dimensions. Adam optimizer sources and Evaluation (LREC 2006), pages 1556– with 1.0 learning rate and Noam scheduler with 1559. 25000 warmup-steps were used in training. We Alexander Zahrer, Andrej Zgank, and Barbara Schup- trained for a max epoch of 100. The parameter size pler. 2020. Towards building an automatic transcrip- is 27M. tion system for language documentation: Experi- ences from muyu. In Proceedings of The 12th Lan- Experimental Settings for HMM-based ASR: guage Resources and Evaluation Conference, pages The acoustic feature input for this model is 40 di- 2893–2900. mensional Mel Frequency Cepstral Coefficients Albert Zeyer, Kazuki Irie, Ralf Schlüter, and Hermann (MFCC). The lexicon for HMM-based models is Ney. 2018. Improved training of end-to-end atten- tion models for speech recognition. Proceedings of 13 Our experiments show that the RNN language model Interspeech 2018, pages 7–11. reduces WER by about 1%.
phone-based. The transcriptions are mapped to surface representations and then to phones (a to- tal of 197 phones, as each tone for a given vowel, is a different phone). There are 22,465 total en- tries in the lexicon. The chain model is trained with a sequence-level objective function and op- erates with an output frame rate of 30 ms, three times longer than the previous standard. The longer frame rate increases decoding speed, which in turn makes it possible to operate with a significantly deeper DNN architecture for acoustic modeling. The best results were achieved with a neural net- work based on the ResNet architecture (Szegedy et al., 2017). This consists of an initial layer for Linear Discriminative Analysis (LDA) transforma- tion and subsequent alternating 160-dimensional bottleneck layers, adding up to 45 layers in total. The DNN acoustic model is then compiled with a 4-gram language model into a weighted finite-state transducer for word sequence decoding.
You can also read