Requirements and Motivations of Low-Resource Speech Synthesis for Language Revitalization
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Requirements and Motivations of Low-Resource Speech Synthesis for Language Revitalization Aidan Pine1 Dan Wells2 Nathan Thanyehténhas Brinklow3 aidan.pine@nrc.ca dan.wells@ed.ac.uk nathan.brinklow@queensu.ca Patrick Littell1 Korin Richmond2 patrick.littell@nrc.ca korin.richmond@ed.ac.uk Abstract class. Text-to-speech synthesis technology (TTS) shows potential for supplementing text-based lan- This paper describes the motivation and devel- guage learning tools with audio in the event that the opment of speech synthesis systems for the pur- domain is too large to be recorded directly, or as poses of language revitalization. By building speech synthesis systems for three Indigenous an interim solution pending recordings from first- languages spoken in Canada, Kanien’kéha, language speakers. Gitksan & SENĆOŦEN, we re-evaluate the Development of TTS systems in this context question of how much data is required to build faces several challenges. Most notable is the usual low-resource speech synthesis systems featur- assumption that neural speech synthesis models re- ing state-of-the-art neural models. For ex- quire at least tens of hours of audio recordings with ample, preliminary results with English data corresponding text transcripts to be trained ade- show that a FastSpeech2 model trained with 1 hour of training data can produce speech with quately. Such a data requirement is far beyond comparable naturalness to a Tacotron2 model what is available for the languages we are con- trained with 10 hours of data. Finally, we mo- cerned with, and is difficult to meet given the lim- tivate future research in evaluation and class- ited time of the relatively small number of speak- room integration in the field of speech synthe- ers of these languages. The limited availability of sis for language revitalization. Indigenous language speakers also hinders the sub- jective evaluation methods often used in TTS stud- 1 Introduction ies, where naturalness of synthetic speech samples There are approximately 70 Indigenous languages is judged by speakers of the language in question. spoken in Canada, from 10 distinct language fam- In this paper, we re-evaluate some of these chal- ilies (Rice, 2008). As a consequence of the resi- lenges for applying TTS in the low-resource con- dential school system and other policies of cultural text of language revitalization. We build TTS sys- suppression, the majority of these languages now tems for three Indigenous languages of Canada, have fewer than 500 fluent speakers remaining, with training data ranging from 25 minutes to 3.5 most of them elderly. Despite this, interest from hours, and confirm that we can produce acceptable students and parents in Indigenous language edu- speech as judged by language teachers and learn- cation continues to grow (Statistics Canada, 2016); ers. Outputs from these systems could be suitable we have heard from teachers that they are over- for use in some classroom applications, for exam- whelmed with interest from potential students, and ple a speaking verb conjugator. the growing trend towards online education means many students who have not previously had access 2 Background to language classes now do. Supporting these growing cohorts of students 2.1 Language Revitalization comes with unique challenges for languages with It is no secret that the majority of the world’s lan- few fluent first-language speakers. A particular guages are in crisis, and in many cases this cri- concern of teachers is to provide their students sis is even more urgent than conservation biolo- with opportunities to hear the language outside of gists’ dire predictions for flora and fauna (Suther- 1 National Research Council Canada land, 2003). However, the ‘doom and gloom’ 2 University of Edinburgh rhetoric that often follows endangered languages 3 Queen’s University over-represents vulnerability and under-represents 7346 Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics Volume 1: Long Papers, pages 7346 - 7359 May 22-27, 2022 c 2022 Association for Computational Linguistics
the enduring strength of Indigenous communi- ties who have refused to stop speaking their lan- (1) Senòn:wes guages despite over a century of colonial policies you.to.it-like-habitual against their use (Pine and Turin, 2017). Contin- ‘You like it.’ uing to speak Indigenous languages is often seen as a political act of anti-colonial resistance. As (2) Takenòn:wes such, the goals of any given language revitaliza- you.to.me-like-habitual tion effort extend far beyond memorizing verb ‘You like me.’ paradigms to broader goals of nationhood and self-determination (Pitawanakwat, 2009; McCarty, Figure 1: An example of fusional morphology of 2018). Language revitalization programs can also agent/patient pairs in Kanien’kéha transitive verb have immediate and important impacts on factors paradigms (from Kazantseva et al., 2018) including community health and wellness (Whalen et al., 2016; Oster et al., 2014). There is a growing international consensus on iar words. Assuming a rate of 200 forms/hr for 4 the importance of linguistic diversity, from the hours per day, 5 days per week, this would take a Truth & Reconciliation Commission of Canada teacher out of the classroom for approximately a (TRC) report in 2015 which issued nine calls year. Considering Kawennón:nis is anticipated to to action related to language, to 2019 being de- have over 1,000,000 unique forms by the time the clared an International Year of Indigenous Lan- grammar modelling work is finished, recording au- guages by the UN, and 2022-2032 being declared dio manually becomes infeasible. an International Decade of Indigenous Languages. The research question that then emerged was From 1996 to 2016, the number of speakers of ‘what is the smallest amount of data needed in or- Indigenous languages increased by 8% (Statistics der to generate audio for all verb forms in Kawen- Canada, 2016). These efforts have been success- nón:nis’. Beyond Kawennón:nis, we anticipate ful despite a lack of support from digital technolo- that there are many similar language revitalization gies. While opportunities may exist for technol- projects that would want to add supplementary au- ogy to assist and support language revitalization dio to other text-based pedagogical tools. efforts, these technologies must be developed in a 2.3 Speech Synthesis way that does not further marginalize communities (Brinklow et al., 2019; Bird, 2020). The last few years have shown an explosion in research into purely neural network-based ap- 2.2 Why TTS for Language Revitalization? proaches to speech synthesis (Tan et al., 2021). Our interest in speech synthesis for language Similar to their HMM/GMM predecessors, neural revitalization was sparked during user evalua- pipelines typically consist of both a network pre- tions of Kawennón:nis (lit. ‘it makes words’), dicting the acoustic properties of a sequence of a Kanien’kéha verb conjugator (Kazantseva text and a vocoder. The feature prediction net- et al., 2018) developed in collaboration between work must be trained using parallel speech/text the National Research Council Canada and the data where the input is typically a sequence of char- Onkwawenna Kentyohkwa adult immersion pro- acters or phones that make up an utterance, and gram in Six Nations of the Grand River in Ontario, the output is a sequence of fixed-width frames of Canada. Kawennón:nis models a pedagogically- acoustic features. In most cases the predictions important subset of verb conjugations in XFST from the TTS model are log Mel-spectral features (Beesley and Karttunen, 2003), and currently and a vocoder is used to generate the waveform produces 247,450 unique conjugations. The from these acoustic features. pronominal system is largely responsible for much Much of the previous work on low resource of this productivity, since in transitive paradigms, speech synthesis has focused on transfer learning; agent/patient pairs are fused, as illustrated in that is, ‘pre-training’ a network using data from a Figure 1. language that has more data, and then ‘fine-tuning’ In user evaluations of Kawennón:nis, students using data from the low-resource language. One often asked whether it was possible to add audio of the problems with this approach is that the in- to the tool, to model the pronunciation of unfamil- put space often differs between languages. As the 7347
inputs to these systems are sequences of charac- non-trivial: there are limited amounts of text from ters or phones, and as these sequences are typi- which a speaker could read, and there are few peo- cally one-hot encoded, it can be difficult to devise ple available who are sufficiently literate in the lan- a principled method for transferring weights from guages to transcribe recorded audio. Re-focusing the source language network to the target if there speakers’ limited time to these tasks presents a sig- is a difference between the character or phone in- nificant opportunity cost; they are often already ventories of the two languages. Various strategies over-worked and over-burdened in under-funded have emerged for normalizing the input space. For and under-resourced language teaching projects. example, Demirsahin et al. (2018) propose a uni- As mentioned in §2.1, language technology fied inventory for regional multilingual training of projects that aim to assist language revitalization South Asian languages, while Tu et al. (2019) com- and reclamation efforts must be centered around pare various methods to create mappings between the primary goals of those efforts and ensure that source and target input spaces. Another proposal the means of developing the technology do not is to normalize the input space between source and distract or work against the broader sociopolitical target languages by replacing one-hot encodings of goals. A primary stress point for many natural text with multi-hot phonological feature encodings language processing projects involving Indigenous (Gutkin et al., 2018; Wells and Richmond, 2021). communities surrounds issues of data sovereignty. It is important that communities direct the devel- 2.4 Speech Synthesis for Indigenous opment of these tools, and maintain control, own- Languages in Canada ership, and distribution rights for their data, as well There is extremely little published work on speech as for the resulting speech synthesis models (Kee- synthesis for Indigenous languages in Canada (and gan, 2019; Brinklow, 2021). In keeping with this, North America generally). A statistical parametric the datasets described in this paper are not being speech synthesizer using Simple4All was recently released publicly at this time. developed for Plains Cree (Harrigan et al., 2019; To test the feasibility of developing speech Clark, 2014). Although it was unpublished, two synthesis systems for Indigenous languages, we highschool students1 created a statistical paramet- trained models for three unrelated Indigenous lan- ric speech synthesizer for Kanien’kéha by adapting guages, Kanien’kéha (§3.1), Gitksan (§3.2), and eSpeak (Duddington and Dunn, 2007). We know SENĆOŦEN (§3.3). of no other attempts to create speech synthesis sys- 3.1 Kanien’kéha tems for Indigenous languages in Canada. Else- where in North America, a Tacotron2 system has Kanien’kéha2 (a.k.a. Mohawk) is an Iroquoian lan- been built for Cherokee (Conrad, 2020), and some guage spoken by roughly 2,350 people in south- early work on concatenative systems for Navajo ern Ontario, Quebec, and northern New York state was discussed in a technical report (Whitman et al., (Statistics Canada, 2016). In 1979 the first immer- 1997), as well as on Rarámuri (Urrea et al., 2009). sion school of any Indigenous language in Canada was opened for Kanien’kéha, and many other very 3 Indigenous Language Data successful programs have been started since, in- cluding the Onkwawenna Kentyohkwa adult im- Although the term ‘low resource’ is used to de- mersion program in 1999 (Gomashie, 2019). scribe a wide swath of languages, most Indigenous In the late 1990s, a team of five Kanien’kéha languages in Canada would be considered ‘low- translators worked with the Canadian Bible Soci- resource’ in multiple senses of the word, having ety to translate and record parts of the Bible; one of both a low amount of available data (annotated the speakers on these recordings, Satewas, is still or unannotated), and a relatively low number of living. Translation runs in Satewas’s family, with speakers. Most Indigenous languages lack tran- his great-grandfather also working on Bible trans- scribed audio corpora, and fewer still have such lations in the 19th century. Later, a team of four data recorded in a studio context. Due to the lim- speakers and learners, including this paper’s third ited number of speakers, creating these resources is author, aligned the text and audio at the utterance 1 2 https://wiki.laptop.org/go/ As there are different variations of spelling, we use the Instructions_for_implementing_a_new_language_% spelling used in the communities of Kahnawà:ke and Kahne- 22voice%22_for_Speak_on_the_XO setà:ke throughout this paper 7348
level using Praat (Boersma and van Heuven, 2001) an orthography developed by the late SENĆOŦEN and ELAN (Brugman and Russel, 2004). speaker and WSÁNEĆ elder Dave Elliott. While ¯ While a total of 24 hours of audio were recorded, the community of approximately 3,500 has fewer members of the Kanien’kéha-speaking community than 10 fluent speakers, there are hundreds of learn- told us it would be inappropriate to use the voices ers, many of whom have been enrolled in years of speakers who had passed away, leaving only of immersion education in the language (First Peo- recordings of Satewas’s voice. Using a GMM- ples’ Cultural Council, 2018). based speaker ID system (Kumar, 2017), we re- As there were no studio-quality recordings of moved utterances by these speakers, then removed the SENĆOŦEN language publicly available, we utterances that were outliers in duration (less than recorded 25.92 minutes of the language with 0.4s or greater than 11s) and speaking rate (less PENÁĆ David Underwood reading two stories than 4 phones per second or greater than 15), originally spoken by elder Chris Paul. recordings with an unknown phase effect present, and utterances containing non-Kanien’kéha char- 4 Research Questions acters (e.g. proper names like ‘Euphrades’). Han- Given the motivation and context for language dling utterances with non-Kanien’kéha characters revitalization-based speech synthesis, a number of would have required grapheme-to-phoneme pre- research questions follow. Namely, how much diction capable of dealing with multilingual text data is required in order to build a system of rea- and code-switching which we did not have avail- sonable pedagogical quality? How do we evalu- able. The resulting speech corpus comprised 3.46 ate such a system? And, how is the resulting sys- hours of speech. tem best integrated into the classroom? In §4.1, 3.2 Gitksan we discuss the difficulty of evaluating TTS sys- tems in low-resource settings. We then discuss Gitksan3 is one of four languages belonging to preliminary results for English and Indigenous lan- the Tsimshianic language family spoken along guage TTS which show that acceptable speech the Skeena river and its surrounding tributaries quality can be achieved with much less training in the area colonially known as northern British data than usually considered for neural speech syn- Columbia. Traditional Gitksan territory spans thesis (§4.2). Finally, we suggest possible direc- some 33,000 square kilometers and is home to al- tions for pedagogical integration in section §4.4. most 10,000 people, with approximately 10% of the population continuing to speak the language 4.1 Low-Resource Evaluation fluently (First Peoples’ Cultural Council, 2018). As there were no studio-quality recordings of One of the most significant challenges in research- the Gitksan language publicly available, and as an ing speech synthesis for languages with few speak- intermediate speaker of the language, the first au- ers is evaluating the models. For some Indigenous thor recorded a sample set himself. In total, he languages in Canada, the total number of speakers recorded 35.46 minutes of audio reading isolated of the language is less than the number typically re- sentences from published and unpublished stories quired for statistical significance in a listening test (Forbes et al., 2017). (Wester et al., 2015). While the number of speak- ers in these conditions is sub-optimal for statisti- 3.3 SENĆOŦEN cal analysis, we have been told by the communi- ties we work with that the positive assessment of The SENĆOŦEN language is spoken by the a few widely respected and community-engaged WSÁNEĆ people on the southern part of the is- ¯ language speakers would be practically sufficient land colonially known as Vancouver Island. It be- to assess the pedagogical value of speech models longs to the Coastal branch of the Salish language in language revitalization contexts. For the experi- family. The WSÁNEĆ community runs a world- ¯ ments described in this paper, we ran listening tests famous language revitalization program4 , and uses for both Kanien’kéha and Gitksan with speakers, 3 We use Lonnie Hindle and Bruce Rigsby’s spelling of the teachers, and learners, but were not able to run any language, which, with the use of ‘k’ and ‘a’ is a blend of up- such tests for SENĆOŦEN due to very few speak- river (gigeenix) and downriver (gyets) dialects 4 https://wsanecschoolboard.ca/sencoten- ers with already busy schedules. language/ While some objective metrics do exist, such as 7349
Mel cepstral distortion (MCD, Kubichek, 1993), tirely, which could still optimize Tacotron2’s root we do not believe they should be considered reli- mean square error function over predicted acoustic able proxies for listening tests. Future research on features, but result in an untrained or degenerate speech synthesis for languages with few speakers attention network which is unable to properly gen- should prioritize efficient and effective means of eralize to new inputs at inference time when the evaluating results. teacher forcing input is unavailable. Attention fail- In many cases, including in the experiment de- ures represent a characteristic class of errors for scribed in §4.2, artificial data constraints can be models such as Tacotron2, for example skipping placed on a language with more data, like En- or repeating words from the input text (Valentini- glish, to simulate a low-resource scenario. While Botinhao and King, 2021). this technique can be insightful and it is tempt- There have been many proposals to improve ing to draw universal conclusions, English is lin- training of the attention network, for example by guistically very different from many of the other guiding the attention or using a CTC loss function languages spoken in the world. Accordingly, we to respect the monotonic alignment between text should be cautious not to assume that results from inputs and speech outputs (Tachibana et al., 2018; these types of experiments will necessarily transfer Liu et al., 2019; Zheng et al., 2019; Gölge, 2020). or extend to genuinely low-resource languages. As noted by Liu et al. (2019), increasing the so- called ‘reduction factor’ – which applies dropout 4.2 How much data do you really need? to the autoregressive frames – can also help the The first question to answer is whether our Indige- model learn to rely more on the attention network nous language corpora ranging from 25 minutes to than the teacher forcing inputs, but possibly at the 3.46 hours of speech are sufficient for building neu- risk of compromising synthesis quality. ral speech synthesizers. Due to the prominence of FastSpeech2 (Ren et al., 2021), and similar sys- Tacotron2 (Shen et al., 2018), it seems that many tems like FastPitch (Łańcucki, 2021), present an people have assumed that the data requirements for alternative to Tacotron2-type attentive, autoregres- training any neural speech synthesizer of similar sive systems with similar listening test results and quality must be the same as the requirements for without the characteristic errors related to atten- this particular model. As a result, some researchers tion. Instead of modelling duration using atten- still choose to implement either concatenative or tion, they include an explicit duration prediction HMM/GMM-based statistical parametric speech module trained on phone duration targets extracted synthesis systems in low-resource situations based from the training data. For the original FastSpeech, on the assumption that a “sufficiently large corpus target phone durations derived from the attention [for neural TTS] is unavailable” (James et al., 2020, weights of a pre-trained Tacotron2 system were p. 298). We argue that attention-based models such used to provide phone durations (Ren et al., 2019). as Tacotron2 should not be used as a benchmark for In low-resource settings, however, there might not data requirements among all neural TTS methods, be sufficient data to train an initial Tacotron2 in as they are notoriously difficult to train and unnec- the target language in the first place. For Fast- essarily inflate training data requirements. Speech2, phone duration targets are instead ex- 4.2.1 Replacing attention-based weak tracted using the Montreal Forced Aligner (MFA, duration models McAuliffe et al., 2017), trained on the same data as Tacotron2 is an autoregressive model, meaning it used for TTS model training. We have found MFA predicts the speech parameters ŷt from both the can provide suitable alignments for our target lan- input sequence of text x and the previous speech guages, even with alignment models being trained parameters y1 , ..., yt−1 . Typically, the model is on only limited data. trained with ‘teacher-forcing’, where the autore- Faster convergence of text-acoustic feature gressive frame yt−1 passed as input for predict- alignments has been found to speed up overall ing ŷt is taken from the ground truth acoustic fea- encoder-decoder TTS model training, as stable tures and not the prediction network’s output from alignments provide a solid foundation for further the previous frame ŷt−1 . As discussed by Liu training of the decoder. Badlani et al. (2021) show et al. (2019), such a system might learn to copy this by adding a jointly-learned alignment frame- the teacher forcing input or disregard the text en- work to a Tacotron2 architecture, reducing time 7350
0.8 0.8 80 80 70 0.7 70 0.7 60 0.6 60 0.6 Encoder timestep Encoder timestep 50 0.5 50 0.5 40 0.4 40 0.4 30 0.3 30 0.3 20 0.2 20 0.2 10 0.1 10 0.1 0 0 0.0 0 100 200 300 400 500 0 100 200 300 400 500 600 Decoder timestep Decoder timestep (a) 5 Hr LJ Corpus Subset (b) 10 Hr LJ Corpus Subset Figure 2: Visualization of Tacotron2 Attention Network Weights extracted after 100k steps trained on the LJ corpus. The weights of the attention network should be diagonal and monotonic as seen in subfigure (b). Subfigure (a) shows that the network trained on a 5 hour subset of the LJ corpus results in a degenerate attention network. to convergence. In contrast, they found that re- 3, 5, 10 and 24 (full corpus) hours of speech. The placing MFA duration targets in FastSpeech2 train- models were trained for 100k steps and, as seen ing offers no benefit – forced alignment targets al- in Figure 2, using up to 5 hours of data the atten- ready provide enough information for more time- tion mechanism does not learn properly, resulting efficient training compared to an attention-based in degenerate outputs. Tacotron2 system. Relieving the burden of learn- For comparison, we trained seven FastSpeech2 ing an internal alignment model also opens the models with batch size 16 for 200k steps on 15 and door to more data-efficient training. For example, 30 minute, 1, 3, 5, 10 and 24 hour incremental par- Perez-Gonzalez-de-Martos et al. (2021) submitted titions of LJ Speech. Our model6 is based on an a non-attentive model trained from forced align- open-source implementation (Chien, 2021), which ments to the Blizzard Challenge 2021, where their adds learnable speaker embeddings and a decoder system was found to be among the most natural postnet to the original model, as well as predict- and intelligible in subjective listening tests despite ing pitch and energy values at the phone rather only using 5 hours of speech; all other submitted than frame level. We also added learnable lan- systems included often significant amounts of ad- guage embeddings for supplementary experiments ditional training data (up to 100 hours total). in cross-lingual fine-tuning; while not reported in this paper, we refer the interested reader to Pine 4.2.2 Experimental Comparison of Data (2021) for discussion of these experiments. Moti- Requirements for Neural TTS vated by concerns of efficiency in model training To investigate the effects of differing amounts of and inference, and the possibility of overfitting a data on the attention network, and in preparation large model to limited amounts of data, we further for training systems with our limited Indigenous modified the base architecture to match the Light- language data sets, we trained five Tacotron2 mod- Speech model presented in Luo et al. (2021). We els on incremental partitions of the LJ Speech cor- removed the energy adaptor, replaced the convolu- pus of American English (Ito and Johnson, 2017). tional layers in the encoder, decoder and remain- We used the NVIDIA implementation5 with de- ing variance predictors with depthwise separable fault hyperparameters apart from a reduced batch convolutions (Kaiser et al., 2018) and matched en- size of 32 to fit the memory capacity of our GPU coder and decoder convolutional kernel sizes with resources. We artificially constrained the training Luo et al. (2021). This reduced the number of data such that the first model saw only the first hour model parameters from 35M7 to 11.6M without no- of data from the shuffled corpus, the second model ticeable change in voice quality and sped up train- that same first hour plus another two hours (3 to- tal) etc., so that the five models were trained on 1, 6 https://github.com/roedoejet/FastSpeech2 7 In the implementation of Chien (2021); the original Fast- 5 https://github.com/NVIDIA/tacotron2 Speech2 is slightly smaller at 27M parameters. 7351
100 TT2 10hr FS2 Full 75 FS2 10hr MUSHRA score p−value FS2 5hr 0.05 0.04 50 FS2 3hr 0.03 0.02 FS2 1hr 0.01 0.00 25 FS2 30m FS2 15m 0 Ref ef m m r r r hr ll hr ll 1h 3h 5h Fu Fu FS 5m FS m FS hr FS hr FS 5hr FS hr 2 l TT hr ll R 15 30 10 10 TT Ful Fu 30 1 3 10 10 2 2 2 2 2 1 2 2 FS FS FS 2 2 FS TT 2 2 2 2 2 FS FS FS TT 2 2 2 FS Model ID Figure 3: Box plot of survey data from MUSHRA Figure 4: Pairwise Bonferroni-corrected Wilcoxon questions comparing Tacotron2 (TT2) and FastSpeech2 signed rank tests between each pair of voices. Cells (FS2) models with constrained amounts of training data. correspond to the significance of the result of the pair- ‘Ref’ refers to reference recordings of natural speech. wise test between the model on the y-axis and the model on the x-axis. Darker cells show stronger sig- nificance; grey cells did not show a significant differ- ing by 33% on GPU or 64% on CPU. For addi- ence in listening test results. FS2 refers to models tional discussion of the accessibility benefits of built with FastSpeech2, TT2 refers to models built with these changes with respect to Indigenous language Tacotron2, and ‘Ref’ to reference recordings. Sam- communities, see Appendix A. ples available at https://roedoejet.github.io/ msc_listening_tests_data/ 4.2.3 Results We conducted a short (10-15 minute) listening test voices, while showing consistent improvements to compare the two Tacotron2 models that trained in naturalness ratings as more data is added (as properly (10h, full) against the seven FastSpeech2 shown in Figure 3), are not significantly different models. We recruited 30 participants through Pro- from each other. This is a relevant and impor- lific, and presented each with four MUSHRA-style tant finding for low-resource speech synthesis be- questions where they were asked to rank the 9 cause it shows that a FastSpeech2 voice built with voices along with a hidden natural speech refer- 3 hours of data can achieve subjective naturalness ence (ITU-R, 2003). MUSHRA-style questions ratings which are not significantly different from a were used as a practical way to evaluate this large Tacotron2 voice built with 24 hours of data. Simi- number of models. larly, the results of the listening test for our Fast- While it only took 30 minutes to recruit 30 par- Speech2 voice built with 1 hour of data are not ticipants using Prolific, the quality of responses significantly different from our Tacotron2 voice was quite varied. We rejected two outright as they built with 10 hours of data. Additionally, while seemingly did not listen to the stimuli and left the all the FastSpeech2 voices were intelligible, all same rankings for every voice. Even still, there Tacotron2 models trained with less than 10 hours was a lot of variation in responses from the remain- of data produced unintelligible speech. ing participants, as seen in Figure 3. We tested for significant differences between pairs of voices 4.3 Indigenous Language Experiments using Bonferroni-corrected Wilcoxon signed rank tests. Pairwise test results are summarized in the Despite the difficulty in evaluation (§4.1), we heat map of their p-values in Figure 4. built and evaluated a number of TTS systems for In the results from the pairwise analysis, we the Indigenous languages described in §3. We can see that natural speech is rated as significantly had a baseline concatenative model available for more natural than all synthetic speech samples. Kanien’kéha that we had previously built using Naturalness ratings for the FastSpeech2 voices Festival and Multisyn (Taylor et al., 1998; Clark trained on 15m and 30m of data are significantly et al., 2007). Additionally, we trained cold-start lower than all other voices, and significantly differ- FastSpeech2 models for each language, as well as ent from each other. The results for the remaining models fine-tuned for 25k steps from a multilin- 7352
gual, multispeaker FastSpeech2 model pre-trained 5 on a combination of VCTK (Yamagishi et al., 4 2019), Kanien’kéha and Gitksan recordings. A 3 MOS rule-based mapping from orthography to pronunci- 2 ation form was developed for each language using the ‘g2p’ Python library in order to perform align- 1 ment and synthesis at the phone-level instead of 0 character-level (Pine et al., Under Review). ef e on R Ph Model ID 4.3.1 Results We carried out listening test evaluations of Gitk- Figure 5: Box plot of MOS results for Gitksan listen- san and Kanien’kéha models. Participants were ing test. ‘Ref’ is the reference voice and ‘Phone’ is the recruited by contacting teachers, learners and lin- phone-based FastSpeech2 neural model. Variable re- sults for the reference voice are likely due to the natural guists with at least some familiarity with the lan- speech recordings coming from a non-native speaker. guages. For the Kanien’kéha listening test, 6 participants maybe were asked to answer 20 A/B questions comparing 16.7% maybe synthesized utterances from the various models. 41.7% yes We used A/B tests for more targeted comparisons 58.3% between different systems, namely cold-start vs. yes 83.3% fine-tuned and neural vs. concatenative. Results showed that 72.2% of A/B responses from partic- (a) Kanien’kéha (b) Gitksan ipants preferred our FastSpeech2 model over our Figure 6: Responses from qualitative survey asking par- baseline concatenative model. In addition, 81.7% ticipants “Would you be comfortable with any of the of A/B responses from participants preferred the voices you heard being played online, say for a digital cold-start to the model fine-tuned on the multi- dictionary or verb conjugator if no other recording ex- speaker, multi-lingual model, suggesting that the isted?”. No participants responded “no”. transfer learning approach discussed in §2.3 might not be necessary for models with explicit dura- tions such as FastSpeech2 since they are relieved san listening tests directly whether they approved of the burden to learn an implicit model of duration of the synthesis quality. As seen in Figure 6, par- through attention from limited data. ticipant responses were generally positive; full re- For the Gitksan listening test, we did not build sponses are reported in Appendix B. a concatenative model as with Kanien’kéha and 4.4 Integrating TTS in the Classroom so we were not comparing different models, but rather just gathering opinions on the quality of the Satisfying the goal of adding supplementary au- cold-start FastSpeech2 model. Accordingly, 10 dio to a reference tool like Kawennón:nis can be MOS-style questions were presented to 12 partici- straightforwardly implemented by linking entries pants for both natural utterances and samples from in the verb conjugator to pre-generated audio for our FastSpeech2 model. The model received a the domain from a static server. This implementa- 3.56 ± 0.26 MOS compared with a MOS for the tion also limits the potential of out of domain utter- reference recordings of 4.63 ± 0.19 as shown in ances that might be deemed inappropriate, which Figure 5. While both Kanien’kéha and Gitksan re- is an ethical concern in communities with low num- sults seem to corroborate our belief that these mod- bers of speakers where the identity of the ‘model’ els should be of reasonable quality despite limited speaker is easily determined. training data, it is difficult to make any conclusive However, the ability to synthesize novel utter- statement given the low number of eligible partici- ances could be pedagogically useful. Students pants available for evaluation. often come into contact with words or sentences As the main goal of our efforts here is to even- which do not have audio, and teachers often have tually integrate our speech synthesis systems into to prepare new thematic word lists or vocabulary a pedagogical setting, we also asked the 18 peo- lessons that could benefit from a more general pur- ple who participated across Kanien’kéha and Gitk- pose speech synthesis solution. In those cases, 7353
with community and speaker input, we might con- based architectures such as Tacotron2. Given sider what controls would be necessary for the forced alignments of sufficient quality, which we users of this technology. One potential solution is found to be achievable even by training a Mon- the variance adaptor architecture present in Fast- treal Forced Aligner model only on our limited Speech2, allowing for phone-level control of dura- Indigenous language training data, this makes for tion, pitch and energy; an engaging demonstration more data-efficient training of neural TTS sys- of a graphical user interface for the corresponding tems than has generally been explored in previous controls in a FastPitch model is also available.8 We work. These findings show great promise for fu- would like to focus further efforts on designing a ture work in low-resource TTS for language revi- user interface for speech synthesis systems that sat- talization, especially as they come from systems isfies ethical concerns while prioritizing language trained from scratch on such limited data, rather pedagogy as the fundamental use case. than pre-training on a high-resource language and In addition to fine-grained prosodic controls, subsequent fine-tuning on limited target language we would like to explore the synthesis of hyper- data. articulated speech, as often used by language teach- ers when modelling pronunciation of unfamiliar Acknowledgements words or sounds for students. This style of speech typically involves adjustment beyond the param- We would like to gratefully acknowledge the eters of pitch, duration and energy, and is char- many people who worked to record the audio acterized by more careful enunciation of individ- for the speech synthesis systems described in this ual phones than is found in normal speech. This project. In particular, Satewas Harvey Gabriel, problem has parallels to the synthesis of Lombard and PENÁĆ David Underwood. speech (Hu et al., 2021), as used to improve intelli- Much of the text and experimentation related to gibility by speakers who find themselves in noisy this paper was submitted as partial fulfillment of environments. the first author’s M.Sc. dissertation at the Univer- sity of Edinburgh (Pine, 2021). 5 Conclusion This work was supported in part by the UKRI Centre for Doctoral Training in Natural Lan- In this paper, we presented the first neural speech guage Processing, funded by the UKRI (grant synthesis systems for Indigenous languages spo- EP/S022481/1) and the University of Edinburgh, ken in Canada. Subjective listening tests showed School of Informatics and School of Philosophy, encouraging results for the naturalness and accept- Psychology & Language Sciences. ability of voices for two languages, Kanien’kéha and Gitksan, despite limited training data avail- ability (3.5 hours and 35 minutes, respectively). References More extensive evaluation on English shows that the FastSpeech2 architecture can produce speech Valerie Alia. 2009. The New Media Nation: Indigenous Peoples and Global Communication, ned - new edi- with similar quality to a Tacotron2 system using tion, 1 edition. Berghahn Books. a fraction of the amount of speech usually consid- ered for neural speech synthesis. Notably, a Fast- Rohan Badlani, Adrian Łancucki, Kevin J. Shih, Speech2 voice trained on 1 hour of English speech Rafael Valle, Wei Ping, and Bryan Catanzaro. achieved subjective naturalness ratings not signif- 2021. One TTS Alignment To Rule Them All. arXiv:2108.10447. icantly different from a Tacotron2 voice using 10 hours of data, while a 3-hour FastSpeech2 system Kenneth R. Beesley and Lauri Karttunen. 2003. Finite showed no significant difference from a 24-hour State Morphology. CSLI Publications. Tacotron2 voice. We attribute these results to the fact that Fast- Steven Bird. 2020. Decolonising speech and lan- guage technology. In Proceedings of the 28th Inter- Speech2 learns input token durations from forced national Conference on Computational Linguistics, alignments, rather than jointly learning to align lin- pages 3504–3519. guistic inputs to acoustic features alongside the acoustic feature prediction task as in attention- Paul Boersma and Vincent van Heuven. 2001. Speak and unSpeak with PRAAT. Glot International, 8 https://fastpitch.github.io/ 5(9/10):341–347. 7354
Nathan Thanyehténhas Brinklow. 2021. Indigenous Grace A. Gomashie. 2019. Kanien’keha / Mohawk In- language technologies: Anti-colonial oases in a digenous language revitalisation efforts in Canada. colonizing (digital) world. WINHEC: Interna- McGill Journal of Education / Revue des sciences tional Journal of Indigenous Education Scholarship, de l’éducation de McGill, 54(1):151–171. 16(1):239–266. Alexander Gutkin, Martin Jansche, and Tatiana Nathan Thanyehténhas Brinklow, Patrick Littell, De- Merkulova. 2018. FonBund: A Library for laney Lothian, Aidan Pine, and Heather Souter. Combining Cross-lingual Phonological Segment 2019. Indigenous Language Technologies & Lan- Data. In Proceedings of the Eleventh International guage Reclamation in Canada. Proceedings of the Conference on Language Resources and Evalua- 1st International Conference on Language Technolo- tion (LREC 2018), pages 2236–2240. European gies for All, pages 402–406. Language Resources Association (ELRA). Hennie Brugman and Albert Russel. 2004. An- Atticus Harrigan, Antti Arppe, and Timothy Mills. notating Multi-media/Multi-modal Resources with 2019. A Preliminary Plains Cree Speech Synthe- ELAN. In Proceedings of the Fourth International sizer. In Proceedings of the 3rd Workshop on the Conference on Language Resources and Evaluation Use of Computational Methods in the Study of En- (LREC’04), Lisbon, Portugal. European Language dangered Languages Volume 1 (Papers), pages 64– Resources Association (ELRA). 73, Honolulu. Association for Computational Lin- guistics. Chung-Ming Chien. 2021. ming024/FastSpeech2. https://github.com/ming024/FastSpeech2. Qiong Hu, Tobias Bleisch, Petko Petkov, Tuomo Raitio, Original-date: 2020-06-25T13:57:53Z. Erik Marchi, and Varun Lakshminarasimhan. 2021. Whispered and Lombard Neural Speech Synthesis. Robert AJ Clark. 2014. Simple4all. In Proc. Inter- In 2021 IEEE Spoken Language Technology Work- speech 2014, pages 1502–1503. shop (SLT), pages 454–461. Robert AJ Clark, Korin Richmond, and Simon King. Innu-Atikamekw-Anishnabeg Coalition. 2020. Export 2007. Multisyn: Open-domain unit selection for the of Canadian Hydropower to the United States - First festival speech synthesis system. Speech Communi- Nations in Québec and Labrador Unite to Oppose cation, 49(4):317–330. Hydro-Québec Project. Michael Conrad. 2020. Tacotron2 and Chero- Keith Ito and Linda Johnson. 2017. The LJ speech kee TTS. https://www.cherokeelessons.com/ dataset. https://keithito.com/LJ-Speech- content/tacotron2-and-cherokee-tts/. Dataset/. Isin Demirsahin, Martin Jansche, and Alexander Gutkin. 2018. A unified phonological representation ITU-R. 2003. Recommendation ITU-R BS.1534-1 - of South Asian languages for multilingual text-to- Method for the subjective assessment of intermedi- speech. In Proc. The 6th Intl. Workshop on Spoken ate quality level of coding systems. Technical Report Language Technologies for Under-Resourced Lan- ITU-R BS.1534-1, International Telecommunication guages (SLTU), pages 80–84. Union. Jonathan Duddington and Reece Dunn. 2007. Jesin James, Isabella Shields, Rebekah Berriman, Pe- eSpeak: Speech Synthesizer. http: ter Keegan, and Catherine Watson. 2020. Develop- //espeak.sourceforge.net/. ing resources for te reo Māori text to speech synthe- sis system. In P. Sojka, I. Kopeček, K. Pala, and EPA. 2019. Emissions & generation resource inte- A. Horák, editors, Text, Speech, and Dialogue, pages grated database (eGRID). https://www.epa.gov/ 294–302. egrid. Lukasz Kaiser, Aidan N. Gomez, and Francois Chollet. First Peoples’ Cultural Council. 2018. Report on 2018. Depthwise Separable Convolutions for Neural the status of B.C. https://fpcc.ca/resource/ Machine Translation. In International Conference fpcc-report-of-the-status-of-b-c-first- on Learning Representations. nations-languages-2018/. Anna Kazantseva, Owennatekha Brian Maracle, Clarissa Forbes, Henry Davis, Michael Schwan, and Ronkwe’tiyóhstha Josiah Maracle, and Aidan Gitksan Research Lab. 2017. Three Gitksan Texts. Pine. 2018. Kawennón:nis: the wordmaker for Papers for the International Conference on Salish Kanyen’kéha. In Proceedings of the Workshop and Neighbouring Languages, 52:47–89. on Computational Modeling of Polysynthetic Lan- guages, pages 53–64, Santa Fe, New Mexico, USA. Eren Gölge. 2020. Solving Attention Problems of Association for Computational Linguistics. TTS models with Double Decoder Consistency. https://erogol.com/solving-attention- Te Taka Keegan. 2019. Issues with Māori sovereignty problems-of-tts-models-with-double- over Māori language data. Let The Languages Live decoder-consistency/. 2019 Conference. 7355
R. Kubichek. 1993. Mel-cepstral distance measure for Aidan Pine and Mark Turin. 2017. Language Revital- objective speech quality assessment. In Proceedings ization. Oxford Research Encyclopedia of Linguis- of IEEE Pacific Rim Conference on Communications tics. Computers and Signal Processing, volume 1, pages 125–128 vol.1. Brock Thorbjorn Pitawanakwat. 2009. Anishi- naabemodaa Pane Oodenang: a qualitative study Abhijeet Kumar. 2017. Spoken Speaker Identification of Anishinaabe language revitalization as self- based on Gaussian Mixture Models : Python Imple- determination in Manitoba and Ontario. Ph.D. the- mentation. sis, University of Victoria. Adrian Łańcucki. 2021. Fastpitch: Parallel Text-to- Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Speech with Pitch Prediction. In ICASSP 2021 Zhou Zhao, and Tie-Yan Liu. 2021. FastSpeech 2: - 2021 IEEE International Conference on Acous- Fast and High-Quality End-to-End Text to Speech. tics, Speech and Signal Processing (ICASSP), pages In International Conference on Learning Represen- 6588–6592. tations. A. Levasseur, S. Mercier-Blais, Y. T. Prairie, A. Trem- Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, blay, and C. Turpin. 2021. Improving the accuracy Zhou Zhao, and Tie-Yan Liu. 2019. FastSpeech: of electricity carbon footprint: Estimation of hydro- Fast, Robust and Controllable Text to Speech. In electric reservoir greenhouse gas emissions. Renew- Advances in Neural Information Processing Systems, able and Sustainable Energy Reviews, 136:110433. volume 32. Peng Liu, Xixin Wu, Shiyin Kang, Guangzhi Li, Dan Keren Rice. 2008. Indigenous languages in Canada. In Su, and Dong Yu. 2019. Maximizing Mutual Infor- The Canadian Encyclopedia. mation for Tacotron. arXiv:1909.01145. Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Renqian Luo, Xu Tan, Rui Wang, Tao Qin, Jinzhu Li, Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Sheng Zhao, Enhong Chen, and Tie-Yan Liu. 2021. Chen, Yu Zhang, Yuxuan Wang, R. J. Skerry- LightSpeech: Lightweight and Fast Text to Speech Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, with Neural Architecture Search. In IEEE Interna- and Yonghui Wu. 2018. Natural TTS Synthesis by tional Conference on Acoustics, Speech and Signal Conditioning WaveNet on Mel Spectrogram Predic- Processing (ICASSP), pages 5699–5703. tions. In IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages Michael McAuliffe, Michaela Socolof, Sarah Mihuc, 4779–4783. Michael Wagner, and Morgan Sonderegger. 2017. Statistics Canada. 2016. Census of population. Montreal Forced Aligner: Trainable Text-Speech https://www12.statcan.gc.ca/census- Alignment Using Kaldi. In Interspeech 2017, pages recensement/2016/dp-pd/index-eng.cfm. 498–502. ISCA. Emma Strubell, Ananya Ganesh, and Andrew McCal- Teresa L McCarty. 2018. Community-based language lum. 2019. Energy and Policy Considerations for planning: Perspectives from Indigenous language re- Deep Learning in NLP. In Proceedings of the 57th vitalization. In The Routledge handbook of language Annual Meeting of the Association for Computa- revitalization, pages 22–35. Routledge. tional Linguistics, pages 3645–3650, Florence, Italy. Association for Computational Linguistics. Richard Oster, Angela Grier, Rick Lightning, Maria Mayan, and Ellen Toth. 2014. Cultural continuity, William J. Sutherland. 2003. Parallel extinction risk traditional Indigenous language, and diabetes in Al- and global distribution of languages and species. Na- berta first nations: a mixed methods study. Interna- ture, 423:276–279. tional journal for equity in health, 13:92. Hideyuki Tachibana, Katsuya Uenoyama, and Shun- Alejandro Perez-Gonzalez-de-Martos, Albert Sanchis, suke Aihara. 2018. Efficiently trainable text-to- and Alfons Juan. 2021. VRAIN-UPV MLLP’s sys- speech system based on deep convolutional net- tem for the Blizzard Challenge 2021. In Blizzard works with guided attention. 2018 IEEE Interna- Challenge 2021 Workshop. tional Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4784–4788. Aidan Pine. 2021. Low Resource Speech Synthesis. M.Sc. dissertation, University of Edinburgh. Xu Tan, Tao Qin, Frank Soong, and Tie-Yan Liu. 2021. A Survey on Neural Speech Synthesis. Aidan Pine, Patrick Littell, Eric Joanis, David arXiv:2106.15561. Huggins-Daines, Christopher Cox, Fineen Davis, Eddie Antonio Santos, Shankhalika Srikanth, De- Paul Taylor, Alan W Black, and Richard Caley. 1998. laisie Torkornoo, and Sabrina Yu. Under Review. The architecture of the Festival speech synthesis Gi 2Pi : Rule-based, index-preserving grapheme-to- system. In The Third ESCA/COCOSDA Workshop phoneme transformations. (ETRW) on Speech Synthesis, pages 147–152. 7356
Tao Tu, Yuan-Jui Chen, Cheng-chieh Yeh, and Hung- the potential harms. Beyond assessing the ben- yi Lee. 2019. End-to-end Text-to-speech for efits and risks of introducing a new technology Low-resource Languages by Cross-Lingual Transfer into language revitalization efforts, communities Learning. In Interspeech 2019, pages 2075–2079. are concerned with the way the technology is re- A. M. Urrea, José Abel Herrera Camacho, and Mari- searched and developed, as this process has the bel Alvarado García. 2009. Towards the Speech ability to empower or disempower language com- Synthesis of Raramuri: A Unit Selection Approach based on Unsupervised Extraction of Suffix Se- munities in equal measure (Alia, 2009; Brinklow quences. Research in Computing Science, 41:243– et al., 2019). The current model for developing 256. speech synthesis systems is not very equitable – models need to be run on GPUs by people with Cassia Valentini-Botinhao and Simon King. 2021. De- tection and Analysis of Attention Errors in Sequence- specialized training. For Indigenous communities to-Sequence Text-to-Speech. In Interspeech 2021, to create speech synthesis tools for their languages, pages 2746–2750. ISCA. they should not be required to hand over their lan- guage data to a large government or corporate or- Dan Wells and Korin Richmond. 2021. Cross-lingual Transfer of Phonological Features for Low-resource ganization. A pre-training, fine-tuning pipeline Speech Synthesis. In Proc. 11th ISCA Speech Syn- could be attractive for this reason; communities thesis Workshop, pages 160–165. could fine-tune their own models on a laptop if a multilingual/multi-speaker model were pre-trained Mirjam Wester, Cassia Valentini-Botinhao, and Gus- tav Eje Henter. 2015. Are we using enough lis- on GPUs at a larger institution. Reducing the teners? No!—an empirically-supported critique of computational requirements for training and infer- Interspeech 2014 TTS evaluations. In Interspeech ence of these models could help ensure language 2015, pages 3476–3480. communities have greater control over the process D. Whalen, Margaret Moss, and Daryl Baldwin. 2016. of the development of these systems, less depen- Healing through language: Positive physical health dence on governmental organizations or corpora- effects of indigenous language use. F1000Research, tions, and more sovereignty over their data (Kee- 5:852. gan, 2019). Robert Whitman, Richard Sproat, and Chilin Shih. Strubell et al. (2019) present an argument for eq- 1997. A Navajo Language Text-to-Speech Synthe- sizer. AT&T Bell Laboratories. uitable access to computational resources for NLP research; put another way, we might say that sys- Junichi Yamagishi, Christophe Veaux, and Kirsten Mac- tems which require less compute are more accessi- Donald. 2019. CSTR VCTK Corpus: English Multi- ble. Reducing the number of parameters in a neu- speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92). University of Edinburgh, The Cen- ral TTS model should translate to increased effi- tre for Speech Technology Research (CSTR). ciency, and might make the model less prone to overfitting when training on limited amounts of Yibin Zheng, Xi Wang, Lei He, Shifeng Pan, Frank K. Soong, Zhengqi Wen, and Jianhua Tao. 2019. data. As discussed in §4.2.2, we modified the Forward-Backward Decoding for Regularizing End- base implementation of FastSpeech2 from Chien to-End TTS. In Interspeech 2019, pages 1283–1287. (2021) closely following the lightweight alterna- tive discovered through neural architecture search A Compute, Accessibility, & in Luo et al. (2021). These changes reduced the Environmental Impact size of the model from Chien (2021) from 35M to 11.6M parameters, reduced the size of the stored For reasons of environmental impact and acces- model from 417 MB to 135 MB and significantly sibility, reducing the amount of computation re- improved inference and train times as summarized quired for both training and inference is important in Table 1. We saw a 33% improvement in av- for any neural speech synthesis system, particu- erage batch processing times on the GPU during larly so for Indigenous languages. training, and 64% on the CPU, which may be even A.1 Accessibility, Training & Inference more relevant for Indigenous language communi- Speed ties with limited computational resources. During inference, we saw a 15% speed-up on GPU and While language revitalization efforts are mostly 57% on CPU. encouraging about integrating new technologies into curriculum, there is a growing awareness of Results were timed by running the model for 300 7357
FastSpeech2 Adapted System GPU 90.52 ms (σ 3.31) 60.04 ms (σ 1.70) Training CPU 7561.50 ms (σ 263.55) 2720.88 ms (σ 92.99) GPU 12.00 ms (σ 0.30) 10.23 ms (σ 0.78) Inference CPU 138.73 ms (σ 3.94) 59.50 ms (σ 1.85) Table 1: Mean and standard deviation of training and inference times for a single forward pass of baseline Fast- Speech2 and adapted models. repetitions and taking the mean. The GPU (Tesla 34.5g CO2eq/kWh (Levasseur et al., 2021). This V100-SXM2 16GB) was warmed up for 10 repe- results in a total equivalent carbon consumption titions before timing started, and PyTorch’s built- of 27,821.65 grams, roughly equivalent to driving in GPU synchronization method was used to syn- a single passenger gas-powered vehicle for 110 chronize timing (which occurs on the CPU) with kilometres according to the average rate of 404 the training or inference running on the GPU. CPU grams/mile (EPA, 2019). tests were performed on an Intel(R) Xeon(R) CPU This is a comparatively low CO2 consump- E5-2650 v2 @ 2.60GHz with 4 cores and 16GB tion for over 1500 GPU hours, largely due to the memory reserved. All timings used a batch size of low CO2/kWh output of Quebec electricity when 16. compared with the 2019 USA average of 400g CO2eq/kWh (EPA, 2019). However, CO2 equiva- A.2 CO2 Consumption lents are just a proxy for environmental impact and Strubell et al. (2019) also argue that NLP re- should not be understood to comprehensively ac- searchers should have a responsibility to disclose count for social and environmental impact. Hydro- the environmental footprint of their research, in or- electric dam projects in Quebec, like the ones pow- der for the community to effectively evaluate any ering the GPSC have a sordid and complex history gains and to allow for a more equitable and repro- in the province. Innu Nation Grand Chief Mary ducible field. Ann Nui spoke to this when she commented that All experiments for this paper requiring a GPU “over the past 50 years, vast areas of our ancestral were run on the Canadian General Purpose Science lands were destroyed by the Churchill Falls hydro- Cluster (GPSC) in Dorval, Quebec. Experiments electric project, people lost their land, their liveli- were all run on single Tesla V100-SXM2 16GB hoods, their travel routes, and their personal be- GPUs. Strubell et al. (2019) provide the following longings when the area where the project is located equation for estimating CO2 production: was flooded. Our ancestral burial sites are under water, our way of life was disrupted forever. Innu 1.58t(pc + pr + (g ∗ pg )) of Labrador weren’t informed or consulted about pt = (1) 1000 that project” (Innu-Atikamekw-Anishnabeg Coali- where t is time, pt is total power for training, pc tion, 2020). is average draw of CPU sockets, pr is average B Qualitative Results DRAM memory draw, g is the number of GPUs used in training and pg is the average draw from Question: GPUs. In our case, we estimate t to be equal “Would you be comfortable with any of the to 1,541.989 after summing the time for exper- voices you heard being played online, say for a iments based on their log files, pc is 75 watts, digital dictionary or verb conjugator if no other pr is 6 watts, g is 1, and pg is 250 watts, and recording existed?” the equation for grams of CO2 consumption is CO2 = 34.5pt as the average carbon footprint Kanien’kéha responses: of electricity distributed in Quebec is estimated at 9 • Yes. Note this estimate is based on the total number of hours spent running experiments from the M.Sc. dissertation this pa- per draws its experiments from. There were additional mod- • yes els trained for experiments that are not discussed in this paper. As such, this is a generous overestimation of t. • Yes 7358
You can also read