Requirements and Motivations of Low-Resource Speech Synthesis for Language Revitalization

Page created by Nathaniel Burton

IT & Technique

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Requirements and Motivations of Low-Resource Speech Synthesis for
                       Language Revitalization
              Aidan Pine1                        Dan Wells2   Nathan Thanyehténhas Brinklow3
          aidan.pine@nrc.ca                 dan.wells@ed.ac.uk nathan.brinklow@queensu.ca

                       Patrick Littell1                                    Korin Richmond2
                  patrick.littell@nrc.ca                               korin.richmond@ed.ac.uk

                         Abstract                                class. Text-to-speech synthesis technology (TTS)
                                                                 shows potential for supplementing text-based lan-
        This paper describes the motivation and devel-           guage learning tools with audio in the event that the
        opment of speech synthesis systems for the pur-
                                                                 domain is too large to be recorded directly, or as
        poses of language revitalization. By building
        speech synthesis systems for three Indigenous            an interim solution pending recordings from first-
        languages spoken in Canada, Kanien’kéha,                 language speakers.
        Gitksan & SENĆOŦEN, we re-evaluate the                      Development of TTS systems in this context
        question of how much data is required to build           faces several challenges. Most notable is the usual
        low-resource speech synthesis systems featur-            assumption that neural speech synthesis models re-
        ing state-of-the-art neural models. For ex-              quire at least tens of hours of audio recordings with
        ample, preliminary results with English data
                                                                 corresponding text transcripts to be trained ade-
        show that a FastSpeech2 model trained with 1
        hour of training data can produce speech with            quately. Such a data requirement is far beyond
        comparable naturalness to a Tacotron2 model              what is available for the languages we are con-
        trained with 10 hours of data. Finally, we mo-           cerned with, and is difficult to meet given the lim-
        tivate future research in evaluation and class-          ited time of the relatively small number of speak-
        room integration in the field of speech synthe-          ers of these languages. The limited availability of
        sis for language revitalization.                         Indigenous language speakers also hinders the sub-
                                                                 jective evaluation methods often used in TTS stud-
1       Introduction                                             ies, where naturalness of synthetic speech samples
There are approximately 70 Indigenous languages                  is judged by speakers of the language in question.
spoken in Canada, from 10 distinct language fam-                    In this paper, we re-evaluate some of these chal-
ilies (Rice, 2008). As a consequence of the resi-                lenges for applying TTS in the low-resource con-
dential school system and other policies of cultural             text of language revitalization. We build TTS sys-
suppression, the majority of these languages now                 tems for three Indigenous languages of Canada,
have fewer than 500 fluent speakers remaining,                   with training data ranging from 25 minutes to 3.5
most of them elderly. Despite this, interest from                hours, and confirm that we can produce acceptable
students and parents in Indigenous language edu-                 speech as judged by language teachers and learn-
cation continues to grow (Statistics Canada, 2016);              ers. Outputs from these systems could be suitable
we have heard from teachers that they are over-                  for use in some classroom applications, for exam-
whelmed with interest from potential students, and               ple a speaking verb conjugator.
the growing trend towards online education means
many students who have not previously had access                 2 Background
to language classes now do.
   Supporting these growing cohorts of students                  2.1 Language Revitalization
comes with unique challenges for languages with                  It is no secret that the majority of the world’s lan-
few fluent first-language speakers. A particular                 guages are in crisis, and in many cases this cri-
concern of teachers is to provide their students                 sis is even more urgent than conservation biolo-
with opportunities to hear the language outside of               gists’ dire predictions for flora and fauna (Suther-
    1
      National Research Council Canada
                                                                 land, 2003). However, the ‘doom and gloom’
    2
      University of Edinburgh                                    rhetoric that often follows endangered languages
    3
      Queen’s University                                         over-represents vulnerability and under-represents
                                                            7346
                    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics
                                           Volume 1: Long Papers, pages 7346 - 7359
                             May 22-27, 2022 c 2022 Association for Computational Linguistics

the enduring strength of Indigenous communi-
ties who have refused to stop speaking their lan-      (1) Senòn:wes
guages despite over a century of colonial policies         you.to.it-like-habitual
against their use (Pine and Turin, 2017). Contin-          ‘You like it.’
uing to speak Indigenous languages is often seen
as a political act of anti-colonial resistance. As     (2) Takenòn:wes
such, the goals of any given language revitaliza-          you.to.me-like-habitual
tion effort extend far beyond memorizing verb              ‘You like me.’
paradigms to broader goals of nationhood and
self-determination (Pitawanakwat, 2009; McCarty,
                                                       Figure 1: An example of fusional morphology of
2018). Language revitalization programs can also
                                                       agent/patient pairs in Kanien’kéha transitive verb
have immediate and important impacts on factors        paradigms (from Kazantseva et al., 2018)
including community health and wellness (Whalen
et al., 2016; Oster et al., 2014).
   There is a growing international consensus on       iar words. Assuming a rate of 200 forms/hr for 4
the importance of linguistic diversity, from the       hours per day, 5 days per week, this would take a
Truth & Reconciliation Commission of Canada            teacher out of the classroom for approximately a
(TRC) report in 2015 which issued nine calls           year. Considering Kawennón:nis is anticipated to
to action related to language, to 2019 being de-       have over 1,000,000 unique forms by the time the
clared an International Year of Indigenous Lan-        grammar modelling work is finished, recording au-
guages by the UN, and 2022-2032 being declared         dio manually becomes infeasible.
an International Decade of Indigenous Languages.          The research question that then emerged was
From 1996 to 2016, the number of speakers of           ‘what is the smallest amount of data needed in or-
Indigenous languages increased by 8% (Statistics       der to generate audio for all verb forms in Kawen-
Canada, 2016). These efforts have been success-        nón:nis’. Beyond Kawennón:nis, we anticipate
ful despite a lack of support from digital technolo-   that there are many similar language revitalization
gies. While opportunities may exist for technol-       projects that would want to add supplementary au-
ogy to assist and support language revitalization      dio to other text-based pedagogical tools.
efforts, these technologies must be developed in a
                                                       2.3 Speech Synthesis
way that does not further marginalize communities
(Brinklow et al., 2019; Bird, 2020).                   The last few years have shown an explosion
                                                       in research into purely neural network-based ap-
2.2 Why TTS for Language Revitalization?               proaches to speech synthesis (Tan et al., 2021).
Our interest in speech synthesis for language          Similar to their HMM/GMM predecessors, neural
revitalization was sparked during user evalua-         pipelines typically consist of both a network pre-
tions of Kawennón:nis (lit. ‘it makes words’),         dicting the acoustic properties of a sequence of
a Kanien’kéha verb conjugator (Kazantseva              text and a vocoder. The feature prediction net-
et al., 2018) developed in collaboration between       work must be trained using parallel speech/text
the National Research Council Canada and the           data where the input is typically a sequence of char-
Onkwawenna Kentyohkwa adult immersion pro-             acters or phones that make up an utterance, and
gram in Six Nations of the Grand River in Ontario,     the output is a sequence of fixed-width frames of
Canada. Kawennón:nis models a pedagogically-           acoustic features. In most cases the predictions
important subset of verb conjugations in XFST          from the TTS model are log Mel-spectral features
(Beesley and Karttunen, 2003), and currently           and a vocoder is used to generate the waveform
produces 247,450 unique conjugations.           The    from these acoustic features.
pronominal system is largely responsible for much         Much of the previous work on low resource
of this productivity, since in transitive paradigms,   speech synthesis has focused on transfer learning;
agent/patient pairs are fused, as illustrated in       that is, ‘pre-training’ a network using data from a
Figure 1.                                              language that has more data, and then ‘fine-tuning’
   In user evaluations of Kawennón:nis, students       using data from the low-resource language. One
often asked whether it was possible to add audio       of the problems with this approach is that the in-
to the tool, to model the pronunciation of unfamil-    put space often differs between languages. As the
                                                   7347

inputs to these systems are sequences of charac-       non-trivial: there are limited amounts of text from
ters or phones, and as these sequences are typi-       which a speaker could read, and there are few peo-
cally one-hot encoded, it can be difficult to devise   ple available who are sufficiently literate in the lan-
a principled method for transferring weights from      guages to transcribe recorded audio. Re-focusing
the source language network to the target if there     speakers’ limited time to these tasks presents a sig-
is a difference between the character or phone in-     nificant opportunity cost; they are often already
ventories of the two languages. Various strategies     over-worked and over-burdened in under-funded
have emerged for normalizing the input space. For      and under-resourced language teaching projects.
example, Demirsahin et al. (2018) propose a uni-          As mentioned in §2.1, language technology
fied inventory for regional multilingual training of   projects that aim to assist language revitalization
South Asian languages, while Tu et al. (2019) com-     and reclamation efforts must be centered around
pare various methods to create mappings between        the primary goals of those efforts and ensure that
source and target input spaces. Another proposal       the means of developing the technology do not
is to normalize the input space between source and     distract or work against the broader sociopolitical
target languages by replacing one-hot encodings of     goals. A primary stress point for many natural
text with multi-hot phonological feature encodings     language processing projects involving Indigenous
(Gutkin et al., 2018; Wells and Richmond, 2021).       communities surrounds issues of data sovereignty.
                                                       It is important that communities direct the devel-
2.4 Speech Synthesis for Indigenous                    opment of these tools, and maintain control, own-
    Languages in Canada                                ership, and distribution rights for their data, as well
There is extremely little published work on speech     as for the resulting speech synthesis models (Kee-
synthesis for Indigenous languages in Canada (and      gan, 2019; Brinklow, 2021). In keeping with this,
North America generally). A statistical parametric     the datasets described in this paper are not being
speech synthesizer using Simple4All was recently       released publicly at this time.
developed for Plains Cree (Harrigan et al., 2019;         To test the feasibility of developing speech
Clark, 2014). Although it was unpublished, two         synthesis systems for Indigenous languages, we
highschool students1 created a statistical paramet-    trained models for three unrelated Indigenous lan-
ric speech synthesizer for Kanien’kéha by adapting     guages, Kanien’kéha (§3.1), Gitksan (§3.2), and
eSpeak (Duddington and Dunn, 2007). We know            SENĆOŦEN (§3.3).
of no other attempts to create speech synthesis sys-
                                                       3.1 Kanien’kéha
tems for Indigenous languages in Canada. Else-
where in North America, a Tacotron2 system has         Kanien’kéha2 (a.k.a. Mohawk) is an Iroquoian lan-
been built for Cherokee (Conrad, 2020), and some       guage spoken by roughly 2,350 people in south-
early work on concatenative systems for Navajo         ern Ontario, Quebec, and northern New York state
was discussed in a technical report (Whitman et al.,   (Statistics Canada, 2016). In 1979 the first immer-
1997), as well as on Rarámuri (Urrea et al., 2009).    sion school of any Indigenous language in Canada
                                                       was opened for Kanien’kéha, and many other very
3       Indigenous Language Data                       successful programs have been started since, in-
                                                       cluding the Onkwawenna Kentyohkwa adult im-
Although the term ‘low resource’ is used to de-        mersion program in 1999 (Gomashie, 2019).
scribe a wide swath of languages, most Indigenous         In the late 1990s, a team of five Kanien’kéha
languages in Canada would be considered ‘low-          translators worked with the Canadian Bible Soci-
resource’ in multiple senses of the word, having       ety to translate and record parts of the Bible; one of
both a low amount of available data (annotated         the speakers on these recordings, Satewas, is still
or unannotated), and a relatively low number of        living. Translation runs in Satewas’s family, with
speakers. Most Indigenous languages lack tran-         his great-grandfather also working on Bible trans-
scribed audio corpora, and fewer still have such       lations in the 19th century. Later, a team of four
data recorded in a studio context. Due to the lim-     speakers and learners, including this paper’s third
ited number of speakers, creating these resources is   author, aligned the text and audio at the utterance
    1                                                     2
   https://wiki.laptop.org/go/                              As there are different variations of spelling, we use the
Instructions_for_implementing_a_new_language_%         spelling used in the communities of Kahnawà:ke and Kahne-
22voice%22_for_Speak_on_the_XO                         setà:ke throughout this paper

                                                   7348

level using Praat (Boersma and van Heuven, 2001)                 an orthography developed by the late SENĆOŦEN
and ELAN (Brugman and Russel, 2004).                             speaker and WSÁNEĆ elder Dave Elliott. While
                                                                               ¯
   While a total of 24 hours of audio were recorded,             the community of approximately 3,500 has fewer
members of the Kanien’kéha-speaking community                    than 10 fluent speakers, there are hundreds of learn-
told us it would be inappropriate to use the voices              ers, many of whom have been enrolled in years
of speakers who had passed away, leaving only                    of immersion education in the language (First Peo-
recordings of Satewas’s voice. Using a GMM-                      ples’ Cultural Council, 2018).
based speaker ID system (Kumar, 2017), we re-                       As there were no studio-quality recordings of
moved utterances by these speakers, then removed                 the SENĆOŦEN language publicly available, we
utterances that were outliers in duration (less than             recorded 25.92 minutes of the language with
0.4s or greater than 11s) and speaking rate (less                PENÁĆ David Underwood reading two stories
than 4 phones per second or greater than 15),                    originally spoken by elder Chris Paul.
recordings with an unknown phase effect present,
and utterances containing non-Kanien’kéha char-                  4 Research Questions
acters (e.g. proper names like ‘Euphrades’). Han-
                                                                 Given the motivation and context for language
dling utterances with non-Kanien’kéha characters
                                                                 revitalization-based speech synthesis, a number of
would have required grapheme-to-phoneme pre-
                                                                 research questions follow. Namely, how much
diction capable of dealing with multilingual text
                                                                 data is required in order to build a system of rea-
and code-switching which we did not have avail-
                                                                 sonable pedagogical quality? How do we evalu-
able. The resulting speech corpus comprised 3.46
                                                                 ate such a system? And, how is the resulting sys-
hours of speech.
                                                                 tem best integrated into the classroom? In §4.1,
3.2 Gitksan                                                      we discuss the difficulty of evaluating TTS sys-
                                                                 tems in low-resource settings. We then discuss
Gitksan3 is one of four languages belonging to
                                                                 preliminary results for English and Indigenous lan-
the Tsimshianic language family spoken along
                                                                 guage TTS which show that acceptable speech
the Skeena river and its surrounding tributaries
                                                                 quality can be achieved with much less training
in the area colonially known as northern British
                                                                 data than usually considered for neural speech syn-
Columbia. Traditional Gitksan territory spans
                                                                 thesis (§4.2). Finally, we suggest possible direc-
some 33,000 square kilometers and is home to al-
                                                                 tions for pedagogical integration in section §4.4.
most 10,000 people, with approximately 10% of
the population continuing to speak the language                  4.1 Low-Resource Evaluation
fluently (First Peoples’ Cultural Council, 2018).
   As there were no studio-quality recordings of                 One of the most significant challenges in research-
the Gitksan language publicly available, and as an               ing speech synthesis for languages with few speak-
intermediate speaker of the language, the first au-              ers is evaluating the models. For some Indigenous
thor recorded a sample set himself. In total, he                 languages in Canada, the total number of speakers
recorded 35.46 minutes of audio reading isolated                 of the language is less than the number typically re-
sentences from published and unpublished stories                 quired for statistical significance in a listening test
(Forbes et al., 2017).                                           (Wester et al., 2015). While the number of speak-
                                                                 ers in these conditions is sub-optimal for statisti-
3.3 SENĆOŦEN                                                     cal analysis, we have been told by the communi-
                                                                 ties we work with that the positive assessment of
The SENĆOŦEN language is spoken by the
                                                                 a few widely respected and community-engaged
WSÁNEĆ people on the southern part of the is-
 ¯                                                               language speakers would be practically sufficient
land colonially known as Vancouver Island. It be-
                                                                 to assess the pedagogical value of speech models
longs to the Coastal branch of the Salish language
                                                                 in language revitalization contexts. For the experi-
family. The WSÁNEĆ community runs a world-
              ¯                                                  ments described in this paper, we ran listening tests
famous language revitalization program4 , and uses
                                                                 for both Kanien’kéha and Gitksan with speakers,
   3
      We use Lonnie Hindle and Bruce Rigsby’s spelling of the    teachers, and learners, but were not able to run any
language, which, with the use of ‘k’ and ‘a’ is a blend of up-   such tests for SENĆOŦEN due to very few speak-
river (gigeenix) and downriver (gyets) dialects
    4
      https://wsanecschoolboard.ca/sencoten-                     ers with already busy schedules.
language/                                                           While some objective metrics do exist, such as
                                                             7349

Mel cepstral distortion (MCD, Kubichek, 1993),          tirely, which could still optimize Tacotron2’s root
we do not believe they should be considered reli-       mean square error function over predicted acoustic
able proxies for listening tests. Future research on    features, but result in an untrained or degenerate
speech synthesis for languages with few speakers        attention network which is unable to properly gen-
should prioritize efficient and effective means of      eralize to new inputs at inference time when the
evaluating results.                                     teacher forcing input is unavailable. Attention fail-
   In many cases, including in the experiment de-       ures represent a characteristic class of errors for
scribed in §4.2, artificial data constraints can be     models such as Tacotron2, for example skipping
placed on a language with more data, like En-           or repeating words from the input text (Valentini-
glish, to simulate a low-resource scenario. While       Botinhao and King, 2021).
this technique can be insightful and it is tempt-          There have been many proposals to improve
ing to draw universal conclusions, English is lin-      training of the attention network, for example by
guistically very different from many of the other       guiding the attention or using a CTC loss function
languages spoken in the world. Accordingly, we          to respect the monotonic alignment between text
should be cautious not to assume that results from      inputs and speech outputs (Tachibana et al., 2018;
these types of experiments will necessarily transfer    Liu et al., 2019; Zheng et al., 2019; Gölge, 2020).
or extend to genuinely low-resource languages.          As noted by Liu et al. (2019), increasing the so-
                                                        called ‘reduction factor’ – which applies dropout
4.2 How much data do you really need?
                                                        to the autoregressive frames – can also help the
The first question to answer is whether our Indige-     model learn to rely more on the attention network
nous language corpora ranging from 25 minutes to        than the teacher forcing inputs, but possibly at the
3.46 hours of speech are sufficient for building neu-   risk of compromising synthesis quality.
ral speech synthesizers. Due to the prominence of
                                                           FastSpeech2 (Ren et al., 2021), and similar sys-
Tacotron2 (Shen et al., 2018), it seems that many
                                                        tems like FastPitch (Łańcucki, 2021), present an
people have assumed that the data requirements for
                                                        alternative to Tacotron2-type attentive, autoregres-
training any neural speech synthesizer of similar
                                                        sive systems with similar listening test results and
quality must be the same as the requirements for
                                                        without the characteristic errors related to atten-
this particular model. As a result, some researchers
                                                        tion. Instead of modelling duration using atten-
still choose to implement either concatenative or
                                                        tion, they include an explicit duration prediction
HMM/GMM-based statistical parametric speech
                                                        module trained on phone duration targets extracted
synthesis systems in low-resource situations based
                                                        from the training data. For the original FastSpeech,
on the assumption that a “sufficiently large corpus
                                                        target phone durations derived from the attention
[for neural TTS] is unavailable” (James et al., 2020,
                                                        weights of a pre-trained Tacotron2 system were
p. 298). We argue that attention-based models such
                                                        used to provide phone durations (Ren et al., 2019).
as Tacotron2 should not be used as a benchmark for
                                                        In low-resource settings, however, there might not
data requirements among all neural TTS methods,
                                                        be sufficient data to train an initial Tacotron2 in
as they are notoriously difficult to train and unnec-
                                                        the target language in the first place. For Fast-
essarily inflate training data requirements.
                                                        Speech2, phone duration targets are instead ex-
4.2.1   Replacing attention-based weak                  tracted using the Montreal Forced Aligner (MFA,
        duration models                                 McAuliffe et al., 2017), trained on the same data as
Tacotron2 is an autoregressive model, meaning it        used for TTS model training. We have found MFA
predicts the speech parameters ŷt from both the        can provide suitable alignments for our target lan-
input sequence of text x and the previous speech        guages, even with alignment models being trained
parameters y1 , ..., yt−1 . Typically, the model is     on only limited data.
trained with ‘teacher-forcing’, where the autore-          Faster convergence of text-acoustic feature
gressive frame yt−1 passed as input for predict-        alignments has been found to speed up overall
ing ŷt is taken from the ground truth acoustic fea-    encoder-decoder TTS model training, as stable
tures and not the prediction network’s output from      alignments provide a solid foundation for further
the previous frame ŷt−1 . As discussed by Liu          training of the decoder. Badlani et al. (2021) show
et al. (2019), such a system might learn to copy        this by adding a jointly-learned alignment frame-
the teacher forcing input or disregard the text en-     work to a Tacotron2 architecture, reducing time
                                                    7350

0.8                                                                            0.8
                   80                                                                                80
                   70                                                     0.7                        70                                                  0.7
                   60                                                     0.6                        60                                                  0.6
Encoder timestep

                                                                                  Encoder timestep
                   50                                                     0.5                        50                                                  0.5
                   40                                                     0.4                        40                                                  0.4
                   30                                                     0.3                        30                                                  0.3
                   20                                                     0.2                        20                                                  0.2
                   10                                                     0.1                        10                                                  0.1
                       0                                                                             0                                                   0.0
                           0      100      200        300     400   500                                   0   100   200      300      400    500   600
                                           Decoder timestep                                                               Decoder timestep

                                        (a) 5 Hr LJ Corpus Subset                                                   (b) 10 Hr LJ Corpus Subset

Figure 2: Visualization of Tacotron2 Attention Network Weights extracted after 100k steps trained on the LJ corpus.
The weights of the attention network should be diagonal and monotonic as seen in subfigure (b). Subfigure (a)
shows that the network trained on a 5 hour subset of the LJ corpus results in a degenerate attention network.

to convergence. In contrast, they found that re-                                            3, 5, 10 and 24 (full corpus) hours of speech. The
placing MFA duration targets in FastSpeech2 train-                                          models were trained for 100k steps and, as seen
ing offers no benefit – forced alignment targets al-                                        in Figure 2, using up to 5 hours of data the atten-
ready provide enough information for more time-                                             tion mechanism does not learn properly, resulting
efficient training compared to an attention-based                                           in degenerate outputs.
Tacotron2 system. Relieving the burden of learn-                                                For comparison, we trained seven FastSpeech2
ing an internal alignment model also opens the                                              models with batch size 16 for 200k steps on 15 and
door to more data-efficient training. For example,                                          30 minute, 1, 3, 5, 10 and 24 hour incremental par-
Perez-Gonzalez-de-Martos et al. (2021) submitted                                            titions of LJ Speech. Our model6 is based on an
a non-attentive model trained from forced align-                                            open-source implementation (Chien, 2021), which
ments to the Blizzard Challenge 2021, where their                                           adds learnable speaker embeddings and a decoder
system was found to be among the most natural                                               postnet to the original model, as well as predict-
and intelligible in subjective listening tests despite                                      ing pitch and energy values at the phone rather
only using 5 hours of speech; all other submitted                                           than frame level. We also added learnable lan-
systems included often significant amounts of ad-                                           guage embeddings for supplementary experiments
ditional training data (up to 100 hours total).                                             in cross-lingual fine-tuning; while not reported in
                                                                                            this paper, we refer the interested reader to Pine
4.2.2                          Experimental Comparison of Data
                                                                                            (2021) for discussion of these experiments. Moti-
                               Requirements for Neural TTS
                                                                                            vated by concerns of efficiency in model training
To investigate the effects of differing amounts of                                          and inference, and the possibility of overfitting a
data on the attention network, and in preparation                                           large model to limited amounts of data, we further
for training systems with our limited Indigenous                                            modified the base architecture to match the Light-
language data sets, we trained five Tacotron2 mod-                                          Speech model presented in Luo et al. (2021). We
els on incremental partitions of the LJ Speech cor-                                         removed the energy adaptor, replaced the convolu-
pus of American English (Ito and Johnson, 2017).                                            tional layers in the encoder, decoder and remain-
We used the NVIDIA implementation5 with de-                                                 ing variance predictors with depthwise separable
fault hyperparameters apart from a reduced batch                                            convolutions (Kaiser et al., 2018) and matched en-
size of 32 to fit the memory capacity of our GPU                                            coder and decoder convolutional kernel sizes with
resources. We artificially constrained the training                                         Luo et al. (2021). This reduced the number of
data such that the first model saw only the first hour                                      model parameters from 35M7 to 11.6M without no-
of data from the shuffled corpus, the second model                                          ticeable change in voice quality and sped up train-
that same first hour plus another two hours (3 to-
tal) etc., so that the five models were trained on 1,                                                     6
                                                                                                https://github.com/roedoejet/FastSpeech2
                                                                                                          7
                                                                                                In the implementation of Chien (2021); the original Fast-
                   5
                       https://github.com/NVIDIA/tacotron2                                  Speech2 is slightly smaller at 27M parameters.

                                                                                7351

100
                                                                                                                         TT2 10hr

                                                                                                                         FS2 Full
               75
                                                                                                                         FS2 10hr
MUSHRA score

                                                                                                                                                                    p−value
                                                                                                                          FS2 5hr                                       0.05
                                                                                                                                                                        0.04
               50
                                                                                                                          FS2 3hr                                       0.03
                                                                                                                                                                        0.02

                                                                                                                          FS2 1hr                                       0.01
                                                                                                                                                                        0.00
               25
                                                                                                                         FS2 30m

                                                                                                                         FS2 15m
                 0                                                                                                           Ref
                     ef

                               m

                                         m

                                                   r

                                                             r

                                                                       r

                                                                               hr

                                                                                        ll

                                                                                                 hr

                                                                                                           ll
                                                  1h

                                                            3h

                                                                      5h

                                                                                      Fu

                                                                                                          Fu

                                                                                                                              FS 5m

                                                                                                                               FS m

                                                                                                                               FS hr

                                                                                                                               FS hr
                                                                                                                              FS 5hr

                                                                                                                               FS hr

                                                                                                                                2 l
                                                                                                                               TT hr

                                                                                                                                       ll
                     R

                              15

                                        30

                                                                            10

                                                                                                10

                                                                                                                              TT Ful

                                                                                                                                    Fu
                                                                                                                                   30

                                                                                                                                    1

                                                                                                                                    3

                                                                                                                                  10

                                                                                                                                  10
                                              2

                                                        2

                                                                  2

                                                                                    2

                                                                                                      2

                                                                                                                                   1
                          2

                                    2

                                             FS

                                                       FS

                                                                 FS

                                                                           2

                                                                                             2
                                                                                 FS

                                                                                                     TT

                                                                                                                                  2

                                                                                                                                  2

                                                                                                                                  2

                                                                                                                                 2

                                                                                                                                 2
                         FS

                                   FS

                                                                       FS

                                                                                           TT

                                                                                                                                2

                                                                                                                                2

                                                                                                                                2
                                                                                                                              FS
                                                            Model ID

Figure 3: Box plot of survey data from MUSHRA                                                                      Figure 4: Pairwise Bonferroni-corrected Wilcoxon
questions comparing Tacotron2 (TT2) and FastSpeech2                                                                signed rank tests between each pair of voices. Cells
(FS2) models with constrained amounts of training data.                                                            correspond to the significance of the result of the pair-
‘Ref’ refers to reference recordings of natural speech.                                                            wise test between the model on the y-axis and the
                                                                                                                   model on the x-axis. Darker cells show stronger sig-
                                                                                                                   nificance; grey cells did not show a significant differ-
ing by 33% on GPU or 64% on CPU. For addi-                                                                         ence in listening test results. FS2 refers to models
tional discussion of the accessibility benefits of                                                                 built with FastSpeech2, TT2 refers to models built with
these changes with respect to Indigenous language                                                                  Tacotron2, and ‘Ref’ to reference recordings. Sam-
communities, see Appendix A.                                                                                       ples available at https://roedoejet.github.io/
                                                                                                                   msc_listening_tests_data/
4.2.3                    Results
We conducted a short (10-15 minute) listening test
                                                                                                                   voices, while showing consistent improvements
to compare the two Tacotron2 models that trained
                                                                                                                   in naturalness ratings as more data is added (as
properly (10h, full) against the seven FastSpeech2
                                                                                                                   shown in Figure 3), are not significantly different
models. We recruited 30 participants through Pro-
                                                                                                                   from each other. This is a relevant and impor-
lific, and presented each with four MUSHRA-style
                                                                                                                   tant finding for low-resource speech synthesis be-
questions where they were asked to rank the 9
                                                                                                                   cause it shows that a FastSpeech2 voice built with
voices along with a hidden natural speech refer-
                                                                                                                   3 hours of data can achieve subjective naturalness
ence (ITU-R, 2003). MUSHRA-style questions
                                                                                                                   ratings which are not significantly different from a
were used as a practical way to evaluate this large
                                                                                                                   Tacotron2 voice built with 24 hours of data. Simi-
number of models.
                                                                                                                   larly, the results of the listening test for our Fast-
   While it only took 30 minutes to recruit 30 par-
                                                                                                                   Speech2 voice built with 1 hour of data are not
ticipants using Prolific, the quality of responses
                                                                                                                   significantly different from our Tacotron2 voice
was quite varied. We rejected two outright as they
                                                                                                                   built with 10 hours of data. Additionally, while
seemingly did not listen to the stimuli and left the
                                                                                                                   all the FastSpeech2 voices were intelligible, all
same rankings for every voice. Even still, there
                                                                                                                   Tacotron2 models trained with less than 10 hours
was a lot of variation in responses from the remain-
                                                                                                                   of data produced unintelligible speech.
ing participants, as seen in Figure 3. We tested
for significant differences between pairs of voices
                                                                                                                   4.3 Indigenous Language Experiments
using Bonferroni-corrected Wilcoxon signed rank
tests. Pairwise test results are summarized in the                                                                 Despite the difficulty in evaluation (§4.1), we
heat map of their p-values in Figure 4.                                                                            built and evaluated a number of TTS systems for
   In the results from the pairwise analysis, we                                                                   the Indigenous languages described in §3. We
can see that natural speech is rated as significantly                                                              had a baseline concatenative model available for
more natural than all synthetic speech samples.                                                                    Kanien’kéha that we had previously built using
Naturalness ratings for the FastSpeech2 voices                                                                     Festival and Multisyn (Taylor et al., 1998; Clark
trained on 15m and 30m of data are significantly                                                                   et al., 2007). Additionally, we trained cold-start
lower than all other voices, and significantly differ-                                                             FastSpeech2 models for each language, as well as
ent from each other. The results for the remaining                                                                 models fine-tuned for 25k steps from a multilin-
                                                                                                                7352

gual, multispeaker FastSpeech2 model pre-trained                      5

on a combination of VCTK (Yamagishi et al.,                           4

2019), Kanien’kéha and Gitksan recordings. A                          3

                                                                MOS
rule-based mapping from orthography to pronunci-
                                                                      2
ation form was developed for each language using
the ‘g2p’ Python library in order to perform align-                   1

ment and synthesis at the phone-level instead of                      0

character-level (Pine et al., Under Review).

                                                                             ef

                                                                                                        e
                                                                                                    on
                                                                            R

                                                                                                   Ph
                                                                                       Model ID
4.3.1   Results
We carried out listening test evaluations of Gitk-      Figure 5: Box plot of MOS results for Gitksan listen-
san and Kanien’kéha models. Participants were           ing test. ‘Ref’ is the reference voice and ‘Phone’ is the
recruited by contacting teachers, learners and lin-     phone-based FastSpeech2 neural model. Variable re-
                                                        sults for the reference voice are likely due to the natural
guists with at least some familiarity with the lan-
                                                        speech recordings coming from a non-native speaker.
guages.
   For the Kanien’kéha listening test, 6 participants
                                                        maybe
were asked to answer 20 A/B questions comparing         16.7%
                                                                                           maybe
synthesized utterances from the various models.                                            41.7%
                                                                                                                     yes
We used A/B tests for more targeted comparisons                                                                   58.3%
between different systems, namely cold-start vs.                                     yes
                                                                                  83.3%
fine-tuned and neural vs. concatenative. Results
showed that 72.2% of A/B responses from partic-                 (a) Kanien’kéha                     (b) Gitksan
ipants preferred our FastSpeech2 model over our
                                                        Figure 6: Responses from qualitative survey asking par-
baseline concatenative model. In addition, 81.7%        ticipants “Would you be comfortable with any of the
of A/B responses from participants preferred the        voices you heard being played online, say for a digital
cold-start to the model fine-tuned on the multi-        dictionary or verb conjugator if no other recording ex-
speaker, multi-lingual model, suggesting that the       isted?”. No participants responded “no”.
transfer learning approach discussed in §2.3 might
not be necessary for models with explicit dura-
tions such as FastSpeech2 since they are relieved       san listening tests directly whether they approved
of the burden to learn an implicit model of duration    of the synthesis quality. As seen in Figure 6, par-
through attention from limited data.                    ticipant responses were generally positive; full re-
   For the Gitksan listening test, we did not build     sponses are reported in Appendix B.
a concatenative model as with Kanien’kéha and
                                                        4.4 Integrating TTS in the Classroom
so we were not comparing different models, but
rather just gathering opinions on the quality of the    Satisfying the goal of adding supplementary au-
cold-start FastSpeech2 model. Accordingly, 10           dio to a reference tool like Kawennón:nis can be
MOS-style questions were presented to 12 partici-       straightforwardly implemented by linking entries
pants for both natural utterances and samples from      in the verb conjugator to pre-generated audio for
our FastSpeech2 model. The model received a             the domain from a static server. This implementa-
3.56 ± 0.26 MOS compared with a MOS for the             tion also limits the potential of out of domain utter-
reference recordings of 4.63 ± 0.19 as shown in         ances that might be deemed inappropriate, which
Figure 5. While both Kanien’kéha and Gitksan re-        is an ethical concern in communities with low num-
sults seem to corroborate our belief that these mod-    bers of speakers where the identity of the ‘model’
els should be of reasonable quality despite limited     speaker is easily determined.
training data, it is difficult to make any conclusive      However, the ability to synthesize novel utter-
statement given the low number of eligible partici-     ances could be pedagogically useful. Students
pants available for evaluation.                         often come into contact with words or sentences
   As the main goal of our efforts here is to even-     which do not have audio, and teachers often have
tually integrate our speech synthesis systems into      to prepare new thematic word lists or vocabulary
a pedagogical setting, we also asked the 18 peo-        lessons that could benefit from a more general pur-
ple who participated across Kanien’kéha and Gitk-       pose speech synthesis solution. In those cases,
                                                    7353

with community and speaker input, we might con- based architectures such as Tacotron2. Given
sider what controls would be necessary for the forced alignments of sufficient quality, which we
users of this technology. One potential solution is found to be achievable even by training a Mon-
the variance adaptor architecture present in Fast- treal Forced Aligner model only on our limited
Speech2, allowing for phone-level control of dura- Indigenous language training data, this makes for
tion, pitch and energy; an engaging demonstration more data-efficient training of neural TTS sys-
of a graphical user interface for the corresponding tems than has generally been explored in previous
controls in a FastPitch model is also available.8 We work. These findings show great promise for fu-
would like to focus further efforts on designing a ture work in low-resource TTS for language revi-
user interface for speech synthesis systems that sat- talization, especially as they come from systems
isfies ethical concerns while prioritizing language trained from scratch on such limited data, rather
pedagogy as the fundamental use case. than pre-training on a high-resource language and
In addition to fine-grained prosodic controls, subsequent fine-tuning on limited target language
we would like to explore the synthesis of hyper- data.
articulated speech, as often used by language teach-
ers when modelling pronunciation of unfamiliar Acknowledgements
words or sounds for students. This style of speech
typically involves adjustment beyond the param- We would like to gratefully acknowledge the
eters of pitch, duration and energy, and is char- many people who worked to record the audio
acterized by more careful enunciation of individ- for the speech synthesis systems described in this
ual phones than is found in normal speech. This project. In particular, Satewas Harvey Gabriel,
problem has parallels to the synthesis of Lombard and PENÁĆ David Underwood.
speech (Hu et al., 2021), as used to improve intelli- Much of the text and experimentation related to
gibility by speakers who find themselves in noisy this paper was submitted as partial fulfillment of
environments. the first author’s M.Sc. dissertation at the Univer-
sity of Edinburgh (Pine, 2021).
5 Conclusion This work was supported in part by the UKRI
Centre for Doctoral Training in Natural Lan-
In this paper, we presented the first neural speech guage Processing, funded by the UKRI (grant
synthesis systems for Indigenous languages spo- EP/S022481/1) and the University of Edinburgh,
ken in Canada. Subjective listening tests showed School of Informatics and School of Philosophy,
encouraging results for the naturalness and accept- Psychology & Language Sciences.
ability of voices for two languages, Kanien’kéha
and Gitksan, despite limited training data avail-
ability (3.5 hours and 35 minutes, respectively). References
More extensive evaluation on English shows that
the FastSpeech2 architecture can produce speech Valerie Alia. 2009. The New Media Nation: Indigenous
Peoples and Global Communication, ned - new edi-
with similar quality to a Tacotron2 system using tion, 1 edition. Berghahn Books.
a fraction of the amount of speech usually consid-
ered for neural speech synthesis. Notably, a Fast- Rohan Badlani, Adrian Łancucki, Kevin J. Shih,
Speech2 voice trained on 1 hour of English speech Rafael Valle, Wei Ping, and Bryan Catanzaro.
achieved subjective naturalness ratings not signif- 2021. One TTS Alignment To Rule Them All.
arXiv:2108.10447.
icantly different from a Tacotron2 voice using 10
hours of data, while a 3-hour FastSpeech2 system Kenneth R. Beesley and Lauri Karttunen. 2003. Finite
showed no significant difference from a 24-hour State Morphology. CSLI Publications.
Tacotron2 voice.
We attribute these results to the fact that Fast- Steven Bird. 2020. Decolonising speech and lan-
guage technology. In Proceedings of the 28th Inter-
Speech2 learns input token durations from forced national Conference on Computational Linguistics,
alignments, rather than jointly learning to align lin- pages 3504–3519.
guistic inputs to acoustic features alongside the
acoustic feature prediction task as in attention- Paul Boersma and Vincent van Heuven. 2001. Speak
and unSpeak with PRAAT. Glot International,
8
https://fastpitch.github.io/ 5(9/10):341–347.

7354

Nathan Thanyehténhas Brinklow. 2021. Indigenous Grace A. Gomashie. 2019. Kanien’keha / Mohawk In-
language technologies: Anti-colonial oases in a digenous language revitalisation efforts in Canada.
colonizing (digital) world. WINHEC: Interna- McGill Journal of Education / Revue des sciences
tional Journal of Indigenous Education Scholarship, de l’éducation de McGill, 54(1):151–171.
16(1):239–266.
Alexander Gutkin, Martin Jansche, and Tatiana
Nathan Thanyehténhas Brinklow, Patrick Littell, De- Merkulova. 2018. FonBund: A Library for
laney Lothian, Aidan Pine, and Heather Souter. Combining Cross-lingual Phonological Segment
2019. Indigenous Language Technologies & Lan- Data. In Proceedings of the Eleventh International
guage Reclamation in Canada. Proceedings of the Conference on Language Resources and Evalua-
1st International Conference on Language Technolo- tion (LREC 2018), pages 2236–2240. European
gies for All, pages 402–406. Language Resources Association (ELRA).
Hennie Brugman and Albert Russel. 2004. An- Atticus Harrigan, Antti Arppe, and Timothy Mills.
notating Multi-media/Multi-modal Resources with 2019. A Preliminary Plains Cree Speech Synthe-
ELAN. In Proceedings of the Fourth International sizer. In Proceedings of the 3rd Workshop on the
Conference on Language Resources and Evaluation Use of Computational Methods in the Study of En-
(LREC’04), Lisbon, Portugal. European Language dangered Languages Volume 1 (Papers), pages 64–
Resources Association (ELRA). 73, Honolulu. Association for Computational Lin-
guistics.
Chung-Ming Chien. 2021. ming024/FastSpeech2.
https://github.com/ming024/FastSpeech2. Qiong Hu, Tobias Bleisch, Petko Petkov, Tuomo Raitio,
Original-date: 2020-06-25T13:57:53Z. Erik Marchi, and Varun Lakshminarasimhan. 2021.
Whispered and Lombard Neural Speech Synthesis.
Robert AJ Clark. 2014. Simple4all. In Proc. Inter- In 2021 IEEE Spoken Language Technology Work-
speech 2014, pages 1502–1503. shop (SLT), pages 454–461.
Robert AJ Clark, Korin Richmond, and Simon King. Innu-Atikamekw-Anishnabeg Coalition. 2020. Export
2007. Multisyn: Open-domain unit selection for the of Canadian Hydropower to the United States - First
festival speech synthesis system. Speech Communi- Nations in Québec and Labrador Unite to Oppose
cation, 49(4):317–330. Hydro-Québec Project.
Michael Conrad. 2020. Tacotron2 and Chero-
Keith Ito and Linda Johnson. 2017. The LJ speech
kee TTS. https://www.cherokeelessons.com/
dataset. https://keithito.com/LJ-Speech-
content/tacotron2-and-cherokee-tts/.
Dataset/.
Isin Demirsahin, Martin Jansche, and Alexander
Gutkin. 2018. A unified phonological representation ITU-R. 2003. Recommendation ITU-R BS.1534-1 -
of South Asian languages for multilingual text-to- Method for the subjective assessment of intermedi-
speech. In Proc. The 6th Intl. Workshop on Spoken ate quality level of coding systems. Technical Report
Language Technologies for Under-Resourced Lan- ITU-R BS.1534-1, International Telecommunication
guages (SLTU), pages 80–84. Union.

Jonathan Duddington and Reece Dunn. 2007. Jesin James, Isabella Shields, Rebekah Berriman, Pe-
eSpeak: Speech Synthesizer. http: ter Keegan, and Catherine Watson. 2020. Develop-
//espeak.sourceforge.net/. ing resources for te reo Māori text to speech synthe-
sis system. In P. Sojka, I. Kopeček, K. Pala, and
EPA. 2019. Emissions & generation resource inte- A. Horák, editors, Text, Speech, and Dialogue, pages
grated database (eGRID). https://www.epa.gov/ 294–302.
egrid.
Lukasz Kaiser, Aidan N. Gomez, and Francois Chollet.
First Peoples’ Cultural Council. 2018. Report on 2018. Depthwise Separable Convolutions for Neural
the status of B.C. https://fpcc.ca/resource/ Machine Translation. In International Conference
fpcc-report-of-the-status-of-b-c-first- on Learning Representations.
nations-languages-2018/.
Anna Kazantseva, Owennatekha Brian Maracle,
Clarissa Forbes, Henry Davis, Michael Schwan, and Ronkwe’tiyóhstha Josiah Maracle, and Aidan
Gitksan Research Lab. 2017. Three Gitksan Texts. Pine. 2018. Kawennón:nis: the wordmaker for
Papers for the International Conference on Salish Kanyen’kéha. In Proceedings of the Workshop
and Neighbouring Languages, 52:47–89. on Computational Modeling of Polysynthetic Lan-
guages, pages 53–64, Santa Fe, New Mexico, USA.
Eren Gölge. 2020. Solving Attention Problems of Association for Computational Linguistics.
TTS models with Double Decoder Consistency.
https://erogol.com/solving-attention- Te Taka Keegan. 2019. Issues with Māori sovereignty
problems-of-tts-models-with-double- over Māori language data. Let The Languages Live
decoder-consistency/. 2019 Conference.

7355

R. Kubichek. 1993. Mel-cepstral distance measure for      Aidan Pine and Mark Turin. 2017. Language Revital-
   objective speech quality assessment. In Proceedings      ization. Oxford Research Encyclopedia of Linguis-
   of IEEE Pacific Rim Conference on Communications         tics.
   Computers and Signal Processing, volume 1, pages
   125–128 vol.1.                                         Brock Thorbjorn Pitawanakwat. 2009.         Anishi-
                                                            naabemodaa Pane Oodenang: a qualitative study
Abhijeet Kumar. 2017. Spoken Speaker Identification         of Anishinaabe language revitalization as self-
  based on Gaussian Mixture Models : Python Imple-          determination in Manitoba and Ontario. Ph.D. the-
  mentation.                                                sis, University of Victoria.

Adrian Łańcucki. 2021. Fastpitch: Parallel Text-to-       Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao,
  Speech with Pitch Prediction. In ICASSP 2021              Zhou Zhao, and Tie-Yan Liu. 2021. FastSpeech 2:
 - 2021 IEEE International Conference on Acous-             Fast and High-Quality End-to-End Text to Speech.
  tics, Speech and Signal Processing (ICASSP), pages        In International Conference on Learning Represen-
  6588–6592.                                                tations.

A. Levasseur, S. Mercier-Blais, Y. T. Prairie, A. Trem-   Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao,
  blay, and C. Turpin. 2021. Improving the accuracy          Zhou Zhao, and Tie-Yan Liu. 2019. FastSpeech:
  of electricity carbon footprint: Estimation of hydro-      Fast, Robust and Controllable Text to Speech. In
  electric reservoir greenhouse gas emissions. Renew-       Advances in Neural Information Processing Systems,
  able and Sustainable Energy Reviews, 136:110433.          volume 32.

Peng Liu, Xixin Wu, Shiyin Kang, Guangzhi Li, Dan         Keren Rice. 2008. Indigenous languages in Canada. In
  Su, and Dong Yu. 2019. Maximizing Mutual Infor-           The Canadian Encyclopedia.
  mation for Tacotron. arXiv:1909.01145.
                                                          Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike
Renqian Luo, Xu Tan, Rui Wang, Tao Qin, Jinzhu Li,          Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng
  Sheng Zhao, Enhong Chen, and Tie-Yan Liu. 2021.           Chen, Yu Zhang, Yuxuan Wang, R. J. Skerry-
  LightSpeech: Lightweight and Fast Text to Speech          Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis,
  with Neural Architecture Search. In IEEE Interna-         and Yonghui Wu. 2018. Natural TTS Synthesis by
  tional Conference on Acoustics, Speech and Signal         Conditioning WaveNet on Mel Spectrogram Predic-
  Processing (ICASSP), pages 5699–5703.                     tions. In IEEE International Conference on Acous-
                                                            tics, Speech and Signal Processing (ICASSP), pages
Michael McAuliffe, Michaela Socolof, Sarah Mihuc,           4779–4783.
  Michael Wagner, and Morgan Sonderegger. 2017.
                                                          Statistics Canada. 2016.  Census of population.
  Montreal Forced Aligner: Trainable Text-Speech
                                                             https://www12.statcan.gc.ca/census-
  Alignment Using Kaldi. In Interspeech 2017, pages
                                                             recensement/2016/dp-pd/index-eng.cfm.
  498–502. ISCA.
                                                          Emma Strubell, Ananya Ganesh, and Andrew McCal-
Teresa L McCarty. 2018. Community-based language            lum. 2019. Energy and Policy Considerations for
  planning: Perspectives from Indigenous language re-       Deep Learning in NLP. In Proceedings of the 57th
  vitalization. In The Routledge handbook of language       Annual Meeting of the Association for Computa-
  revitalization, pages 22–35. Routledge.                   tional Linguistics, pages 3645–3650, Florence, Italy.
                                                            Association for Computational Linguistics.
Richard Oster, Angela Grier, Rick Lightning, Maria
  Mayan, and Ellen Toth. 2014. Cultural continuity,       William J. Sutherland. 2003. Parallel extinction risk
  traditional Indigenous language, and diabetes in Al-      and global distribution of languages and species. Na-
  berta first nations: a mixed methods study. Interna-      ture, 423:276–279.
  tional journal for equity in health, 13:92.
                                                          Hideyuki Tachibana, Katsuya Uenoyama, and Shun-
Alejandro Perez-Gonzalez-de-Martos, Albert Sanchis,         suke Aihara. 2018. Efficiently trainable text-to-
  and Alfons Juan. 2021. VRAIN-UPV MLLP’s sys-              speech system based on deep convolutional net-
  tem for the Blizzard Challenge 2021. In Blizzard          works with guided attention. 2018 IEEE Interna-
  Challenge 2021 Workshop.                                  tional Conference on Acoustics, Speech and Signal
                                                            Processing (ICASSP), pages 4784–4788.
Aidan Pine. 2021. Low Resource Speech Synthesis.
  M.Sc. dissertation, University of Edinburgh.            Xu Tan, Tao Qin, Frank Soong, and Tie-Yan Liu.
                                                            2021.    A Survey on Neural Speech Synthesis.
Aidan Pine, Patrick Littell, Eric Joanis, David             arXiv:2106.15561.
  Huggins-Daines, Christopher Cox, Fineen Davis,
  Eddie Antonio Santos, Shankhalika Srikanth, De-         Paul Taylor, Alan W Black, and Richard Caley. 1998.
  laisie Torkornoo, and Sabrina Yu. Under Review.           The architecture of the Festival speech synthesis
  Gi 2Pi : Rule-based, index-preserving grapheme-to-        system. In The Third ESCA/COCOSDA Workshop
  phoneme transformations.                                  (ETRW) on Speech Synthesis, pages 147–152.

                                                      7356

Tao Tu, Yuan-Jui Chen, Cheng-chieh Yeh, and Hung- the potential harms. Beyond assessing the ben-
yi Lee. 2019. End-to-end Text-to-speech for efits and risks of introducing a new technology
Low-resource Languages by Cross-Lingual Transfer
into language revitalization efforts, communities
Learning. In Interspeech 2019, pages 2075–2079.
are concerned with the way the technology is re-
A. M. Urrea, José Abel Herrera Camacho, and Mari- searched and developed, as this process has the
bel Alvarado García. 2009. Towards the Speech ability to empower or disempower language com-
Synthesis of Raramuri: A Unit Selection Approach
based on Unsupervised Extraction of Suffix Se-
munities in equal measure (Alia, 2009; Brinklow
quences. Research in Computing Science, 41:243– et al., 2019). The current model for developing
256. speech synthesis systems is not very equitable –
models need to be run on GPUs by people with
Cassia Valentini-Botinhao and Simon King. 2021. De-
tection and Analysis of Attention Errors in Sequence- specialized training. For Indigenous communities
to-Sequence Text-to-Speech. In Interspeech 2021, to create speech synthesis tools for their languages,
pages 2746–2750. ISCA. they should not be required to hand over their lan-
guage data to a large government or corporate or-
Dan Wells and Korin Richmond. 2021. Cross-lingual
Transfer of Phonological Features for Low-resource ganization. A pre-training, fine-tuning pipeline
Speech Synthesis. In Proc. 11th ISCA Speech Syn- could be attractive for this reason; communities
thesis Workshop, pages 160–165. could fine-tune their own models on a laptop if a
multilingual/multi-speaker model were pre-trained
Mirjam Wester, Cassia Valentini-Botinhao, and Gus-
tav Eje Henter. 2015. Are we using enough lis- on GPUs at a larger institution. Reducing the
teners? No!—an empirically-supported critique of computational requirements for training and infer-
Interspeech 2014 TTS evaluations. In Interspeech ence of these models could help ensure language
2015, pages 3476–3480. communities have greater control over the process
D. Whalen, Margaret Moss, and Daryl Baldwin. 2016. of the development of these systems, less depen-
Healing through language: Positive physical health dence on governmental organizations or corpora-
effects of indigenous language use. F1000Research, tions, and more sovereignty over their data (Kee-
5:852.
gan, 2019).
Robert Whitman, Richard Sproat, and Chilin Shih. Strubell et al. (2019) present an argument for eq-
1997. A Navajo Language Text-to-Speech Synthe-
sizer. AT&T Bell Laboratories. uitable access to computational resources for NLP
research; put another way, we might say that sys-
Junichi Yamagishi, Christophe Veaux, and Kirsten Mac- tems which require less compute are more accessi-
Donald. 2019. CSTR VCTK Corpus: English Multi-
ble. Reducing the number of parameters in a neu-
speaker Corpus for CSTR Voice Cloning Toolkit
(version 0.92). University of Edinburgh, The Cen- ral TTS model should translate to increased effi-
tre for Speech Technology Research (CSTR). ciency, and might make the model less prone to
overfitting when training on limited amounts of
Yibin Zheng, Xi Wang, Lei He, Shifeng Pan, Frank K.
Soong, Zhengqi Wen, and Jianhua Tao. 2019.
data. As discussed in §4.2.2, we modified the
Forward-Backward Decoding for Regularizing End- base implementation of FastSpeech2 from Chien
to-End TTS. In Interspeech 2019, pages 1283–1287. (2021) closely following the lightweight alterna-
tive discovered through neural architecture search
A Compute, Accessibility, & in Luo et al. (2021). These changes reduced the
Environmental Impact size of the model from Chien (2021) from 35M to
11.6M parameters, reduced the size of the stored
For reasons of environmental impact and acces-
model from 417 MB to 135 MB and significantly
sibility, reducing the amount of computation re-
improved inference and train times as summarized
quired for both training and inference is important
in Table 1. We saw a 33% improvement in av-
for any neural speech synthesis system, particu-
erage batch processing times on the GPU during
larly so for Indigenous languages.
training, and 64% on the CPU, which may be even
A.1 Accessibility, Training & Inference more relevant for Indigenous language communi-
Speed ties with limited computational resources. During
inference, we saw a 15% speed-up on GPU and
While language revitalization efforts are mostly
57% on CPU.
encouraging about integrating new technologies
into curriculum, there is a growing awareness of Results were timed by running the model for 300
7357

FastSpeech2                    Adapted System
                                       GPU       90.52 ms (σ 3.31)              60.04 ms (σ 1.70)
                       Training
                                       CPU       7561.50 ms (σ 263.55)          2720.88 ms (σ 92.99)
                                       GPU       12.00 ms (σ 0.30)              10.23 ms (σ 0.78)
                       Inference
                                       CPU       138.73 ms (σ 3.94)             59.50 ms (σ 1.85)

Table 1: Mean and standard deviation of training and inference times for a single forward pass of baseline Fast-
Speech2 and adapted models.

repetitions and taking the mean. The GPU (Tesla                     34.5g CO2eq/kWh (Levasseur et al., 2021). This
V100-SXM2 16GB) was warmed up for 10 repe-                          results in a total equivalent carbon consumption
titions before timing started, and PyTorch’s built-                 of 27,821.65 grams, roughly equivalent to driving
in GPU synchronization method was used to syn-                      a single passenger gas-powered vehicle for 110
chronize timing (which occurs on the CPU) with                      kilometres according to the average rate of 404
the training or inference running on the GPU. CPU                   grams/mile (EPA, 2019).
tests were performed on an Intel(R) Xeon(R) CPU                        This is a comparatively low CO2 consump-
E5-2650 v2 @ 2.60GHz with 4 cores and 16GB                          tion for over 1500 GPU hours, largely due to the
memory reserved. All timings used a batch size of                   low CO2/kWh output of Quebec electricity when
16.                                                                 compared with the 2019 USA average of 400g
                                                                    CO2eq/kWh (EPA, 2019). However, CO2 equiva-
A.2 CO2 Consumption                                                 lents are just a proxy for environmental impact and
Strubell et al. (2019) also argue that NLP re-                      should not be understood to comprehensively ac-
searchers should have a responsibility to disclose                  count for social and environmental impact. Hydro-
the environmental footprint of their research, in or-               electric dam projects in Quebec, like the ones pow-
der for the community to effectively evaluate any                   ering the GPSC have a sordid and complex history
gains and to allow for a more equitable and repro-                  in the province. Innu Nation Grand Chief Mary
ducible field.                                                      Ann Nui spoke to this when she commented that
  All experiments for this paper requiring a GPU                    “over the past 50 years, vast areas of our ancestral
were run on the Canadian General Purpose Science                    lands were destroyed by the Churchill Falls hydro-
Cluster (GPSC) in Dorval, Quebec. Experiments                       electric project, people lost their land, their liveli-
were all run on single Tesla V100-SXM2 16GB                         hoods, their travel routes, and their personal be-
GPUs. Strubell et al. (2019) provide the following                  longings when the area where the project is located
equation for estimating CO2 production:                             was flooded. Our ancestral burial sites are under
                                                                    water, our way of life was disrupted forever. Innu
              1.58t(pc + pr + (g ∗ pg ))                            of Labrador weren’t informed or consulted about
       pt =                                                 (1)
                        1000                                        that project” (Innu-Atikamekw-Anishnabeg Coali-
where t is time, pt is total power for training, pc                 tion, 2020).
is average draw of CPU sockets, pr is average
                                                                    B Qualitative Results
DRAM memory draw, g is the number of GPUs
used in training and pg is the average draw from                    Question:
GPUs. In our case, we estimate t to be equal                           “Would you be comfortable with any of the
to 1,541.989 after summing the time for exper-                      voices you heard being played online, say for a
iments based on their log files, pc is 75 watts,                    digital dictionary or verb conjugator if no other
pr is 6 watts, g is 1, and pg is 250 watts, and                     recording existed?”
the equation for grams of CO2 consumption is
CO2 = 34.5pt as the average carbon footprint                           Kanien’kéha responses:
of electricity distributed in Quebec is estimated at
   9
                                                                       • Yes.
      Note this estimate is based on the total number of hours
spent running experiments from the M.Sc. dissertation this pa-
per draws its experiments from. There were additional mod-             • yes
els trained for experiments that are not discussed in this paper.
As such, this is a generous overestimation of t.                       • Yes
                                                                7358

You can also read