Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus Minchan Kim1∗ , Myeonghun Jeong1∗ , Byoung Jin Choi1 , Sunghwan Ahn1 , Joun Yeop Lee2 , Nam Soo Kim1 1 Department of Electrical and Computer Engineering and INMC, Seoul National University, Seoul, South Korea 2 Samsung Research, Samsung Electronics, Republic of Korea {mckim, mhjeong, bjchoi, shahn}@hi.snu.ac.kr, jounyeop.lee@samsung.com, nkim@snu.ac.kr Abstract training methods [16–18] make it possible to exploit a large- scale unlabeled dataset, and significantly improve the gener- arXiv:2203.15447v1 [eess.AS] 29 Mar 2022 Training a text-to-speech (TTS) model requires a large scale alization performance of downstream tasks such as machine text labeled speech corpus, which is troublesome to collect. translation and sentence classification. Following these works, In this paper, we propose a transfer learning framework for several studies have attempted to apply transfer learning for TTS that utilizes a large amount of unlabeled speech dataset speech data [19–21]. For instance, wav2vec 2.0 [20] is used for pre-training. By leveraging wav2vec2.0 representation, un- for low-resource speech recognition, even when there is no la- labeled speech can highly improve performance, especially in beled dataset [22]. In addition, several models utilize the self- the lack of labeled speech. We also extend the proposed method supervised representations for speech analysis, disentanglement to zero-shot multi-speaker TTS (ZS-TTS). The experimental re- and voice conversion [23–25]. sults verify the effectiveness of the proposed method in terms of Inspired by these works, we propose a novel transfer learn- naturalness, intelligibility, and speaker generalization. We high- ing framework for TTS which employs large amounts of un- light that the single speaker TTS model fine-tuned on the only labeled speech for pre-training. The proposed method operates 10 minutes of labeled dataset outperforms the other baselines, on VITS [5] architecture and leverages the wav2vec 2.0 repre- and the ZS-TTS model fine-tuned on the only 30 minutes of sentation to extract pseudo phoneme sequences, which are used single speaker dataset can generate the voice of the arbitrary for pre-training as a substitute of phoneme sequences. We also speaker, by pre-training on unlabeled multi-speaker speech cor- carefully designed the fine-tuning procedure in consideration of pus. the role of each component and the training mechanism of nor- Index Terms: speech synthesis, transfer learning, zero-shot malizing flow. multi-speaker text-to-speech The contributions of our work are as follows: • To the best of our knowledge, it is the first approach that 1. Introduction utilizes the unlabeled dataset for TTS based on the trans- In recent years, text-to-speech (TTS) models [1–5] have fer learning framework. achieved dramatic improvement in various perspectives includ- • For the single speaker TTS, we verify that the proposed ing naturalness, intelligibility, generation speed and controlla- pre-training significantly improved the generated speech bility. However, one of the unresolved problems of TTS is that quality, especially in the lack of labeled dataset. Only 10 training these neural TTS models requires large amounts of text minutes of labeled dataset is required for synthesizing labeled speech corpora for high-quality speech generation. For high fidelity speech. example, LJSpeech [6], a public single speaker corpus for TTS, consists of more than 20 hours of delicately recorded speech • We extend the proposed framework to ZS-TTS. Pre- with transcriptions. The difficulty and cost of collecting the la- training on the large corpus, which consists of a large beled dataset can limit the application in various fields. In the number of speakers, improves the speaker generalization zero-shot multi-speaker TTS (ZS-TTS) [7–10], which synthe- on unseen speakers and even allows to generate the ref- size voices of new speakers with only a few seconds of reference erence voices with only 30 minutes of a single speaker speech, it becomes more difficult to collect labeled dataset. The TTS dataset. dataset should be composed of as many speakers as possible for better speaker generalization. 2. Proposed Method Meanwhile, transfer learning has been widely used in various fields, especially when the labeled dataset is insufficient. In In this section, we describe the proposed transfer learning transfer learning, a neural network model is firstly pre-trained framework for TTS. As a base architecture, we use VITS [5] for indirectly related objectives, and then the pre-trained model with several modifications. We first pre-train the VITS with a is used for parameter initialization of fine-tuning or feature ex- large untranscribed speech corpus and fine-tune the model with traction. For example, in computer vision (CV), various im- a small amount of text-labeled dataset. A new phonetic token age classification models [11–14] trained on ImageNet [15] are named pseudo phoneme is used for pre-training as a substitute usually fine-tuned to classify the new classes with little exam- for phoneme. We explain pseudo phoneme in Section 2.1, and ples. In natural language process (NLP), self-supervised pre- describe the detailed method in Section 2.2. At last, we discuss the operation of the proposed framework in Section 2.3. The * These authors contributed equally to this work. overall proposed framework is depicted in Figure. 1.
Wave To be frozen after pre-training To be fine-tuned Decoder To be trained from scratch Wave Not to be used for inference Posterior Encoder Not trainable Decoder Normalizing Flow Wave Not used for fine-tuning Posterior Encoder Decoder Monotonic Alignment Search Normalizing Flow Normalizing Flow 2 2 Reference Reference Linear Spec. 1 Encoder Alignment Encoder Monotonic Alignment Search Generation Ref. Speech Ref. Speech Speech Reference 2 2 Pseudo Text Encoder 2 2 Preprocess Encoder 1 1 Pseudo Phoneme Ref. Speech Linear Spec. 6,2,5,3,1 Ceil ’ Merge Speech Duration Duration Text Encoder Text Encoder 6,6,6,2,5,5,3,1 Preprocess Predictor Predictor Indexing ℎ K-means Wave Phoneme Noise Phoneme Noise Wav2vec2.0 codebook Pseudo Phoneme Extraction Wave (a) Pre-training procedure (b) Fine-tuning procedure (c) Inference procedure Figure 1: The entire framework of the proposed method. The discriminator part of the model is omitted for simplicity. 2.1. Pseudo Phoneme Atext . The alignment Atext is calculated by monotonic align- ment search (MAS) algorithm [4], and the duration predictor Pseudo phoneme is a proposed token that contains phonetic in- learns the duration of each phoneme using Atext . Additionally, formation, and it is used as a substitute for a phoneme in pre- adversarial training helps to generate high fidelity sound. In the training. The pseudo phoneme should have similar characteris- case of zero-shot multi-speaker TTS, we add a reference en- tics with phoneme for effective pre-training and has to be ac- coder with the same architecture of [26] to extract speaker em- quired by speech-only corpus for our precondition of not us- bedding. The speaker embedding is conditioned on the affine ing supervision. To satisfy these requirements, we leveraged the coupling layers of normalizing flow in the same way as [5], hidden representation of pre-trained wav2vec2.0 [20] which is and it makes text-speaker dependent prior distribution of z. trained in a self-supervised manner. According to [22], k-means Pre-training For pre-training, we exploit the vanilla VITS clustering on hidden representations of wav2vec2.0 operates training method except for two differences. At first, the pseudo well for assigning a phonetic token to each speech frame. The phoneme and pseudo text encoder are used instead of phoneme pseudo phoneme is extracted as follows. At first, we perform and text encoder, respectively. The pseudo text encoder con- k-means clustering to identify K clusters on the hidden repre- sists of 2 layers of 1D convolutions with a ReLU activation. sentations of block 15 for the entire unlabeled speech corpus. Accordingly, the training objective is modified to maximize the We denote the hidden representation of block 15 as h1 , ..., hT . conditional likelihood of speech given pseudo phoneme so that Then, the cluster index it ∈ {1, ..., K} is obtained from each Kullback-Leibler divergence (KLD) loss for prior matching is ht , where we set K = 128 for this work. As shown in Fig- changed from Eq. 1 to Eq. 2. ure.1a, the same consecutive indices are merged to reflect the characteristics of a real phoneme. We refer to these merged in- Lkl = log qφ (z|xlin ) − log pθ (z|ctext , Atext ) (1) dices i01 , ..., i0T 0 as pseudo phoneme and use it for pre-training. In the above notation, T and T 0 denote lengths of speech frames Lkl0 = log qφ (z|xlin ) − log pψ (z|ci0 , Ai0 ) (2) and merged frames, respectively. In Eq. 2, ψ represents parameters of the modified prior encoder: pseudo text encoder and normalizing flow, and Ai0 denotes the 2.2. Transfer Learning Framework for TTS alignment between speech and pseudo phoneme. For ZS-TTS, the reference encoder is jointly trained with the other compo- The proposed framework builds upon VITS architecture which nents. has a conditional variational autoencoder (CVAE) structure. Fine-tuning The goal of fine-tuning is to adapt the pre-trained VITS consists of a posterior encoder, prior encoder, decoder model to phoneme sequences. During fine-tuning, the pseudo and duration predictor. The posterior encoder extracts the pos- phoneme and pseudo text encoder are replaced by phoneme and terior distribution qφ (z|xlin ) of the latent variable z given the text encoder, and the KLD loss is also returned to Eq. 1. We linear spectrogram xlin , and the decoder generates output audio carefully divide the model into three parts for efficient training: from the sampled z. The prior encoder consists of a normalizing frozen, fine-tunned and scratch. Firstly, the decoder and pos- flow and text encoder, and match the conditional prior distribu- terior encoder are frozen during fine-tuning. As the role of the tion pθ (z|ctext , Atext ) of z, given text ctext and the alignment decoder and posterior encoder is to reconstruct high-quality raw
audio, we assumed that these modules are trained enough with speaker TTS. The detailed experimental setup and results a large corpus and does not have to be fine-tuned with a small of each case are described in Section 3.1 and 3.2, respec- dataset. Freezing decoder induces several benefits. As we do not tively. Our synthesized audio samples are publicly available at have to feed-forward to the frozen decoder, we can save lots of https://jmhxxi.github.io/TransferTTS-demo/. memory consumption and calculations for generating the high- resolution audio. In addition, there is no speech output during 3.1. Single Speaker TTS fine-tuning, so we can skip the adversarial training of VITS that For single speaker TTS, we firstly pre-trained the single speaker makes the training process more complex. In the case of ZS- VITS without transcription and fine-tuned on different sizes of TTS, the reference encoder is also frozen during fine-tuning. We the labeled dataset. This experiment demonstrates the effective- assume that fine-tuning reference encoders with small speakers ness of the proposed method and its relation to the amount of can decrease the generalization performance for unseen speak- the labeled dataset. ers. Meanwhile, normalizing flow is fine-tuned to adjust the Dataset: We used LJSpeech [6] dataset for single speaker pseudo phoneme dependent prior to the phoneme dependent TTS. The LJSpeech dataset consists of 13,100 short sentences, prior. It can mitigate the mismatch between the phoneme and and the total length is about 24 hours with a sample rate of pseudo phoneme. Lastly, the text encoder and duration predic- 22.05kHz. The dataset was randomly split into subsets of train- tor is trained from scratch. For better generalization, we do not ing set (12500 sentences) and evaluation set (500 sentences). condition speaker embedding to duration predictor. We briefly All of the speech in the training set (23 hours unlabeled) was state that the only fine-tuned part is the normalizing flow, and used for pre-training, and different amounts of speech-text pairs only KLD loss and duration prediction loss are used for fine- in the training set were used for fine-tuning: LJ-10min (10 min- tuning. utes labeled), LJ-1h (1 hour labeled), and LJ-10h (10 hours la- beled). 2.3. Discussion on Framework Design Implementation Details: We exploited the basic configuration There is a notable difference between the existing transfer learn- of [5] for the proposed model unless otherwise explained. We ing methods and the proposed framework. In transfer learn- pre-trained the model for 500k iterations and fine-tuned it only ing, a domain D is defined as D = {X , P (X)}, where X is for 30k iterations to generate high-quality speech. Both pro- a feature space and P (X) is probability distribution of X = cesses were trained with mini-batch size 64. For comparison, {x1 , ..., xn } ∈ X . Given a domain D, a task is defined as we used VITS-baseline, Glow-TTS [4], and FastSpeech2 [2]. T = {Y, f (x)}, where Y is a label space, and f : X → Y is These baselines were trained on the fine-tuning subsets without an objective predictive function. The knowledge of source do- pre-training. HiFi-GAN [31], trained on the entire LJSpeech, main DS and source task TS can help training the target func- was used as a vocoder for Glow-TTS and FastSpeech2. tion fT (x) of target task TT in target domain DT . However, Evaluation Metrics: For the subjective evaluation of audio fi- in our framework, the source domain: pseudo phoneme indices delity, we conducted a mean opinion score (MOS) test. 15 raters and target domain: phoneme indices don’t share any common listened to the randomly selected samples and gave 5 scaled feature spaces, meaning XS ∩ XT = ∅. Instead, the YS and scores from 1 to 5 based on their naturalness. In addition, we YT are highly related, since they are both speech. For this rea- estimated the character error rate (CER) to evaluate the intelligi- son, feed-forward TTS models such as [2, 27–29] are not suit- bility of generated speech. We used a pre-trained speech recog- able for the proposed method. When the domain changes, the nition model from speechbrain toolkit [32] for transcription. text encoder should be randomly initialized because of feature Experimental Results: The results of single-speaker TTS are space mismatch, and the decoder takes unlearned features from presented in Table 1. As shown in Table 1, the proposed method the text encoder. This makes it difficult to utilize the knowledge shows the best performance for both CER and MOS in all cases, of the decoder acquired from pre-training and causes catas- and with the fewer labeled dataset, the larger the performance trophic forgetting [30] during fine-tuning. To avoid this issue, gap. This indicates that our proposed method using speech-only we exploited the invertibility of normalizing flow. Normalizing corpus significantly improve the performance, especially in the flow has opposite directions for the training and inference pro- lack of the labeled dataset. cedure. In training procedure, the output of posterior encoder z is converted to the latent variable of normalizing flow, zp 3.2. Zero-Shot Multi-Speaker TTS ∼ N (µθ (ctext , Atext ), σθ (ctext , Atext )), where µθ and σθ are Similar to the single speaker TTS, we pre-trained the zero-shot calculated from text encoder. This property converts the domain multi-speaker VITS model and fine-tuned it on different shift problem to a target shift problem. Specifically, maximizing sizes of dataset. This experiment is conducted to verify that log pθ (zp ; µθ , σθ ) for the objective of Eq. 11 , resembles min- the proposed method improves the speaker generalization imizing weighted mean-squared-error (MSE) between zp and performance of ZS-TTS. µθ (ctext , Atext ). This can be considered as target shift prob- Dataset: For pre-training, we used train-clean-100 and train- lem from µψ (ci0 , Ai0 ). As pseudo phoneme has rich phonetic clean-360 subsets of LibriTTS [33]. These subsets are about information, the text encoder can quickly learn the well-defined 245 hours in total, including 1151 speakers. In the case of latent space of zp , and normalizing flow adapts for the gap be- fine-tuning, we used LJSpeech corpus and VCTK corpus [34]. tween text and pseudo phoneme. We made subsets of these corpus based on the total hours and number of speakers: LJ-30min (30 minutes, 1 speaker), VCTK- 3. Experiments 1h (1 hour, 4 speakers), VCTK-5h (5 hours, 20 speakers) and VCTK-20h (20 hours, 80 speakers). These subsets were used In this section, we verify the effectiveness of the pro- for fine-tuning. The evaluation set is composed of 28 speakers posed method in single-speaker TTS and zero-shot multi- of VCTK not included in the fine-tuning sets so that all of the speakers in the evaluation set are unseen during training. The 1 In ∂zp Eq.1, pθ (z|ctext , Atext ) = N (zp ; µθ , σθ )|det ∂z |. evaluation set has a total of 560 sentences and 20 sentences per
Table 1: Comparison of CER and MOS with 95% confidence and SMOS, the proposed model significantly outperforms the intervals for single speaker TTS. baselines as well as VITS-baseline with any data sizes except for SC-GlowTTS of VCTK-20h. Since only SC-GlowTTS Method CER MOS uses a pre-trained speaker verification model as a reference encoder, SC-GlowTTS might get higher SECS calculated by Ground Truth 1.6 4.77±0.05 speaker verification model, while perceptual speaker similarity LJSpeech 10min has a different tendency. One notable point is the results on LJ- GlowTTS 31.6 1.82±0.11 30min. Although the proposed model is fine-tuned on a small FastSpeech2 15.3 1.89±0.11 single speaker dataset, it can generate voices of unseen speakers VITS-baseline 8.0 2.20±0.10 even with comparable performance. By this experiment, we Proposed 3.5 4.42±0.08 show that the speaker similarity of ZS-TTS is heavily affected by the size of multi-speaker TTS corpus, and using large-scale LJSpeech 1hour unlabeled speech corpus can improve the speaker generaliza- GlowTTS 8.2 2.83±0.12 tion performance, especially when the labeled dataset is limited. FastSpeech2 3.1 3.35±0.10 VITS-baseline 2.4 3.87±0.11 Proposed 2.3 4.58±0.07 Table 2: Comparison of results for zero-shot multi-speaker TTS. LJSpeech 10hour MOS and SMOS is described with 95% confidence intervals. GlowTTS 2.1 4.37±0.09 FastSpeech2 2.6 3.90±0.09 Method CER SECS MOS SMOS VITS-baseline 1.9 4.60±0.07 Proposed 1.9 4.66±0.06 Ground Truth 4.1 0.632 4.72±0.06 4.65±0.07 LJ 30min, 1speaker Proposed 6.1 0.212 3.99±0.10 3.47±0.09 speaker. All of the datasets were down-sampled to 22.05kHz VCTK 1h, 4speakers for this experiment. SC-GlowTTS 29.5 0.151 2.66±0.11 2.75±0.08 Implementation Details: For ZS-TTS, the same configuration Meta-StyleSpeech 18.1 0.107 2.12±0.11 1.98±0.09 of the model was set with that of single speaker TTS except for VITS-baseline (+ ref.) 18.9 0.107 3.36±0.11 2.84±0.08 the reference encoder. The proposed model was pre-trained for Proposed 8.0 0.248 4.10±0.10 3.85±0.09 500k iterations as in Section 3.2, then fine-tuned for 10k itera- tions on LJ-30min and 100k iterations on the other subsets. For VCTK 5h, 20speakers comparison, we set the baselines with Meta-StyleSpeech [9], SC-GlowTTS 14.0 0.248 3.34±0.11 3.36±0.08 SC-GlowTTS2 [10], and VITS-baseline. The VITS-baseline Meta-StyleSpeech 16.0 0.246 3.52±0.12 3.69±0.09 has the same architecture as the fine-tuned model but without VITS-baseline (+ ref.) 8.7 0.179 4.40±0.08 3.60±0.09 pre-training, and these baselines were trained on the fine-tuning Proposed 8.5 0.298 4.21±0.10 3.92±0.09 parts except LJ-30min. A pre-trained universal HiFi-GAN was VCTK 20h, 80speakers used as a vocoder for Meta-StyleSpeech and SC-GlowTTS. SC-GlowTTS 6.2 0.327 3.94±0.09 3.79±0.09 Evaluation Metrics: For speech quality and intelligibility, Meta-StyleSpeech 14.4 0.302 3.42±0.11 3.85±0.08 we used CER and MOS in the same way as Section 3.1. VITS-baseline (+ ref.) 5.7 0.267 4.44±0.08 4.00±0.09 In addition, we evaluated the speaker similarity between Proposed 5.7 0.325 4.32±0.09 4.15±0.08 reference speech and synthesized speech by similarity mean opinion score (SMOS) and speaker embedding cosine similar- ity (SECS). For SMOS, 15 raters evaluated whether generated speech reflects the speaker identity of reference speech well in 4. Conclusions the 5 scaled scores from 1 to 5. The SECS is the cosine distance of speaker embedding between generated speech and reference In this work, we introduced a transfer learning framework for speech. To extract speaker embedding, we used a pre-trained TTS that uses a large amount of unlabeled speech dataset for speaker verification model [36] from speechbrain toolkit. The pre-training. This framework is based on VITS architecture and SECS ranges from -1 to 1, and the higher score indicates the the pseudo phoneme extracted from wav2vec2.0 representation. higher similarity. Due to the similar nature of phoneme and pseudo phoneme, we Experimental Results: The results of ZS-TTS are presented can pre-train the TTS model on a speech-only corpus, which in Table 2. As shown in Table 2, the overall performance improves the overall performance, especially in the lack of la- gap between baselines and the proposed model increases beled speech. Our experimental results show that the proposed as the datasets get smaller, and the performance of the method outperforms other baselines over all of the fine-tuning proposed model is relatively less affected by the amount of subsets in single-speaker TTS and makes higher speaker sim- fine-tuning dataset. Specifically, the proposed method achieves ilarity for zero-shot multi-speaker TTS. This performance im- lower CER in all of the cases. Even though the MOS of the provement was greater with the smaller labeled dataset. With proposed model is lower than VITS-baseline in VCTK-5h these advantages, we expect that the proposed method becomes and VCTK-20h, we expect that this gap can be mitigated a good starting point for low-resource TTS. by carefully matching the domain between pre-training and fine-tuning dataset. In the case of speaker similarity: SECS 5. Acknowledgements 2 We used a pre-trained speaker verification model [35] for the refer- This work is/was supported by Samsung Research, Samsung ence encoder of SC-GlowTTS Electronics Co.,Ltd.
6. References [19] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, [1] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, 2018. Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram pre- [20] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: dictions,” in 2018 IEEE International Conference on Acoustics, A framework for self-supervised learning of speech representa- Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4779– tions,” arXiv preprint arXiv:2006.11477, 2020. 4783. [21] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhut- [2] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, dinov, and A. Mohamed, “Hubert: Self-supervised speech repre- “Fastspeech 2: Fast and high-quality end-to-end text to speech,” sentation learning by masked prediction of hidden units,” arXiv arXiv preprint arXiv:2006.04558, 2020. preprint arXiv:2106.07447, 2021. [3] M. Jeong, H. Kim, S. J. Cheon, B. J. Choi, and N. S. Kim, [22] A. Baevski, W.-N. Hsu, A. Conneau, and M. Auli, “Unsupervised “Diff-tts: A denoising diffusion model for text-to-speech,” arXiv speech recognition,” arXiv preprint arXiv:2105.11084, 2021. preprint arXiv:2104.01409, 2021. [23] A. Polyak, Y. Adi, J. Copet, E. Kharitonov, K. Lakhotia, W.-N. [4] J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-tts: A generative Hsu, A. Mohamed, and E. Dupoux, “Speech resynthesis from dis- flow for text-to-speech via monotonic alignment search,” arXiv crete disentangled self-supervised representations,” arXiv preprint preprint arXiv:2005.11129, 2020. arXiv:2104.00355, 2021. [5] J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder [24] H.-S. Choi, J. Lee, W. Kim, J. Lee, H. Heo, and K. Lee, “Neu- with adversarial learning for end-to-end text-to-speech,” arXiv ral analysis and synthesis: Reconstructing speech from self- preprint arXiv:2106.06103, 2021. supervised representations,” Advances in Neural Information Pro- cessing Systems, vol. 34, 2021. [6] K. Ito and L. Johnson, “The lj speech dataset,” Online: https://keithito. com/LJ-Speech-Dataset, 2017. [25] F. Kreuk, A. Polyak, J. Copet, E. Kharitonov, T.-A. Nguyen, M. Rivière, W.-N. Hsu, A. Mohamed, E. Dupoux, and Y. Adi, [7] S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice “Textless speech emotion conversion using decomposed and dis- cloning with a few samples,” Advances in Neural Information crete representations,” arXiv preprint arXiv:2111.07402, 2021. Processing Systems, vol. 31, 2018. [26] M. Kim, S. J. Cheon, B. J. Choi, J. J. Kim, and N. S. [8] E. Cooper, C.-I. Lai, Y. Yasuda, F. Fang, X. Wang, N. Chen, Kim, “Expressive text-to-speech using style tag,” arXiv preprint and J. Yamagishi, “Zero-shot multi-speaker text-to-speech with arXiv:2104.00436, 2021. state-of-the-art neural speaker embeddings,” in ICASSP 2020- [27] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. 2020 IEEE International Conference on Acoustics, Speech and Liu, “Fastspeech: Fast, robust and controllable text to speech,” Ad- Signal Processing (ICASSP). IEEE, 2020, pp. 6184–6188. vances in Neural Information Processing Systems, vol. 32, 2019. [9] D. Min, D. B. Lee, E. Yang, and S. J. Hwang, “Meta-stylespeech: [28] A. Łańcucki, “Fastpitch: Parallel text-to-speech with pitch predic- Multi-speaker adaptive text-to-speech generation,” arXiv preprint tion,” in ICASSP 2021-2021 IEEE International Conference on arXiv:2106.03153, 2021. Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, [10] E. Casanova, C. Shulby, E. Gölge, N. M. Müller, F. S. de Oliveira, pp. 6588–6592. A. C. Junior, A. d. S. Soares, S. M. Aluisio, and M. A. Ponti, [29] J. Vainer and O. Dušek, “Speedyspeech: Efficient neural speech “Sc-glowtts: an efficient zero-shot multi-speaker text-to-speech synthesis,” arXiv preprint arXiv:2008.03802, 2020. model,” arXiv preprint arXiv:2104.05557, 2021. [30] R. M. French, “Catastrophic forgetting in connectionist net- [11] K. Simonyan and A. Zisserman, “Very deep convolutional works,” Trends in cognitive sciences, vol. 3, no. 4, pp. 128–135, networks for large-scale image recognition,” arXiv preprint 1999. arXiv:1409.1556, 2014. [31] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial net- [12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning works for efficient and high fidelity speech synthesis,” Advances for image recognition,” in Proceedings of the IEEE conference in Neural Information Processing Systems, vol. 33, pp. 17 022– on computer vision and pattern recognition, 2016, pp. 770–778. 17 033, 2020. [13] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for [32] M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, convolutional neural networks,” in International Conference on L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong Machine Learning. PMLR, 2019, pp. 6105–6114. et al., “Speechbrain: A general-purpose speech toolkit,” arXiv [14] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, preprint arXiv:2106.04624, 2021. X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, [33] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, S. Gelly et al., “An image is worth 16x16 words: Transformers and Y. Wu, “Libritts: A corpus derived from librispeech for text- for image recognition at scale,” arXiv preprint arXiv:2010.11929, to-speech,” arXiv preprint arXiv:1904.02882, 2019. 2020. [34] J. Yamagishi, C. Veaux, K. MacDonald et al., “Cstr vctk corpus: [15] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Im- English multi-speaker corpus for cstr voice cloning toolkit (ver- agenet: A large-scale hierarchical image database,” in 2009 IEEE sion 0.92),” 2019. conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255. [35] H. S. Heo, B.-J. Lee, J. Huh, and J. S. Chung, “Clova baseline sys- tem for the voxceleb speaker recognition challenge 2020,” arXiv [16] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre- preprint arXiv:2009.14153, 2020. training of deep bidirectional transformers for language under- [36] B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapa- standing,” arXiv preprint arXiv:1810.04805, 2018. tdnn: Emphasized channel attention, propagation and aggre- [17] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, gation in tdnn based speaker verification,” arXiv preprint M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A ro- arXiv:2005.07143, 2020. bustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019. [18] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for lan- guage understanding,” Advances in neural information processing systems, vol. 32, 2019.
You can also read