The IMU speech synthesis entry for Blizzard Challenge 2019
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
The IMU speech synthesis entry for Blizzard Challenge 2019 Rui Liu, Jingdong Li, Feilong Bao and Guanglai Gao College of Computer Science, Inner Mongolia University, Inner Mongolia Key Laboratory of Mongolian Information Processing Technology, Hohhot 010021, China liurui imu@163.com Abstract the text data in the training data to model the prosodic phrase break and interjections. This paper decribes the IMU speech synthesis entry for Blizzard Since our system didn’t achieve satisfactory evaluations, we Challenge 2019, where the task was to build a voice from will try to analyze the defects or problems in it. We hope that the Mandarin audio data. Our system is a typical end-to-end problems we faced and solutions we found may provide some speech synthesis system. The acoustic parameters is modeled useful information for other studies. No external training data by using “Tacotron” model, and the vocoder is using Griffin- was used in our system. Lim algorithm. In the synthesis stage, the task is divided into the This paper is structured as follows: section 2 describes following parts: 1) segment long sentence into short sentences the data sets and pre-processing steps used for building the by comma; 2) predict interjection labels of each words in short voice, while section 3 details the end-to-end speech synthesis sentences; 3) predict prosodic break labels of each words in techniques, along with the prosodic phrase break prediction and short sentences; 4) generate corresponding synthesis speech for interjection prediction models. The results from the Blizzard each short sentences which enriched by prosodic break labels Challenge listening tests will be discussed in section 4, with and interjections; 5) concatenate short sentences into an entire concluding remarks in section 5. long sentence. The Blizzard Challenge listening test results show that the proposed system achieves unsatisfactory performance. The 2. Data Processing problems in the system are also discussed in this paper. 2.1. Overview of the data Index Terms: Blizzard Challenge 2019, end-to-end, Tacotron, prosodic phrase break, interjections The data corpus released for the Blizzard Challenge 2019 con- sists of Mandarin talkshows, which all read by the same male speaker in talkshow style. This database has approximately 1. Introduction contains 480 utterances with the total duration of 8 hours. The This paper introduces the system submitted to Blizzard Chal- sampling rate is 24kHz. All files in this database are in mp3 lenge 2019 by Institute for Inner Mongolia University and format. Inner Mongolia Key Laboratory of Mongolian Information Most of these utterances in the speech files are very rich in Processing Technology, Hohhot, China. The name of our team emotion with a large number of interjection words. They also is “IMU” and this is our first entry to Blizzard Challenge. include many phrase breaks that are unique to the speaker. The The task of Blizzard Challenge 2019 is to build a voice testing transcriptions given by the organizer are picked up from corpus in Mandarin that would merge up for 8 hours or so. news, long facts, English words, abbreviation, numbers, letters, This speech corpus is a collection of the voice of a well-known chinese poetry, etc.. Chinese man named Zhenyu Luo. However, since these speech files were not recorded by 2.2. Preprocessing a professional studio and its non-ideal speech quality, several Although the quality of these recordings of the training data challenges emerge. is standardization certain. the quality of the talkshows may A testing set of sentences are also provided. All participates not be ideal for parametric speech synthesis. Moreover, Some are asked to submit the corresponding speech files generated by sentences in the training data are very long, which puts a burden their own model. A large scale subjective evaluation will be on the model training. Thus, preprocessing is conducted on the conducted. talkshows data as follows: Recent work on neural text-to-speech (TTS) can be split into two camps. In one camp, statistical parametric methods 2.2.1. Segment and Alignment with deep neural network architectures etc. are used [1, 2]. In the other camp, end-to-end models are used [3, 4, 5]. The The provided training data contained mismatches between recent proposed end-to-end TTS architectures (like Tacotron speech data and text. These mismatches were caused by [3]) can be trained on pairs, eliminating the need misreading of a text or phrase or words that do not exist in the for complex sub-systems that needs to be developed and trained text. This will negatively impacts the mapping of parameters separately. In our system, we used the end-to-end Tacotron of TTS. As is known to all, it is preferable to use a text of model [3] to construct experiment architecture. fully matched speech data for the training corpus. To overcome Besides speech waveform generation, text analysis is an- this problem, we investigated an automatic construction of a other important component of a TTS system. Mandarin is the training corpus from talkshows using a speech recognizer. It given language of this year’s challenge. In order to retain the is expensive to manually obtain a large amount of alignmented expression style of the speakers as much as possible, we used texts. Therefore, a speech recognizer is used to recognized text
3.1. Training phase In training phase, we added some extra prosody information to the end-to-end system, according to the prosodic phrase break labels, so as to achieve high quality natural speech with correct prosodic phrasing. The extra information serves as a local condition to control the prosody. As shown in the top part of Figure.2. We mapped the break label for each words into a one-hot vector which called as “prosody embeddings (PE)”. The character3 embeddings (CE) were concatenated with PE and sent to Tacotron model. Due to the fact that the PE and CE have different time resolutions, below we need to upsample the PE to be the same as the length of the character-level sequence. We first make N copies of PE, where the number of copies is equal to the number of characters in a word. Then we concatenate the PE Figure 1: Segment long sentences and alignment text and its and original character-level embeddings across all characters in speech by using speech recognizer. the word into a joint vector which used as the new character- level representation for the Tacotron model. Table 1: Examples of true content from a speech file, its text transcription from training data and its recognized text. 3.2. Synthesis phase Before synthesis, the given long sentences are segmented into true content /·ã˝ π’J some short sentences. To synthesize speech, each of the short text transcription /·ã˝ π’ sentences is enriched by “Interjections prediction” model and recognized text /·ã˝ π’J “Phrase break prediction” model. In the process of text analysis, we find that there are many interjections, which reflect most of the emotional information in speech, in the data. In order to retain the speaker’s emotional information better, we make of speech data. interjection prediction for the text to be synthesized. Moveover, Figure 2 shows an overview of the alignment method. First, phrase break plays an important role in both naturalness and we segment long speech to some short speech files leverage the intelligibility of speech [6, 7, 8, 9]. It breaks long utterances into voice activity detection tools1 ; Second, all short speech files meaningful units of information and makes the speech more were recognized to their corresponding text transcriptions with understandable. Therefore, identifying the break boundaries the help of Baidu online speech recognization service2 . Table.1 of prosodic phrases from the given text is crucial in speech shows examples of true content from a speech file, its text synthesis. transcription from training data and its recognized text. In the As the bottom half of Figure.2 shows, as a complement examples of Table.1, the words “J” , which does not exit in a of the main Tacotron model, the “Interjections prediction” book text, is recorded in a speech data. model and “Phrase break prediction” model are to learn prosody In addition, some of the speech files in the training data features and essential interjections, belonging to the particular are too low to hear clearly, so we deleted these files. Through speaker, based on input text. the above pre-processing operation, the size of our training data becomes 8630 utterances with the total duration of 7.68 hours. 3.2.1. Interjections prediction model We regard interjection prediction as a sequence labeling task. 2.2.2. Text Normalization According to the statistical results of interjections in training data, we select the top eight interjections as prediction targets: The alignmented text data were manually checked. First, J, P, ', @, , Œ, f, b. The statistical results of these because of some error feedbacks by using speech recognizer, interjections in the training data are shown in the Table.2. the text was manually checked to matched the speech content. We use the Bidirectional Long Short-Term Memory (BiL- Second, due to the mistakes in the voice activity detection, some STM) model to predict the interjection for each words. In incomplete or incoherent words in the speech waveform was this module, each words in an input sentence is mapped to annotated in the text. a sequence of word embeddings (WE1 ,...,WEt ) by pretrain Furthermore, in order to facilitate model training, we re- embeddings. Given the input vector sequence (WE1 ,...,WEt ), placed other punctuation marks with commas, and converted the forward LSTM reads it from left to right, but the backward the digits, English words or letters, etc. in the text into Chinese LSTM reads it in a reverse order. word representations through regular expressions. ! ! ht = LSTM (WEt , ht 1) , ht = LSTM (WEt , ht 1) (1) 3. System Description Finally, we use a softmax layer to produce interjection The training framework and synthesis module of our system is labels. The softmax calculates a normalised probability distri- shown in Figure.2. We will introduce this model in order. bution over all the possible labels of each word: 3 The “character” mentioned in this paper is pinyin, and we use open 1 https://webrtc.org/ source tools, pypinyin(https://pypi.org/project/pypinyin/), to convert the 2 https://ai.baidu.com/tech/speech/asr chinese word into its pinyin sequence.
Figure 2: Training and sythesis phase of IMU speech synthesis system. The prosody embedding is connected to the input of the encoder. Table 2: The statistical results of the top eight interjections in the training data J 3876 P 328 ' 180 @ 712 674 Œ 726 f 146 b 1486 Figure 3: Detailed network architecture of Tacotron model. eWo,k dt 3.2.3. Tacotron model P (yt = k|dt ) = P Wo,k (2) e dt e e k✏K For the proposed model, the main task is to generate the speech spectrograms given the input character-level features based on where P (yt = k|dt ) is the probability of the label of the t-th the original Tacotron model. word (yt ) being k, K is the set of all possible labels, and Wo,k The Tacotron model is a complicate neural network ar- is the k-th row of output weight matrix Wo . chitecture, as shown in Figure.3. It contains a multi-stage After the interjection prediction, the original non- encoder-decoder based on the combination of convolutional interjection text embedded with the predicted interjection can neural network (CNN) and recurrent neural network (RNN). be even richer in emotion. This can make the model get better The character embeddings of raw text is fed into an encoder synthesized speech. which generates attention features. Then the generated features fed in every step of the decoder before generating spectrograms. At last, the generated spectrograms are converted to waveform 3.2.2. Phrase break prediction model by the Griffin-Lim method [10]. Our implementation is forked from the TensorFlow implementation from Keith Ito4 , which We extend the original Tacotron model by adding a “phrase is faithful to the original Tacotron paper. We first convert the break prediction” model which takes the word sequence from input text to character sequence. Then the character inputs are given text as input, and computes the corresponding break label converted into one-hot vectors. The one-hot vectors then turn by using the BiLSTM network. In this model, each words in an into character embeddings and are fed to a pre-net which is a input sentence is mapped to a sequence of word embeddings multilayer perceptron. The output is then fed into the CBHG (WE1 ,...,WEt ) by a look-up table, which is passed through which stands for Convolutional Bank + Highway network + a BiLSTM. Finally, to produce phrase break label, we use a gated recurrent unit (GRU). The output from CBHG is the softmax layer also. final encoder representation used by the attention module. Then the PE and CE (from interjections enriched text) Then the generated attention features are fed in a RNN-based are concatenated, by using upsample method as described in decoder. In every step, the decoder generate few frames of Section 3.1, to create a new character-level encoder input. We spectrograms. The number of output frames is controlled by assume that this improved input sequence carries more realistic prosody and emotional information. 4 https://github.com/keithito/tacotron
Figure 6: Pinyin (without tone) error rate. System O is the Figure 4: Naturalness ratings. System O is the proposed system, proposed system, A is the natural speech, and B is the Merlin A is the natural speech, and B is the Merlin benchmark. benchmark. Figure 7: Pinyin (with tone) error rate. System O is the Figure 5: Speaker similarity ratings. System O is the proposed proposed system, A is the natural speech, and B is the Merlin system, A is the natural speech, and B is the Merlin benchmark. benchmark.
a hyperparameter reduction factor (r in Figure.3). At last, the may have lead to the relatively low similarity score. First, audio is reconstructed by the Griffin-Lim method. in the interjection prediction model, we only select a part of interjection (top eight) as the prediction target, missing the 3.2.4. Model setup rest of the interjections. Moreover, the existing interjection prediction model also has a certain prediction error rate; second, The prosody phrase break for training data was labelled by in the prosodic prediction model, we did not address the expert annotators by both listening the utterances and reading prosodic hierarchy in detail. Similarly, the existing prosodic the transcriptions. In the end of the process each word is labeled prediction also has prediction errors; Third, we believe that the as “B” or “NB” depending upon the location of the breaks failure to extract the speaker’s characteristics is an important introduced by the speaker. We set the dimensions of the PE reason for the relatively low similarity score. Fourth, long to 2. For interjection model, we removed the interjections from sentence speeches are generated by concatenating some short the training data firstly, then treated them as separate labels for sentence, which ignored the importance of pauses between the previous word in the model training. We set the dimensions sentences. of the interjection one-hot vector to 9. For the above two models, the input tokens fed into BiL- As shown in Figure.6 and Figure.7, the pinyin (without STM network are transformed into embedding features by a tone) error rate and pinyin (with tone) error rate of our system look-up table. For Mandarin, we choose “Tencent AI Lab is about 18% on the intelligibility test, ranking 12th in all Embedding Corpus for Chinese Words and Phrases” [11]. This submitted systems. We found that there were some digits, corpus provides 200-dimension vector representations, a.k.a. English words or letters and chinese ancient poetry in the embeddings, for over 8 million Chinese words and phrases, 2019 testing data. Ancient poetry contains a large number of which are pre-trained on large-scale high-quality data. polyphonic words, and we only use open source tools, pypinyin, to convert words to its pinyin (with tone) sequence, not making The LSTM layer size was set to 200 in both direction for all any extra special treatment on their pronunciation. In addition, experiments. The size of hidden layer d is 50. We set learning English letters are not processed separately. It was one of the rate as 1.0 and batch size as 64. All parameters were optimised factors that degrading the performance in intelligibility. by using AdaDelta algorithm. The training/validation/test split is 8:1:1. At every epoch, we calculate the performance on the To sum up, we summarize the problems in the system as training set. We stop training if the effect does not increase follows: seven epoches. The best model on training stage was then 1) The interjection model and prosodic prediction model used for evaluation on the test set. Finally, the accaury of are too simple and have a lot of room for improvement. our interjection model and phrase break model were achieved 2) The conversion of Chinese words to pinyin only uses 97.24%, 90.02% respectively. For Tacotron model, we use open source tools, and some other special cases have no extra Adam optimizer with learning rate decay to train the model and processing like digits, English words or letters and chinese the network were trained 200k steps. ancient poetry, etc. 3) In order to retain the speaker’s style and emotion infor- 4. Results and Analysis mation, we only conduct some simple analysis at the textual In this section, we will discuss the evaluation results in detail. level, without considering the speaker’s acoustic characteristics Our designated system identification letter is “O”. System A [12, 13, 14], which is a fatal shortcoming. is the natural speech. System B is the Merlin benchmark 4) The final result is obtained by concatenating short system and others are participants systems. The subjects who sentences together to form long sentences. In the process of are involoved in the listening test are paid isteners and online concatenating, there must be discontinuity at the connection volunteers and so on. There are a total of 2546 sentences in the point, which also has a negative impact on speech quality. 2019 testing set. The evaluation results of all systems are shown in Figures.4, 5, 6 and 7. 5) Last but not least, our system use Griffin-Lim algorithm Figure.4 shows the naturalness ratings presented as box as vocoder to convert acoustic feature to waveform. We did plots, where the central solid bar marks the median, the shaded not use the popular neural network vocoder to model the box represents the quartiles, and the extended dashed lines show speech waveform. Considering of the successful application of 1.5 times the quartile range. The most relevant comparisons WaveNet [5] in TTS, Wavenet-based vocoder was widely used can be made with the other known synthesis systems, namely in TTS fileds. By conditioning WaveNet on acoustic features, system B, which is the Merlin benchmark system. As you a WaveNet based neural vocoder is implemented. It can can see from the results, our system ranks 19th out of all learn the relationship between acoustic features and waveform submitted systems except natural speech. The results show that samples automatically. This neural vocoder is used to replace our proposed system O outperforms the Merlin benchmark B. the conventional vocoder so that waveforms can be generated There is a big gap between our system and the champion system directly. The quality of synthetic speeches is supposed to be M, and it is also different from other systems in a statistical further improved. sense. As far as I know, the champion system M use neural vocoder, and we only use simple Griffin-Lim algorithm, so such 5. Conclusions obvious gap can be expected. Speaker similarity scores are presented in Figure.5 with We introduce the IMU speech synthesis system for Blizzard similar box plots. The results show that several systems have Challenge 2019. The results of listening test for our system shown comparable results to our system, such as J, X, K, F, R, are not good, but we have found many interesting problems and the proposed system works a bit better of speaker similarity that we should have attacked. According to the subjective to the Merlin benchmark, having lower similarity than most evaluation results, there is still much room for improvement on other systems like M, S, Y, Z, E, C. Four possible reasons our method.
6. Acknowledgements [14] R. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. J. Weiss, R. Clark, and R. A. Saurous, “Towards This research was supports by the China national natural sci- end-to-end prosody transfer for expressive speech synthesis with ence foundation (No.61563040, No.61773224), Inner Mongo- tacotron,” arXiv preprint arXiv:1803.09047, 2018. lian nature science foundation (No.2016ZD06). [15] INTERSPEECH 2018 – 19th Annual Conference of the Inter- national Speech Communication Association, September 2-6, 7. References Hyderabad, India, Proceedings, 2018. [1] H. Zen, A. Senior, and M. Schuster, “Statistical parametric speech synthesis using deep neural networks,” in ICASSP 2013 – 38th IEEE International Conference on Acoustics, Speech and Signal Processing, May 26-31, Vancouver, BC, Canada, Proceedings, 2013, pp. 7962–7966. [2] Z. Ling, S. Kang, H. Zen, A. Senior, M. Schuster, X. Qian, H. M. Meng, and L. Deng, “Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques and future trends,” IEEE Signal Processing Magazine, vol. 32, no. 3, pp. 35–52, 2015. [3] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous., “Tacotron: Towards end-to-end speech synthesis,” in INTERSPEECH 2017 – 18th Annual Conference of the International Speech Communica- tion Association, August 20-24 , Stockholm, Sweden, Proceedings, 2017, pp. 4006–4010. [4] J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. Courville, and Y. Bengio, “Char2wav: End-to-end speech synthesis,” 2007. [5] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Z. Yu, Y. Wang, and R. J. Skerry-Ryan, “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” 2017. [6] Y. Zheng, Y. Li, Z. Wen, X. Ding, and J. Tao, “Improving prosodic boundaries prediction for mandarin speech synthesis by using enhanced embedding feature and model fusion approach,” in INTERSPEECH 2016 – 17th Annual Conference of the International Speech Communication Association, September 8- 12, San Francisco, CA, USA, Proceedings, 2016, pp. 3201–3205. [7] Y. Zheng, J. Tao, Z. Wen, and Y. Li, “Blstm-crf based end-to-end prosodic boundary prediction with context sensitive embeddings in a text-to-speech front-end,” in INTERSPEECH 2018 – 19th Annual Conference of the International Speech Communication Association, September 2-6, Hyderabad, India, Proceedings, 2018, pp. 3062–3066. [8] R. Liu, F. Bao, G. Gao, and W. Wang, “A lstm approach with sub-word embeddings for mongolian phrase break prediction,” in COLING 2018 – 27th International Conference on Computational Linguistics, August 20-26, Santa Fe, New Mexico, USA, Proceed- ings, 2018, pp. 2448–2455. [9] ——, “Improving mongolian phrase break prediction by using syllable and morphological embeddings with bilstm model,” in INTERSPEECH 2018 – 19th Annual Conference of the International Speech Communication Association, September 2- 6, Hyderabad, India, Proceedings, 2018, pp. 57–61. [10] G. D and L. J. S, “Signal estimation from modified short-time fourier transform,” IEEE Transactions on Acoustics Speech & Signal Processing, vol. 32, no. 2, pp. 236–243, 1984. [11] Y. Song, S. Shi, J. Li, and H. Zhang, “Directional skip- gram: Explicitly distinguishing left and right context for word embeddings,” in NAACL 2018 – 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics, June 1-6, New Orleans, USA, Proceedings, 2018, pp. 175–180. [12] Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, and I. L. Moreno, “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” 2018. [13] D. Stanton, Y. Wang, and R. Skerry-Ryan, “Predicting expressive speaking style from text in end-to-end speech synthesis,” arXiv preprint arXiv:1808.01410, 2018.
You can also read