IMPROVEMENTS TO PROSODIC ALIGNMENT FOR AUTOMATIC DUBBING
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
IMPROVEMENTS TO PROSODIC ALIGNMENT FOR AUTOMATIC DUBBING Yogesh Virkar Marcello Federico Robert Enyedi Roberto Barra-Chicote Amazon ABSTRACT Automatic dubbing is an extension of speech-to-speech translation such that the resulting target speech is carefully aligned in terms of duration, lip movements, timbre, emotion, prosody, etc. of the speaker in order to achieve audiovisual coherence. Dubbing qual- ity strongly depends on isochrony, i.e., arranging the translation of the original speech to optimally match its sequence of phrases and pauses. To this end, we present improvements to the prosodic align- ment component of our recently introduced dubbing architecture. Fig. 1. Speech translation pipeline (dotted box) with enhancements We present empirical results for four dubbing directions – English introduced to perform automatic dubbing (in bold). to French, Italian, German and Spanish – on a publicly available collection of TED Talks. Compared to previous work, our enhanced prosodic alignment model significantly improves prosodic alignment accuracy, fluency and smoothness of PA with respect to manually accuracy and provides segmentation perceptibly better or on par with post-edited and segmented translations, while extrinsic evaluations manually annotated reference segmentation. measure subjective quality of video clips dubbed by applying differ- ent PA models on the human translations. Index Terms— speech translation, text to speech, automatic Our paper is arranged as follows. First, we describe the auto- dubbing matic dubbing architecture used for our experiments, then, we focus on existing and new PA methods, and finally we present and discuss 1. INTRODUCTION experimental results of all compared methods. Automatic Dubbing (AD) is the task of automatically replacing the 2. DUBBING ARCHITECTURE speech in a video document with speech in a different language, while preserving as much as possibly the user experience of the orig- We build on the automatic dubbing architecture presented in [7, 8], inal video. AD dubbing differs from speech translation [1, 2, 3, 4] and described in Figure 1, that extends a speech-to-speech transla- in significant ways. In speech translation, a speech utterance in the tion [1, 2, 3] pipeline with: neural machine translation (MT) robust source language is recognized, translated (and possibly synthesized) to ASR errors and able to control verbosity of the output [12, 10, 13]; in the target language. In speech translation close to real-time re- prosodic alignment (PA) [6] which addresses phrase-level synchro- sponse is expected and typical use cases include human-to-human nization of the MT output by leveraging the force-aligned source interaction, traveling, live lectures, etc. On the other hand, AD transcript; neural text-to-speech (TTS) [14, 15, 16] with precise du- tries to automate the localization of audiovisual content, a complex ration control; and, finally, audio rendering that enriches TTS output and demanding work flow [5] managed during post-production by with the original background noise (extracted via audio source sepa- dubbing studios. A major requirement of dubbing is speech syn- ration with deep U-Nets [17, 18]) and reverberation, estimated from chronization which, in order of priority, should happen at the ut- the original audio [19, 20]. terance level (isochrony), lip movement level (lip synchrony), and body movement level (kinetic synchrony) [5]. Most of the work on AD [6, 7, 8], including this one, addresses isochrony, which aims to 3. RELATED WORK generate translations and utterances that match the phrase-pause ar- rangement of the original audio. Given a source sentence transcript, In the past, there has been little work to address prosodic align- the first step is to generate a translation of more or less the same ”du- ment for automatic dubbing [6, 7, 8]. The work of [6] utilized ration” [9, 10], e.g. number of characters or syllables. The second the attention mechanism of neural machine translation to achieve step, called prosodic alignment (PA) [6, 7, 8], segments the trans- isochrony. While this approach achieves linguistic similarity be- lation into phrases and pauses of the same duration of the original tween corresponding source and target phrases, it has no mechanism phrases. to explicitly control for uneven or extreme speaking rates that can This paper focuses on the PA step, by comparing previous [6, 7, cause unnatural sounding dubbing. This was partly addressed in [7] 8] and new methods that allow to optimally segment and temporally by segmenting the translation according to the length similarity be- align a translation with the original phrases. Differently from previ- tween corresponding source-target phrases. Moreover, [7] proposed ous work, we perform intrinsic and extrinsic evaluations of PA an a a more efficient implementation based on dynamic programming as significantly larger test set extracted from the MUST-C corpus [11] opposed to generating and rescoring segmentation hypotheses [6]. and on four dubbing directions, English (en) to French (fr), Italian More recently, [8] further improved on speech fluency by not only (it), German (de) and Spanish (es). Intrinsic evaluations measure the controlling the speaking rate match between corresponding source- To appear in Proc. ICASSP2021, June 06-11, 2021, Toronto, Canada c IEEE 2021
target phrases but also the speaking rate variation across consecutive of segment t does not overlap with the right relaxation of segment target phrases. Additionally, it introduced a mechanism to relax t − 1. Additionally, owing to negative relaxations, we note that we the timing constraints to cope with too high TTS speaking rates. have implicit though trivial restrictions on the choice of δl , δr , δr0 , δl0 While it achieves much smoother speech, the human evaluation to ensure s∗t > 0, s∗t−1 > 0. of the automatically dubbed videos revealed that users still prefer We define the model probability with a log-linear model: human-annotated reference segmentation over the one produced by 5 the model. Finally, another limitation of this study is the use of a X log Pr (j, δl , δr | . . . ; t) ∝ wa log sa (j, δl , δr , . . . ; t) (5) rather small test set. In this work, we address these issues by using k=1 a much larger dataset and combine the advantages of content-based [6] and fluency-based [7, 8] approaches. In sec. 6, we provide where weights wa are learned from data and functions sa model the a direct comparison of our present work with [8] and an indirect following features: comparison of our work with [6]. 1. Language model score of target break point. 2. Cross-lingual semantic match score across source and target 4. PROSODIC ALIGNMENT segments. The goal of the PA [6, 7, 8] is to segment the target sentence to 3. Speaking rate variation across target segments. optimally match the sequence of phrases and pauses in the source 4. Speaking rate match across source and target segments. utterance. Let e = e1 , e2 , . . . , en be a source sentence of n words, 5. Isochrony score for left and right relaxations. segmented according to k breakpoints 1 ≤ i1 < i2 < . . . ik = n, denoted with i. Let the temporal duration of e be T and the temporal Speaking rate computations rely on the strings f˜t and ẽt , denoting intervals of the segmentation i be s1 = [l1 , r1 ] , . . . , sk = [lk , rk ], the t-th source and target segments, as well as the original interval denoted by s, s.t. l1 ≥ ∆, li < ri , li+1 − ri ≥ ∆, T − rk ≥ st and the relaxed interval s∗t . Hence, the speaking rate of a source ∆, where ∆ is the minimum silence interval after (and before) (target) segment is computed by taking the ratio between the dura- each break point.1 Given a target sentence f = f1 , f2 , . . . , fm of m tion of the utterance by source (target) TTS run at normal speed and words, the goal is to find k breakpoints 1 ≤ j1 < j2 < . . . jk = m the source (target) interval length,2 i.e: (denoted with j) that maximize the probability: duration(TTSe (ẽt )) re (t) = (6) max log Pr(j | i, e, f , s) (1) |st | j duration(TTSf (f˜t )) By assuming a Markovian dependency on j, i.e.: rf (t) = (7) |s∗t | k 4.1. Language model X Pr(j | i, e, f , s) = log Pr(jt | jt−1 ; t, i, e, f , s) (2) t=1 As in previous work [7, 8], we define a language model score that es- and omitting from the notation the constant terms i,e,f and s we timates the probability of placing a break between consecutive target derive the following recurrent quantity: words ft−1 and ft : Q(j, t) = max log Pr(j | j 0 ; t) + Q(j 0 , t − 1) (3) slm (j, ft−1 , ft ) = Pr (br|ft−1 , ft ) 0 j
we provide a quantitative comparison of using different multilingual 5. EVALUATION DATA embedding models. Note that the accuracy of the prosodic align- ment depends strongly on matching the linguistic content between To train and evaluate PA, we re-translated and annotated a total of the source and target phrases and our past work [8] shows that accu- 495 video clips from 20 TED talks of the MUST-C corpus such that racy impacts strongly the subjective viewing experience. Hence, we a clip contains a single sentence with at least one pause of at least expect to benefit by using such content-matching feature. 300ms. Using [21], we time aligned the English text with the au- dio. Using external vendors, we manually adapted and segmented the available translations in 4 languages - French, German, Italian 4.3. Speaking rate variation and Spanish - so as to fit the duration and segmentation of corre- sponding English utterances. For automatic evaluation, we use the As in [8] we penalize hypotheses resulting in high variation in speak- metrics Accuracy, Fluency and Smoothnes as defined in [8]. ing rates for consecutive target phrases, we define the speaking rate variation score ssv (·) as follows: 6. EXPERIMENTS |rf (t) − rf (t − 1)| To test the importance of relaxation mechanism, we evaluate our ssv f˜t , s∗t , f˜t−1 , s∗t−1 ; t = 1 − . (10) rf (t − 1) new PA model without (B) and with (C) relaxations and compare our results with the best model with relaxation from our previous Naturally, this feature works for t ≥ 2 and reaches its maximum work [8], i.e. model A. To simplify search and run ablation tests value when consecutive phrases have the same speaking rate. across models, we find optimal feature weights using hierarchical grid search with convex combinations of feature pairs 4 : 4.4. Speaking rate match A: sis , (slm , (ssm , ssv )wsm )wlm w is We realized that the speaking rate match score in [8] does not take B: scm , (slm , (ssm , ssv )wsm )wlm into account global information of the target sentence, such as its wcm verbosity relative to the source sentence. Hence, to better match speaking rates between target and source segments, we introduce C: sis , scm , (slm , (ssm , ssv )wsm )wlm wcm wis the factor: Note that for model A, the features sis , slm and ssm are defined duration(TTSf (f )) differently than for models B and C presented in this paper. β= (11) duration(TTSe (e)) 6.1. Two-step optimization that computes the ratio of the duration of the target sequence f and the source sequence e, both synthesized using TTS at normal speak- As we explain in Sec. 6.2, performance of model A [8] on the ex- ing rate. Thus, we define the speaking rate match score as: panded test dataset of 495 sentences is significantly lower than that on the previous corpus. In particular, we found that relaxations do |rf (t) − βre (t)| not help to improve accuracy but only improve fluency and smooth- ssm ẽt , f˜t , st , s∗t ; t = 1 − . (12) ness. Hence, to find the optimal feature weights for model C we de- βre (t) cided to utilize a two-step optimization procedure. In the first step, This feature reaches it maximum when the target speaking rate is we find optimal weights wsm , wlm , wcm by maximizing average ac- identical to the scaled source speaking rate. curacy. This is equivalent to training model B. In the second step, we find the optimal weight wis by maximizing average smoothness by keeping the segmentation obtained in step 1. Thus, model C has the 4.5. Isochrony score same segmentation (and hence accuracy) of model B but should have better smoothness. We extend the isochrony score sis (·) of [8] to positive and negative relaxations δl , δr , as: 6.2. Automatic evaluation Table 1 shows the results of automatic evaluation. All the observed sis (δl , δr ) = 1 − [α |δl | + (1 − α) |δr |] (13) improvements of the new models B and C on our previous work A are statistically significant [24]. For all four languages, model This feature reaches its maximum when no relaxation occurs (δr = B outperforms model A with relative improvements in accuracy – δl = 0), that is when the TTS output is stretched to exactly fit the fr: +99.4%, it: +87.3%, de: +80.2%, es: +86.1%. This shows that duration of the original utterance. The relaxation mechanism of [8] the improvements in the speaking rate match feature, the language is able to mitigate only very high speaking rates. To mitigate very model and the addition of cross-lingual semantic match feature pro- low speaking rates, we introduce negative values for relaxation and vides a strong improvement in the scoring function. The reduction hence use absolute values for δl , δr in (13). Since relaxations are in fluency and smoothness metrics can be attributed to the fact that less tolerated at the beginning than at the end of a phrase [5], we set A uses relaxations and B does not. α > 45 such that left relaxation is always more penalized than the Table 3 shows a comparison of accuracy for PA model using right, i.e.: only the cross-lingual semantic match feature with different pre- trained models such as multilingual universal sentence encoder 1 2 3 α |δl | > (1 − α) |δr | ∀δl , δr ∈ 0, ± , ± , ± , ±1 (14) 4 Where (a, b)θ := θa + (1 − θ)b with θ ∈ [0.1] 4 4 4 3
Automatic A B C R mUSE SBERT LaBSE LASER en-fr Accuracy 35.6% 70.9%∗ – 100% en-fr 46.06% 51.11% 50.3% 61.21% Fluency 87.9% 62.8%∗ 82.6%∗ 59.2% en-it 59.19% 63.84% 67.27% 74.95% Smoothness 82.4% 68.5%∗ 81.0%∗ 64% en-de 54.14% 57.17% 59.19% 66.06% en-it Accuracy 43.0% 80.6%∗ – 100% en-es 61.01% 64.04% 67.68% 74.55% Fluency 78.4% 50.7%∗ 59.6%∗ 52.3% Smoothness 82.9% 67.7%∗ 73.8%∗ 67.6% Table 3. Model accuracy using only the cross-lingual semantic en-de Accuracy 39.8% 71.7%∗ – 100% match feature on pretrained models of mUSE [25], SBERT [26], Fluency 68.1% 57.6%∗ 73.5%∗ 59.1% LaBSE [27] and finally LASER [23] that significantly outperforms Smoothness 70.2% 65.0%∗ 74.0%∗ 64.5% all other models on our dataset. en-es Accuracy 43.4% 80.8%∗ – 100% Fluency 77.0% 44.9%∗ 52.1%∗ 44.2% Smoothness 77.4% 68.0%∗ 75.1%∗ 68.2% 6.3. Human evaluation For comparability with our previous work, we present results of hu- Table 1. Results of automatic evaluation with prosodic alignments: man evaluation on the same subset of 50 test sentences used in [8]. (A) previous work [8], (B) new model without relaxation, (C) new For each language and each test sentence, starting with manually model with relaxation and manual reference (R). Test set is made of post-edited translation, we apply PA with models A and C, followed 495 sentences. Significance testing is against model A, with levels by neural text-to-speech and audio rendering steps as described in p < 0.05 (+ ) and p < 0.01 (∗ ). Best PA results are in bold face. Sec. 2 to generate the dubbed videos. As reference (R), we also dubbed videos using the manual segmentation. We asked native speakers in each language to grade the viewing experience of each Manual A vs. C C vs. R dubbed video on a scale of 0-10. To reduce the cognitive load, we en-fr Wins 22.6% 52.4%∗ 39.6% 30.2%∗ perform two distinct evaluations, comparing two conditions in each Score 4.5 5.14∗ 4.96 4.76∗ case: A vs. C and C vs. R. We run evaluations using Amazon Me- en-it Wins 26.9% 36.8%∗ 28.3% 27.9% chanical Turk with 20 subjects each grading all videos for a total of Score 4.55 4.76 ∗ 4.49 4.49 2000 scores in each language. en-de Wins 27.3% 47.6%∗ 38.8% 31.8%∗ We compare A, C and R using the Wins (percentage of times Score 4.81 5.38∗ 5.24 5.15 one condition is preferred over the other) and Score (average score en-es Wins 24.6% 34.5%∗ 20.8% 23.4% of dubbed videos) metrics. We use the linear mixed-effects model5 Score 4.56 4.78∗ 5.05 5.07 (LMEM) by defining subjects and sentences as random effects [29]. Table 2 shows that C clearly outperforms A on all languages Table 2. Results of manual evaluations with prosodic alignments: for both Wins (fr: +131.8%, it: +36.8%, de: +74.4%, es: +40.2%) (A) previous work [8], (C) new model with relaxation and manual and Score (fr: +14.2%, it: +4.6%, de: +11.9%, es: +4.8%) with all reference (R). Test set is made of 50 video clips. Significance testing results statistically significant (p < 0.01). Comparing C with R, is with levels p < 0.05 (+ ) and p < 0.01 (∗ ). for it and es languages, we find no statistically significant difference for either metric. Instead, for de and fr we find +18% (+1.7%) and +23.7% (+4.2%) relative gains in Wins (Score), respectively. To un- derstand why C outperforms R on fr and de, we used a LMEMs to (mUSE) [25], sentence BERT (SBERT) [26], language agnostic bert explain observed Score variations by means of the three automatic sentence embeddings (LaBSE) [27] and language agnostic sentence metrics. We observe that for fr, smoothness is the only statistically representations (LASER) [23]. LASER substantially outperforms significant factor (p < 0.01), while for de all three metrics are sig- all other models on all languages for our dataset and is hence used nificant (p
8. REFERENCES [15] Javier Latorre, Jakub Lachowicz, Jaime Lorenzo-Trueba, Thomas Merritt, Thomas Drugman, Srikanth Ronanki, and [1] F. Casacuberta, M. Federico, H. Ney, and E. Vidal, “Recent ef- Klimkov Viacheslav, “Effect of data reduction on sequence-to- forts in spoken language translation,” IEEE Signal Processing sequence neural TTS,” in Proc. ICASSP, 2019, pp. 7075–7079. Magazine, vol. 25, no. 3, pp. 80–88, 2008. [16] Jaime Lorenzo-Trueba, Thomas Drugman, Javier Latorre, [2] Ron J. Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui Wu, Thomas Merritt, Bartosz Putrycz, Roberto Barra-Chicote, and Zhifeng Chen, “Sequence-to-Sequence Models Can Di- Alexis Moinet, and Vatsal Aggarwal, “Towards Achieving Ro- rectly Translate Foreign Speech,” in Proc. Interspeech 2017. bust Universal Neural Vocoding,” in Proc. Interspeech, 2019, Aug. 2017, pp. 2625–2629, ISCA. pp. 181–185. [3] Laura Cross Vila, Carlos Escolano, Jos A. R. Fonollosa, and [17] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net: Marta R. Costa-Juss, “End-to-End Speech Translation with Convolutional networks for biomedical image segmentation,” the Transformer,” in IberSPEECH 2018. Nov. 2018, pp. 60– in Proc. ICMAI. Springer, 2015, pp. 234–241. 63, ISCA. [18] Andreas Jansson, Eric Humphrey, Nicola Montecchio, Rachel [4] Matthias Sperber and Matthias Paulik, “Speech Translation Bittner, Aparna Kumar, and Tillman Weyde, “Singing voice and the End-to-End Promise: Taking Stock of Where We Are,” separation with deep u-net convolutional networks,” in Proc. in Proceedings of the 58th Annual Meeting of the Association ISMIR, 2017. for Computational Linguistics, Online, July 2020, pp. 7409– [19] Heiner Löllmann, Emre Yilmaz, Marco Jeub, and Peter Vary, 7421, Association for Computational Linguistics. “An improved algorithm for blind reverberation time estima- [5] Frederic Chaume, “Synchronization in dubbing: A translation tion,” in Proc. IWAENC, 2010, pp. 1–4. approach,” in Topics in Audiovisual Translation, Pilar Orero, [20] Emanuel AP Habets, “Room impulse response generator,” Ed. 2004, pp. 35–52, John Benjamins B.V. Tech. Rep. 2.4, Technische Universiteit Eindhoven, 2006. [6] A. Öktem, M. Farrùs, and A. Bonafonte, “Prosodic Phrase [21] R. M. Ochshorn and M. Hawkins, “Gentle Forced Aligner,” Alignment for Machine Dubbing,” in Proc. Interspeech, 2019. 2017. [7] Marcello Federico, Robert Enyedi, Roberto Barra-Chicote, [22] Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Ritwik Giri, Umut Isik, Arvindh Krishnaswamy, and Hassan Wagner, and Morgan Sonderegger, “Montreal Forced Aligner: Sawaf, “From Speech-to-Speech Translation to Automatic Trainable Text-Speech Alignment Using Kaldi,” in Interspeech Dubbing,” in Proceedings of the 17th International Confer- 2017. Aug. 2017, pp. 498–502, ISCA. ence on Spoken Language Translation, Online, July 2020, pp. [23] M. Artetxe and H. Schwenk, “Massively Multilingual Sen- 257–264, Association for Computational Linguistics. tence Embeddings for Zero-Shot Cross-Lingual Transfer and [8] Marcello Federico, Yogesh Virkar, Robert Enyedi, and Roberto Beyond,” TACL, 2019. Barra-Chicote, “Evaluating and optimizing prosodic alignment [24] Eric W. Noreen, Computer-Intensive Methods for Testing Hy- for automatic dubbing,” in Proceedings of Interspeech, 2020, potheses: An Introduction, John Wiley & Sons, 1989. p. 5. [25] Y. Yang, D. Cer, M. Guo A. Ahmad, J. Law, N. Constant, [9] Ashutosh Saboo and Timo Baumann, “Integration of Dub- G. Abrego, S. Yuan nad C. Tar, Y. Sung, B. Strope, and bing Constraints into Machine Translation,” in Proceedings R. Kurzweil, “Multilingual Universal Sentence Encoder for of the Fourth Conference on Machine Translation (Volume 1: Semantic Retrieval,” ACL, 2020. Research Papers), Florence, Italy, Aug. 2019, pp. 94–101, As- [26] N. Reimers and I. Gurevych, “Sentence-Bert: Sentence Em- sociation for Computational Linguistics. beddings using Siamese Bert-networks.,” EMNLP, 2019. [10] Surafel Melaku Lakew, Mattia Di Gangi, and Marcello Fed- [27] F. Feng, Y. Tang, D. Cer, N. Arivazhagan, and W. Wang, erico, “Controlling the output length of neural machine trans- “Language-Agnostic Bert Sentence Embedding,” arXiv lation,” in Proc. IWSLT, 2019. preprint arXiv:2007.01852, 2020. [11] Mattia A. Di Gangi, Roldano Cattoni, Luisa Bentivogli, Matteo [28] Douglas Bates, Martin Mchler, Ben Bolker, and Steve Walker, Negri, and Marco Turchi, “MuST-C: a Multilingual Speech “Fitting Linear Mixed-Effects Models Using lme4,” Journal of Translation Corpus,” in Proc. NAACL, 2019, pp. 2012–2017. Statistical Software, vol. 67, no. 1, pp. 1–48, 2015. [12] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- [29] Douglas Bates, Reinhold Kliegl, Shravan Vasishth, and Harald reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Baayen, “Parsimonious Mixed Models,” arXiv:1506.04967, Polosukhin, “Attention is all you need,” in Proc. NIPS, 2017, 2015. pp. 5998–6008. [13] Mattia Di Gangi, Robert Enyedi, Alessndra Brusadin, and Mar- cello Federico, “Robust neural machine translation for clean and noisy speech translation,” in Proc. IWSLT, 2019. [14] Nishant Prateek, Mateusz Lajszczak, Roberto Barra-Chicote, Thomas Drugman, Jaime Lorenzo-Trueba, Thomas Merritt, Srikanth Ronanki, and Trevor Wood, “In other news: A bi- style text-to-speech model for synthesizing newscaster voice with limited data,” in Proc. NAACL, 2019, pp. 205–213. 5
You can also read