VSEGAN: Visual Speech Enhancement Generative Adversarial Network
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
VSEGAN: Visual Speech Enhancement Generative Adversarial Network Xinmeng Xu1,2 , Yang Wang2 , Dongxiang Xu2 , Yiyuan Peng2 , Cong Zhang2 , Jie Jia2 , Binbin Chen2 1 E.E. Engineering, Trinity College Dublin, Ireland 2 vivo AI Lab, P.R. China xux3@tcd.ie,{yang.wang.rj,dongxiang.xu,pengyiyuan,zhangcong,jie.jia,bb.chen}@vivo.com Abstract in the framework of DNNs, a fully connected network was used to jointly process audio and visual inputs to perform speech en- Speech enhancement is an essential task of improving speech hancement [15]. The fully connected architecture cannot effec- quality in noise scenario. Several state-of-the-art approaches tively process visual information, which caused the audio-visual have introduced visual information for speech enhancement, speech enhancement system slightly better than its audio-only since the visual aspect of speech is essentially unaffected by speech enhancement counterpart. In addition, there is a model acoustic environment. This paper proposes a novel framework which feed the video frames into a trained speech generation arXiv:2102.02599v1 [eess.AS] 4 Feb 2021 that involves visual information for speech enhancement, by in- network, and predict clean speech from noisy input [16], which corporating a Generative Adversarial Network (GAN). In par- has shown more obvious improvement when compared with the ticular, the proposed visual speech enhancement GAN consist previous approaches. of two networks trained in adversarial manner, i) a generator The Generative Adversarial Network (GAN) [17] consists that adopts multi-layer feature fusion convolution network to of a generator network and a discriminator network that play enhance input noisy speech, and ii) a discriminator that attempts a min-max game between each other, and GAN have been ex- to minimize the discrepancy between the distributions of the plored for speech enhancement, SEGAN [18] is the first ap- clean speech signal and enhanced speech signal. Experiment re- proach to apply GAN to speech enhancement model. This paper sults demonstrated superior performance of the proposed model proposes a Visual Speech Enhancement Generative Adversarial against several state-of-the-art models. Network (VSEGAN) that enhances noisy speech using visual Index Terms: speech enhancement, visual information, multi- information under GAN architecture. layer feature fusion convolution network, generative adversarial The main contributions of the proposed VSEGAN can be network summarized as follows: • It provides a robust audio-visual feature fusion strategy. 1. Introduction The audio-visual fusion occurs at every processing level, Speech processing systems are used in a wide variety of appli- rather than a single fusion in whole network, hence, more cations such as speech recognition, speech coding, and hearing audio-visual information can be obtained. aids. These systems have best performance under the condition • It firstly investigates the GAN for visual speech enhance- that noise interference are absent. Consequently, speech en- ment. In addition, the GAN architecture helps visual hancement is essential to improve the performance of these sys- speech enhancement to control information proportion tems in noisy background [1]. Speech enhancement is a kind of between audio and visual modalities during fusion pro- algorithm that can be used to improve the quality and intelligi- cess. bility of noisy speech, decrease the hearing fatigue, and improve The rest of article is organized as follows: Section 2 the performance of many speech processing systems [2]. presents the proposed method in detail. Section 3 introduces the Conventional speech enhancement algorithms are mainly experimental setup. Experiment results are discussed in Section based on signal processing techniques, e.g., by using speech 4, and a conclusion is summarized in Section 5. signal characteristics of a known speaker, which include spec- tral subtraction [3], signal subspace [4], Wiener filter [5], and 2. Model Architecture model-based statistical algorithms [6]. Various deep learning networks architectures, such as fully connected network [7], 2.1. Generative Adversarial Network Convolution Neural Networks (CNNs) [8, 9], Recurrent Neu- GAN is comprised of generator (G) and discriminator (D). The ral Networks (RNNs) [10], have been demonstrated to notably function of G is to map a noise vector x from a given prior improve speech enhancement capabilities than that of conven- distribution X to an output sample y from the distribution Y of tional approaches. Although deep learning approaches make training data. D is a binary classifier network, which determines noisy speech signal more audible, there are some remaining de- whether its input is real or fake. The generated samples coming ficiencies in restoring intelligibility. from Y, are classified as real, whereas the samples coming from Speech enhancement is inherently multimodal, where vi- G, are classified as fake. The learning process can be regarded sual cues help to understand speech better. The correlation be- as a minimax game between G and D, and can be expressed tween the visible proprieties of articulatory organs, e.g., lips, by [17]: teeth, tongue, and speech reception has been previously shown in numerous behavioural studies [11, 12]. Similarly, a large min max V (D, G) = Ey∼py (y) [log(D(y))] G D number of previous works have been developed for visual (1) + Ex∼px (x) [log(1 − D(G(x)))] speech enhancement, which based on signal processing tech- niques and machine learning algorithms [13, 14]. Not surpris- Training procedure for GAN can be concluded the repeti- ingly, visual speech enhancement has been recently addressed tion of following three steps:
Figure 1: Network architecture of generator. Conv-A, Conv-V, Conv-AV, BN, and Deconv denote convolution of audio encoder, convo- lution of video encoder, convolution of audio-visual fusion, batch normalization, and transposed convolution. Step 1: D back-props a batch of real samples y. uses Multi-layer Feature Fusion Convolution Network (MF- Step 2: Freeze the parameters of G, and D back-props a FCN) [22], which follows an encoder-decoder scheme, and con- batch of fake samples that generated from G. sist of encoder part, fusion part, embedding part, and decoder part. The architecture of G network is shown in Figure 1. Step 3: Freeze the parameters of D, and G back-props to Encoder part of G network involves audio encoder and make D misclassify. video encoder. The audio encoder is designed as a CNN tak- The regression task generally works with a conditioned ver- ing spectrogram as input, and each layer of an audio encoder sion of GAN [19], in which some extra information, involve in is followed by strided convolutional layer [20], batch normal- a vector yc , is provided along with the noise vector x at the in- ization, and Leaky-ReLU for non-linearity. The video encoder put of G. In that case, the cost function of D is expressed as is used to process the input face embedding through a number following: of max-pooling convolutional layers followed by batch normal- ization, and Leaky-ReLU. In the G network, the dimension of min maxV (D, G) = Ey,yc ∼py (y,yc ) [log(D(y, yc ))] visual feature vector after convolution layer has to be the same G D as the corresponding audio feature vector, since both vectors + Ex∼px (x),yc ∼py (yc ) [log(1 − D(G(x, yc ), yc ))] take at every encoder layer is through a fusion part in encoding (2) stage. The audio decoder is reversed in the audio encoder part However, Eq. (2) are suffered from vanishing gradients due by deconvolutions, followed again by batch normalization and to the sigmoid cross-entropy loss function [20]. To tackle this Leaky-ReLU. problem, least-squares GAN approach [21] substitutes cross- entropy loss to the mean-squares function with binary coding, Fusion part designates a merged dimension to implement as given in Eq. (3) and Eq. (4). fusion, and the audio and video streams take the concatenation operation and are through several strided convolution layer, fol- 1 lowed by batch normalization, and Leaky-ReLU. Embedding maxV (D) = Ey,yc ∼py (y,yc ) [log(D(y, yc ) − 1)2 ] part consists of three parts: 1) flatten audio and visual steams, D 2 (3) 1 2) concatenate flattened audio and visual streams together, 3) + Ex∼px (x),yc ∼py (yc ) [log(1 − D(G(x, yc ), yc ))2 ] feed concatenated feature vector into several fully-connected 2 layers. The output of fusion part in each layer is fed to the corre- sponding decoder layer. Embedding part is a bottleneck, which 1 applied deeper feature fusion strategy, but with a larger compu- minV (G) = Ex∼px (x),yc ∼py (yc ) [log(D(G(x, yc ), yc ) − 1)2 ] G 2 tation expense. The architecture of G network avoids that many (4) low level details could be lost to reconstruct the speech wave- form properly, if all information are forced to flow through the 2.2. Visual Speech Enhancement GAN compression bottleneck. The G network of VSEGAN performs enhancement, where its The D network of VSEGAN has the same structure with inputs are noisy speech ỹ and video frames v, and its out- SERGAN [23], as shown in Figure 2. The D can be seen as a put is the enhanced speech by = G(ỹ, v). The G network kind of loss function, which transmits the classified information
Figure 2: Network architecture of discriminator, and GAN training procedure. Table 1: Detailed architecture of the VSEGAN generator encoders. Conv1 denotes the first convolution layer of the VSEGAN generator encoder part. Conv1 Conv2 Conv3 Conv4 Conv5 Conv6 Conv7 Conv8 Conv9 Conv10 Num Filters 64 64 128 128 256 256 512 512 1024 1024 Filter Size (5, 5) (4, 4) (4, 4) (4, 4) (2, 2) (2, 2) (2, 2) (2, 2) (2, 2) (2, 2) Stride(audio) (2, 2) (1, 1) (2, 2) (1, 1) (2, 1) (1, 1) (2, 1) (1, 1) (1, 5) (1, 1) MaxPool(video) (2, 4) (1, 2) (2, 2) (1, 1) (2, 1) (1, 1) (2, 1) (1, 1) (1, 5) (1, 1) (real or fake) to G, i.e., G can predict waveform towards the re- been cropped with a mouthcentered window of size 128 × 128 alistic distribution, and getting rid of the noisy signals labeled to [29]. The audio representation is the transformed magnitude be fake. In addition, previous approaches [24, 25] demonstrated spectrograms in the log Mel-domain with 80 Mel frequency that using L1 norm as an additional component is beneficial to bands from 0 to 8 kHz, using a Hanning window of length 640 the loss of G, and L1 norm which performs better than L2 norm bins (40 milliseconds), and hop size of 160 bins (10 millisec- to minimize the distance between enhanced speech and target onds). The whole spectrograms are sliced into pieces of dura- speech [26]. Therefore, the G loss is modified as: tion of 200 milliseconds corresponding to the length of 5 video frames. 1 min V (G) = Ex∼px (x),ỹ∼py (ỹ) [(D(G(x, (v, ỹ)), ỹ) − 1)2 ] Consequently, the size of video feature is 128 × 128 × 5, G 2 and the size of audio feature is 80 × 20 × 1. During training + λ||G(x, (v, ỹ)) − y||1 process, the size of visual modality has to be the same as audio (5) modality, therefore, the processed video segment is zoomed to where λ is a hyper-parameter to control the magnitude of the 80 × 80 × 5 by bilinear interpolation algorithm [30]. The pro- L1 norm. posed VSEGAN has 10 convolutional layers for each encoder and decoder of generator, and the details of audio and visual en- 3. Experiment coders are described in Table 1, and a Conv-A or a Conv-V in Figure 1 comprise of two convolution layers in Table 1. 3.1. Datasets The model is trained with ADAM optimizer for 70 epochs The model is trained on two datasets: the first is the GRID [27] with learning rate of 10−4 , and batch size of 8, and the hyper which consist of video recordings where 18 male speakers and parameter λ of loss function in Eq. (5) is set to 100 [18]. 16 female speakers pronounce 1000 sentences each; the second is TCD-TIMIT [28], which consist of 32 male speakers and 30 3.3. Evaluation female speakers with around 200 videos each. The performance of VSEGAN is evaluated with the following The noise signals are collected from the real world and cat- metrics: Perceptual Evaluation of Speech Quality (PESQ), the egorized into 12 types: room, car, instrument, engine, train, more specifically the wideband version recommended in ITU-T talker speaking, air-brake, water, street, mic-noise, ring-bell, P.862.2 (-0.5 to 4.5) [31], and Short Term Objective Intelligi- and music. At every iteration of training, a random attenuation bility (STOI) [32]. Each measurement compares the enhanced of the noise interference in the range of [-15, 0] dB is applied speech with clean reference of the test stimuli provided in the as a data augmentation scheme. This augmentation was done to dataset. make the network robust against various SNRs. 3.2. Training and Network Parameters 4. Results The video representation is extracted from input video and is There are four networks trained as follows: resampled to 25 frames per seconds. Each video is divided • SEGAN [18]: An audio-only speech enhancement gen- into non-overlapping segments of 5 consecutive frames, and has erative adversarial network.
Table 2: Performance of trained networks Test SNR -5 dB 0 dB Evaluation Metrics STOI PESQ STOI PESQ Noisy 51.4 1.03 62.6 1.24 SEGAN 63.4 1.97 77.3 2.21 Baseline 81.3 2.35 87.9 2.94 MFFCN 82.7 2.72 89.3 2.92 VSEGAN 86.8 2.88 89.8 3.10 Table 3: Performance comparison of VSEGAN with state-of- the-art result on GRID Test SNR -5 dB 0 dB Evaluation Metrics PESQ L2L 2.61 2.92 OVA 2.69 3.00 AV(SE)2 - 2.98 AMFFCN 2.81 3.04 VSEGAN 2.88 3.10 • Baseline [33]: A baseline work of visual speech en- Figure 3: Example of input and enhanced spectra from an ex- hancement. ample speech utterance. (a) Noisy speech under the condition • MFFCN [22]: A visual speech enhancement approach of noise at 0 dB. (b) Enhanced speech generated by baseline using multi-layer feature fusion convolution network, work. (c) Enhanced speech generated by MFFCN. (d) En- and this model is the G network of proposed model. hanced speech generated by VSEGAN. • VSEGAN: the proposed model, visual speech enhance- ment generative adversarial network. • AMFFCN model [37]: A visual speech enhancement Table 2 demonstrates the improvement performance of net- approach using attentional multi-layer feature fusion work, as a new component is added to the architecture, such as convolution network. visual information, multi-layer feature fusion strategy, and fi- nally GAN model. The VSEGAN outperforms SEGAN, which Table 3 shows that the VSEGAN produces state-of-the-art is an evidence that visual information significantly improves the results in terms of PESQ score by comparing against four recent performance of speech enhancement system. What is more, proposed methods that use DNNs to perform end-to-end visual the comparison between VSEGAN and MFFCN illustrates that speech enhancement. Results for competing methods are taken GAN model for visual speech enhancement is more robust from the corresponding papers and the missing entries in the than G-only model. Hence the performance improvement from table indicate that the metric is not reported in the reference pa- SEGAN to VSEGAN is primarily for two reason: 1) using vi- per. Although the competing results are for reference only, the sual information, and 2) using GAN model. Figure 3 shows VSEGAN has better performance than state-of-the-art results the visualization of baseline system enhancement, MFFCN en- on the GRID dataset. hancement, and VSEGAN enhancement, which most obvious details of spectrum distinction are framed by dotted box.1 5. Conclusions For further investigating the superiority of proposed method, the performance of VSEGAN has also compared to the This paper proposed an end-to-end visual speech enhancement following recent audio-visual speech enhancement approaches method has been implemented within the generative adversarial on GRID dataset: framework. The model adopts multi-layer feature fusion convo- lution network structure, which provides a better training behav- • Looking-to-Listen model [34]: A speaker independent ior, as the gradient can flow deeper through the whole structure. audio-visual speech separation model. According to the experiment results, the performance of speech • Online Visual Augmented (OVA) model [35]: A late enhancement system has significantly improves by involving of fusion based visual speech enhancement model, which visual information, and visual speech enhancement using GAN involves the audio-based component, visual-based com- performs better quality of enhanced speech than several state- ponent and the augmentation component. of-the-art models. • AV(SE)2 model [36]: An audio-visual squeeze-excite Possible future work involves integrating spatial and tem- speech enhancement model. poral information into network, exploring better convolutional structures, including weightings for audio and visual modalities 1 Speech samples are available at: https://XinmengXu.github. fusion blocks for controlling how the information from audio io/AVSE/VSEGAN and visual modalities is treated.
6. References [21] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley, “Least squares generative adversarial networks,” in Proceedings [1] P. C. Loizou, Speech enhancement: theory and practice. CRC of the IEEE international conference on computer vision, 2017, press, 2013. pp. 2794–2802. [2] T. F. Quatieri, Discrete-time speech signal processing: principles [22] X. Xu, D. Xu, J. Jia, Y. Wang, and B. Chen, “Mffcn: Multi-layer and practice. Pearson Education India, 2006. feature fusion convolution network for audio-visual speech en- [3] R. Martin, “Spectral subtraction based on minimum statistics,” hancement,” arXiv preprint arXiv:2101.05975, 2021. power, vol. 6, no. 8, 1994. [23] D. Baby and S. Verhulst, “Sergan: Speech enhancement using rel- [4] Y. Ephraim and H. L. Van Trees, “A signal subspace approach for ativistic generative adversarial networks with gradient penalty,” in speech enhancement,” IEEE Transactions on speech and audio ICASSP 2019-2019 IEEE International Conference on Acoustics, processing, vol. 3, no. 4, pp. 251–266, 1995. Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 106– 110. [5] J. Lim and A. Oppenheim, “All-pole modeling of degraded speech,” IEEE Transactions on Acoustics, Speech, and Signal [24] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image Processing, vol. 26, no. 3, pp. 197–210, 1978. translation with conditional adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recogni- [6] M. Dendrinos, S. Bakamidis, and G. Carayannis, “Speech en- tion, 2017, pp. 1125–1134. hancement from noise: A regenerative approach,” Speech Com- munication, vol. 10, no. 1, pp. 45–57, 1991. [25] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in Pro- [7] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression ap- ceedings of the IEEE conference on computer vision and pattern proach to speech enhancement based on deep neural networks,” recognition, 2016, pp. 2536–2544. IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 23, no. 1, pp. 7–19, 2014. [26] A. Pandey and D. Wang, “On adversarial training and loss func- tions for speech enhancement,” in 2018 IEEE International Con- [8] K. Qian, Y. Zhang, S. Chang, X. Yang, D. Florêncio, and ference on Acoustics, Speech and Signal Processing (ICASSP). M. Hasegawa-Johnson, “Speech enhancement using bayesian IEEE, 2018, pp. 5414–5418. wavenet.” in Interspeech, 2017, pp. 2013–2017. [27] M. Cooke, J. Barker, S. Cunningham, and X. Shao, “An audio- [9] S.-W. Fu, T.-W. Wang, Y. Tsao, X. Lu, and H. Kawai, “End- visual corpus for speech perception and automatic speech recog- to-end waveform utterance enhancement for direct evaluation nition,” The Journal of the Acoustical Society of America, vol. metrics optimization by fully convolutional neural networks,” 120, no. 5, pp. 2421–2424, 2006. IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 26, no. 9, pp. 1570–1584, 2018. [28] N. Harte and E. Gillen, “TCD-TIMIT: An audio-visual corpus of continuous speech,” IEEE Transactions on Multimedia, vol. 17, [10] F. Weninger, F. Eyben, and B. Schuller, “Single-channel speech no. 5, pp. 603–615, 2015. separation with memory-enhanced recurrent neural networks,” in 2014 IEEE International Conference on Acoustics, Speech and [29] V. Kazemi and J. Sullivan, “One millisecond face alignment with Signal Processing (ICASSP). IEEE, 2014, pp. 3709–3713. an ensemble of regression trees,” in Proceedings of the IEEE con- ference on computer vision and pattern recognition, 2014, pp. [11] W. H. Sumby and I. Pollack, “Visual contribution to speech intelli- 1867–1874. gibility in noise,” The journal of the acoustical society of america, vol. 26, no. 2, pp. 212–215, 1954. [30] K. T. Gribbon and D. G. Bailey, “A novel approach to real-time bi- linear interpolation,” in Proceedings. DELTA 2004. Second IEEE [12] Q. Summerfield, “Use of visual information for phonetic percep- International Workshop on Electronic Design, Test and Applica- tion,” Phonetica, vol. 36, no. 4-5, pp. 314–331, 1979. tions. IEEE, 2004, pp. 126–131. [13] W. Wang, D. Cosker, Y. Hicks, S. Saneit, and J. Chambers, “Video [31] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Per- assisted speech source separation,” in Proceedings.(ICASSP’05). ceptual evaluation of speech quality (PESQ)-a new method for IEEE International Conference on Acoustics, Speech, and Signal speech quality assessment of telephone networks and codecs,” in Processing, 2005., vol. 5. IEEE, 2005, pp. v–425. IEEE International Conference on Acoustics, Speech, and Signal [14] J. R. Hershey and M. Casey, “Audio-visual sound separation via Processing. Proceedings (Cat. No. 01CH37221), vol. 2. IEEE, hidden markov models,” in Advances in Neural Information Pro- 2001, pp. 749–752. cessing Systems, 2002, pp. 1173–1180. [32] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An al- [15] J.-C. Hou, S.-S. Wang, Y.-H. Lai, J.-C. Lin, Y. Tsao, H.-W. gorithm for intelligibility prediction of time–frequency weighted Chang, and H.-M. Wang, “Audio-visual speech enhancement us- noisy speech,” vol. 19, no. 7. IEEE, 2011, pp. 2125–2136. ing deep neural networks,” in 2016 Asia-Pacific Signal and Infor- [33] A. Gabbay, A. Shamir, and S. Peleg, “Visual speech enhance- mation Processing Association Annual Summit and Conference ment,” Interspeech, pp. 1170–1174, 2018. (APSIPA). IEEE, 2016, pp. 1–6. [34] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, [16] A. Gabbay, A. Ephrat, T. Halperin, and S. Peleg, “Seeing through W. T. Freeman, and M. Rubinstein, “Looking to listen at the cock- noise: Visually driven speaker separation and enhancement,” in tail party: A speaker-independent audio-visual model for speech IEEE International Conference on Acoustics, Speech and Signal separation,” ACM Transactions on Graphics, 2018. Processing (ICASSP). IEEE, 2018, pp. 3051–3055. [35] W. Wang, C. Xing, D. Wang, X. Chen, and F. Sun, “A robust [17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- audio-visual speech enhancement model,” in ICASSP 2020-2020 Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver- IEEE International Conference on Acoustics, Speech and Signal sarial nets,” Advances in neural information processing systems, Processing (ICASSP). IEEE, 2020, pp. 7529–7533. vol. 27, pp. 2672–2680, 2014. [36] M. L. Iuzzolino and K. Koishida, “AV(SE)2 : Audio-visual [18] S. Pascual, A. Bonafonte, and J. Serrà, “SEGAN: Speech en- squeeze-excite speech enhancement,” in ICASSP 2020-2020 IEEE hancement generative adversarial network,” in Interspeech 2017, International Conference on Acoustics, Speech and Signal Pro- 2017, pp. 3642–3646. cessing (ICASSP). IEEE, 2020, pp. 7539–7543. [19] M. Mirza and S. Osindero, “Conditional generative adversarial [37] X. Xu, Y. Wang, D. Xu, C. Zhang, Y. Peng, J. Jia, and nets,” Computer ence, pp. 2672–2680, 2014. B. Chen, “AMMFCN:attentional multi-layer feature fusion con- [20] A. Radford, L. Metz, and S. Chintala, “Unsupervised representa- volution network for audio-visual speech enhancement,” ArXiv, tion learning with deep convolutional generative adversarial net- vol. abs/2101.06268, 2021. works,” 11 2016.
You can also read