VSEGAN: Visual Speech Enhancement Generative Adversarial Network

Page created by Danielle Navarro

IT & Technique

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

VSEGAN: Visual Speech Enhancement Generative Adversarial Network

VSEGAN: Visual Speech Enhancement Generative Adversarial Network
                                            Xinmeng Xu1,2 , Yang Wang2 , Dongxiang Xu2 , Yiyuan Peng2 , Cong Zhang2 , Jie Jia2 , Binbin Chen2
                                                                            1
                                                                                E.E. Engineering, Trinity College Dublin, Ireland
                                                                                           2
                                                                                             vivo AI Lab, P.R. China
                                             xux3@tcd.ie,{yang.wang.rj,dongxiang.xu,pengyiyuan,zhangcong,jie.jia,bb.chen}@vivo.com

                                                                    Abstract                                  in the framework of DNNs, a fully connected network was used
                                                                                                              to jointly process audio and visual inputs to perform speech en-
                                          Speech enhancement is an essential task of improving speech         hancement [15]. The fully connected architecture cannot effec-
                                          quality in noise scenario. Several state-of-the-art approaches      tively process visual information, which caused the audio-visual
                                          have introduced visual information for speech enhancement,          speech enhancement system slightly better than its audio-only
                                          since the visual aspect of speech is essentially unaffected by      speech enhancement counterpart. In addition, there is a model
                                          acoustic environment. This paper proposes a novel framework         which feed the video frames into a trained speech generation
arXiv:2102.02599v1 [eess.AS] 4 Feb 2021

                                          that involves visual information for speech enhancement, by in-     network, and predict clean speech from noisy input [16], which
                                          corporating a Generative Adversarial Network (GAN). In par-         has shown more obvious improvement when compared with the
                                          ticular, the proposed visual speech enhancement GAN consist         previous approaches.
                                          of two networks trained in adversarial manner, i) a generator            The Generative Adversarial Network (GAN) [17] consists
                                          that adopts multi-layer feature fusion convolution network to       of a generator network and a discriminator network that play
                                          enhance input noisy speech, and ii) a discriminator that attempts   a min-max game between each other, and GAN have been ex-
                                          to minimize the discrepancy between the distributions of the        plored for speech enhancement, SEGAN [18] is the first ap-
                                          clean speech signal and enhanced speech signal. Experiment re-      proach to apply GAN to speech enhancement model. This paper
                                          sults demonstrated superior performance of the proposed model       proposes a Visual Speech Enhancement Generative Adversarial
                                          against several state-of-the-art models.                            Network (VSEGAN) that enhances noisy speech using visual
                                          Index Terms: speech enhancement, visual information, multi-         information under GAN architecture.
                                          layer feature fusion convolution network, generative adversarial         The main contributions of the proposed VSEGAN can be
                                          network                                                             summarized as follows:
                                                                                                                   • It provides a robust audio-visual feature fusion strategy.
                                                              1. Introduction                                         The audio-visual fusion occurs at every processing level,
                                          Speech processing systems are used in a wide variety of appli-              rather than a single fusion in whole network, hence, more
                                          cations such as speech recognition, speech coding, and hearing              audio-visual information can be obtained.
                                          aids. These systems have best performance under the condition            • It firstly investigates the GAN for visual speech enhance-
                                          that noise interference are absent. Consequently, speech en-                ment. In addition, the GAN architecture helps visual
                                          hancement is essential to improve the performance of these sys-             speech enhancement to control information proportion
                                          tems in noisy background [1]. Speech enhancement is a kind of               between audio and visual modalities during fusion pro-
                                          algorithm that can be used to improve the quality and intelligi-            cess.
                                          bility of noisy speech, decrease the hearing fatigue, and improve        The rest of article is organized as follows: Section 2
                                          the performance of many speech processing systems [2].              presents the proposed method in detail. Section 3 introduces the
                                               Conventional speech enhancement algorithms are mainly          experimental setup. Experiment results are discussed in Section
                                          based on signal processing techniques, e.g., by using speech        4, and a conclusion is summarized in Section 5.
                                          signal characteristics of a known speaker, which include spec-
                                          tral subtraction [3], signal subspace [4], Wiener filter [5], and                   2. Model Architecture
                                          model-based statistical algorithms [6]. Various deep learning
                                          networks architectures, such as fully connected network [7],        2.1. Generative Adversarial Network
                                          Convolution Neural Networks (CNNs) [8, 9], Recurrent Neu-           GAN is comprised of generator (G) and discriminator (D). The
                                          ral Networks (RNNs) [10], have been demonstrated to notably         function of G is to map a noise vector x from a given prior
                                          improve speech enhancement capabilities than that of conven-        distribution X to an output sample y from the distribution Y of
                                          tional approaches. Although deep learning approaches make           training data. D is a binary classifier network, which determines
                                          noisy speech signal more audible, there are some remaining de-      whether its input is real or fake. The generated samples coming
                                          ficiencies in restoring intelligibility.                            from Y, are classified as real, whereas the samples coming from
                                               Speech enhancement is inherently multimodal, where vi-         G, are classified as fake. The learning process can be regarded
                                          sual cues help to understand speech better. The correlation be-     as a minimax game between G and D, and can be expressed
                                          tween the visible proprieties of articulatory organs, e.g., lips,   by [17]:
                                          teeth, tongue, and speech reception has been previously shown
                                          in numerous behavioural studies [11, 12]. Similarly, a large             min max V (D, G) = Ey∼py (y) [log(D(y))]
                                                                                                                     G    D
                                          number of previous works have been developed for visual                                                                           (1)
                                                                                                                                        + Ex∼px (x) [log(1 − D(G(x)))]
                                          speech enhancement, which based on signal processing tech-
                                          niques and machine learning algorithms [13, 14]. Not surpris-           Training procedure for GAN can be concluded the repeti-
                                          ingly, visual speech enhancement has been recently addressed        tion of following three steps:

Figure 1: Network architecture of generator. Conv-A, Conv-V, Conv-AV, BN, and Deconv denote convolution of audio encoder, convo-
lution of video encoder, convolution of audio-visual fusion, batch normalization, and transposed convolution.

         Step 1: D back-props a batch of real samples y.                     uses Multi-layer Feature Fusion Convolution Network (MF-
         Step 2: Freeze the parameters of G, and D back-props a              FCN) [22], which follows an encoder-decoder scheme, and con-
         batch of fake samples that generated from G.                        sist of encoder part, fusion part, embedding part, and decoder
                                                                             part. The architecture of G network is shown in Figure 1.
         Step 3: Freeze the parameters of D, and G back-props to                  Encoder part of G network involves audio encoder and
         make D misclassify.                                                 video encoder. The audio encoder is designed as a CNN tak-
     The regression task generally works with a conditioned ver-             ing spectrogram as input, and each layer of an audio encoder
sion of GAN [19], in which some extra information, involve in                is followed by strided convolutional layer [20], batch normal-
a vector yc , is provided along with the noise vector x at the in-           ization, and Leaky-ReLU for non-linearity. The video encoder
put of G. In that case, the cost function of D is expressed as               is used to process the input face embedding through a number
following:                                                                   of max-pooling convolutional layers followed by batch normal-
                                                                             ization, and Leaky-ReLU. In the G network, the dimension of
  min maxV (D, G) = Ey,yc ∼py (y,yc ) [log(D(y, yc ))]                       visual feature vector after convolution layer has to be the same
     G    D
                                                                             as the corresponding audio feature vector, since both vectors
             + Ex∼px (x),yc ∼py (yc ) [log(1 − D(G(x, yc ), yc ))]
                                                                             take at every encoder layer is through a fusion part in encoding
                                                                  (2)        stage. The audio decoder is reversed in the audio encoder part
     However, Eq. (2) are suffered from vanishing gradients due              by deconvolutions, followed again by batch normalization and
to the sigmoid cross-entropy loss function [20]. To tackle this              Leaky-ReLU.
problem, least-squares GAN approach [21] substitutes cross-
entropy loss to the mean-squares function with binary coding,                     Fusion part designates a merged dimension to implement
as given in Eq. (3) and Eq. (4).                                             fusion, and the audio and video streams take the concatenation
                                                                             operation and are through several strided convolution layer, fol-
                    1                                                        lowed by batch normalization, and Leaky-ReLU. Embedding
  maxV (D) =          Ey,yc ∼py (y,yc ) [log(D(y, yc ) − 1)2 ]               part consists of three parts: 1) flatten audio and visual steams,
     D              2                                                  (3)
             1                                                               2) concatenate flattened audio and visual streams together, 3)
         +     Ex∼px (x),yc ∼py (yc ) [log(1 − D(G(x, yc ), yc ))2 ]         feed concatenated feature vector into several fully-connected
             2
                                                                             layers. The output of fusion part in each layer is fed to the corre-
                                                                             sponding decoder layer. Embedding part is a bottleneck, which
                 1                                                           applied deeper feature fusion strategy, but with a larger compu-
minV (G) =         Ex∼px (x),yc ∼py (yc ) [log(D(G(x, yc ), yc ) − 1)2 ]
 G               2                                                           tation expense. The architecture of G network avoids that many
                                                                     (4)     low level details could be lost to reconstruct the speech wave-
                                                                             form properly, if all information are forced to flow through the
2.2. Visual Speech Enhancement GAN                                           compression bottleneck.
The G network of VSEGAN performs enhancement, where its                           The D network of VSEGAN has the same structure with
inputs are noisy speech ỹ and video frames v, and its out-                  SERGAN [23], as shown in Figure 2. The D can be seen as a
put is the enhanced speech by = G(ỹ, v). The G network                      kind of loss function, which transmits the classified information

Figure 2: Network architecture of discriminator, and GAN training procedure.

Table 1: Detailed architecture of the VSEGAN generator encoders. Conv1 denotes the first convolution layer of the VSEGAN generator
encoder part.

                               Conv1     Conv2      Conv3      Conv4     Conv5    Conv6     Conv7      Conv8     Conv9    Conv10
            Num Filters          64         64        128        128      256      256       512        512      1024       1024
            Filter Size        (5, 5)     (4, 4)     (4, 4)     (4, 4)   (2, 2)   (2, 2)     (2, 2)     (2, 2)   (2, 2)     (2, 2)
           Stride(audio)       (2, 2)     (1, 1)     (2, 2)     (1, 1)   (2, 1)   (1, 1)     (2, 1)     (1, 1)   (1, 5)     (1, 1)
         MaxPool(video)        (2, 4)     (1, 2)     (2, 2)     (1, 1)   (2, 1)   (1, 1)     (2, 1)     (1, 1)   (1, 5)     (1, 1)

(real or fake) to G, i.e., G can predict waveform towards the re-          been cropped with a mouthcentered window of size 128 × 128
alistic distribution, and getting rid of the noisy signals labeled to      [29]. The audio representation is the transformed magnitude
be fake. In addition, previous approaches [24, 25] demonstrated            spectrograms in the log Mel-domain with 80 Mel frequency
that using L1 norm as an additional component is beneficial to             bands from 0 to 8 kHz, using a Hanning window of length 640
the loss of G, and L1 norm which performs better than L2 norm              bins (40 milliseconds), and hop size of 160 bins (10 millisec-
to minimize the distance between enhanced speech and target                onds). The whole spectrograms are sliced into pieces of dura-
speech [26]. Therefore, the G loss is modified as:                         tion of 200 milliseconds corresponding to the length of 5 video
                                                                           frames.
               1
 min V (G) =     Ex∼px (x),ỹ∼py (ỹ) [(D(G(x, (v, ỹ)), ỹ) − 1)2 ]           Consequently, the size of video feature is 128 × 128 × 5,
   G           2                                                           and the size of audio feature is 80 × 20 × 1. During training
            + λ||G(x, (v, ỹ)) − y||1                                      process, the size of visual modality has to be the same as audio
                                                                  (5)      modality, therefore, the processed video segment is zoomed to
where λ is a hyper-parameter to control the magnitude of the               80 × 80 × 5 by bilinear interpolation algorithm [30]. The pro-
L1 norm.                                                                   posed VSEGAN has 10 convolutional layers for each encoder
                                                                           and decoder of generator, and the details of audio and visual en-
                      3. Experiment                                        coders are described in Table 1, and a Conv-A or a Conv-V in
                                                                           Figure 1 comprise of two convolution layers in Table 1.
3.1. Datasets
                                                                               The model is trained with ADAM optimizer for 70 epochs
The model is trained on two datasets: the first is the GRID [27]           with learning rate of 10−4 , and batch size of 8, and the hyper
which consist of video recordings where 18 male speakers and               parameter λ of loss function in Eq. (5) is set to 100 [18].
16 female speakers pronounce 1000 sentences each; the second
is TCD-TIMIT [28], which consist of 32 male speakers and 30                3.3. Evaluation
female speakers with around 200 videos each.
                                                                           The performance of VSEGAN is evaluated with the following
     The noise signals are collected from the real world and cat-
                                                                           metrics: Perceptual Evaluation of Speech Quality (PESQ), the
egorized into 12 types: room, car, instrument, engine, train,
                                                                           more specifically the wideband version recommended in ITU-T
talker speaking, air-brake, water, street, mic-noise, ring-bell,
                                                                           P.862.2 (-0.5 to 4.5) [31], and Short Term Objective Intelligi-
and music. At every iteration of training, a random attenuation
                                                                           bility (STOI) [32]. Each measurement compares the enhanced
of the noise interference in the range of [-15, 0] dB is applied
                                                                           speech with clean reference of the test stimuli provided in the
as a data augmentation scheme. This augmentation was done to
                                                                           dataset.
make the network robust against various SNRs.

3.2. Training and Network Parameters                                                                  4. Results
The video representation is extracted from input video and is              There are four networks trained as follows:
resampled to 25 frames per seconds. Each video is divided                      • SEGAN [18]: An audio-only speech enhancement gen-
into non-overlapping segments of 5 consecutive frames, and has                   erative adversarial network.

Table 2: Performance of trained networks

        Test SNR               -5 dB                  0 dB
   Evaluation Metrics       STOI   PESQ        STOI      PESQ
          Noisy             51.4    1.03       62.6      1.24
         SEGAN              63.4    1.97       77.3      2.21
         Baseline           81.3    2.35       87.9      2.94
         MFFCN              82.7    2.72       89.3      2.92
        VSEGAN              86.8    2.88       89.8      3.10

Table 3: Performance comparison of VSEGAN with state-of-
the-art result on GRID

                    Test SNR        -5 dB      0 dB
             Evaluation Metrics             PESQ
                      L2L              2.61    2.92
                     OVA               2.69    3.00
                    AV(SE)2             -      2.98
                    AMFFCN             2.81    3.04
                    VSEGAN             2.88    3.10

    • Baseline [33]: A baseline work of visual speech en-           Figure 3: Example of input and enhanced spectra from an ex-
      hancement.                                                    ample speech utterance. (a) Noisy speech under the condition
    • MFFCN [22]: A visual speech enhancement approach              of noise at 0 dB. (b) Enhanced speech generated by baseline
      using multi-layer feature fusion convolution network,         work. (c) Enhanced speech generated by MFFCN. (d) En-
      and this model is the G network of proposed model.            hanced speech generated by VSEGAN.
    • VSEGAN: the proposed model, visual speech enhance-
      ment generative adversarial network.
                                                                        • AMFFCN model [37]: A visual speech enhancement
     Table 2 demonstrates the improvement performance of net-
                                                                          approach using attentional multi-layer feature fusion
work, as a new component is added to the architecture, such as
                                                                          convolution network.
visual information, multi-layer feature fusion strategy, and fi-
nally GAN model. The VSEGAN outperforms SEGAN, which                    Table 3 shows that the VSEGAN produces state-of-the-art
is an evidence that visual information significantly improves the   results in terms of PESQ score by comparing against four recent
performance of speech enhancement system. What is more,             proposed methods that use DNNs to perform end-to-end visual
the comparison between VSEGAN and MFFCN illustrates that            speech enhancement. Results for competing methods are taken
GAN model for visual speech enhancement is more robust              from the corresponding papers and the missing entries in the
than G-only model. Hence the performance improvement from           table indicate that the metric is not reported in the reference pa-
SEGAN to VSEGAN is primarily for two reason: 1) using vi-           per. Although the competing results are for reference only, the
sual information, and 2) using GAN model. Figure 3 shows            VSEGAN has better performance than state-of-the-art results
the visualization of baseline system enhancement, MFFCN en-         on the GRID dataset.
hancement, and VSEGAN enhancement, which most obvious
details of spectrum distinction are framed by dotted box.1                               5. Conclusions
     For further investigating the superiority of proposed
method, the performance of VSEGAN has also compared to the          This paper proposed an end-to-end visual speech enhancement
following recent audio-visual speech enhancement approaches         method has been implemented within the generative adversarial
on GRID dataset:                                                    framework. The model adopts multi-layer feature fusion convo-
                                                                    lution network structure, which provides a better training behav-
    • Looking-to-Listen model [34]: A speaker independent
                                                                    ior, as the gradient can flow deeper through the whole structure.
      audio-visual speech separation model.
                                                                    According to the experiment results, the performance of speech
    • Online Visual Augmented (OVA) model [35]: A late              enhancement system has significantly improves by involving of
      fusion based visual speech enhancement model, which           visual information, and visual speech enhancement using GAN
      involves the audio-based component, visual-based com-         performs better quality of enhanced speech than several state-
      ponent and the augmentation component.                        of-the-art models.
    • AV(SE)2 model [36]: An audio-visual squeeze-excite                 Possible future work involves integrating spatial and tem-
      speech enhancement model.                                     poral information into network, exploring better convolutional
                                                                    structures, including weightings for audio and visual modalities
    1 Speech samples are available at: https://XinmengXu.github.    fusion blocks for controlling how the information from audio
io/AVSE/VSEGAN                                                      and visual modalities is treated.

6. References                                       [21] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley,
                                                                                  “Least squares generative adversarial networks,” in Proceedings
 [1] P. C. Loizou, Speech enhancement: theory and practice.         CRC           of the IEEE international conference on computer vision, 2017,
     press, 2013.                                                                 pp. 2794–2802.
 [2] T. F. Quatieri, Discrete-time speech signal processing: principles      [22] X. Xu, D. Xu, J. Jia, Y. Wang, and B. Chen, “Mffcn: Multi-layer
     and practice. Pearson Education India, 2006.                                 feature fusion convolution network for audio-visual speech en-
 [3] R. Martin, “Spectral subtraction based on minimum statistics,”               hancement,” arXiv preprint arXiv:2101.05975, 2021.
     power, vol. 6, no. 8, 1994.                                             [23] D. Baby and S. Verhulst, “Sergan: Speech enhancement using rel-
 [4] Y. Ephraim and H. L. Van Trees, “A signal subspace approach for              ativistic generative adversarial networks with gradient penalty,” in
     speech enhancement,” IEEE Transactions on speech and audio                   ICASSP 2019-2019 IEEE International Conference on Acoustics,
     processing, vol. 3, no. 4, pp. 251–266, 1995.                                Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 106–
                                                                                  110.
 [5] J. Lim and A. Oppenheim, “All-pole modeling of degraded
     speech,” IEEE Transactions on Acoustics, Speech, and Signal             [24] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image
     Processing, vol. 26, no. 3, pp. 197–210, 1978.                               translation with conditional adversarial networks,” in Proceedings
                                                                                  of the IEEE conference on computer vision and pattern recogni-
 [6] M. Dendrinos, S. Bakamidis, and G. Carayannis, “Speech en-
                                                                                  tion, 2017, pp. 1125–1134.
     hancement from noise: A regenerative approach,” Speech Com-
     munication, vol. 10, no. 1, pp. 45–57, 1991.                            [25] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A.
                                                                                  Efros, “Context encoders: Feature learning by inpainting,” in Pro-
 [7] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression ap-
                                                                                  ceedings of the IEEE conference on computer vision and pattern
     proach to speech enhancement based on deep neural networks,”
                                                                                  recognition, 2016, pp. 2536–2544.
     IEEE/ACM Transactions on Audio, Speech, and Language Pro-
     cessing, vol. 23, no. 1, pp. 7–19, 2014.                                [26] A. Pandey and D. Wang, “On adversarial training and loss func-
                                                                                  tions for speech enhancement,” in 2018 IEEE International Con-
 [8] K. Qian, Y. Zhang, S. Chang, X. Yang, D. Florêncio, and
                                                                                  ference on Acoustics, Speech and Signal Processing (ICASSP).
     M. Hasegawa-Johnson, “Speech enhancement using bayesian
                                                                                  IEEE, 2018, pp. 5414–5418.
     wavenet.” in Interspeech, 2017, pp. 2013–2017.
                                                                             [27] M. Cooke, J. Barker, S. Cunningham, and X. Shao, “An audio-
 [9] S.-W. Fu, T.-W. Wang, Y. Tsao, X. Lu, and H. Kawai, “End-
                                                                                  visual corpus for speech perception and automatic speech recog-
     to-end waveform utterance enhancement for direct evaluation
                                                                                  nition,” The Journal of the Acoustical Society of America, vol.
     metrics optimization by fully convolutional neural networks,”
                                                                                  120, no. 5, pp. 2421–2424, 2006.
     IEEE/ACM Transactions on Audio, Speech, and Language Pro-
     cessing, vol. 26, no. 9, pp. 1570–1584, 2018.                           [28] N. Harte and E. Gillen, “TCD-TIMIT: An audio-visual corpus of
                                                                                  continuous speech,” IEEE Transactions on Multimedia, vol. 17,
[10] F. Weninger, F. Eyben, and B. Schuller, “Single-channel speech
                                                                                  no. 5, pp. 603–615, 2015.
     separation with memory-enhanced recurrent neural networks,” in
     2014 IEEE International Conference on Acoustics, Speech and             [29] V. Kazemi and J. Sullivan, “One millisecond face alignment with
     Signal Processing (ICASSP). IEEE, 2014, pp. 3709–3713.                       an ensemble of regression trees,” in Proceedings of the IEEE con-
                                                                                  ference on computer vision and pattern recognition, 2014, pp.
[11] W. H. Sumby and I. Pollack, “Visual contribution to speech intelli-
                                                                                  1867–1874.
     gibility in noise,” The journal of the acoustical society of america,
     vol. 26, no. 2, pp. 212–215, 1954.                                      [30] K. T. Gribbon and D. G. Bailey, “A novel approach to real-time bi-
                                                                                  linear interpolation,” in Proceedings. DELTA 2004. Second IEEE
[12] Q. Summerfield, “Use of visual information for phonetic percep-
                                                                                  International Workshop on Electronic Design, Test and Applica-
     tion,” Phonetica, vol. 36, no. 4-5, pp. 314–331, 1979.
                                                                                  tions. IEEE, 2004, pp. 126–131.
[13] W. Wang, D. Cosker, Y. Hicks, S. Saneit, and J. Chambers, “Video
                                                                             [31] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Per-
     assisted speech source separation,” in Proceedings.(ICASSP’05).
                                                                                  ceptual evaluation of speech quality (PESQ)-a new method for
     IEEE International Conference on Acoustics, Speech, and Signal
                                                                                  speech quality assessment of telephone networks and codecs,” in
     Processing, 2005., vol. 5. IEEE, 2005, pp. v–425.
                                                                                  IEEE International Conference on Acoustics, Speech, and Signal
[14] J. R. Hershey and M. Casey, “Audio-visual sound separation via               Processing. Proceedings (Cat. No. 01CH37221), vol. 2. IEEE,
     hidden markov models,” in Advances in Neural Information Pro-                2001, pp. 749–752.
     cessing Systems, 2002, pp. 1173–1180.
                                                                             [32] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An al-
[15] J.-C. Hou, S.-S. Wang, Y.-H. Lai, J.-C. Lin, Y. Tsao, H.-W.                  gorithm for intelligibility prediction of time–frequency weighted
     Chang, and H.-M. Wang, “Audio-visual speech enhancement us-                  noisy speech,” vol. 19, no. 7. IEEE, 2011, pp. 2125–2136.
     ing deep neural networks,” in 2016 Asia-Pacific Signal and Infor-
                                                                             [33] A. Gabbay, A. Shamir, and S. Peleg, “Visual speech enhance-
     mation Processing Association Annual Summit and Conference
                                                                                  ment,” Interspeech, pp. 1170–1174, 2018.
     (APSIPA). IEEE, 2016, pp. 1–6.
                                                                             [34] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim,
[16] A. Gabbay, A. Ephrat, T. Halperin, and S. Peleg, “Seeing through
                                                                                  W. T. Freeman, and M. Rubinstein, “Looking to listen at the cock-
     noise: Visually driven speaker separation and enhancement,” in
                                                                                  tail party: A speaker-independent audio-visual model for speech
     IEEE International Conference on Acoustics, Speech and Signal
                                                                                  separation,” ACM Transactions on Graphics, 2018.
     Processing (ICASSP). IEEE, 2018, pp. 3051–3055.
                                                                             [35] W. Wang, C. Xing, D. Wang, X. Chen, and F. Sun, “A robust
[17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-
                                                                                  audio-visual speech enhancement model,” in ICASSP 2020-2020
     Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver-
                                                                                  IEEE International Conference on Acoustics, Speech and Signal
     sarial nets,” Advances in neural information processing systems,
                                                                                  Processing (ICASSP). IEEE, 2020, pp. 7529–7533.
     vol. 27, pp. 2672–2680, 2014.
                                                                             [36] M. L. Iuzzolino and K. Koishida, “AV(SE)2 : Audio-visual
[18] S. Pascual, A. Bonafonte, and J. Serrà, “SEGAN: Speech en-
                                                                                  squeeze-excite speech enhancement,” in ICASSP 2020-2020 IEEE
     hancement generative adversarial network,” in Interspeech 2017,
                                                                                  International Conference on Acoustics, Speech and Signal Pro-
     2017, pp. 3642–3646.
                                                                                  cessing (ICASSP). IEEE, 2020, pp. 7539–7543.
[19] M. Mirza and S. Osindero, “Conditional generative adversarial
                                                                             [37] X. Xu, Y. Wang, D. Xu, C. Zhang, Y. Peng, J. Jia, and
     nets,” Computer ence, pp. 2672–2680, 2014.
                                                                                  B. Chen, “AMMFCN:attentional multi-layer feature fusion con-
[20] A. Radford, L. Metz, and S. Chintala, “Unsupervised representa-              volution network for audio-visual speech enhancement,” ArXiv,
     tion learning with deep convolutional generative adversarial net-            vol. abs/2101.06268, 2021.
     works,” 11 2016.

You can also read