WITH GENERATIVE ADVERSARIAL NETWORKS

                                                                                  Marco Comunità, Huy Phan, Joshua D. Reiss

                                                                    Centre for Digital Music, Queen Mary University of London, UK

                                                                     ABSTRACT                                      of neural networks for the synthesis of footsteps sounds although
arXiv:2110.09605v2 [cs.SD] 10 Dec 2021

                                                                                                                   there is substantial literature exploring neural synthesis of broad-
                                         Footsteps are among the most ubiquitous sound effects in multime-
                                                                                                                   band impulsive sounds, such as drums samples, which have some
                                         dia applications. There is substantial research into understanding
                                                                                                                   similarities to footsteps. One of the first attempts was in [15], where
                                         the acoustic features and developing synthesis models for footstep
                                                                                                                   Donahue et al. developed WaveGAN - a generative adversarial net-
                                         sound effects. In this paper, we present a first attempt at adopting
                                                                                                                   work for unconditional audio synthesis. Another example of neu-
                                         neural synthesis for this task. We implemented two GAN-based ar-
                                                                                                                   ral synthesis of drums is [16], where the authors used a Progres-
                                         chitectures and compared the results with real recordings as well as
                                                                                                                   sive Growing GAN. Variational autoencoders [17] and U-Nets [18]
                                         six traditional sound synthesis methods. Our architectures reached
                                                                                                                   have also been used for the same task. But, the application of re-
                                         realism scores as high as recorded samples, showing encouraging
                                                                                                                   cent developments to neural synthesis of sound effects is yet to be
                                         results for the task at hand.
                                                                                                                   explored. We could only find one other work [19], related to the
                                           Index Terms— footsteps, neural synthesis, sound effects, GAN            present study, where the authors focused on synthesis of knocking
                                                                                                                   sounds with emotional content using a conditional WaveGAN.
                                                                1. INTRODUCTION                                       In this paper we describe the first attempt at neural synthesis of
                                         When sound designers are given the task of creating sound effects         footsteps sound effects. In Section 2, we propose a hybrid architec-
                                         for movies, video games or radio shows, they have essentially two         ture that improves the quality and better approximates the real data
                                         options: pre-recorded samples or procedural audio. In the first case,     distribution, with respect to a standard conditional WaveGAN. Ob-
                                         they usually rely on large libraries of high quality audio recordings,    jective evaluation of this architecture is provided in Section 3. Sec-
                                         from which they have to select, edit and mix samples for each event       tion 4 reports on the first listening test that compares “traditional”
                                         they need to sonify. For a realistic result, especially in video games    and neural synthesis models on the task at hand. Discussion and
                                         where the same actions are repeated many times, several samples           conclusions are given in Section 5.
                                         are selected for every event and randomised during action. This cre-
                                         ates challenges in terms of memory requirements, assets manage-                                 2. ARCHITECTURE
                                         ment and implementation time.                                             2.1. Data
                                             Alternatively, procedural audio [1] aims to synthesise sound ef-      To train our models we curated a small dataset (81 samples) using
                                         fects in real time, based on a set of input parameters. In the context    free footsteps sounds available on the Zapsplat website1 . These are
                                         of video games, these parameters might come from the specific inter-      high quality samples, recorded by Foley artists using a single type
                                         action of a character with the environment. This approach presents        of shoes (women, high heels) on seven surfaces (carpet, deck, metal,
                                         challenges in terms of development of procedural models which can         pavement, rug, wood and wood internal). The samples were con-
                                         synthesise high quality and realistic audio, as well as finding the       verted to WAV file format, resampled at 16kHz, time aligned and
                                         right parameters values for each sound event.                             normalised to -6dBFS.
                                             Footsteps sounds are a typical example of the challenges of sound
                                         design. It is an omnipresent sound, which is generally fairly repeti-     2.2. Generator
                                         tive, but on which the human ear is being constantly trained. In fact,    The original WaveGAN generator [15] was designed to synthesise
                                         it is possible for a subject to identify from recorded or synthesised     16384 samples (∼1s at 16kHz sampling frequency); and afterwards
                                         footsteps, things like: gender [2], emotions [3], posture [4], iden-      extended to 32768 or 65536 samples2 (∼2s and ∼4s at 16kHz). We
                                         tity [5], ground materials [6], and type of locomotion [7].               adapted the architecture to synthesise 8192 samples, which is suf-
                                             Thus, it is not surprising that researchers put substantial efforts   ficient to capture all footsteps samples in our dataset. As shown in
                                         into understanding the acoustic features [8] and developing synthe-       Fig. 1, the generator expands the latent variable z to the final au-
                                         sis models of footsteps sounds. Cook [9] made a first attempt at syn-     dio output size. After reshaping and concatenation with the condi-
                                         thesising footsteps on different surfaces, based on his previous work     tioning label [19, 20], the output is synthesised by passing the input
                                         on physically-informed stochastic models (PhISM) [10]. Another            through 5 upsampling 1-D convolutional layers. Upsampling is ob-
                                         physically-informed model - based on using a stochastic controller        tained by: zero-stuffing or nearest neighbour, linear or cubic inter-
                                         to drive sums of microscopic impacts - was proposed by Fontana and        polation. Zero-stuffing plus 1-D convolution is equivalent to using
                                         Bresin [11]. DeWitt and Bresin [12] proposed a model that included        1-D transposed convolution with stride equal to the upsampling rate.
                                         the user’s emotion parameter. Farnell [13] instead, developed a pro-      In the other cases, we split the operation into upsampling and 1-D
                                         cedural model by studying the characteristics of locomotion in pri-       convolution with stride of 1. Every convolutional layer uses a kernel
                                         mates. In [14] Turchet developed physical and physically-inspired
z                        label                                               x     (1, 8192)             label
                            (1, 100)                         (1)                                                                            (1)

                  Linear                   Embedding                                                                   Embedding

                           (1, 8192)             (#labels * 20)                                                                  (#labels * 20)

                 Reshape                     Linear                                                                      Linear
                                                                                                                                     (1, 8192)
                           (1024, 8)                      (1, 8)
                                                                                                       (2, 8192)
                                                                                         Conv (out_ch=64)                    Conv
                  ReLU                                                                                                                 (ch, d)
                                                                                         Conv (out_ch=128)
                                                                                                                   Conv1D (k=25, s=4)
                           (1025, 8)                                                     Conv (out_ch=256)
           Up+Conv (out_ch=512)             Up+Conv                                      Conv (out_ch=512)
                                                        (ch, d)                                                                      (2ch, d/4)
           Up+Conv (out_ch=256)                                                         Conv (out_ch=1024)
                                        Upsample1D (x4)
           Up+Conv (out_ch=128)                                                                        (1024, 8)
                                       Conv1D (k=25, s=1)
           Up+Conv (out_ch=64)                                                               Reshape
            Up+Conv (out_ch=1)                                                                         (1, 8192)
                           (1, 8192)                                                          Linear
                                                      (ch/2, 4d)
                    y                                                                            y

                    Fig. 1: WaveGAN generator.                                                Fig. 2: WaveGAN discriminator.

size of 25, as in the original design. Differently from the original, we   we followed the standard approach of 5 to 1 updates ratio.
included batch normalisation layers after each convolution. With the          For HiFi-WaveGAN we followed the approach suggested in [26],
exception of the last layer, the number of output channels is halved       opting for least squares GAN (LS-GAN) [29], where the binary
at each convolution layer.                                                 cross-entropy terms of the original GAN [30] are replaced with least
                                                                           squares losses. [26] included additional losses for the generator;
2.3. Discriminator                                                         specifically, a mel-spectrogram loss and a feature matching loss.
                                                                           The mel-spectrogram loss measures the `1-distance between the
Recent results in the field of neural speech synthesis [21–26] have
                                                                           mel-spectrogram of a synthesised and a ground truth waveforms.
shown how GAN-based vocoders are capable of reaching state-of-
                                                                           We discarded this term since, differently from HiFi-GAN, our gen-
the-art results in terms of mean opinion scores. You et al. [27]
                                                                           erator was not designed and trained to synthesise waveforms from
hypothesised that this success is not related to the specific de-
                                                                           ground-truth spectrograms. However, we kept the feature matching
sign choices or training strategies; they identified it in the multi-
                                                                           loss, which measures the `1-distance of the features extracted at
resolution discriminating framework. In their work, the authors
                                                                           every level of each sub-discriminator, between real and generated
trained 6 different generators using the same discriminator (Hifi-
                                                                           samples. See [26] for a more detailed description of each loss term.
GAN - [26]) reaching very similar performance independently of
                                                                           The final losses for our HiFi-WaveGAN were:
the specific generator.
   We experimented with a similar approach by implementing con-                              LG = LAdv (G; D) + λf m LF M ,
ditional versions of WaveGAN and HiFi-GAN discriminators. Our
                                                                                             LD = LAdv (D; G),
WaveGAN discriminator was again adapted to 8192 samples of the
original architecture, based on 5 1-D convolutional layers with stride     where LG and LD are the generator and discriminator total losses
of 4 (see Fig. 2). The HiFi-GAN discriminator is made of two sep-          with LAdv (G; D) and LAdv (D; G) the adversarial loss terms for
arate discriminators (multi-scale and multi-period) each of which is       generator and discriminator and λf m LF M the feature matching loss.
made of several sub-discriminators that work with inputs of different      Both architectures were trained for 120k batches, with a batch size
resolutions. The multi-scale discriminator works on raw audio, ×2          of 16 and a learning rate of 0.0001. WaveGAN used Adam while
average-pooled and ×4 average-pooled audio (i.e., a downsampled            HiFi-WaveGAN used AdamW optimisers.
and smoothed version of the original signal). The multi-period dis-
criminator works on “equally spaced samples of an input audio; the                           3. OBJECTIVE EVALUATION
space is given as period p”. The periods are set in the original model
to 2, 3, 5, 7 and 11.                                                      There are no formalised methods to reliably evaluate the quality and
                                                                           diversity of synthesised audio, but there are several metrics which are
2.4. Loss and Training Procedure                                           commonly adopted to analyse and compare neural synthesis mod-
                                                                           els. We followed a similar approach to [16] for objective evaluation,
We trained the two architectures - WaveGAN generator and discrim-          where the authors relied on Inception Score (IS), Kernel Inception
inator (referred to as WaveGAN), WaveGAN generator and HiFi-               Distance (KID) and Fréchet Audio Distance (FAD). We also relied
GAN discriminator (referred to as HiFi-WaveGAN) - using different          on the maximum mean discrepancy (MMD) [31] as a measure of
paradigms.                                                                 similarity between real and synthesised samples, using the same for-
   In the first case we opted for a Wasserstein GAN with gradient          mulation adopted in [32], and computed the MMD using the `1-
penalty [28] (WGAN-GP), which has been shown to improve train-             distance between OpenL3 embeddings [33] (env, mel128, 512)3 .
ing stability and help convergence towards a minimum which ap-                To analyse whether our models were capable of learning the real
proximates better the real data distribution as well as the synthesised    data distribution - and to verify that they were not affected by over-
samples’ quality. When training a WGAN-GP the discriminator’s
MISC                                                 MISC                                                      MISC

            3.13       5.99
                                                               0.793      1.213                                                56.4
                       HIFI          3.55   WAVE                                                                                        79.6
         6.43                                                1.904              0.108                                   84.0            HIFI     30.5   WAVE
                              8.59                                                      WAVE
                                                                1.022                                                                     85.7

       HEELS            A) FAD                              HEELS         C) KID                                      HEELS           B) MMD

Fig. 3: Graphs representing Fréchet Audio Distance (A), Kernel Inception Distance (B) and Maximum Mean Discrepancy (C) for our models.

fitting or mode collapse - we used principal components analysis                                                 Dataset                IS
(PCA) on the OpenL3 embeddings of real and synthesised samples.                                                Zapsplat Misc          4.139
We generated 1000 samples for each class and plot the results of                                              Zapsplat Heels          3.221
PCA in Fig. 4. It can be seen that HiFi-WaveGAN reached a bet-                                                HiFi-WaveGAN            3.411
ter approximation of the real data without collapsing into synthesis-                                           WaveGAN               3.232
ing only samples that were seen during training. However, there is
currently no way to establish a correlation between distance in the                       Table 1: Inception Score for training and generated data.
embedding space with distance in terms of human perception. It is
therefore not possible to comment on the diversity of synthesised
samples from these results.                                                      with greater diversity than the training data. FAD, correlating with
    Instead, we used IS, which gave us a way to measure diversity and            human judgement, is also a measure of the perceived quality of
semantic discriminability. This measure ranges from 1 to n (with n               individual sounds (the lower the FAD, the higher the quality). In our
number of classes) and it is maximised for models which can gener-               case, HiFi-WaveGAN also seems to score higher in terms of quality.
ate samples for all possible classes and that are classified with high
confidence. As classifier, we used the same Inception Net variant                                      4. SUBJECTIVE EVALUATION
developed in [16] and adapted it to our domain. We trained the                   For the subjective evaluation we followed a similar paradigm to [34].
model on a separate dataset (Zapsplat Misc) obtained by scraping                 The Web Audio Evaluation Tool [35] was adopted to run an audio
all the freely available footsteps samples on the Zapsplat website 4             perceptual evaluation (APE) [36]; a multi-stimulus test in which a
and organised them into 5 classes depending on the surface material              participant was presented with a series of samples to be compared
(carpet/rug, deck/boardwalk, metal, pavement/concrete, wood/wood                 and rated on a continuous scale from 0 to 1. On this scale, 4 ref-
internal). Likewise, we merged the training (Zapsplat Heels) and                 erence values (very unrealistic, somewhat unrealistic, somewhat re-
generated data into the same 5 classes by grouping carpet/rug and                alistic, very realistic) were given at the 0, 0.33, 0.66 and 1 points,
wood/wood internal samples together.                                             respectively. Eight synthesis methods were compared to real record-
    We trained the Inception Net for 100 epochs, reaching 86% vali-              ings:
dation accuracy. The results are shown in Table 1. It is interesting to
                                                                                        1.   Procedural model 1 (PM1) by Fontana and Bresin [11]
notice how the IS for HiFi-WaveGAN is slightly higher then it is for
the training data (Zapsplat Heels). Assuming that the synthesised                       2.   Procedural model 2 (PM2) by Farnell [13] 5
data cannot reach a higher semantic discriminability than the data it                   3.   Procedural model 3 (PM3) from Nemisindo [37] 6
learned from, the score should be related to diversity. Which in turn                   4.   Sinusoidal plus stochastic (SPS) by Amatriain et al. [38] 7
might depend on two factors. First, we evaluated IS on a number of                      5.   Statistical modelling (STAT) by McDermott et al. [39] 8
Hifi-WaveGAN samples orders of magnitudes greater than the train-
ing data (3500 versus 81). Excluding the case of mode collapse, this                    6.   Additive synthesis (ADD) by Verron et al. [40] 9
will inherently carry higher diversity. And second, Hifi-WaveGAN                        7.   WaveGAN (WAVE)
is actually able to synthesise samples not seen during training, and                    8.   HiFi-WaveGAN (HIFI)
could therefore lead to a more diverse set. However, it is not possible
                                                                                 To present a more realistic and reliable scenario we prepared 10 s
to comment on the “perceptual” diversity of the generated samples
                                                                                 long walks by concatenating single samples. We started from real
based on IS only.
                                                                                 recordings and chose - through informal listening - the time interval
    Fig. 3 shows the other metrics we adopted to compare real (Misc
                                                                                 between samples that gave the most realistic result. The same pace
and Heels) and synthesised (HiFi and Wave) data. For clarity, we
                                                                                 was then replicated for all the other synthesis methods. We prepared
represent FAD, KID and MMD on 3 graphs, where the distance
                                                                                 a total of 10 series of 9 walks (1 for each synthesis method plus the
between nodes is roughly proportional to each metric values (writ-
                                                                                 real recordings); and presented each participant with 5 of these series
ten on the edges). The 3 metrics, which are based on comparing
                                                                                 to be able to compare many different conditions (i.e., shoe types and
embeddings distributions (VGG-ish for FAD, Inception for KID
                                                                                 surface materials) while keeping the test short.
and OpenL3 for MMD), all depict a similar picture where HiFi-
WaveGAN seems to better approximate the training data. FAD and
KID also place HiFi-WaveGAN samples nearer to the Misc dataset,
Fig. 4: Scatter plots of principal components analysis on OpenL3 embeddings for: A) Training Data, B) HiFi-WaveGAN and C) WaveGAN.

4.1. Results                                                                 1.0
A total of 19 participants took part in the online test10 . Of these, 10
identified as male, 6 as female and 3 preferred not to indicate their        0.8
gender. 3 participants were excluded since they had no previous ex-
perience with critical listening tests. Of the remaining participants,
all but 1 had experience as: musicians (15 out of 16, µ = 9.7,               0.4
σ = 7.6 and max = 23 years), sound engineers (11 out of 16,
µ = 4.6, σ = 7.6, max = 15 years) and sound designers (7 out                 0.2
of 16, µ = 1.7, σ = 2.9, max = 10 years). We also enquired about
the headphone models and verified that all participants used good            0.0
                                                                                      1       2      3      T  S  D   E FI   L
                                                                                                         STA SP AD WAV HI REA
quality devices.
                                                                                   PM       PM     PM
   Two criteria were adopted to judge the reliability of each rating.
We used one of the procedural models (PM1) as an anchor, and ex-
cluded all cases where this model was rated above 0.1. Also, we                           Fig. 5: Results of the subjective evaluation.
considered real recordings to be the reference, and excluded all cases
where they were rated below 0.5.                                           samples and better approximates the training data distribution. Dif-
   Final results are shown in Fig. 5. Together with the anchor (PM1),      ferently from what is commonly suggested in the literature, upsam-
Farnell’s model (PM2), as well as statistical and sinusoidal mod-          pling with zero-stuffing gave us better results than nearest neighbour
elling, are the lowest rated. This result is not surprising since Far-     interpolation. The same is true for linear or cubic interpolation.
nell’s procedural model is a fairly basic implementation, which lacks         Both objective and subjective evaluation of results were con-
the necessary output quality. Statistical modelling is usually adopted     ducted. It is not common for neural synthesis methods to be com-
for texture synthesis, and also in those situations, it shows good re-     pared with “traditional” synthesis algorithms in a listening test,
sults only for very specific cases (see [34, 39]). Also, sinusoidal        which makes it impossible to establish whether a neural synthesis
modelling is more suitable for harmonic sounds (e.g. musical instru-       method can actually reach state-of-the-art results. In this work, we
ments) where the broadband, noisy components play a minor role in          compared the two architectures with 6 other methods as well as
the overall spectrum. Nemisindo (PM3), being a more recent and             recorded samples. The two architectures reached “realism” ratings
advanced procedural model, scores higher than the other two; and           as high as real sounds. Although, from informal listening tests, we
additive synthesis, by synthesising sounds as a sum of five core el-       would have expected a greater difference in the ratings. In fact,
ements (modal impact, noisy impact, chirped impact, band-limited           samples synthesised by WaveGAN are affected by a perceivable
noise, equalised noise), is rated even higher. Both WaveGAN and            amount of background noise and distortion, while the opposite is
HiFi-WaveGAN score as high as the reference. This result shows             true for HiFi-WaveGAN 11 .
that the two neural synthesis models manage to capture all the im-            This opens questions about the definition of realism of sound ef-
portant details of the training data, and allow the generated samples      fects, to what extent audio quality correlates with perceived realism,
to reach synthesis quality and realism comparable to recorded audio.       and what other aspects play a significant role in such judgements.
                                                                           Following work will focus on increasing the degrees of control and
                           5. DISCUSSION                                   diversity of synthesised samples, while retaining the audio quality.
We presented a first attempt at neural synthesis of footsteps - one
of the most common and challenging sound effects in sound de-
sign. In this work, two GANs architectures were implemented: a                                   6. ACKNOWLEDGEMENTS
standard conditional WaveGAN, and a hybrid consisting of a condi-
tional WaveGAN generator and a conditional HiFi-GAN discrimina-            Funded by UKRI and EPSRC as part of the “UKRI CDT in Artificial
tor. The hybrid architecture improves the audio quality of generated       Intelligence and Music”, under grant EP/S022694/1.
You can also read