NEURAL SYNTHESIS OF FOOTSTEPS SOUND EFFECTS WITH GENERATIVE ADVERSARIAL NETWORKS
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
NEURAL SYNTHESIS OF FOOTSTEPS SOUND EFFECTS WITH GENERATIVE ADVERSARIAL NETWORKS Marco Comunità, Huy Phan, Joshua D. Reiss Centre for Digital Music, Queen Mary University of London, UK ABSTRACT of neural networks for the synthesis of footsteps sounds although arXiv:2110.09605v2 [cs.SD] 10 Dec 2021 there is substantial literature exploring neural synthesis of broad- Footsteps are among the most ubiquitous sound effects in multime- band impulsive sounds, such as drums samples, which have some dia applications. There is substantial research into understanding similarities to footsteps. One of the first attempts was in [15], where the acoustic features and developing synthesis models for footstep Donahue et al. developed WaveGAN - a generative adversarial net- sound effects. In this paper, we present a first attempt at adopting work for unconditional audio synthesis. Another example of neu- neural synthesis for this task. We implemented two GAN-based ar- ral synthesis of drums is [16], where the authors used a Progres- chitectures and compared the results with real recordings as well as sive Growing GAN. Variational autoencoders [17] and U-Nets [18] six traditional sound synthesis methods. Our architectures reached have also been used for the same task. But, the application of re- realism scores as high as recorded samples, showing encouraging cent developments to neural synthesis of sound effects is yet to be results for the task at hand. explored. We could only find one other work [19], related to the Index Terms— footsteps, neural synthesis, sound effects, GAN present study, where the authors focused on synthesis of knocking sounds with emotional content using a conditional WaveGAN. 1. INTRODUCTION In this paper we describe the first attempt at neural synthesis of When sound designers are given the task of creating sound effects footsteps sound effects. In Section 2, we propose a hybrid architec- for movies, video games or radio shows, they have essentially two ture that improves the quality and better approximates the real data options: pre-recorded samples or procedural audio. In the first case, distribution, with respect to a standard conditional WaveGAN. Ob- they usually rely on large libraries of high quality audio recordings, jective evaluation of this architecture is provided in Section 3. Sec- from which they have to select, edit and mix samples for each event tion 4 reports on the first listening test that compares “traditional” they need to sonify. For a realistic result, especially in video games and neural synthesis models on the task at hand. Discussion and where the same actions are repeated many times, several samples conclusions are given in Section 5. are selected for every event and randomised during action. This cre- ates challenges in terms of memory requirements, assets manage- 2. ARCHITECTURE ment and implementation time. 2.1. Data Alternatively, procedural audio [1] aims to synthesise sound ef- To train our models we curated a small dataset (81 samples) using fects in real time, based on a set of input parameters. In the context free footsteps sounds available on the Zapsplat website1 . These are of video games, these parameters might come from the specific inter- high quality samples, recorded by Foley artists using a single type action of a character with the environment. This approach presents of shoes (women, high heels) on seven surfaces (carpet, deck, metal, challenges in terms of development of procedural models which can pavement, rug, wood and wood internal). The samples were con- synthesise high quality and realistic audio, as well as finding the verted to WAV file format, resampled at 16kHz, time aligned and right parameters values for each sound event. normalised to -6dBFS. Footsteps sounds are a typical example of the challenges of sound design. It is an omnipresent sound, which is generally fairly repeti- 2.2. Generator tive, but on which the human ear is being constantly trained. In fact, The original WaveGAN generator [15] was designed to synthesise it is possible for a subject to identify from recorded or synthesised 16384 samples (∼1s at 16kHz sampling frequency); and afterwards footsteps, things like: gender [2], emotions [3], posture [4], iden- extended to 32768 or 65536 samples2 (∼2s and ∼4s at 16kHz). We tity [5], ground materials [6], and type of locomotion [7]. adapted the architecture to synthesise 8192 samples, which is suf- Thus, it is not surprising that researchers put substantial efforts ficient to capture all footsteps samples in our dataset. As shown in into understanding the acoustic features [8] and developing synthe- Fig. 1, the generator expands the latent variable z to the final au- sis models of footsteps sounds. Cook [9] made a first attempt at syn- dio output size. After reshaping and concatenation with the condi- thesising footsteps on different surfaces, based on his previous work tioning label [19, 20], the output is synthesised by passing the input on physically-informed stochastic models (PhISM) [10]. Another through 5 upsampling 1-D convolutional layers. Upsampling is ob- physically-informed model - based on using a stochastic controller tained by: zero-stuffing or nearest neighbour, linear or cubic inter- to drive sums of microscopic impacts - was proposed by Fontana and polation. Zero-stuffing plus 1-D convolution is equivalent to using Bresin [11]. DeWitt and Bresin [12] proposed a model that included 1-D transposed convolution with stride equal to the upsampling rate. the user’s emotion parameter. Farnell [13] instead, developed a pro- In the other cases, we split the operation into upsampling and 1-D cedural model by studying the characteristics of locomotion in pri- convolution with stride of 1. Every convolutional layer uses a kernel mates. In [14] Turchet developed physical and physically-inspired models coupled with additive synthesis and signals multiplication. 1 https://www.zapsplat.com/sound-effect-packs/footsteps-in-high-heels To this day, there has not yet been an attempt at exploring the use 2 https://github.com/chrisdonahue/wavegan
z label x (1, 8192) label (1, 100) (1) (1) Linear Embedding Embedding (1, 8192) (#labels * 20) (#labels * 20) Reshape Linear Linear (1, 8192) (1024, 8) (1, 8) (2, 8192) BatchNorm Conv (out_ch=64) Conv ReLU (ch, d) Conv (out_ch=128) Conv1D (k=25, s=4) (1025, 8) Conv (out_ch=256) LeakyReLU Up+Conv (out_ch=512) Up+Conv Conv (out_ch=512) (ch, d) (2ch, d/4) Up+Conv (out_ch=256) Conv (out_ch=1024) Upsample1D (x4) Up+Conv (out_ch=128) (1024, 8) Conv1D (k=25, s=1) Up+Conv (out_ch=64) Reshape BatchNorm Up+Conv (out_ch=1) (1, 8192) ReLU (1, 8192) Linear (ch/2, 4d) (1) y y Fig. 1: WaveGAN generator. Fig. 2: WaveGAN discriminator. size of 25, as in the original design. Differently from the original, we we followed the standard approach of 5 to 1 updates ratio. included batch normalisation layers after each convolution. With the For HiFi-WaveGAN we followed the approach suggested in [26], exception of the last layer, the number of output channels is halved opting for least squares GAN (LS-GAN) [29], where the binary at each convolution layer. cross-entropy terms of the original GAN [30] are replaced with least squares losses. [26] included additional losses for the generator; 2.3. Discriminator specifically, a mel-spectrogram loss and a feature matching loss. The mel-spectrogram loss measures the `1-distance between the Recent results in the field of neural speech synthesis [21–26] have mel-spectrogram of a synthesised and a ground truth waveforms. shown how GAN-based vocoders are capable of reaching state-of- We discarded this term since, differently from HiFi-GAN, our gen- the-art results in terms of mean opinion scores. You et al. [27] erator was not designed and trained to synthesise waveforms from hypothesised that this success is not related to the specific de- ground-truth spectrograms. However, we kept the feature matching sign choices or training strategies; they identified it in the multi- loss, which measures the `1-distance of the features extracted at resolution discriminating framework. In their work, the authors every level of each sub-discriminator, between real and generated trained 6 different generators using the same discriminator (Hifi- samples. See [26] for a more detailed description of each loss term. GAN - [26]) reaching very similar performance independently of The final losses for our HiFi-WaveGAN were: the specific generator. We experimented with a similar approach by implementing con- LG = LAdv (G; D) + λf m LF M , ditional versions of WaveGAN and HiFi-GAN discriminators. Our LD = LAdv (D; G), WaveGAN discriminator was again adapted to 8192 samples of the original architecture, based on 5 1-D convolutional layers with stride where LG and LD are the generator and discriminator total losses of 4 (see Fig. 2). The HiFi-GAN discriminator is made of two sep- with LAdv (G; D) and LAdv (D; G) the adversarial loss terms for arate discriminators (multi-scale and multi-period) each of which is generator and discriminator and λf m LF M the feature matching loss. made of several sub-discriminators that work with inputs of different Both architectures were trained for 120k batches, with a batch size resolutions. The multi-scale discriminator works on raw audio, ×2 of 16 and a learning rate of 0.0001. WaveGAN used Adam while average-pooled and ×4 average-pooled audio (i.e., a downsampled HiFi-WaveGAN used AdamW optimisers. and smoothed version of the original signal). The multi-period dis- criminator works on “equally spaced samples of an input audio; the 3. OBJECTIVE EVALUATION space is given as period p”. The periods are set in the original model to 2, 3, 5, 7 and 11. There are no formalised methods to reliably evaluate the quality and diversity of synthesised audio, but there are several metrics which are 2.4. Loss and Training Procedure commonly adopted to analyse and compare neural synthesis mod- els. We followed a similar approach to [16] for objective evaluation, We trained the two architectures - WaveGAN generator and discrim- where the authors relied on Inception Score (IS), Kernel Inception inator (referred to as WaveGAN), WaveGAN generator and HiFi- Distance (KID) and Fréchet Audio Distance (FAD). We also relied GAN discriminator (referred to as HiFi-WaveGAN) - using different on the maximum mean discrepancy (MMD) [31] as a measure of paradigms. similarity between real and synthesised samples, using the same for- In the first case we opted for a Wasserstein GAN with gradient mulation adopted in [32], and computed the MMD using the `1- penalty [28] (WGAN-GP), which has been shown to improve train- distance between OpenL3 embeddings [33] (env, mel128, 512)3 . ing stability and help convergence towards a minimum which ap- To analyse whether our models were capable of learning the real proximates better the real data distribution as well as the synthesised data distribution - and to verify that they were not affected by over- samples’ quality. When training a WGAN-GP the discriminator’s weights were updated several times for each update of the generator; 3 https://github.com/torchopenl3/torchopenl3
MISC MISC MISC 3.13 5.99 0.793 1.213 56.4 HIFI 3.55 WAVE 79.6 HIFI 6.43 1.904 0.108 84.0 HIFI 30.5 WAVE 8.59 WAVE 6.06 1.022 85.7 53.5 1.054 HEELS A) FAD HEELS C) KID HEELS B) MMD Fig. 3: Graphs representing Fréchet Audio Distance (A), Kernel Inception Distance (B) and Maximum Mean Discrepancy (C) for our models. fitting or mode collapse - we used principal components analysis Dataset IS (PCA) on the OpenL3 embeddings of real and synthesised samples. Zapsplat Misc 4.139 We generated 1000 samples for each class and plot the results of Zapsplat Heels 3.221 PCA in Fig. 4. It can be seen that HiFi-WaveGAN reached a bet- HiFi-WaveGAN 3.411 ter approximation of the real data without collapsing into synthesis- WaveGAN 3.232 ing only samples that were seen during training. However, there is currently no way to establish a correlation between distance in the Table 1: Inception Score for training and generated data. embedding space with distance in terms of human perception. It is therefore not possible to comment on the diversity of synthesised samples from these results. with greater diversity than the training data. FAD, correlating with Instead, we used IS, which gave us a way to measure diversity and human judgement, is also a measure of the perceived quality of semantic discriminability. This measure ranges from 1 to n (with n individual sounds (the lower the FAD, the higher the quality). In our number of classes) and it is maximised for models which can gener- case, HiFi-WaveGAN also seems to score higher in terms of quality. ate samples for all possible classes and that are classified with high confidence. As classifier, we used the same Inception Net variant 4. SUBJECTIVE EVALUATION developed in [16] and adapted it to our domain. We trained the For the subjective evaluation we followed a similar paradigm to [34]. model on a separate dataset (Zapsplat Misc) obtained by scraping The Web Audio Evaluation Tool [35] was adopted to run an audio all the freely available footsteps samples on the Zapsplat website 4 perceptual evaluation (APE) [36]; a multi-stimulus test in which a and organised them into 5 classes depending on the surface material participant was presented with a series of samples to be compared (carpet/rug, deck/boardwalk, metal, pavement/concrete, wood/wood and rated on a continuous scale from 0 to 1. On this scale, 4 ref- internal). Likewise, we merged the training (Zapsplat Heels) and erence values (very unrealistic, somewhat unrealistic, somewhat re- generated data into the same 5 classes by grouping carpet/rug and alistic, very realistic) were given at the 0, 0.33, 0.66 and 1 points, wood/wood internal samples together. respectively. Eight synthesis methods were compared to real record- We trained the Inception Net for 100 epochs, reaching 86% vali- ings: dation accuracy. The results are shown in Table 1. It is interesting to 1. Procedural model 1 (PM1) by Fontana and Bresin [11] notice how the IS for HiFi-WaveGAN is slightly higher then it is for the training data (Zapsplat Heels). Assuming that the synthesised 2. Procedural model 2 (PM2) by Farnell [13] 5 data cannot reach a higher semantic discriminability than the data it 3. Procedural model 3 (PM3) from Nemisindo [37] 6 learned from, the score should be related to diversity. Which in turn 4. Sinusoidal plus stochastic (SPS) by Amatriain et al. [38] 7 might depend on two factors. First, we evaluated IS on a number of 5. Statistical modelling (STAT) by McDermott et al. [39] 8 Hifi-WaveGAN samples orders of magnitudes greater than the train- ing data (3500 versus 81). Excluding the case of mode collapse, this 6. Additive synthesis (ADD) by Verron et al. [40] 9 will inherently carry higher diversity. And second, Hifi-WaveGAN 7. WaveGAN (WAVE) is actually able to synthesise samples not seen during training, and 8. HiFi-WaveGAN (HIFI) could therefore lead to a more diverse set. However, it is not possible To present a more realistic and reliable scenario we prepared 10 s to comment on the “perceptual” diversity of the generated samples long walks by concatenating single samples. We started from real based on IS only. recordings and chose - through informal listening - the time interval Fig. 3 shows the other metrics we adopted to compare real (Misc between samples that gave the most realistic result. The same pace and Heels) and synthesised (HiFi and Wave) data. For clarity, we was then replicated for all the other synthesis methods. We prepared represent FAD, KID and MMD on 3 graphs, where the distance a total of 10 series of 9 walks (1 for each synthesis method plus the between nodes is roughly proportional to each metric values (writ- real recordings); and presented each participant with 5 of these series ten on the edges). The 3 metrics, which are based on comparing to be able to compare many different conditions (i.e., shoe types and embeddings distributions (VGG-ish for FAD, Inception for KID surface materials) while keeping the test short. and OpenL3 for MMD), all depict a similar picture where HiFi- 5 http://aspress.co.uk/sd/practical26.html WaveGAN seems to better approximate the training data. FAD and 6 https://nemisindo.com/ KID also place HiFi-WaveGAN samples nearer to the Misc dataset, 7 https://www.dafx.de/DAFX Book Page 2nd edition/chapter10.html again suggesting that the model is capable of synthesising samples 8 https://mcdermottlab.mit.edu/downloads.html 4 https://www.zapsplat.com 9 http://www.charlesverron.com/spad.html
Fig. 4: Scatter plots of principal components analysis on OpenL3 embeddings for: A) Training Data, B) HiFi-WaveGAN and C) WaveGAN. 4.1. Results 1.0 A total of 19 participants took part in the online test10 . Of these, 10 identified as male, 6 as female and 3 preferred not to indicate their 0.8 gender. 3 participants were excluded since they had no previous ex- perience with critical listening tests. Of the remaining participants, 0.6 all but 1 had experience as: musicians (15 out of 16, µ = 9.7, 0.4 σ = 7.6 and max = 23 years), sound engineers (11 out of 16, µ = 4.6, σ = 7.6, max = 15 years) and sound designers (7 out 0.2 of 16, µ = 1.7, σ = 2.9, max = 10 years). We also enquired about the headphone models and verified that all participants used good 0.0 1 2 3 T S D E FI L STA SP AD WAV HI REA quality devices. PM PM PM Two criteria were adopted to judge the reliability of each rating. We used one of the procedural models (PM1) as an anchor, and ex- cluded all cases where this model was rated above 0.1. Also, we Fig. 5: Results of the subjective evaluation. considered real recordings to be the reference, and excluded all cases where they were rated below 0.5. samples and better approximates the training data distribution. Dif- Final results are shown in Fig. 5. Together with the anchor (PM1), ferently from what is commonly suggested in the literature, upsam- Farnell’s model (PM2), as well as statistical and sinusoidal mod- pling with zero-stuffing gave us better results than nearest neighbour elling, are the lowest rated. This result is not surprising since Far- interpolation. The same is true for linear or cubic interpolation. nell’s procedural model is a fairly basic implementation, which lacks Both objective and subjective evaluation of results were con- the necessary output quality. Statistical modelling is usually adopted ducted. It is not common for neural synthesis methods to be com- for texture synthesis, and also in those situations, it shows good re- pared with “traditional” synthesis algorithms in a listening test, sults only for very specific cases (see [34, 39]). Also, sinusoidal which makes it impossible to establish whether a neural synthesis modelling is more suitable for harmonic sounds (e.g. musical instru- method can actually reach state-of-the-art results. In this work, we ments) where the broadband, noisy components play a minor role in compared the two architectures with 6 other methods as well as the overall spectrum. Nemisindo (PM3), being a more recent and recorded samples. The two architectures reached “realism” ratings advanced procedural model, scores higher than the other two; and as high as real sounds. Although, from informal listening tests, we additive synthesis, by synthesising sounds as a sum of five core el- would have expected a greater difference in the ratings. In fact, ements (modal impact, noisy impact, chirped impact, band-limited samples synthesised by WaveGAN are affected by a perceivable noise, equalised noise), is rated even higher. Both WaveGAN and amount of background noise and distortion, while the opposite is HiFi-WaveGAN score as high as the reference. This result shows true for HiFi-WaveGAN 11 . that the two neural synthesis models manage to capture all the im- This opens questions about the definition of realism of sound ef- portant details of the training data, and allow the generated samples fects, to what extent audio quality correlates with perceived realism, to reach synthesis quality and realism comparable to recorded audio. and what other aspects play a significant role in such judgements. Following work will focus on increasing the degrees of control and 5. DISCUSSION diversity of synthesised samples, while retaining the audio quality. We presented a first attempt at neural synthesis of footsteps - one of the most common and challenging sound effects in sound de- sign. In this work, two GANs architectures were implemented: a 6. ACKNOWLEDGEMENTS standard conditional WaveGAN, and a hybrid consisting of a condi- tional WaveGAN generator and a conditional HiFi-GAN discrimina- Funded by UKRI and EPSRC as part of the “UKRI CDT in Artificial tor. The hybrid architecture improves the audio quality of generated Intelligence and Music”, under grant EP/S022694/1. 10 http://webprojects.eecs.qmul.ac.uk/mc309/FootEval/ test.html?url=tests/ape footsteps.xml 11 https://mcomunita.github.io/hifi-wavegan-footsteps page/
7. REFERENCES [20] C.Y. Lee, A. Toffy, et al., “Conditional wavegan,” arXiv preprint arXiv:1809.10636, 2018. [1] A. Farnell, Designing sound, Mit Press, 2010. [21] K. Kumar, R. Kumar, et al., “Melgan: Generative adversarial [2] X. Li, R.J. Logan, et al., “Perception of acoustic source char- networks for conditional waveform synthesis,” arXiv preprint acteristics: Walking sounds,” J. of the Acoustical Society of arXiv:1910.06711, 2019. America, vol. 90, no. 6, 1991. [22] R. Yamamoto, E. Song, et al., “Parallel wavegan: A fast wave- [3] B. Giordano and R. Bresin, “Walking and playing: What’s form generation model based on generative adversarial net- the origin of emotional expressiveness in music,” in Int. Conf. works with multi-resolution spectrogram,” in IEEE Int. Conf. Music Perception and Cognition, 2006. on Acoustics, Speech and Signal Processing, 2020. [4] R.E. Pastore, J.D. Flint, et al., “Auditory event perception: The [23] W. Jang, D. Lim, et al., “Universal melgan: A robust neural source—perception loop for posture in human gait,” Percep- vocoder for high-fidelity waveform generation in multiple do- tion & Psychophysics, vol. 70, no. 1, 2008. mains,” arXiv preprint arXiv:2011.09631, 2020. [5] K. Makela, J. Hakulinen, et al., “The use of walking sounds [24] E. Song, R. Yamamoto, et al., “Improved parallel wavegan in supporting awareness,” in Int. Conf. on Auditory Display, vocoder with perceptually weighted spectrogram loss,” in 2003. IEEE Spoken Language Technology Workshop, 2021. [6] R. Nordahl, S. Serafin, et al., “Sound synthesis and evaluation [25] J. Yang, J. Lee, et al., “Vocgan: A high-fidelity real-time of interactive footsteps for virtual reality applications,” in IEEE vocoder with a hierarchically-nested adversarial network,” Virtual Reality Conference, 2010. arXiv preprint arXiv:2007.15256, 2020. [7] R. Bresin and S. Dahl, “Experiments on gestures: walking, [26] J. Kong, J. Kim, et al., “Hifi-gan: Generative adversarial net- running, and hitting,” The sounding object, 2003. works for efficient and high fidelity speech synthesis,” arXiv preprint arXiv:2010.05646, 2020. [8] L. Turchet, D. Moffat, et al., “What do your footsteps sound like? an investigation on interactive footstep sounds adjust- [27] J. You, D. Kim, et al., “Gan vocoder: Multi-resolution dis- ment,” Applied Acoustics, vol. 111, 2016. criminator is all you need,” arXiv preprint arXiv:2103.05236, 2021. [9] P.R. Cook, “Modeling bill’s gait: Analysis and parametric syn- thesis of walking sounds,” in AES Conf.: Virtual, synthetic, and [28] I. Gulrajani, F. Ahmed, et al., “Improved training of wasser- entertainment audio, 2002. stein gans,” arXiv preprint arXiv:1704.00028, 2017. [29] X. Mao, Q. Li, et al., “Least squares generative adversarial [10] P.R. Cook, “Physically informed sonic modeling (phism): Syn- networks,” in IEEE Int. Conf. on Computer Cision, 2017. thesis of percussive sounds,” Computer Music J., vol. 21, no. 3, 1997. [30] I. Goodfellow, J. Pouget-Abadie, et al., “Generative adversarial nets,” Advances in neural information processing systems, vol. [11] F. Fontana and R. Bresin, “Physics-based sound synthesis and 27, 2014. control: crushing, walking and running by crumpling sounds,” in Colloquium on Musical Informatics, 2003. [31] A. Gretton, K. M Borgwardt, et al., “A kernel two-sample test,” J. of Machine Learning Research, vol. 13, no. 1, 2012. [12] A. DeWitt and R. Bresin, “Sound design for affective inter- action,” in Int. Conf. on Affective Computing and Intelligent [32] Joseph Turian, Jordie Shier, George Tzanetakis, Kirk McNally, Interaction, 2007. and Max Henry, “One billion audio sounds from gpu-enabled modular synthesis,” arXiv preprint arXiv:2104.12922, 2021. [13] A.J. Farnell, “Marching onwards: procedural synthetic foot- steps for video games and animation,” in Pure Data Conv., [33] J. Cramer, H. Wu, et al., “Look, listen, and learn more: Design 2007. choices for deep audio embeddings,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2019. [14] L. Turchet, “Footstep sounds synthesis: design, implementa- tion, and evaluation of foot–floor interactions, surface materi- [34] D. Moffat and J.D. Reiss, “Perceptual evaluation of synthe- als, shoe types, and walkers’ features,” Applied Acoustics, vol. sized sound effects,” ACM Trans. on Applied Perception, vol. 107, 2016. 15, no. 2, 2018. [35] N. Jillings, B. De Man, et al., “Web audio evaluation tool: A [15] C. Donahue, J. McAuley, et al., “Adversarial audio synthesis,” browser-based listening test environment,” 2015. arXiv preprint arXiv:1802.04208, 2018. [36] B. De Man and J.D. Reiss, “Ape: Audio perceptual evaluation [16] J. Nistal, S. Lattner, et al., “Drumgan: Synthesis of drum toolbox for matlab,” in AES Conv. 136, 2014. sounds with timbral feature conditioning using generative ad- versarial networks,” arXiv preprint arXiv:2008.12073, 2020. [37] P. Bahadoran, A. Benito, et al., “Fxive: A web platform for procedural sound synthesis,” in AES Conv. 144, 2018. [17] C. Aouameur, P. Esling, et al., “Neural drum machine: An in- teractive system for real-time synthesis of drum sounds,” arXiv [38] X. Amatriain, J. Bonada, et al., “Spectral processing,” DAFX- preprint arXiv:1907.02637, 2019. Digital Audio Effects, 2002. [18] A. Ramires, P. Chandna, et al., “Neural percussive synthe- [39] J.H. McDermott and E.P. Simoncelli, “Sound texture percep- sis parameterised by high-level timbral features,” in IEEE Int. tion via statistics of the auditory periphery: evidence from Conf. on Acoustics, Speech and Signal Processing, 2020. sound synthesis,” Neuron, vol. 71, no. 5, 2011. [40] C. Verron, M. Aramaki, et al., “A 3-d immersive synthesizer [19] R.A. Barahona and S. Pauletto, “Synthesising knocking sound for environmental sounds,” IEEE Trans. on Audio, Speech, and effects using conditional wavegan,” in Sound and Music Com- Language Processing, vol. 18, no. 6, 2009. puting Conf., 2020.
You can also read