DEFENDING YOUR VOICE: ADVERSARIAL ATTACK ON VOICE CONVERSION
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
DEFENDING YOUR VOICE: ADVERSARIAL ATTACK ON VOICE CONVERSION Chien-yu Huang∗ , Yist Y. Lin∗ , Hung-yi Lee, Lin-shan Lee College of Electrical Engineering and Computer Science, National Taiwan University, Taiwan ABSTRACT prone to yield different or incorrect results if the input is dis- turbed by such subtle perturbations imperceptible to humans arXiv:2005.08781v2 [eess.AS] 19 Jan 2021 Substantial improvements have been achieved in recent years in voice conversion, which converts the speaker character- [10]. Adversarial attack is to generate such subtle perturba- istics of an utterance into those of another speaker without tions that can fool the neural networks. It has been successful changing the linguistic content of the utterance. Nonetheless, on some discriminative models [11, 12, 13], but less reported the improved conversion technologies also led to concerns on generative models [14]. about privacy and authentication. It thus becomes highly de- In this paper, we propose to perform adversarial attack on sired to be able to prevent one’s voice from being improperly voice conversion to prevent one’s speaker characteristics from utilized with such voice conversion technologies. This is why being improperly utilized with voice conversion. Human- we report in this paper the first known attempt to perform imperceptible perturbations are added to the utterances pro- adversarial attack on voice conversion. We introduce human duced by the speaker to be defended. Three different ap- imperceptible noise into the utterances of a speaker whose proaches, the end-to-end attack, embedding attack, and feed- voice is to be defended. Given these adversarial examples, back attack are proposed, such that the speaker characteris- voice conversion models cannot convert other utterances so as tics of the converted utterances would be made very different to sound like being produced by the defended speaker. Pre- from those of the defended speaker. We conducted objective liminary experiments were conducted on two currently state- and subjective evaluations on two recent state-of-the-art zero- of-the-art zero-shot voice conversion models. Objective and shot voice conversion models. Objective speaker verification subjective evaluation results in both white-box and black-box showed the converted utterances were significantly different scenarios are reported. It was shown that the speaker char- from those produced by the defended speaker, which was then acteristics of the converted utterances were made obviously verified by subjective similarity test. The effectiveness of the different from those of the defended speaker, while the adver- proposed approaches was also verified for black-box attack sarial examples of the defended speaker are not distinguish- via a proxy model closer to the real application scenario. able from the authentic utterances. Index Terms— voice conversion, adversarial attack, 2. RELATED WORKS speaker verification, speaker representation 2.1. Voice conversion 1. INTRODUCTION Traditionally, parallel data are required for voice conversion, or the training utterances of the two speakers must be paired Voice conversion aims to alter some specific acoustic charac- and aligned. To overcome this problem, Chou et al. [1] ob- teristics of an utterance, such as the speaker identity, while tained disentangled representations respectively for linguis- preserving the linguistic content. These technologies were tic content and speaker information with adversarial training; made much more powerful by deep learning [1, 2, 3, 4, 5], CycleGAN-VC [2] used cycle-consistency to ensure the con- but the improved technologies also led to concerns about pri- verted speech to be linguistically meaningful with the target vacy and authentication. One’s identity may be counterfeited speaker’s features; and StarGAN-VC [3] introduced condi- by voice conversion and exploited in improper ways, which tional input for many-to-many voice conversion. All these is only one of the many deepfake problems observed today are limited to speakers seen in training. generated by deep learning, such as synthesized fake photos Zero-shot approaches then tried to convert utterances to or fake voice. Detecting any of such artifacts or defending any speaker given only one example utterance without fine- against such activities is thus increasingly important [6, 7, 8, tuning, and the target speaker is not necessarily seen before. 9], which applies equally to voice conversion. Chou et al. [4] employed adaptive instance normalization for On the other hand, it has been widely known that neural this purpose; AUTOVC [5] integrated a pre-trained d-vector networks are fragile in the presence of some specific noise, or and an encoder bottleneck, achieving the state-of-the-art re- ∗ These authors contributed equally. sults.
Content Ec(t) t Encoder Ec embedding attack (Sec. 3.2) Decoder D F(t, x) Es(x) x (or y) Speaker Encoder Es Es(F(t, x)) feedback attack (Sec. 3.3) end-to-end attack (Sec. 3.1) Fig. 1: The encoder-decoder based voice conversion model and the three proposed approaches. Perturbations are updated on the utterances providing speaker characteristics, as the blue dashed lines indicate. 2.2. Attacking and defending voice encoder, since we are defending the speaker characteris- tics provided by such utterances. Motivated by the prior Automatic speech recognition (ASR) systems have been work [14], here we present three approaches to performing shown to be prone to adversarial attacks. Applying pertur- the attack, with the target being either the output spectro- bations on the waveforms, spectrograms, or MFCC features gram F (t, x) (Sec. 3.1), or the speaker embedding Es (x) was able to make ASR systems fail to recognize the speech (Sec. 3.2), or the combination of the two (Sec. 3.3), as shown correctly [15, 16, 17, 18, 19]. Similar goals were achieved in Fig. 1. on speaker recognition by generating adversarial examples to fool automatic speaker verification (ASV) systems to predict that these examples had been uttered by a specific speaker 3.1. End-to-end attack [20, 21, 22]. Different approaches for spoofing ASV were A straight-forward approach to perform adversarial attack on also proposed to show the vulnerabilities of such systems the above model in Fig. 1 is to take the decoder output F (t, x) [23, 24, 25]. But to our knowledge, applying adversarial as the target, referred to as end-to-end attack also shown in attacks on voice conversion has not been reported yet. Fig. 1. Denote the original spectrogram of an utterance pro- On the other hand, many approaches were proposed to duced by the speaker to be defended as x ∈ RM ×T and the defend one’s voice when ASV systems were shown to be vul- adversarial perturbation on x as δ ∈ RM ×T , where M and nerable to spoofing attacks [26, 27, 28, 29]. In addition to T are the total number of frequency components and time the ASVspoof challenges for spoofing techniques and coun- frames respectively. An untargeted attack simply aims to alter termeasures [30], Liu et al. [20] conducted adversarial attacks the output of the voice conversion model and can be expressed on those countermeasures, showing the fragility of them. Ob- as: viously, all neural network models are under the threat of ad- maximize L(F (t, x + δ), F (t, x)) versarial attacks [11], which led to the idea of attacking voice δ (1) conversion models as proposed here. subject to kδk∞ < L(·, ·) is the distance between two vectors or the spectrograms 3. METHODOLOGIES for two signals and is a constraint making the perturbation subtle. The signal t can be arbitrary offering the content of A widely used model for voice conversion adopted an encoder- the output utterance, on which we do not focus here. decoder structure, in which the encoder is further divided into Given a certain utterance y produced by a target speaker, a content encoder and a speaker encoder, as shown in Fig. 1. we can formulate a targeted attack for output signal with spe- This paper is also based on this model. The content encoder cific speaker characteristics: Ec extracts the content information from an input utterance t yielding Ec (t), while the speaker encoder Es embeds the L(F (t, x + δ), F (t, y)) minimize speaker characteristics of an input utterance x as a latent δ − λL(F (t, x + δ), F (t, x)) (2) vector Es (x), as in the left part of Fig. 1. Taking Ec (t) and subject to kδk∞ < Es (x) as the input, the decoder D generates a spectrogram F (t, x) with content information based on Ec (t) and speaker The first term in the first expression in (2) aims to make the characteristics based on Es (x). model output sound like being produced by the speaker of Here we only focus on the utterances fed into the speaker y, while the second term is to eliminate the original speaker
identity in x. λ is a positive valued hyperparameter balancing unseen speakers given their few utterances without fine- the importance between source and target. tuning, considered suitable for our scenarios. Models such To effectively constrain the range of perturbation within as StarGAN-VC, on the other hand, are limited to the voice [−, ] while solving (2), we adopt the approach of Change of produced by speakers seen during training, which makes it variable as was done previously [31] using tanh(·) function. less likely to counterfeit other speakers’ voices, and we thus In this way (2) above becomes (3) below: do not consider them here. L(F (t, x + δ), F (t, y)) minimize w − λL(F (t, x + δ), F (t, x)) (3) 4.1. Speaker encoders subject to δ = · tanh(w) For Chou’s model, all modules were trained jointly from scratch on the CSTR VCTK Corpus [32]. The speaker where w ∈ RM ×T . The clipping function is not needed here. encoder took a 512-dim mel spectrogram and generated a 128-dim speaker embedding. AUTOVC utilized pre-trained 3.2. Embedding attack d-vector [33] as speaker encoder, with 80-dim mel spectro- The speaker encoder Es in Fig. 1 embeds an utterance into a gram as input and 256-dim speaker embedding as output, latent vector. These latent vectors for utterances produced by pre-trained on VoxCeleb1 [34] and LibriSpeech [35] but gen- the same speaker tend to cluster closely together, while those eralizable to unseen speakers. by different speakers tend to be separated apart. The second approach proposed here is focused on the speaker encoder by directly changing the speaker embeddings of the utterances, 4.2. Vocoders referred to as embedding attack also in Fig. 1. As the decoder In inference, Chou’s model leveraged Griffin-Lim algorithm D produces the output F (t, x) with speaker characteristics [36] to synthesize the audio. AUTOVC previously adopted based on the speaker embedding Es (x) as in Fig. 1, changing WaveNet [37] as the spectrogram inverter, but here we used the speaker embeddings thus alters the output of the decoder. WaveRNN-based vocoder [38] pre-trained on the VCTK cor- Following the notations and expressions in (3), we have: pus to generate waveforms with similar quality due to time L(Es (x + δ), Es (y)) limitation. minimize As we introduced perturbation on spectrogram, vocoders w − λL(Es (x + δ), Es (x))) (4) converting the spectrogram into waveform were neces- subject to δ = · tanh(w) sary. We respectively adopted Griffin-lim algorithm and WaveRNN-based vocoder for attacks on Chou’s model and where the adversarial attack is now performed with the AUTOVC. speaker encoder Es only. Since only the speaker encoder is involved, it is therefore more efficient. 4.3. Attack scenarios 3.3. Feedback attack Two scenarios were tested here. In the first scenario, the at- The third approach proposed here tries to combine the above tacker has full access to the model to be attacked. With the two approaches by feeding the output spectrogram F (t, x+δ) complete architecture plus all trained parameters of the model from the decoder D back to the speaker encoder Es (the red being available, we can apply adversarial attack directly. This feedback loop in Fig. 1), and consider the speaker embedding scenario is referred to as white-box scenario, in which all ex- obtained in this way. More specifically, Es (x + δ) in (4) is periments were conducted on Chou’s model and AUTOVC replaced by Es (F (t, x + δ)) in (5). This is referred to as the with their publicly available network parameters and evalu- feedback attack also in Fig. 1. ated on exactly the same models. L(Es (F (t, x + δ)), Es (y)) The second one is referred to as black-box scenario, in minimize which the attacker cannot directly access the parameters of w − λL(Es (F (t, x + δ)), Es (x))) (5) the model to be attacked, or the architecture might even be un- subject to δ = · tanh(w) known. For attacking Chou’s model, we trained a new model with the same architecture but different initialization, whereas 4. EXPERIMENTAL SETTINGS for AUTOVC we trained a new speaker encoder with the ar- chitecture similar to the one in the original AUTOVC. These We conducted experiments on the model proposed by Chou newly-trained models are then used as proxy models to gen- et al. [4] (referred to as Chou’s model below) and AUTOVC. erate adversarial examples to be evaluated with the publicly Both were able to perform zero-shot voice conversion on available ones in the same way as in white-box scenario.
Speaker verification accuracy Speaker verification accuracy 0.95 0.98 1.00 0.98 1.00 1 0.94 1 0.8 0.8 0.58 0.59 0.6 0.6 0.45 0.48 0.40 0.4 0.4 0.34 0.2 0.2 0 0 (i) end-to-end (ii) embedding (iii) feedback (i) end-to-end (ii) embedding (iii) feedback adversarial input adversarial output adversarial input adversarial output (a) Chou’s (b) AUTOVC Fig. 2: Speaker verification accuracy for the defended speaker with Chou’s ( = 0.075) and AUTOVC ( = 0.05) by the three proposed approaches under the white-box scenario. 4.4. Attack procedure the authentic speaker, whereas the negative ones were com- puted against random utterances produced by other randomly We selected `2 norm as L(·, ·), and λ = 0.1 for all experi- selected speakers. This gave the threshold of 0.683 with the ments. w in (3), (4), (5) was initialized from a standard nor- EER being 0.056. mal distribution. The perturbations to be added to the utter- We randomly collected enough number of utterances ances was tanh(w). Adam [39] optimizer was adopted to offering the speaker characteristics (x in Fig. 1) from 109 update w iteratively according to the loss function defined in speakers in the VCTK corpus and enough number of utter- (3), (4), (5) with the learning rate being 0.001 and the number ances offering the content (t in Fig. 1), and performed voice of iterations being 1500. conversion with both Chou’s and AUTOVC to generate the original outputs (F (t, x) in Fig. 1). We randomly collected 5. RESULTS 100 of such pairs (x, F (t, x)) produced by both Chou’s and AUTOVC which were considered by the speaker verification 5.1. Objective tests system mentioned above to be produced by the same speaker For automatic evaluation, we adopted speaker verification ac- to be used for the test below.2 Thus the speaker verification curacy as a reliable metric. The speaker verification system accuracy of all these original outputs (F (t, x)) is 1.00. We used here first encoded two input utterances into embeddings then created corresponding adversarial examples (x + δ in and then computed the similarity between the two. The two (3), (4), (5)), targeting speakers with gender opposite to the utterances were considered to be uttered by the same speaker defended speaker, and performed speaker verification respec- if the similarity exceeds a threshold. In the test, each time we tively on these adversarial example utterances (referred to as compared a machine-generated utterance with a real utterance adversarial input) and the converted utterances F (t, x + δ) providing the speaker characteristics in the experiments (x in (referred to as adversarial output). The same examples were Fig. 1), and the speaker verification accuracy used below is used in the test for Chou’s and AUTOVC. defined as the percentage of the cases that the two are con- Fig. 2(a) shows the speaker verification accuracy for the sidered to be produced by the same speaker by the speaker adversarial input and output utterances evaluated with re- verification system. spect to the defended speaker under the white-box scenario The verification system used here was based on a pre- for Chou’s model. Results for the three approaches men- trained d-vector model 1 which was different from the speaker tioned in Sec. 3 are in the three sections (i) (ii) (iii), with encoders of the two attacked models. The threshold was de- the blue crosshatch-dotted bar and the red diagonal-lined bar termined based on the equal error rate (EER) when verify- respectively for adversarial input and adversarial output. ing utterance pairs randomly sampled from the VCTK corpus Similarly in Fig. 2(b) for AUTOVC. We can see the adversar- in the following way. We sampled 256 utterances for each ial inputs sounded very close to the defended speaker or the speaker in the dataset, half of which were used as positive perturbation δ almost imperceptible (the blue bars very close samples and the other half as negative ones. For positive sam- 2 We only evaluated the examples that can be successfully converted by ples, the similarity was computed with random utterances of the voice conversion model because if an example cannot be successfully 1 https://github.com/resemble-ai/Resemblyzer converted, then there is no need to defend it.
Speaker verification accuracy 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 0.01 0.02 0.05 0.075 0.1 0.2 0.01 0.02 0.05 0.075 0.1 0.2 0.01 0.02 0.05 0.075 0.1 0.2 Perturbation scale () Perturbation scale () Perturbation scale () adversarial input adversarial output (a) end-to-end (b) embedding (c) feedback Fig. 3: The speaker verification accuracy for different perturbation on Chou’s model under black-box scenario for the three proposed approaches: (a) end-to-end, (b) embedding, and (c) feedback attacks. to 1.00), while the converted utterances sounded as from a Speaker verification accuracy different speaker (the red bars much lower). All the three ap- 1 proaches were effective, although feedback attack worked the best for Chou’s (section (iii) in Fig. 2(a)), while embedding 0.8 attack worked very well for both Chou’s and AUTOVC with 0.6 respect to both adversarial input and output (section (ii) of each chart). 0.4 For black-box scenario, we analyzed the same speaker verification accuracy as in Fig. 2(a) for Chou’s model only but 0.01 0.02 0.05 0.075 0.1 0.2 with varying scale of the perturbations , with results plotted in Fig. 3 (a) (b) (c) respectively for the three approaches pro- Perturbation scale () posed. We see when = 0.075 the adversarial inputs were adversarial input adversarial output kept almost intact (blue curves close to 1.0) while adversar- ial outputs were seriously disturbed (red curves much lower). Fig. 4: Same as Fig. 3 except on AUTOVC and for embed- However, as ≥ 0.1 the speaker characteristics of the adver- ding attack only. sarial inputs were altered drastically (blue curves went down), although the adversarial outputs sounded very different (red curves went very low). 5.2. Subjective tests Fig. 4 shows the same results as in Fig. 3 except on AU - The above speaker verification tests were objective but not TOVC with embedding attack only (as the other two methods necessarily adequate. So we performed subjective evaluation did not work well in white-box scenario in Fig. 2(b)). We here but only with the most attractive embedding attack on see very similar results as in Fig. 3, and the embedding at- both Chou’s and AUTOVC for both white- and black-box tack worked successfully with AUTOVC for good choices of scenarios. We randomly selected 50 out of the 100 example (0.05 ≤ ≤ 0.075). utterances (x) from the set used in objective evaluation de- Among the three proposed approaches, the embedding at- scribed above. The corresponding adversarial inputs (x + δ), tack turned out to be the most attractive, considering both outputs (F (t, x + δ)) and the original outputs (F (t, x)) as defending effectiveness (as mentioned above) and time effi- used above were then reused in subjective evaluation here for ciency. The feedback attack offered very good performance = 0.075 and 0.05 respectively for Chou’s and AUTOVC. on Chou’s model, but less effective on AUTOVC. It also took The subjects were then asked to decide if two given utterances more time to apply the perturbation as one more complete were from the same speaker by choosing one out of four: (I) encoder-to-decoder inference was required. It is interesting to Different, absolutely sure, (II) Different, but not very sure, note that the end-to-end attack offered performance compara- (III) Same, but not very sure, and (IV) Same, absolutely sure. ble to the other two approaches, although it is based on the Among the two utterances given, one is the original utterance distance between spectrograms, very different from the dis- x, whereas the other is the adversarial input, adversarial tance between speaker embeddings, based on which the other output, or original output. Each utterance pair was evalu- two approaches relied on. ated by 6 subjects. To remove the possible outliers for sub-
white-box black-box white-box black-box 1 2 3 4 5 1 2 3 4 5 100 6.0 100 3.0 1.5 3.5 10.5 9.0 15.0 17.5 31.5 23.5 80 80 19.5 31.0 29.5 70.0 Percentage (%) 60 78.0 60 35.0 85.5 30.0 90.0 51.0 40 40 68.5 58.5 54.0 41.5 43.0 20 20 26.5 18.5 16.0 12.0 8.0 3.0 3.5 1.5 1.5 2.5 0 0.5 0 0.5 adv. adv. adv. adv. original adv. adv. adv. adv. original input output input output output input output input output output (I) Different, absolutely sure (II) Different, but not very sure (III) Same, but not very sure (IV) Same, absolutely sure (a) Chou’s (b) AUTOVC Fig. 5: Subjective evaluation results with embedding attack for Chou’s ( = 0.075) and AUTOVC ( = 0.05). On x-axis, “adv. input” stands for adversarial input and “adv. output” stands for adversarial output. jective results, we deleted two extreme ballots on both ends original utterance above the threshold may not sound to hu- out of the 6 ballots received for each utterance pair (delete a man subjects as produced by the same speaker. Also for both (I) if there is one, or delete a (II) if there is no (I)’s, etc.; simi- two models, the black-box scenario was in general more chal- lar for (IV) and (III)). In this way 4 ballots were collected for lenging than the white-box one (lower green and blue parts, each utterance pair, and 200 ballots for the 50 utterance pairs. 4 v.s. 2 ), but the approach is still effective to a good extent. The percentages of ballots choosing (I), (II), (III), (IV) out The demo samples can be found at https://yistlin. of these 200 are then shown in the bars 1 2 for white-box, github.io/attack-vc-demo, and the source code is at 3 4 for black-box scenarios, and 5 for original output in https://github.com/cyhuang-tw/attack-vc. Fig. 5 for Chou’s and AUTOVC respectively. For Chou’s, we can see in Fig. 5(a) at least 70% - 78% 6. CONCLUSIONS of the ballots chose (IV) or considered the adversarial inputs preserved the original speaker characteristics very well (red Improved voice conversion techniques imply higher demand parts in bars 1 3 ), yet at least 41% - 58% of the ballots for new technologies to defend personal speaker characteris- chose (I) or considered adversarial outputs obviously from a tics. This paper presents the first known attempt to perform different speaker (blue parts in bars 2 4 ). Also for the origi- adversarial attack on voice conversion. Three different ap- nal output at least 82% of the ballots considered them close to proaches are proposed and tested on two state-of-the-art voice the original speaker ((III) plus (IV) in bar 5 ). As in Fig. 5(b) conversion models in both objective and subjective evalua- for AUTOVC, at least 85% - 90% of the ballots chose (IV) tion with very encouraging results, including for the black- (red parts in bars 1 3 ), yet more than 54% - 68% of the bal- box scenario closer to real applications. lots chose (I) (blue parts in bars 2 4 ). However, only about 27% of the ballots considered that the original outputs are 7. ACKNOWLEDGMENTS from the same speaker (red and orange parts in bar 5 ). This is probably because the objective speaker verification system We are grateful to the authors of AUTOVC for offering the used here didn’t match human’s perception very well based complete source code for our experiments and Chou’s kind on which the selected original output with similarity to the help in model training.
8. REFERENCES [11] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy, “Explaining and harnessing adversarial exam- [1] Ju chieh Chou, Cheng chieh Yeh, Hung yi Lee, and Lin ples,” in International Conference on Learning Repre- shan Lee, “Multi-target voice conversion without par- sentations, 2015. allel data by adversarially learning disentangled audio representations,” in Proc. Interspeech 2018, 2018, pp. [12] Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio, 501–505. “Adversarial machine learning at scale,” in International Conference on Learning Representations, 2017. [2] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “Cyclegan-vc2: Improved cyclegan-based non-parallel [13] Aleksander Madry, Aleksandar Makelov, Ludwig voice conversion,” in ICASSP 2019 - 2019 IEEE In- Schmidt, Dimitris Tsipras, and Adrian Vladu, “Towards ternational Conference on Acoustics, Speech and Signal deep learning models resistant to adversarial attacks,” in Processing (ICASSP), 2019, pp. 6820–6824. International Conference on Learning Representations, 2018. [3] Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo, “Stargan-vc: Non-parallel many-to- [14] Jernej Kos, Ian Fischer, and Dawn Song, “Adversarial many voice conversion using star generative adversarial examples for generative models,” in 2018 IEEE Security networks,” in 2018 IEEE Spoken Language Technology and Privacy Workshops (SPW). IEEE, 2018, pp. 36–42. Workshop (SLT). IEEE, 2018, pp. 266–273. [15] Nicholas Carlini and David Wagner, “Audio adversarial examples: Targeted attacks on speech-to-text,” in 2018 [4] Ju chieh Chou and Hung-Yi Lee, “One-Shot Voice Con- IEEE Security and Privacy Workshops (SPW). IEEE, version by Separating Speaker and Content Representa- 2018, pp. 1–7. tions with Instance Normalization,” in Proc. Interspeech 2019, 2019, pp. 664–668. [16] Lea Schönherr, Katharina Kohls, Steffen Zeiler, Thorsten Holz, and Dorothea Kolossa, “Adversarial at- [5] Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, tacks against automatic speech recognition systems via and Mark Hasegawa-Johnson, “AutoVC: Zero-shot psychoacoustic hiding,” in Network and Distributed voice style transfer with only autoencoder loss,” in System Security Symposium (NDSS), 2019. Proceedings of the 36th International Conference on Machine Learning, Kamalika Chaudhuri and Ruslan [17] Moustafa Alzantot, Bharathan Balaji, and Mani B. Sri- Salakhutdinov, Eds., Long Beach, California, USA, 09– vastava, “Did you hear that? adversarial examples 15 Jun 2019, vol. 97 of Proceedings of Machine Learn- against automatic speech recognition,” in Machine De- ing Research, pp. 5210–5219, PMLR. ception Workshop, Neural Information Processing Sys- tems (NIPS) 2017, 2017. [6] Md Sahidullah, Tomi Kinnunen, and Cemal Hanilçi, “A comparison of features for synthetic speech detection,” [18] R. Taori, A. Kamsetty, B. Chu, and N. Vemuri, “Tar- in INTERSPEECH 2015, 2015, pp. 2087–2091. geted adversarial examples for black box audio sys- tems,” in 2019 IEEE Security and Privacy Workshops [7] Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, (SPW), 2019, pp. 15–20. Russ Howes, Menglin Wang, and Cristian Canton Fer- rer, “The deepfake detection challenge dataset,” 2020. [19] Moustapha M Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet, “Houdini: Fooling deep structured vi- [8] Yuezun Li and Siwei Lyu, “Exposing deepfake videos sual and speech recognition models with adversarial ex- by detecting face warping artifacts,” in The IEEE Con- amples,” in Advances in Neural Information Processing ference on Computer Vision and Pattern Recognition Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wal- (CVPR) Workshops, June 2019. lach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., pp. 6977–6987. Curran Associates, Inc., 2017. [9] U. A. Ciftci, I. Demir, and L. Yin, “Fakecatcher: De- tection of synthetic portrait videos using biological sig- [20] S. Liu, H. Wu, H. Lee, and H. Meng, “Adversarial at- nals,” IEEE Transactions on Pattern Analysis and Ma- tacks on spoofing countermeasures of automatic speaker chine Intelligence, pp. 1–1, 2020. verification,” in 2019 IEEE Automatic Speech Recog- nition and Understanding Workshop (ASRU), 2019, pp. [10] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, 312–319. Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus, “Intriguing properties of neural networks,” in [21] X. Li, J. Zhong, X. Wu, J. Yu, X. Liu, and H. Meng, International Conference on Learning Representations, “Adversarial attacks on gmm i-vector based speaker ver- 2014. ification systems,” in ICASSP 2020 - 2020 IEEE In-
ternational Conference on Acoustics, Speech and Signal [31] N. Carlini and D. Wagner, “Towards evaluating the ro- Processing (ICASSP), 2020, pp. 6579–6583. bustness of neural networks,” in 2017 IEEE Symposium on Security and Privacy (SP), Los Alamitos, CA, USA, [22] F. Kreuk, Y. Adi, M. Cisse, and J. Keshet, “Fooling may 2017, pp. 39–57, IEEE Computer Society. end-to-end speaker verification with adversarial exam- ples,” in 2018 IEEE International Conference on Acous- [32] Christophe Veaux, Junichi Yamagishi, Kirsten MacDon- tics, Speech and Signal Processing (ICASSP), 2018, pp. ald, et al., “Cstr vctk corpus: English multi-speaker cor- 1962–1966. pus for cstr voice cloning toolkit,” 2017. [23] Yee Wah Lau, M. Wagner, and D. Tran, “Vulnerabil- [33] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez ity of speaker verification to voice mimicking,” in Pro- Moreno, “Generalized end-to-end loss for speaker ceedings of 2004 International Symposium on Intelli- verification,” in 2018 IEEE International Conference gent Multimedia, Video and Speech Processing, 2004., on Acoustics, Speech and Signal Processing (ICASSP). 2004, pp. 145–148. IEEE, 2018, pp. 4879–4883. [24] Z. Wu, T. Kinnunen, E. S. Chng, H. Li, and E. Am- [34] Arsha Nagrani, Joon Son Chung, and Andrew Zisser- bikairajah, “A study on spoofing attack in state-of-the- man, “Voxceleb: A large-scale speaker identification art speaker verification: the telephone speech case,” in dataset,” in Proc. Interspeech 2017, 2017, pp. 2616– Proceedings of The 2012 Asia Pacific Signal and In- 2620. formation Processing Association Annual Summit and [35] Vassil Panayotov, Guoguo Chen, Daniel Povey, and San- Conference, 2012, pp. 1–5. jeev Khudanpur, “Librispeech: an asr corpus based on [25] Nicholas Evans, Tomi Kinnunen, and Junichi Yamag- public domain audio books,” in 2015 IEEE Interna- ishi, “Spoofing and countermeasures for automatic tional Conference on Acoustics, Speech and Signal Pro- speaker verification,” in INTERSPEECH-2013, 2013, cessing (ICASSP). IEEE, 2015, pp. 5206–5210. pp. 925–929. [36] D. Griffin and Jae Lim, “Signal estimation from mod- [26] Yanmin Qian, Nanxin Chen, and Kai Yu, “Deep features ified short-time fourier transform,” IEEE Transactions for automatic spoofing detection,” Speech Communica- on Acoustics, Speech, and Signal Processing, vol. 32, tion, vol. 85, pp. 43 – 52, 2016. no. 2, pp. 236–243, 1984. [27] Galina Lavrentyeva, Sergey Novoselov, Egor Ma- [37] Aäron van den Oord, Sander Dieleman, Heiga Zen, lykh, Alexander Kozlov, Oleg Kudashev, and Vadim Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Shchemelinin, “Audio replay attack detection with deep Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu, learning frameworks,” in Proc. Interspeech 2017, 2017, “Wavenet: A generative model for raw audio,” in 9th pp. 82–86. ISCA Speech Synthesis Workshop, 2016, pp. 125–125. [28] Giacomo Valenti, Héctor Delgado, Massimiliano [38] Jaime Lorenzo-Trueba, Thomas Drugman, Javier La- Todisco, Nicholas Evans, and Laurent Pilati, “An end- torre, Thomas Merritt, Bartosz Putrycz, Roberto Barra- to-end spoofing countermeasure for automatic speaker Chicote, Alexis Moinet, and Vatsal Aggarwal, “Towards verification using evolving recurrent neural networks,” Achieving Robust Universal Neural Vocoding,” in Proc. in Proc. Odyssey 2018 The Speaker and Language Interspeech 2019, 2019, pp. 181–185. Recognition Workshop, 2018, pp. 288–295. [39] Diederik P. Kingma and Jimmy Ba, “Adam: A method [29] S. Chen, K. Ren, S. Piao, C. Wang, Q. Wang, J. Weng, for stochastic optimization,” in 3rd International Con- L. Su, and A. Mohaisen, “You can hear but you cannot ference on Learning Representations, ICLR 2015, San steal: Defending against voice impersonation attacks on Diego, CA, USA, May 7-9, 2015, Conference Track Pro- smartphones,” in 2017 IEEE 37th International Confer- ceedings, Yoshua Bengio and Yann LeCun, Eds., 2015. ence on Distributed Computing Systems (ICDCS), 2017, pp. 183–195. [30] Massimiliano Todisco, Xin Wang, Ville Vestman, Md. Sahidullah, Héctor Delgado, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Tomi H. Kinnunen, and Kong Aik Lee, “ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection,” in Proc. Inter- speech 2019, 2019, pp. 1008–1012.
You can also read