AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario Yihui Fu1,∗ , Luyao Cheng1,∗ , Shubo Lv1 , Yukai Jv1 , Yuxiang Kong1 , Zhuo Chen2 Yanxin Hu1 , Lei Xie1,# , Jian Wu3 , Hui Bu4 , Xin Xu4 , Jun Du5 , Jingdong Chen1 1 Northwestern Polytechnical University, Xi’an, China 2 Microsoft Corporation, USA 3 Microsoft Corporation, China 4 Beijing Shell Shell Technology Co., Ltd., Beijing, China 5 University of Science and Technology of China, Hefei, China {yhfu, lycheng, shblv, ykjv, yxkong, yxhu}@npu-aslp.org, lxie@nwpu.edu.cn, {zhuc, wujian}@microsoft.com, {buhui, xuxin}@aishelldata.com, jundu@ustc.edu.cn, jingdongchen@ieee.org arXiv:2104.03603v4 [cs.SD] 10 Aug 2021 Abstract The scarcity of relevant data is one of the main factors that hinders the development of advanced meeting transcription sys- In this paper, we present AISHELL-4, a sizable real-recorded tem. As meeting transcription involves multiple aspects of the Mandarin speech dataset collected by 8-channel circular micro- input audio such as speech content, speaker identifications and phone array for speech processing in conference scenario. The their onset/offset time, precise annotation on all aspects is ex- dataset consists of 211 recorded meeting sessions, each con- pensive and usually difficult to obtain. Meanwhile, meeting of- taining 4 to 8 speakers, with a total length of 120 hours. This ten contains noticeable amount of speech overlap, quick speaker dataset aims to bridge the advanced research on multi-speaker turn, non-grammatical presentation and noise, further increas- processing and the practical application scenario in three as- ing the transcription difficulty and cost. For example, to tran- pects. With real recorded meetings, AISHELL-4 provides re- scribe the overlapped regions in a meeting, the transcriber is alistic acoustics and rich natural speech characteristics in con- often required to listen to the audio multiple times in order to versation such as short pause, speech overlap, quick speaker understand each involved speaker. Therefore, most of the pre- turn, noise, etc. Meanwhile, accurate transcription and speaker vious works on meeting transcription or relevant tasks such as voice activity are provided for each meeting in AISHELL-4. overlapped speech recognition employ synthetic datasets for ex- This allows the researchers to explore different aspects in meet- perimentation [4, 5, 6]. ing processing, ranging from individual tasks such as speech To help with the data insufficiency problem, several rele- front-end processing, speech recognition and speaker diariza- vant datasets have been collected and released [6, 7, 8, 9, 10]. tion, to multi-modality modeling and joint optimization of rel- However, most of the existing datasets suffer from various lim- evant tasks. Given most open source dataset for multi-speaker itations, ranging from corpus setup such as corpus size, speaker tasks are in English, AISHELL-4 is the only Mandarin dataset & spatial variety, collection condition, etc., to corpus con- for conversation speech, providing additional value for data di- tent such as recording quality, accented speech, speaking style, versity in speech community. We also release a PyTorch-based etc. Meanwhile, almost all public available meeting corpus training and evaluation framework as baseline system to pro- are based on English, which largely limits the data variation mote reproducible research in this field. for meeting transcription systems. As each language possesses Index Terms: AISHELL-4, speech front-end processing, unique properties, the solution for English based meeting is speech recognition, speaker diarization, conference scenario, likely to be sub-optimum for other languages. Mandarin In this work, in order to boost the research on advanced meeting transcription, we design and release a sizeable collec- 1. Introduction tion of real meetings, namely AISHELL-41 dataset. AISHELL- Meeting transcription defines a process of estimating “who 4 is a Mandarian based corpus, containing 120 hours of meeting speaks what at when” on meetings recordings, which usually recording, using an 8-channel microphone array. The corpus is composes of utterances from multiple speakers, with a cer- designed to cover a variety of aspects in real world meetings, in- tain amount of speech overlap. It is considered as one of the cluding diverse recording conditions, various numbers of meet- most challenging problems in speech processing, as it is a com- ing participants and overlapped ratios. High quality transcrip- bination of various speech tasks, including speech front-end tions on multiple aspects are provided for each meeting sample, processing, speech activity detection (SAD), automatic speech allowing the researcher to explore different aspects of meet- recognition (ASR), speaker identification and diarization, etc. ing processing. A baseline system is also released along with While several works show promising results [1, 2], the problem the corpus, to further facilitate the research in this complicated is still considered unsolved, especially for real world applica- problem. tions [3]. * Contribute equally # Corresponding author 1 http://www.aishelltech.com/aishell 4
2. Previous Works by a group of participants. The total number of participants in training and evaluation sets is 36 and 25, with balanced gender For ASR with conversation speech, switchboard [11], collected coverage. by Texas Instruments, is a dataset of telephone communication The dataset is collected in 10 conference venues. The con- with about 2,500 conversations by 500 native speakers. How- ference venues are divided into three types: small, medium, and ever, the data scale is relatively limited. Fisher [12] is another large room, whose size range from 7 × 3 × 3 to 15 × 7 × 3 conversational telephone speech dataset which has yielded more m3 . The type of wall material of the conference venues cov- than 16,000 English conversations and 2,000 hours of speech in ers cement, glass, etc. Other furnishings in conference venues duration, covering various speaker and speaking styles. How- include sofa, TV, blackboard, fan, air conditioner, plants, etc. ever, fisher, together with switchboard, is recorded with a sam- During recording, the participants of the conference sit around pling rate of 8 kHz, which does not meet the modern require- the microphone array which is placed on the table in the middle ment of high accuracy meeting transcription. Furthermore, of the room and conduct natural conversation. The microphone- these datasets are recorded in single-speaker manner and do not speaker distance ranges from 0.6 to 6.0 m. All participants are contain speech overlapping scenario. native Chinese speakers speaking Mandarin without strong ac- To cope with the speech overlapping, a variety of datasets cents. During the conference, various kinds of indoor noise have been developed. WSJ0-2mix [13], and its variants including but not limited to clicking, keyboard, door open- WHAM! [14] and WHAMR! [15] are mostly adopted for ing/closing, fan, bubble noise, etc., are made by participants speech separation evaluation, which consists of artificially naturally. For training set, the participants are required to re- mixed speech samples using Wall Street Journal dataset. Lib- main in the same position during recording, while for evaluation riMix [16], derived from LibriSpeech [17], is an alternative to set, the participants may move naturally within a small range. datasets mentioned above for noisy speech separation task. Its There is no room overlap and only one speaker overlap between sizeable data scale introduces a higher generalization to model training and evaluation set. An example of the recording venue training. However, all datasets mentioned above are synthetic from the training set, including the topology of microphone ar- data and are created in short-segmented fully overlapped man- ray, is shown in Fig. 1. ner, with severe mismatches with real conversation that the overlap ratio is usually below 20% [18]. Table 1: Details of AISHELL-4 dataset. LibriCSS and CHiME-6 are recently introduced datasets for conference or indoor conversation transcription, with consider- ation of overlapped speech. LibriCSS mixes the LibriSpeech Training Evaluation utterances from different speakers to form single channel con- Duration (h) 107.50 12.72 tinuous utterances, which are then played by multiple Hi-Fi #Session 191 20 loudspeakers and recorded by a 7-channel microphone array in #Room 5 5 one conference venue. However, LibriSpeech is a reading style #Participant 36 25 corpus with relatively stable speed and pronunciation, which #Male 16 11 mismatches with natural conversation. On the other hand, the #Female 20 14 CHiME-6 challenge includes a track aimed at recognizing un- Overlap Ratio (Avg.) 19.04% 9.31% segmented dinner-party conversations recorded on multiple mi- crophone arrays. However, the recording quality of CHiME- The number of participants within one conference session 6 is relatively low, as a result of the existence of audio clip- ranges from 4 to 8. To ensure the coverage of different over- ping, a large amount of interjection and laugh, signal attenu- lap ratios, we select various meeting topics during record- ation, etc. The number of the participant for each CHiME-6 ing, including medical treatment, education, business, organi- meeting is fixed to four, making it difficult to evaluate the gen- zation management, industrial production and other daily rou- eralization of meeting transcription systems. ICSI [8], CHIL [9] tine meetings. The average speech overlap ratios of training and AMI [10] are meeting corpus collected from academic lab- and evaluation sets are 19.04 % and 9.31 %, respectively. More oratories, instead of commercial microphones, which have rel- details of AISHELL-4 is shown in Table 1. A detailed session- atively limited speaker variation and restrict the universality of level overlap ratio distribution of training and evaluation sets is data. The data scale of Santa Barbara Corpus of Spoken Ameri- shown in Table 2. can English [19] is considerable. Unfortunately, it only contains We also record the near-field signal using headset micro- single-channel data. phones for each participant. To obtain the transcription, we first For Mandarin speech corpus, the commonly used open align the signal recorded by headset microphone and the first source datasets for speech recognition include AISHELL- channel of microphone array and then select the signal with 1 [20], ASHELL-2 [21], aidatatang2 , etc. However, most of the higher quality for manual labeling. Before labeling the scripts, datasets are recorded in near-field scenario and have no speech we apply automatic speech recognition on the recorded data to overlap and obvious noise and reverberation. To the best of our assist transcribers to for accurate labeling results. Then inspec- knowledge, there is no public available meeting dataset in Man- tors will double-check the labeling results of each session from darin. transcribers and decide whether meet the acceptance standard. Praat3 is used for further calibration to check the accuracy of 3. Datasets speaker distribution and to avoid the miss cutting of speech seg- AISHELL-4 contains 120 hours of speech data in total, di- ments. We also pay special attention to accurate punctuation la- vided into 107.50 hours training and 12.72 hours evaluation beling. Each session is labeled by three professional annotators set. Training and evaluation set contain 191 and 20 sessions, on average as a secondary inspection to improve the labeling respectively. Each session consists of a 30-minute discussion quality. 2 http://www.openslr.org/62/ 3 https://github.com/praat/praat
All scripts of the speech data are prepared in TextGrid for- layer statistics pooling for continuous conference segmentation. mat for each session, which contains the information of the ses- The feature sent into the SAD model is 40 dimensional mel fre- sion duration, speaker information (number of speaker, speaker- quency cepstral coefficient (MFCC) extracted every 10 ms with id, gender, etc.), the total number of segments of each speaker, 25 ms window. the timestamp and transcription of each segment, etc. The non- The speaker embedding network 6 is based on ResNet [23] speech events are transcribed as well, such as pauses, laughing, which is trained using Voxceleb1 [24], Voxceleb2 [25] and CN- coughing, breathing, etc. The overlapping and non-overlapping Celeb [26]. The speaker embedding network is trained using segments are also identified. stochastic gradient descent (SGD) and additive angular margin loss [27]. We ramp-up the margin during the first two epochs Table 2: Session-level overlap ratio distribution. 0%-10%, and then train the network for the following epoch with fixed 10%-20%, 20%-30%, 30%-40% and 40%-100% indicate the margin m = 0.2. Data augmentation is performed in the same range of overlap ratio of each session. The numbers followed way as the SRE16 Kaldi recipe 7 . The feature fed into the indicate the number of sessions with corresponding overlap ra- speaker embedding network is 64 dimensional mel-filterbank tio in training and evaluation sets, respectively. which is extracted every 15 ms with 25 ms window. The post processing result of speaker embedding is derived from proba- Overlap Ratio Training Evaluation bilistic linear discriminant analysis (PLDA) [28] model which 0%-10% 41 12 is trained on the same data as the speaker embedding network. 10%-20% 76 6 All speaker embeddings generated will be pre-clustered us- 20%-30% 44 2 ing agglomerative hierarchical cluster (AHC) [29] algorithm to 30%-40% 20 0 obtain the initial speaker labels. The threshold of AHC algo- 40%-100% 10 0 rithm is set to 0.015. The dimension of speaker embedding is reduced from 256 to 128 for further clustering using the Vari- ational Bayesian HMM clustering (VBx) [30] model. Thus, the segmented utterances from the whole conference session to- gether with their corresponding speaker information. Note that the speaker diarization does not take overlapped speech into consideration as we assumed that each speech frame corresponds to only one of the speakers. Diarization systems from [31] are designed to address speech overlap. We will ex- plore their integration in follow-up works. 4.2. Speech Front-end Processing We employ the separation solution from [3] in our baseline sys- tem, which contains a multi-channel speech separation network, and a mask based minimum variance distortionless response (MVDR) beamformer. For each testing segment, two masks Figure 1: An example of recording venue of training set and the are firstly estimated by the network, followed by the MVDR topology of microphone array. beamforming to generate the final separation output. The separation network consists of 3 LSTM layers, each having 3084 nodes, followed by a fully connection (FC) layer with sigmoid activation function to form speaker masks. 4. Baseline We use synthetic overlapped speech to train the separation We build our baseline meeting transcription system4 . inspired network. Each mixture sample consists of fully or partially by [6, 2]. The baseline system contains three major sub- overlapped speech, directional noise and isotropic noise. 460 modules, named speech front-end processing, speaker diariza- hours clean speech from LibriSpeech is adopted as the close- tion and ASR. We train each sub-module separately. During talk data for simulation. The noise set comes from MUSAN evaluation, the meeting recording is firstly processed by the and Audioset which contain 49 and 88 hours of utterances, re- speaker diarization module, which generates the speaker attri- spectively. The multi-channel RIRs and isotropic noise are sim- bution and boundary information of each individual utterance. ulated based on the topology of 8-channel microphone array. Then speech separation module is applied to each segmented The room size ranges from 3 × 3 × 3 to 10 × 10 × 3 m3 while utterance to remove the potential speech overlap. And finally, the microphone array is randomly placed within a 2 × 2 m2 area the processed audio is recognized by the ASR module to obtain in the center of the room with height randomly placed between the speaker attributed transcription. 0.6 to 1.2 m. RT60 ranges from 0.2 to 0.8 s. The angle between each speaker and noise is set at least 20◦ to ensure the spatial 4.1. Speaker Diarization distinctiveness of each sound source. The SNR range is [5, 20] dB and SDR range is [-5, 5] dB. The overlap ratio of simulated We adopt the Kaldi-based diarization system from CHiME-6 data is divided into three parts with equal data amount: 1) non- challenge in our system. The diarizaiton module includes three overlapped, 2) overlap ratio randomly ranges from 0 to 20 % components: SAD, speaker embedding extractor and clustering. and 3) overlap ratio randomly ranges from 20 to 80 %. Finally We employ the CHiME-6’s SAD [22] model5 , which con- we simulate 364 and 10 hours data for model training and de- sists of a 5-layer time-delay neural network (TDNN) and a 2- velopment, respectively. 4 https://github.com/felixfuyihui/AISHELL-4 6 https://github.com/BUTSpeechFIT/VBx 5 http://kaldi-asr.org/models/12/0012 sad v1.tar.gz 7 https://github.com/kaldi-asr/kaldi/blob/master/egs/sre16/v2
All simulated data is segmented into 4.0 s short utterances Table 3: %CERs results for speaker independent and speaker during training. Interchannel phase difference (IPD) features dependent tasks. are calculated among four microphone pairs: (1,5), (2,6), (3,7), (4,8). The frame length and hop length used in STFT are set Speaker Independent Speaker Dependent to 32 and 16 ms, respectively. The mask estimator is trained No FE 32.56 41.55 for 20 epochs with Adam optimizer, where the initial learning FE 30.49 39.86 rate is set to 0.001 and will halve if there is no improvement on development set. We concatenate magnitude of the first channel of the ob- served signal, as well as the cosIPD among selected microphone The frame length and hop size of STFT are set to 25 and pairs along frequency axis, as the input of the model. The IPD 10 ms. We utilize 80 dimensional log-fbank features with is calculated as IPDi,j = ∠yt,f i j − ∠yt,f , where i, j denote utterance-wise mean variance normalization as the input fea- microphone index. For mask estimator, we use STFT mask as ture and we apply SpecAugment [34] for data augmentation. the training target and the loss function for magnitude approx- We train the model for a maximum of 50 epochs with the Adam imation with the permutation invariant training (PIT) criteria is optimizer. The warm-up and decay learning rate scheduler with used for training. a peak value of 10−4 is adopted, where the warm-up steps is set In MVDR step, based on estimated masks, the covariance to 25,000. matrices are calculated via 4.4. Evaluation and Results 1 X k Rkf = P k H mt,f yt,f yt,f . (1) Two metrics are included in the baseline system, referring as t m t,f t speaker independent and speaker dependent character error rate mkt,f refers to the mask value of the source k on each TF bin (CER). The speaker independent task is designed to measure and yt,f denotes the corresponding observed vector on the fre- the performance of front-end and speech recognition module. quency domain. In this work, k ∈ {0, 1} as we consider at In this task, we use the ground truth utterance boundary in- most two speakers. The filter coefficients of the MVDR wfk for stead of the diarizaiton output to segment the utterance. On the source k is derived as follows other hand, the speaker dependent task takes consideration of all involved modules. After obtaining transcription or speaker (R1−k )−1 Rkf attributed transcription, Asclite29 , which can align multiple hy- wfk = f ur , (2) potheses against multiple reference transcriptions, is used to es- tr (R1−k f )−1 Rkf timate CERs. We provide CER for reference system, with and without where tr(·) is the matrix trace operation and ur is a one-hot front-end, in Table 3. Here, the CER results of no front-end vector indicating the reference microphone. Thus, we can get processed data on speaker independent and speaker dependent the final enhanced and separated result for each source via tasks are 32.56% and 41.55%, respectively. While after frond- H end processing, we get 30.49% CER on speaker independent k ŷt,f = wfk yt,f . (3) task, and 39.86% CER on speaker dependent task when apply- ing SAD and speaker diarization processing. In the future, we From the two separated channel of each utterance, we pick will consider deliver speech front-end processing in a continu- the channel with larger energy as the final processing result for ous manner and reduce the usage of ground-truth information this utterance. to get more promising results. 4.3. Automatic Speech Recognition 5. Conclusions A transformer based end to end speech recognition model [32] In this paper, we present AISHELL-4, a sizable Mandarin with sequence to sequence architecture is employed as the back- speech dataset for speech enhancement, separation, recogni- end in our system. Here, a 2-layer 2D CNN is firstly applied to tion and speaker diarization in conference scenario. All data reduce the input frame rate, followed by a 8-layer transformer released are real recorded, multi-channel data in real confer- encoder and 6-layer transformer decoder. The multi-head self- ence venues and acoustic scenario. We also release the train- attention layer contains 8 heads with 512 dimensions and the ing and evaluation framework as baseline for research in this inner dimension of the feed forward network is 1024. The field. Our experiment results show that the complex acoustic model is trained with joint CTC and cross-entropy objective scenario degrades the performance of ASR to a great extend, function [33] to accelerate the convergence of the training. The which results in 32.56% and 41.55% CERs in speaker indepen- beam size is set to 24 during the beam search decoding. dent and speaker dependent task, respectively. After the front- The training data is composed of two parts: simulated and end and diarization processing, the CERs achieve 30.49% and real-recorded data. For simulated data, the clean speech data 39.86% on speaker independent and speaker dependent task, re- comes from AISHELL-1, aidatatang 200zh and Primewords 8 , spectively. We hope that this dataset and the associated baseline while noise, RIRs and SNR are the same as that of the front- can promote researchers to explore different aspects in meeting end model. We combine the simulated and original data to form processing tasks, ranging from individual tasks such as speech 768 and 97 hours single-speaker single-channel data as training front-end processing, speech recognition and speaker diariza- and development set, respectively. For real-recorded data, we tion, to multi-modality modeling and joint optimization of rele- select the non-overlapped part of the training set of AISHELL-4 vant tasks, and help to reduce the gap between the research state according to the ground-truth segmentation information, which and real conferencing applications. consists of 63 hours data for model training only. 8 http://www.openslr.org/47/ 9 https://github.com/usnistgov/SCTK
6. References [17] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an asr corpus based on public domain audio books,” [1] N. Kanda, X. Chang, Y. Gaur, X. Wang, Z. Meng, Z. Chen, in 2015 IEEE International Conference on Acoustics, Speech and and T. Yoshioka, “Investigation of end-to-end speaker-attributed Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210. asr for continuous multi-talker recordings,” in IEEE Spoken Lan- guage Technology Workshop (SLT). IEEE, 2021, pp. 809–816. [18] Ö. Çetin and E. Shriberg, “Analysis of overlaps in meetings by dialog factors, hot spots, speakers, and collection site: Insights for [2] X. Wang, N. Kanda, Y. Gaur, Z. Chen, Z. Meng, and T. Yoshioka, automatic speech recognition,” in Ninth international conference “Exploring end-to-end multi-channel asr with bias information for on spoken language processing, 2006. meeting transcription,” in 2021 IEEE Spoken Language Technol- [19] A. Stupakov, E. Hanusa, D. Vijaywargi, D. Fox, and J. Bilmes, ogy Workshop (SLT). IEEE, 2021, pp. 833–840. “The design and collection of COSINE, a multi-microphone in [3] T. Yoshioka, I. Abramovski, C. Aksoylar, Z. Chen, M. David, situ speech corpus recorded in noisy environments,” Computer D. Dimitriadis, Y. Gong, I. Gurvich, X. Huang, Y. Huang Speech & Language, vol. 26, no. 1, pp. 52–66, 2012. et al., “Advances in online audio-visual meeting transcription,” [20] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open- in IEEE Automatic Speech Recognition and Understanding Work- source mandarin speech corpus and a speech recognition base- shop (ASRU). IEEE, 2019, pp. 276–283. line,” in 20th Conference of the Oriental Chapter of the Interna- [4] C. Li, Y. Luo, C. Han, J. Li, T. Yoshioka, T. Zhou, M. Delcroix, tional Coordinating Committee on Speech Databases and Speech K. Kinoshita, C. Boeddeker, Y. Qian et al., “Dual-Path RNN for I/O Systems and Assessment (O-COCOSDA). IEEE, 2017, pp. long recording speech separation,” in 2021 IEEE Spoken Lan- 1–5. guage Technology Workshop (SLT). IEEE, 2021, pp. 865–872. [21] J. Du, X. Na, X. Liu, and H. Bu, “Aishell-2: Transform- [5] S. Chen, Y. Wu, Z. Chen, T. Yoshioka, S. Liu, and J. Li, ing mandarin asr research into industrial scale,” arXiv preprint “Don’t shoot butterfly with rifles: Multi-channel continuous arXiv:1808.10583, 2018. speech separation with early exit transformer,” arXiv preprint [22] P. Ghahremani, V. Manohar, D. Povey, and S. Khudanpur, “Acous- arXiv:2010.12180, 2020. tic modelling from the signal domain using CNNs.” in Inter- [6] Z. Chen, T. Yoshioka, L. Lu, T. Zhou, Z. Meng, Y. Luo, J. Wu, speech, 2016, pp. 3434–3438. X. Xiao, and J. Li, “Continuous speech separation: dataset and [23] H. Zeinali, S. Wang, A. Silnova, P. Matějka, and O. Plchot, analysis,” in ICASSP 2020-2020 IEEE International Conference “But system description to voxceleb speaker recognition chal- on Acoustics, Speech and Signal Processing (ICASSP). IEEE, lenge 2019,” arXiv preprint arXiv:1910.12592, 2019. 2020, pp. 7284–7288. [24] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: [7] S. Watanabe, M. Mandel, J. Barker, and E. Vincent, “Chime-6 a large-scale speaker identification dataset,” arXiv preprint challenge: Tackling multispeaker speech recognition for unseg- arXiv:1706.08612, 2017. mented recordings,” arXiv preprint arXiv:2004.09249, 2020. [25] A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “Voxceleb: [8] A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Mor- Large-scale speaker verification in the wild,” Computer Speech & gan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke et al., “The ICSI Language, vol. 60, p. 101027, 2020. meeting corpus,” in IEEE International Conference on Acoustics, [26] Y. Fan, J. Kang, L. Li, K. Li, H. Chen, S. Cheng, P. Zhang, Speech, and Signal Processing (ICASSP)., vol. 1. IEEE, 2003, Z. Zhou, Y. Cai, and D. Wang, “CN-CELEB: a challenging chi- pp. 1–5. nese speaker recognition dataset,” in IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). [9] D. Mostefa, N. Moreau, K. Choukri, G. Potamianos, S. M. Chu, IEEE, 2020, pp. 7604–7608. A. Tyagi, J. R. Casas, J. Turmo, L. Cristoforetti, F. Tobia et al., “The CHIL audiovisual corpus for lecture and meeting analysis [27] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive an- inside smart rooms,” Language resources and evaluation, vol. 41, gular margin loss for deep face recognition,” in Proceedings of the no. 3, pp. 389–407, 2007. IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2019, pp. 4690–4699. [10] S. Renals, T. Hain, and H. Bourlard, “Interpretation of multiparty [28] S. Ioffe, “Probabilistic linear discriminant analysis,” in European meetings the AMI and AMIDA projects,” in Hands-Free Speech Conference on Computer Vision. Springer, 2006, pp. 531–542. Communication and Microphone Arrays. IEEE, 2008, pp. 115– 118. [29] K. J. Han, S. Kim, and S. S. Narayanan, “Strategies to improve the robustness of agglomerative hierarchical clustering under data [11] J. J. Godfrey, E. C. Holliman, and J. McDaniel, “Switchboard: source variation for speaker diarization,” IEEE Transactions on Telephone speech corpus for research and development,” in Audio, Speech, and Language Processing, vol. 16, no. 8, pp. Acoustics, Speech, and Signal Processing, IEEE International 1590–1601, 2008. Conference on, vol. 1. IEEE Computer Society, 1992, pp. 517– 520. [30] F. Landini, J. Profant, M. Diez, and L. Burget, “Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: [12] C. Cieri, D. Miller, and K. Walker, “The Fisher corpus: a resource theory, implementation and analysis on standard tasks,” arXiv for the next generations of speech-to-text.” in LREC, vol. 4, 2004, preprint arXiv:2012.14952, 2020. pp. 69–71. [31] X. Xiao, N. Kanda, Z. Chen, T. Zhou, T. Yoshioka, Y. Zhao, [13] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clus- G. Liu, J. Wu, J. Li, and Y. Gong, “Microsoft speaker diariza- tering: Discriminative embeddings for segmentation and separa- tion system for the voxceleb speaker recognition challenge 2020,” tion,” in IEEE International Conference on Acoustics, Speech and arXiv preprint arXiv:2010.11458, 2020. Signal Processing (ICASSP). IEEE, 2016, pp. 31–35. [32] L. Dong, S. Xu, and B. Xu, “Speech-transformer: a no-recurrence [14] G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, sequence-to-sequence model for speech recognition,” in IEEE In- D. Crow, E. Manilow, and J. L. Roux, “WHAM!: Extend- ternational Conference on Acoustics, Speech and Signal Process- ing speech separation to noisy environments,” arXiv preprint ing (ICASSP). IEEE, 2018, pp. 5884–5888. arXiv:1907.01160, 2019. [33] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, [15] M. Maciejewski, G. Wichern, E. McQuinn, and J. Le Roux, “Hybrid CTC/attention architecture for end-to-end speech recog- “WHAMR!: Noisy and reverberant single-channel speech sepa- nition,” IEEE Journal of Selected Topics in Signal Processing, ration,” in IEEE International Conference on Acoustics, Speech vol. 11, no. 8, pp. 1240–1253, 2017. and Signal Processing (ICASSP). IEEE, 2020, pp. 696–700. [34] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. [16] J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, and E. Vin- Cubuk, and Q. V. Le, “Specaugment: A simple data augmen- cent, “Librimix: An open-source dataset for generalizable speech tation method for automatic speech recognition,” arXiv preprint separation,” arXiv preprint arXiv:2005.11262, 2020. arXiv:1904.08779, 2019.
You can also read