AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario

Page created by Marshall Kramer

World Around

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation,
                                               Recognition and Speaker Diarization in Conference Scenario
                                                       Yihui Fu1,∗ , Luyao Cheng1,∗ , Shubo Lv1 , Yukai Jv1 , Yuxiang Kong1 , Zhuo Chen2
                                                       Yanxin Hu1 , Lei Xie1,# , Jian Wu3 , Hui Bu4 , Xin Xu4 , Jun Du5 , Jingdong Chen1
                                                                         1
                                                                         Northwestern Polytechnical University, Xi’an, China
                                                                                     2
                                                                                       Microsoft Corporation, USA
                                                                                    3
                                                                                      Microsoft Corporation, China
                                                                      4
                                                                        Beijing Shell Shell Technology Co., Ltd., Beijing, China
                                                                   5
                                                                     University of Science and Technology of China, Hefei, China
                                                   {yhfu, lycheng, shblv, ykjv, yxkong, yxhu}@npu-aslp.org, lxie@nwpu.edu.cn,
                                                {zhuc, wujian}@microsoft.com, {buhui, xuxin}@aishelldata.com, jundu@ustc.edu.cn,
                                                                             jingdongchen@ieee.org
arXiv:2104.03603v4 [cs.SD] 10 Aug 2021

                                                                   Abstract                                       The scarcity of relevant data is one of the main factors that
                                                                                                             hinders the development of advanced meeting transcription sys-
                                         In this paper, we present AISHELL-4, a sizable real-recorded        tem. As meeting transcription involves multiple aspects of the
                                         Mandarin speech dataset collected by 8-channel circular micro-      input audio such as speech content, speaker identifications and
                                         phone array for speech processing in conference scenario. The       their onset/offset time, precise annotation on all aspects is ex-
                                         dataset consists of 211 recorded meeting sessions, each con-        pensive and usually difficult to obtain. Meanwhile, meeting of-
                                         taining 4 to 8 speakers, with a total length of 120 hours. This     ten contains noticeable amount of speech overlap, quick speaker
                                         dataset aims to bridge the advanced research on multi-speaker       turn, non-grammatical presentation and noise, further increas-
                                         processing and the practical application scenario in three as-      ing the transcription difficulty and cost. For example, to tran-
                                         pects. With real recorded meetings, AISHELL-4 provides re-          scribe the overlapped regions in a meeting, the transcriber is
                                         alistic acoustics and rich natural speech characteristics in con-   often required to listen to the audio multiple times in order to
                                         versation such as short pause, speech overlap, quick speaker        understand each involved speaker. Therefore, most of the pre-
                                         turn, noise, etc. Meanwhile, accurate transcription and speaker     vious works on meeting transcription or relevant tasks such as
                                         voice activity are provided for each meeting in AISHELL-4.          overlapped speech recognition employ synthetic datasets for ex-
                                         This allows the researchers to explore different aspects in meet-   perimentation [4, 5, 6].
                                         ing processing, ranging from individual tasks such as speech             To help with the data insufficiency problem, several rele-
                                         front-end processing, speech recognition and speaker diariza-       vant datasets have been collected and released [6, 7, 8, 9, 10].
                                         tion, to multi-modality modeling and joint optimization of rel-     However, most of the existing datasets suffer from various lim-
                                         evant tasks. Given most open source dataset for multi-speaker       itations, ranging from corpus setup such as corpus size, speaker
                                         tasks are in English, AISHELL-4 is the only Mandarin dataset        & spatial variety, collection condition, etc., to corpus con-
                                         for conversation speech, providing additional value for data di-    tent such as recording quality, accented speech, speaking style,
                                         versity in speech community. We also release a PyTorch-based        etc. Meanwhile, almost all public available meeting corpus
                                         training and evaluation framework as baseline system to pro-        are based on English, which largely limits the data variation
                                         mote reproducible research in this field.                           for meeting transcription systems. As each language possesses
                                         Index Terms: AISHELL-4, speech front-end processing,                unique properties, the solution for English based meeting is
                                         speech recognition, speaker diarization, conference scenario,       likely to be sub-optimum for other languages.
                                         Mandarin                                                                 In this work, in order to boost the research on advanced
                                                                                                             meeting transcription, we design and release a sizeable collec-
                                                             1. Introduction                                 tion of real meetings, namely AISHELL-41 dataset. AISHELL-
                                         Meeting transcription defines a process of estimating “who          4 is a Mandarian based corpus, containing 120 hours of meeting
                                         speaks what at when” on meetings recordings, which usually          recording, using an 8-channel microphone array. The corpus is
                                         composes of utterances from multiple speakers, with a cer-          designed to cover a variety of aspects in real world meetings, in-
                                         tain amount of speech overlap. It is considered as one of the       cluding diverse recording conditions, various numbers of meet-
                                         most challenging problems in speech processing, as it is a com-     ing participants and overlapped ratios. High quality transcrip-
                                         bination of various speech tasks, including speech front-end        tions on multiple aspects are provided for each meeting sample,
                                         processing, speech activity detection (SAD), automatic speech       allowing the researcher to explore different aspects of meet-
                                         recognition (ASR), speaker identification and diarization, etc.     ing processing. A baseline system is also released along with
                                         While several works show promising results [1, 2], the problem      the corpus, to further facilitate the research in this complicated
                                         is still considered unsolved, especially for real world applica-    problem.
                                         tions [3].

                                              * Contribute equally
                                              # Corresponding author                                             1 http://www.aishelltech.com/aishell   4

2. Previous Works by a group of participants. The total number of participants in
training and evaluation sets is 36 and 25, with balanced gender
For ASR with conversation speech, switchboard [11], collected
coverage.
by Texas Instruments, is a dataset of telephone communication
The dataset is collected in 10 conference venues. The con-
with about 2,500 conversations by 500 native speakers. How-
ference venues are divided into three types: small, medium, and
ever, the data scale is relatively limited. Fisher [12] is another
large room, whose size range from 7 × 3 × 3 to 15 × 7 × 3
conversational telephone speech dataset which has yielded more
m3 . The type of wall material of the conference venues cov-
than 16,000 English conversations and 2,000 hours of speech in
ers cement, glass, etc. Other furnishings in conference venues
duration, covering various speaker and speaking styles. How-
include sofa, TV, blackboard, fan, air conditioner, plants, etc.
ever, fisher, together with switchboard, is recorded with a sam-
During recording, the participants of the conference sit around
pling rate of 8 kHz, which does not meet the modern require-
the microphone array which is placed on the table in the middle
ment of high accuracy meeting transcription. Furthermore,
of the room and conduct natural conversation. The microphone-
these datasets are recorded in single-speaker manner and do not
speaker distance ranges from 0.6 to 6.0 m. All participants are
contain speech overlapping scenario.
native Chinese speakers speaking Mandarin without strong ac-
To cope with the speech overlapping, a variety of datasets
cents. During the conference, various kinds of indoor noise
have been developed. WSJ0-2mix [13], and its variants
including but not limited to clicking, keyboard, door open-
WHAM! [14] and WHAMR! [15] are mostly adopted for
ing/closing, fan, bubble noise, etc., are made by participants
speech separation evaluation, which consists of artificially
naturally. For training set, the participants are required to re-
mixed speech samples using Wall Street Journal dataset. Lib-
main in the same position during recording, while for evaluation
riMix [16], derived from LibriSpeech [17], is an alternative to
set, the participants may move naturally within a small range.
datasets mentioned above for noisy speech separation task. Its
There is no room overlap and only one speaker overlap between
sizeable data scale introduces a higher generalization to model
training and evaluation set. An example of the recording venue
training. However, all datasets mentioned above are synthetic
from the training set, including the topology of microphone ar-
data and are created in short-segmented fully overlapped man-
ray, is shown in Fig. 1.
ner, with severe mismatches with real conversation that the
overlap ratio is usually below 20% [18].
Table 1: Details of AISHELL-4 dataset.
LibriCSS and CHiME-6 are recently introduced datasets for
conference or indoor conversation transcription, with consider-
ation of overlapped speech. LibriCSS mixes the LibriSpeech Training Evaluation
utterances from different speakers to form single channel con- Duration (h) 107.50 12.72
tinuous utterances, which are then played by multiple Hi-Fi #Session 191 20
loudspeakers and recorded by a 7-channel microphone array in #Room 5 5
one conference venue. However, LibriSpeech is a reading style #Participant 36 25
corpus with relatively stable speed and pronunciation, which #Male 16 11
mismatches with natural conversation. On the other hand, the #Female 20 14
CHiME-6 challenge includes a track aimed at recognizing un- Overlap Ratio (Avg.) 19.04% 9.31%
segmented dinner-party conversations recorded on multiple mi-
crophone arrays. However, the recording quality of CHiME-
The number of participants within one conference session
6 is relatively low, as a result of the existence of audio clip-
ranges from 4 to 8. To ensure the coverage of different over-
ping, a large amount of interjection and laugh, signal attenu-
lap ratios, we select various meeting topics during record-
ation, etc. The number of the participant for each CHiME-6
ing, including medical treatment, education, business, organi-
meeting is fixed to four, making it difficult to evaluate the gen-
zation management, industrial production and other daily rou-
eralization of meeting transcription systems. ICSI [8], CHIL [9]
tine meetings. The average speech overlap ratios of training
and AMI [10] are meeting corpus collected from academic lab-
and evaluation sets are 19.04 % and 9.31 %, respectively. More
oratories, instead of commercial microphones, which have rel-
details of AISHELL-4 is shown in Table 1. A detailed session-
atively limited speaker variation and restrict the universality of
level overlap ratio distribution of training and evaluation sets is
data. The data scale of Santa Barbara Corpus of Spoken Ameri-
shown in Table 2.
can English [19] is considerable. Unfortunately, it only contains
We also record the near-field signal using headset micro-
single-channel data.
phones for each participant. To obtain the transcription, we first
For Mandarin speech corpus, the commonly used open
align the signal recorded by headset microphone and the first
source datasets for speech recognition include AISHELL-
channel of microphone array and then select the signal with
1 [20], ASHELL-2 [21], aidatatang2 , etc. However, most of the
higher quality for manual labeling. Before labeling the scripts,
datasets are recorded in near-field scenario and have no speech
we apply automatic speech recognition on the recorded data to
overlap and obvious noise and reverberation. To the best of our
assist transcribers to for accurate labeling results. Then inspec-
knowledge, there is no public available meeting dataset in Man-
tors will double-check the labeling results of each session from
darin.
transcribers and decide whether meet the acceptance standard.
Praat3 is used for further calibration to check the accuracy of
3. Datasets speaker distribution and to avoid the miss cutting of speech seg-
AISHELL-4 contains 120 hours of speech data in total, di- ments. We also pay special attention to accurate punctuation la-
vided into 107.50 hours training and 12.72 hours evaluation beling. Each session is labeled by three professional annotators
set. Training and evaluation set contain 191 and 20 sessions, on average as a secondary inspection to improve the labeling
respectively. Each session consists of a 30-minute discussion quality.

2 http://www.openslr.org/62/ 3 https://github.com/praat/praat

All scripts of the speech data are prepared in TextGrid for- layer statistics pooling for continuous conference segmentation.
mat for each session, which contains the information of the ses- The feature sent into the SAD model is 40 dimensional mel fre-
sion duration, speaker information (number of speaker, speaker- quency cepstral coefficient (MFCC) extracted every 10 ms with
id, gender, etc.), the total number of segments of each speaker, 25 ms window.
the timestamp and transcription of each segment, etc. The non- The speaker embedding network 6 is based on ResNet [23]
speech events are transcribed as well, such as pauses, laughing, which is trained using Voxceleb1 [24], Voxceleb2 [25] and CN-
coughing, breathing, etc. The overlapping and non-overlapping Celeb [26]. The speaker embedding network is trained using
segments are also identified. stochastic gradient descent (SGD) and additive angular margin
loss [27]. We ramp-up the margin during the first two epochs
Table 2: Session-level overlap ratio distribution. 0%-10%, and then train the network for the following epoch with fixed
10%-20%, 20%-30%, 30%-40% and 40%-100% indicate the margin m = 0.2. Data augmentation is performed in the same
range of overlap ratio of each session. The numbers followed way as the SRE16 Kaldi recipe 7 . The feature fed into the
indicate the number of sessions with corresponding overlap ra- speaker embedding network is 64 dimensional mel-filterbank
tio in training and evaluation sets, respectively. which is extracted every 15 ms with 25 ms window. The post
processing result of speaker embedding is derived from proba-
Overlap Ratio Training Evaluation bilistic linear discriminant analysis (PLDA) [28] model which
0%-10% 41 12 is trained on the same data as the speaker embedding network.
10%-20% 76 6 All speaker embeddings generated will be pre-clustered us-
20%-30% 44 2 ing agglomerative hierarchical cluster (AHC) [29] algorithm to
30%-40% 20 0 obtain the initial speaker labels. The threshold of AHC algo-
40%-100% 10 0 rithm is set to 0.015. The dimension of speaker embedding is
reduced from 256 to 128 for further clustering using the Vari-
ational Bayesian HMM clustering (VBx) [30] model. Thus,
the segmented utterances from the whole conference session to-
gether with their corresponding speaker information.
Note that the speaker diarization does not take overlapped
speech into consideration as we assumed that each speech frame
corresponds to only one of the speakers. Diarization systems
from [31] are designed to address speech overlap. We will ex-
plore their integration in follow-up works.

4.2. Speech Front-end Processing
We employ the separation solution from [3] in our baseline sys-
tem, which contains a multi-channel speech separation network,
and a mask based minimum variance distortionless response
(MVDR) beamformer. For each testing segment, two masks
Figure 1: An example of recording venue of training set and the are firstly estimated by the network, followed by the MVDR
topology of microphone array. beamforming to generate the final separation output.
The separation network consists of 3 LSTM layers, each
having 3084 nodes, followed by a fully connection (FC) layer
with sigmoid activation function to form speaker masks.
4. Baseline We use synthetic overlapped speech to train the separation
We build our baseline meeting transcription system4 . inspired network. Each mixture sample consists of fully or partially
by [6, 2]. The baseline system contains three major sub- overlapped speech, directional noise and isotropic noise. 460
modules, named speech front-end processing, speaker diariza- hours clean speech from LibriSpeech is adopted as the close-
tion and ASR. We train each sub-module separately. During talk data for simulation. The noise set comes from MUSAN
evaluation, the meeting recording is firstly processed by the and Audioset which contain 49 and 88 hours of utterances, re-
speaker diarization module, which generates the speaker attri- spectively. The multi-channel RIRs and isotropic noise are sim-
bution and boundary information of each individual utterance. ulated based on the topology of 8-channel microphone array.
Then speech separation module is applied to each segmented The room size ranges from 3 × 3 × 3 to 10 × 10 × 3 m3 while
utterance to remove the potential speech overlap. And finally, the microphone array is randomly placed within a 2 × 2 m2 area
the processed audio is recognized by the ASR module to obtain in the center of the room with height randomly placed between
the speaker attributed transcription. 0.6 to 1.2 m. RT60 ranges from 0.2 to 0.8 s. The angle between
each speaker and noise is set at least 20◦ to ensure the spatial
4.1. Speaker Diarization distinctiveness of each sound source. The SNR range is [5, 20]
dB and SDR range is [-5, 5] dB. The overlap ratio of simulated
We adopt the Kaldi-based diarization system from CHiME-6 data is divided into three parts with equal data amount: 1) non-
challenge in our system. The diarizaiton module includes three overlapped, 2) overlap ratio randomly ranges from 0 to 20 %
components: SAD, speaker embedding extractor and clustering. and 3) overlap ratio randomly ranges from 20 to 80 %. Finally
We employ the CHiME-6’s SAD [22] model5 , which con- we simulate 364 and 10 hours data for model training and de-
sists of a 5-layer time-delay neural network (TDNN) and a 2- velopment, respectively.
4 https://github.com/felixfuyihui/AISHELL-4 6 https://github.com/BUTSpeechFIT/VBx
5 http://kaldi-asr.org/models/12/0012 sad v1.tar.gz 7 https://github.com/kaldi-asr/kaldi/blob/master/egs/sre16/v2

All simulated data is segmented into 4.0 s short utterances      Table 3: %CERs results for speaker independent and speaker
during training. Interchannel phase difference (IPD) features         dependent tasks.
are calculated among four microphone pairs: (1,5), (2,6), (3,7),
(4,8). The frame length and hop length used in STFT are set                          Speaker Independent        Speaker Dependent
to 32 and 16 ms, respectively. The mask estimator is trained              No FE             32.56                     41.55
for 20 epochs with Adam optimizer, where the initial learning              FE               30.49                     39.86
rate is set to 0.001 and will halve if there is no improvement on
development set.
     We concatenate magnitude of the first channel of the ob-
served signal, as well as the cosIPD among selected microphone             The frame length and hop size of STFT are set to 25 and
pairs along frequency axis, as the input of the model. The IPD        10 ms. We utilize 80 dimensional log-fbank features with
is calculated as IPDi,j = ∠yt,f  i          j
                                      − ∠yt,f   , where i, j denote   utterance-wise mean variance normalization as the input fea-
microphone index. For mask estimator, we use STFT mask as             ture and we apply SpecAugment [34] for data augmentation.
the training target and the loss function for magnitude approx-       We train the model for a maximum of 50 epochs with the Adam
imation with the permutation invariant training (PIT) criteria is     optimizer. The warm-up and decay learning rate scheduler with
used for training.                                                    a peak value of 10−4 is adopted, where the warm-up steps is set
     In MVDR step, based on estimated masks, the covariance           to 25,000.
matrices are calculated via
                                                                      4.4. Evaluation and Results
                         1     X k
              Rkf   = P k                   H
                                 mt,f yt,f yt,f .              (1)    Two metrics are included in the baseline system, referring as
                       t m t,f t                                      speaker independent and speaker dependent character error rate
mkt,f refers to the mask value of the source k on each TF bin         (CER). The speaker independent task is designed to measure
and yt,f denotes the corresponding observed vector on the fre-        the performance of front-end and speech recognition module.
quency domain. In this work, k ∈ {0, 1} as we consider at             In this task, we use the ground truth utterance boundary in-
most two speakers. The filter coefficients of the MVDR wfk for        stead of the diarizaiton output to segment the utterance. On the
source k is derived as follows                                        other hand, the speaker dependent task takes consideration of
                                                                      all involved modules. After obtaining transcription or speaker
                             (R1−k )−1 Rkf                            attributed transcription, Asclite29 , which can align multiple hy-
                 wfk =       f              ur ,             (2)    potheses against multiple reference transcriptions, is used to es-
                          tr (R1−k
                                f   )−1 Rkf                           timate CERs.
                                                                           We provide CER for reference system, with and without
where tr(·) is the matrix trace operation and ur is a one-hot         front-end, in Table 3. Here, the CER results of no front-end
vector indicating the reference microphone. Thus, we can get          processed data on speaker independent and speaker dependent
the final enhanced and separated result for each source via           tasks are 32.56% and 41.55%, respectively. While after frond-
                                    H
                                                                      end processing, we get 30.49% CER on speaker independent
                           k
                         ŷt,f = wfk yt,f .                    (3)    task, and 39.86% CER on speaker dependent task when apply-
                                                                      ing SAD and speaker diarization processing. In the future, we
     From the two separated channel of each utterance, we pick        will consider deliver speech front-end processing in a continu-
the channel with larger energy as the final processing result for     ous manner and reduce the usage of ground-truth information
this utterance.                                                       to get more promising results.
4.3. Automatic Speech Recognition
                                                                                           5. Conclusions
A transformer based end to end speech recognition model [32]
                                                                      In this paper, we present AISHELL-4, a sizable Mandarin
with sequence to sequence architecture is employed as the back-
                                                                      speech dataset for speech enhancement, separation, recogni-
end in our system. Here, a 2-layer 2D CNN is firstly applied to
                                                                      tion and speaker diarization in conference scenario. All data
reduce the input frame rate, followed by a 8-layer transformer
                                                                      released are real recorded, multi-channel data in real confer-
encoder and 6-layer transformer decoder. The multi-head self-
                                                                      ence venues and acoustic scenario. We also release the train-
attention layer contains 8 heads with 512 dimensions and the
                                                                      ing and evaluation framework as baseline for research in this
inner dimension of the feed forward network is 1024. The
                                                                      field. Our experiment results show that the complex acoustic
model is trained with joint CTC and cross-entropy objective
                                                                      scenario degrades the performance of ASR to a great extend,
function [33] to accelerate the convergence of the training. The
                                                                      which results in 32.56% and 41.55% CERs in speaker indepen-
beam size is set to 24 during the beam search decoding.
                                                                      dent and speaker dependent task, respectively. After the front-
     The training data is composed of two parts: simulated and
                                                                      end and diarization processing, the CERs achieve 30.49% and
real-recorded data. For simulated data, the clean speech data
                                                                      39.86% on speaker independent and speaker dependent task, re-
comes from AISHELL-1, aidatatang 200zh and Primewords 8 ,
                                                                      spectively. We hope that this dataset and the associated baseline
while noise, RIRs and SNR are the same as that of the front-
                                                                      can promote researchers to explore different aspects in meeting
end model. We combine the simulated and original data to form
                                                                      processing tasks, ranging from individual tasks such as speech
768 and 97 hours single-speaker single-channel data as training
                                                                      front-end processing, speech recognition and speaker diariza-
and development set, respectively. For real-recorded data, we
                                                                      tion, to multi-modality modeling and joint optimization of rele-
select the non-overlapped part of the training set of AISHELL-4
                                                                      vant tasks, and help to reduce the gap between the research state
according to the ground-truth segmentation information, which
                                                                      and real conferencing applications.
consists of 63 hours data for model training only.
    8 http://www.openslr.org/47/                                          9 https://github.com/usnistgov/SCTK

6. References                                      [17] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-
                                                                                rispeech: an asr corpus based on public domain audio books,”
 [1] N. Kanda, X. Chang, Y. Gaur, X. Wang, Z. Meng, Z. Chen,                    in 2015 IEEE International Conference on Acoustics, Speech and
     and T. Yoshioka, “Investigation of end-to-end speaker-attributed           Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210.
     asr for continuous multi-talker recordings,” in IEEE Spoken Lan-
     guage Technology Workshop (SLT). IEEE, 2021, pp. 809–816.             [18] Ö. Çetin and E. Shriberg, “Analysis of overlaps in meetings by
                                                                                dialog factors, hot spots, speakers, and collection site: Insights for
 [2] X. Wang, N. Kanda, Y. Gaur, Z. Chen, Z. Meng, and T. Yoshioka,             automatic speech recognition,” in Ninth international conference
     “Exploring end-to-end multi-channel asr with bias information for          on spoken language processing, 2006.
     meeting transcription,” in 2021 IEEE Spoken Language Technol-
                                                                           [19] A. Stupakov, E. Hanusa, D. Vijaywargi, D. Fox, and J. Bilmes,
     ogy Workshop (SLT). IEEE, 2021, pp. 833–840.
                                                                                “The design and collection of COSINE, a multi-microphone in
 [3] T. Yoshioka, I. Abramovski, C. Aksoylar, Z. Chen, M. David,                situ speech corpus recorded in noisy environments,” Computer
     D. Dimitriadis, Y. Gong, I. Gurvich, X. Huang, Y. Huang                    Speech & Language, vol. 26, no. 1, pp. 52–66, 2012.
     et al., “Advances in online audio-visual meeting transcription,”      [20] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open-
     in IEEE Automatic Speech Recognition and Understanding Work-               source mandarin speech corpus and a speech recognition base-
     shop (ASRU). IEEE, 2019, pp. 276–283.                                      line,” in 20th Conference of the Oriental Chapter of the Interna-
 [4] C. Li, Y. Luo, C. Han, J. Li, T. Yoshioka, T. Zhou, M. Delcroix,           tional Coordinating Committee on Speech Databases and Speech
     K. Kinoshita, C. Boeddeker, Y. Qian et al., “Dual-Path RNN for             I/O Systems and Assessment (O-COCOSDA). IEEE, 2017, pp.
     long recording speech separation,” in 2021 IEEE Spoken Lan-                1–5.
     guage Technology Workshop (SLT). IEEE, 2021, pp. 865–872.             [21] J. Du, X. Na, X. Liu, and H. Bu, “Aishell-2: Transform-
 [5] S. Chen, Y. Wu, Z. Chen, T. Yoshioka, S. Liu, and J. Li,                   ing mandarin asr research into industrial scale,” arXiv preprint
     “Don’t shoot butterfly with rifles: Multi-channel continuous               arXiv:1808.10583, 2018.
     speech separation with early exit transformer,” arXiv preprint        [22] P. Ghahremani, V. Manohar, D. Povey, and S. Khudanpur, “Acous-
     arXiv:2010.12180, 2020.                                                    tic modelling from the signal domain using CNNs.” in Inter-
 [6] Z. Chen, T. Yoshioka, L. Lu, T. Zhou, Z. Meng, Y. Luo, J. Wu,              speech, 2016, pp. 3434–3438.
     X. Xiao, and J. Li, “Continuous speech separation: dataset and        [23] H. Zeinali, S. Wang, A. Silnova, P. Matějka, and O. Plchot,
     analysis,” in ICASSP 2020-2020 IEEE International Conference               “But system description to voxceleb speaker recognition chal-
     on Acoustics, Speech and Signal Processing (ICASSP). IEEE,                 lenge 2019,” arXiv preprint arXiv:1910.12592, 2019.
     2020, pp. 7284–7288.                                                  [24] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb:
 [7] S. Watanabe, M. Mandel, J. Barker, and E. Vincent, “Chime-6                a large-scale speaker identification dataset,” arXiv preprint
     challenge: Tackling multispeaker speech recognition for unseg-             arXiv:1706.08612, 2017.
     mented recordings,” arXiv preprint arXiv:2004.09249, 2020.            [25] A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “Voxceleb:
 [8] A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Mor-              Large-scale speaker verification in the wild,” Computer Speech &
     gan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke et al., “The ICSI         Language, vol. 60, p. 101027, 2020.
     meeting corpus,” in IEEE International Conference on Acoustics,       [26] Y. Fan, J. Kang, L. Li, K. Li, H. Chen, S. Cheng, P. Zhang,
     Speech, and Signal Processing (ICASSP)., vol. 1. IEEE, 2003,               Z. Zhou, Y. Cai, and D. Wang, “CN-CELEB: a challenging chi-
     pp. 1–5.                                                                   nese speaker recognition dataset,” in IEEE International Con-
                                                                                ference on Acoustics, Speech and Signal Processing (ICASSP).
 [9] D. Mostefa, N. Moreau, K. Choukri, G. Potamianos, S. M. Chu,               IEEE, 2020, pp. 7604–7608.
     A. Tyagi, J. R. Casas, J. Turmo, L. Cristoforetti, F. Tobia et al.,
     “The CHIL audiovisual corpus for lecture and meeting analysis         [27] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive an-
     inside smart rooms,” Language resources and evaluation, vol. 41,           gular margin loss for deep face recognition,” in Proceedings of the
     no. 3, pp. 389–407, 2007.                                                  IEEE/CVF Conference on Computer Vision and Pattern Recogni-
                                                                                tion, 2019, pp. 4690–4699.
[10] S. Renals, T. Hain, and H. Bourlard, “Interpretation of multiparty
                                                                           [28] S. Ioffe, “Probabilistic linear discriminant analysis,” in European
     meetings the AMI and AMIDA projects,” in Hands-Free Speech
                                                                                Conference on Computer Vision. Springer, 2006, pp. 531–542.
     Communication and Microphone Arrays. IEEE, 2008, pp. 115–
     118.                                                                  [29] K. J. Han, S. Kim, and S. S. Narayanan, “Strategies to improve
                                                                                the robustness of agglomerative hierarchical clustering under data
[11] J. J. Godfrey, E. C. Holliman, and J. McDaniel, “Switchboard:              source variation for speaker diarization,” IEEE Transactions on
     Telephone speech corpus for research and development,” in                  Audio, Speech, and Language Processing, vol. 16, no. 8, pp.
     Acoustics, Speech, and Signal Processing, IEEE International               1590–1601, 2008.
     Conference on, vol. 1. IEEE Computer Society, 1992, pp. 517–
     520.                                                                  [30] F. Landini, J. Profant, M. Diez, and L. Burget, “Bayesian HMM
                                                                                clustering of x-vector sequences (VBx) in speaker diarization:
[12] C. Cieri, D. Miller, and K. Walker, “The Fisher corpus: a resource         theory, implementation and analysis on standard tasks,” arXiv
     for the next generations of speech-to-text.” in LREC, vol. 4, 2004,        preprint arXiv:2012.14952, 2020.
     pp. 69–71.
                                                                           [31] X. Xiao, N. Kanda, Z. Chen, T. Zhou, T. Yoshioka, Y. Zhao,
[13] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clus-           G. Liu, J. Wu, J. Li, and Y. Gong, “Microsoft speaker diariza-
     tering: Discriminative embeddings for segmentation and separa-             tion system for the voxceleb speaker recognition challenge 2020,”
     tion,” in IEEE International Conference on Acoustics, Speech and           arXiv preprint arXiv:2010.11458, 2020.
     Signal Processing (ICASSP). IEEE, 2016, pp. 31–35.                    [32] L. Dong, S. Xu, and B. Xu, “Speech-transformer: a no-recurrence
[14] G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn,                 sequence-to-sequence model for speech recognition,” in IEEE In-
     D. Crow, E. Manilow, and J. L. Roux, “WHAM!: Extend-                       ternational Conference on Acoustics, Speech and Signal Process-
     ing speech separation to noisy environments,” arXiv preprint               ing (ICASSP). IEEE, 2018, pp. 5884–5888.
     arXiv:1907.01160, 2019.                                               [33] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi,
[15] M. Maciejewski, G. Wichern, E. McQuinn, and J. Le Roux,                    “Hybrid CTC/attention architecture for end-to-end speech recog-
     “WHAMR!: Noisy and reverberant single-channel speech sepa-                 nition,” IEEE Journal of Selected Topics in Signal Processing,
     ration,” in IEEE International Conference on Acoustics, Speech             vol. 11, no. 8, pp. 1240–1253, 2017.
     and Signal Processing (ICASSP). IEEE, 2020, pp. 696–700.              [34] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D.
[16] J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, and E. Vin-           Cubuk, and Q. V. Le, “Specaugment: A simple data augmen-
     cent, “Librimix: An open-source dataset for generalizable speech           tation method for automatic speech recognition,” arXiv preprint
     separation,” arXiv preprint arXiv:2005.11262, 2020.                        arXiv:1904.08779, 2019.

You can also read