SPEAKER DIARIZATION WITH SESSION-LEVEL SPEAKER EMBEDDING REFINEMENT USING GRAPH NEURAL NETWORKS - Unpaywall

Page created by Gene Clarke

Home & Garden

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

SPEAKER DIARIZATION WITH SESSION-LEVEL SPEAKER EMBEDDING REFINEMENT USING GRAPH NEURAL NETWORKS - Unpaywall

SPEAKER DIARIZATION WITH SESSION-LEVEL SPEAKER EMBEDDING REFINEMENT
                                                                  USING GRAPH NEURAL NETWORKS

                                               Jixuan Wang1,2∗ , Xiong Xiao3 , Jian Wu3 , Ranjani Ramamurthy3 , Frank Rudzicz1,2 , Michael Brudno1,2

                                                                    1
                                                                    University of Toronto, Canada 2 Vector Institute, Canada 3 Microsoft, USA
                                                           1,2
                                                               {jixuan, frank, brudno}@cs.toronto.edu 3 {xioxiao, jianwu, ranjanir}@microsoft.com
arXiv:2005.11371v1 [eess.AS] 22 May 2020

                                                                            ABSTRACT
                                           Deep speaker embedding models have been commonly used
                                           as a building block for speaker diarization systems; however,
                                           the speaker embedding model is usually trained according to
                                           a global loss defined on the training data, which could be sub-
                                           optimal for distinguishing speakers locally in a specific meeting
                                           session. In this work we present the first use of graph neural
                                           networks (GNNs) for the speaker diarization problem, utilizing
                                           a GNN to refine speaker embeddings locally using the struc-
                                           tural information between speech segments inside each session.
                                           The speaker embeddings extracted by a pre-trained model are
                                           remapped into a new embedding space, in which the different
                                           speakers within a single session are better separated. The model
                                           is trained for linkage prediction in a supervised manner by min-
                                           imizing the difference between the affinity matrix constructed
                                           by the refined embeddings and the ground-truth adjacency ma-         Fig. 1. Overview of the proposed method. A graph is con-
                                           trix. Spectral clustering is then applied on top of the refined      structed for each session using speaker embeddings extracted
                                           embeddings. We show that the clustering performance of the           by a pre-trained speaker embedding model. A GNN model is
                                           refined speaker embeddings outperforms the original embed-           applied to remap the original embedding into another embed-
                                           dings significantly on both simulated and real meeting data, and     ding space which is trained according to a loss function defined
                                           our system achieves the state-of-the-art result on the NIST SRE      by the difference between the refined affinity matrix and the
                                           2000 CALLHOME database.                                              ground-truth adjacency matrix.
                                              Index Terms— Speaker diarization, graph neural networks,
                                           deep speaker embedding.
                                                                                                                (AHC) [15], spectral clustering (SC) [16] and affinity propaga-
                                                                                                                tion [17]. Although deep learning methods driven by large scale
                                                                      1. INTRODUCTION                           datasets have dominated the fields of speaker and speech recog-
                                                                                                                nition, it is still non-trivial to design an end-to-end objective
                                           Speaker diarization is the problem of “who spoke when”. A typ-
                                                                                                                function for the speaker diarization problems which is permu-
                                           ical speaker diarization system usually contains multiple steps.
                                                                                                                tation invariant in terms of both the speaker order and speaker
                                           First, the non-speech parts are filtered out by voice activity de-
                                                                                                                number. Most recently there have been several end-to-end ap-
                                           tection (VAD). Second, the speech parts are split into small ho-
                                                                                                                proaches that either utilize a factored generative model [13] or
                                           mogeneous segments either uniformly or according to the de-
                                                                                                                are trained according to a permutation-free loss [18].
                                           tected speaker change points. Third, each segment is mapped
                                                                                                                     In this paper, we consider a task upstream of speaker di-
                                           into a fixed dimensional embedding, such as i-vector [1], x-
                                                                                                                arization. We suggest that the speaker diarization models could
                                           vector [2] or d-vector [3, 4, 5, 6, 7]. Finally clustering methods
                                                                                                                be further improved by locally refining the speaker embedding
                                           or end-to-end approaches are applied to generate the diariza-
                                                                                                                for each session. The speaker embedding models are usually
                                           tion results [8, 9, 10, 11, 12, 13]. Usually a classifier needs
                                                                                                                designed to generalize well across a large range of speakers
                                           to be trained for similarity scoring for i-vectors and x-vectors,
                                                                                                                with different characteristics. However, speaker diarization is a
                                           while similarities between d-vectors can usually be measured
                                                                                                                simpler task compared with speaker recognition in the perspec-
                                           by simple distance metrics, e.g. cosine or Euclidean distance.
                                                                                                                tive of distinguishing speakers: we only care about separating
                                               Commonly-used clustering methods for speaker diarization
                                                                                                                several speakers for each session.
                                           include K-means [14], agglomerative hierarchical clustering
                                                                                                                     Our approach is to refine speaker embeddings by improving
                                             ∗ Initial   work by first author done as an intern at Microsoft.   their performance for similarity measurement between speech

segments. Obviously, if we could correctly predict for each pair mulated as the prediction of yi for each segment embedding xi
of speech segments whether they belong to the same speaker such that yi is equal to yj if and only if xi and xj belong to the
or not, the diarization problem would be solved. Our method same speakers, i, j ∈ {1, 2, . . . , N }.
utilizes local structural information existing among the speaker
embedding spaces for each session. The structural information
2.2. Graph neural networks
is expressed by a graph built upon the speech segments for each
session. The task of predicting similarities between pairs of We apply several variants of GNNs falling into the framework
nodes is analogous to the link prediction task on graphs. Our of message passing neural networks (MPNNs) [25]. Under the
model can be seen as a neural link predictor [19, 20, 21] con- MPNN framework, the convolutional operator is expressed as a
sisting of an encoding component and a scoring component. message passing scheme:
The encoding component includes GNN [22, 23, 24, 25] lay-
ers to map speaker embeddings into another embedding space. x0i = γΘ xi , j∈N (i) φΘ (xi , xj , ei,j )

(1)
Fully connected layers or cosine similarity is used as the scor-
ing component. To train the model, a loss function is defined where xi is the feature of node i in the current layer with dimen-
as the difference between the affinity matrix constructed by the sion D, x0i is the node feature in the next layer with dimension
refined embeddings and the ground-truth adjacency matrix. By D0 , ei,j is the edge feature from node i to node j , γ(·) and φ(·)
minimizing this loss the model will be able to refine the original are the update function and message function, respectively, pa-
embeddings so that different speakers could be better separated rameterized by Θ, and denotes the aggregation function, e.g.,
within each session. sum, mean, max, etc.
Very recently (and in parallel to this work), Lin et al. [26] One GNN variant we apply is the graph convolutional net-
proposed the use of LSTMs to improve similarity measurement work (GCN) described by [24] in which the function cor-
for speaker diarization. Unlike their work, we utilize the struc- responds to taking a certain weighted average of neighboring
tural information in the embedding space instead of utilizing nodes. The updating scheme between two layers is:
the temporal information in label sequences. Also our model  
not only outputs better similarity measurement but also refined X
speaker embeddings, which could be potentially fed into non- x0i = σ W Lij xj  , (2)
clustering-based methods. Our model also achieves better per- j
formance with the same experimental setup. Experiments on
both simulated and real meeting data show that the performance where L = D̂−1/2 ÂD̂−1/2 is the normalized affinity matrix
of speaker number detection and speaker diarization with the added by self connection, Â = A + IN , D̂ denotes the degree
0
refined embeddings outperforms the original embeddings sig- matrix of Â, W ∈ RD ×D is a layer-specific trainable weight
nificantly, and our system achieves the state-of-the-art result on matrix, and σ(·) is a nonlinear function.
the NIST SRE 2000 CALLHOME database.
2.3. Model design
2. GRAPH BASED SPEAKER DIARIZATION
Our model consists of two components: an encoding compo-
2.1. Building graphs of speech segments nent and a scoring component. GNN layers are applied as
the encoding component. No nonlinear functions are applied
A graph is built for each session using the pretrained speaker between the GNN layers. Because the speaker embeddings are
embeddings. Each node represents an audio segment which already well trained, adding nonlinearity may lose informa-
could be word-level, utterance-level, or extracted by a sliding tion inside the original embeddings and result in worse results.
window with a fixed sliding step. Speaker embedding of each The encoding component generates the refined embeddings,
segment is extracted as the node features. The weight of edges of which every pair are concatenated as inputs to the scoring
between nodes is represented by the PLDA scores or cosine component.
similarities between the corresponding x-vectors and d-vectors, Due to the distinct nature of x-vectors and d-vectors, differ-
respectively. We only keep edges of weight larger than a thresh- ent scoring components are designed. For d-vectors, it is com-
old, which is treated as a hyperparameter. mon to leave the complexity to the model and use a simple dis-
Formally, each meeting session can be represented as a tance metric, e.g. cosine distance. Since the original d-vectors
graph G(V, E, A), where V is the set of nodes (speech seg- are trained for being comparable by a simple distance metric,
ments), E is the set of edges, and A ∈ N × N is the affinity we keep using this distance metric for scoring the refined em-
matrix with Aij > 0 if edge eij = (vi , vj ) ∈ E and Aij = 0 beddings. For x-vectors, on the other hand, an additional classi-
otherwise. The matrix of d-vectors X ∈ RN ×D can be treated fier, e.g. PLDA, is usually trained to measure the similarity. In
as node features for graph G , where N is the total number our model, we add fully connected layers with nonlinear func-
of segments and D is the dimension of the embedding space. tions after the GNN layers, which is actually a classifier on top
The features of each node vi are represented by a d-vector xi , of the refined embeddings. This classifier is trained together
i ∈ 1, 2, . . . , N . The goal of speaker diarization can be for- with the GNN layers.

2.4. Loss function by a sigmoid function. The GCN layers are implemented with
the PyTorch Geometric library [32].
We train the GNN model w.r.t. a loss function that is defined
Similar to previous work [13, 26], we conducted 5-fold
as the difference between the affinity matrix constructed by the
cross validation. The dataset contains 500 sessions in total,
refined embedding and the ground-truth adjacency matrix. This
which is uniformly split into 5 subsets of 100 sessions for eval-
is actually a binary classification of linkage for all pairs of seg-
uation. In each turn, the model is trained for 50 epochs on
ments inside each session. For the x-vector based system, we
400 sessions with an initial learning rate of 0.001 reduced by a
simply applied the binary cross entropy (BCE) loss on all pairs
factor of 10 after 40 epochs. The optimal threshold for speaker
of segments for each session.
number detection on the training set is used for evaluation on
However, this simple loss function does not work for d-
the testing set. A graph is built for each session on which we
vectors. Due to the simplicity of the distance metric used, the
connect a pair of nodes if their PLDA score is higher than 0.2.
gradient would be very noisy if training the model w.r.t. the ex-
The model is trained in a session-by-session fashion that each
act match between the cosine similarities and ground-truth 1s
batch contains a single graph constructed from one session. All
or 0s. Instead we applied the following loss function:
the hyperparameters are tuned on a validation set split from the
training set for each turn. Results are shown in Table 1 and
L = hist loss(A, Agt ) + α · kA − Agt knuclear , (3)
summarized in the following section.
in which hist loss corresponds to the histogram loss [27],
k·knuclear is the nuclear norm between A and the ground-truth Table 1. DER (%) on the NIST SRE 2000 CALLHOME
adjacency matrix Agt , and α is a scalar value hyperparameter. dataset. SC refers to spectral clustering and AHC to agglom-
Histogram loss tries to ensure the distance between pairs of seg- erative hierarchical clustering.
ments belonging to different speakers is larger than the distance
between those belonging to the same speakers. The nuclear Method DER(%)
norm, which is the sum of singular values of a matrix [28] x-vector + PLDA + AHC (5-fold) 8.64
depends on the rank match between A and Agt , instead of Baseline
x-vector + PLDA + SC (5-fold) 8.05
an exact match. The combination of these two loss functions Wang et al. [12] 12.0
achieves similarity scores that stay in a reasonable range. Sell et al. [33] 11.5
Recent
Romero et al. [34] 9.9
Work
2.5. Spectral clustering Zhang et al. [13] (5-fold) 8.5
Lin et al. [26] (5-fold) 7.73
We use spectral clustering as our backend method and follow
Ours GNN based (5-fold) 7.24
the standard steps described in [16]. However, to detect the
speaker number, we first perform eigen-decomposition on the
refined affinity matrix and then determine the speaker number
by the number of eigenvalues that are larger than a threshold, 3.2. Speaker diarization results
which we treat as a tunable parameter. As shown in Section 4.2,
this method outperforms the commonly used method based on We compare speaker diarization error rate (DER) of our model
maximal eigengap. with several recent works. We achieve state-of-the-art perfor-
mance resulting in a DER of 7.24%, which outperforms both
the baselines and all recent works. For fairness, we only include
3. EXPERIMENTS WITH X-VECTORS
the results without system fusion of [26]. We only include the
result with in-domain training of [13]. The DER of [13] with
3.1. Datasets and implementation details
external data is 7.6% which is still worse than our result. Also
To compare our x-vector-based systems with other meth- the speaker embedding model of [13] was trained on a much
ods, we follow the evaluation steps described in the call- larger dataset than the one we are using, which may also influ-
home diarization/v2 receipe of Kaldi [29]. We used the pre- ence the results.
trained x-vector model and PLDA scoring model trained on
augmented Switchboard and NIST SRE datasets [30]. Results
4. EXPERIMENTS WITH D-VECTORS
on the commonly used dataset NIST SRE 2000 CALLHOME
(LDC2001S97) Disk 8 are reported. 4.1. Datasets and implementation details
Our model architecture includes two GCN [24] layers fol-
lowed by two fully connected layers. Input x-vector is 128- For d-vector-based system, we trained the model on simulated
dimensional. The dimension of output embedding from the conversations and evaluated it on real meetings. The d-vector
GCN layers is 64, of which every pair is concatenated as a 128- extraction model is trained and fake meetings are simulated
dimensional input to the following fully connected layers. The from the VoxCeleb2 dataset [35]. For each simulated meeting,
first fully connected layers is 64-dimensional with the ELU acti- we randomly select 2-15 speakers from the training set of Vox-
vation function [31] and the last layer has 1 dimension followed Celeb2 as the meeting participants. And for each speaker, 2-60

speech segments with duration of 1.5s are sampled. Moreover, mance of the threshold-based method with original embeddings
we use 45 sessions of in-house real meetings as another test- degrades on the out-of-domain (in-house meeting) dataset. This
ing set. On average, each meeting contains 6 speakers, ranging is likely because original embeddings are less robust with re-
from 2 to 16. The average duration of meeting sessions is about spect to the threshold, while the refined embeddings are robust
30 minutes. even with a threshold tuned on a different dataset. Conse-
To build the graphs, we only connect a pair of speech seg- quently the threshold-based method with refined embeddings
ments if the cosine similarity between their d-vectors is larger significantly outperforms other approaches on both in-domain
than a threshold 0.2, which is tuned on the validation set. Our and out-of-domain datasets.
model contains two GCN layers, of which the hidden dimension
and output dimension are both equal to the input dimension of
Table 2. Mean error of speaker number detection (first two
128. We used 150 bins for the histogram loss. The DER re-
rows) and DER (last row) for the eigengap-based (gap) method
ported only includes speaker confusion.
and the threshold-based method (thred) using original (O) and
refined (R) embeddings, respectively. The last line shows DER
4.2. Speaker number detection results improvement relative to O-gap.
We first evaluate the refined embedding for speaker number
Dataset O-gap O-thred R-gap R-thred
detection. We search for the optimal threshold for this task
VoxCeleb2 Test 4.77 1.56 2.64 1.03
through simulated conversations on the validation set. The
In-house meetings 3.23 4.39 2.86 1.20
threshold which achieves the lowest mean speaker number
In-house meetings
detection error across all simulated sessions is chosen for eval- 0% -116.7% 16.3% 73.7%
(Rel. DER reduction)
uation on testing set. As shown in Figure 2, the minimal error
achieved by the refined embeddings is lower than the original
embeddings with a threshold of 3.0 and 2.0, respectively. The
curved slope of the refined embeddings after the optimal thresh-
4.3. Speaker diarization results
old is lower than the original embeddings, which indicates that
speaker number detection with the refined embedding is more We compare the performance of spectral clustering on speaker
robust. diarization using both original and refined embeddings. The re-
sults in Table 2 are consistent with the results on speaker num-
ber detection. Here again using the threshold-based model us-
ing the original embeddings actually performs worse than the
eigengap-based method for the out-of-domain dataset. How-
ever the model using refined embeddings and the threshold-
based method for speaker number detection achieves a 73.7%
relative DER reduction compared to the model with the original
embeddings. This indicates that our system should be general-
izable to real world applications.

5. CONCLUSION

Fig. 2. Mean error of speaker number detection by different In this paper, we present the first use of GNNs for speaker di-
thresholds. The dashed line (original) refers to results with arization. A graph of speech segments is built for each meeting
the original speaker embeddings, while the solid line (refined) session and a GNN model is trained to refine the speaker em-
refers to results with the refined speaker embeddings. bedding utilizing the structural information in the graph. Our
model achieves the state-of-the-art performance on a public
dataset, while experiments on both simulated and real meetings
With the optimal threshold tuned on the validation set we
show that spectral clustering based on the refined embeddings
evaluate the performance of speaker number detection on two
can achieve much better performance than original embeddings
test sets: fake conversations simulated from the testing set
in terms of speaker number detection and clustering.
of VoxCeleb2 and an in-house meeting dataset, which are in-
domain and out-of-domain datasets respectively. As shown
in Table 2, the performance with refined embeddings outper- 6. ACKNOWLEDGEMENT
forms the original embeddings on both datasets with both the
eigengap-based method [12] and the threshold-based method. The authors would like to thank Tianyan Zhou and Yong Zhao
Intriguingly, while the threshold-based methods with orig- for providing the d-vector extraction model, and Chun-Hao
inal and refined embeddings outperform the eigengap-based Chang and Zhuo Chen for helpful discussions. Rudzicz is a
methods on the in-domain dataset (VoxCeleb2 Test), the perfor- CIFAR Chair in AI.

7. REFERENCES                                         [18] Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Kenji Nagamatsu,
                                                                                    and Shinji Watanabe, “End-to-end neural speaker diarization with
 [1] Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and               permutation-free objectives,” Proc. Interspeech 2019, pp. 4300–4304,
     Pierre Ouellet, “Front-end factor analysis for speaker verification,”          2019.
     IEEE Transactions on Audio, Speech, and Language Processing, vol.         [19] Thomas N Kipf and Max Welling, “Variational graph auto-encoders,”
     19, no. 4, pp. 788–798, 2010.                                                  arXiv preprint arXiv:1611.07308, 2016.
 [2] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey,           [20] Muhan Zhang and Yixin Chen, “Link prediction based on graph neural
     and Sanjeev Khudanpur, “X-vectors: Robust dnn embeddings for                   networks,” in Advances in Neural Information Processing Systems,
     speaker recognition,” in 2018 IEEE International Conference on                 2018, pp. 5165–5175.
     Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp.
                                                                               [21] Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy
     5329–5333.
                                                                                    Gabrilovich, “A review of relational machine learning for knowledge
 [3] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and              graphs,” Proceedings of the IEEE, vol. 104, no. 1, pp. 11–33, 2015.
     Javier Gonzalez-Dominguez, “Deep neural networks for small foot-
                                                                               [22] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi
     print text-dependent speaker verification,” in 2014 IEEE International
                                                                                    Zhang, and Philip S Yu, “A comprehensive survey on graph neural
     Conference on Acoustics, Speech and Signal Processing (ICASSP).
                                                                                    networks,” arXiv preprint arXiv:1901.00596, 2019.
     IEEE, 2014, pp. 4052–4056.
                                                                               [23] Hongyun Cai, Vincent W Zheng, and Kevin Chen-Chuan Chang, “A
 [4] Hervé Bredin, “Tristounet: triplet loss for speaker turn embedding,”
                                                                                    comprehensive survey of graph embedding: Problems, techniques, and
     in 2017 IEEE international conference on acoustics, speech and signal
                                                                                    applications,” IEEE Transactions on Knowledge and Data Engineer-
     processing (ICASSP). IEEE, 2017, pp. 5430–5434.
                                                                                    ing, vol. 30, no. 9, pp. 1616–1637, 2018.
 [5] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno, “Gen-
                                                                               [24] Thomas N. Kipf and Max Welling, “Semi-supervised classification
     eralized end-to-end loss for speaker verification,” in 2018 IEEE In-
                                                                                    with graph convolutional networks,” in 5th International Conference
     ternational Conference on Acoustics, Speech and Signal Processing
                                                                                    on Learning Representations, ICLR 2017, Toulon, France, April 24-
     (ICASSP). IEEE, 2018, pp. 4879–4883.
                                                                                    26, 2017, Conference Track Proceedings. 2017, OpenReview.net.
 [6] Jixuan Wang, Kuan-Chieh Wang, Marc T Law, Frank Rudzicz, and
                                                                               [25] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals,
     Michael Brudno, “Centroid-based deep metric learning for speaker
                                                                                    and George E Dahl, “Neural message passing for quantum chem-
     recognition,” in ICASSP 2019-2019 IEEE International Conference
                                                                                    istry,” in Proceedings of the 34th International Conference on Machine
     on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019,
                                                                                    Learning-Volume 70. JMLR. org, 2017, pp. 1263–1272.
     pp. 3652–3656.
                                                                               [26] Qingjian Lin, Ruiqing Yin, Ming Li, Hervé Bredin, and Claude Barras,
 [7] Tianyan Zhou, Yong Zhao, Jinyu Li, Yifan Gong, and Jian Wu, “Cnn
                                                                                    “LSTM based Similarity Measurement with Spectral Clustering for
     with phonetic attention for text-independent speaker verification,” in
                                                                                    Speaker Diarization,” 2019, HAL CCSD.
     Proc. IEEE Workshop on Automatic Speech Recognition and Under-
     standing, 2019.                                                           [27] Evgeniya Ustinova and Victor Lempitsky, “Learning deep embeddings
                                                                                    with histogram loss,” in Advances in Neural Information Processing
 [8] Sylvain Meignier and Teva Merlin, “Lium spkdiarization: an open
                                                                                    Systems, 2016, pp. 4170–4178.
     source toolkit for diarization,” 2010.
                                                                               [28] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo, “Guaranteed
 [9] Stephen H Shum, Najim Dehak, Réda Dehak, and James R Glass,                   minimum-rank solutions of linear matrix equations via nuclear norm
     “Unsupervised methods for speaker diarization: An integrated and it-           minimization,” SIAM review, vol. 52, no. 3, pp. 471–501, 2010.
     erative approach,” IEEE Transactions on Audio, Speech, and Lan-
     guage Processing, vol. 21, no. 10, pp. 2015–2028, 2013.                   [29] “callhome diarization/v2 receipe of Kaldi,” https://github.
                                                                                    com/kaldi-asr/kaldi/tree/master/egs/callhome_
[10] Gregory Sell and Daniel Garcia-Romero, “Speaker diarization with               diarization/v2, Accessed: 2019-10-15.
     plda i-vector scoring and unsupervised calibration,” in 2014 IEEE
     Spoken Language Technology Workshop (SLT). IEEE, 2014, pp. 413–           [30] “SRE16 Xvector Model,” http://kaldi-asr.org/models/
     417.                                                                           m6, Accessed: 2019-10-15.
[11] Daniel Garcia-Romero, David Snyder, Gregory Sell, Daniel Povey,           [31] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter, “Fast
     and Alan McCree, “Speaker diarization using deep neural network                and accurate deep network learning by exponential linear units (elus),”
     embeddings,” in 2017 IEEE International Conference on Acoustics,               in 4th International Conference on Learning Representations, ICLR,
     Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 4930–4934.              2016.
[12] Quan Wang, Carlton Downey, Li Wan, Philip Andrew Mansfield, and           [32] Matthias Fey and Jan Eric Lenssen, “Fast graph representation learn-
     Ignacio Lopz Moreno, “Speaker diarization with lstm,” in 2018 IEEE             ing with pytorch geometric,” arXiv preprint arXiv:1903.02428, 2019.
     International Conference on Acoustics, Speech and Signal Processing       [33] G. Sell and D. Garcia-Romero, “Diarization resegmentation in the
     (ICASSP). IEEE, 2018, pp. 5239–5243.                                           factor analysis subspace,” in 2015 IEEE International Conference on
[13] Aonan Zhang, Quan Wang, Zhenyao Zhu, John Paisley, and Chong                   Acoustics, Speech and Signal Processing (ICASSP), April 2015, pp.
     Wang, “Fully supervised speaker diarization,” in ICASSP 2019-2019              4794–4798.
     IEEE International Conference on Acoustics, Speech and Signal Pro-        [34] D. Garcia-Romero, D. Snyder, G. Sell, D. Povey, and A. McCree,
     cessing (ICASSP). IEEE, 2019, pp. 6301–6305.                                   “Speaker diarization using deep neural network embeddings,” in 2017
[14] Stuart Lloyd, “Least squares quantization in pcm,” IEEE transactions           IEEE International Conference on Acoustics, Speech and Signal Pro-
     on information theory, vol. 28, no. 2, pp. 129–137, 1982.                      cessing (ICASSP), March 2017, pp. 4930–4934.

[15] Lior Rokach and Oded Maimon, “Clustering methods,” in Data min-           [35] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman, “Voxceleb2:
     ing and knowledge discovery handbook, pp. 321–352. Springer, 2005.             Deep speaker recognition,” in Proc. Interspeech 2018, 2018, pp. 1086–
                                                                                    1090.
[16] Ulrike Von Luxburg, “A tutorial on spectral clustering,” Statistics and
     computing, vol. 17, no. 4, pp. 395–416, 2007.
[17] Brendan J Frey and Delbert Dueck, “Clustering by passing messages
     between data points,” science, vol. 315, no. 5814, pp. 972–976, 2007.

You can also read