Jointly optimal denoising, dereverberation, and source separation
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXXX 2020 1 Jointly optimal denoising, dereverberation, and source separation Tomohiro Nakatani, Senior Member, IEEE, Christoph Boeddeker, Student Member, IEEE, Keisuke Kinoshita, Senior Member, IEEE, Rintaro Ikeshita, Member, IEEE, Marc Delcroix, Senior Member, IEEE, Reinhold Haeb-Umbach, Fellow, IEEE Abstract—This paper proposes methods that can optimize a the acquired signal. For performing denoising (DN), beam- Convolutional BeamFormer (CBF) for jointly performing denois- forming techniques have been investigated for decades [1], [2], ing, dereverberation, and source separation (DN+DR+SS) in a [3], [4], and the Minimum Variance Distortionless Response computationally efficient way. Conventionally, cascade configu- ration composed of a Weighted Prediction Error minimization (MVDR) beamformer and the Minimum Power Distortionless (WPE) dereverberation filter followed by a Minimum Variance Response (MPDR) beamformer, are now widely used as the Distortionless Response (MVDR) beamformer has been used state-of-the-art techniques. For source separation (SS), a num- as the state-of-the-art frontend of far-field speech recognition, ber of blind signal processing techniques have been developed, however, overall optimality of this approach is not guaranteed. including independent component analysis [5], independent In the blind signal processing area, an approach for jointly optimizing dereverberation and source separation (DR+SS) has vector analysis [6], and spatial clustering-based beamforming been proposed, however, this approach requires huge computing [7]. For dereverberation (DR), a Weighted Prediction Error cost, and has not been extended for application to DN+DR+SS. minimization (WPE) based linear prediction technique [8], To overcome the above limitations, this paper develops new [9] and its variants [10] have been actively studied as an approaches for jointly optimizing DN+DR+SS in a computation- effective approach. With these techniques, for determining ally much more efficient way. To this end, we first present an objective function to optimize a CBF for performing DN+DR+SS the coefficients of the filtering, it is crucial to accurately based on the maximum likelihood estimation, on an assumption estimate statistics of the speech signals and the noise, such as that the steering vectors of the target signals are given or can their spatial covariances and time-varying variances. However, be estimated, e.g., using a neural network. This paper refers the estimation often becomes inaccurate when the signals to a CBF optimized by this objective function as a weighted mixed under reverberant and noisy conditions, which seriously Minimum-Power Distortionless Response (wMPDR) CBF. Then, we derive two algorithms for optimizing a wMPDR CBF based degrades the performance of these techniques. on two different ways of factorizing a CBF into WPE filters To enhance the robustness of the above techniques, recently, and beamformers: one based on extension of the conventional neural network-supported microphone array speech enhance- joint optimization approach proposed for DR+SS and the other ment has been actively studied, and showed its effectiveness based on a novel technique. Experiments using noisy reverberant for denoising [11], dereverberation [12], and source separation sound mixtures show that the proposed optimization approaches greatly improve the performance of the speech enhancement in [13], [14]. With this approach, estimation of the statistics of comparison with the conventional cascade configuration in terms the signals and noise, such as Time-Frequency (TF) masks and of the signal distortion measures and ASR performance. It is time-varying variances, is conducted by neural networks [13], also shown that the proposed approaches can greatly reduce the [15], [16], [17], while speech enhancement is performed by computing cost with improved estimation accuracy in comparison the microphone array signal processing. This combination is with the conventional joint optimization approach. particularly effective because neural networks can very well Index Terms—Beamforming, dereverberation, source separa- capture spectral patterns of signals over wide TF ranges, and tion, microphone array, automatic speech recognition, maximum can reliably estimate such statistics of the signals, which was likelihood estimation not well handled by the conventional signal processing. On the other hand, neural networks often introduce nonlinear I. I NTRODUCTION distortions into the processed signal, which are harmful to When a speech signal is captured by distant microphones, perceived speech quality and ASR, while it can be well e.g., in a conference room, it often contains reverberation, dif- avoided by microphone array techniques. A number of articles fuse noise, and extraneous speakers’ voices. These components have reported the usefulness of this combination, particularly are detrimental to the intelligibility of the captured speech for far-field ASR, e.g., at the REVERB challenge [18] and the and often cause serious degradation in many applications CHiME-3/4/5 challenges [19], [20]. such as hands-free teleconferencing and Automatic Speech Despite the success of the neural network-supported mi- Recognition (ASR). crophone array speech enhancement, it is still not yet well Microphone array speech enhancement has been extensively investigated how to optimally combine individual microphone studied to minimize the aforementioned detrimental effects in array techniques for performing denoising, dereverberation, and source separation (DN+DR+SS) at the same time in a T. Nakatani, K. Kinoshita, R. Ikeshita, and M. Delcroix are with NTT Corporation. C. Boeddeker and R. Haeb-Umbach are with Paderborn Univ. computationally efficient way. For example, for denoising and Manuscript received January 1, 2020; revised XXXX XX, 2020. dereverberation (DN+DR), cascade configuration of a WPE
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXXX 2020 2 filter followed by a MVDR/MPDR beamformer has been estimation of the steering vectors. While both approaches widely used as the state-of-the-art frontend, e.g., at the far- work comparably well in terms of the estimation accuracy, field ASR challenges [18], [19], [20], [21]. However, the WPE the source-wise factorization has advantages in terms of filter and the beamformer are separately optimized, and the computational efficiency. An additional benefit of the source- overall optimality of this approach is not guaranteed. In order wise factorization is that it can be used, without loss of to perform DN+DR in an optimal way, several techniques optimality, for extraction of a single target source from a sound have been proposed using a Kalman filter [22], [23], [24]. A mixture, which is now an important application area of speech technique, called Integrated Sidelobe Cancellation and Linear enhancement [13], [29]. Prediction (ISCLP) [24], optimizes an integrated filter that Experiments based on noisy reverberant sound mixtures can cancel noise and reverberation from the observed signal created using the REVERB Challenge dataset [18] show that using a sidelobe cancellation framework. With this technique, the proposed optimization approaches substantially improve however, a steering vector of the target signal needs to be the performance of DN+DR+SS in comparison to the con- estimated in advance directly from noisy reverberant speech, ventional cascade configuration in terms of ASR performance which is a challenging problem, and thus limits the overall and reduction of signal distortion. It is also shown that the estimation accuracy. In the blind signal processing area, on two proposed approaches can greatly reduce the computing the other hand, a technique to jointly optimize a pair of a cost with improved estimation accuracy in comparison with WPE filter followed by a beamformer has been proposed the conventional joint optimization approach. for dereverberation and source separation (DR+SS) under Certain parts of this paper have already been presented in noiseless conditions [25], [26], [27]. One advantage of this our recent conference papers. In [30], the ML formulation for approach is that we can access multichannel dereverberated optimizing a CBF was derived for DN+DR. In [31], it was signals obtained as the output of the WPE filter during the shown that a CBF for DN+DR can be factorized into a WPE optimization, and utilize them to reliably estimate the beam- filter and a wMPDR (non-convolutional) beamformer, and former. However, this approach requires 1) huge computing jointly optimized without loss of optimality. Furthermore, [32] cost for the optimization, and 2) has not been extended for presented ways to reliably estimate TF masks for DN+DR+SS. application to DN+DR+SS. This paper integrates these techniques to perform DN+DR+SS To overcome the above limitations, this paper develops in a computationally efficient way. algorithms for optimizing a Convolutional BeamFormer (CBF) In the remainder of this paper, the model of the observed that can perform DN+DR+SS in a computationally much more signal and that of the CBF are defined in Section II. Then, efficient way. A CBF is a filter that is applied to a multichannel Section III presents the proposed optimization methods. Sec- observed signal to yield the desired output signals. For the tion IV summarizes characteristics and advantages of the optimization of a CBF, this paper first presents a common proposed methods. In Sections V and VI, experimental results objective function based on the Maximum Likelihood (ML) and concluding remarks are described, respectively. criterion, on an assumption that the steering vectors of the desired signals are given, or can be estimated. This paper II. M ODELS OF SIGNAL AND BEAMFORMER refers to a CBF optimized by this objective function as a weighted MPDR (wMPDR) CBF. Then, showing that a CBF This paper assumes that I source signals are captured by can be factorized into WPE filter(s) and beamformer(s) in M (≥ I) microphones in a noisy reverberant environment. The two different ways, we derive two different algorithms for captured signal at each TF point in the short-time Fourier optimizing the wMPDR CBF, based on the ways of the CBF transform (STFT) domain is modeled by factorization. The first approach, referred to as a source- I packed factorization, is an extension of the conventional joint (i) X xt,f = xt,f + nt,f , (1) optimization technique proposed for DR+SS [25], [26], [27]. i=1 In this paper, we first show that the direct application of this (i) (i) (i) xt,f = dt,f + rt,f , (2) approach to DN+DR+SS has serious problems in terms of the computational efficiency and the estimation accuracy, and where t and f are time and frequency indices, respec- then present its extension for solving the problems. The second tively, xt,f = [x1,t,f , . . . , xM,t,f ]> ∈ CM ×1 is a column approach, referred to as a source-wise factorization, is based vector containing all microphone signals at a TF point. on a novel factorization technique. It factorizes a CBF into a (i) Here, (·)> denotes the non-conjugate transpose. xt,f = set of sub-filter pairs, each of which is composed of a WPE (i) (i) [x1,t,f , . . . , xM,t,f ]> is a (noiseless) reverberant signal cor- filter and a beamformer, and aimed at estimating each source responding to the ith source, and nt,f = [n1,t,f , . . . , nM,t,f ]> independently. For both approaches, we also present a method (i) for robustly estimating the steering vectors of the desired is the additive diffuse noise. xt,f for each source in Eq. (1) is signals during the optimization of the wMPDR CBF using further decomposed into two parts in Eq. (2), one consisting the output of the WPE filters. A neural network supported of the direct signal and early reflections, referred to as a (i) TF-mask estimation technique is also incorporated1 for the desired signal dt,f , and the other corresponding to the late (i) reverberation rt,f . Hereafter, the frequency indices of the 1 It is worth noting that the proposed techniques can also be applied to the symbols are omitted for brevity, on an assumption that each conventional blind signal processing for DR+SS as discussed in [28]. frequency bin is processed independently in the same way.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXXX 2020 3 Convoluonal Mulple- Beamformer beamformer target matrix for for dereverberaon, linear separaon denoising, and predicon and source separaon (LP) denoising (a) A MIMO CBF (b) A MIMO CBF with source-packed factorization (1) Convoluonal (1) Single-target Beamformer (1) beamformer for = 1 LP for = 1 for = 1 … … … … … ( ) Convoluonal ( ) Single-target Beamformer ( ) beamformer for = LP for = for = (c) A set of MISO CBFs (d) MISO CBFs with source-wise factorization Fig. 1. A Multi-Input Multi-Output (MIMO) CBF and its three different implementations. They are equivalent to each other in the sense that whatever values are set to coefficients of one implementation, certain coefficients of the other implementations can be determined such that they realize the same input-output relationship. Thus, the optimal solutions of all the implementations are identical as long as they are optimized based on the same objective function. (i) (i) (i) (i) In this paper, the goal of DN+DR+SS is to estimate dt where aτ = [a1,τ , . . . , aM,τ ]> for τ ∈ {∆, . . . , La − 1} (i) for each source i from xt in Eq. (1) by reducing rt of the are the convolutional acoustic transfer functions, and ∆ is the (i0 ) mixing time, representing the relative frame delay of the late source i, xt of all the other sources i0 6= i, and the diffuse noise nt . It is known that in noisy reverberant environments reverberation start time to the direct signal. (i) the early reflections enhance the intelligibility of speech for In this paper, we further assume that dt is statistically 2 human perception [33] and improve the ASR performance by independent of the following variables: computer [34], and thus we include them in the desired signal. • (i) st0 for t0 ≤ t − ∆ (and thus dt (i) is statistically Hereafter, this paper uses m = 1 as the reference microphone, (i) independent of xt0 for t0 ≤ t − ∆), (i) and describes a method for estimating the desired signal, d1,t , (i) • rt00 for t00 ≤ t, at the microphone without loss of generality. (i0 ) (i) • xt0 and nt0 for all t, t0 and i0 6= i. To achieve the above goal, we further model dt as These assumptions are used to derive the optimization algo- (i) (i) (i) dt = v(i) st = ṽ(i) d1,t , (3) rithms described in the following. (i) where st is the ith clean speech at a TF point. In Eq. (3), the (i) (i) A. Definition of a CBF and its three different implementations desired signal of the ith source, dt , is modeled by v(i) st , i.e., a product in the STFT domain of the clean speech with We now define a CBF, which will be later factorized into a transfer function v(i) , hereafter referred to as a steering WPE filter(s) and beamformer(s), as vector, assuming that the duration of the impulse response corresponding to the direct signal and early reflections in L−1 X the time domain is sufficiently short in comparison with the yt = W0H xt + WτH xt−τ , (6) analysis window [35]. Then, the desired signal is further τ =∆ (i) rewritten as ṽ(i) d1,t , i.e., a product of the desired signal at the (1) (I) (i) (i) (i) where yt = [yt , . . . , yt ]> ∈ CI×1 is the output of the CBF reference microphone d1,t = v1 st with a Relative Transfer corresponding to estimates of I desired signals, Wτ ∈ CM ×I Function (RTF) [36] which is defined as the steering vector for each τ ∈ {0, ∆, ∆+1, . . . , L−1} is a matrix composed of divided by its reference microphone element, namely the beamformer coefficients, (·)H denotes conjugate transpose, (i) and ∆ is the prediction delay of the CBF. In this paper, ṽ(i) = v(i) /v1 . (4) we set ∆ equal to the mixing time introduced in Eq. (5), In contrast, assuming that the duration of the late reverberation so that the desired signals are included only in the first in the time domain is larger than the analysis window, the late term of Eq. (6), and that they are statistically independent (i) reverberation rt is modeled by a convolution in the STFT of the second term according to the assumptions introduced domain [37] of the clean speech with a time series of acoustic in the signal model. Then, this paper performs DN+DR+SS transfer functions corresponding to the late reverberation as by estimating beamformer coefficients that can estimate the desired signals included in the first term of Eq. (6). a −1 LX (i) (i) rt = a(i) τ st−τ , (5) 2 See [8] for more precise discussion on the statistical independence between (i) τ =∆ dt and st0 for t0 ≤ t − ∆.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXXX 2020 4 For the sake of notational simplicity, we also introduce a W, and are used to extract the ith desired signal. Then, Eq. (7) matrix representation of a CBF as can be rewritten for each source i as H " #H W0 xt (i) yt = , (7) (i) w0 xt W xt yt = . (14) w(i) xt where W is a matrix containing Wτ for ∆ ≤ τ ≤ L − 1 and For example, MISO CBFs were used in [30], [39]. ISCLP xt is a column vector containing past multichannel observed proposed in [24] can also be viewed as a realization of a MISO signals xt−τ for ∆ ≤ τ ≤ L − 1, respectively, defined as CBF using a sidelobe cancellation framework [40]. > > > 3) Source-wise factorization: With the source-wise factor- W = W∆ , . . . , WL−1 ∈ CM (L−∆)×I , (8) > > ization shown in Fig. 1 (d), we further factorize each MISO > M (L−∆)×1 xt = xt−∆ , . . . , xt−L+1 ∈ C . (9) CBF defined in Eq. (14) for a source i as " # " # Hereafter, we refer to the CBF defined by Eqs. (6) and (7) as (i) IM w0 a MIMO CBF. = (i) q(i) , (15) w(i) −G In the following, we further present three different imple- mentations of the CBF, including the two ways of factorizing (i) where q(i) ∈ CM ×1 and G ∈ CM (L−∆)×M . Then, Eq. (14) the CBF. Figure 1 illustrates the MIMO CBF and its three can be rewritten as a pair of a linear prediction filter followed different implementations. by a beamformer, respectively, defined as 1) Source-packed factorization: With the implementation (i) H shown in Fig. 1 (b), we directly factorize3 the MIMO CBF in (i) zt = x t − G xt , (16) Eq. (7) as H (i) (i) yt = q(i) zt , W0 IM (17) = Q, (10) W −G (i) (i) where zt ∈ CM ×1 and G are the output and the prediction where Q ∈ CM ×I , G ∈ CM (L−∆)×M , and IM ∈ RM ×M is matrix of the linear prediction, and q(i) is the coefficient vector an identity matrix. Then, Eq. (6) can be rewritten as a pair of of the beamformer. Because Eq. (16) is performed only for a (convolutional) linear prediction filter followed by a (non- estimation of the ith source, it is referred to as a single-target convolutional) beamformer matrix, respectively, defined as linear prediction. H zt = xt − G xt , (11) 4) Relationship between two factorization approaches: The H yt = Q zt . (12) difference between the two factorization approaches, namely Fig. 1 (b) and (d), is based only on the way to perform linear Here, zt ∈ CM ×1 and G are the output and the prediction prediction, Eq. (11) or Eq. (16), and more specifically based (i) matrix of the linear prediction, and Q is the coefficient matrix on whether the prediction matrices, G and G , are common of the beamformer. Eq. (11) is supposed to dereverberate all to all the sources or different over different sources. According the sources at the same time, and thus referred to as a multiple- to this, different optimization algorithms with different char- target linear prediction, while Eq. (12) is supposed to perform acteristics are derived, as will be shown in Section III. In denoising and source separation at the same time. Because contrast, the beamformer parts, Q and q(i) in Eqs. (12) and the individual sources are not distinguished in the output of (17) are identical in the two approaches, viewing q(i) as the the WPE filter, this implementation is called source-packed ith column of Q, because they satisfy W0 = Q in Eq. (10) factorization. (i) and w0 = q(i) in Eq. (15). One example of the source-packed factorization is the In addition, it should be noted that all the above implemen- cascade configuration composed of a WPE filter followed by tations of a CBF are equivalent to each other in the sense a beamformer, which has been widely used for DN+DR+SS that whatever values are set to coefficients of one imple- in the far-field speech recognition area [14], [20], [38], and mentation, certain coefficients of the other implementations the other example is one used in the joint optimization of a can be determined such that they realize the same input- WPE filter and a beamformer, which has been investigated for output relationship. Thus, the optimal solutions of all the DR+SS in the blind signal processing area [25], [26], [27]. implementations are identical as long as they are based on 2) Multi-Input Single-Output (MISO) CBF: Next, we define the same objective function. a set of MISO CBFs shown in Fig. 1 (c). They are obtained by decomposing the beamformer coefficients in Eq. (7) as III. ML ESTIMATION OF CBF " (1) (2) (I) # W0 w0 w0 . . . w0 In this section, two different optimization algorithms are = , (13) W w(1) w(2) . . . w(I) derived, respectively, using (b) source-packed factorization and (i) (d) source-wise factorization. For the derivation, we assume where w0 ∈ CM ×1 and w(i) ∈ CM (L−∆)×1 are column that the RTFs ṽ(i) and the time-varying variances of the output vectors, respectively, containing the ith columns of W0 and (i) signals yielded by the optimal CBF, denoted by λt , are given. (i) 3 The existence of G that satisfies W = −GQ is guaranteed for any W Later in Section III-E, we describe a way for estimating λt when M ≥ I and rank{Q} = I. jointly with the CBF coefficients based on the ML criterion,
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXXX 2020 5 and a way for estimating ṽ(i) based on the output of the WPE noise remaining in the output of the CBF, respectively, written filter obtained at a step of the optimization. using the MISO CBF form as " #H " # (i) (i) (i) w0 rt r̂t = (i) , (22) A. Probabilistic model w(i) xt #H " (i0 ) " # (i) First, we formulate the objective function for DN+DR+SS (i0 ) w0 xt x̂t = (i0 ) , (23) by reinterpreting the objective function proposed for DN+DR w(i) xt in [30]. For this formulation, we interpret DN+DR+SS to " #H (i) be composed of a set of separate processing steps, each of w0 nt n̂t = , (24) which applies DN+DR to enhance a source i by reducing w(i) nt the late reverberation of the source (DR) and by reducing the additive noise including the other sources and the diffuse noise where nt = [n> > > t−∆ , . . . , nt−L+1 ] . According to the statistical (i) (DN). With this interpretation, we introduce the following independence assumptions introduced in Section II, d1,t is sta- (i) (i0 ) assumptions similar to [30]. tistically independent of r̂t , x̂t , and n̂t . Then, substituting (i) Eq. (21) in Eq. (19) and omitting constant terms, we obtain • The output of the optimal CBF for each i, namely yt , in the expectation sense follows zero-mean complex Gaussian distribution with a 2 2 (i) (i) (i) P (i0 ) time-varying variance λt = E yt [8]. T E r̂t + i0 6=i tx̂ + n̂ t n o 1X • The beamformer satisfies a distortionless constraint for E Li (θ(i) ) = (i) . T t=1 λt each source i defined using the RTF ṽ(i) in Eq. (4) as (25) H H (i) (i) (i) (i) The above equation indicates that minimization of0 the objec- w0 ṽ = 1 or q ṽ = 1 . (18) (i) (i ) tive function indeed minimizes the sum of r̂t , x̂t for i0 6= i, and n̂t in Eq. (21). Then, according to the discussion in [30], we can approxi- Before deriving the optimization algorithms, we define a mately derive the objective function to minimize for estimating matrix that is frequently used in the derivation, referred to (i) the CBF coefficients for source i, e.g., θ(i) = {w0 , w(i) }, as a variance-normalized spatio-temporal covariance matrix. based on the ML estimation as Letting xt be a column vector composed of the current and 2 past observed signals at all the microphones, defined as T (i) 1 y H > > t xt = x> ∈ CM (L−∆+1)×1 , t , xt (26) (i) (i) X Li (θ(i) ) = (i) + log λt s.t. w0 ṽ(i) = 1. T t=1 λt the matrix is defined as (19) T 1 X xt xH Rx(i) = t ∈ CM (L−∆+1)×M (L−∆+1) . (27) T t=1 λ(i) The objective function for estimating all the sources can then t be obtained by summing Eq. (19) over all the sources as A factorized form of the matrix is also defined as H (i) (i) I x R P x R(i) H x = X (i) , (28) L (Θ) = Li (θ(i) ), s.t. w0 ṽ(i) = 1 for all i, (20) (i) (i) Px Rx i=1 where where Θ = θ(1) , . . . , θ(I) . This objective function is used T 1 X xt xH commonly for all the implementations of a CBF. In this paper, Rx(i) = t ∈ CM ×M , (29) we call a CBF optimized by the above objective function as T t=1 λ(i) t weighted MPDR (wMPDR) CBF because it minimizes the T 1 X xt xHt (i) average power of the output yt weighted by the time-varying Px(i) = ∈ CM (L−∆)×M , (30) (i) T t=1 λ(i) t variance, λt , of the signal. T Here, let us briefly explain how DN+DR+SS is performed (i) 1 X xt xHt Rx = ∈ CM (L−∆)×M (L−∆) . (31) by Eqs. (19) and (20). Substituting Eqs. (1) and (2) in Eq. (14) T t=1 λ(i) t and using the model of the desired signal in Eq. (3) and the distortionless constraint in Eq. (18), we obtain B. Optimization based on source-packed factorization This subsection discusses methods for optimizing a CBF (i) (i) (i) (i0 ) X yt = d1,t + r̂t + x̂t + n̂t , (21) with the source-packed factorization. In the following, after i0 6=i describing a method for directly applying the conventional joint optimization technique used for DR+SS to DN+DR+SS, (i) (i0 ) where r̂t , x̂t for i0 6= i, and n̂t are late reverberation of we summarize the problems in it, and present the solutions to the ith source, all the other sources, and the additive diffuse the problems.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXXX 2020 6 (i) 1) Direct application of conventional technique: With the where Rz is a variance-normalized spatial covariance matrix source-packed factorization in Eqs. (11) and (12), it is difficult of the output of the multiple-target WPE filter, calculated as to estimate both Q and G at the same time in closed form even T (i) when λt and ṽ(i) are all given. Instead, we use an iterative 1 X zt zHt R(i) z = . (41) and alternate estimation scheme, following the idea from the T t=1 λ(i) t blind signal processing technique [25], [26], [27], where at each estimation step, one of Q and G is updated while fixing Then, q(i) that minimizes Eq. (40) under the distortionless H the other. constraint q(i) ṽ(i) = 1 can be obtained as For updating G, we fix Q at its previously estimated value. −1 (i) For the derivation of the algorithm, the representation of linear Rz ṽ(i) prediction in Eq. (11) is slightly modified as q(i) = H −1 . (42) (i) ṽ(i) Rz ṽ(i) zt = xt − Xt g, (32) Because the above beamformer minimizes the average power where Xt and g are equivalent to xt and G with modified of zt weighted by the time-varying variance, we call this as matrix structure defined as weighted MPDR (wMPDR) beamformer4 . As will be shown 2 in Section III-C, a wMPDR beamformer is a special case of X t = IM ⊗ x >t ∈C M ×M (L−∆) , (33) a wMPDR CBF. A wMPDR CBF is reduced to a wMPDR H 2 g = g1 , . . . , g> > ∈ CM (L−∆)×1 , beamformer when setting the length of the CBF L = 1, i.e., M (34) by just making it a non-convolutional beamformer. where ⊗ is Kronecker product, and gm is the mth column of The above algorithm, however, has two serious problems. G. Then, considering that the CBF in Eqs. (11) and (12) can Firstly, the size of the covariance matrix in Eq. (38) is very (i) H be written as yt = q(i) xt − Xt g and omitting normal- large, requiring huge computing cost for calculating it and ization terms, the objective function in Eq. (20) becomes its inverse. Secondly, as will be shown in the experiments, T the iterative and alternate estimation of Q and G tends to 1X 2 converge to a sub-optimal point. This is probably because the Lg (g) = x t − Xt g Φq,t , (35) T t=1 update of G is performed based only on the output of the fixed beamformer in the iterative and alternate estimation, as 2 where kxkR = xH Rx, and Φq,t is a semi-definite Hermitian in Eq. (19), while the signal dimension of the beamformer matrix defined as output, i.e., I, is reduced from that of the original signal I H space, i.e., M , with the over-determined case, i.e., I < M . X q(i) q(i) As a consequence, signal components that are relevant for Φq,t = (i) ∈ CM ×M . (36) i=1 λt the update of G may be reduced in the beamformer output, especially when the estimation of Q is not accurate at the Because Eq. (35) is a quadratic form with a lower bound, g early stage of the optimization. This can seriously degrade the that minimizes it can be obtained as update of G. g = Ψ+ ψ, (37) 2) Proposed extension: Here, we present two techniques 1X H 2 2 to mitigate the above problems within the source-packed Ψ= Xt Φq,t Xt ∈ CM (L−∆)×M (L−∆) , (38) factorization approach. The first one is used to reduce the T t computing cost. As shown in Appendix A, Eqs. (38) and (39) 1X H 2 can be rewritten, using Eq. (28), as ψ= Xt Φq,t xt ∈ CM (L−∆)×1 , (39) T t I X H (i) > Ψ= q(i) q(i) ⊗ Rx , (43) where (·)+ is the Moore-Penrose pseudo-inverse. Because i=1 the rank of Ψ is equal to or smaller than M I(L − ∆) as I ∗ will be shown in Section III-B2, Ψ is rank deficient for X ψ= q(i) ⊗ P(i) x q (i) , (44) over-determined cases, namely when M > I, and thus the i=1 use of the pseudo-inverse is indispensable. Eqs. (37) to (39) are equivalent to those used in the dereverberation step for where ()∗ denotes complex conjugate. In the above equations, (i) DR+SS [25], [26], [27] except that in this paper denoising is the majority of the calculation is coming from that of Rx . additionally included in the objective and that over-determined Because the size of the matrix is much smaller than that of cases are also considered. This filter is referred to as the Ψ, we can greatly reduce the computing cost with this mod- multiple-target WPE filter in this paper. ification5 in comparison with direct calculation of Eqs. (38) For the update of Q, fixing g at its previously estimated 4 A wMPDR beamformer is also called a Maximum-Likelihood Distortion- value, the objective in Eq. (20) can be rewritten as less Response (MLDR) beamformer in [41]. 5 In general, computational complexity of a matrix multiplication is more I 2 H than O(n2 ). Because the size of Ψ is M -times larger than Rx , it is suggested X LQ (Q) = q(i) (i) s.t. q(i) ṽ(i) = 1, (40) that the computational complexity for calculating Ψ is at least M 2 times Rz i=1 larger than that for calculating Rx .
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXXX 2020 7 and (39). Although we still need to calculate the inverse of where R(i) is the covariance matrix defined in Eq. (28), and h x > i> the huge matrix Ψ even with this modification, the cost is (i) v = ṽ (i) , 0, . . . , 0 ∈ CM (L−∆+1)×1 corresponds to relatively small in comparison with the direct calculation of Ψ. (Note that Eq. (43) also shows the rank of Ψ to be equal the RTF ṽ(i) with zero padding. Finally, the solution is given to or smaller than M I(L − ∆).) as −1 The second technique introduces a heuristic to improve R(i) v(i) x the update of the WPE filter. In order to use a whole M - w(i) = . (51) H −1 dimensional signal space to be considered for the update, v(i) R(i) v (i) x we modify the CBF to output not only I desired signals, but also M − I auxiliary signals included in the orthogonal The above equation gives the simplest form of the solution complement Q⊥ of Q, and model the auxiliary signals as zero- to a wMPDR CBF. It clearly shows that a wMPDR CBF mean time-varying complex Gaussians. With this modification, is a general case of a wMPDR beamformer. By setting the optimization is performed by calculating the summation L = 1 in the above solution, namely by letting it be a in Eqs. (43) and (44) over not only 1 ≤ i ≤ I but also non-convolutional beamformer, it reduces to the solution of I < i ≤ M , letting q(I+1) , . . . , q(M ) be the orthonormal a wMPDR beamformer in Eq. (42). bases for the orthogonal complement Q⊥ . Because it is not An advantage of the solution using the MISO CBFs is that (i) it can be obtained by a closed form equation provided the important to distinguish the variances λt of the auxiliary signals, we use the same value for them calculated as RTFs and the time-varying variances of the desired signals M are given, and we do not need to consider the interaction 1 H 2 between DN and DR. With this approach, however, it is X λ⊥ t = q(i) zt , (45) M −I necessary to estimate the RTFs directly from a reverberant i=I+1 observation similar to [24]. A solution to this problem is to ⊥ and calculate P⊥ x and Rx based on Eqs. (30) and (31) use dereverberation preprocessing based on a WPE filter for accordingly. In summary, we can implement this modification the RTF estimation. It was shown in [30] that the output of a by adding the following terms, respectively, to Ψ and ψ in WPE filter can be obtained in a computationally efficient way Eqs. (43) and (44). within the framework of this approach. However, the source- M H ! ⊥ > wise factorization approach described in the following can X Ψ⊥ = q(i) q(i) ⊗ Rx , (46) more naturally solve this problem. So, this paper adopts it i=I+1 as the solution. M X ∗ ψ⊥ = q(i) ⊗ P⊥ xq (i) . (47) D. Optimization based on source-wise factorization i=i+1 With the source-wise factorization, similar to the case with the direct optimization of the MISO CBFs, the optimization C. Direct optimization of MISO CBFs can be performed separately for each source, and the resultant Before deriving the optimization with the source-wise fac- algorithm is identical to that proposed for DN+DR in [31]. torization, we show that we can directly optimize the MISO Considering that a CBF can be written based on Eqs. (16) CBFs in Eq. (14), and summarize its characteristics. With this H (i) H (i) setting, the CBFs and the objective function are both defined and (17) as yt = q(i) xt − G xt and using the separately for each source in Eqs. (14) and (19), and thus, factorized form of R(i) x in Eq. (28), the objective function in the optimization can be performed separately for each source. Eq. (19) can be rewritten as The resultant algorithm is, therefore, identical to that proposed (i) (i) −1 2 for DN+DR in [42], where this type of CBF is also referred (i) Li G , q(i) = G − Rx P(i) x q (i) to as a Weighted Power minimization Distortionless response (i) Rx (WPD) CBF. 2 For presenting the solution, we introduce the following + q(i) (i) (i) H (i) −1 (i) . (52) Rx − Px Rx Px vector representation of Eq. (14). H (i) (i) In the above objective function, G is contained only in the yt = w(i) xt , (48) first term, and the term can be minimized, not depending on (i) where w(i) is defined as the value of q(i) , when G takes the following value. " # (i) (i) −1 (i) (i) w0 G = Rx P(i) x . (53) w = , (49) w(i) (i) So, this is a solution6 of G that globally minimizes the (i) (i) (i) Then, when λt and ṽ are given, Eq. (19) becomes a simple objective function given the time-varing variance λt . Inter- constraint quadratic form as estingly, this solution is identical to that of the conventional 2 H Lw (w(i) ) = w(i) (i) s.t. w(i) v(i) = 1, (50) 6 This is not a unique solution. The first term is minimized even when an Rx arbitrary matrix, of which null space includes q(i) , is added to Eq. (53).
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXXX 2020 8 Algorithm 1: Source-packed factorization-based opti- Algorithm 2: Source-wise factorization-based opti- mization for estimation of all the sources. mization for estimation of the ith source Data: Observed signal xt for all t Data: Observed signal xt for all t (i) (i) TF masks γt for all t and 1 ≤ i ≤ I TF masks γt for all t (i) (i) Result: Estimated sources yt for all t and 1 ≤ i ≤ I Result: Estimated ith source yt for all t (i) 2 (i) 2 1 Initialize λt as ||xt ||I /M for all t and 1 ≤ i ≤ I 1 Initialize λt as ||xt ||I /M for all t M M (i) 2 Initialize q as the ith column of IM for 1 ≤ i ≤ I 2 repeat 3 Initialize zt as xt for all t (i) PT x xH 3 Rx ← T1 t=1 t(i)t λ 4 repeat PT xttxHt (i) 1 (i) PT x xH 4 Px ← T t=1 (i) 5 Rx ← T1 t=1 t(i)t for 1 ≤ i ≤ I (i) + λt λ (i) 1 PT xttxHt 5 (i) G ← Rx Px (i) 6 Px ← T t=1 (i) for 1 ≤ i ≤ I λt (i) (i) H PI H (i) > 6 zt ← x t − G xt 7 Ψ ← i=1 q(i) q(i) ⊗ Rx (i) (i) PI (i) ∗ 7 Estimate ṽ(i) based on zt and γt 8 ψ ← i=1 q(i) ⊗ Px q(i) PT z(i) (i) H zt (i) t 9 Begin Add orthogonal complement beamformer 8 Ŕz ← T1 t=1 (i) λt + 10 Set q(I+1) , . . . , q(M ) be orthonormal bases for 9 q (i) ← (Ŕ(i) z ) ṽ (i) + the orthogonal complement Q⊥ of Q H (i) (ṽ(i) ) Ŕz ṽ(i) 2 (i) H (i) (i) H (i) PM λ⊥ 1 yt ← q zt t ← M −I i=I+1 q zt 11 10 2 ⊥ PT x xH (i) (i) 12 Rx ← T1 t=1 λt ⊥t 11 λt ← yt PT xt xt Ht 12 until convergence 13 P⊥x ← T 1 t=1 ⊥ P λt H ⊥ > M (i) 14 Ψ←Ψ+ i=I+1 q q(i) ⊗ Rx (i) ⊥ (i) ∗ 15 PM ψ ← ψ + i=I+1 q ⊗ Px q (i) where Ŕz is a variance-normalized covariance matrix of the output of the single-target WPE filter, calculated as 16 End H 17 g ← Ψ+ ψ T z(i) z(i) zt ← x t − X t g 1 X t t 18 Ŕz(i) = (i) ∈ CM ×M . (55) 19 (i) Estimate ṽ(i) based on zt and γt for 1 ≤ i ≤ I T t=1 λt H (i) T Rz ← T1 t=1 zt (z(i)t ) for 1 ≤ i ≤ I P 20 λt Then, the solution can be obtained, under the distortionless ( R(i) ) + (i) ṽ constraint, as a wMPDR beamformer defined by 21 q(i) ← H z for 1 ≤ i ≤ I (i) + (i) −1 ( ) ṽ(i) Rz ṽ (i) (i) H Ŕz ṽ(i) 22 yt ← q(i) zt for 1 ≤ i ≤ I (i) q = . (56) 2 H (i) −1 23 (i) λt ← yt (i) for 1 ≤ i ≤ I ṽ(i) Ŕz ṽ(i) 24 until convergence Eqs. (54) to (56) are very similar to Eqs. (40) to (42), and the difference is whether the dereverberation is performed by multiple-target WPE filter or single-target WPE filter. WPE dereverberation. This means that the WPE filter op- With the source-wise factorization, the solution can be (i) timized solely for dereverberation can perform the optimal obtained in closed form when λt and ṽ(i) are given, similar dereverberation for the joint optimization not depending on to the case with the direct optimization of the MISO CBFs. (i) the subsequent beamforming, provided the time-varying vari- In addition, the output of the WPE filter is obtained as zt ance of the desired source is given for the optimization. In in Eq. (16), and can be efficiently used for estimation of the addition, unlike the source-packed factorization approach, this RTFs. Furthermore, the size of the temporal-spatial covariance approach does not need to compensate for the dimensionality matrix in Eq. (31) is much smaller than that in Eq. (38) of the (i) source-packed factorization, and thus the computational cost reduction of the beamformer output for the update of G can be reduced. (See Section IV for more detailed discussion because it considers a whole signal space without adding any (i) on the computing cost.) modification. We refer to this filter G as a single-target WPE filter in this paper. (i) (i) Once G is obtained as the above solution, the objective E. Processing flow with estimation of λt and ṽ(i) function in Eq. (19) can be rewritten as This subsection describes examples of processing flows, shown in Algorithms 1 and 2, for optimizing a CBF based 2 H Li q(i) = q(i) s.t. q(i) ṽ(i) = 1, (54) on source-packed factorization and source-wise factorization, (i) (i) Ŕz including the estimation of the time-varying variances, λt ,
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXXX 2020 9 WPE for wMPDR for similar to one used by a fully-Convolutional Time-domain Audio Separation Network (Conv-TasNet) [44]. The network Dereverb Beamform was trained so that it receives the output of the WPE filter that is obtained at the first iteration in the iterative optimization of Es mate Es mate the CBF, and estimates the TF masks of the desired signals. The input of the network was set as concatenation of the real and imaginary parts of STFT coefficients, and the loss Es mate Es mate function was set as the (scale-dependent) signal-to-distortion ratio (SDR) of the enhanced signal obtained by multiplying the TF masks estimated masks to the observed signal. For the training and For 1st me For subsequent mes validation data, we synthesized mixtures using two utterances randomly extracted from the WSJ-CAM0 corpus [45] and two Fig. 2. Processing flow of source-wise factorization-based CBF for estimating room impulse responses and background noise extracted from a source i. the REVERB Challenge training set [18]. For the estimation of the RTFs, ṽ(i) , we adopt a method and the RTFs, ṽ(i) . Hereafter, we refer to both algorithms as based on eigenvalue decomposition with noise covariance A-1 and A-2 for brevity. While A-1 estimates all the sources, whitening [46], [47]. With this technique, the steering vector (i) yt for all i, at the same time from the observed signal xt , v(i) is first estimated as (i) A-2 estimates only one of the sources, yt for a certain i, and v(i) = R\i MaxEig R−1 \i Ri , (57) (if necessary) is repeatedly applied to the observed signal to estimate all the sources one after another. As auxiliary inputs, where MaxEig(·) is a function that calculates the eigenvector (i) TF masks are provided for both algorithms. A TF mask γt is corresponding to the maximum eigenvalue, Ri and R\i are a associated with a source and a TF point, takes a value between spatial covariance matrix of the i-th desired signal and that of 0 and 1, and indicates whether the desired signal of the source the other signals, respectively, estimated as (i) (i) dominates the TF point (γt = 1) or not (γt = 0). The TF P (i) (i) (i) H masks over all the TF points are used to estimate the RTF(s) of t γ t zt zt the desired signal(s) in line 19 of A-1 and line 7 of A-2. (See Ri = P (i) , (58) Section III-E1 for the detail of estimation of the TF masks t γt P H and the RTFs.) (i) (i) (i) (i) t 1 − γt zt zt Both algorithms estimate the time-varying variances λt R\i = P . (59) (i) based on the same objective as that for the CBF, defined in t 1 − γt Eq. (19). Because a closed form solution to the estimation Then, the RTF is obtained by Eq. (4). of the CBF and the time-varying variances is not known, an iterative and alternate optimization scheme is introduced to both algorithms. In each iteration, the time-varying variances, IV. D ISCUSSION (i) λt , are updated in line 23 of A-1 and line 11 of A-2 as In summary, the proposed techniques can optimize a CBF the power of the previously estimated values of the desired for jointly performing DN+DR+SS with greatly reduced com- (i) (i) signal yt , and then the CBF and the desired signal yt are puting cost in comparison with the direct application of updated while fixing the time-varying variances. The iteration the conventional joint optimization technique proposed for is repeated until convergence is obtained. DR+SS to DN+DR+SS. With the conventional technique, it is The optimization methods described in Sections III-B and necessary to calculate the huge covariance matrix Ψ in order III-D are used in respective algorithms for update of the CBF to take into account the dependency of G on Q inherently and the desired signal(s). The WPE filter is first estimated in introduced in the source-packed factorization. This makes lines 5 to 17 of A-1 and lines 3 to 5 of A-2, and applied in the computing cost of the conventional technique extremely line 18 of A-1 and line 6 of A-2. After the RTF(s) is updated high. In contrast, the proposed extension of the source-packed using the dereverberated signals, the wMPDR beamformer is factorization approach reduces the size of the matrix to be estimated in lines 20 and 21 of A-1 and lines 8 and 9 of A-2, calculated substantively from M 2 (L−∆) for Ψ to M (L−∆) and applied in line 22 of A-1 and line 10 of A-2. for Rx , and thus can effectively reduce the computing cost. (i) Figure 2 also illustrates the processing flow of a CBF with On the other hand, with the source-wise factorization, G the source-wise factorization for estimating a source i. can be optimized independently of q(i) , which also allows 1) Methods for estimating TF masks and RTFs: In ex- us to reduce the size of the matrix to be calculated to the (i) periments, for estimating TF masks, γt , for all i and t at same as that of the proposed extension of the source-packed each frequency, we used a Convolutional Neural Network factorization approach. In addition, we can skip the calculation ⊥ that works in the TF domain and is trained using utterance- of an additional matrix, Rx , and that of the inverse of level Permutation Invariant Training criterion (CNN-uPIT) the huge matrix, Ψ−1 , which are required for the proposed [43]. According to our preliminary experiments [32], we set extension of the source-packed factorization approach. This the network structure as a CNN with a large receptive field makes the source-wise factorization approach computationally
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXXX 2020 10 TABLE I CBF S COMPARED IN EXPERIMENTS . (1) AND (2) ARE CONVENTIONAL CASCADE CONFIGURATION APPROACHES , (5) IS A CONVENTIONAL JOINT OPTIMIZATION APPROACH , (6) AND (7) ARE PROPOSED JOINT OPTIMIZATION APPROACHES , AND (3) AND (4) ARE TEST CONDITIONS USED JUST FOR COMPARISON . (5), (6), AND (7) ARE CATEGORIZED AS “J OINTLY OPTIMAL” BECAUSE THEY ARE COMPOSED OF WPE AND W MPDR AND OPTIMIZED BASED ON INTEGRATED VARIANCE ESTIMATION ( SEE F IG . 3 FOR THE DIFFERENCE BETWEEN SEPARATE AND INTEGRATED VARIANCE ESTIMATION ). Name of method Jointly WPE BF Variance Category optimal estimation (1) WPE+MPDR (separate) Multiple-target MPDR Separate Cascade (conventional) (2) WPE+MVDR (separate) Multiple-target MVDR Separate Cascade (conventional) (3) WPE+wMPDR (separate) Multiple-target wMPDR Separate Test condition (4) WPE+MPDR (integrated) Single-target MPDR Integrated Test condition (5) Source-packed factorization (conventional) X Multiple-target wMPDR Integrated Jointly optimal (conventional) (6) Source-packed factorization (extended) X Multiple-target wMPDR Integrated Jointly optimal (proposed) (7) Source-wise factorization X Single-target wMPDR Integrated Jointly optimal (proposed) further efficient. A drawback of the source-wise factorization is that it has to handle I-times larger number of dereverberated Derev BF Derev BF signals than the source-packed factorization. The source-wise factorization approach has additional ben- (a) Separate optimization (b) Integrated optimization efits w.r.t. computational efficiency when it is used in specific scenarios listed below: Fig. 3. Separate and integrated variance optimization schemes. While the separate variance optimization updates λt for Derev as the variance of the • The source-wise factorization approach can estimate the Derev output, the integrated variance optimization updates it as the variance CBF by a closed-form equation when time-varying source of the beamformer output. As a consequence, λt for Derev is common to all variances are given, or estimated, e.g., using neural net- the sources with separate variance optimization. works [15], [12]. In such a case, we can skip the iterative optimization. In contrast, the source-packed factorization approach needs to maintain iterations to estimate Q and CBF using a different type of masks and also to obtain g alternately due to their mutual dependency. the top-line performance of a CBF. • The source-wise factorization approach is advantageous when it is combined with neural network-based single A. Dataset and evaluation metrics target speaker extraction that has been actively studied For the evaluation, we prepared for a set of noisy reverberant recently [13]. With this combination, we can skip the es- speech mixtures (REVERB-2MIX) using the REVERB Chal- timation of sources other than the target source, allowing lenge dataset (REVERB) [18]. Each utterance in REVERB us to further reduce the computing cost. contains a single reverberant speech with moderate stationary diffuse noise. For generating a set of test data, we mixed two V. E XPERIMENTS utterances extracted from REVERB, one from its development This section experimentally confirms the effectiveness of the set (Dev set) and the other from its evaluation set (Eval set), proposed joint optimization approaches. Table I summarizes so that each pair of mixed utterances were recorded in the optimization methods to be compared in the experiments (See same room, by the same microphone array, and under the same Sections V-C and V-D for the detail of the methods). We condition (near or far, RealData or SimData). We categorize compare them in the following three aspects. the test data according to the original categories of the data in 1) Effectiveness of joint optimization REVERB (e.g., SimData or RealData). We created the same We compare a CBF with and without joint optimization number of mixtures in the test data as in the REVERB Eval set, in terms of estimation accuracy. The source-wise fac- such that each utterance in the REVERB Eval set is contained torization approach (Table I (7)) is compared with the in one of the mixtures in the test data. Furthermore, the length conventional cascade configuration (Table I (1) and (2)), of each mixture in the test data was set at the same as that of and two additional test conditions (Table I (3) and (4)). the corresponding utterance in the REVERB Eval set, while 2) Comparison among joint optimization approaches the utterance from Dev set was trimmed or zero-padded at its We compare three joint optimization approaches, i.e., end to have the same length as that of Eval set. the source-packed factorization approach with its con- For experiments in Section V-E, we also prepared for a ventional setting (Table I (5)), its proposed extension set of noisy reverberant speech mixtures, each of which is (Table I (6)), and the source-wise factorization approach composed of three speaker utterances (REVERB-3MIX). We (Table I (7)), described, respectively, in Sections III-B1, created REVERB-3MIX by adding one utterance extracted III-B2, and III-D, in terms of computational efficiency from REVERB Dev set to each mixture in REVERB-2MIX. and estimation accuracy. Only RealData (i.e., real recordings of reverberant data) was 3) Evaluation using oracle masks created for REVERB-3MIX. We use oracle masks instead of estimated masks for In the experiments, we estimated two and three speech evaluating a CBF, in order to test the performance of a signals from each mixture, respectively, for REVERB-2MIX
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXXX 2020 11 TABLE II TABLE III B EAMFORMER CONFIGURATIONS USED IN EXPERIMENTS WER (%) FOR R EAL DATA AND CD ( D B), FWSSNR ( D B), PESQ AND STOI FOR S IM DATA IN REVERB-2MIX OBTAINED USING DIFFERENT M L at each freq. range (kHz) #iterations BEAMFORMERS AFTER FIVE ESTIMATION ITERATIONS WITH C ONFIG -1. 0.0-0.8 0.8-1.5 1.5-8.0 S CORES FOR REVERB-2MIX AND REVERB ( I . E ., SINGLE SPEAKER ) Config-1 8 20 16 8 10 WITH NO ENHANCEMENT (N O E NH ), ARE ALSO SHOWN . Config-2 4 20 16 8 10 Enhancement method WER CD FWSSNR PESQ STOI No Enh (REVERB-2MIX) 62.49 5.44 1.12 1.12 0.55 and REVERB-3MIX, and evaluated only one of them corre- No Enh (REVERB) 18.61 3.97 3.62 1.48 0.75 sponding to the REVERB Eval set, using baseline evaluation MPDR (w/o iteration) 30.79 4.40 3.07 1.45 0.73 tools prepared for it. We selected the signal to be evaluated MVDR (w/o iteration) 30.89 4.43 3.00 1.44 0.73 from all the estimated speech signals based on the corre- wMPDR 28.75 3.96 4.46 1.60 0.75 lation between the separated signals and the original signal (1) WPE+MPDR (separate) 23.04 4.30 3.77 1.58 0.77 in the REVERB Eval set. As objective measures for speech (2) WPE+MVDR (separate) 23.34 4.34 3.66 1.57 0.76 enhancement [48], we used the Cepstrum Distance (CD), the (3) WPE+wMPDR (separate) 21.53 3.74 5.42 1.77 0.82 Frequency-Weighted Segmental SNR (FWSSNR), the Percep- (4) WPE+MPDR (integrated) 23.22 4.28 3.66 1.56 0.76 (7) Source-wise factorization 20.03 3.67 5.57 1.80 0.81 tual Evaluation of Speech Quality (PESQ), and the Short-Time Objective Intelligibility measure (STOI) [49]. To evaluate the ASR performance, we used a baseline ASR system for (1) WPE+MPDR (separate) (4) WPE+MPDR (integrated) REVERB that was recently developed using Kaldi [50]. This (2) WPE+MVDR (separate) (7) Source-wise factorization (3) WPE+wMPDR (separate) system is composed of a Time-Delay Neural Network (TDNN) acoustic model trained using lattice-free maximum mutual 24 4.4 information (LF-MMI) and online i-vector extraction, and a trigram language model. They are trained on the REVERB 23 4.2 training set. 22 WER (%) CD (dB) 4 B. Configurations of CBF 21 Table I summarizes two configurations of the CBF examined 3.8 in experiments including the number of microphones M , the 20 filter length L, and the number of optimization iterations. The 3.6 sampling frequency was 16 kHz. A Hann window was used for a short-time analysis with the frame length and shift being 19 2 4 6 8 10 2 4 6 8 10 set at 32 ms and 8 ms, respectively. The prediction delay was #iterations #iterations set at ∆ = 4 for the WPE filter. 6 1.85 In the iterative optimization, the time-varying variances of 1.8 sources were initialized as those of the observed signal for the 5.5 WPE filter and as 1 for the wMPDR beamformer for all the 1.75 FWSSNR (dB) methods. 5 1.7 PESQ 4.5 1.65 C. Experiment-1: effectiveness of joint optimization 1.6 In this experiment, we evaluated the effectiveness of the 4 joint optimization focusing on its two characteristics. First, 1.55 we compared three different filter combinations, a WPE filter 3.5 1.5 followed by a wMPDR beamformer (WPE+wMPDR), a WPE 2 4 6 8 10 2 4 6 8 10 filter followed by an MPDR beamformer (WPE+MPDR), #iterations #iterations and a WPE filter followed by an MVDR beamformer Fig. 4. Comparison among joint optimization and cascade configuration (WPE+MVDR). The first combination is required for the approaches when using WPE+MPDR and WPE+wMPDR with integrated and jointly optimal processing, and the others have been used for separate optimization schemed using Config-1 for REVERB-2MIX. the conventional cascade configuration. Second, we compared two different variance optimization schemes shown in Fig. 3, namely “separate” and “integrated” variance optimization beamformer. A significant difference between the two schemes schemes. With the separate variance optimization, the itera- is whether the WPE filter uses the same variances for all tive estimation of the time-varying variance was performed the sources or different variances dependant on the sources separately for the WPE filter and for the beamformer. This is estimated by the beamformer. the scheme used by the conventional cascade configuration. In Table III compares WERs, CDs, FWSSNRs, PESQs, and contrast, with the integrated variance optimization, the iterative STOIs obtained after five estimation iterations using three estimation was performed jointly for the WPE filter and the beamformers (MPDR, MVDR, and wMPDR), two conven-
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. X, XXXXX 2020 12 8 8 Frequency (kHz) Frequency (kHz) 6 6 4 4 2 2 0 0 0 0.5 1 1.5 0 0.5 1 1.5 Time (s) Time (s) (a) Observed signal (b) MVDR 8 8 Frequency (kHz) Frequency (kHz) 6 6 4 4 2 2 0 0 0 0.5 1 1.5 0 0.5 1 1.5 Time (s) Time (s) (c) WPE+MVDR (d) CBF with source-wise factorization Fig. 5. A spectrogram of (a) a noisy reverberant mixture in RealData of REVERB-2MIX and those of enhanced signals obtained by (b) MVDR, (c) WPE+MVDR and (d) CBF with source-wise factorization. The mixture is composed of two female speakers under far conditions. tional cascade configuration approaches ((1) WPE+MPDR the integrated variance optimization, are both important for and (2) WPE+MVDR), two test conditions ((3) and (4)), achieving the optimal performance. and a proposed joint optimization approach ((7) source-wise factorization). All methods used configuration Config-1 in D. Experiment-2: Comparison among joint optimization ap- Table I. Table III shows that 1) WPE+MPDR, WPE+MVDR, proaches and WPE+wMPDR greatly outperformed MPDR, MVDR, In this experiments, we compared three joint optimiza- and wMPDR, respectively, with all the conditions, 2) the tion approaches, namely two source-packed factorization ap- joint optimization approach, i.e., (7) source-wise factorization, proaches, respectively, described in Sections III-B1 and substantially outperformed all the other methods in terms of III-B2, and denoted as “(5) Source-packed factorization (con- all the measures except for a case in terms of STOI where ventional)” and “(6) Source-packed factorization (extended),” WPE+wMPDR (separate) gave a slightly better score than and the source-wise factorization approach, denoted as “(7) (7) source-wise factorization. Furthermore, Fig. 4 shows con- Source-wise factorization.” (5) Source-packed factorization vergence curves of the two cascade configuration approaches, (conventional) corresponds to the conventional joint optimiza- two test conditions, and the joint optimization approach. The tion technique, and (6) Source-packed factorization (extended) performance of (7) source-wise factorization was the best of and (7) Source-wise factorization correspond to our proposed all and improved as the number of iterations increased. The methods. Figure 6 compares the WERs obtained using the second best was (3) WPE+wMPDR (separate). In contrast, the three joint optimization approaches with Config-1 and Config- other methods did not improve the scores after the first iter- 2. It shows that the proposed methods, i.e., (6) Source-packed ation with both integrated and separate variance optimization factorization (extended) and (7) Source-wise factorization, schemes. performed comparably well and both greatly outperformed (5) Figure 5 shows a spectrogram of a noisy reverberant mixture Source-packed factorization (conventional). in RealData of REVERB-2MIX, and those of enhanced signals Table IV compares computing times required for the three obtained using MVDR, WPE+MVDR, and CBF with source- approaches to estimate and apply the CBFs with ten estimation wise factorization. The figure demonstrates that while all the iterations for processing a mixture utterance with the length enhancement methods were effective, the CBF with source- being 9.44 s. The computing time was measured by a Mat- wise factorization was the best of all for achieving denoising, lab interpreter as the elapsed time. The computing time for dereverberation, and source separation. estimating masks, which was 0.63 s and 7.2 s, respectively, The above results clearly show that the two characteristics with and without a GPU (NVIDIA 2080ti), is not included of the joint optimization approach, i.e., 1) the optimal com- in the table. As shown in the table, for both configurations, bination of a WPE filter and a wMPDR beamformer, and 2) (6) Source-packed factorization (extended) greatly reduced
You can also read