A NONLOCALLY WEIGHTED SOFT-CONSTRAINED NATURAL GRADIENT ALGORITHM FOR

A NONLOCALLY WEIGHTED SOFT-CONSTRAINED NATURAL GRADIENTALGORITHM FOR BLIND SEPARATION OF REVERBERANT SPEECH

Jack Xin, Meng Yu, Yingyong Qi, Hsin-I Yang, Fan-Gang Zeng

Mathematics and Biomedical EngineeringUniversity of California at Irvine

Irvine, CA 92697, USA.{jack.xin, m.yu}@uci.edu.

ABSTRACT

A nonlocally weighted soft-constrained natural gradient itera-tive method is introduced for robust blind separation in reverber-ant environment. The nonlocal weighting of the iterations pro-motes stability and convergence of the algorithm for long demix-ing filters. The scaling degree of freedom is controled by soft-constraints built into the auxillary difference equations. The smalldivisor problem of iterations in silence durations of speech is re-solved. Computations on synthetic speech mixtures based on mea-sured binaural room impulse responses show that the algorithmachieves higher signal-to-inteference ratio improvement than ex-isting method (natural gradient time domain algorithm) in an officesize room with reverberation time over 0.5 second.

Index Terms— Adaptive estimation, weighted and controlediteration, reverberant speech, blind separation.

1. INTRODUCTION

A variety of methods rooted in the independent component analy-sis have been generalized to convolutive mixtures in the past decade,[2, 3, 4, 5, 6, 7, 8, 9, 10, 12] among others. However room rever-beration remains one of the main barriers to applications of blindsource separation methods (BSS) in realistic listening conditions.A typical office size room has reverberation time (T60, [1]) near0.5 second. At 10 kHz sampling frequency, separating two chan-nel mixtures of two sources already requires finding 5000 × 4 or20k demixing filter taps. It is incredibly challenging to find an ac-curate solution to this high dimensional problem, considering alsothat the associated objective function is non-convex. On the otherhand, a practically useful BSS method should perform well in areverberant environment (T60 ≥ 0.5s), with at least 5 dB improve-ment after processing. The 5 dB is a typical gain of beamformingdirectional microphones.

In this paper, we introduce a new time domain convolutive nat-ural gradient algorithm for robust BSS of reverberant speech sig-nals. Computational results shall demonstrate that it outperformsthe recent method [4], and improves signal-to interference ratio(SIR) of reverberant speech (T60 = 0.56 second) by over 10 dB.

To motivate the main ingredients of our algorithm design, letus recall the existing convolutive natural gradient method [2, 4].Let the mixing model be

x(t) = [A ∗ s](t) (1.1)

Partially supported by NSF Grant DMS-0712881, NIH Grant2R44DC006734, UCI grant CORCLR-MI-2006-07-6.

where [A ∗ s](t) =∑q

i=0 A(i)s(t− i) is the linear convolution ofsource vector s ∈ Rn with A(i), a group of n×n matrices. Let Lbe the demixing filter length, and N be the number of data pointsin each block (frame). The demixing (inversion) formula is:

yk(l) =L∑

p=0

Wk(p) x(l − p). (1.2)

The scaled natural gradient method [4] is:

Wk+1(p) = (1 + μ) d−1k Wk(p) − μ d−2

k Hk(p), (1.3)

where μ is the learning parameter, the scaling factor dk is linear inthe mixture signal x in the current block, and:

Hk(p)�=

1

N

kN∑l=(k−1)N+1

L∑q=0

f(yk(l))yTk (l − p + q)Wk(q),

(1.4)where y is related x as in (1.2), and f is the sign function. Thescaling variable d(k) is intended to cure divergence that arises if dk

were a positive constant independent of k as originally proposed[2]. However, during silence (or near silence) period of speech,dk may be very small to cause a drastic growth (instability) in thesolution, or a small divisor problem. Specifically,

dk =1

n

n∑i,j=1

2L∑q=0

|gijk (q)|, (1.5)

where (gijk (q)) are entries of an n × n matrix G for each (k, q).

The matrix G is:

Gk(q)�=

L∑p=0

Fk(q − p) W�k (L − p), q ∈ [0, 2L] (1.6)

where the matrix F is:

Fk(p)�=

1

N

kN∑l=(k−1)N+1

f(yk(l)) x�(l − p). (1.7)

In the sum of (1.6), Fk(p) is set to zero if p �∈ [0, L]. It followsfrom a direct calculation that

Hk(p) =L∑

q=0

Gk(L + p − q) Wk(q), (1.8)

978-1-4244-3679-8/09/$25.00 ©2009 IEEE 81

2009 IEEE Workshop on Applications of Audio and Acoustics October 18-21, 2009, New Paltz, NY2009 IEEE Workshop on Applications of Audio and Acoustics October 18-21, 2009, New Paltz, NY

−200 −150 −100 −50 0 50 100 150 200−0.01

−0.005

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

kernel position

kern

el am

plitud

e

Figure 1: Measured (sliding window) generalized correlation func-tion from a clean speech signal

which is another way to express Hk(p) in (1.4). One observes thatif x(i) = 0 for i in a block, then Fk(p) = 0 by (1.7), implyingthat Gk(p) = 0 for p ∈ [0, L], and dk = 0. Likewise, if x(i) ≈ 0for i in a block, dk ≈ 0 causing small divisor problem in (1.3).The expressions (1.6)-(1.8) are much less transparent than (1.4),but are helpful for implementation.

There is also an inconsistency problem in (1.2)-(1.4). Supposethat yk converges up to a limiting signal s with independent com-ponents at large values of N , and that Wk(q) converges to W∞(q),then

H∞(p) ≡ limk→∞,N→∞

Hk(p) =

L∑q=0

E[f(s) sT (q − p)]W∞(q),

(1.9)which is a convolutive (non-local) product that cannot be balancedby the local term (constant multiple of Wk) in (1.3)! The gen-eralized correlation matrix E[f(s)(·) sT (· − η)] is diagonal withdiagonal entries being functions of η with finite support. Only inthe special case that s is independent from time to time, the ma-trix E[f(s)(·) sT (· − η)] is zero if η �= 0, and the sum in (1.9)reduces to a local multiplication. In other words, with local term(1 + μ) d−1

k Wk(p) in (1.3), convergence cannot happen for mix-tures of colored stochastic signals such as speech for which f is thesign function as a consequence of Laplace distribution of speechamplitudes. The solution is to adopt a nonlocal term which is equalto convolution of a kernel function with Wk(p), see (2.2) below.

The present contribution resolves both the inconsistency andthe small divisor scaling problem for convolutive source separa-tion by equiping the natural gradient iteration with two importantingredients: nonlocal weighting and soft constraints. In section 2,the new algorithm and its properties are outlined. In section 3, ex-perimental results on reverberant speech show robust and excellentperformance. Conclusions are in section 4.

2. NONLOCALLY WEIGHTED SOFT-CONSTRAINEDITERATIVE SCHEME

Let us introduce the nonlocally weighted demixing matrix iterationas:

Wk+1(p) = Wk(p) + σ1,k [M � Wk](p) − σ2,k Hk(p), (2.1)

−20 −15 −10 −5 0 5 10 15 200

0.005

0.01

0.015

0.02

0.025

0.03

Kernel Position

Kern

el Am

plitud

e

Figure 2: Piecewise linear kernel function used in the computation(right), with base width LM = 5 and peak height 0.03.

where H is by (1.4) and � is a nonlocal product with a nonlinearkernel function M :

[M � Wk](p)�=

LM∑i=1

M(i)Wk

(p − LM − 1

2+ i − 1

)(2.2)

and LM is the length of the support of the kernel function. Thekernel function M = M(i) used in our computation is plotted inFig. 2. It is constructed from numerical data of (sliding window)generalized correlation functions of clean speech signals, see theFig. 1 for a typical example. We simplify the ”measured kernels”into a piecewise linear function (hat function) with LM = 5 andamplitude equal to 0.03, shown in the right panel. The shape ofkernel function is obtained by optimal searching. The value of Wk

on the right hand side of (2.2) is set to zero if p− LM−12

+ i−1 �∈[0, L].

Notice that the nonlocal product in (2.2) captures the limitingform of Hk(p) in (1.9). For simplicity, we have chosen the samekernel function M for all entries of Wk(p) matrix. The locallyscaled version (1.3) is a special case, for example when the supportof piecewise linear M shrinks to a single point.

The scaling variables (σ1,k, σ2,k) are updated by the equa-tions:

σ1,k+1 = σ1,k exp{−νF1,k(σ1,k, σ2,k)}σ2,k+1 = σ2,k exp{−νF2,k(σ1,k, σ2,k)} (2.3)

where ν is a positive constant, and:

F1,k�= σ1,k

L∑p=0

n∑j=1

|W 1,jk (p)| + σ2,k

L∑p=0

n∑j=1

|H1,jk (p)| − c1,

(2.4)

F2,k�= σ1,k

L∑p=0

n∑j=1

|W 2,jk (p)| + σ2,k

L∑p=0

n∑j=1

|H2,jk (p)|

− c2 + σ2,kE (2.5)

where c1, c2 and E are positive constants.Equations (2.3)-(2.5) say that if |Wk| were too large, F1 and

F2 would be positive and reduce (σ1, σ2) so that |Wk| would notcontinue to grow in k and become unbounded. Similarly, if |Wk|were too small, the nonlinear term H would be smaller, and soF1 and F2 would turn negative so that (σ1, σ2) would grow in k

82


and not become trivial. The growth of σ1 in turn helps the growthof |W | which dominates that of H in equation (2.1) when |W | issmall enough. It follows that the dynamics by (2.1)-(2.5) admitan invariant region where the iterates are bounded and nontrivial,which implies weak convergence of the iteration and the separationcondition that the off-diagonal elements of < f(yk(t)yT

k (t−η) >tend to zero as k 1 for a range of η, < · > being a temporalaverage in t, [11].

3. EXPERIMENTAL RESULTS

The proposed nonlocally weighted soft-constrained natural gradi-ent (NLW-SCNG) algorithm is tested on convolutive and reverber-ant speech mixtures. Two clean speech signals are convolved witha set of measured binaural room impulse responses (BRIRs), [14].The source signals are sentences of two male speakers, 5 secondsin duration, recorded at a sampling rate of 10 kHz. The BRIRs aremeasured in a 5×9×3.5 m ordinary classroom using the KnowlesElectronic Manikin for Auditory Research (KEMAR), positionedat 1.5 m above the floor and at ear level [14]. The BRIR dataare more faithful to the setting of a human wearing hearing aids,and are also used in [4, 8]. By convolving the speech signals withthe pre-measured room impulse responses, one source is virtuallyplaced directly at the front of the listener and the other at an an-gle of 60 ◦ in the azimuth to the right, while both are located at1 m away from the KEMAR. We then calculate the reverberationtime to quantify the degree of difficulty of a separation task [13].To assess the separation ability of the algorithm, we calculate thesignal-to-interference-ratio improvement (SIRI, [8]) by measuringthe overall amount of crosstalk reduction achieved by the algo-rithm before (SIRi) and after (SIRo) the demixing stage. Follow-ing [8], the SIRI in dB is:

SIRI�= SIRo − SIRi

�= 10 log10

(m∑

i=1

∑Lengthl=1 |sii|2∑n

j=1,j �=i

∑Lengthl=0 |sij |2

)

− 10 log10

(m∑

i=1

∑Lengthl=1 |xii|2∑n

j=1,j �=i

∑Lengthl=0 |xij |2

)

(3.6)

where the s’s are the output signals, and x’s are the input signals;m denotes the number of microphones, n denotes the number ofsources, and Length is for the length of speech signal. We shallfocus on SIRI in this work, and report on subjective speech qualitymeasure in a future study.

Let us compare the novel algorithm NLW-SCNG with the re-cent natural gradient time domain algorithm (NGTD). It is wellknown that beamforming initialization and data prewhitening im-prove separation. In order to evaluate the algorithms themselves,neither NGTD nor NLW-SCNG has beamforming initializationand prewhitening preprocessing.

First consider the BRIR mixtures with reverberation time T60 =150 ms. We use the piecewise linear kernel in Fig. 2 with basewidth 5 and peak height 0.03. Two schemes are tested. Scheme I(in Fig. 3) refers to applying NLW-SCNG once through the mix-tures, and scheme II (in Fig. 4) refers to processing by reinitial-izing and rerunning NLW-SCNG on the output of scheme I. Theparameters are: ν = 0.00125, c1 = 1, c2 = 3, E = 0.04, framelength is 10000, frame step size is 100 point, and demixing filter

Demixing filter WH

Kernel

Convolutive mixturesSpeech ( )x t

Iteration and update demixing filter W

SeparatedSpeech ( )s t

M W

Figure 3: Schematic diagram of Scheme I (NLW-SCNG). Theinput mixed reverberant speeches are iterated frame by frameto update demixing filter through nonlocally weighted and soft-constrained natural gradient iterations.

ReinitializationNLW-SCNG NLW-SCNG

ReverberantSpeech ( )x t

SeparatedSpeech ( )s t

Figure 4: Schematic diagram of Scheme II (NLW-SCNG withreinitialization). The first-step output ”separated” speech signalsare sent into NLW-SCNG again as input ”mixed” speech signals.

length L is 1000. The initial conditions are: W0(0) = 0.2 I2,I2 the 2x2 identity matrix; W0(p) = 0, p ≥ 1; σ1 = 1.2 andσ2 = 0.24. We found that scheme II improves on scheme I un-der 1 dB in slightly reverberant conditions (T60 ≤ 150 ms). Theimproved SIR is 17.37 dB.

Shown in Fig. 5 and Fig. 6 are the evolution of control vari-ables (F1, F2) and the scaling variables (σ1, σ2) in terms of it-eration numbers. The algorithm in reverberant environment con-verges in approximately 250 iterations as illustrated in the Fig-ures. If only soft-constrained, i.e. without nonlocal weighting,the control and scaling sequences appear as damped oscillationswhich may take as many as 3000 iterations to converge. With addi-tional nonlocal weighting, the control variables (F1, F2) and scal-ing variables (σ1, σ2) in the new algorithm NLW-SCNG convergesignificantly faster.

For the NGTD algorithm, the initial condition of W is givenby W0(1) = 0.001I , W0(p) = 0 if p > 1; step size μ = 0.2[4], demixing filter length L = 1000; frame length 10000, andframe step size is equal to 100 sample points. Under the acousticenvironment T60 = 150 ms, the improved SIR is 12.75 dB.

Now a typical reverberant room environment with T60 = 560

0 50 100 150 200 2500

5

10

15

F1

iteration

0 50 100 150 200 250−5

0

5

10

15

F2

iteration

Figure 5: Convergence of control variables F1 and F2 in terms ofiteration numbers.

83


Documents

A NONLOCALLY WEIGHTED SOFT-CONSTRAINED NATURAL GRADIENT ALGORITHM FOR