ELEN E4896 MUSIC SIGNAL PROCESSING Lecture 14: Source Separationdpwe/e4896/lectures/E4896-L... · 2013. 4. 30. · P. Denbigh & J. Zhao, “Pitch extraction and separation of overlapping

E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19

Lecture 14:Source Separation

Dan EllisDept. Electrical Engineering, Columbia University

[email protected] http://www.ee.columbia.edu/~dpwe/e4896/1

1. Sources, Mixtures, & Perception2. Spatial Filtering3. Time-Frequency Masking4. Model-Based Separation

ELEN E4896 MUSIC SIGNAL PROCESSING

mailto:[email protected]:[email protected]://labrosa.ee.columbia.eduhttp://labrosa.ee.columbia.edu


1. Sources, Mixtures, & Perception

• Sound is a linear process (superposition)no “opacity” (unlike vision)sources → “auditory scenes” (polyphony)

• Humans perceive discrete sources.. a subjective construct

2

02_m+s-15-evil-goodvoice-fade

0 2 4 6 8 10 12 time/s

frq/Hz

0

2000

1000

3000

4000

Voice (evil)Stab

Rumble StringsChoir

Voice (pleasant)

Analysis

level / dB-60

-40

-20

0


Spatial Hearing• People perceive sources based on cues

spatial (binaural): ITD, ILD

3

2.2 2.205 2.21 2.215 2.22 2.225 2.23 2.235-0.1

-0.05

0

0.05

0.1

time / s

shatr78m3 waveform

Left

Right

path length difference

path length difference

head shadow (high freq)

source

LR

Blauert ’96


Auditory Scene Analysis• Spatial cues may not be enough/available

single channel signal

• Brain uses signal-intrinsic cues to form sourcesonset, harmonicity

4

time / sec

freq

/ kH

z

0 0.5 1 1.5 2 2.5 3 3.5 4 level / dB0

1

2

3

4

-40

-30

-20

-10

0

10

20

30

Reynolds-McAdams Oboe

Bregman ’90


Auditory Scene Analysis

“Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregman’90)

• Quite a challenge!

5


Audio Mixing• Studio recording combines separate tracks

into, e.g., 2 channels (stereo)different levelspanningother effects

• Stereo Intensity Panningmanipulating ILD onlyconstant powermore channels: use just nearest pair?

6

L R

−1 −0.5 0 0.5 10

0.5

1


2. Spatial Filtering• N sources detected by M sensors

degrees of freedom(else need other constraints)

• Consider 2 x 2 case:directionalmics

→mixing matrix:

7

m1 s1

m2s2

a21a22

a12a11

�m1m2

�=

�a11 a12a21 a22

� �s1s2

��

�ŝ1ŝ2

�= Â�1m


Source Cancelation• Simple 2 x 2 case example:

8

�m1m2

�=

�1 0.5

0.8 1

� �s1s2

�

m1(t) = s1(t) + 0.5s2(t)m2(t) = 0.8s1(t) + s2(t)

� m1(t)� 0.5m2(t) = 0.6s1(t)

if no delay and linearly-independent sums, can cancel one source per combination


Independent Component Analysis• Can separate “blind” combinations by

maximizing independence of outputs

kurtosis for independence?

9

m1 m2

s1 s2

a11 a21

a12 a22

x

−δ MutInfo δa

-0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

mix 1

mix

2

0 0.2 0.4 0.6 0.8 10

2

4

6

8

10

12

/

kurto

sis

Kurtosis vs. Mixture Scatter

s1

s1

s2

s2

kurt(y) = E

��y � µ

�

�4�� 3

Bell & Sejnowski ’95


Microphone Arrays• If interference is diffuse, can simply

boost energy from target directione.g. shotgun mic - delay-and-sum

off-axis spectral colorationmany variants - filter & sum, sidelobe cancelation ...

10

DD +D ++

-40

-20

0λ = 4D

λ = 2D

λ = D

∆x = c . D

Benesty, Chen, Huang ’08


3. Time-Frequency Masking• What if there is only one channel?

cannot have fixed cancellationbut could have fast time-varying filtering:

• The trick is finding the right mask...11

freq

/ kHz

0

2

4

6

8

freq

/ kHz

0

2

4

6

8

time / s0.5 1 1.5 2 2.5 3

Brown & Cooke ’94Roweis ’01


Time-Frequency Masking• Works well for overlapping voices

time-frequency resolution?12

freq

/ kH

z

0

2

4

6

8

freq

/ kH

z

0

2

4

6

8

level / dB-20

0

20

40

time / sec time / sec0 0.5 1 0 0.5 1

Male

Original

Mix +OracleLabels

Oracle-based

Resynth

Female

cooke-v3n7.wav

cooke-v3msk-ideal.wav cooke-n7msk-ideal.wav


Pan-Based Filtering• Can use time-frequency masking

even for stereoe.g. calculate “panning index” as ILDmask cells matching that ILD

13

Avendano 2003

freq

/ kHz

0

2

4

6

level

/ dB

level

/ dB

ILD

/ dB

freq

/ kHz

ILD mask 1024 pt win −2.5 .. +1.0 dB

0 5 10 15 20 time / s0

2

4

6

−20

0

20

−5

0

5

−20

0

20


Harmonic-based Masking• Time-frequency masking

can be used to pick out harmonicsgiven pitch track, know where to expect harmonics

14

Denbigh & Zhao 1992

freq

/ kHz

0

1

2

3

4

freq

/ kHz

1

2

3

4

time / s0 1 2 3 4 5 6 7 80


Harmonic Filtering• Given pitch track, could use

time-varying comb filter to get harmonicsor: isolate each harmonic by heterodyning:

15

Avery Wang 1995

x̂(t) =�

k

âk(t)cos(k�̂0(t)t)

âk(t) = LPF{|x(t)e�jk�̂0(t)t|}

time / s

freq

/ kHz

0 1 2 3 4 5 6 7 80

1

2

3

4

freq

/ kHz

0

1

2

3

4

freq

/ kHz

0

1

2

3

4


Nonnegative Matrix Factorization• Decomposition of spectrograms

into templates + activation

fast & forgiving gradient descent algorithmfits neatly with time-frequency masking

16

Lee & Seung ’99Abdallah & Plumbley ’04Smaragdis & Brown ’03

Virtanen ’07

Smar

agdi

s ’0

4

1 2 3Bases from all Wt

50 100 150 200

3

2

1

Row

s of

H

Time (DFT slices)

Freq

uenc

y (D

FT in

dex)

50 100 150 200

20

40

60

80

100

120

X = W · H

Virtanen ’03sounds


4. Model-Based Separation• When N (sources) > M (sensors), need

additional constraints to solve probleme.g. assumption of single dominant pitch

• Can assemble into a model M of source sidefines set of “possible” waveforms..probabilistically:

• Source separation from mixture as inference:

where

17

Pr(si|M)

s = {si} = arg maxs

Pr(x|s, A)P (A)�

i

Pr(si|M)

Pr(x|s, A) = N (x|As, �)


Time

Freq

uenc

y

Stereo − instantaneous mix

0.5 1 1.5 2 2.5 30

1000

2000

3000

4000

Time

Freq

uenc

y

0.5 1 1.5 2 2.5 30

1000

2000

3000

4000

Time

Freq

uenc

y

Separated source 1

0.5 1 1.5 2 2.5 30

1000

2000

3000

4000

Time

Freq

uenc

y

Separated source 2

0.5 1 1.5 2 2.5 30

1000

2000

3000

4000

Time

Freq

uenc

y

Separated source 3

0.5 1 1.5 2 2.5 30

1000

2000

3000

4000

Source Models• Can constrain:

source spectra (e.g. harmonic, noisy, smooth)temporal evolution (piecewise-continuous)spatial arrangements (point-source, diffuse)

• Factored decomposition:

18

Ozerov, Vincent & Bimbot 2010http://bass-db.gforge.inria.fr/fasst/

Music: Shannon Hurley / Mix: Michel Desnoues & Alexey Ozerov / Separations: Alexey OzerovVexj = Wexj Uexj Gexj Hexj

Vexj = Eexj PexjEexj = Wexj Uexj Pexj = Gexj Hexj

Wexj UexjGexj Hexj

inria

-005

3691

7, v

ersi

on 2

- 29

Nov

201

0


Summary• Acoustic Source Mixtures

The normal situation in real-world sounds

• Spatial filteringCanceling sources by subtracting channels

• Time-Frequency MaskingSelecting spectrogram cells

• Model-Based SeparationExploiting regularities in source signals

19


ReferencesS. Abdallah & M. Plumbley, “Polyphonic transcription by non-negative sparse coding of power spectra”, Proc. Int. Symp. Music Info. Retrieval, 2004. C. Avendano, “Frequency-domain source identification and manipulation in stereo mixes for enhancement, suppression and re-panning applications,” IEEE WASPAA, Mohonk, pp. 55-58, 2003.A. Bell, T. Sejnowski, “An information-maximization approach to blind separation and blind deconvolution,” Neural Computation, vol. 7 no. 6, pp. 1129-1159, 1995.J. Benesty, J. Chen, Y. Huang, Microphone Array Signal Processing, Springer, 2008.J. Blauert, Spatial Hearing, MIT Press, 1996.A. Bregman, Auditory Scene Analysis, MIT Press, 1990.G. Brown & M. Cooke, “Computational auditory scene analysis,” Computer Speech and Language, vol. 8 no. 4, pp. 297-336, 1994.P. Denbigh & J. Zhao, “Pitch extraction and separation of overlapping speech,” Speech Communication, vol. 11 no. 2-3, pp. 119-125, 1992.D. Lee & S. Seung, “Learning the Parts of Objects by Non-negative Matrix Factorization”, Nature 401, 788, 1999.A. Ozerov, E. Vincent, & F. Bimbot, “A general flexible framework for the handling of prior information in audio source separation,” INRIA Tech. Rep. 7453, Nov. 2010.S. Roweis, “One microphone source separation,” Adv. Neural Info. Proc. Sys., pp. 793-799, 2001.P. Smaragdis & J. Brown, “Non-negative Matrix Factorization for Polyphonic Music Transcription”, Proc. IEEE WASPAA,177-180, October, 2003.T. Virtanen “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE Tr. Audio, Speech, & Lang. Proc. 15(3), 1066–1074, 2007.Avery Wang, Instantaneous and frequency-warped signal processing techniques for auditory source separation, Ph.D. dissertation, Stanford CCRMA, 1995.

20
http://hal.inria.fr/docs/00/54/07/35/PDF/RR-7453.pdfhttp://hal.inria.fr/docs/00/54/07/35/PDF/RR-7453.pdfhttp://hal.inria.fr/docs/00/54/07/35/PDF/RR-7453.pdfhttp://hal.inria.fr/docs/00/54/07/35/PDF/RR-7453.pdf

Documents

ELEN E4896 MUSIC SIGNAL PROCESSING Lecture 14: Source Separationdpwe/e4896/lectures/E4896-L... · 2013. 4. 30. · P. Denbigh & J. Zhao, “Pitch extraction and separation of overlapping