20
E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19 Lecture 14: Source Separation Dan Ellis Dept. Electrical Engineering, Columbia University [email protected] http://www.ee.columbia.edu/~dpwe/e4896/ 1 1. Sources, Mixtures, & Perception 2. Spatial Filtering 3. Time-Frequency Masking 4. Model-Based Separation ELEN E4896 MUSIC SIGNAL PROCESSING

ELEN E4896 MUSIC SIGNAL PROCESSING Lecture 14: Source Separationdpwe/e4896/lectures/E4896-L... · 2013. 4. 30. · P. Denbigh & J. Zhao, “Pitch extraction and separation of overlapping

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19

    Lecture 14:Source Separation

    Dan EllisDept. Electrical Engineering, Columbia University

    [email protected] http://www.ee.columbia.edu/~dpwe/e4896/1

    1. Sources, Mixtures, & Perception2. Spatial Filtering3. Time-Frequency Masking4. Model-Based Separation

    ELEN E4896 MUSIC SIGNAL PROCESSING

    mailto:[email protected]:[email protected]://labrosa.ee.columbia.eduhttp://labrosa.ee.columbia.edu

  • E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19

    1. Sources, Mixtures, & Perception

    • Sound is a linear process (superposition)no “opacity” (unlike vision)sources → “auditory scenes” (polyphony)

    • Humans perceive discrete sources.. a subjective construct

    2

    02_m+s-15-evil-goodvoice-fade

    0 2 4 6 8 10 12 time/s

    frq/Hz

    0

    2000

    1000

    3000

    4000

    Voice (evil)Stab

    Rumble StringsChoir

    Voice (pleasant)

    Analysis

    level / dB-60

    -40

    -20

    0

  • E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19

    Spatial Hearing• People perceive sources based on cues

    spatial (binaural): ITD, ILD

    3

    2.2 2.205 2.21 2.215 2.22 2.225 2.23 2.235-0.1

    -0.05

    0

    0.05

    0.1

    time / s

    shatr78m3 waveform

    Left

    Right

    path length difference

    path length difference

    head shadow (high freq)

    source

    LR

    Blauert ’96

  • E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19

    Auditory Scene Analysis• Spatial cues may not be enough/available

    single channel signal

    • Brain uses signal-intrinsic cues to form sourcesonset, harmonicity

    4

    time / sec

    freq

    / kH

    z

    0 0.5 1 1.5 2 2.5 3 3.5 4 level / dB0

    1

    2

    3

    4

    -40

    -30

    -20

    -10

    0

    10

    20

    30

    Reynolds-McAdams Oboe

    Bregman ’90

  • E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19

    Auditory Scene Analysis

    “Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregman’90)

    • Quite a challenge!

    5

  • E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19

    Audio Mixing• Studio recording combines separate tracks

    into, e.g., 2 channels (stereo)different levelspanningother effects

    • Stereo Intensity Panningmanipulating ILD onlyconstant powermore channels: use just nearest pair?

    6

    L R

    −1 −0.5 0 0.5 10

    0.5

    1

  • E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19

    2. Spatial Filtering• N sources detected by M sensors

    degrees of freedom(else need other constraints)

    • Consider 2 x 2 case:directionalmics

    →mixing matrix:

    7

    m1 s1

    m2s2

    a21a22

    a12a11

    �m1m2

    �=

    �a11 a12a21 a22

    � �s1s2

    ��

    �ŝ1ŝ2

    �= �1m

  • E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19

    Source Cancelation• Simple 2 x 2 case example:

    8

    �m1m2

    �=

    �1 0.5

    0.8 1

    � �s1s2

    m1(t) = s1(t) + 0.5s2(t)m2(t) = 0.8s1(t) + s2(t)

    � m1(t)� 0.5m2(t) = 0.6s1(t)

    if no delay and linearly-independent sums, can cancel one source per combination

  • E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19

    Independent Component Analysis• Can separate “blind” combinations by

    maximizing independence of outputs

    kurtosis for independence?

    9

    m1 m2

    s1 s2

    a11 a21

    a12 a22

    x

    −δ MutInfo δa

    -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4-0.6

    -0.4

    -0.2

    0

    0.2

    0.4

    0.6

    0.8

    mix 1

    mix

    2

    0 0.2 0.4 0.6 0.8 10

    2

    4

    6

    8

    10

    12

    /

    kurto

    sis

    Kurtosis vs. Mixture Scatter

    s1

    s1

    s2

    s2

    kurt(y) = E

    ��y � µ

    �4�� 3

    Bell & Sejnowski ’95

  • E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19

    Microphone Arrays• If interference is diffuse, can simply

    boost energy from target directione.g. shotgun mic - delay-and-sum

    off-axis spectral colorationmany variants - filter & sum, sidelobe cancelation ...

    10

    DD +D ++

    -40

    -20

    0λ = 4D

    λ = 2D

    λ = D

    ∆x = c . D

    Benesty, Chen, Huang ’08

  • E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19

    3. Time-Frequency Masking• What if there is only one channel?

    cannot have fixed cancellationbut could have fast time-varying filtering:

    • The trick is finding the right mask...11

    freq

    / kHz

    0

    2

    4

    6

    8

    freq

    / kHz

    0

    2

    4

    6

    8

    time / s0.5 1 1.5 2 2.5 3

    Brown & Cooke ’94Roweis ’01

  • E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19

    Time-Frequency Masking• Works well for overlapping voices

    time-frequency resolution?12

    freq

    / kH

    z

    0

    2

    4

    6

    8

    freq

    / kH

    z

    0

    2

    4

    6

    8

    level / dB-20

    0

    20

    40

    time / sec time / sec0 0.5 1 0 0.5 1

    Male

    Original

    Mix +OracleLabels

    Oracle-based

    Resynth

    Female

    cooke-v3n7.wav

    cooke-v3msk-ideal.wav cooke-n7msk-ideal.wav

  • E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19

    Pan-Based Filtering• Can use time-frequency masking

    even for stereoe.g. calculate “panning index” as ILDmask cells matching that ILD

    13

    Avendano 2003

    freq

    / kHz

    0

    2

    4

    6

    level

    / dB

    level

    / dB

    ILD

    / dB

    freq

    / kHz

    ILD mask 1024 pt win −2.5 .. +1.0 dB

    0 5 10 15 20 time / s0

    2

    4

    6

    −20

    0

    20

    −5

    0

    5

    −20

    0

    20

  • E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19

    Harmonic-based Masking• Time-frequency masking

    can be used to pick out harmonicsgiven pitch track, know where to expect harmonics

    14

    Denbigh & Zhao 1992

    freq

    / kHz

    0

    1

    2

    3

    4

    freq

    / kHz

    1

    2

    3

    4

    time / s0 1 2 3 4 5 6 7 80

  • E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19

    Harmonic Filtering• Given pitch track, could use

    time-varying comb filter to get harmonicsor: isolate each harmonic by heterodyning:

    15

    Avery Wang 1995

    x̂(t) =�

    k

    âk(t)cos(k�̂0(t)t)

    âk(t) = LPF{|x(t)e�jk�̂0(t)t|}

    time / s

    freq

    / kHz

    0 1 2 3 4 5 6 7 80

    1

    2

    3

    4

    freq

    / kHz

    0

    1

    2

    3

    4

    freq

    / kHz

    0

    1

    2

    3

    4

  • E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19

    Nonnegative Matrix Factorization• Decomposition of spectrograms

    into templates + activation

    fast & forgiving gradient descent algorithmfits neatly with time-frequency masking

    16

    Lee & Seung ’99Abdallah & Plumbley ’04Smaragdis & Brown ’03

    Virtanen ’07

    Smar

    agdi

    s ’0

    4

    1 2 3Bases from all Wt

    50 100 150 200

    3

    2

    1

    Row

    s of

    H

    Time (DFT slices)

    Freq

    uenc

    y (D

    FT in

    dex)

    50 100 150 200

    20

    40

    60

    80

    100

    120

    X = W · H

    Virtanen ’03sounds

  • E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19

    4. Model-Based Separation• When N (sources) > M (sensors), need

    additional constraints to solve probleme.g. assumption of single dominant pitch

    • Can assemble into a model M of source sidefines set of “possible” waveforms..probabilistically:

    • Source separation from mixture as inference:

    where

    17

    Pr(si|M)

    s = {si} = arg maxs

    Pr(x|s, A)P (A)�

    i

    Pr(si|M)

    Pr(x|s, A) = N (x|As, �)

  • E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19

    Time

    Freq

    uenc

    y

    Stereo − instantaneous mix

    0.5 1 1.5 2 2.5 30

    1000

    2000

    3000

    4000

    Time

    Freq

    uenc

    y

    0.5 1 1.5 2 2.5 30

    1000

    2000

    3000

    4000

    Time

    Freq

    uenc

    y

    Separated source 1

    0.5 1 1.5 2 2.5 30

    1000

    2000

    3000

    4000

    Time

    Freq

    uenc

    y

    Separated source 2

    0.5 1 1.5 2 2.5 30

    1000

    2000

    3000

    4000

    Time

    Freq

    uenc

    y

    Separated source 3

    0.5 1 1.5 2 2.5 30

    1000

    2000

    3000

    4000

    Source Models• Can constrain:

    source spectra (e.g. harmonic, noisy, smooth)temporal evolution (piecewise-continuous)spatial arrangements (point-source, diffuse)

    • Factored decomposition:

    18

    Ozerov, Vincent & Bimbot 2010http://bass-db.gforge.inria.fr/fasst/

    Music: Shannon Hurley / Mix: Michel Desnoues & Alexey Ozerov / Separations: Alexey OzerovVexj = Wexj Uexj Gexj Hexj

    Vexj = Eexj PexjEexj = Wexj Uexj Pexj = Gexj Hexj

    Wexj UexjGexj Hexj

    inria

    -005

    3691

    7, v

    ersi

    on 2

    - 29

    Nov

    201

    0

  • E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19

    Summary• Acoustic Source Mixtures

    The normal situation in real-world sounds

    • Spatial filteringCanceling sources by subtracting channels

    • Time-Frequency MaskingSelecting spectrogram cells

    • Model-Based SeparationExploiting regularities in source signals

    19

  • E4896 Music Signal Processing (Dan Ellis) 2013-04-29 - /19

    ReferencesS. Abdallah & M. Plumbley, “Polyphonic transcription by non-negative sparse coding of power spectra”, Proc. Int. Symp. Music Info. Retrieval, 2004. C. Avendano, “Frequency-domain source identification and manipulation in stereo mixes for enhancement, suppression and re-panning applications,” IEEE WASPAA, Mohonk, pp. 55-58, 2003.A. Bell, T. Sejnowski, “An information-maximization approach to blind separation and blind deconvolution,” Neural Computation, vol. 7 no. 6, pp. 1129-1159, 1995.J. Benesty, J. Chen, Y. Huang, Microphone Array Signal Processing, Springer, 2008.J. Blauert, Spatial Hearing, MIT Press, 1996.A. Bregman, Auditory Scene Analysis, MIT Press, 1990.G. Brown & M. Cooke, “Computational auditory scene analysis,” Computer Speech and Language, vol. 8 no. 4, pp. 297-336, 1994.P. Denbigh & J. Zhao, “Pitch extraction and separation of overlapping speech,” Speech Communication, vol. 11 no. 2-3, pp. 119-125, 1992.D. Lee & S. Seung, “Learning the Parts of Objects by Non-negative Matrix Factorization”, Nature 401, 788, 1999.A. Ozerov, E. Vincent, & F. Bimbot, “A general flexible framework for the handling of prior information in audio source separation,” INRIA Tech. Rep. 7453, Nov. 2010.S. Roweis, “One microphone source separation,” Adv. Neural Info. Proc. Sys., pp. 793-799, 2001.P. Smaragdis & J. Brown, “Non-negative Matrix Factorization for Polyphonic Music Transcription”, Proc. IEEE WASPAA,177-180, October, 2003.T. Virtanen “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE Tr. Audio, Speech, & Lang. Proc. 15(3), 1066–1074, 2007.Avery Wang, Instantaneous and frequency-warped signal processing techniques for auditory source separation, Ph.D. dissertation, Stanford CCRMA, 1995.

    20

    http://hal.inria.fr/docs/00/54/07/35/PDF/RR-7453.pdfhttp://hal.inria.fr/docs/00/54/07/35/PDF/RR-7453.pdfhttp://hal.inria.fr/docs/00/54/07/35/PDF/RR-7453.pdfhttp://hal.inria.fr/docs/00/54/07/35/PDF/RR-7453.pdf