17
LabROSA Overview - Dan Ellis 2011-09-09 /17 1 1. Real-World Sound 2. Speech Separation 3. Environmental Audio Classification 4. Music Audio Analysis LabROSA Research Overview Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA [email protected] http://labrosa.ee.columbia.edu/

LabROSA Research Overviewdpwe/talks/rosaview-2011-09.pdf · 2011-09-19 · LabROSA Overview - Dan Ellis 2011-09-09 /171 1. Real-World Sound 2. Speech Separation 3. Environmental Audio

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: LabROSA Research Overviewdpwe/talks/rosaview-2011-09.pdf · 2011-09-19 · LabROSA Overview - Dan Ellis 2011-09-09 /171 1. Real-World Sound 2. Speech Separation 3. Environmental Audio

LabROSA Overview - Dan Ellis 2011-09-09 /171

1. Real-World Sound2. Speech Separation3. Environmental Audio Classification4. Music Audio Analysis

LabROSA Research Overview

Dan EllisLaboratory for Recognition and Organization of Speech and Audio

Dept. Electrical Eng., Columbia Univ., NY USA

[email protected] http://labrosa.ee.columbia.edu/

Page 2: LabROSA Research Overviewdpwe/talks/rosaview-2011-09.pdf · 2011-09-19 · LabROSA Overview - Dan Ellis 2011-09-09 /171 1. Real-World Sound 2. Speech Separation 3. Environmental Audio

LabROSA Overview - Dan Ellis 2011-09-09 /17

LabROSA Overview

2

InformationExtraction

MachineLearning

SignalProcessing

Speech

Music EnvironmentRecognition

Retrieval

Separation

• Getting information from sound

Page 3: LabROSA Research Overviewdpwe/talks/rosaview-2011-09.pdf · 2011-09-19 · LabROSA Overview - Dan Ellis 2011-09-09 /171 1. Real-World Sound 2. Speech Separation 3. Environmental Audio

LabROSA Overview - Dan Ellis 2011-09-09 /17

1. Real-World Sound

3

• Sounds rarely occur in isolation.. so analyzing mixtures (“scenes”) is a problem.. for humans and machines

02_m+s-15-evil-goodvoice-fade

0 2 4 6 8 10 12 time/s

frq/Hz

0

2000

1000

3000

4000

Voice (evil)Stab

Rumble StringsChoir

Voice (pleasant)

Analysis

level / dB-60

-40

-20

0

Page 4: LabROSA Research Overviewdpwe/talks/rosaview-2011-09.pdf · 2011-09-19 · LabROSA Overview - Dan Ellis 2011-09-09 /171 1. Real-World Sound 2. Speech Separation 3. Environmental Audio

LabROSA Overview - Dan Ellis 2011-09-09 /17

Auditory Scene Analysis

“Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregmanʼ90)

• Received waveform is a mixture2 sensors, N sources - underconstrained

• Use prior knowledge (models) to constrain

4

Page 5: LabROSA Research Overviewdpwe/talks/rosaview-2011-09.pdf · 2011-09-19 · LabROSA Overview - Dan Ellis 2011-09-09 /171 1. Real-World Sound 2. Speech Separation 3. Environmental Audio

LabROSA Overview - Dan Ellis 2011-09-09 /17

2. Speech Separation

• Given models for sources, find “best” (most likely) states for spectra:

can include sequential constraints...

• E.g. stationary noise:

5

{i1(t), i2(t)} = argmaxi1,i2p(x(t)|i1, i2)p(x|i1, i2) = N (x;ci1+ ci2,Σ) combination

model

inference ofsource state

time / s

freq

/ mel

bin

Original speech

0 1 2

20

40

60

80

In speech-shaped noise (mel magsnr = 2.41 dB)

0 1 2

20

40

60

80

VQ inferred states (mel magsnr = 3.6 dB)

0 1 2

20

40

60

80

Roweis ’01, ’03Kristjannson ’04, ’06

Page 6: LabROSA Research Overviewdpwe/talks/rosaview-2011-09.pdf · 2011-09-19 · LabROSA Overview - Dan Ellis 2011-09-09 /171 1. Real-World Sound 2. Speech Separation 3. Environmental Audio

LabROSA Overview - Dan Ellis 2011-09-09 /17

Eigenvoices• Idea: Find speaker model

parameter space

generalize without losing detail?

• Eigenvoice model:280 states x 320 bins= 89,600 dimensions10-30 dimensions

6

Weiss & Ellis ’09, ’10

Speaker modelsSpeaker subspace bases

µ = µ̄ + U w + B hadapted mean eigenvoice weights channel channelmodel voice bases bases weights

Freq

uenc

y (kH

z)

Mean Voice

b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaw ay ah aoowuw ax

2

4

6

8

Freq

uenc

y (kH

z)

Eigenvoice dimension 1

b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaw ay ah aoowuw ax

2

4

6

8

Freq

uenc

y (kH

z)Eigenvoice dimension 2

b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaw ay ah aoowuw ax

2

4

6

8

Freq

uenc

y (kH

z)

Eigenvoice dimension 3

b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaw ay ah aoowuw ax

2

4

6

8

50

40

30

20

10

0

2

4

6

8

0

2

4

6

8

0

2

4

6

8

Page 7: LabROSA Research Overviewdpwe/talks/rosaview-2011-09.pdf · 2011-09-19 · LabROSA Overview - Dan Ellis 2011-09-09 /171 1. Real-World Sound 2. Speech Separation 3. Environmental Audio

LabROSA Overview - Dan Ellis 2011-09-09 /17

Speaker-Adapted Separation

7

Page 8: LabROSA Research Overviewdpwe/talks/rosaview-2011-09.pdf · 2011-09-19 · LabROSA Overview - Dan Ellis 2011-09-09 /171 1. Real-World Sound 2. Speech Separation 3. Environmental Audio

LabROSA Overview - Dan Ellis 2011-09-09 /17

Speaker-Adapted Separation

• Eigenvoices for Speech Separation taskspeaker adapted (SA) performs midway between speaker-dependent (SD) & speaker-indep (SI)

8

Mix

SA

Page 9: LabROSA Research Overviewdpwe/talks/rosaview-2011-09.pdf · 2011-09-19 · LabROSA Overview - Dan Ellis 2011-09-09 /171 1. Real-World Sound 2. Speech Separation 3. Environmental Audio

LabROSA Overview - Dan Ellis 2011-09-09 /17

3. Soundtrack Classification

• Short video clips as the evolution of snapshots10-100 sec, one location, no editingbrowsing?

• Need information for indexing...video + audioforeground + background

9

Page 10: LabROSA Research Overviewdpwe/talks/rosaview-2011-09.pdf · 2011-09-19 · LabROSA Overview - Dan Ellis 2011-09-09 /171 1. Real-World Sound 2. Speech Separation 3. Environmental Audio

LabROSA Overview - Dan Ellis 2011-09-09 /17

MFCC Covariance Representation

• Each clip/segment → fixed-size statisticssimilar to speaker ID and music genre classification

• Full Covariance matrix of MFCCs maps the kinds of spectral shapes present

• Clip-to-clip distances for SVM classifierby KL or 2nd Gaussian model

10

VTS_04_0001 - Spectrogram

freq

/ kH

z

1 2 3 4 5 6 7 8 9012345678

-20

-10

0

10

20

30

time / sec

time / sec

level / dB

value

MFC

C b

in

1 2 3 4 5 6 7 8 92468

101214161820

-20-15-10-505101520

MFCC dimension

MFC

C d

imen

sion

MFCC covariance

5 10 15 20

2

4

6

8

10

12

14

16

18

20

-50

0

50

Video Soundtrack

MFCCfeatures

MFCCCovariance

Matrix

Page 11: LabROSA Research Overviewdpwe/talks/rosaview-2011-09.pdf · 2011-09-19 · LabROSA Overview - Dan Ellis 2011-09-09 /171 1. Real-World Sound 2. Speech Separation 3. Environmental Audio

LabROSA Overview - Dan Ellis 2011-09-09 /17

Classification Results

• All classifiers vs. all labels

some concepts are more audio-related

Mutual InformationProportion

11

Chang, Ellis et al. ’07Lee & Ellis ’10

CCV: Average Precision (mean=0.300)

Bb Bs So Ic Sk Sw Bi Ca Do Bi Gr Bd Wr WcWdMp Np Pa Be Pl RNBasketball

BaseballSoccer

IceSkatingSkiing

SwimmingBiking

CatDogBird

GraduationBirthday

WedRecepWedCeremWedDanceMusicPerf

NonMusicPerfParadeBeach

PlaygroundRAND

Clas

sifier

sMutual Info Prop (mean=0.175)

Bb Bs So Ic Sk Sw Bi Ca Do Bi Gr Bd Wr WcWdMp Np Pa Be Pl RNBasketball

BaseballSoccer

IceSkatingSkiing

SwimmingBiking

CatDogBird

GraduationBirthday

WedRecepWedCeremWedDanceMusicPerf

NonMusicPerfParadeBeach

PlaygroundRAND

Clas

sifier

s

0

0.5

1

AvPrec

MIPropLabels

0.050.10.150.20.25

MIP =I(classifier; label)

H(label)

Page 12: LabROSA Research Overviewdpwe/talks/rosaview-2011-09.pdf · 2011-09-19 · LabROSA Overview - Dan Ellis 2011-09-09 /171 1. Real-World Sound 2. Speech Separation 3. Environmental Audio

LabROSA Overview - Dan Ellis 2011-09-09 /17

Matching Videos via Fingerprints

• Landmark pairs are a noise-robust fingerprint

• Use to match distinct videos with same sound ambience

12

Cotton & Ellis ’10

VIdeo IMpLQaiHWbE at 195s

VIdeo Yi1hkNkqHBc at 218 s

195.5 196 196.5 197 197.5 198 198.5 199

218.5 219 219.5 220 220.5 221 221.5 2220

1

2

3

4fre

q / k

Hz

0

1

2

3

4

freq

/ kHz

time / sec

time / sec

Page 13: LabROSA Research Overviewdpwe/talks/rosaview-2011-09.pdf · 2011-09-19 · LabROSA Overview - Dan Ellis 2011-09-09 /171 1. Real-World Sound 2. Speech Separation 3. Environmental Audio

LabROSA Overview - Dan Ellis 2011-09-09 /17

4. Music Audio Analysis

• ... at all levels from notes to genres

13

freq

/ kH

z

0

2

4

162 164 166 168 170 172 174 time / s

time / beats

level / dB

C4C5

C2

C2

C3

C4C5

Signal

Onsets& Beats

Per-framechroma

Per-beatnormalized

chroma

Melody

Piano

C3

-20

0

20

intensity0

0.50.25

0.751

Let it Be (final verse)

390 395 400 405 410 415

ACDEG

ACDEG

Page 14: LabROSA Research Overviewdpwe/talks/rosaview-2011-09.pdf · 2011-09-19 · LabROSA Overview - Dan Ellis 2011-09-09 /171 1. Real-World Sound 2. Speech Separation 3. Environmental Audio

LabROSA Overview - Dan Ellis 2011-09-09 /17

Polyphonic Transcription

• Apply the Eigenvoice idea to musiceigeninstruments? • Subspace NMF

14

Grindlay & Ellis ’09

Page 15: LabROSA Research Overviewdpwe/talks/rosaview-2011-09.pdf · 2011-09-19 · LabROSA Overview - Dan Ellis 2011-09-09 /171 1. Real-World Sound 2. Speech Separation 3. Environmental Audio

LabROSA Overview - Dan Ellis 2011-09-09 /17

Melodic-Harmonic Mining

• Million Song Datasetas Echo Nest Analyze

• Frequent clusters of 12 x 8 binarized event-chroma

15

#1 (3491) #2 (2775) #3 (2255) #4 (1241) #5 (1224) #6 (1218) #7 (1092) #8 (1084) #9 (1080) #10 (1035)

#11 (1021)

#1 (3491)

#12 (1005)

)#2 (2775)

#13 (974)

#3 (2255)

#14 (942)

)#4 (1241)

#15 (936)

)#5 (1224)

#16 (924)

)#6 (1218)

#17 (920)

)#7 (1092)

#18 (913)

)#8 (1084)

#19 (901)

)) #10 (1035)#9 (1080)

#20 (897)#

#21 (887)

#11 (1021) #

#21 (887)#21 (887) #22 (882)

)#12 (1005)

))#22 (882)#22 (882) #23 (881)

#13 (974)

#23 (881)#23 (881) #24 (881)

)#14 (942)

))#24 (881)#24 (881) #25 (879)

)#15 (936)

))#25 (879)#25 (879) #26 (875)

)#16 (924)

))#26 (875)#26 (875) #27 (875)

)#17 (920)

))#27 (875)#27 (875) #28 (874)

)#18 (913)

))#28 (874)#28 (874) #29 (868)

) #20 (897)#19 (901)

))#29 (868)#29 (868) #30 (844)

#31 (839) #32 (839) #33 (794) #34 (786) #35 (785) #36 (747) #37 (731) #38 (714) #39 (706) #40 (698)

#41 (682)

#31 (839)#31 (839)

#42 (678)

))#32 (839)#32 (839)

#43 (675)

#33 (794)#33 (794)

#44 (657)

))#34 (786)#34 (786)

#45 (656)

))#35 (785)#35 (785)

#46 (651)

))#36 (747)#36 (747)

#47 (647)

))#37 (731)#37 (731)

#48 (638)

))#38 (714)#38 (714)

#49 (610)

)) #40 (698)#40 (698)#39 (706)#39 (706)

#50 (593)

#51 (592)

#41 (682)#41 (682)

#52 (591)

))#42 (678)#42 (678)

#53 (589)

#43 (675)#43 (675)

#54 (572)

))#44 (657)#44 (657)

#55 (571)

))#45 (656)#45 (656)

#56 (550)

))#46 (651)#46 (651)

#57 (549)

))#47 (647)#47 (647)

#58 (534)

))#48 (638)#48 (638)

#59 (534)

)) #50 (593)#50 (593)#49 (610)#49 (610)

#60 (531)

#61 (528)

#51 (592)

#62 (525)

)#52 (591)

#63 (522)

#53 (589)

#64 (514)

)#54 (572)

#65 (510)

)#55 (571)

#66 (507)

)#56 (550)

#67 (500)

)#57 (549)

#68 (497)

)#58 (534)

#69 (486)

) #60 (531)#59 (534)

#70 (479)

#71 (476)

#61 (528)#61 (528)

#72 (468)

))#62 (525)#62 (525)

#73 (468)

#63 (522)#63 (522)

#74 (466)

))#64 (514)#64 (514)

#75 (463)

))#65 (510)#65 (510)

#76 (454)

))#66 (507)#66 (507)

#77 (453)

))#67 (500)#67 (500)

#78 (448)

))#68 (497)#68 (497)

#79 (441)

)) #70 (479)#70 (479)#69 (486)#69 (486)

#80 (440)

#81 (435)

#71 (476)#71 (476)

#82 (430)

))#72 (468)#72 (468)

#83 (430)

#73 (468)#73 (468)

#84 (425)

))#74 (466)#74 (466)

#85 (425)

))#75 (463)#75 (463)

#86 (419)

))#76 (454)#76 (454)

#87 (419)

))#77 (453)#77 (453)

#88 (417)

))#78 (448)#78 (448)

#89 (416)

)) #80 (440)#80 (440)#79 (441)#79 (441)

#90 (414)

#91 (411)

#81 (435)#81 (435)

#91 (411)#91 (411) #92 (410)

))#82 (430)#82 (430)

))#92 (410)#92 (410) #93 (408)

#83 (430)#83 (430)

#93 (408)#93 (408) #94 (406)

))#84 (425)#84 (425)

))#94 (406)#94 (406) #95 (401)

))#85 (425)#85 (425)

))#95 (401)#95 (401) #96 (398)

))#86 (419)#86 (419)

))#96 (398)#96 (398) #97 (397)

))#87 (419)#87 (419)

))#97 (397)#97 (397) #98 (396)

))#88 (417)#88 (417)

))#98 (396)#98 (396) #99 (396)

)) #90 (414)#90 (414)#89 (416)#89 (416)

))#99 (396)#99 (396) #100 (395)

Musicaudio

LocalitySensitive

Hash Table

Beattracking

Chromafeatures

Keynormalization

Landmarkidentification

Bertin-Mahieux et al. ’10, ’11

Original Reconstruction

Page 16: LabROSA Research Overviewdpwe/talks/rosaview-2011-09.pdf · 2011-09-19 · LabROSA Overview - Dan Ellis 2011-09-09 /171 1. Real-World Sound 2. Speech Separation 3. Environmental Audio

LabROSA Overview - Dan Ellis 2011-09-09 /17

Results - Beatles• Over 86 Beatles tracks

• All beat offsets = 41,705 patchesLSH takes 300 sec - approx NlogN in patches?

• High-pass along time to avoid sustainednotes

• Song filterremove hitsin same track

16

chro

ma bi

n

02-I Should Have Known Better 92.4-97.7s

2

4

6

8

10

12

chro

ma bi

n

05-Here There And Everywhere 12.1-20.5s

2

4

6

8

10

12

chro

ma bi

n

beat

09-Martha My Dear 90.9-98.6s

5 10 15 20

2

4

6

8

10

12

chro

ma bi

nbeat

12-Piggies 22.0-29.6s

5 10 15 20

2

4

6

8

10

12

Page 17: LabROSA Research Overviewdpwe/talks/rosaview-2011-09.pdf · 2011-09-19 · LabROSA Overview - Dan Ellis 2011-09-09 /171 1. Real-World Sound 2. Speech Separation 3. Environmental Audio

LabROSA Overview - Dan Ellis 2011-09-09 /17

Summary• LabROSA : getting information from sound

• Speechmonaural separation using eigenvoices

binaural + reverb using MESSL

• Environmentalclassification of consumer video

landmark-based events and matching

• Musictranscription of notes, chords, ...

large corpus mining

• http://labrosa.ee.columbia.edu/17