Learning and Scene Analysis - Columbia University · 2004-11-08 · Speech Separation: Learning - Dan Ellis 2004-11-05 1. Scene Analysis systems 2. Disambiguation 3. Learning Learning

Speech Separation: Learning - Dan Ellis 2004-11-05

1. Scene Analysis systems2. Disambiguation3. Learning

Learning and Scene Analysis Dan Ellis

Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA

[email protected]


1. Scene Analysis Systems

• “Scene Analysis”not necessarily separation, recognition, ...scene = overlapping objects, ambiguity

• General Framework:

distinguish input and output representationsdistinguish engine (algorithm) and control (computational model)


Human and Machine Scene Analysis

• CASA (Brown’92 et seq.):Input: Periodicity, continuity, onset “maps” Output: Waveform (or mask)Engine: Time-frequency maskingControl: “Grouping cues” from input- or: spatial features (Roman, ...)



• CASA (e.g. Brown’92):• ICA (Bell & Sejnowski et seq.):

Input: waveform (or STFT)Output: waveform (or STFT)Engine: cancelationControl: statistical independence of outputs- or energy minimization for beamforming



• CASA (e.g. Brown’92):• ICA (Bell & Sejnowski et seq.):• Human Listeners:

Input: excitation patterns ...Output: percepts ...Engine: ?Control: find a plausible explanation


• Scene ⇒ multiple possible explanationsAnalysis ⇒ choose most reasonable one

• Most reasonable means...consistent with grouping cues (CASA)independent sources (ICA)consistent with experience ... (human)max P({Si}| X) ∝ P(X |{Si}) P({Si})

• i.e. some kind of constraints to disambiguateLearning as the source of this disambiguation knowledge

2. Disambiguation

combination physics source models


3. Learning

• “Reasonable” = like what we’ve seen before?i.e. infer source models P({Si}) from observations

• Ways to learn“memorize” instancesgeneralize to a subspace- linear or parametric

• Learning and RecognitionRecognition is classification: set of possible labelslearning properties (distinctions) as best approach


Disambiguating with Knowledge

• Use strength of match to models as reasonableness measure for control

• e.g. MAXVQ (Roweis’03)learn dictionary of spectrogram slicesfind the ones that ‘fit’- or a combination... then filter out excess energy

Noise-corrupt speech

MAXVQ Results: Denoising

time

frequency

time

frequency

time

frequency

time

frequency

time

frequency

time

frequency

Training: 300sec of isolated speech (TIMIT) to fit 512 codewords, and 100sec ofisolated noise (NOISEX) to fit 32 codewords; testing on new signals at 0dB SNR

MAXVQ Results: Denoising

time

frequency

time

frequency

time

frequency

time

frequency

time

frequency

time

frequency

Training: 300sec of isolated speech (TIMIT) to fit 512 codewords, and 100sec ofisolated noise (NOISEX) to fit 32 codewords; testing on new signals at 0dB SNR

Matching templates

from

Sam

Row

eis’s

Mon

treal

200

3pr

esen

tatio

n


Recognition for Separation

• Speech recognizers embody knowledgetrained on 100s of hours of speechuse them as a ‘reasonableness’ measure

• e.g. Seltzer, Raj, Reyes:

speech recognizer’s best-match provides optimization target

12

12

Mitsubishi Electric

Research Laboratories

Columbia

University

Workshop on Application of Signal Processing to Audio and Acoustics

Manuel J. Reyes-Gomez, Daniel P.W. Ellis and Bhiksha Raj

October 19th 2003

Target Estimation Iteration & Filter Training

ErrorFEMy

Filter 1

Speaker 1

Filter M

Speaker 1

!

MIC1

MICM

Ms

ErrorFE

My

Filter 1

Speaker 2

Filter M

Speaker 2

!

MIC1

MICM

Ms

MIC1

MICM

!

ASRAlignment Target 1

FactorialProcessing

Target 2

from

Man

uel R

eyes

’sW

ASPA

A 20

03pr

esen

tatio

n


Learning Elsewhere

• Control: learn what is “reasonable”• Input: discriminant features

learned subspaces

• Engine: clustering parameters• Output: restoration...


Obliteration and Outputs

• Perfect separation is rarely possiblee.g. no cancelation after psychoacoustic codingstrong interference will obliterate part of target

• What should the output be?can fill-in missing-data holes using source models- ‘pretend’ we observed the full signal- but: hides observed/inferred distinction output internal model state instead?- e.g. ASR output- depends on eventual use...


Conclusions

• Framework for scene analysisInput, Output, Engine, Control

• Scene analysis as Disambiguationfinding the additional constraints

• Learning to spot a reasonable solution • Various implementations

direct dictionary fitcompare output to recognizer’s state

• Learned states as the output?

Documents

Learning and Scene Analysis - Columbia University · 2004-11-08 · Speech Separation: Learning - Dan Ellis 2004-11-05 1. Scene Analysis systems 2. Disambiguation 3. Learning Learning