Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Speech Separation: Learning - Dan Ellis 2004-11-05
1. Scene Analysis systems2. Disambiguation3. Learning
Learning and Scene Analysis Dan Ellis
Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA
Speech Separation: Learning - Dan Ellis 2004-11-05
1. Scene Analysis Systems
• “Scene Analysis”not necessarily separation, recognition, ...scene = overlapping objects, ambiguity
• General Framework:
distinguish input and output representationsdistinguish engine (algorithm) and control (computational model)
Speech Separation: Learning - Dan Ellis 2004-11-05
Human and Machine Scene Analysis
• CASA (Brown’92 et seq.):Input: Periodicity, continuity, onset “maps” Output: Waveform (or mask)Engine: Time-frequency maskingControl: “Grouping cues” from input- or: spatial features (Roman, ...)
Speech Separation: Learning - Dan Ellis 2004-11-05
Human and Machine Scene Analysis
• CASA (e.g. Brown’92):• ICA (Bell & Sejnowski et seq.):
Input: waveform (or STFT)Output: waveform (or STFT)Engine: cancelationControl: statistical independence of outputs- or energy minimization for beamforming
Speech Separation: Learning - Dan Ellis 2004-11-05
Human and Machine Scene Analysis
• CASA (e.g. Brown’92):• ICA (Bell & Sejnowski et seq.):• Human Listeners:
Input: excitation patterns ...Output: percepts ...Engine: ?Control: find a plausible explanation
Speech Separation: Learning - Dan Ellis 2004-11-05
• Scene ⇒ multiple possible explanationsAnalysis ⇒ choose most reasonable one
• Most reasonable means...consistent with grouping cues (CASA)independent sources (ICA)consistent with experience ... (human)max P({Si}| X) ∝ P(X |{Si}) P({Si})
• i.e. some kind of constraints to disambiguateLearning as the source of this disambiguation knowledge
2. Disambiguation
combination physics source models
Speech Separation: Learning - Dan Ellis 2004-11-05
3. Learning
• “Reasonable” = like what we’ve seen before?i.e. infer source models P({Si}) from observations
• Ways to learn“memorize” instancesgeneralize to a subspace- linear or parametric
• Learning and RecognitionRecognition is classification: set of possible labelslearning properties (distinctions) as best approach
Speech Separation: Learning - Dan Ellis 2004-11-05
Disambiguating with Knowledge
• Use strength of match to models as reasonableness measure for control
• e.g. MAXVQ (Roweis’03)learn dictionary of spectrogram slicesfind the ones that ‘fit’- or a combination... then filter out excess energy
Noise-corrupt speech
MAXVQ Results: Denoising
time
frequency
time
frequency
time
frequency
time
frequency
time
frequency
time
frequency
Training: 300sec of isolated speech (TIMIT) to fit 512 codewords, and 100sec ofisolated noise (NOISEX) to fit 32 codewords; testing on new signals at 0dB SNR
MAXVQ Results: Denoising
time
frequency
time
frequency
time
frequency
time
frequency
time
frequency
time
frequency
Training: 300sec of isolated speech (TIMIT) to fit 512 codewords, and 100sec ofisolated noise (NOISEX) to fit 32 codewords; testing on new signals at 0dB SNR
Matching templates
from
Sam
Row
eis’s
Mon
treal
200
3pr
esen
tatio
n
Speech Separation: Learning - Dan Ellis 2004-11-05
Recognition for Separation
• Speech recognizers embody knowledgetrained on 100s of hours of speechuse them as a ‘reasonableness’ measure
• e.g. Seltzer, Raj, Reyes:
speech recognizer’s best-match provides optimization target
12
12
Mitsubishi Electric
Research Laboratories
Columbia
University
Workshop on Application of Signal Processing to Audio and Acoustics
Manuel J. Reyes-Gomez, Daniel P.W. Ellis and Bhiksha Raj
October 19th 2003
Target Estimation Iteration & Filter Training
ErrorFEMy
Filter 1
Speaker 1
Filter M
Speaker 1
!
MIC1
MICM
Ms
ErrorFE
My
Filter 1
Speaker 2
Filter M
Speaker 2
!
MIC1
MICM
Ms
MIC1
MICM
!
ASRAlignment Target 1
FactorialProcessing
Target 2
from
Man
uel R
eyes
’sW
ASPA
A 20
03pr
esen
tatio
n
Speech Separation: Learning - Dan Ellis 2004-11-05
Learning Elsewhere
• Control: learn what is “reasonable”• Input: discriminant features
learned subspaces
• Engine: clustering parameters• Output: restoration...
Speech Separation: Learning - Dan Ellis 2004-11-05
Obliteration and Outputs
• Perfect separation is rarely possiblee.g. no cancelation after psychoacoustic codingstrong interference will obliterate part of target
• What should the output be?can fill-in missing-data holes using source models- ‘pretend’ we observed the full signal- but: hides observed/inferred distinction output internal model state instead?- e.g. ASR output- depends on eventual use...
Speech Separation: Learning - Dan Ellis 2004-11-05
Conclusions
• Framework for scene analysisInput, Output, Engine, Control
• Scene analysis as Disambiguationfinding the additional constraints
• Learning to spot a reasonable solution • Various implementations
direct dictionary fitcompare output to recognizer’s state
• Learned states as the output?