Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Sound Organization by Models - Dan Ellis 2006-12-09 - /291
1. Mixtures and Models2. Human Sound Organization3. Machine Sound Organization4. Research Questions
Sound OrganizationBy Source Models
in Humans and MachinesDan Ellis
Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA
[email protected] http://labrosa.ee.columbia.edu/
Sound Organization by Models - Dan Ellis 2006-12-09 - /29
The Problem of Mixtures
“Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregman’90) • Received waveform is a mixture
2 sensors, N sources - underconstrained
• Undoing mixtures: hearing’s primary goal?.. by any means available
2
Sound Organization by Models - Dan Ellis 2006-12-09 - /29
Sound Organization Scenarios• Interactive voice systems
human-level understanding is expected
• Speech prosthesescrowds: #1 complaint of hearing aid users
• Archive analysisidentifying and isolating sound events
• Unmixing/remixing/enhancement...
3
!"#$%&%'$()$*$)%&%+,
-.$/%&%012
dpwe-2004-09-10-13:15:40
3 4 5 6 7 8 9 : ; <3
5
7
9
;
=73
=53
3
53
pa-2004-09-10-131540.wav
Sound Organization by Models - Dan Ellis 2006-12-09 - /29
How Can We Separate?• By between-sensor differences (spatial cues)
‘steer a null’ onto a compact interfering sourcethe filtering/signal processing paradigm
• By finding a ‘separable representation’spectral? sources are broadband but sparseperiodicity? maybe – for pitched soundssomething more signal-specific...
• By inference (based on knowledge/models)acoustic sources are redundant→ use part to guess the remainder- limited possible solutions
4
Sound Organization by Models - Dan Ellis 2006-12-09 - /29
combination physics source models
Separation vs. Inference• Ideal separation is rarely possible
i.e. no projection can completely remove overlaps
• Overlaps → Ambiguityscene analysis = find “most reasonable” explanation
• Ambiguity can be expressed probabilisticallyi.e. posteriors of sources {Si} given observations X:
P({Si}| X) ∝ P(X |{Si}) P({Si})
• Better source models → better inference
.. learn from examples?
5
Sound Organization by Models - Dan Ellis 2006-12-09 - /29
A Simple Example• Source models are codebooks
from separate subspaces
6
Sound Organization by Models - Dan Ellis 2006-12-09 - /29
A Slightly Less Simple Example• Sources with Markov transitions
7
Sound Organization by Models - Dan Ellis 2006-12-09 - /29
What is a Source Model?• Source Model describes signal behavior
encapsulates constraints on form of signal(any such constraint can be seen as a model...)
• A model has parametersmodel + parameters → instance
• What is not a source model?detail not provided in instancee.g. using phase from original mixtureconstraints on interaction between sourcese.g. independence, clustering attributes
8
Excitation source g[n]
n
nResonance filter H(ej!)
!
Sound Organization by Models - Dan Ellis 2006-12-09 - /29
Outline1. Mixtures and Models2. Human Sound Organization
Auditory Scene AnalysisUsing source characteristicsIllusions
3. Machine Sound Organization4. Research Questions
9
Sound Organization by Models - Dan Ellis 2006-12-09 - /2910
Auditory Scene Analysis
• How do people analyze sound mixtures? break mixture into small elements (in time-freq) elements are grouped in to sources using cues sources have aggregate attributes
• Grouping rules (Darwin, Carlyon, ...): cues: common onset/modulation, harmonicity, ...
• Also learned “schema” (for speech etc.)
Frequencyanalysis
Sound
Elements Sources
Groupingmechanism
Onsetmap
Harmonicitymap
Spatialmap
Sourceproperties
(after Darwin1996)
Bregman’90Darwin & Carlyon’95
Sound Organization by Models - Dan Ellis 2006-12-09 - /29
Perceiving Sources• Harmonics distinct in ear, but perceived as
one source (“fused”):
depends on common onsetdepends on harmonics
• Experimental techniquesask subjects “how many”match attributes e.g. pitch, vowel identitybrain recordings (EEG “mismatch negativity”)
11
time
freq
Sound Organization by Models - Dan Ellis 2006-12-09 - /29
Auditory “Illusions”• How do we explain illusions?
pulsation threshold
sinewave speech
phonemic restoration
• Something is providing the missing (illusory) pieces ... source models
12
!"#$
%&$'($)*+
, ,-. / /-. 0 0-.,
0,,,
1,,,
2,,,
3,,,
!"#$
%&$'($)*+
,-. / /-. 0 0-. 4 4-.,
/,,,
0,,,
4,,,
1,,,
.,,,
!"#$
%&$'($)*+
, ,-. / /-. 0,
/,,,
0,,,
4,,,
1,,,
time
freq
+
+ +
?
Sound Organization by Models - Dan Ellis 2006-12-09 - /29
Human Speech Separation• Task: Coordinate Response Measure
“Ready Baron go to green eight now”256 variants, 16 speakerscorrect = color and number for “Baron”
• Accuracy as a function of spatial separation:
A, B same speaker o Range effect
13
Brungart et al.’02
crm-11737+16515.wav
Sound Organization by Models - Dan Ellis 2006-12-09 - /29
Separation by Vocal Differences• CRM varying the level and voice character
energetic vs. informational maskingmore than pitch .. source models
14
!"#$%&$'()#*"&+$,-"#$'(!$./-#01232'(4-**2&2#52.
1232'(6-**2&2#52(5$#(72'8(-#(.82257(.20&20$,-"#
1-.,2#(*"&(,72(9%-2,2&(,$'/2&:
Brungart et al.’01
(same spatiallocation)
Sound Organization by Models - Dan Ellis 2006-12-09 - /29
Outline1. Mixtures and Models2. Human Sound Organization3. Machine Sound Organization
Computational Auditory Scene AnalysisDictionary Source Models
4. Research Questions
15
Sound Organization by Models - Dan Ellis 2006-12-09 - /29
Source Model Issues• Domain
parsimonious expression of constraintsnice combination physics
• Tractabilitysize of search spacetricks to speed search/inference
• Acquisitionhand-designed vs. learnedstatic vs. short-term
• Factorizationindependent aspectshierarchy & specificity
16
Sound Organization by Models - Dan Ellis 2006-12-09 - /29
Computational Auditory Scene Analysis
• Central idea:Segment time-frequency into sourcesbased on perceptual grouping cues
... principal cue is harmonicity
17
Brown & Cooke’94Okuno et al.’99Hu & Wang’04 ...
inputmixture
signalfeatures
(maps)
discreteobjects
Front end Objectformation
Groupingrules
Sourcegroups
onset
period
frq.mod
time
freq
Segment Group
Sound Organization by Models - Dan Ellis 2006-12-09 - /29
CASA limitations• Limitations of T-F masking
cannot undo overlaps – leaves gaps
• Driven by local featureslimited model scope ➝ no inference or illusions
• Does not learn from data
18
from Hu & Wang ’04
huwang-v3n7.wav
Sound Organization by Models - Dan Ellis 2006-12-09 - /29
Basic Dictionary Models• Given models for sources,
find “best” (most likely) states for spectra:
can include sequential constraints...different domains for combining c and defining
• E.g. stationary noise:
19
{i1(t), i2(t)} = argmaxi1,i2p(x(t)|i1, i2)p(x|i1, i2) = N (x;ci1+ ci2,Σ) combination
model
inference ofsource state
time / s
freq
/ mel
bin
!"#$#%&'()*++,-
0 1 2
20
40
60
80
.%()*++,-/)-&*+0(%1#)+((23+'(3&$)%"(4(5678(09:
0 1 2
20
40
60
80
;<(#%=+""+0()>&>+)((23+'(3&$)%"(4(?6@(09:
0 1 2
20
40
60
80
Roweis ’01, ’03Kristjannson ’04, ’06
Sound Organization by Models - Dan Ellis 2006-12-09 - /29
Deeper Models: Iriquois• Optimal inference on mixed spectra
speaker-specific models(e.g. 512 mix GMM)Algonquin inference
• .. for Speech Separation Challenge (Cooke/Lee’06)exploit grammar constraints - higher-level dynamics
20
Kristjansson, Hershey et al. ’06
HIGH RESOLUTION SIGNAL RECONSTRUCTION
Trausti Kristjansson
Machine Learning and Applied StatisticsMicrosoft Research
John Hershey
University of California, San DiegoMachine Perception Lab
ABSTRACT
We present a framework for speech enhancement and ro-bust speech recognition that exploits the harmonic structureof speech. We achieve substantial gains in signal to noise ra-tio (SNR) of enhanced speech as well as considerable gainsin accuracy of automatic speech recognition in very noisyconditions.
The method exploits the harmonic structure of speechby employing a high frequency resolution speech model inthe log-spectrum domain and reconstructs the signal fromthe estimated posteriors of the clean signal and the phasesfrom the original noisy signal.
We achieve a gain in signal to noise ratio of 8.38 dB forenhancement of speech at 0 dB. We also present recognitionresults on the Aurora 2 data-set. At 0 dB SNR, we achievea reduction of relative word error rate of 43.75% over thebaseline, and 15.90% over the equivalent low-resolution al-gorithm.
1. INTRODUCTION
A long standing goal in speech enhancement and robustspeech recognition has been to exploit the harmonic struc-ture of speech to improve intelligibility and increase recog-nition accuracy.
The source-filter model of speech assumes that speechis produced by an excitation source (the vocal cords) whichhas strong regular harmonic structure during voiced phonemes.The overall shape of the spectrum is then formed by a fil-ter (the vocal tract). In non-tonal languages the filter shapealone determines which phone component of a word is pro-duced (see Figure 2). The source on the other hand intro-duces fine structure in the frequency spectrum that in manycases varies strongly among different utterances of the samephone.
This fact has traditionally inspired the use of smoothrepresentations of the speech spectrum, such as the Mel-frequency cepstral coefficients, in an attempt to accuratelyestimate the filter component of speech in a way that is in-variant to the non-phonetic effects of the excitation[1].
There are two observations that motivate the consider-ation of high frequency resolution modelling of speech fornoise robust speech recognition and enhancement. First isthe observation that most noise sources do not have har-monic structure similar to that of voiced speech. Hence,voiced speech sounds should be more easily distinguish-able from environmental noise in a high dimensional signalspace1.
0 200 400 600 800 100010
15
20
25
30
35
40
45Noisy vector, clean vector and estimate of clean vector
Frequency [Hz]
Ene
rgy
[dB
]
x estimatenoisy vectorclean vector
Fig. 1. The noisy input vector (dot-dash line), the corre-sponding clean vector (solid line) and the estimate of theclean speech (dotted line), with shaded area indicating theuncertainty of the estimate (one standard deviation). Noticethat the uncertainty on the estimate is considerably larger inthe valleys between the harmonic peaks. This reflects thelower SNR in these regions. The vector shown is frame 100from Figure 2
A second observation is that in voiced speech, the signalpower is concentrated in areas near the harmonics of thefundamental frequency, which show up as parallel ridges in
1Even if the interfering signal is another speaker, the harmonic structureof the two signals may differ at different times, and the long term pitchcontour of the speakers may be exploited to separate the two sources [2].
0-7803-7980-2/03/$17.00 © 2003 IEEE 291 ASRU 2003
wer
/ %
6 dB 3 dB 0 dB - 3 dB - 6 dB - 9 dB0
50
100Same Gender Speakers
SDL Recognizer No dynamics Acoustic dyn. Grammar dyn. Human
Sound Organization by Models - Dan Ellis 2006-12-09 - /2921
Faster Search: Fragment Decoder
• Match ‘uncorrupt’ spectrum to ASR models using missing data recognitioneasy if you know the segregation mask S
• Joint search for model M and segregation S to maximize:
Barker et al. ’05
!"#$%&&'( )*+#,-$.'/0+12($3$42"1#'#5 67789::976$9$::$;$68
!"#$%&'()*+,)&,)%-'"(*.%/0/
1 +-%(2%&2*34%//'!3%-'"(*35""/,/*6,-7,,(*
#"2,4/*M*-"*#%-35*/"8&3,*9,%-8&,/*X
1 .':-8&,/;*"6/,&<,2*9,%-8&,/*Y=*/,)&,)%-'"(*S=*%44*&,4%-,2*6>*P( X | Y,S )*;
1 ?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/,)&,)%-'"(;
<$P(X)$#*$&*#521$=*#(0"#0$>
M! P M X" #M
argmax P X M" #P M" #P X" #--------------!
Margmax = =
freq
ObservationY(f )
Segregation S
SourceX(f )
P M S Y$" # P M" # P X M" #P X Y S$" #
P X" #-------------------------! Xd% P S Y" #!=
!"#$%&&'( )*+#,-$.'/0+12($3$42"1#'#5 67789::976$9$::$;$68
!"#$%&'()*+,)&,)%-'"(*.%/0/
1 +-%(2%&2*34%//'!3%-'"(*35""/,/*6,-7,,(*
#"2,4/*M*-"*#%-35*/"8&3,*9,%-8&,/*X
1 .':-8&,/;*"6/,&<,2*9,%-8&,/*Y=*/,)&,)%-'"(*S=*%44*&,4%-,2*6>*P( X | Y,S )*;
1 ?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/,)&,)%-'"(;
<$P(X)$#*$&*#521$=*#(0"#0$>
M! P M X" #M
argmax P X M" #P M" #P X" #--------------!
Margmax = =
freq
ObservationY(f )
Segregation S
SourceX(f )
P M S Y$" # P M" # P X M" #P X Y S$" #
P X" #-------------------------! Xd% P S Y" #!=
Isolated Source Model Segregation Model
Sound Organization by Models - Dan Ellis 2006-12-09 - /2922
CASA in the Fragment Decoder
• CASA can help searchconsider only segregations made from CASA chunks
• CASA can rate segregationconstruct P(S|Y) to reward CASA qualities:
!"#$%&&'( )*+#,-$.'/0+12($3$42"1#'#5 67789::976$9$:8$;$68
!"#$%&'()(&*+,-./+"
0 P(S|Y)&1#$2"&,34."-#3&#$*4/5,-#4$&-4&"+%/+%,-#4$9 '($0<'($(25125"0'*#$=*10<$>*#(',21'#5?
9 <*=$&'@2&A$'($'0?
0 67+$#$%&*4/&'()(8"-91+&143,1&*+,-./+"
9 B21'*,'>'0A;<"1C*#'>'0AD
E12F+2#>A$G"#,($G2&*#5$0*520<21
9 *#(20;>*#0'#+'0AD
0'C29E12F+2#>A$125'*#$C+(0$G2$=<*&2
Frequency Proximity HarmonicityCommon Onset
!"#$%&&'( )*+#,-$.'/0+12($3$42"1#'#5 67789::976$9$::$;$68
!"#$%&'()*+,)&,)%-'"(*.%/0/
1 +-%(2%&2*34%//'!3%-'"(*35""/,/*6,-7,,(*
#"2,4/*M*-"*#%-35*/"8&3,*9,%-8&,/*X
1 .':-8&,/;*"6/,&<,2*9,%-8&,/*Y=*/,)&,)%-'"(*S=*%44*&,4%-,2*6>*P( X | Y,S )*;
1 ?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/,)&,)%-'"(;
<$P(X)$#*$&*#521$=*#(0"#0$>
M! P M X" #M
argmax P X M" #P M" #P X" #--------------!
Margmax = =
freq
ObservationY(f )
Segregation S
SourceX(f )
P M S Y$" # P M" # P X M" #P X Y S$" #
P X" #-------------------------! Xd% P S Y" #!=
Sound Organization by Models - Dan Ellis 2006-12-09 - /29
(Pitch) Factored Dictionaries
• Separate representations for “source” (pitch) and “filter”NM codewords from N+M entriesbut: overgeneration...
• Faster searchdirect extraction of pitchesimmediate separation of (most of) spectra
23
Ghandi & Has-John. ’04Radfar et al. ‘06
Sound Organization by Models - Dan Ellis 2006-12-09 - /29
Discriminant Models for Music• Transcribe piano recordings by classification
train SVM detectors for every piano note88 separate detectors, independent smoothing
• Trained on player piano recordings
• Can resynthesize from transcript...
24
time / seclevel / dB
freq
/ pitc
h
0 1 2 3 4 5 6 7 8 9
A1
A2
A3
A4
A5
A6
-20
-10
0
10
20
Bach 847 Disklavier
Poliner & Ellis ’06
Sound Organization by Models - Dan Ellis 2006-12-09 - /29
Piano Transcription Results
• Significant improvement from classifier:frame-level accuracy results:
Breakdownby frametype:
25
Table 1: Frame level transcription results.
Algorithm Errs False Pos False Neg d!
SVM 43.3% 27.9% 15.4% 3.44
Klapuri&Ryynanen 66.6% 28.1% 38.5% 2.71
Marolt 84.6% 36.5% 48.1% 2.35
• Overall Accuracy Acc: Overall accuracy is a frame-level version of the metricproposed by Dixon in [Dixon, 2000] defined as:
Acc =N
(FP + FN + N)(3)
where N is the number of correctly transcribed frames, FP is the number of
unvoiced frames UV transcribed as voiced V , and FN is the number of voiced
frames transcribed as unvoiced.
• Error Rate Err: The unbounded error rate is defined as:
Err =FP + FN
V(4)
Additionally, we define the false positive rate FPR and false negative rate FNRas FP/V and FN/V respectively.
• Discriminability d!: The discriminability is a measure of the sensitivity of adetector that attempts to factor out the overall bias toward labeling any frame
as voiced (which can move both hit rate and false alarm rate up and down in
tandem). It converts the hit rate and false alarm into standard deviations away
from the mean of an equivalent Gaussian distribution, and reports the difference
between them. A larger value indicates a detection scheme with better discrimi-
nation between the two classes [Duda et al., 2001]
d! = |Qinv(N/V )!Qinv(FP/UV )|. (5)
As displayed in Table 1, the discriminative model provides a significant perfor-
mance advantage on the test set with respect to frame-level transcription accuracy.
This result highlights the merit of a discriminative model for candidate note identi-
fication. Since the transcription problem becomes more complex with the number of
simultaneous notes, we have also plotted the frame-level classification accuracy versus
the number of notes present for each of the algorithms in the left panel of Figure 4, and
the classification error rate composition with the number of simultaneously occurring
notes for the proposed algorithm is displayed in right panel. As expected, there is an
inverse relationship between the number of notes present and the proportional contri-
bution of insertion errors to the total error rate. However, the performance degredation
of the proposed is not as significant as the harmonic-based models.
8
1 2 3 4 5 6 7 80
20
40
60
80
100
120
# notes present
Cla
ssifi
catio
n er
ror %
False NegativesFalse Positives
Sound Organization by Models - Dan Ellis 2006-12-09 - /29
Outline1. Mixtures & Models2. Human Sound Organization3. Machine Sound Organization4. Research Questions
Task and Evaluation Generic vs. Specific
26
Sound Organization by Models - Dan Ellis 2006-12-09 - /29
Task & Evaluation• How to measure separation performance?
depends what you are trying to do
• SNR?energy (and distortions) are not created equaldifferent nonlinear components [Vincent et al. ’06]
• Human Intelligibility?rare for nonlinear processing to improve intelligibilitylistening tests expensive
• ASR performance?separate-then-recognize too simplistic;ASR needs to accommodate separation
27
Tra
nsm
issio
ne
rro
rs
Agressiveness
of processingoptimum
Reduced
interference
Increased
artefacts
Net
effect
Sound Organization by Models - Dan Ellis 2006-12-09 - /29
How Many Models?• More specific models ➝ better separation
need individual dictionaries for “everything”??
• Model adaptation and hierarchyspeaker adapted models : base + parameters
extrapolation beyond normal
generic-specific: pitched ➝ piano ➝ this piano
• Time scales of model acquisitioninnate/evolutionary (hair-cell tuning)developmental (mother tongue phones) dynamic - the “slung mugs” effect; Ozerov
28
CNBH, Physiology Department, Cambridge UniversityCNBH, Physiology Department, Cambridge University
Recognition of Scaled Vowels
Sm
ith
, P
atte
rso
n,
Tu
rner
, K
awah
ara
and
Iri
no
JA
SA
(2
00
5)
Domain of normal speakers
/a/
/u/
/i/
/e/
/o/
mean
pitch / Hz
VT
leng
th r
atio
Smith, Pattersonet al. ’05
Sound Organization by Models - Dan Ellis 2006-12-09 - /29
Summary & Conclusions• Listeners do well separating sound mixtures
using signal cues (location, periodicity)using source-property variations
• Machines do less welldifficult to apply enough constraintsneed to exploit signal detail
• Models capture constraintslearn from the real worldadapt to sources
• Separation feasible only sometimesdescribing source properties is easier
29