Sound Organization By Source Models in Humans and Machinesweb4.cs.ucl.ac.uk/staff/D.Barber/workshops/amac/... · Sound Organization by Models - Dan Ellis 2006-12-09 - /291 1. Mixtures

Sound Organization by Models - Dan Ellis 2006-12-09 - /291

1. Mixtures and Models2. Human Sound Organization3. Machine Sound Organization4. Research Questions

Sound OrganizationBy Source Models

in Humans and MachinesDan Ellis

Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA

[email protected] http://labrosa.ee.columbia.edu/


The Problem of Mixtures

“Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregman’90) • Received waveform is a mixture

2 sensors, N sources - underconstrained

• Undoing mixtures: hearing’s primary goal?.. by any means available

2


Sound Organization Scenarios• Interactive voice systems

human-level understanding is expected

• Speech prosthesescrowds: #1 complaint of hearing aid users

• Archive analysisidentifying and isolating sound events

• Unmixing/remixing/enhancement...

3

!"#$%&%'$()$*$)%&%+,

-.$/%&%012

dpwe-2004-09-10-13:15:40

3 4 5 6 7 8 9 : ; <3

5

7

9

;

=73

=53

3

53

pa-2004-09-10-131540.wav


How Can We Separate?• By between-sensor differences (spatial cues)

‘steer a null’ onto a compact interfering sourcethe filtering/signal processing paradigm

• By finding a ‘separable representation’spectral? sources are broadband but sparseperiodicity? maybe – for pitched soundssomething more signal-specific...

• By inference (based on knowledge/models)acoustic sources are redundant→ use part to guess the remainder- limited possible solutions

4


combination physics source models

Separation vs. Inference• Ideal separation is rarely possible

i.e. no projection can completely remove overlaps

• Overlaps → Ambiguityscene analysis = find “most reasonable” explanation

• Ambiguity can be expressed probabilisticallyi.e. posteriors of sources {Si} given observations X:

P({Si}| X) ∝ P(X |{Si}) P({Si})

• Better source models → better inference

.. learn from examples?

5


A Simple Example• Source models are codebooks

from separate subspaces

6


A Slightly Less Simple Example• Sources with Markov transitions

7


What is a Source Model?• Source Model describes signal behavior

encapsulates constraints on form of signal(any such constraint can be seen as a model...)

• A model has parametersmodel + parameters → instance

• What is not a source model?detail not provided in instancee.g. using phase from original mixtureconstraints on interaction between sourcese.g. independence, clustering attributes

8

Excitation source g[n]

n

nResonance filter H(ej!)

!


Outline1. Mixtures and Models2. Human Sound Organization

Auditory Scene AnalysisUsing source characteristicsIllusions

3. Machine Sound Organization4. Research Questions

9


Auditory Scene Analysis

• How do people analyze sound mixtures? break mixture into small elements (in time-freq) elements are grouped in to sources using cues sources have aggregate attributes

• Grouping rules (Darwin, Carlyon, ...): cues: common onset/modulation, harmonicity, ...

• Also learned “schema” (for speech etc.)

Frequencyanalysis

Sound

Elements Sources

Groupingmechanism

Onsetmap

Harmonicitymap

Spatialmap

Sourceproperties

(after Darwin1996)

Bregman’90Darwin & Carlyon’95


Perceiving Sources• Harmonics distinct in ear, but perceived as

one source (“fused”):

depends on common onsetdepends on harmonics

• Experimental techniquesask subjects “how many”match attributes e.g. pitch, vowel identitybrain recordings (EEG “mismatch negativity”)

11

time

freq


Auditory “Illusions”• How do we explain illusions?

pulsation threshold

sinewave speech

phonemic restoration

• Something is providing the missing (illusory) pieces ... source models

12

!"#$

%&$'($)*+

, ,-. / /-. 0 0-.,

0,,,

1,,,

2,,,

3,,,

!"#$

%&$'($)*+

,-. / /-. 0 0-. 4 4-.,

/,,,

0,,,

4,,,

1,,,

.,,,

!"#$

%&$'($)*+

, ,-. / /-. 0,

/,,,

0,,,

4,,,

1,,,

time

freq

+

+ +

?


Human Speech Separation• Task: Coordinate Response Measure

“Ready Baron go to green eight now”256 variants, 16 speakerscorrect = color and number for “Baron”

• Accuracy as a function of spatial separation:

A, B same speaker o Range effect

13

Brungart et al.’02

crm-11737+16515.wav


Separation by Vocal Differences• CRM varying the level and voice character

energetic vs. informational maskingmore than pitch .. source models

14

!"#$%&$'()#*"&+$,-"#$'(!$./-#01232'(4-**2&2#52.

1232'(6-**2&2#52(5$#(72'8(-#(.82257(.20&20$,-"#

1-.,2#(*"&(,72(9%-2,2&(,$'/2&:

Brungart et al.’01

(same spatiallocation)


Outline1. Mixtures and Models2. Human Sound Organization3. Machine Sound Organization

Computational Auditory Scene AnalysisDictionary Source Models

4. Research Questions

15


Source Model Issues• Domain

parsimonious expression of constraintsnice combination physics

• Tractabilitysize of search spacetricks to speed search/inference

• Acquisitionhand-designed vs. learnedstatic vs. short-term

• Factorizationindependent aspectshierarchy & specificity

16


Computational Auditory Scene Analysis

• Central idea:Segment time-frequency into sourcesbased on perceptual grouping cues

... principal cue is harmonicity

17

Brown & Cooke’94Okuno et al.’99Hu & Wang’04 ...

inputmixture

signalfeatures

(maps)

discreteobjects

Front end Objectformation

Groupingrules

Sourcegroups

onset

period

frq.mod

time

freq

Segment Group


CASA limitations• Limitations of T-F masking

cannot undo overlaps – leaves gaps

• Driven by local featureslimited model scope ➝ no inference or illusions

• Does not learn from data

18

from Hu & Wang ’04

huwang-v3n7.wav


Basic Dictionary Models• Given models for sources,

find “best” (most likely) states for spectra:

can include sequential constraints...different domains for combining c and defining

• E.g. stationary noise:

19

{i1(t), i2(t)} = argmaxi1,i2p(x(t)|i1, i2)p(x|i1, i2) = N (x;ci1+ ci2,Σ) combination

model

inference ofsource state

time / s

freq

/ mel

bin

!"#$#%&'()*++,-

0 1 2

20

40

60

80

.%()*++,-/)-&*+0(%1#)+((23+'(3&$)%"(4(5678(09:

0 1 2

20

40

60

80

;<(#%=+""+0()>&>+)((23+'(3&$)%"(4(?6@(09:

0 1 2

20

40

60

80

Roweis ’01, ’03Kristjannson ’04, ’06


Deeper Models: Iriquois• Optimal inference on mixed spectra

speaker-specific models(e.g. 512 mix GMM)Algonquin inference

• .. for Speech Separation Challenge (Cooke/Lee’06)exploit grammar constraints - higher-level dynamics

20

Kristjansson, Hershey et al. ’06

HIGH RESOLUTION SIGNAL RECONSTRUCTION

Trausti Kristjansson

Machine Learning and Applied StatisticsMicrosoft Research

[email protected]

John Hershey

University of California, San DiegoMachine Perception Lab

[email protected]

ABSTRACT

We present a framework for speech enhancement and ro-bust speech recognition that exploits the harmonic structureof speech. We achieve substantial gains in signal to noise ra-tio (SNR) of enhanced speech as well as considerable gainsin accuracy of automatic speech recognition in very noisyconditions.

The method exploits the harmonic structure of speechby employing a high frequency resolution speech model inthe log-spectrum domain and reconstructs the signal fromthe estimated posteriors of the clean signal and the phasesfrom the original noisy signal.

We achieve a gain in signal to noise ratio of 8.38 dB forenhancement of speech at 0 dB. We also present recognitionresults on the Aurora 2 data-set. At 0 dB SNR, we achievea reduction of relative word error rate of 43.75% over thebaseline, and 15.90% over the equivalent low-resolution al-gorithm.

1. INTRODUCTION

A long standing goal in speech enhancement and robustspeech recognition has been to exploit the harmonic struc-ture of speech to improve intelligibility and increase recog-nition accuracy.

The source-filter model of speech assumes that speechis produced by an excitation source (the vocal cords) whichhas strong regular harmonic structure during voiced phonemes.The overall shape of the spectrum is then formed by a fil-ter (the vocal tract). In non-tonal languages the filter shapealone determines which phone component of a word is pro-duced (see Figure 2). The source on the other hand intro-duces fine structure in the frequency spectrum that in manycases varies strongly among different utterances of the samephone.

This fact has traditionally inspired the use of smoothrepresentations of the speech spectrum, such as the Mel-frequency cepstral coefficients, in an attempt to accuratelyestimate the filter component of speech in a way that is in-variant to the non-phonetic effects of the excitation[1].

There are two observations that motivate the consider-ation of high frequency resolution modelling of speech fornoise robust speech recognition and enhancement. First isthe observation that most noise sources do not have har-monic structure similar to that of voiced speech. Hence,voiced speech sounds should be more easily distinguish-able from environmental noise in a high dimensional signalspace1.

0 200 400 600 800 100010

15

20

25

30

35

40

45Noisy vector, clean vector and estimate of clean vector

Frequency [Hz]

Ene

rgy

[dB

]

x estimatenoisy vectorclean vector

Fig. 1. The noisy input vector (dot-dash line), the corre-sponding clean vector (solid line) and the estimate of theclean speech (dotted line), with shaded area indicating theuncertainty of the estimate (one standard deviation). Noticethat the uncertainty on the estimate is considerably larger inthe valleys between the harmonic peaks. This reflects thelower SNR in these regions. The vector shown is frame 100from Figure 2

A second observation is that in voiced speech, the signalpower is concentrated in areas near the harmonics of thefundamental frequency, which show up as parallel ridges in

1Even if the interfering signal is another speaker, the harmonic structureof the two signals may differ at different times, and the long term pitchcontour of the speakers may be exploited to separate the two sources [2].

0-7803-7980-2/03/$17.00 © 2003 IEEE 291 ASRU 2003

wer

/ %

6 dB 3 dB 0 dB - 3 dB - 6 dB - 9 dB0

50

100Same Gender Speakers

SDL Recognizer No dynamics Acoustic dyn. Grammar dyn. Human


Faster Search: Fragment Decoder

• Match ‘uncorrupt’ spectrum to ASR models using missing data recognitioneasy if you know the segregation mask S

• Joint search for model M and segregation S to maximize:

Barker et al. ’05

!"#$%&&'( )*+#,-$.'/0+12($3$42"1#'#5 67789::976$9$::$;$68

!"#$%&'()*+,)&,)%-'"(*.%/0/

1 +-%(2%&2*34%//'!3%-'"(*35""/,/*6,-7,,(*

#"2,4/*M*-"*#%-35*/"8&3,*9,%-8&,/*X

1 .':-8&,/;*"6/,&<,2*9,%-8&,/*Y=*/,)&,)%-'"(*S=*%44*&,4%-,2*6>*P( X | Y,S )*;

1 ?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/,)&,)%-'"(;

<$P(X)$#*$&*#521$=*#(0"#0$>

M! P M X" #M

argmax P X M" #P M" #P X" #--------------!

Margmax = =

freq

ObservationY(f )

Segregation S

SourceX(f )

P M S Y$" # P M" # P X M" #P X Y S$" #

P X" #-------------------------! Xd% P S Y" #!=

!"#$%&&'( )*+#,-$.'/0+12($3$42"1#'#5 67789::976$9$::$;$68

!"#$%&'()*+,)&,)%-'"(*.%/0/

1 +-%(2%&2*34%//'!3%-'"(*35""/,/*6,-7,,(*

#"2,4/*M*-"*#%-35*/"8&3,*9,%-8&,/*X

1 .':-8&,/;*"6/,&<,2*9,%-8&,/*Y=*/,)&,)%-'"(*S=*%44*&,4%-,2*6>*P( X | Y,S )*;

1 ?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/,)&,)%-'"(;

<$P(X)$#*$&*#521$=*#(0"#0$>

M! P M X" #M

argmax P X M" #P M" #P X" #--------------!

Margmax = =

freq

ObservationY(f )

Segregation S

SourceX(f )

P M S Y$" # P M" # P X M" #P X Y S$" #

P X" #-------------------------! Xd% P S Y" #!=

Isolated Source Model Segregation Model


CASA in the Fragment Decoder

• CASA can help searchconsider only segregations made from CASA chunks

• CASA can rate segregationconstruct P(S|Y) to reward CASA qualities:

!"#$%&&'( )*+#,-$.'/0+12($3$42"1#'#5 67789::976$9$:8$;$68

!"#$%&'()(&*+,-./+"

0 P(S|Y)&1#$2"&,34."-#3&#$*4/5,-#4$&-4&"+%/+%,-#4$9 '($0<'($(25125"0'*#$=*10<$>*#(',21'#5?

9 <*=$&'@2&A$'($'0?

0 67+$#$%&*4/&'()(8"-91+&143,1&*+,-./+"

9 B21'*,'>'0A;<"1C*#'>'0AD

E12F+2#>A$G"#,($G2&*#5$0*520<21

9 *#(20;>*#0'#+'0AD

0'C29E12F+2#>A$125'*#$C+(0$G2$=<*&2

Frequency Proximity HarmonicityCommon Onset

!"#$%&&'( )*+#,-$.'/0+12($3$42"1#'#5 67789::976$9$::$;$68

!"#$%&'()*+,)&,)%-'"(*.%/0/

1 +-%(2%&2*34%//'!3%-'"(*35""/,/*6,-7,,(*

#"2,4/*M*-"*#%-35*/"8&3,*9,%-8&,/*X

1 .':-8&,/;*"6/,&<,2*9,%-8&,/*Y=*/,)&,)%-'"(*S=*%44*&,4%-,2*6>*P( X | Y,S )*;

1 ?"'(-*34%//'!3%-'"(*"9*#"2,4*%(2*/,)&,)%-'"(;

<$P(X)$#*$&*#521$=*#(0"#0$>

M! P M X" #M

argmax P X M" #P M" #P X" #--------------!

Margmax = =

freq

ObservationY(f )

Segregation S

SourceX(f )

P M S Y$" # P M" # P X M" #P X Y S$" #

P X" #-------------------------! Xd% P S Y" #!=


(Pitch) Factored Dictionaries

• Separate representations for “source” (pitch) and “filter”NM codewords from N+M entriesbut: overgeneration...

• Faster searchdirect extraction of pitchesimmediate separation of (most of) spectra

23

Ghandi & Has-John. ’04Radfar et al. ‘06


Discriminant Models for Music• Transcribe piano recordings by classification

train SVM detectors for every piano note88 separate detectors, independent smoothing

• Trained on player piano recordings

• Can resynthesize from transcript...

24

time / seclevel / dB

freq

/ pitc

h

0 1 2 3 4 5 6 7 8 9

A1

A2

A3

A4

A5

A6

-20

-10

0

10

20

Bach 847 Disklavier

Poliner & Ellis ’06


Piano Transcription Results

• Significant improvement from classifier:frame-level accuracy results:

Breakdownby frametype:

25

Table 1: Frame level transcription results.

Algorithm Errs False Pos False Neg d!

SVM 43.3% 27.9% 15.4% 3.44

Klapuri&Ryynanen 66.6% 28.1% 38.5% 2.71

Marolt 84.6% 36.5% 48.1% 2.35

• Overall Accuracy Acc: Overall accuracy is a frame-level version of the metricproposed by Dixon in [Dixon, 2000] defined as:

Acc =N

(FP + FN + N)(3)

where N is the number of correctly transcribed frames, FP is the number of

unvoiced frames UV transcribed as voiced V , and FN is the number of voiced

frames transcribed as unvoiced.

• Error Rate Err: The unbounded error rate is defined as:

Err =FP + FN

V(4)

Additionally, we define the false positive rate FPR and false negative rate FNRas FP/V and FN/V respectively.

• Discriminability d!: The discriminability is a measure of the sensitivity of adetector that attempts to factor out the overall bias toward labeling any frame

as voiced (which can move both hit rate and false alarm rate up and down in

tandem). It converts the hit rate and false alarm into standard deviations away

from the mean of an equivalent Gaussian distribution, and reports the difference

between them. A larger value indicates a detection scheme with better discrimi-

nation between the two classes [Duda et al., 2001]

d! = |Qinv(N/V )!Qinv(FP/UV )|. (5)

As displayed in Table 1, the discriminative model provides a significant perfor-

mance advantage on the test set with respect to frame-level transcription accuracy.

This result highlights the merit of a discriminative model for candidate note identi-

fication. Since the transcription problem becomes more complex with the number of

simultaneous notes, we have also plotted the frame-level classification accuracy versus

the number of notes present for each of the algorithms in the left panel of Figure 4, and

the classification error rate composition with the number of simultaneously occurring

notes for the proposed algorithm is displayed in right panel. As expected, there is an

inverse relationship between the number of notes present and the proportional contri-

bution of insertion errors to the total error rate. However, the performance degredation

of the proposed is not as significant as the harmonic-based models.

8

1 2 3 4 5 6 7 80

20

40

60

80

100

120

# notes present

Cla

ssifi

catio

n er

ror %

False NegativesFalse Positives


Outline1. Mixtures & Models2. Human Sound Organization3. Machine Sound Organization4. Research Questions

Task and Evaluation Generic vs. Specific

26


Task & Evaluation• How to measure separation performance?

depends what you are trying to do

• SNR?energy (and distortions) are not created equaldifferent nonlinear components [Vincent et al. ’06]

• Human Intelligibility?rare for nonlinear processing to improve intelligibilitylistening tests expensive

• ASR performance?separate-then-recognize too simplistic;ASR needs to accommodate separation

27

Tra

nsm

issio

ne

rro

rs

Agressiveness

of processingoptimum

Reduced

interference

Increased

artefacts

Net

effect


How Many Models?• More specific models ➝ better separation

need individual dictionaries for “everything”??

• Model adaptation and hierarchyspeaker adapted models : base + parameters

extrapolation beyond normal

generic-specific: pitched ➝ piano ➝ this piano

• Time scales of model acquisitioninnate/evolutionary (hair-cell tuning)developmental (mother tongue phones) dynamic - the “slung mugs” effect; Ozerov

28

CNBH, Physiology Department, Cambridge UniversityCNBH, Physiology Department, Cambridge University

Recognition of Scaled Vowels

Sm

ith

, P

atte

rso

n,

Tu

rner

, K

awah

ara

and

Iri

no

JA

SA

(2

00

5)

Domain of normal speakers

/a/

/u/

/i/

/e/

/o/

mean

pitch / Hz

VT

leng

th r

atio

Smith, Pattersonet al. ’05


Summary & Conclusions• Listeners do well separating sound mixtures

using signal cues (location, periodicity)using source-property variations

• Machines do less welldifficult to apply enough constraintsneed to exploit signal detail

• Models capture constraintslearn from the real worldadapt to sources

• Separation feasible only sometimesdescribing source properties is easier

29

Documents

Sound Organization By Source Models in Humans and Machinesweb4.cs.ucl.ac.uk/staff/D.Barber/workshops/amac/... · Sound Organization by Models - Dan Ellis 2006-12-09 - /291 1. Mixtures