Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
EE E6820: Speech & Audio Processing & Recognition
Lecture 4: Auditory Perception
Mike Mandel <[email protected]>Dan Ellis <[email protected]>
Columbia University Dept. of Electrical Engineeringhttp://www.ee.columbia.edu/∼dpwe/e6820
February 10, 2009
1 Motivation: Why & how2 Auditory physiology3 Psychophysics: Detection & discrimination4 Pitch perception5 Speech perception6 Auditory organization & Scene analysis
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 1 / 44
Outline
1 Motivation: Why & how
2 Auditory physiology
3 Psychophysics: Detection & discrimination
4 Pitch perception
5 Speech perception
6 Auditory organization & Scene analysis
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 2 / 44
Why study perception?
Perception is messy: can we avoid it?No!
Audition provides the ‘ground truth’ in audioI what is relevant and irrelevantI subjective importance of distortion (coding etc.)I (there could be other information in sound. . . )
Some sounds are ‘designed’ for auditionI co-evolution of speech and hearing
The auditory system is very successfulI we would do extremely well to duplicate it
We are now able to model complex systemsI faster computers, bigger memories
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 3 / 44
How to study perception?
Three different approaches:
Analyze the example: physiology
I dissection & nerve recordings
Black box input/output: psychophysics
I fit simple models of simple functions
Information processing modelsI investigate and model complex functions
e.g. scene analysis, speech perception
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 4 / 44
Outline
1 Motivation: Why & how
2 Auditory physiology
3 Psychophysics: Detection & discrimination
4 Pitch perception
5 Speech perception
6 Auditory organization & Scene analysis
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 5 / 44
Physiology
Processing chain from air to brain:
Outerear
Middleear
Inner ear(cochlea)
Auditorynerve
Midbrain
Cortex
Study via:I anatomyI nerve recordings
Signals flow in both directions
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 6 / 44
Outer & middle ear
Pinna
Ear canal
Eardrum(tympanum)
Middle earbones
Cochlea(inner ear)
Pinna ‘horn’I complex reflections give spatial (elevation) cues
Ear canalI acoustic tube
Middle earI bones provide impedance matching
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 7 / 44
Inner ear: Cochlea
Cochlea
Oval window (from ME bones)
Basilar Membrane (BM)
Travelling wave
Resonant frequency
Position
16 kHz
50 Hz
0 35mm
Mechanical input from middle ear starts traveling wavemoving down Basilar membrane
Varying stiffness and mass of BM results in continuousvariation of resonant frequency
At resonance, traveling wave energy is dissipated in BMvibration
I Frequency (Fourier) analysis
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 8 / 44
Cochlea hair cells
Ear converts sound to BM motionI each point on BM corresponds to a frequency
Cochlea
Basilarmembrane
Tectorialmembrane
Inner Hair Cell(IHC) Outer Hair Cell
(OHC)
Auditory nerve
Hair cells on BM convert motion into nerve impulses (firings)
Inner Hair Cells detect motion
Outer Hair Cells? Variable damping?
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 9 / 44
Inner Hair Cells
IHCs convert BM vibration into nerve firings
Human ear has ∼3500 IHCsI each IHC has ∼7 connections to Auditory Nerve
Each nerve fires (sometimes) near peak displacement
Local BM displacement
Typical nerve signal (mV)
time / ms50
Histogram to get firing probability
Firing count
Cycle angle
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 10 / 44
Auditory nerve (AN) signalsSingle nerve measurements
Tone burst histogram Frequency thresholdSpike count
Time
100
100 ms
Tone burst (log) frequency
100 Hz 1 kHz 10 kHz
20
40
60
80
dB SPL
Rate vs intensity
Spi
kes/
sec
Intensity / dB SPL
300
200
100
2000
40 60 80 100
One fiber: ~ 25 dB dynamic range
Hearing dynamic range > 100 dB
Hard to measure: probe living ANs?E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 11 / 44
AN population responseAll the information the brain has about sound
average rate & spike timings on 30,000 fibers
Not unlike a (constant-Q) spectrogram
time / ms
freq
/ 8v
e re
100
Hz
PatSla rectsmoo on bbctmp2 (2001-02-18)
0
1
2
3
4
5
0 10 20 30 40 50 60
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 12 / 44
Beyond the auditory nerve
Ascending and descending
Tonotopic × ?I modulation, position, source??
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 13 / 44
Periphery models
Outer/middle ear
filteringSound
Cochlea filterbank
IHC
IHC
Modeled aspectsI outer / middle earI hair cell
transductionI cochlea filteringI efferent feedback?
time / s
chan
nel
SlaneyPatterson 12 chans/oct from 180 Hz, BBC1tmp (20010218)
0 0.1 0.2 0.3 0.4 0.5
10
20
30
40
50
60
Results: ‘neurogram’ / ‘cochleagram’
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 14 / 44
Outline
1 Motivation: Why & how
2 Auditory physiology
3 Psychophysics: Detection & discrimination
4 Pitch perception
5 Speech perception
6 Auditory organization & Scene analysis
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 15 / 44
Psychophysics
Physiology looks at the implementationPsychology looks at the function/behavior
Analyze audition as signal detection: p(θ | x)I psychological tests reflect internal decisionsI assume optimal decision processI infer nature of internal representations, noise, . . .→ lower bounds on more complex functions
Different aspects to measureI time, frequency, intensityI tones, complexes, noiseI binauralI pitch, detuning
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 16 / 44
Basic psychophysicsRelate physical and perceptual variablese.g. intensity → loudness
frequency → pitchMethodology: subject tests
I just noticeable difference (JND)I magnitude scaling e.g. “adjust to twice as loud”
Results for Intensity vs Loudness:Weber’s law ∆I ∝ I ⇒ log(L) = k log(I )
-20 -10 0 101.4
1.6
1.8
2.0
2.2
2.4
2.6
Sound level / dB
Log(
loud
ness
rat
ing)
Hartmann(1993) Classroom loudness scaling data
Power law fit:
L α I 0.22
Textbook figure:
L α I 0.3
log2(L) = 0.3 log2(I )
= 0.3log10 I
log10 2
=0.3
log10 2
dB
10
= dB/10
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 17 / 44
Loudness as a function of frequency
Fletcher-Munsen equal-loudness curves
freq / Hz
Inte
nsity
/ dB
SP
L
0
40
20
60
100
80
120
1000100 10,000
Intensity / dB
Equ
ival
ent l
oudn
ess
@
1kH
z
400 0
40
80
80
100
60
20
20 60 20 600
100
60
20
0Intensity / dB
Equ
ival
ent l
oudn
ess
@
1kH
z
40
40
80
80
rapid loudness growth
100 Hz 1 kHz
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 18 / 44
Loudness as a function of bandwidthSame total energy, different distributione.g. 2 channels at −6 dB (not −10 dB)
time
freq
freq
mag
freq
mag
Same total
energy I·B
... but wider perceived as louder
I0I1
B0 B1
Bandwidth B‘Critical’ bandwidth
Loud
ness
Critical bands: independent frequency channels
I ∼25 total (4-6 / octave)
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 19 / 44
Simultaneous maskingA louder tone can ‘mask’ the perception of a second tone nearby infrequency:
masked threshold
log freq
absolute threshold
masking tone
Inte
nsity
/ dB
Suggests an ‘internal noise’ model:
decision variable x
internal noise
p(x | I)p(x | I)
p(x | I+∆I)
σn
I
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 20 / 44
Sequential masking
Backward/forward in time:
time
Inte
nsity
/ dB
masker envelope
masked threshold
simultaneous masking ~10 dB
backward masking ~5 ms
forward masking ~100 ms
→ Time-frequency masking ‘skirt’:
time
freq
inte
nsity
Masking tone
Masked threshold
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 21 / 44
What we do and don’t hear
A B XX = A or B?
“two-interval forced-choice”:
time
Timing: 2 ms attack resolution, 20 ms discriminationI but: spectral splatter
Tuning: ∼1% discriminationI but: beats
Spectrum: profile changes, formantsI variables time-frequency resolution
Harmonic phase?
Noisy signals & texture
(Trace vs categorical memory)
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 22 / 44
Outline
1 Motivation: Why & how
2 Auditory physiology
3 Psychophysics: Detection & discrimination
4 Pitch perception
5 Speech perception
6 Auditory organization & Scene analysis
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 23 / 44
Pitch perception: a classic argument in psychophysics
Harmonic complexes are a pattern on AN
10
20
30
40
50
60
70
0 0.05 0.1 time/s
freq
. cha
n.
I but give a fused percept (ecological)
What determines the pitch percept?I not the fundamental
How is it computed?Two competing models: place and time
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 24 / 44
Place model of pitch
AN excitation pattern shows individual peaks
‘Pattern matching’ method to find pitch
frequency channel
frequency channel
AN
exc
itatio
n
Pitch strength
resolved harmonics
broader HF channels cannot resolve
harmonics
Correlate with harmonic ‘sieve’:
Support: Low harmonics are very important
But: Flat-spectrum noise can carry pitch
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 25 / 44
Time model of pitch
Timing information is preserved in AN down to ∼1 ms scale
Extract periodicity by e.g. autocorrelation and combine acrossfrequency channels
lag / ms
time
freq per-channel
autocorrelation
autocorrelation
Summary autocorrelation
0 10 20 30common period
(pitch)
But: HF gives weak pitch (in practice)
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 26 / 44
Alternate & competing cues
Pitch perception could rely on various cuesI average excitation patternI summary autocorrelationI more complex pattern matching
Relying on just one cue is brittleI e.g. missing fundamental
→ Perceptual system appears to use a flexible, opportunisticcombination
Optimal detector justification?
argmaxθ
p(θ | x) = argmaxθ
p(x | θ)p(θ)
= argmaxθ
p(x1 | θ)p(x2 | θ)p(θ)
I if x1 and x2 are conditionally independent
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 27 / 44
Outline
1 Motivation: Why & how
2 Auditory physiology
3 Psychophysics: Detection & discrimination
4 Pitch perception
5 Speech perception
6 Auditory organization & Scene analysis
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 28 / 44
Speech perception
Highly specialized functionI subsequent to source organization?
. . . but also can interact
Kinds of speech sounds
20
30
40
50
60
1.4 1.6 1.8 2 2.2 2.4 2.6 time/slevel/dB
freq
/ H
z
0
1000
2000
3000
4000
watch thin as a dimeahas
stop burstfricativevowelnasalglide
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 29 / 44
Cues to phoneme perception
Linguists describe speech with phonemes
watch thin as a dimeahas
mdnctcl
^
θ zwzh e
III ayε
Acoustic-phoneticians describe phonemes by
• formants & transitions
• bursts & onset times
time
freq
vowel formants
transition stop burst
voicing onset time
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 30 / 44
Categorical perception
(Some) speech sounds perceived categorically rather thananalogically
I e.g. stop-burst and timing:
T
P K P
Pi e a c o uε
following vowelbu
rst f
req
f b /
Hz
1000
2000
3000
4000
time
freq stop burst vowel
formants
fb
I tokens within category are hard to distinguishI category boundaries are very sharp
Categories are learned for native tongueI “merry” / “Mary” / “marry”
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 31 / 44
Where is the information in speech?
‘Articulation’ of high/low-pass filtered speech:
Art
icul
atio
n / %
1000
20
40
60
80
2000 3000 4000freq / Hz
high-pass low-pass
sums to more than 1. . .
Speech message is highly redundant
e.g. constraints of language, context
→ listeners can understand with very few cues
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 32 / 44
Top-down influences: Phonemic restoration (Warren, 1970)
What if a noise burst obscures speech?
1.4 1.6 1.8 2 2.2 2.4 2.6 time / s
freq
/ H
z
0
1000
2000
3000
4000
auditory system ‘restores’ the missing phoneme. . . based on semantic content. . . even in retrospect
Subjects are typically unaware of which sounds are restored
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 33 / 44
A predisposition for speech: Sinewave replicas
Replace each formant with a single sinusoid (Remez et al., 1981)
010002000300040005000
0.5 1 1.5 2 2.5 30
10002000300040005000
time / s
freq
/ H
zfr
eq /
Hz
Speech
Sines
speech is (somewhat) intelligible
people hear both whistles and speech (“duplex”)
processed as speech despite un-speech-like
What does it take to be speech?
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 34 / 44
Simultaneous vowelsMix synthetic vowels with different f0s
freq+ =
dB
/iy/ @ 100 Hz
/ah/ @ 125 Hz
Pitch difference helps(though not necessarily)
DV identification vs. ∆f0 (200ms) (Culling & Darwin 1993)
% b
oth
vow
els
corr
ect
∆f0 (semitones)
0 11/4 1/2 2 4
25
50
75
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 35 / 44
Computational models of speech perception
Various theoretical-practical models of speech comprehensione.g.
Phoneme recognition
Lexical access
Grammar constraints
Speech
Words
Open questions:I mechanism of phoneme classificationI mechanism of lexical recallI mechanism of grammar constraints
ASR is a practical implementation (?)
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 36 / 44
Outline
1 Motivation: Why & how
2 Auditory physiology
3 Psychophysics: Detection & discrimination
4 Pitch perception
5 Speech perception
6 Auditory organization & Scene analysis
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 37 / 44
Auditory organization
Detection model is huge simplification
The real role of hearing is much more general:Recover useful information from the outside world
→ Sound organization into events and sources
0 2 4 time/s
frq/Hz
0
2000
4000 Voice
Stab
Rumble
Research questions:I what determines perception of sources?I how do humans separate mixtures?I how much can we tell about a source?
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 38 / 44
Auditory scene analysis: simultaneous fusion
Harmonics are distinct on AN,but perceived as one sound (“fused”)
time
freq
I depends on common onsetI depends on harmonicity (common period)
Methodologies:I ask subject how many ‘objects’I match attributes e.g. object pitchI manipulate high level e.g. vowel identity
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 39 / 44
Sequential grouping: streaming
Pattern / rhythm: property of a set of objectsI subsequent to fusion ∵ employs fused events?
–2 octaves
TRT: 60-150 ms
time
freq
uenc
y
∆f:1 kHz
Measure by relative timing judgmentsI cannot compare between streams
Separate ‘coherence’ and ‘fusion’ boundaries
Can interact and compete with fusion
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 40 / 44
Continuity and restoration
Tone is interrupted by noise burst: what happened?
time
freq
+
+ +
?
I masking makes tone undetectable during noise
Need to infer most probable real-world eventsI observation equally likely for either explanationI prior on continuous tone much higher ⇒ choose
Top-down influence on perceived events. . .
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 41 / 44
Models of auditory organization
Psychological accounts suggest bottom-up
inputmixture
signalfeatures(maps)
discreteobjectsFront end Object
formationGrouping
rulesSourcegroups
onsetperiodfrq.mod
time
freq
Brown and Cooke (1994)
Complicated in practice
formation of separate elements
contradictory cues
influence of top-down constraints (context, expectations, . . . )
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 42 / 44
Summary
Auditory perception provides the ‘ground truth’ underlyingaudio processing
Physiology specifies information available
Psychophysics measure basic sensitivities
Sounds sources require further organization
Strong contextual effects in speech perception
Transduce Scene analysis
Multiple represent'ns
High-level recognitionSound
Parting thought
Is pitch central to communication? Why?
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 43 / 44
References
Richard M. Warren. Perceptual restoration of missing speech sounds. Science, 167(3917):392–393, January 1970.
R. E. Remez, P. E. Rubin, D. B. Pisoni, and T. D. Carrell. Speech perception withouttraditional speech cues. Science, 212(4497):947–949, May 1981.
G. J. Brown and M. Cooke. Computational auditory scene analysis. Computer Speech& Language, 8(4):297–336, 1994.
Brian C. J. Moore. An Introduction to the Psychology of Hearing. Academic Press,fifth edition, April 2003. ISBN 0125056281.
James O. Pickles. An Introduction to the Physiology of Hearing. Academic Press,second edition, January 1988. ISBN 0125547544.
E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 44 / 44