Upload
bryar-powers
View
33
Download
5
Embed Size (px)
DESCRIPTION
How Does auditory perception organization works ? by Elvira Perez and Georg Meyer Dept. Psychology, Liverpool University, UK Hoarse Meeting, Chrysler Ulm, Germany 28 th -30 th October, 2004. 1.Introduction:. Ears receive mixtures of sounds. - PowerPoint PPT Presentation
Citation preview
How Does auditory perception
organization works?by Elvira Perez and Georg Meyer
Dept. Psychology, Liverpool University, UK
Hoarse Meeting, Chrysler Ulm, Germany 28th-30th October, 2004
1. Introduction:
Ears receive mixtures of sounds.
We can tolerate surprisingly high levels of noise and still orientate
our attention to whatever we want to attend.
But... how the auditory system can do this so accurately?
• Auditory scene analysis (Bregman, 1990) is a theoretical framework that aims to explain auditory perceptual organisation.
• Basics:– Environment contains multiples objects
• Decomposition into its constituent elements.• Grouping.
• It proposes two grouping mechanisms:– 1. ‘Bottom-up’: Primitive cues (F0, intensity, location)
Grouping mechanism based on Gestalt principles.– 2. ‘Top-down’: Schema-based (speech pattern
matching)
1. Introduction:
• Primitive process (Gestalt. Koffka, 1935):– Similarity– Good continuation– Common fate– Disjoint locations– Closure
1. Introduction:
1. Introduction
• Criticisms: Too simplistic. Whatever cannot be explained through the primitive processes, it is explained by the schema-based processes.
• Primitive processes only work in the lab.• Sine-wave replicas of utterances (Remez et al.,
1992)– Phonetic principles of organization find a single speech
stream, whereas auditory principles find several simultaneous whistles.
– Grouping by phonetic rather than by simple auditory coherence.
3. Experiments (baseline):
• The purpose of these studies is to explore how noise (a chirp) affects speech perception.
• The stimulus used is a vowel-nasal syllable which is perceived as /en/ if presented in isolation but as /em/ if it is presented with a frequency modulated sine wave in the position where the second formant transition would be expected.
• In the three experiments participants categorised the synthetic syllable heard as /em/ or /en/.
• Direction, duration, and position of the chirp were the values manipulated.
The perception of a nasal /n/ change to /m/ when adding a chirp between the vowel and nasal F2
For
man
t fre
qu. (
Hz) 2700
2000
800375
vowel nasal
100 200ms
3. Experiments
Experiment 1 Baseline/Direction
chirp up chirp down
vowel nasal
100 200ms ctrl chirp up chirp dn
0.0
0.2
0.4
0.6
0.8
1.0
p(m
)
Condition
In 80% of the trials the participants heard the difference between up and down chirp.
Experiment 2 Duration
vowel nasal
100 200ms
0 5 10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
1.0 down chirp
up chirp
p(m
)
Chirp duration (ms)
100 200ms
vowel nasal
Experiment 3 Position
-100 -50 -25 -10 0 10 25 50 100 ctrl
0.0
0.2
0.4
0.6
0.8
1.0
p(m
)
Chirp position relative to midpoint (ms)
5. Conclusions:
Chirps from 4 ms to 20 ms duration and in a range of 1kHz-2kHz, independently of their direction, are apparently integrated into the speech signal and change the percept from /en/ to /em/.
Subjects very clearly hear two objects, so that some scene analysis is taking place since the chirp is not integrated completely into the speech.
Duplex perception with one ear.
It seems that listeners can also discriminate the direction motion of the chirp when they focus their attention in the chirp and a more high level of auditory processing takes places (80% accuracy).
Mr. Background Noise
• Do human listeners actively generate representation of background noise to improve speech recognition?
• Hypothesis: Recognition performance should be highest if the spectral and temporal structure of interfering noise is regular so that a good noise model can be generated unpredictable noise.
Experiment 4 & 5
• Stimuli: chirps + /en/• Ten subjects• The amplitude of the chirp vary (5 conditions: 0dB -
8dB -14dB -20dB no-chirp)• Background noise (down chirps):
– Quantity: Lots (170/20ms) vs Few (19/20ms) – Time of appearance: Regular vs Irregular
• Categorization task 2FC.• Threshold shifts.
en en en en en en
en en en en en en
Regular condition
Irregular condition
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0 T
hre
sh
old
Pro
ba
blility
of e
ari
ng
/m
/
Amplitude Chirp
Each point in the scatter is the mean threshold over all subjectsfor a give session. The solid lines show the Boltzmann fit (Eq.(1)for each individual subject in the fifth different conditions. All the fits have the same upper and lower asymptotes.
2/)(21
01A
e
AAy dxxx
control fewrand fewreg lotsrand lotsreg-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
Th
resh
old
/m/
Condition
lots vs. few (t = -3.34, df = 38, p = 0.001). control vs. lots (t = -3.34, df = 38, p = 0.001).
No effect between irregular and regular.
Exp. 4 rand/reg
• Two aspects change from exp. 4 to 5:– Amplitude scale of the chirps (0dB -4dB -8dB -
16dB no-chirp).– The conditions lots now includes 100/20’’ and
before 170/20’’.
Control fewrand fewreg lotsrand lotsreg
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Th
resh
old
/m/
Condition
lots vs few (t = 2.27, df = 38, p = 0.05). control vs. lots (t = 3.12, df = 38, p < 0.05).
No effect between irregular and regular.
Exp.5 rand/reg
5. Conclusions
Only the amount of background noise seems to affect the performance of the recognition task.
The regularity of the background noise seems an irrelevant cue to improve auditory stream segregation and therefore, speech perception.
Counterintuitive phenomenon.
• Irrelevant sound effect (ISE) (Colle & Welsh, 1976) disrupts in serial recall.
• The level of meaning (reverse vs forward speech), predictability of the sequence (random vs regular), and similarity (semantic or physical) of the IS to the target material, seems to have little impact in the focal task. (Jones et al., 1990).
• Changing state: The degree of variability or physical change within an auditory stream is the primary determinant of the degree of distrupion in the focal task.
Smooth change Abrupt change
Freq
uenc
y
TimeFr
eque
ncy
Time
Zoom in
Up
Bottom
Top
Down
Experiment 6
• Stimuli: Synthesised vowel-nasal + background FM tone + Chirps
• Three blocks (200trial each block): First Control, second Smooth or Abrupt (counterbalanced order)
• Chirps: Four different frequencies: Up/Down/Top/Bottom
• Five amplitudes: 0dB, -4dB, -8dB, -16dB, no-chirp
Experiment 6
• Subjects: 42
• Musicians vs Non musicians
• Female vs Male
• Nationality (27)
• Age(27.7), languages spoken (3)
• Hearing issues (AP, RP, Tinnitus)
Results
Tukey test p < 0.001
down vs up YES control vs. smooth YES control vs abrupt YES
No effect between smooth and abrupt.
C-D S-D A-D C-UP S-UP A-UP
-70
-60
-50
-40
-30
-20
-10
0
10
20
30
40
50
1
2
3
4
56
A BC
DE
F
dB
fo
r P
/m/
Conditions
Results Musicians vs. Non Musicians
C-D-M C-D-NM S-D-M S-D-NM A-D-M A-D-NMC-UP-MC-UP-NMS-UP-MS-UP-NMA-UP-MA-UP-NM
-70
-60
-50
-40
-30
-20
-10
0
10
20
30
40
50
dB
fo
r P
/m
/
Conditions
…no differences.
More analysis…
• Take away the intermediate conditions– Results remain the same
• Habituation– Differences between musicians and non
musicians; overall, 10, 5, 3 first blocks, but not the first block.
– Again control vs smooth/abrupt
6. Conclusions
• It seems that listeners do not use pattern prediction as a cue for auditory perceptual organisation.
• … or they do it extremely fast (3ms), or is due to STM (pre-perceptual auditory storage)
• Attention must be focused on an object (background noise) for a change in that object to detected (Rensink, et al, 1997)
• … or we just ignore the information contained in the transitions for not being reliable.
For
man
t fre
qu. (
Hz) 2700
2000
800375
vowel nasal
Conflict area
For
man
t fre
qu. (
Hz) 2700
2000
800375
vowel nasal
Ignore is not reliable
For
man
t fre
qu. (
Hz)
2700
2000
800375
vowel nasal
For
man
t fre
qu. (
Hz)
2700
2000
800375
vowel nasal
=
=
/em/
/em/
• Which information are we taking in account?
• Do we combine cues?
• Do we measure the variances associated which each cue to test its reliability?
• Maximum likelihood integrator• Antecedents: The nervous system seems to
combine visual and haptic info similar to a MLE (Ernst, et al. 2002)
transition transitionnasalnasal
noise
noise
Conditions
e
e
e
n
ne
n
n
Full syllable
No transition
No nasal
No transition/nasal
(
(
(
)
)
)
Vowel Transition Nasal
Preliminary results
0 1 2 3 4
0,0
0,2
0,4
0,6
0,8
1,0Data: Data2_C,Data2_BModel: Boltzmann Equation: y = A2 + (A1-A2)/(1 + exp((x-x0)/dx)) Weighting:y(1) No weightingy(2) No weighting Chi^2/DoF = 0.003R^2 = 0.99546 A1 0.15 ±0.03873A2 1 ±0.03873x0 1.9523 ±--dx 0.02368 ±--A1_2 -0.05798 ±0.11851A2_2 0.72447 ±0.05901x0_2 1.43035 ±0.26337dx_2 0.57667 ±0.23807
Y A
xis
Titl
e
Noise Level
New Methodology
• Until now Method of Constants: Several stimulus levels are chosen beforehand, and groups of observations are placed at each of these stimulus levels. The order of observations is randomized. A conventional method of estimation is used in fitting the psychometric function to the resulting data.
• Adaptative procedures: The stimulus level on any one trial is determined by the preceding stimuli and responses.– Sequential experiment: The course of the experiment is
dependent on the experimental data.
• Up-Down procedures or staircase method: The stimuli level (amplitude of the speech signal) is decreased after a positive response (or increase after a negative).
• On each trial, the participant is required to give both a binary judgment (em/en/e) and a confident rating. The binary judgment are used to decide on the direction of change in the stimulus level, and the confident ratings are used to decide on the steps size (dB).
• Advantages: Most of the observations are placed at or near the 50% level
Conclusions
• …In the next meeting.
Thank you