Auditory Scene Analysis: phenomena, theories and computational

ASA - Dan Ellis 1998jul11 - 1

Auditory Scene Analysis:phenomena, theories

and computational models

July 1998

Dan EllisInternational Computer Science Institute, Berkeley CA

<[email protected]>

Outline

The computational theory of ASA

Cues & grouping

Expectations & inference

Big issues

1

2

3

4


Auditory Scene Analysis

What does our sense of hearing do?

- recover useful information ... about objects of interest... in a wide range of circumstances

Measuring objects in an auditory scene:


Subjective analysis of auditory scenes

• Subjects identify structures in dense scenes with high agreement

S3−crash (not car)

S3−closeup car

S3−1st horn

S3−2nd horn

S1−slam

S4−horn1

S4−crash

S5−Honk

S5−Trash can

S5−Acceleration

S6−double horn

S1−honk, honk

S6−slam

S6−doppler horn

S6−acceleration

S7−gunshot

S7−hornS7−horn2

S7−horn3

S7−horn4

S8−car horns

S8−car horns

S8−car horns

S8−large object crash

S8−truck engine

S1−rev up/passing

S9−horn 2

S9−horn 3

S9−door Slam?

S9−horn 5

S10−car horn

S10−car horn

S10−door slamming

S10−wheels on road

S2−first double horn

S2−crash

S2−horn during crash

S2−truck accelerating

200

400

1000

2000

4000

f/Hz City

0 1 2 3 4 5 6 7 8 9

Horn1 (10/10)

Crash (10/10)

Horn2 (5/10)

Truck (7/10)

Horn3 (5/10)


Outline


- ASA and CASA- The grouping paradigm- Marr’s three levels of explanation

Cues & grouping


Big issues

1

2

3

4


Auditory Scene Analysis (ASA)

“The organization of sound scenes according to their inferred sources”

• Real-world sounds rarely occur in isolation

→

a useful sense of hearing must be able to segregate mixtures

- people (and ...) do this very well;unexpectedly difficult to model

- depends on: subjective definition of relevant sources regularity/constraints of real-world sounds

• Studied via experimental psychology

- characterize ‘rules’ for organizing simple pieces(tones, noise bursts, clicks)i.e. ‘reductive’ approach


Computational Auditory Scene Analysis(CASA)

• Psychological ‘rules’ suggest computer implementation

- .. but many practical problems arise!

• Motivations:Practical applications

- real-world interactive systems- indexing of media databases- hearing prostheses

Crossover opportunities

- unknown signal/information processing principles?

Benefits for theory

- implementations are very revealing


The grouping paradigm

• Standard theory of ASA (Bregman, Darwin &c):

- sound mixture is broken up into small elements e.g. time-frequency ‘cells’

- each element has a number of feature dimensions (amplitude, ITD, period)

- elements are grouped together according to their features to form larger structures

- resulting groups have overall attributes (pitch, location)

(from Darwin 1996)


Marr’s levels-of-explanationof information processing

• Three distinct aspects to info. processing

Why bother?

- to help organize understanding- avoid confusion/wasted effort

→

use as an analysis tool...

Computational Theory

‘what’ and ‘why’; the overall goal

Sound source

organization

Algorithm

‘how’; an approach to meeting the goal

Auditory grouping

Implementation

practical realization of the

process.

Feature calculation &

binding


Level 1: Computational theory

• The underlying regularities that make the problem possible

- i.e. the ‘ecological’ facts

• Implicit definition of “what is a source?”: Independence

of attributes between sources

Continuity

of attributes for each source

+ other source-specific constraints


Level 2: Algorithm

• A particular approach to exploiting the constraints of the computational theory

- both process & representation

• Audition: the “elements-then-grouping” approach

- could have been otherwise e.g. templates

• Often the focus of analysis

- but: debate is muddled without a clear computational theory


Level 3: Implementation

• A specific realization of the algorithm

- computer programs- neurons- ...

• Can be analyzed separately?

- provided epiphenomena are correctly assigned

• Needs context of algorithm , computational theory

“You cannot understand stereopsis simply by thinking about neurons”


The advantage of the appropriate level

• Computational theory

- determines the purpose of the process;provides focus necessary for analysis

e.g. biosonar: benefit of hyperresolution

• Algorithm

- abstraction that is still specific, transferablee.g. autocorrelation for pitch

• Implementation

- explain ‘epiphenomena’e.g. ‘subjective octave’ from refractory period


An example: Neural inhibition

Computational theory

Frequency-domain

processing

Algorithm

Discrete-time filtering

(subtraction)

Implementation

Neurons with GABAergic inhibitions

f

X(f)


Summary

• Acoustic scenes are very complex

• .. but the auditory system extracts useful information

• Grouping is the main focus of Auditory Scene Analysis

• .. but it fits into a larger Marrian framework

1


Outline


Cues & grouping

- Cue analysis- Simple scenes- Models- Complications: interaction, ambiguity, time


Big issues

1

2

3

4


Cues to grouping

• Common onset/offset/modulation (“fate”)

• Common periodicity (“pitch”)

• Spatial location (ITD, ILD, spectral cues)

• Sequential cues...

• Source-specific cues...

Common onset Periodicity


Acoustic consequences tend to be synchronized

(Nonlinear) cyclic processes are

common

Algorithm

Group elements that start in a time range

? Place patterns? Autocorrelation

Implementation

Onset detector cellsSynchronized osc’s?

? Delay-and-mult? Modulation spect


Simple grouping

• E.g. isolated tones


• common onset• common period (harmonicity)

Algorithm

• locate elements (tracks)• group by shared features

Implementation

? exhaustive search• evolution in time

time

freq


Computer models of grouping

• “Bregman at face value” (e.g. Brown 1992):

- feature maps- periodicity cue- common-onset boost- resynthesis

inputmixture

signalfeatures

(maps)

discreteobjects

Front end Objectformation

Groupingrules

Sourcegroups

onset

period

frq.mod

time

freq


Grouping model results

• Able to extract voiced speech:

• Periodicity is the primary cue

- how to handle aperiodic energy?

• Limitations

- resynthesis via filter-mask-

only

periodic targets- robustness of discrete objects

0.2 0.4 0.6 0.8 1.0 time/s

100

150200

300400

600

1000

15002000

3000

frq/Hzbrn1h.aif

0.2 0.4 0.6 0.8 1.0 time/s

100

150200

300400

600

1000

15002000

3000

frq/Hzbrn1h.fi.aif


Complications for grouping:1: Cues in conflict

• Mistuned harmonic (Moore, Darwin..):

- harmonic usually groups by onset & periodicity- can alter frequency and/or onset time- ‘degree of grouping’ from overall pitch match

• Gradual, various results:

- heard as separate tone, still affects pitch

time

freq

3%mistuning

pitch shift


Complications for grouping:2: The effect of time

• Added harmonics:

- onset cue initially segregates;periodicity eventually fuses

• The effect of time

- some cues take time to become apparent- onset cue becomes increasingly distant...

• What is the impetus for fission?

- e.g. double vowels- depends on what you expect .. ?

time

freq


Summary

• Known grouping cues make sense

• Simple examples are straightforward

• Models can be implemented directly

• .. but problematic situations abound

2


Outline


Cues & grouping

Expectations & inference- “Old-plus-new”- Streaming- Restoration & illusions- Top-down models

Big issues

1

2

3

4


The effect of context

• Context can create an ‘expectation’: i.e. a bias towards a particular interpretation

• e.g. Bregman’s “old-plus-new” principle:A change in a signal will be interpreted as an added source whenever possible

- a different division of the same energy depending on what preceded it

+

time/s

freq/kHz

0.0 0.4 0.8

1

2

1.20


Streaming

• Successive tone events form separate streams

• Order, rhythm &c within , not between , streams


Consistency of properties for successive source events

Algorithm• ‘expectation window’ for known

streams (widens with time)

Implementation• competing time-frequency

affinity weights...

±2 octaves

TRT: 60-150 ms

time

freq.

∆f:1 kHz


Restoration & illusions

• Direct evidence may be masked or distorted→make best guess using available information

• E.g. the ‘continuity illusion’:

- tones alternates with noise bursts- noise is strong enough to mask tone

... so listener discriminate presence- continuous tone distinctly perceived

for gaps ~100s of ms

→ Inference acts at low, preconscious level

1000

2000

4000

f/Hz ptshort

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4time/s


Speech restoration

• Speech provides very strong bases for inference (coarticulation, grammar, semantics):

1.2 1.3 1.4 1.5 1.6 1.7 time/s0

500

1000

1500

2000

2500

3000

3500frq/Hz

nsoffee.aif

5

10

15f/Bark S1−env.pf:0

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8

40

60

80

Temporal compound (1998jul10)

time / ms50 100 150 200 250 300 350 400 450 500 550

20

40

60

80

100

120

• Phonemic restoration

• Sinewave speech (duplex?)

• Temporal compounds


Models of top-down processing

Perception as a search for plausible explanations

• ‘Prediction-driven’ CASA (PDCASA):

• An approach as well as an implementation...

• Key features:- ‘complete explanation’ of all scene energy- vocabulary of periodic/noise/transient elements- multiple hypotheses- explanation hierarchy

inputmixture

signalfeatures

predictionerrors

hypotheses

predictedfeaturesFront end Compare

& reconcile

Hypothesismanagement

Predict& combinePeriodic

components

Noisecomponents


PDCASA for old-plus-new

• Incremental analysist1 t2 t3

Input signal

Time t1:initial element created

Time t2:Additional element required

Time t3:Second element finished


PDCASA for the continuity illusion

• Subjects hear the tone as continuous... if the noise is a plausible masker

• Data-driven analysis gives just visible portions:

• Prediction-driven can infer masking:

1000

2000

4000

f/Hz ptshort

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4i /


PDCASA analysis of a complex scene

−70

−60

−50

−40

dB

200400

100020004000

f/Hz Noise1

200400

100020004000

f/Hz Noise2,Click1

200400

100020004000

f/Hz City

0 1 2 3 4 5 6 7 8 9

50100200400

1000

Horn1 (10/10)

Crash (10/10)

Horn2 (5/10)

Truck (7/10)

Horn3 (5/10)

Squeal (6/10)

Horn4 (8/10)Horn5 (10/10)

0 1 2 3 4 5 6 7 8 9time/s

200400

100020004000

f/Hz Wefts1−4

50100200400

1000

Weft5 Wefts6,7 Weft8 Wefts9−12


Marrian analysis of PDCASA

• Marr invoked to separate high-level function from low-level details

“It is not enough to be able to describe the response of single cells, nor predict the results of psychophysical experiments.Nor is it enough even to write computer programs that perform approximately in the desired way: One has to do all these things at once, and also be very aware of the computational theory...”


• Objects persist predictably• Observations interact irreversibly

Algorithm• Build hypotheses from generic

elements• Update by prediction-reconciliation

Implementation ???


Summary

• Perceptual processing is highly context-dependent

• Auditory system will use prior knowledge to fill-in gaps (subconsciously)

• Prediction-reconciliation models can encompass this behavior

3


Outline


Cues & grouping


Big issues- the state of ASA and CASA- outstanding issues- discussion points

1

2

3

4


The current state of ASA and CASA

• ASA- detailed descriptions of “in vitro” tests- some quite subtle effects explained (DV beats)but: how to extend to complex scenarios?

• CASA- numerous models, some convergence

(mainly periodicity-based)- best results sound impressive

(least plausible systems!)- applications in speech recognition?but: domains limited, poor robustness


Big issues in CASA:

• Plausibility- correct level for human correspondence?- which phenomena are important to match?- how to implement symbolic-style processing?

• Top-down vs. bottom-up- different approaches to ambiguity, latency- how far down for top-down?- how far ‘up’ for high level?- choice between extraction & inference?

• Integrating multiple cues (e.g. binaural)

• Other debates:- what is the real goal?- resynthesis- evaluation


Big issues in ASA & CASA:

• Knowledge:how to acquire, represent & store ...- short-term: context- long-term: memories- abstract: classes, generalities

• Attention:- what does it mean in these models?- limitation or important principle?


Conclusions

• Real-world sounds are complex;scene-analysis is required

• We know certain cues & some rules, but real situations raise contradictions

• Current models handle ‘obvious’ cases;robustness & generality are hard

• Many issues remain


Discussion points

• Are Marr’s levels important? Useful?Can you study levels in isolation?

• What do restoration phenomena imply about internal representations?

• Do we have an adequate account of an ASA algorithm? e.g. where do hypotheses come from?

• How important/challenging are phenomena like duplex perception, sinewave speech etc.?

Documents

Auditory Scene Analysis: phenomena, theories and computational