17
L C S Informative Subspaces: High-level Function from Low-level Fusion John Fisher Oxygen Workshop, January, 2002 L C S Overview Motivation for early fusion of audio and video. Why is it hard? Why is it useful? A statistical model consistent with the approach. Information theoretic perspective. Algorithmic description Applications of the method Video localization of speaker Audio Enhancement Audio/video synchrony

L C S Informative Subspaces: High-level Function from Low ... · Informative Subspaces: High-level Function from Low-level Fusion John Fisher Oxygen Workshop, January, 2002 L C S

Embed Size (px)

Citation preview

1

L C S

Informative Subspaces: High-level Function from Low-level Fusion

John Fisher

Oxygen Workshop, January, 2002

L C SOverview

• Motivation for early fusion of audio and video.– Why is it hard?– Why is it useful?

• A statistical model consistent with the approach.• Information theoretic perspective.• Algorithmic description• Applications of the method

– Video localization of speaker– Audio Enhancement– Audio/video synchrony

2

L C S

• Who is there? (presence, identity)• Which person said that? (audiovisual grouping)• Where are they? (location)• What are they looking / pointing at? (pose, gaze)• What are they doing? (activity)

Perceptual Context

• Who is there? (presence, identity)• Which person said that? (audiovisual grouping)• Where are they? (location)• What are they looking / pointing at? (pose, gaze)• What are they doing? (activity)

L C S

blah blah blahblah blah blah

blah blah blahblah blah blah

blah blah blahblah blah blah

blah blah blah blah

computer, show me the PUI presentation

Is that you talking?

A perceptual link between user and device:• detect and recognize user • confirm that utterance was from user

blah blah blahblah blah blah

blah blah blahblah blah blah

blah blah blahblah blah blah

blah blah blah blah

computer, show me the PUI presentation

3

L C SInformation Loss

videofeatures

audiofeatures

decisionlogic

decisionlogic

Data Processing Inequality : processing data can only destroy information

1( ; ) ( ; )I C F I C X≤ 2 1( ; ) ( ; )I C F I C F≤X

C

L C SWhen should we “fuse” data?

videofeatures

audiofeatures

Signal Level•high dimension•complex joint statistics

decisionlogic

decisionlogic

Feature Level•moderate dimension•complex joint statistics

Decision Level•low dimension•discrete statistics

•Data fusion implies estimating the joint statistics of two or more data sources.•The point of “fusion” is dictated by a tradeoff between robustness/simplicity.

4

L C SA Compromise

videofeatures

audiofeatures

Estimate density in the “feature” space

•How might we (implicitly) model the joint statistics?•What objective criterion is appropriate adapting the mapping from signals to features?

Feedback from objective criterion of the joint statistics guides the audio/video feature extraction

L C SAssociative Modeling at the Signal Level

consistent not

Question: How do we quantify “consistent”?

Sounds and motions which are consistent may be attributed to a common cause…

5

L C SA Simple Independent Cause Model

A

B

C

V

U

( ) ( ) ( ) ( ) ( ) ( ), , , , , ,p A B C U V p A p B p C p U A B p V B C=

f(V,β)

g(U,α)

person speaking monitor flickerfan noise

L C SA Simple Independent Cause Model

A

B

C

V

U

( ) ( ) ( ) ( ) ( ) ( ), , , , , ,p A B C U V p A p B p C p U A B p V B C=

f(V,β)

g(U,α)

6

L C SU conveys information about V through the joint of (A,B)

A

B

C

V

U

( ) ( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( )

, , , , , ,

, ,

p A B C U V p A p B p C p U A B

p U p A B U p C p

p C

C

V B

V B

=

=

L C SV conveys information about U through the joint of (B,C)

A

B

C

V

U

( ) ( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( )( ) ( ) ( ) ( ), ,

, , , , , ,

, ,

p A B C U V p A p B p C p U A B p V B C

p

p V p

U p A B

B

U p C p V

C V p

B C

A p U A B

=

=

=

7

L C SA Big Mess!!

A

B

C

V

U

( ) ( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( )( ) ( ) ( ) ( )( ) ( )

, , , , , ,

, ,

, ,

, , , ,

p A B C U V p A p B p C p U A B p V B C

p U p A B U p C p V B C

p V p B C V p A p U A B

p U V p A B C U V

=

=

=

=

L C S

Suppose a partitioning of U and V exists such that:

then…

Bear in mind we still have the task of finding it.

Separating Functions

A

B

C

V

U( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( ) ( )

, ,

, ,

A B

B C

p U A B p A p B p U A p U B

p V B C p B p C p V B p V C

=

=

8

L C S

becomes

Separating Functions

A

B

C

VB

UA

( ) ( ) ( ) ( )( ) ( ) ( ) ( )

, ,

A B B C

p A B C p A p B p C

p U A p U B p V B p V C

=UB

VC

L C S

or

Separating Functions

A

B

C

VB

UA

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( ) ( ) ( )( ) ( ) ( ) ( )

, ,

A

B B B

A

B B C

C

p A B C p A p B p C

p U A p U B p V B p V

p U p B U p V B

p A

C

p C p U A p V C

=

=

UB

VC

9

L C S

or

Separating Functions

A

B

C

VB

UA

( ) ( ) ( ) ( )( ) ( ) ( ) ( )( ) ( ) ( )( ) ( ) ( ) ( )( ) ( ) ( )( ) ( ) ( ) ( )

, ,

A B B C

B B

B B B

C

B

A C

A

p A B C p A p B p C

p U A p U B p V B p V C

p U p B U p V

p V p B V p U B

p

B

p A p C p U A p V

A p C p U C

C

A p V

=

=

=

UB

VC

L C SSummarizing Common Information

A

B

C

V

U

( ( , ); ( , )) ( ; ( , ))( ; ( , ))

I g U f V I B f VI B g U

α β βα

≤≤

f(V,β)

g(U,α)

10

L C SFusion of Audio/Video Measurement using Mutual Information

( , )i vg v α

( , )i ag a α

•Choose the mapping parameters such that the mutual information between the extracted features is maximized (i.e. project onto a maximally informative subspace.

L C SFusion of Audio/Video Measurement using Mutual Information

( , )i vg v α

( , )i ag a α

•By maximizing MI, we are summarizing the common information in the measurements, (i.e. which is related to their common cause).•From the information theory perspective, the joint of the feature variables is a proxy for the “observable” part of their common cause.

11

L C SMaximally Informative Subspace

i

Tv v if h V=

i

Ta a if h A=

•Treat each image/audio frame in the sequence as a sample of a random variable.•Projections optimize the joint audio/video statistics in the lower dimensional feature space.

L C SFormally...

( )( ) ( )

( )( ) ( )

( , ) ( ) ( ) ( , )( ) ( )

( ) ( )

( ) log , z discrete

( ) log , z continuousz

Z i Z ii

Z Z

I y H h y h yH H y

h y h y

H z p z p z

h z p z p z dz

θ θ θθ θ

θ

Ω

= + −

= −

= −

= −

= −

•Mutual information quantifies the reduction in uncertainty (on average) about one random variable achieved by observing another.

•The entropy terms depend on the whether the random variable is discrete or continuous.

12

L C SEntropy vs. Moments (i.e. correlation)

L C S

• Substitute approximation into integral and simplify

• Consequently, maximizing this approximation to entropy is equivalent to minimizing the chi-squared distance between the density, p, and the expansion density, q.

Approximating Differential Entropy

( ) ( ) ( ) ( ) ( ) ( )( )2

( ) ( ) log ( )

1ˆ2

y

y

H p p y p y dy

H p H p D p q p y q y dyq y

Ω

Ω

=

= + − −

13

L C SExact Evaluation ofIntegral Criterion Gradient

Gradient of approximation can be computed exactly by evaluation of N functions at N sample locations.

( ) ( ) ( )( )

( ) ( )

( )( ) ( )

2

2

1

2

1

ˆ ˆ ˆlog2

1log ;2

1log ; ;2

Y

Y

y

Y

Y

y

Y

Y

y

u

N

i N ui

N

i N ui

VH p V p y p y dy

VV y y h p y dy

N

VV y g x h p y dy

N

κ

κ α

ΩΩ

Ω

ΩΩ

ΩΩ

= − −

= − − −

= − − −

∑∫

∑∫

( )

( ) ( )( ) ( ) ( )

( ) ( ) ( )

1

ˆ ;

1 ;

;

; ; ;

YN

i ii

i r i a i j Nj i

r u z N

a N N z N

VH g x

N

f y y y hN

f y p y y h

y h y h y h

ε αα α

ε κ

κ

κ κ κ

Ω

=

∂ ∂ = − ∂ ∂

= − −

= ∗

= ∗

L C SVideo Localization of Single Speaker in the Presence of Motion Distractors

• Which pixels are “related” to the associated audio?

• Joint statistics of video and audio modalities are not well modeled by parametric forms.

• Slaney and Covell (NIPS ’00) demonstrate that canonical correlations (a second-order statistical measure) do not successfully detect audio/video synchrony using spectral representations.

• Classical sensor fusion approaches are formulated as joint Bayesian estimation problems, which is equivalent to MI in the non-parametric case.

14

L C SDetecting (change) motion is not enough

•Red squares indicate regions with large pixel variance•Variance image of sequence at left•Magnitude of MAX MI video projection shown at center•Inspection of the learned video projection coefficients tells us which pixels are associated with the audio signal.

L C SPixel Representation vs. Motion Representation

•Similar result using an optic flow representation [Anandan ’89] of motion in the video

15

L C SAudio Enhancement

•Left channel

•Right channel

•Wiener (left)

•Wiener (right)

•MI (left)

•MI (right)

In this experiment, regions of the video are selected for enhancement (e.g. face detector, manually).

L C SWiener Filter Comparison

5.6 dB5.7 dB10.5 dBSPG

(female voice)

9.2 dB8.9 dB10.43 dBSPG

(male voice)

Optical Flow-Periodogram

Representation

Pixel-Periodogram

RepresentationWiener filter

16

L C SAudio-visual synchrony detection

MI: 0.68 0.61 0.19 0.20 Compute confusion matrix for 8 subjects:

No errors! No training!

Also can use for audio/visual temporal alignment….

L C SAn Admission

But:We can incorporate prior models when available?

While sounds and motions may be consistent with each other, they are not always attributable to a common cause…

17

L C SConclusions

• We’ve motivated a tractable and computationally feasible information theoretic approach to signal level fusion.

• The method – starts from a simple assumption (consistency at the signal level)– made little use of prior models – the statistical framework provides a clear methodology for

incorporating prior models when their use is appropriate.• Tracking, tracking, tracking…• Audio-video synchrony is a straightforward application of

the approach, the fact that we learn a generative subspace model allows us to do more…