51
Information Bottleneck versus Maximum Likelihood Felix Polyakov

Information Bottleneck versus Maximum Likelihood Felix Polyakov

Embed Size (px)

DESCRIPTION

A Simple Example...

Citation preview

Page 1: Information Bottleneck versus Maximum Likelihood Felix Polyakov

Information Bottleneck versus

Maximum Likelihood

Felix Polyakov

Page 2: Information Bottleneck versus Maximum Likelihood Felix Polyakov

Overview of the talk

Brief review of the Information Bottleneck • Maximum Likelihood

• Information Bottleneck and Maximum Likelihood

• Example from Image Segmentation

Page 3: Information Bottleneck versus Maximum Likelihood Felix Polyakov

Israel Health www Drug Jewish Dos Doctor ...Doc1 12 0 0 0 8 0 0 ...Doc2 0 9 2 11 1 0 6 ...Doc3 0 10 1 6 0 0 20 ...Doc4 9 1 0 0 7 0 1 ...Doc5 0 3 9 0 1 10 0 ...Doc6 1 11 0 6 0 1 7 ...Doc7 0 0 8 0 2 12 2 ...Doc8 15 0 1 1 10 0 0 ...Doc9 0 12 1 16 0 1 12 ...Doc10 1 0 9 0 1 11 2 ...

... ... ... ... ... ... ... ... ...

A Simple Example...

N. Tishby

Page 4: Information Bottleneck versus Maximum Likelihood Felix Polyakov

Simple ExampleIsrael Jewish Health Drug Doctor www Dos ....

Doc1 12 8 0 0 0 0 0 ...Doc4 9 7 1 0 1 0 0 ...Doc8 15 10 0 1 0 1 0 ...

Doc2 0 1 9 11 6 2 0 ...Doc3 0 0 10 6 20 1 0 ...Doc6 1 0 11 6 7 0 1 ...Doc9 0 0 12 16 12 1 1 ...

Doc5 0 1 3 0 0 9 10 ...Doc7 0 2 0 0 2 8 12 ...Doc10 1 1 0 0 2 9 11 ...

... ... ... ... ... ... ... ... ...

N. Tishby

Page 5: Information Bottleneck versus Maximum Likelihood Felix Polyakov

Israel Jewish Health Drug Doctor www Dos ...

Cluster1 36 25 1 1 1 1 0 ...

Cluster2 1 1 42 39 45 4 2 ...

Cluster3 1 4 3 0 4 26 33 ...

... ... ... ... ... ... ... ... ...

A new compact representation

The document clusters preserve the relevant information between the documents and words

N. Tishby

Page 6: Information Bottleneck versus Maximum Likelihood Felix Polyakov

Feature Selection?Feature Selection?• NO ASSUMPTIONS about the source of

the data

• Extracting relevant structure from data– functions of the data (statistics) that preserve

information

• Information about what?

• Need a principle that is both general and precise.

N. Tishby

Page 7: Information Bottleneck versus Maximum Likelihood Felix Polyakov

),( YCI x

xC

kc

1c2c

X

nx

1x2x

Y

my

1y2y

Documents Words

N. Tishby

Page 8: Information Bottleneck versus Maximum Likelihood Felix Polyakov

The information bottleneck or relevance through distortion

• We would like the relevant partitioning T to compress X as much as possible, and to capture as much information about Y as possible

X

xtp |

Y

typ |

YXI

tpyptypptypYTI

y t

;|log,;

T

N. Tishby, F. Pereira, and W. Bialek

Page 9: Information Bottleneck versus Maximum Likelihood Felix Polyakov

X

Y

T

Y

YXP , YTq ,

• Goal: find q(T | X)–note Markovian independence

relation T X Y

Page 10: Information Bottleneck versus Maximum Likelihood Felix Polyakov

Variational problem

Iterative algorithm

Page 11: Information Bottleneck versus Maximum Likelihood Felix Polyakov

Overview of the talk

Short review of the Information Bottleneck Maximum Likelihood

• Information Bottleneck and Maximum Likelihood

• Example from Image Segmentation

Page 12: Information Bottleneck versus Maximum Likelihood Felix Polyakov

A coin is known to be biasedThe coin is tossed three times – two heads and one tailUse ML to estimate the probability of throwing a head

Try P = 0.2

Try P = 0.6

Probability of a head

Like

lihoo

d of

the

Dat

a

L(O) = 0.2 * 0.2 * 0.8 = 0.032

Try P = 0.4L(O) = 0.4 * 0.4 * 0.6 = 0.096

L(O) = 0.6 * 0.6 * 0.4 = 0.144

Try P = 0.8L(O) = 0.8 * 0.8 * 0.2 = 0.128

A simple example...

•Model:−p(head) = P −p(tail) = 1 - P

Page 13: Information Bottleneck versus Maximum Likelihood Felix Polyakov

A bit more complicated example… :Mixture Model

• Three baskets with white (O = 1), grey (O = 2), and black (O = 3) balls

B1 B2 B3

• 15 balls were drawn as follows:1. Choose a basket according to p(i) = b

i

2. Draw the ball j from basket i with probability

• Use ML to estimate given the observations: sequence of balls’ colors

Page 14: Information Bottleneck versus Maximum Likelihood Felix Polyakov

•Likelihood of observations

•Log Likelihood of observations

•Maximal Likelihood of observations

Page 15: Information Bottleneck versus Maximum Likelihood Felix Polyakov

Likelihood of the observed datax – hidden random variables [e.g. basket]y – observed random variables [e.g. color]- model parameters [e.g. they define p(y|x)]0 – current estimate of model parameters

;log;log;|;log0;|0 ypyLyxpyL yx

x

E

;|log;,log00 ;|;| yxpyxp yxyx EE

x

yxpyxp ;,log;| 0

x

yxpyxp ;|log;| 0

Page 16: Information Bottleneck versus Maximum Likelihood Felix Polyakov

x

yxpyxp ;,log;| 0

x

yxpyxp ;|log;| 0

x

yxpyxp 00 ;|log;|

x

yxpyxp 00 ;|log;|

0|0|0 ||;,log;| yxx

yx HKLyxpyxp

KL

H

0|0|0 ||,;log yxyx HKLQyL

Page 17: Information Bottleneck versus Maximum Likelihood Felix Polyakov

0|0|0 ||,;log yxyx HKLQyL

1. Expectation − Compute − Get ;,log, ;| yxpQ

tyxt E

tt Q

,maxarg1

2. Maximization−

Expectation-maximization algorithm (I)

• • EM algorithm converges to local maxima

tt yLyL ;log;log 1

tyXp ;|

Page 18: Information Bottleneck versus Maximum Likelihood Felix Polyakov

Log-likelihood is non-decreasing, examples

Page 19: Information Bottleneck versus Maximum Likelihood Felix Polyakov

EM – another approach

xqyxpxqyp

x

;,log;log

Goal:

,;,log qFqHyxpxqx

1,011

kk

n

kkk

n

kkk xfxf

Jensen’s inequality for concave function

xqyxpxq

x

;,log

Page 20: Information Bottleneck versus Maximum Likelihood Felix Polyakov

,;,log;log qFqHyxpxqypx

xqyxpxqypxqqFyp

xx

;,log;log,;log0

;|||

;,log yxpxqKL

yxpxqxq

x

Page 21: Information Bottleneck versus Maximum Likelihood Felix Polyakov

;|||,;log0 yxpxqKLqFyp

;max,;|;log qFyxpFypq

Page 22: Information Bottleneck versus Maximum Likelihood Felix Polyakov

1. Expectation

;ˆmaxargˆ1 qFqt

2. Maximization

Expectation-maximization algorithm (II)

x

t qHyxpxq )ˆ(;,logˆmaxarg1

ttqt yxpqFq ;|;maxargˆ

x

t yxpyxp

;,log;|maxarg

tt Q

,maxarg1 (I) and (II) are equivalent

Page 23: Information Bottleneck versus Maximum Likelihood Felix Polyakov

ttt yxpFyp ;;|;log

111 ;;|;;| tttt yxpFyxpF

Scheme of the approach

Page 24: Information Bottleneck versus Maximum Likelihood Felix Polyakov

Overview of the talk

Short review of the Information Bottleneck Maximum Likelihood

Information Bottleneck and Maximum Likelihood for a toy problem

• Example from Image Segmentation

Page 25: Information Bottleneck versus Maximum Likelihood Felix Polyakov

Israel Jewish Health Drug Doctor www Dos .... Doc1 12 8 0 0 0 0 0 ... Doc4 9 7 1 0 1 0 0 ... Doc8 15 10 0 1 0 1 0 ...

Doc2 0 1 9 11 6 2 0 ... Doc3 0 0 10 6 20 1 0 ... Doc6 1 0 11 6 7 0 1 ... Doc9 0 0 12 16 12 1 1 ...

Doc5 0 1 3 0 0 9 10 ... Doc7 0 2 0 0 2 8 12 ... Doc10 1 1 0 0 2 9 11 ...

... ... ... ... ... ... ... ... ...

Words - YD

ocum

ents

- X

Topics - tt ~ (t)x ~ (x)y|t ~ (y|t)

Page 26: Information Bottleneck versus Maximum Likelihood Felix Polyakov

Model parameters

0

0.05

0.1

0.15

0.2

1 2 3 4 5 6 7 8 9 10

X = Documents

(X)

Example• xi = 9

– t(9) = 2– sample from (y|2)

get yi = “Drug”− set n(9, “Drug”) =

n(9, “Drug”) + 1

0

0.05

0.1

0.15

0.2

1 2 3 4 5 6 7 8 9 10

X = Documents

(X)

Sampling algorithm• For i = 1:N

– choose xi by sampling from (x)

– choose yi by sampling from (y|t(xi))

– increase n(xi, yi) by one

Page 27: Information Bottleneck versus Maximum Likelihood Felix Polyakov

Israel Jewish Health Drug Doctor www Dos .... Doc1 12 8 0 0 0 0 0 ... Doc4 9 7 1 0 1 0 0 ... Doc8 15 10 0 1 0 1 0 ...

Doc2 0 1 9 11 6 2 0 ... Doc3 0 0 10 6 20 1 0 ... Doc6 1 0 11 6 7 0 1 ... Doc9 0 0 12 16 12 1 1 ...

Doc5 0 1 3 0 0 9 10 ... Doc7 0 2 0 0 2 8 12 ... Doc10 1 1 0 0 2 9 11 ...

... ... ... ... ... ... ... ... ...

1 2 3 4 5 6 7 8 9 10

1 2 2 1 3 2 3 1 2 3

X

t

t(X)

(y|t=1)

(y|t=2)

(y|t=3)

Page 28: Information Bottleneck versus Maximum Likelihood Felix Polyakov

Israel Jewish Health Drug Doctor www Dos ...

Cluster1 36 25 1 1 1 1 0 ...

Cluster2 1 1 42 39 45 4 2 ...

Cluster3 1 4 3 0 4 26 33 ...

... ... ... ... ... ... ... ... ...

Toy problem: which parameters maximize the likelihood?

= topicsX = documentsY = words

t

X Yt

t(x) (y|t(x))

t

tyxpyxnL

,,:,,,,:,

Page 29: Information Bottleneck versus Maximum Likelihood Felix Polyakov

EM approach

tyxynxn

x etxktq

xyxtp||||KL

:|)(

x x

x x

tqyxnty

tqt

,|

• E-step

• M-step

Normalizationfactor

xn

yxnxyn ,|

Page 30: Information Bottleneck versus Maximum Likelihood Felix Polyakov

IB approach

xtqxypexZ

tqxtq ||||KL

,|

x xtqIBtq

yx xIB

xtqxptyxqtyq

xtqxptyxqtq

|,,|

|,,11

,

Normalizationfactor

xtqyxptyxqIB |,,,

Page 31: Information Bottleneck versus Maximum Likelihood Felix Polyakov

tyxynxnx etxktq ||||KL

x x

x x

tqyxnty

tqt

,|

xtqxypexZ

tqxtq ||||KL

,|

x xtqIBtq

yx xIB

xtqxptyxqtyq

xtqxptyxqtq

|,,|

|,,11

,

ML

IB

tqx xtq | yxnN

,1 N r, , yxp ,

yxyxnN

,,

r is a scalingconstant

Page 32: Information Bottleneck versus Maximum Likelihood Felix Polyakov

• X is uniformly distributed• r = |X|The EM algorithm is equivalent to the IB iterative

algorithm

tqx xtq | yxnN

,1 N r, , yxp ,

IB ML

IB ML mapping

++

+

+ ++

Iterative IBEM

Page 33: Information Bottleneck versus Maximum Likelihood Felix Polyakov

• X is uniformly distributed = n(x)

IB ML

IB ML mapping

+

++

+

All the fixed points of the likelihood L are mapped to all the fixed points of the IB-functional

= I(T;X) - I(T;Y)

At the fixed points –log L + const

Page 34: Information Bottleneck versus Maximum Likelihood Felix Polyakov

• X is uniformly distributed = n(x)

-(1/r) F - H(Y) =

-F + const

Every algorithm increases F, iff it decreases

Page 35: Information Bottleneck versus Maximum Likelihood Felix Polyakov

Deterministic case

• N (or )

otherwise,0||||KLminarg,1 tyxynt

tq tx

EM: ML

IB

otherwise,0||||KLminarg,1

|new tyqxyptxtq t

IB:

tyq | ty |

Page 36: Information Bottleneck versus Maximum Likelihood Felix Polyakov

• N (or )– Do not speak about uniformity of X here

All the fixed points of L are mapped to all the fixed points of

-F + constEvery algorithm which finds a fixed point

of L, induces a fixed point of and vice versa

In case of several different f.p., the solution that maximized L is mapped to the solution that minimizes .

Page 37: Information Bottleneck versus Maximum Likelihood Felix Polyakov

Example

(x) x

2/3 Yellow submarine

1/3 Red bull N= =

tEM IB(t) q(t)

1 1/2 2/3

2 1/2 1/3

ttqxp uniformNon

• This does not mean that q(t) = (t)

Page 38: Information Bottleneck versus Maximum Likelihood Felix Polyakov

When N, every algorithm increases F iff it decrease with

• How large must N (or ) be?

• How is it related to the “amount of uniformity” in n(x)?

Page 39: Information Bottleneck versus Maximum Likelihood Felix Polyakov

Simulations for iIB

Page 40: Information Bottleneck versus Maximum Likelihood Felix Polyakov

Simulations for EM

Page 41: Information Bottleneck versus Maximum Likelihood Felix Polyakov

Simulations

• 200 runs = 100 (small N) + 100 (large N)

58 runs IIB converged to a smaller value of (-F) than EM

46 runs EM converged to (-F) related to a smaller value of

Page 42: Information Bottleneck versus Maximum Likelihood Felix Polyakov

Quality estimation for EM solution

• The quality of IB solution is measured through the theoretic upper bound 1

;;

YXIYTI

• Using mapping, one can adopt this measure for the ML esimation problem, for large enough N

IB ML

Page 43: Information Bottleneck versus Maximum Likelihood Felix Polyakov

Summary: IB versus ML

• ML and IB approaches are equivalent under certain conditions

• Models comparison– The mixture model assumes that Y is independent of X

given T(X): X T Y– In the IB framework, T is defined through the IB

Markovian independence relation: T X Y• Can adapt the quality estimation measure from IB to

the ML estimation problem, for large N

Page 44: Information Bottleneck versus Maximum Likelihood Felix Polyakov

Overview of the talk

Brief review of the Information Bottleneck

Maximum Likelihood

Information Bottleneck and Maximum Likelihood

Example from Image Segmentation(L. Hermes et. al.)

Page 45: Information Bottleneck versus Maximum Likelihood Felix Polyakov

The clustering model

• Pixels oi, i = 1, …, n• Deterministic clusters c,, = 1, …, k• Boolean assignment matrix M= {0, 1}n x k , Mi• Observations

iniii xxX ,,1

l

xgpxp1

| ,||

Page 46: Information Bottleneck versus Maximum Likelihood Felix Polyakov

oi

qr

rqni

l

xgpxp1

| ,||

• Observations iniii xxX ,,1

Page 47: Information Bottleneck versus Maximum Likelihood Felix Polyakov

Likelihood

• Discretization of the color space into intervals Ij

• Set • Data likelihood

jI

dxxgjG

iij

Mn

lmjknijGppp

||,M

Page 48: Information Bottleneck versus Maximum Likelihood Felix Polyakov

Relation to the IB

T

T T

Page 49: Information Bottleneck versus Maximum Likelihood Felix Polyakov

Log-Likelihood |,log,| MM pL

i jiji jGpnpM

|loglog

IB functional

i jij

ii jGpn

npM

|loglog

• Assume that ni = const, set = ni then = -log L

Page 50: Information Bottleneck versus Maximum Likelihood Felix Polyakov

Images generated from the learned statistics

Page 51: Information Bottleneck versus Maximum Likelihood Felix Polyakov

References

• N. Tishby, F. Pereira, and W. Bialek. The Information Bottleneck method.• Noam Slonim YairWeiss. Maximum Likelihood and the Information Bottleneck’• R. M. Neal and G. E. Hinton. A view of the EM algorithm that justifies

incremental, sparse, and other variants.• J. Goldberger. Lecture notes• L. Hermes, T. Zoller, and J. M. Buhmann. Parametric distributional clustering

for image segmentation.

The end