Information Bottleneck versus Maximum Likelihood Felix Polyakov

Information Bottleneck versus

Maximum Likelihood

Felix Polyakov

Overview of the talk

Brief review of the Information Bottleneck • Maximum Likelihood

• Information Bottleneck and Maximum Likelihood

• Example from Image Segmentation

Israel Health www Drug Jewish Dos Doctor ...Doc1 12 0 0 0 8 0 0 ...Doc2 0 9 2 11 1 0 6 ...Doc3 0 10 1 6 0 0 20 ...Doc4 9 1 0 0 7 0 1 ...Doc5 0 3 9 0 1 10 0 ...Doc6 1 11 0 6 0 1 7 ...Doc7 0 0 8 0 2 12 2 ...Doc8 15 0 1 1 10 0 0 ...Doc9 0 12 1 16 0 1 12 ...Doc10 1 0 9 0 1 11 2 ...

... ... ... ... ... ... ... ... ...

A Simple Example...

N. Tishby

Simple ExampleIsrael Jewish Health Drug Doctor www Dos ....

Doc1 12 8 0 0 0 0 0 ...Doc4 9 7 1 0 1 0 0 ...Doc8 15 10 0 1 0 1 0 ...

Doc2 0 1 9 11 6 2 0 ...Doc3 0 0 10 6 20 1 0 ...Doc6 1 0 11 6 7 0 1 ...Doc9 0 0 12 16 12 1 1 ...

Doc5 0 1 3 0 0 9 10 ...Doc7 0 2 0 0 2 8 12 ...Doc10 1 1 0 0 2 9 11 ...

... ... ... ... ... ... ... ... ...

N. Tishby

Israel Jewish Health Drug Doctor www Dos ...

Cluster1 36 25 1 1 1 1 0 ...

Cluster2 1 1 42 39 45 4 2 ...

Cluster3 1 4 3 0 4 26 33 ...

... ... ... ... ... ... ... ... ...

A new compact representation

The document clusters preserve the relevant information between the documents and words

N. Tishby

Feature Selection?Feature Selection?• NO ASSUMPTIONS about the source of

the data

• Extracting relevant structure from data– functions of the data (statistics) that preserve

information

• Information about what?

• Need a principle that is both general and precise.

N. Tishby

),( YCI x

xC

kc

1c2c

X

nx

1x2x

Y

my

1y2y

Documents Words

N. Tishby

The information bottleneck or relevance through distortion

• We would like the relevant partitioning T to compress X as much as possible, and to capture as much information about Y as possible

X

xtp |

Y

typ |

YXI

tpyptypptypYTI

y t

;|log,;

T

N. Tishby, F. Pereira, and W. Bialek

X

Y

T

Y

YXP , YTq ,

• Goal: find q(T | X)–note Markovian independence

relation T X Y

Variational problem

Iterative algorithm


Short review of the Information Bottleneck Maximum Likelihood

• Information Bottleneck and Maximum Likelihood


A coin is known to be biasedThe coin is tossed three times – two heads and one tailUse ML to estimate the probability of throwing a head

Try P = 0.2

Try P = 0.6

Probability of a head

Like

lihoo

d of

the

Dat

a

L(O) = 0.2 * 0.2 * 0.8 = 0.032

Try P = 0.4L(O) = 0.4 * 0.4 * 0.6 = 0.096

L(O) = 0.6 * 0.6 * 0.4 = 0.144

Try P = 0.8L(O) = 0.8 * 0.8 * 0.2 = 0.128

A simple example...

•Model:−p(head) = P −p(tail) = 1 - P

A bit more complicated example… :Mixture Model

• Three baskets with white (O = 1), grey (O = 2), and black (O = 3) balls

B1 B2 B3

• 15 balls were drawn as follows:1. Choose a basket according to p(i) = b

i

2. Draw the ball j from basket i with probability

• Use ML to estimate given the observations: sequence of balls’ colors

•Likelihood of observations

•Log Likelihood of observations

•Maximal Likelihood of observations

Likelihood of the observed datax – hidden random variables [e.g. basket]y – observed random variables [e.g. color]- model parameters [e.g. they define p(y|x)]0 – current estimate of model parameters

;log;log;|;log0;|0 ypyLyxpyL yx

x

E

;|log;,log00 ;|;| yxpyxp yxyx EE

x

yxpyxp ;,log;| 0

x

yxpyxp ;|log;| 0

x

yxpyxp ;,log;| 0

x

yxpyxp ;|log;| 0

x

yxpyxp 00 ;|log;|

x

yxpyxp 00 ;|log;|

0|0|0 ||;,log;| yxx

yx HKLyxpyxp

KL

H

0|0|0 ||,;log yxyx HKLQyL

0|0|0 ||,;log yxyx HKLQyL

1. Expectation − Compute − Get ;,log, ;| yxpQ

tyxt E

tt Q

,maxarg1

2. Maximization−

Expectation-maximization algorithm (I)

• • EM algorithm converges to local maxima

tt yLyL ;log;log 1

tyXp ;|

Log-likelihood is non-decreasing, examples

EM – another approach

xqyxpxqyp

x

;,log;log

Goal:

,;,log qFqHyxpxqx

1,011

kk

n

kkk

n

kkk xfxf

Jensen’s inequality for concave function

xqyxpxq

x

;,log

,;,log;log qFqHyxpxqypx

xqyxpxqypxqqFyp

xx

;,log;log,;log0

;|||

;,log yxpxqKL

yxpxqxq

x

;|||,;log0 yxpxqKLqFyp

;max,;|;log qFyxpFypq

1. Expectation

;ˆmaxargˆ1 qFqt

2. Maximization

Expectation-maximization algorithm (II)

x

t qHyxpxq )ˆ(;,logˆmaxarg1

ttqt yxpqFq ;|;maxargˆ

x

t yxpyxp

;,log;|maxarg

tt Q

,maxarg1 (I) and (II) are equivalent

ttt yxpFyp ;;|;log

111 ;;|;;| tttt yxpFyxpF

Scheme of the approach


Short review of the Information Bottleneck Maximum Likelihood

Information Bottleneck and Maximum Likelihood for a toy problem


Israel Jewish Health Drug Doctor www Dos .... Doc1 12 8 0 0 0 0 0 ... Doc4 9 7 1 0 1 0 0 ... Doc8 15 10 0 1 0 1 0 ...

Doc2 0 1 9 11 6 2 0 ... Doc3 0 0 10 6 20 1 0 ... Doc6 1 0 11 6 7 0 1 ... Doc9 0 0 12 16 12 1 1 ...

Doc5 0 1 3 0 0 9 10 ... Doc7 0 2 0 0 2 8 12 ... Doc10 1 1 0 0 2 9 11 ...

... ... ... ... ... ... ... ... ...

Words - YD

ocum

ents

- X

Topics - tt ~ (t)x ~ (x)y|t ~ (y|t)

Model parameters

0

0.05

0.1

0.15

0.2

1 2 3 4 5 6 7 8 9 10

X = Documents

(X)

Example• xi = 9

– t(9) = 2– sample from (y|2)

get yi = “Drug”− set n(9, “Drug”) =

n(9, “Drug”) + 1

0

0.05

0.1

0.15

0.2

1 2 3 4 5 6 7 8 9 10

X = Documents

(X)

Sampling algorithm• For i = 1:N

– choose xi by sampling from (x)

– choose yi by sampling from (y|t(xi))

– increase n(xi, yi) by one

Israel Jewish Health Drug Doctor www Dos .... Doc1 12 8 0 0 0 0 0 ... Doc4 9 7 1 0 1 0 0 ... Doc8 15 10 0 1 0 1 0 ...

Doc2 0 1 9 11 6 2 0 ... Doc3 0 0 10 6 20 1 0 ... Doc6 1 0 11 6 7 0 1 ... Doc9 0 0 12 16 12 1 1 ...

Doc5 0 1 3 0 0 9 10 ... Doc7 0 2 0 0 2 8 12 ... Doc10 1 1 0 0 2 9 11 ...

... ... ... ... ... ... ... ... ...

1 2 3 4 5 6 7 8 9 10

1 2 2 1 3 2 3 1 2 3

X

t

t(X)

(y|t=1)

(y|t=2)

(y|t=3)

Israel Jewish Health Drug Doctor www Dos ...

Cluster1 36 25 1 1 1 1 0 ...

Cluster2 1 1 42 39 45 4 2 ...

Cluster3 1 4 3 0 4 26 33 ...

... ... ... ... ... ... ... ... ...

Toy problem: which parameters maximize the likelihood?

= topicsX = documentsY = words

t

X Yt

t(x) (y|t(x))

t

tyxpyxnL

,,:,,,,:,

EM approach

tyxynxn

x etxktq

xyxtp||||KL

:|)(

x x

x x

tqyxnty

tqt

,|

• E-step

• M-step

Normalizationfactor

xn

yxnxyn ,|

IB approach

xtqxypexZ

tqxtq ||||KL

,|

x xtqIBtq

yx xIB

xtqxptyxqtyq

xtqxptyxqtq

|,,|

|,,11

,

Normalizationfactor

xtqyxptyxqIB |,,,

tyxynxnx etxktq ||||KL

x x

x x

tqyxnty

tqt

,|

xtqxypexZ

tqxtq ||||KL

,|

x xtqIBtq

yx xIB

xtqxptyxqtyq

xtqxptyxqtq

|,,|

|,,11

,

ML

IB

tqx xtq | yxnN

,1 N r, , yxp ,

yxyxnN

,,

r is a scalingconstant

• X is uniformly distributed• r = |X|The EM algorithm is equivalent to the IB iterative

algorithm

tqx xtq | yxnN

,1 N r, , yxp ,

IB ML

IB ML mapping

++

+

+ ++

Iterative IBEM

• X is uniformly distributed = n(x)

IB ML

IB ML mapping

+

++

+

All the fixed points of the likelihood L are mapped to all the fixed points of the IB-functional

= I(T;X) - I(T;Y)

At the fixed points –log L + const

• X is uniformly distributed = n(x)

-(1/r) F - H(Y) =

-F + const

Every algorithm increases F, iff it decreases

Deterministic case

• N (or )

otherwise,0||||KLminarg,1 tyxynt

tq tx

EM: ML

IB

otherwise,0||||KLminarg,1

|new tyqxyptxtq t

IB:

tyq | ty |

• N (or )– Do not speak about uniformity of X here

All the fixed points of L are mapped to all the fixed points of

-F + constEvery algorithm which finds a fixed point

of L, induces a fixed point of and vice versa

In case of several different f.p., the solution that maximized L is mapped to the solution that minimizes .

Example

(x) x

2/3 Yellow submarine

1/3 Red bull N= =

tEM IB(t) q(t)

1 1/2 2/3

2 1/2 1/3

ttqxp uniformNon

• This does not mean that q(t) = (t)

When N, every algorithm increases F iff it decrease with

• How large must N (or ) be?

• How is it related to the “amount of uniformity” in n(x)?

Simulations for iIB

Simulations for EM

Simulations

• 200 runs = 100 (small N) + 100 (large N)

58 runs IIB converged to a smaller value of (-F) than EM

46 runs EM converged to (-F) related to a smaller value of

Quality estimation for EM solution

• The quality of IB solution is measured through the theoretic upper bound 1

;;

YXIYTI

• Using mapping, one can adopt this measure for the ML esimation problem, for large enough N

IB ML

Summary: IB versus ML

• ML and IB approaches are equivalent under certain conditions

• Models comparison– The mixture model assumes that Y is independent of X

given T(X): X T Y– In the IB framework, T is defined through the IB

Markovian independence relation: T X Y• Can adapt the quality estimation measure from IB to

the ML estimation problem, for large N


Brief review of the Information Bottleneck

Maximum Likelihood

Information Bottleneck and Maximum Likelihood

Example from Image Segmentation(L. Hermes et. al.)

The clustering model

• Pixels oi, i = 1, …, n• Deterministic clusters c,, = 1, …, k• Boolean assignment matrix M= {0, 1}n x k , Mi• Observations

•

iniii xxX ,,1

l

xgpxp1

| ,||

oi

qr

rqni

•

l

xgpxp1

| ,||

• Observations iniii xxX ,,1

Likelihood

• Discretization of the color space into intervals Ij

• Set • Data likelihood

jI

dxxgjG

iij

Mn

lmjknijGppp

||,M

Relation to the IB

T

T T

Log-Likelihood |,log,| MM pL

i jiji jGpnpM

|loglog

IB functional

i jij

ii jGpn

npM

|loglog

• Assume that ni = const, set = ni then = -log L

Images generated from the learned statistics

References

• N. Tishby, F. Pereira, and W. Bialek. The Information Bottleneck method.• Noam Slonim YairWeiss. Maximum Likelihood and the Information Bottleneck’• R. M. Neal and G. E. Hinton. A view of the EM algorithm that justifies

incremental, sparse, and other variants.• J. Goldberger. Lecture notes• L. Hermes, T. Zoller, and J. M. Buhmann. Parametric distributional clustering

for image segmentation.

The end