Upload
rachel-edwards
View
243
Download
1
Embed Size (px)
DESCRIPTION
A Simple Example...
Citation preview
Information Bottleneck versus
Maximum Likelihood
Felix Polyakov
Overview of the talk
Brief review of the Information Bottleneck • Maximum Likelihood
• Information Bottleneck and Maximum Likelihood
• Example from Image Segmentation
Israel Health www Drug Jewish Dos Doctor ...Doc1 12 0 0 0 8 0 0 ...Doc2 0 9 2 11 1 0 6 ...Doc3 0 10 1 6 0 0 20 ...Doc4 9 1 0 0 7 0 1 ...Doc5 0 3 9 0 1 10 0 ...Doc6 1 11 0 6 0 1 7 ...Doc7 0 0 8 0 2 12 2 ...Doc8 15 0 1 1 10 0 0 ...Doc9 0 12 1 16 0 1 12 ...Doc10 1 0 9 0 1 11 2 ...
... ... ... ... ... ... ... ... ...
A Simple Example...
N. Tishby
Simple ExampleIsrael Jewish Health Drug Doctor www Dos ....
Doc1 12 8 0 0 0 0 0 ...Doc4 9 7 1 0 1 0 0 ...Doc8 15 10 0 1 0 1 0 ...
Doc2 0 1 9 11 6 2 0 ...Doc3 0 0 10 6 20 1 0 ...Doc6 1 0 11 6 7 0 1 ...Doc9 0 0 12 16 12 1 1 ...
Doc5 0 1 3 0 0 9 10 ...Doc7 0 2 0 0 2 8 12 ...Doc10 1 1 0 0 2 9 11 ...
... ... ... ... ... ... ... ... ...
N. Tishby
Israel Jewish Health Drug Doctor www Dos ...
Cluster1 36 25 1 1 1 1 0 ...
Cluster2 1 1 42 39 45 4 2 ...
Cluster3 1 4 3 0 4 26 33 ...
... ... ... ... ... ... ... ... ...
A new compact representation
The document clusters preserve the relevant information between the documents and words
N. Tishby
Feature Selection?Feature Selection?• NO ASSUMPTIONS about the source of
the data
• Extracting relevant structure from data– functions of the data (statistics) that preserve
information
• Information about what?
• Need a principle that is both general and precise.
N. Tishby
),( YCI x
xC
kc
1c2c
X
nx
1x2x
Y
my
1y2y
Documents Words
N. Tishby
The information bottleneck or relevance through distortion
• We would like the relevant partitioning T to compress X as much as possible, and to capture as much information about Y as possible
X
xtp |
Y
typ |
YXI
tpyptypptypYTI
y t
;|log,;
T
N. Tishby, F. Pereira, and W. Bialek
X
Y
T
Y
YXP , YTq ,
• Goal: find q(T | X)–note Markovian independence
relation T X Y
Variational problem
Iterative algorithm
Overview of the talk
Short review of the Information Bottleneck Maximum Likelihood
• Information Bottleneck and Maximum Likelihood
• Example from Image Segmentation
A coin is known to be biasedThe coin is tossed three times – two heads and one tailUse ML to estimate the probability of throwing a head
Try P = 0.2
Try P = 0.6
Probability of a head
Like
lihoo
d of
the
Dat
a
L(O) = 0.2 * 0.2 * 0.8 = 0.032
Try P = 0.4L(O) = 0.4 * 0.4 * 0.6 = 0.096
L(O) = 0.6 * 0.6 * 0.4 = 0.144
Try P = 0.8L(O) = 0.8 * 0.8 * 0.2 = 0.128
A simple example...
•Model:−p(head) = P −p(tail) = 1 - P
A bit more complicated example… :Mixture Model
• Three baskets with white (O = 1), grey (O = 2), and black (O = 3) balls
B1 B2 B3
• 15 balls were drawn as follows:1. Choose a basket according to p(i) = b
i
2. Draw the ball j from basket i with probability
• Use ML to estimate given the observations: sequence of balls’ colors
•Likelihood of observations
•Log Likelihood of observations
•Maximal Likelihood of observations
Likelihood of the observed datax – hidden random variables [e.g. basket]y – observed random variables [e.g. color]- model parameters [e.g. they define p(y|x)]0 – current estimate of model parameters
;log;log;|;log0;|0 ypyLyxpyL yx
x
E
;|log;,log00 ;|;| yxpyxp yxyx EE
x
yxpyxp ;,log;| 0
x
yxpyxp ;|log;| 0
x
yxpyxp ;,log;| 0
x
yxpyxp ;|log;| 0
x
yxpyxp 00 ;|log;|
x
yxpyxp 00 ;|log;|
0|0|0 ||;,log;| yxx
yx HKLyxpyxp
KL
H
0|0|0 ||,;log yxyx HKLQyL
0|0|0 ||,;log yxyx HKLQyL
1. Expectation − Compute − Get ;,log, ;| yxpQ
tyxt E
tt Q
,maxarg1
2. Maximization−
Expectation-maximization algorithm (I)
• • EM algorithm converges to local maxima
tt yLyL ;log;log 1
tyXp ;|
Log-likelihood is non-decreasing, examples
EM – another approach
xqyxpxqyp
x
;,log;log
Goal:
,;,log qFqHyxpxqx
1,011
kk
n
kkk
n
kkk xfxf
Jensen’s inequality for concave function
xqyxpxq
x
;,log
,;,log;log qFqHyxpxqypx
xqyxpxqypxqqFyp
xx
;,log;log,;log0
;|||
;,log yxpxqKL
yxpxqxq
x
;|||,;log0 yxpxqKLqFyp
;max,;|;log qFyxpFypq
1. Expectation
;ˆmaxargˆ1 qFqt
2. Maximization
Expectation-maximization algorithm (II)
x
t qHyxpxq )ˆ(;,logˆmaxarg1
ttqt yxpqFq ;|;maxargˆ
x
t yxpyxp
;,log;|maxarg
tt Q
,maxarg1 (I) and (II) are equivalent
ttt yxpFyp ;;|;log
111 ;;|;;| tttt yxpFyxpF
Scheme of the approach
Overview of the talk
Short review of the Information Bottleneck Maximum Likelihood
Information Bottleneck and Maximum Likelihood for a toy problem
• Example from Image Segmentation
Israel Jewish Health Drug Doctor www Dos .... Doc1 12 8 0 0 0 0 0 ... Doc4 9 7 1 0 1 0 0 ... Doc8 15 10 0 1 0 1 0 ...
Doc2 0 1 9 11 6 2 0 ... Doc3 0 0 10 6 20 1 0 ... Doc6 1 0 11 6 7 0 1 ... Doc9 0 0 12 16 12 1 1 ...
Doc5 0 1 3 0 0 9 10 ... Doc7 0 2 0 0 2 8 12 ... Doc10 1 1 0 0 2 9 11 ...
... ... ... ... ... ... ... ... ...
Words - YD
ocum
ents
- X
Topics - tt ~ (t)x ~ (x)y|t ~ (y|t)
Model parameters
0
0.05
0.1
0.15
0.2
1 2 3 4 5 6 7 8 9 10
X = Documents
(X)
Example• xi = 9
– t(9) = 2– sample from (y|2)
get yi = “Drug”− set n(9, “Drug”) =
n(9, “Drug”) + 1
0
0.05
0.1
0.15
0.2
1 2 3 4 5 6 7 8 9 10
X = Documents
(X)
Sampling algorithm• For i = 1:N
– choose xi by sampling from (x)
– choose yi by sampling from (y|t(xi))
– increase n(xi, yi) by one
Israel Jewish Health Drug Doctor www Dos .... Doc1 12 8 0 0 0 0 0 ... Doc4 9 7 1 0 1 0 0 ... Doc8 15 10 0 1 0 1 0 ...
Doc2 0 1 9 11 6 2 0 ... Doc3 0 0 10 6 20 1 0 ... Doc6 1 0 11 6 7 0 1 ... Doc9 0 0 12 16 12 1 1 ...
Doc5 0 1 3 0 0 9 10 ... Doc7 0 2 0 0 2 8 12 ... Doc10 1 1 0 0 2 9 11 ...
... ... ... ... ... ... ... ... ...
1 2 3 4 5 6 7 8 9 10
1 2 2 1 3 2 3 1 2 3
X
t
t(X)
(y|t=1)
(y|t=2)
(y|t=3)
Israel Jewish Health Drug Doctor www Dos ...
Cluster1 36 25 1 1 1 1 0 ...
Cluster2 1 1 42 39 45 4 2 ...
Cluster3 1 4 3 0 4 26 33 ...
... ... ... ... ... ... ... ... ...
Toy problem: which parameters maximize the likelihood?
= topicsX = documentsY = words
t
X Yt
t(x) (y|t(x))
t
tyxpyxnL
,,:,,,,:,
EM approach
tyxynxn
x etxktq
xyxtp||||KL
:|)(
x x
x x
tqyxnty
tqt
,|
• E-step
• M-step
Normalizationfactor
xn
yxnxyn ,|
IB approach
xtqxypexZ
tqxtq ||||KL
,|
x xtqIBtq
yx xIB
xtqxptyxqtyq
xtqxptyxqtq
|,,|
|,,11
,
Normalizationfactor
xtqyxptyxqIB |,,,
tyxynxnx etxktq ||||KL
x x
x x
tqyxnty
tqt
,|
xtqxypexZ
tqxtq ||||KL
,|
x xtqIBtq
yx xIB
xtqxptyxqtyq
xtqxptyxqtq
|,,|
|,,11
,
ML
IB
tqx xtq | yxnN
,1 N r, , yxp ,
yxyxnN
,,
r is a scalingconstant
• X is uniformly distributed• r = |X|The EM algorithm is equivalent to the IB iterative
algorithm
tqx xtq | yxnN
,1 N r, , yxp ,
IB ML
IB ML mapping
++
+
+ ++
Iterative IBEM
• X is uniformly distributed = n(x)
IB ML
IB ML mapping
+
++
+
All the fixed points of the likelihood L are mapped to all the fixed points of the IB-functional
= I(T;X) - I(T;Y)
At the fixed points –log L + const
• X is uniformly distributed = n(x)
-(1/r) F - H(Y) =
-F + const
Every algorithm increases F, iff it decreases
Deterministic case
• N (or )
otherwise,0||||KLminarg,1 tyxynt
tq tx
EM: ML
IB
otherwise,0||||KLminarg,1
|new tyqxyptxtq t
IB:
tyq | ty |
• N (or )– Do not speak about uniformity of X here
All the fixed points of L are mapped to all the fixed points of
-F + constEvery algorithm which finds a fixed point
of L, induces a fixed point of and vice versa
In case of several different f.p., the solution that maximized L is mapped to the solution that minimizes .
Example
(x) x
2/3 Yellow submarine
1/3 Red bull N= =
tEM IB(t) q(t)
1 1/2 2/3
2 1/2 1/3
ttqxp uniformNon
• This does not mean that q(t) = (t)
When N, every algorithm increases F iff it decrease with
• How large must N (or ) be?
• How is it related to the “amount of uniformity” in n(x)?
Simulations for iIB
Simulations for EM
Simulations
• 200 runs = 100 (small N) + 100 (large N)
58 runs IIB converged to a smaller value of (-F) than EM
46 runs EM converged to (-F) related to a smaller value of
Quality estimation for EM solution
• The quality of IB solution is measured through the theoretic upper bound 1
;;
YXIYTI
• Using mapping, one can adopt this measure for the ML esimation problem, for large enough N
IB ML
Summary: IB versus ML
• ML and IB approaches are equivalent under certain conditions
• Models comparison– The mixture model assumes that Y is independent of X
given T(X): X T Y– In the IB framework, T is defined through the IB
Markovian independence relation: T X Y• Can adapt the quality estimation measure from IB to
the ML estimation problem, for large N
Overview of the talk
Brief review of the Information Bottleneck
Maximum Likelihood
Information Bottleneck and Maximum Likelihood
Example from Image Segmentation(L. Hermes et. al.)
The clustering model
• Pixels oi, i = 1, …, n• Deterministic clusters c,, = 1, …, k• Boolean assignment matrix M= {0, 1}n x k , Mi• Observations
•
iniii xxX ,,1
l
xgpxp1
| ,||
oi
qr
rqni
•
l
xgpxp1
| ,||
• Observations iniii xxX ,,1
Likelihood
• Discretization of the color space into intervals Ij
• Set • Data likelihood
jI
dxxgjG
iij
Mn
lmjknijGppp
||,M
Relation to the IB
T
T T
Log-Likelihood |,log,| MM pL
i jiji jGpnpM
|loglog
IB functional
i jij
ii jGpn
npM
|loglog
• Assume that ni = const, set = ni then = -log L
Images generated from the learned statistics
References
• N. Tishby, F. Pereira, and W. Bialek. The Information Bottleneck method.• Noam Slonim YairWeiss. Maximum Likelihood and the Information Bottleneck’• R. M. Neal and G. E. Hinton. A view of the EM algorithm that justifies
incremental, sparse, and other variants.• J. Goldberger. Lecture notes• L. Hermes, T. Zoller, and J. M. Buhmann. Parametric distributional clustering
for image segmentation.
The end