Upload
others
View
16
Download
0
Embed Size (px)
Citation preview
Information theory and machine learning
I: Rate Distortion Theory, Deterministic Annealing, and Soft K-means.
II: Information Bottleneck Method.
Lossy Compression
Summarize data by keeping only relevant information and throwing away irrelevant information.
Need a measure for information -> Shannon’s mutual information.
Need a notion of relevance.
Relevance (1)Shannon (1948): Define a function that measures the distortion between the original signal and its compressed representation.
Note: This is related to the similarity measure in unsupervised learning/cluster analysis.
Distortion functionDegree of freedom: which function to use is up to the experimenter.
It is not always obvious what function should be used, especially if the data do not live in a metric space, and so there is no “natural” measure.
Example: Speech.
Relevance (2)Tishby, Pereira, Bialek (1999): Measure relevance directly via Shannon’s mutual information by defining:
Relevant information = the information about a variable of interest.
Then, there is no need to define a distortion function ad hoc, and the appropriate similarity measure arises naturally.
Learning and lossy data compression
When we build a model of a data set, we map observations to a representation that summarizes the data in an efficient way.
Example: K-means. Map N data points to K clusters, with centroids c. If K << N, then we get a substantial compression! log(K) << log(N).
Recall: Entropy H[X] ~ log(N).
How “efficient” is the representation?
Entropy can be used to measure the compactness of the model. Sometimes called “statistical complexity” (J. P. Crutchfield.)
This is a good measure of complexity if we are searching for a deterministic map (in clustering called a “hard partition”)
In general, we may search over probabilistic assignments (clustering: “soft partition”)
Trade-off between compression and accuracy
Think about a continuous variable.
To describe it exactly, you need infinitely many bits.
But finite bits to describe it up to some accuracy.
Rate distortion theory used for clustering
Find assignments from data to clusters , and find class-representatives (“cluster centers”), such that the average distortion is small.
Compress the data by summarizing it efficiently in terms of bit-cost: Minimize the coding rate, the information which the clusters retain about the raw data.
x ! Xc = 1, ...,K xc
D = !d(x, xc)"
p(c|x)
I = ! p(x, c)p(x)p(c)
"
Constrained optimization
Solution: p(c|x) =p(c)
Z(x,!)exp [!!d(x, xc)]
min [I(x, c) + !!d(x, xc)"]
xc from
(centroid condition)
! d
dxcd(x, xc)"p(x|c) = 0
p(c|x)xc
Rate distortion curve
p(c|x) =p(c)
Z(x,!)exp [!!d(x, xc)]
Family of optimal solutions, one for each value of the Lagrange multiplyer beta. This parameter controls the trade-off between compression and fidelity.
Evaluate objective function at the optimum for each value of beta and plot I vs D. => Rate-distortion curve.
for squared error distortion, , the centroid condition reduces to
“soft K-means” because assignments can be probabilistic (fuzzy):
Remarksd = (x! xc)2/2
xc = !x"p(x|c)
p(c|x) =p(c)
Z(x,!)exp [!!d(x, xc)]
Remarksin the zero temperature limit, , we have deterministic (or “hard”) assignments because
the analogy to thermodynamics inspired Deterministic Annealing (Rose, 1990)
! !"
c! := arg minc
d(x, xc); D(x, c) := d(x, xc)! d(x, xc!) > 0
p(c!|x) =exp(!!d(x, xc!))!c exp(!!d(x, xc))
="1 +
#
c"=c!
exp(!!D(x, c))$#1 " 1
Soft K-means algorithmChoose a distortion measure.
Fix the “temperature”, T, to a very large value (corresponds to small = 1/T)
Solve iteratively, until convergence:
Lower temperature: T <- aT, and repeat.
a = “annealing rate”, a small, positive number.
p(c|x) =p(c)
Z(x,!)exp [!!d(x, xc)]
! d
dxcd(x, xc)"p(x|c) = 0
Assignments:
Centroids:
Rate-distortion curvebi
t ra
te
infeasible region
Distortion
K = 2
K = 3
etc.
feasible region
How to choose the distortion function?
Example: Speech
Cluster speech such that signals which encode the same word will group together
May be extremely difficult to define a distortion function which achieves this
Intelligibility criterion?
Should measure how well the meaning is preserved.
Information Bottleneck Method Tishby, Pereira, Bialek, 1999
Instead of guessing a distortion function, define relevant information as information the data carries about a quantity of interest (Example: phonemes or words.)
Data is compressed such that relevant information is kept maximally.
X
C
y
max I(c,y)min I(x,c)
Constrained optimization
Optimal assignment rule
?
maxp(c|x)
[I(y, c)! !I(x, c)]
Constrained optimization
Optimal assignment rule
Kullback-Leibler divergence emerges as the distortion function:
maxp(c|x)
[I(y, c)! !I(x, c)]
p(c|x) =p(c)
Z(x,!)exp
!! 1
!DKL[p(y|x)"p(y|c)]
"
DKL[p(y|x)!p(y|c)] =!
y
p(y|x) log2
"p(y|x)p(y|c)
#
Information PlaneI(c,y)
I(c,x)
etc.
K=2
K=3K=4
x
xxinfeasible
Homework
Implement Soft K-means algorithm.
Helpful reading (on course website): K. Rose, "Deterministic Annealing for Clustering, Compression, Classification, Regression, and Related Optimization Problems," Proceedings of the IEEE, vol. 80, pp. 2210-2239, November 1998.