Information theory and machine learningsstill/ICS636Lectures/ICS636Lecture4.pdfRate distortion theory used for clustering Find assignments from data to clusters , and ﬁnd class-representatives

Information theory and machine learning

I: Rate Distortion Theory, Deterministic Annealing, and Soft K-means.

II: Information Bottleneck Method.

Lossy Compression

Summarize data by keeping only relevant information and throwing away irrelevant information.

Need a measure for information -> Shannon’s mutual information.

Need a notion of relevance.

Relevance (1)Shannon (1948): Define a function that measures the distortion between the original signal and its compressed representation.

Note: This is related to the similarity measure in unsupervised learning/cluster analysis.

Distortion functionDegree of freedom: which function to use is up to the experimenter.

It is not always obvious what function should be used, especially if the data do not live in a metric space, and so there is no “natural” measure.

Example: Speech.

Relevance (2)Tishby, Pereira, Bialek (1999): Measure relevance directly via Shannon’s mutual information by defining:

Relevant information = the information about a variable of interest.

Then, there is no need to define a distortion function ad hoc, and the appropriate similarity measure arises naturally.

Learning and lossy data compression

When we build a model of a data set, we map observations to a representation that summarizes the data in an efficient way.

Example: K-means. Map N data points to K clusters, with centroids c. If K << N, then we get a substantial compression! log(K) << log(N).

Recall: Entropy H[X] ~ log(N).

How “efficient” is the representation?

Entropy can be used to measure the compactness of the model. Sometimes called “statistical complexity” (J. P. Crutchfield.)

This is a good measure of complexity if we are searching for a deterministic map (in clustering called a “hard partition”)

In general, we may search over probabilistic assignments (clustering: “soft partition”)

Trade-off between compression and accuracy

Think about a continuous variable.

To describe it exactly, you need infinitely many bits.

But finite bits to describe it up to some accuracy.

Rate distortion theory used for clustering

Find assignments from data to clusters , and find class-representatives (“cluster centers”), such that the average distortion is small.

Compress the data by summarizing it efficiently in terms of bit-cost: Minimize the coding rate, the information which the clusters retain about the raw data.

x ! Xc = 1, ...,K xc

D = !d(x, xc)"

p(c|x)

I = ! p(x, c)p(x)p(c)

"

Constrained optimization

Solution: p(c|x) =p(c)

Z(x,!)exp [!!d(x, xc)]

min [I(x, c) + !!d(x, xc)"]

xc from

(centroid condition)

! d

dxcd(x, xc)"p(x|c) = 0

p(c|x)xc

Rate distortion curve

p(c|x) =p(c)


Family of optimal solutions, one for each value of the Lagrange multiplyer beta. This parameter controls the trade-off between compression and fidelity.

Evaluate objective function at the optimum for each value of beta and plot I vs D. => Rate-distortion curve.

for squared error distortion, , the centroid condition reduces to

“soft K-means” because assignments can be probabilistic (fuzzy):

Remarksd = (x! xc)2/2

xc = !x"p(x|c)

p(c|x) =p(c)


Remarksin the zero temperature limit, , we have deterministic (or “hard”) assignments because

the analogy to thermodynamics inspired Deterministic Annealing (Rose, 1990)

! !"

c! := arg minc

d(x, xc); D(x, c) := d(x, xc)! d(x, xc!) > 0

p(c!|x) =exp(!!d(x, xc!))!c exp(!!d(x, xc))

="1 +

#

c"=c!

exp(!!D(x, c))$#1 " 1

Soft K-means algorithmChoose a distortion measure.

Fix the “temperature”, T, to a very large value (corresponds to small = 1/T)

Solve iteratively, until convergence:

Lower temperature: T <- aT, and repeat.

a = “annealing rate”, a small, positive number.

p(c|x) =p(c)


! d

dxcd(x, xc)"p(x|c) = 0

Assignments:

Centroids:

Rate-distortion curvebi

t ra

te

infeasible region

Distortion

K = 2

K = 3

etc.

feasible region

How to choose the distortion function?

Example: Speech

Cluster speech such that signals which encode the same word will group together

May be extremely difficult to define a distortion function which achieves this

Intelligibility criterion?

Should measure how well the meaning is preserved.

Information Bottleneck Method Tishby, Pereira, Bialek, 1999

Instead of guessing a distortion function, define relevant information as information the data carries about a quantity of interest (Example: phonemes or words.)

Data is compressed such that relevant information is kept maximally.

X

C

y

max I(c,y)min I(x,c)


Optimal assignment rule

?

maxp(c|x)

[I(y, c)! !I(x, c)]


Optimal assignment rule

Kullback-Leibler divergence emerges as the distortion function:

maxp(c|x)

[I(y, c)! !I(x, c)]

p(c|x) =p(c)

Z(x,!)exp

!! 1

!DKL[p(y|x)"p(y|c)]

"

DKL[p(y|x)!p(y|c)] =!

y

p(y|x) log2

"p(y|x)p(y|c)

#

Information PlaneI(c,y)

I(c,x)

etc.

K=2

K=3K=4

x

xxinfeasible

Homework

Implement Soft K-means algorithm.

Helpful reading (on course website): K. Rose, "Deterministic Annealing for Clustering, Compression, Classification, Regression, and Related Optimization Problems," Proceedings of the IEEE, vol. 80, pp. 2210-2239, November 1998.

Documents

Information theory and machine learningsstill/ICS636Lectures/ICS636Lecture4.pdfRate distortion theory used for clustering Find assignments from data to clusters , and ﬁnd class-representatives