21
Information theory and machine learning I: Rate Distortion Theory, Deterministic Annealing, and Soft K-means. II: Information Bottleneck Method.

Information theory and machine learningsstill/ICS636Lectures/ICS636Lecture4.pdfRate distortion theory used for clustering Find assignments from data to clusters , and find class-representatives

  • Upload
    others

  • View
    16

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Information theory and machine learningsstill/ICS636Lectures/ICS636Lecture4.pdfRate distortion theory used for clustering Find assignments from data to clusters , and find class-representatives

Information theory and machine learning

I: Rate Distortion Theory, Deterministic Annealing, and Soft K-means.

II: Information Bottleneck Method.

Page 2: Information theory and machine learningsstill/ICS636Lectures/ICS636Lecture4.pdfRate distortion theory used for clustering Find assignments from data to clusters , and find class-representatives

Lossy Compression

Summarize data by keeping only relevant information and throwing away irrelevant information.

Need a measure for information -> Shannon’s mutual information.

Need a notion of relevance.

Page 3: Information theory and machine learningsstill/ICS636Lectures/ICS636Lecture4.pdfRate distortion theory used for clustering Find assignments from data to clusters , and find class-representatives

Relevance (1)Shannon (1948): Define a function that measures the distortion between the original signal and its compressed representation.

Note: This is related to the similarity measure in unsupervised learning/cluster analysis.

Page 4: Information theory and machine learningsstill/ICS636Lectures/ICS636Lecture4.pdfRate distortion theory used for clustering Find assignments from data to clusters , and find class-representatives

Distortion functionDegree of freedom: which function to use is up to the experimenter.

It is not always obvious what function should be used, especially if the data do not live in a metric space, and so there is no “natural” measure.

Example: Speech.

Page 5: Information theory and machine learningsstill/ICS636Lectures/ICS636Lecture4.pdfRate distortion theory used for clustering Find assignments from data to clusters , and find class-representatives

Relevance (2)Tishby, Pereira, Bialek (1999): Measure relevance directly via Shannon’s mutual information by defining:

Relevant information = the information about a variable of interest.

Then, there is no need to define a distortion function ad hoc, and the appropriate similarity measure arises naturally.

Page 6: Information theory and machine learningsstill/ICS636Lectures/ICS636Lecture4.pdfRate distortion theory used for clustering Find assignments from data to clusters , and find class-representatives

Learning and lossy data compression

When we build a model of a data set, we map observations to a representation that summarizes the data in an efficient way.

Example: K-means. Map N data points to K clusters, with centroids c. If K << N, then we get a substantial compression! log(K) << log(N).

Recall: Entropy H[X] ~ log(N).

Page 7: Information theory and machine learningsstill/ICS636Lectures/ICS636Lecture4.pdfRate distortion theory used for clustering Find assignments from data to clusters , and find class-representatives

How “efficient” is the representation?

Entropy can be used to measure the compactness of the model. Sometimes called “statistical complexity” (J. P. Crutchfield.)

This is a good measure of complexity if we are searching for a deterministic map (in clustering called a “hard partition”)

In general, we may search over probabilistic assignments (clustering: “soft partition”)

Page 8: Information theory and machine learningsstill/ICS636Lectures/ICS636Lecture4.pdfRate distortion theory used for clustering Find assignments from data to clusters , and find class-representatives

Trade-off between compression and accuracy

Think about a continuous variable.

To describe it exactly, you need infinitely many bits.

But finite bits to describe it up to some accuracy.

Page 9: Information theory and machine learningsstill/ICS636Lectures/ICS636Lecture4.pdfRate distortion theory used for clustering Find assignments from data to clusters , and find class-representatives

Rate distortion theory used for clustering

Find assignments from data to clusters , and find class-representatives (“cluster centers”), such that the average distortion is small.

Compress the data by summarizing it efficiently in terms of bit-cost: Minimize the coding rate, the information which the clusters retain about the raw data.

x ! Xc = 1, ...,K xc

D = !d(x, xc)"

p(c|x)

I = ! p(x, c)p(x)p(c)

"

Page 10: Information theory and machine learningsstill/ICS636Lectures/ICS636Lecture4.pdfRate distortion theory used for clustering Find assignments from data to clusters , and find class-representatives

Constrained optimization

Solution: p(c|x) =p(c)

Z(x,!)exp [!!d(x, xc)]

min [I(x, c) + !!d(x, xc)"]

xc from

(centroid condition)

! d

dxcd(x, xc)"p(x|c) = 0

p(c|x)xc

Page 11: Information theory and machine learningsstill/ICS636Lectures/ICS636Lecture4.pdfRate distortion theory used for clustering Find assignments from data to clusters , and find class-representatives

Rate distortion curve

p(c|x) =p(c)

Z(x,!)exp [!!d(x, xc)]

Family of optimal solutions, one for each value of the Lagrange multiplyer beta. This parameter controls the trade-off between compression and fidelity.

Evaluate objective function at the optimum for each value of beta and plot I vs D. => Rate-distortion curve.

Page 12: Information theory and machine learningsstill/ICS636Lectures/ICS636Lecture4.pdfRate distortion theory used for clustering Find assignments from data to clusters , and find class-representatives

for squared error distortion, , the centroid condition reduces to

“soft K-means” because assignments can be probabilistic (fuzzy):

Remarksd = (x! xc)2/2

xc = !x"p(x|c)

p(c|x) =p(c)

Z(x,!)exp [!!d(x, xc)]

Page 13: Information theory and machine learningsstill/ICS636Lectures/ICS636Lecture4.pdfRate distortion theory used for clustering Find assignments from data to clusters , and find class-representatives

Remarksin the zero temperature limit, , we have deterministic (or “hard”) assignments because

the analogy to thermodynamics inspired Deterministic Annealing (Rose, 1990)

! !"

c! := arg minc

d(x, xc); D(x, c) := d(x, xc)! d(x, xc!) > 0

p(c!|x) =exp(!!d(x, xc!))!c exp(!!d(x, xc))

="1 +

#

c"=c!

exp(!!D(x, c))$#1 " 1

Page 14: Information theory and machine learningsstill/ICS636Lectures/ICS636Lecture4.pdfRate distortion theory used for clustering Find assignments from data to clusters , and find class-representatives

Soft K-means algorithmChoose a distortion measure.

Fix the “temperature”, T, to a very large value (corresponds to small = 1/T)

Solve iteratively, until convergence:

Lower temperature: T <- aT, and repeat.

a = “annealing rate”, a small, positive number.

p(c|x) =p(c)

Z(x,!)exp [!!d(x, xc)]

! d

dxcd(x, xc)"p(x|c) = 0

Assignments:

Centroids:

Page 15: Information theory and machine learningsstill/ICS636Lectures/ICS636Lecture4.pdfRate distortion theory used for clustering Find assignments from data to clusters , and find class-representatives

Rate-distortion curvebi

t ra

te

infeasible region

Distortion

K = 2

K = 3

etc.

feasible region

Page 16: Information theory and machine learningsstill/ICS636Lectures/ICS636Lecture4.pdfRate distortion theory used for clustering Find assignments from data to clusters , and find class-representatives

How to choose the distortion function?

Example: Speech

Cluster speech such that signals which encode the same word will group together

May be extremely difficult to define a distortion function which achieves this

Intelligibility criterion?

Should measure how well the meaning is preserved.

Page 17: Information theory and machine learningsstill/ICS636Lectures/ICS636Lecture4.pdfRate distortion theory used for clustering Find assignments from data to clusters , and find class-representatives

Information Bottleneck Method Tishby, Pereira, Bialek, 1999

Instead of guessing a distortion function, define relevant information as information the data carries about a quantity of interest (Example: phonemes or words.)

Data is compressed such that relevant information is kept maximally.

X

C

y

max I(c,y)min I(x,c)

Page 18: Information theory and machine learningsstill/ICS636Lectures/ICS636Lecture4.pdfRate distortion theory used for clustering Find assignments from data to clusters , and find class-representatives

Constrained optimization

Optimal assignment rule

?

maxp(c|x)

[I(y, c)! !I(x, c)]

Page 19: Information theory and machine learningsstill/ICS636Lectures/ICS636Lecture4.pdfRate distortion theory used for clustering Find assignments from data to clusters , and find class-representatives

Constrained optimization

Optimal assignment rule

Kullback-Leibler divergence emerges as the distortion function:

maxp(c|x)

[I(y, c)! !I(x, c)]

p(c|x) =p(c)

Z(x,!)exp

!! 1

!DKL[p(y|x)"p(y|c)]

"

DKL[p(y|x)!p(y|c)] =!

y

p(y|x) log2

"p(y|x)p(y|c)

#

Page 20: Information theory and machine learningsstill/ICS636Lectures/ICS636Lecture4.pdfRate distortion theory used for clustering Find assignments from data to clusters , and find class-representatives

Information PlaneI(c,y)

I(c,x)

etc.

K=2

K=3K=4

x

xxinfeasible

Page 21: Information theory and machine learningsstill/ICS636Lectures/ICS636Lecture4.pdfRate distortion theory used for clustering Find assignments from data to clusters , and find class-representatives

Homework

Implement Soft K-means algorithm.

Helpful reading (on course website): K. Rose, "Deterministic Annealing for Clustering, Compression, Classification, Regression, and Related Optimization Problems," Proceedings of the IEEE, vol. 80, pp. 2210-2239, November 1998.