67
Copyright © 2001, 2003, Andrew W. Moore Entropy and Information Gain Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm [email protected] 412-268-7599 Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.

2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Embed Size (px)

Citation preview

Page 1: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Copyright © 2001, 2003, Andrew W. Moore

Entropy and Information Gain

Andrew W. Moore

Professor

School of Computer Science

Carnegie Mellon University www.cs.cmu.edu/~awm

[email protected]

412-268-7599

Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.

Page 2: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 2 Copyright © 2001, 2003, Andrew W. Moore

Bits You are watching a set of independent random samples of X

You see that X has four possible values

So you might see: BAACBADCDADDDA…

You transmit data over a binary serial link. You can encode each reading with two bits (e.g. A = 00, B = 01, C = 10, D = 11)

0100001001001110110011111100…

P(X=A) = 1/4 P(X=B) = 1/4 P(X=C) = 1/4 P(X=D) = 1/4

Page 3: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 3 Copyright © 2001, 2003, Andrew W. Moore

Fewer Bits Someone tells you that the probabilities are not equal

It’s possible…

…to invent a coding for your transmission that only uses 1.75 bits on average per symbol. How?

P(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8

Page 4: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 4 Copyright © 2001, 2003, Andrew W. Moore

Fewer Bits Someone tells you that the probabilities are not equal

It’s possible…

…to invent a coding for your transmission that only uses 1.75 bits on average per symbol. How?

(This is just one of several ways)

P(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8

A 0

B 10

C 110

D 111

Page 5: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 5 Copyright © 2001, 2003, Andrew W. Moore

Fewer Bits Suppose there are three equally likely values…

Here’s a naïve coding, costing 2 bits per symbol

Can you think of a coding that would need only 1.6 bits per symbol on average? In theory, it can in fact be done with 1.58496 bits per symbol.

P(X=A) = 1/3 P(X=B) = 1/3 P(X=C) = 1/3

A 00

B 01

C 10

Page 6: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 6 Copyright © 2001, 2003, Andrew W. Moore

Suppose X can have one of m values… V1, V2, … Vm

What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s

H(X) = The entropy of X (Shannon, 1948)

• “High Entropy” means X is from a uniform (boring) distribution

• “Low Entropy” means X is from varied (peaks and valleys) distribution

General Case

mm ppppppXH 2222121 logloglog)(

P(X=V1) = p1 P(X=V2) = p2 …. P(X=Vm) = pm

m

j

jj pp1

2log

Page 7: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 7 Copyright © 2001, 2003, Andrew W. Moore

Suppose X can have one of m values… V1, V2, … Vm

What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s

H(X) = The entropy of X

• “High Entropy” means X is from a uniform (boring) distribution

• “Low Entropy” means X is from varied (peaks and valleys) distribution

General Case

mm ppppppXH 2222121 logloglog)(

P(X=V1) = p1 P(X=V2) = p2 …. P(X=Vm) = pm

m

j

jj pp1

2log

A histogram of the frequency distribution of values of X would be flat

A histogram of the frequency distribution of values of X would have many lows and one or two highs

Page 8: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 8 Copyright © 2001, 2003, Andrew W. Moore

Suppose X can have one of m values… V1, V2, … Vm

What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s

H(X) = The entropy of X

• “High Entropy” means X is from a uniform (boring) distribution

• “Low Entropy” means X is from varied (peaks and valleys) distribution

General Case

mm ppppppXH 2222121 logloglog)(

P(X=V1) = p1 P(X=V2) = p2 …. P(X=Vm) = pm

m

j

jj pp1

2log

A histogram of the frequency distribution of values of X would be flat

A histogram of the frequency distribution of values of X would have many lows and one or two highs

..and so the values sampled from it would be all over the place

..and so the values sampled from it would be more predictable

Page 9: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 9 Copyright © 2001, 2003, Andrew W. Moore

Entropy in a nut-shell

Low Entropy High Entropy

Page 10: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 10 Copyright © 2001, 2003, Andrew W. Moore

Entropy in a nut-shell

Low Entropy High Entropy ..the values (locations of soup) unpredictable... almost uniformly sampled throughout our dining room

..the values (locations of soup) sampled entirely from within the soup bowl

Page 11: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 11 Copyright © 2001, 2003, Andrew W. Moore

Entropy of a PDF

x

dxxpxpXHX )(log)(][ ofEntropy

Natural log (ln or loge)

The larger the entropy of a distribution…

…the harder it is to predict

…the harder it is to compress it

…the less spiky the distribution

Page 12: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 12 Copyright © 2001, 2003, Andrew W. Moore

The “box” distribution

-w/2 0 w/2

1/w

2

w|x|if0

2

w|x|if

1

)( wxp

wdxww

dxww

dxxpxpXH

w

wx

w

wxx

log1

log11

log1

)(log)(][

2/

2/

2/

2/

Page 13: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 13 Copyright © 2001, 2003, Andrew W. Moore

Unit variance box distribution

0

12]Var[

2wX

0][ XE

242.1][ and 1]Var[ then 32 if XHXw

3

32

1

3

2

w|x|if0

2

w|x|if

1

)( wxp

Page 14: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 14 Copyright © 2001, 2003, Andrew W. Moore

The Hat distribution

0

w|x|

w|x|w

xw

xp

if0

if||

)( 2

6]Var[

2wX

0][ XE

w

1

w

w

Page 15: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 15 Copyright © 2001, 2003, Andrew W. Moore

Unit variance hat distribution

0

w|x|

w|x|w

xw

xp

if0

if||

)( 2

6]Var[

2wX

0][ XE

396.1][ and 1]Var[ then 6 if XHXw

6

6

1

6

Page 16: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 16 Copyright © 2001, 2003, Andrew W. Moore

The “2 spikes” distribution

-1 0 1

2

)1()1()(

xxxp

x

dxxpxpXH )(log)(][

1]Var[ X

0][ XE

2

)1(

2

1x )1(

2

1x

Dirac Delta

Page 17: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 17 Copyright © 2001, 2003, Andrew W. Moore

Entropies of unit-variance distributions

Distribution Entropy

Box 1.242

Hat 1.396

2 spikes -infinity

??? 1.4189 Largest possible entropy of any unit-variance distribution

Page 18: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 18 Copyright © 2001, 2003, Andrew W. Moore

Unit variance Gaussian

2exp

2

1)(

2xxp

4189.1)(log)(][

x

dxxpxpXH

1]Var[ X

0][ XE

Page 19: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 19 Copyright © 2001, 2003, Andrew W. Moore

Specific Conditional Entropy H(Y|X=v)

Suppose I’m trying to predict output Y and I have input X

Let’s assume this reflects the true probabilities

E.G. From this data we estimate

• P(LikeG = Yes) = 0.5

• P(Major = Math & LikeG = No) = 0.25

• P(Major = Math) = 0.5

• P(LikeG = Yes | Major = History) = 0

Note:

• H(X) = 1.5

•H(Y) = 1

X = College Major

Y = Likes “Gladiator”

X Y

Math Yes

History No

CS Yes

Math No

Math No

CS Yes

History No

Math Yes

Page 20: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 20 Copyright © 2001, 2003, Andrew W. Moore

Definition of Specific Conditional Entropy:

H(Y |X=v) = The entropy of Y among only those records in which X has value v

X = College Major

Y = Likes “Gladiator”

X Y

Math Yes

History No

CS Yes

Math No

Math No

CS Yes

History No

Math Yes

Specific Conditional Entropy H(Y|X=v)

Page 21: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 21 Copyright © 2001, 2003, Andrew W. Moore

Definition of Specific Conditional Entropy:

H(Y |X=v) = The entropy of Y among only those records in which X has value v

Example:

• H(Y|X=Math) = 1

• H(Y|X=History) = 0

• H(Y|X=CS) = 0

X = College Major

Y = Likes “Gladiator”

X Y

Math Yes

History No

CS Yes

Math No

Math No

CS Yes

History No

Math Yes

Specific Conditional Entropy H(Y|X=v)

Page 22: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 22 Copyright © 2001, 2003, Andrew W. Moore

Conditional Entropy H(Y|X)

Definition of Conditional Entropy:

H(Y |X) = The average specific conditional entropy of Y

= if you choose a record at random what

will be the conditional entropy of Y, conditioned on that row’s value of X

= Expected number of bits to transmit Y if both sides will know the value of X

= Σj Prob(X=vj) H(Y | X = vj)

X = College Major

Y = Likes “Gladiator”

X Y

Math Yes

History No

CS Yes

Math No

Math No

CS Yes

History No

Math Yes

Page 23: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 23 Copyright © 2001, 2003, Andrew W. Moore

Conditional Entropy Definition of Conditional Entropy:

H(Y|X) = The average conditional entropy of Y

= ΣjProb(X=vj) H(Y | X = vj)

X = College Major

Y = Likes “Gladiator”

Example:

vj Prob(X=vj) H(Y | X = vj)

Math 0.5 1

History 0.25 0

CS 0.25 0

H(Y|X) = 0.5 * 1 + 0.25 * 0 + 0.25 * 0 = 0.5

X Y

Math Yes

History No

CS Yes

Math No

Math No

CS Yes

History No

Math Yes

Page 24: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 24 Copyright © 2001, 2003, Andrew W. Moore

Information Gain Definition of Information Gain:

IG(Y|X) = I must transmit Y. How many bits on average would it save me if both ends of the line knew X?

IG(Y|X) = H(Y) - H(Y | X)

X = College Major

Y = Likes “Gladiator”

Example:

• H(Y) = 1

• H(Y|X) = 0.5

• Thus IG(Y|X) = 1 – 0.5 = 0.5

X Y

Math Yes

History No

CS Yes

Math No

Math No

CS Yes

History No

Math Yes

Page 25: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 25 Copyright © 2001, 2003, Andrew W. Moore

Relative Entropy:Distance Kullback-Leibler

x xq

xpxpqpD )

)(

)((log)(),(

2

Page 26: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 26 Copyright © 2001, 2003, Andrew W. Moore

Mutual Information A quantity that measures the mutual dependence of

the two random variables.

))()(

),((log),(),( 2

yqxp

yxpyxpYXI

))|()|(

)|,((log)|,()|,( 2

cyqcxp

cyxpcyxpCYXI

Y X

dxdyyqxp

yxpyxpYXI )

)()(

),((log),(),( 2

Page 27: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 27 Copyright © 2001, 2003, Andrew W. Moore

Mutual Information I(X,Y)=H(Y)-H(Y/X)

x y yq

xypyxpYXI )

)(

)/((log),(),( 2

))/((log),())((log),(),( 22 x yx y

xypyxpypyxpYXI

x yy

xypxypxpyqyqYXI ))/((log)/()())((log)(),(22

Page 28: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 28 Copyright © 2001, 2003, Andrew W. Moore

Mutual information

• I(X,Y)=H(Y)-H(Y/X)

• I(X,Y)=H(X)-H(X/Y)

• I(X,Y)=H(X)+H(Y)-H(X,Y)

• I(X,Y)=I(Y,X)

• I(X,X)=H(X)

Page 29: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 29 Copyright © 2001, 2003, Andrew W. Moore

Information Gain Example

Page 30: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 30 Copyright © 2001, 2003, Andrew W. Moore

Another example

Page 31: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 31 Copyright © 2001, 2003, Andrew W. Moore

Relative Information Gain Definition of Relative Information Gain:

RIG(Y|X) = I must transmit Y, what fraction of the bits on average would it save me if both ends of the line knew X?

RIG(Y|X) = [H(Y) - H(Y | X) ]/ H(Y)

X = College Major

Y = Likes “Gladiator”

Example:

• H(Y|X) = 0.5

• H(Y) = 1

• Thus IG(Y|X) = (1 – 0.5)/1 = 0.5

X Y

Math Yes

History No

CS Yes

Math No

Math No

CS Yes

History No

Math Yes

Page 32: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 32 Copyright © 2001, 2003, Andrew W. Moore

What is Information Gain used for?

Suppose you are trying to predict whether someone is going live past 80 years. From historical data you might find…

•IG(LongLife | HairColor) = 0.01

•IG(LongLife | Smoker) = 0.2

•IG(LongLife | Gender) = 0.25

•IG(LongLife | LastDigitOfSSN) = 0.00001

IG tells you how interesting a 2-d contingency table is going to be.

Page 33: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 33 Copyright © 2001, 2003, Andrew W. Moore

Cross Entropy

),()()](log([)( qpKLxHxqExHC

Sea X una variable aleatoria con distribucion conocida p(x) y distribucion

estimada q(x), la “cross entropy” mide la diferencia entre las dos

distribuciones y se define por

donde H(X) es la entropia de X con respecto a la distribucion p y KL es

la distancia Kullback-Leibler ente p y q.

Si p y q son discretas se reduce a :

y para p y q continuas se tiene

x

CxqxpXH ))((log)()(

2

Page 34: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 34 Copyright © 2001, 2003, Andrew W. Moore

Bivariate Gaussians

)()(exp||||2

1)( 1

2

1

21

μxΣμxΣ

x Tp

y

x

μ

Y

X X r.v. Write

yxy

xyx

2

2

Σ

Then define ),(~ ΣμNX to mean

Where the Gaussian’s parameters are…

Where we insist that S is symmetric non-negative definite

Page 35: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 35 Copyright © 2001, 2003, Andrew W. Moore

Bivariate Gaussians

)()(exp||||2

1)( 1

2

1

21

μxΣμxΣ

x Tp

y

x

μ

Y

X X r.v. Write

yxy

xyx

2

2

Σ

Then define ),(~ ΣμNX to mean

Where the Gaussian’s parameters are…

Where we insist that S is symmetric non-negative definite

It turns out that E[X] = and Cov[X] = S. (Note that this is a resulting property of Gaussians, not a definition)

Page 36: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 36 Copyright © 2001, 2003, Andrew W. Moore

Evaluating p(x): Step 1

)()(exp||||2

1)( 1

2

1

21

μxΣμxΣ

x Tp

1. Begin with vector x

x

Page 37: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 37 Copyright © 2001, 2003, Andrew W. Moore

Evaluating p(x): Step 2

1. Begin with vector x

2. Define = x - x

)()(exp||||2

1)( 1

2

1

21

μxΣμxΣ

x Tp

Page 38: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 38 Copyright © 2001, 2003, Andrew W. Moore

Evaluating p(x): Step 3

1. Begin with vector x

2. Define = x -

3. Count the number of contours crossed of the ellipsoids formed S-1

D = this count = sqrt(TS-1) = Mahalonobis Distance between x and

x

Contours defined by sqrt(TS-1) = constant

)()(exp||||2

1)( 1

2

1

21

μxΣμxΣ

x Tp

Page 39: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 39 Copyright © 2001, 2003, Andrew W. Moore

Evaluating p(x): Step 4

1. Begin with vector x

2. Define = x -

3. Count the number of contours crossed of the ellipsoids formed S-1

D = this count = sqrt(TS-1) = Mahalonobis Distance between x and

4. Define w = exp(-D 2/2)

D 2

exp(-

D 2

/2)

x close to in squared Mahalonobis space gets a large weight. Far away gets

a tiny weight

)()(exp||||2

1)( 1

2

1

21

μxΣμxΣ

x Tp

Page 40: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 40 Copyright © 2001, 2003, Andrew W. Moore

Evaluating p(x): Step 5

1. Begin with vector x

2. Define = x -

3. Count the number of contours crossed of the ellipsoids formed S-1

D = this count = sqrt(TS-1) = Mahalonobis Distance between x and

4. Define w = exp(-D 2/2)

5. D 2

exp(-

D 2

/2)

1ensure to||||2

1by Multiply w

21

xxΣ

d)p(

)()(exp||||2

1)( 1

2

1

21

μxΣμxΣ

x Tp

Page 41: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 41 Copyright © 2001, 2003, Andrew W. Moore

Normal Bivariada NB(0,0,1,1,0)

persp(x,y,a,theta=30,phi=10,zlab="f(x,y)",box=FALSE,col=4)

Page 42: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 42 Copyright © 2001, 2003, Andrew W. Moore

Normal Bivariada NB(0,0,1,1,0)

0.00

0.05

0.10

0.15

0.20

-3 -2 -1 0 1 2 3

-3

-2

-1

0

1

2

3

filled.contour(x,y,a,nlevels=4,col=2:5)

Page 43: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 43 Copyright © 2001, 2003, Andrew W. Moore

Multivariate Gaussians

)()(exp||||)2(

1)( 1

2

1

21

2

μxΣμxΣ

x T

mp

m

2

1

μ

mX

X

X

2

1

r.v. Write X

mmm

m

m

2

21

222

12

11212

Σ

Then define ),(~ ΣμNX to mean

Where the Gaussian’s parameters have…

Where we insist that S is symmetric non-negative definite

Again, E[X] = and Cov[X] = S. (Note that this is a resulting property of Gaussians, not a definition)

Page 44: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 44 Copyright © 2001, 2003, Andrew W. Moore

General Gaussians

m

2

1

μ

mmm

m

m

2

21

222

12

11212

Σ

x1

x2

Page 45: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 45 Copyright © 2001, 2003, Andrew W. Moore

Axis-Aligned Gaussians

m

2

1

μ

m

m

2

12

32

22

12

0000

0000

0000

0000

0000

Σ

x1

x2

jiXX ii for

Page 46: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 46 Copyright © 2001, 2003, Andrew W. Moore

Spherical Gaussians

m

2

1

μ

2

2

2

2

2

0000

0000

0000

0000

0000

Σ

x1

x2

jiXX ii for

Page 47: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 47 Copyright © 2001, 2003, Andrew W. Moore

Subsets of variables

where as Write1)(

)(

1

2

1

m

um

um

m

X

X

X

X

X

X

X

V

U

V

UXX

This will be our standard notation for breaking an m-dimensional distribution into subsets of variables

Page 48: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 48 Copyright © 2001, 2003, Andrew W. Moore

Gaussian Marginals are Gaussian

m

um

um

m

X

X

X

X

X

X

X

1)(

)(

1

2

1

, where as Write VUV

UXX

THEN U is also distributed as a Gaussian

vv

T

uv

uvuu

v

u

ΣΣ

ΣΣ

μ

μ

V

U,N~ IF

uuu ΣμU ,N~

V

U Margin- alize

U

Page 49: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 49 Copyright © 2001, 2003, Andrew W. Moore

Gaussian Marginals are Gaussian

m

um

um

m

X

X

X

X

X

X

X

1)(

)(

1

2

1

, where as Write VUV

UXX

THEN U is also distributed as a Gaussian

vv

T

uv

uvuu

v

u

ΣΣ

ΣΣ

μ

μ

V

U,N~ IF

uuu ΣμU ,N~

V

U Margin- alize

U

This fact is not immediately obvious

Obvious, once we know it’s a Gaussian (why?)

Page 50: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 50 Copyright © 2001, 2003, Andrew W. Moore

Gaussian Marginals are Gaussian

m

um

um

m

X

X

X

X

X

X

X

1)(

)(

1

2

1

, where as Write VUV

UXX

THEN U is also distributed as a Gaussian

vv

T

uv

uvuu

v

u

ΣΣ

ΣΣ

μ

μ

V

U,N~ IF

uuu ΣμU ,N~

V

U Margin- alize

U

How would you prove this?

(snore...)

),(

)(

v

vvu

u

dp

p

Page 51: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 51 Copyright © 2001, 2003, Andrew W. Moore

Linear Transforms remain Gaussian

ΣμX ,N~

Multiply AX X

Matrix A

Assume X is an m-dimensional Gaussian r.v.

Define Y to be a p-dimensional r. v. thusly (note ):

AXY

…where A is a p x m matrix. Then…

mp

TAAΣAμY ,N~

Page 52: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 52 Copyright © 2001, 2003, Andrew W. Moore

Adding samples of 2 independent Gaussians

is Gaussian

YXΣμYΣμX and ,N~ and ,N~ if yyxx

+

YX X

Y

yxyx ΣΣμμYX ,N~ then

Why doesn’t this hold if X and Y are dependent?

Which of the below statements is true?

If X and Y are dependent, then X+Y is Gaussian but possibly with some other covariance

If X and Y are dependent, then X+Y might be non-Gaussian

Page 53: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 53 Copyright © 2001, 2003, Andrew W. Moore

Conditional of Gaussian is Gaussian

V

U Condition- alize

VU |

vv

T

uv

uvuu

v

u

ΣΣ

ΣΣ

μ

μ

V

U,N~ IF

where,N~ | THEN || vuvu ΣμVU

)( 1

| vvv

T

uvuvu μVΣΣμμ

uvvv

T

uvuuvu ΣΣΣΣΣ1

|

Page 54: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 54 Copyright © 2001, 2003, Andrew W. Moore

vv

T

uv

uvuu

v

u

ΣΣ

ΣΣ

μ

μ

V

U,N~ IF

where,N~ | THEN || vuvu ΣμVU

)( 1

| vvv

T

uvuvu μVΣΣμμ

uvvv

T

uvuuvu ΣΣΣΣΣ1

|

2

2

68.3967

967849,

76

2977N~ IF

y

w

where,N~ | THEN || ywywyw Σμ

2|68.3

)76(9762977

yywμ

2

2

22

| 80868.3

967849 ywΣ

Page 55: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 55 Copyright © 2001, 2003, Andrew W. Moore

vv

T

uv

uvuu

v

u

ΣΣ

ΣΣ

μ

μ

V

U,N~ IF

where,N~ | THEN || vuvu ΣμVU

)( 1

| vvv

T

uvuvu μVΣΣμμ

uvvv

T

uvuuvu ΣΣΣΣΣ1

|

2

2

68.3967

967849,

76

2977N~ IF

y

w

where,N~ | THEN || ywywyw Σμ

2|68.3

)76(9762977

yywμ

2

2

22

| 80868.3

967849 ywΣ

P(w|m=82)

P(w|m=76)

P(w)

Page 56: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 56 Copyright © 2001, 2003, Andrew W. Moore

vv

T

uv

uvuu

v

u

ΣΣ

ΣΣ

μ

μ

V

U,N~ IF

where,N~ | THEN || vuvu ΣμVU

)( 1

| vvv

T

uvuvu μVΣΣμμ

uvvv

T

uvuuvu ΣΣΣΣΣ1

|

2

2

68.3967

967849,

76

2977N~ IF

y

w

where,N~ | THEN || ywywyw Σμ

2|68.3

)76(9762977

yywμ

2

2

22

| 80868.3

967849 ywΣ

P(w|m=82)

P(w|m=76)

P(w)

Note: conditional variance is independent of the given value of v

Note: conditional variance can only be

equal to or smaller than marginal variance

Note: marginal mean is a linear function of v

Note: when given value of v is v, the conditional

mean of u is u

Page 57: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 57 Copyright © 2001, 2003, Andrew W. Moore

Gaussians and the chain rule

vv

T

vv

vvvu

T

vv

ΣAΣ

AΣΣAAΣΣ

)( |

vvvvu ΣμVΣAVVU ,N~ and ,N~ | IF |

V

UChain Rule

VU |

V

Let A be a constant matrix

with,,N~ THEN ΣμV

U

v

v

μ

Aμμ

Page 58: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 58 Copyright © 2001, 2003, Andrew W. Moore

Available Gaussian tools

V

UChain Rule

VU |

V

V

U Condition- alize

VU |

+ YX X

Y

Multiply AX X

Matrix A

V

U Margin- alize

U

vv

T

uv

uvuu

v

u

ΣΣ

ΣΣ

μ

μ

V

U,N~ IF uuu ΣμU ,N~ THEN

ΣμX ,N~ IF AXY AND TAAΣAμY ,N~ THEN

YXΣμYΣμX and ,N~ and ,N~ if yyxx

yxyx ΣΣμμYX ,N~ then

uvvv

T

uvuuvu ΣΣΣΣΣ1

| where

vuvu || ,N~ | ΣμVUTHEN

vv

T

uv

uvuu

v

u

ΣΣ

ΣΣ

μ

μ

V

U,N~ IF

)( 1

| vvv

T

uvuvu μVΣΣμμ

vv

T

vv

vvvu

T

vv

ΣAΣ

AΣΣAAΣΣ

)( |

vvvvu ΣμVΣAVVU ,N~ and ,N~ | IF |

with,,N~ THEN ΣμV

U

Page 59: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 59 Copyright © 2001, 2003, Andrew W. Moore

Assume…

• You are an intellectual snob

• You have a child

Page 60: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 60 Copyright © 2001, 2003, Andrew W. Moore

Intellectual snobs with children

• …are obsessed with IQ

• In the world as a whole, IQs are drawn from a Gaussian N(100,152)

Page 61: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 61 Copyright © 2001, 2003, Andrew W. Moore

IQ tests

• If you take an IQ test you’ll get a score that, on average (over many tests) will be your IQ

• But because of noise on any one test the score will often be a few points lower or higher than your true IQ.

SCORE | IQ ~ N(IQ,102)

Page 62: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 62 Copyright © 2001, 2003, Andrew W. Moore

Assume… • You drag your kid off to get tested

• She gets a score of 130

• “Yippee” you screech and start deciding how to casually refer to her membership of the top 2% of IQs in your Christmas newsletter.

P(X<130|=100,2=152) =

P(X<2| =0,2=1) =

erf(2) = 0.977

Page 63: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 63 Copyright © 2001, 2003, Andrew W. Moore

Assume… • You drag your kid off to get tested

• She gets a score of 130

• “Yippee” you screech and start deciding how to casually refer to her membership of the top 2% of IQs in your Christmas newsletter.

P(X<130|=100,2=152) =

P(X<2| =0,2=1) =

erf(2) = 0.977

You are thinking:

Well sure the test isn’t accurate, so she might have an IQ of 120 or she might have an 1Q of 140, but the most likely IQ given the evidence “score=130” is, of course, 130.

Can we trust this reasoning?

Page 64: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 64 Copyright © 2001, 2003, Andrew W. Moore

What we really want: • IQ~N(100,152)

• S|IQ ~ N(IQ, 102)

• S=130

• Question: What is IQ | (S=130)?

Called the Posterior Distribution of IQ

Page 65: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 65 Copyright © 2001, 2003, Andrew W. Moore

Which tool or tools? • IQ~N(100,152)

• S|IQ ~ N(IQ, 102)

• S=130

• Question: What is IQ | (S=130)?

V

UChain Rule

VU |

V

V

U Condition- alize

VU |

+ YX X

Y

Multiply AX X

Matrix A

V

U Margin- alize

U

Page 66: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 66 Copyright © 2001, 2003, Andrew W. Moore

Plan • IQ~N(100,152)

• S|IQ ~ N(IQ, 102)

• S=130

• Question: What is IQ | (S=130)?

IQ

SChain Rule

Q| IS

IQ

S

IQ Condition- alize

SI |QSwap

Page 67: 2013-1 Machine Learning Lecture 02 - Andrew Moore: Entropy

Information Gain: Slide 67 Copyright © 2001, 2003, Andrew W. Moore

Working… IQ~N(100,152)

S|IQ ~ N(IQ, 102)

S=130

Question: What is IQ | (S=130)?

THEN

vv

T

uv

uvuu

v

u

ΣΣ

ΣΣ

μ

μ

V

U,N~ IF

)( 1

| vvv

T

uvuvu μVΣΣμμ

vv

T

vv

vvvu

T

vv

ΣAΣ

AΣΣAAΣΣ

)( |

vvvvu ΣμVΣAVVU ,N~ and ,N~ | IF |

with,,N~ THEN ΣμV

U