2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff [email protected] Dr. Kiri Wagstaff [email protected]

2/9/08 CS 461, Winter 2008 1

CS 461: Machine LearningLecture 6

Dr. Kiri [email protected]

Dr. Kiri [email protected]

2/9/08 CS 461, Winter 2008 2

Plan for Today

Note: Room change for 2/16: E&T A129 Solution to Midterm Solution to Homework 3

Parametric methods Data comes from distribution Bernoulli, Gaussian, and their parameters How good is a parameter estimate? (bias, variance)

Bayes estimation ML: use the data MAP: use the prior and the data Bayes estimator: integrated estimate (weighted)

Parametric classification Maximize the posterior probability

2/9/08 CS 461, Winter 2008 3

Review from Lecture 5

Probability Axioms

Bayesian Learning Classification Bayes’s Rule Bayesian Networks Naïve Bayes Classifier Association Rules

2/9/08 CS 461, Winter 2008 4

Parametric Methods

Chapter 4Chapter 4

2/9/08 CS 461, Winter 2008 5

Parametric Learning

Assume: data x comes from a distribution p(x)

Model this distribution by selecting parameters θ e.g., N ( μ, σ2) where θ = { μ, σ2}

2/9/08 CS 461, Winter 2008 6

Maximum Likelihood Model

(Last time: Max Likelihood Classification) Likelihood of θ given the sample X

l (θ|X) = p (X |θ) = ∏t p (xt|θ)

Log likelihood L(θ|X) = log l (θ|X) = ∑

t log p (xt|θ)

Maximum likelihood estimator (MLE)θ* = argmaxθ L(θ|X)

[Alpaydin 2004 The MIT Press]

2/9/08 CS 461, Winter 2008 7

Example: do you wear glasses?

Bernoulli: Two states, x in {0,1} P (x) = po

x (1 – po ) (1 – x)

L (po|X) = log ∏t po

xt (1 – po ) (1 – xt)

MLE: po = ∑t xt / N


2/9/08 CS 461, Winter 2008 8

Gaussian (Normal) Distribution

p(x) = N ( μ, σ2)

MLE for μ and σ2:


( ) ( )⎥⎦

⎤⎢⎣

⎡

σμ−

σπ= 2

2

2exp

2

1 x-xp

μ σ

€

p x( ) =1

2π σexp −

x − μ( )2

2σ 2

⎡

⎣ ⎢ ⎢

⎤

⎦ ⎥ ⎥

€

m =

x t

t

∑N

s2 =

x t − m( )2

t

∑N

2/9/08 CS 461, Winter 2008 9

How good is that estimate?

Let d be the estimate of θ It is also a random variable Technically, it’s d(X) since it depends on X E [d] is the expected value of d (regardless of X)

Bias: bθ(d) = E [d] – θ How far off the correct value is it?

Variance: Var(d) = E [(d–E [d])2] How much does it change with different X?

Mean square error: r (d,θ) = E [(d–θ)2]

= (E [d] – θ)2 + E [(d–E [d])2]= Bias2 + Variance


2/9/08 CS 461, Winter 2008 10

Bayes Estimator: Using what we already know

Prior information p (θ) Bayes’s rule (get posterior):

p (θ|X) = p(X|θ) p(θ) / p(X) Maximum a Posteriori (MAP):

θMAP = argmaxθ p(θ|X) Maximum Likelihood (ML):

θML = argmaxθ p(X|θ) Bayes Estimator:

θBayes = E[θ|X] = ∫ θ p(θ|X) dθ In a sense, it is the “weighted mean” for θ

For our purposes, θMAP = θBayes


2/9/08 CS 461, Winter 2008 11

Bayes Estimator: Continuous Example Assume xt ~ N (θ, σo

2) and θ ~ N ( μ, σ2)

θML = m (sample mean)

θMAP = θBayes =

Estimated mean = weighted average of sample mean m and prior mean μ Weights indicate how much you trust the

sample

[ ] μσ+σ

σ+

σ+σσ

=θ 220

2

220

20

1

1

1|

//N

/m

//N/N

E X


2/9/08 CS 461, Winter 2008 12

Example: Coin flipping

?

??

??

?

?

0 flips total

[Copyright Terran Lane]

2/9/08 CS 461, Winter 2008 13


1 flip total


2/9/08 CS 461, Winter 2008 14


5 flips total


2/9/08 CS 461, Winter 2008 15


10 flips total


2/9/08 CS 461, Winter 2008 16


20 flips total


2/9/08 CS 461, Winter 2008 17


50 flips total


2/9/08 CS 461, Winter 2008 18


100 flips total


2/9/08 CS 461, Winter 2008 19

Parametric Classification

Remember Naïve Bayes? Maximum likelihood estimator (MLE)

Maximum a-posteriori (MAP) classifier

€

Cpredict = argmaxc

P(X1 = u1L Xm = um | C = c)

€

Cpredict = argmaxc

P(C = c | X1 = u1L Xm = um )

[Copyright Andrew Moore]

2/9/08 CS 461, Winter 2008 20

Parametric Classification

€

gi x( ) = p x | Ci( )P Ci( )

or equivalently

gi x( ) = log p x | Ci( ) + log P Ci( )

€

p x | Ci( ) =1

2π σ i

exp −x − μ i( )

2

2σ i2

⎡

⎣ ⎢ ⎢

⎤

⎦ ⎥ ⎥

gi x( ) = −1

2log 2π − log σ i −

x − μ i( )2

2σ i2 + log P Ci( )


Discriminant (take the max over i):

If we assume p(x|Ci) are Gaussian:

2/9/08 CS 461, Winter 2008 21

Given the sample

ML estimates of priors are

Discriminant becomes

Nt

tt,rx 1}{ ==X

ℜ∈x ⎪⎩

⎪⎨⎧

≠∈∈

= , if 0

if 1

ijx

xr

jt

it

ti C

C

€

ˆ P Ci( ) =

rit

t

∑N

mi =

x trit

t

∑

rit

t

∑ si

2 =

x t − mi( )2ri

t

t

∑

rit

t

∑

€

gi x( ) = −1

2log 2π − log si −

x − mi( )2

2si2 + log ˆ P Ci( )


1-D data

Observed frequency

Class indicators

Observed variance

Observed mean

2/9/08 CS 461, Winter 2008 22

Example: 2 classes (same variance)


Posterior = discriminant g(x) normalized by P(x)

Likelihood = Gaussian with mean, std dev

Assume both classes have the same prior

2/9/08 CS 461, Winter 2008 23

Example: 2 classes (diff variance)


Posterior = discriminant g(x) normalized by P(x)

Likelihood = Gaussian with mean, std dev

Assume both classes have the same prior

2/9/08 CS 461, Winter 2008 24

Example: Predicting student’s major

From HW 3: Use “glasses” feature: yes or no Discrete distribution

Likelihood (of data)

2/9/08 CS 461, Winter 2008 25

Example: Predicting student’s major

Posterior (of class)

Priors: P(CS) = 0.4, P(Physics) = 0.3, P(EE) = 0.3

Probability of data: P(yes) = 0.6, P(no) = 0.4

Bayes: P(major|glasses) = P(glasses|major) P(major) / P(glasses)

MAPnoMAPyes

2/9/08 CS 461, Winter 2008 26

Summary: Key Points for Today

Parametric methods Data comes from distribution Bernoulli, Gaussian, and their parameters How good is a parameter estimate? (bias,

variance)

Bayes estimation ML: use the data MAP: use the prior and the data Bayes estimator: integrated estimate

(weighted)

Parametric classification Maximize the posterior probability

2/9/08 CS 461, Winter 2008 27

Next Time

Clustering!(read Ch. 7.1-7.4, 7.8… optional)

No reading questions

Documents

2/9/08CS 461, Winter 20081 CS 461: Machine Learning Lecture 6 Dr. Kiri Wagstaff [email protected] Dr. Kiri Wagstaff [email protected]