Upload
hubert-hensley
View
224
Download
2
Tags:
Embed Size (px)
Citation preview
2/9/08 CS 461, Winter 2008 1
CS 461: Machine LearningLecture 6
Dr. Kiri [email protected]
Dr. Kiri [email protected]
2/9/08 CS 461, Winter 2008 2
Plan for Today
Note: Room change for 2/16: E&T A129 Solution to Midterm Solution to Homework 3
Parametric methods Data comes from distribution Bernoulli, Gaussian, and their parameters How good is a parameter estimate? (bias, variance)
Bayes estimation ML: use the data MAP: use the prior and the data Bayes estimator: integrated estimate (weighted)
Parametric classification Maximize the posterior probability
2/9/08 CS 461, Winter 2008 3
Review from Lecture 5
Probability Axioms
Bayesian Learning Classification Bayes’s Rule Bayesian Networks Naïve Bayes Classifier Association Rules
2/9/08 CS 461, Winter 2008 4
Parametric Methods
Chapter 4Chapter 4
2/9/08 CS 461, Winter 2008 5
Parametric Learning
Assume: data x comes from a distribution p(x)
Model this distribution by selecting parameters θ e.g., N ( μ, σ2) where θ = { μ, σ2}
2/9/08 CS 461, Winter 2008 6
Maximum Likelihood Model
(Last time: Max Likelihood Classification) Likelihood of θ given the sample X
l (θ|X) = p (X |θ) = ∏t p (xt|θ)
Log likelihood L(θ|X) = log l (θ|X) = ∑
t log p (xt|θ)
Maximum likelihood estimator (MLE)θ* = argmaxθ L(θ|X)
[Alpaydin 2004 The MIT Press]
2/9/08 CS 461, Winter 2008 7
Example: do you wear glasses?
Bernoulli: Two states, x in {0,1} P (x) = po
x (1 – po ) (1 – x)
L (po|X) = log ∏t po
xt (1 – po ) (1 – xt)
MLE: po = ∑t xt / N
[Alpaydin 2004 The MIT Press]
2/9/08 CS 461, Winter 2008 8
Gaussian (Normal) Distribution
p(x) = N ( μ, σ2)
MLE for μ and σ2:
[Alpaydin 2004 The MIT Press]
( ) ( )⎥⎦
⎤⎢⎣
⎡
σμ−
σπ= 2
2
2exp
2
1 x-xp
μ σ
€
p x( ) =1
2π σexp −
x − μ( )2
2σ 2
⎡
⎣ ⎢ ⎢
⎤
⎦ ⎥ ⎥
€
m =
x t
t
∑N
s2 =
x t − m( )2
t
∑N
2/9/08 CS 461, Winter 2008 9
How good is that estimate?
Let d be the estimate of θ It is also a random variable Technically, it’s d(X) since it depends on X E [d] is the expected value of d (regardless of X)
Bias: bθ(d) = E [d] – θ How far off the correct value is it?
Variance: Var(d) = E [(d–E [d])2] How much does it change with different X?
Mean square error: r (d,θ) = E [(d–θ)2]
= (E [d] – θ)2 + E [(d–E [d])2]= Bias2 + Variance
[Alpaydin 2004 The MIT Press]
2/9/08 CS 461, Winter 2008 10
Bayes Estimator: Using what we already know
Prior information p (θ) Bayes’s rule (get posterior):
p (θ|X) = p(X|θ) p(θ) / p(X) Maximum a Posteriori (MAP):
θMAP = argmaxθ p(θ|X) Maximum Likelihood (ML):
θML = argmaxθ p(X|θ) Bayes Estimator:
θBayes = E[θ|X] = ∫ θ p(θ|X) dθ In a sense, it is the “weighted mean” for θ
For our purposes, θMAP = θBayes
[Alpaydin 2004 The MIT Press]
2/9/08 CS 461, Winter 2008 11
Bayes Estimator: Continuous Example Assume xt ~ N (θ, σo
2) and θ ~ N ( μ, σ2)
θML = m (sample mean)
θMAP = θBayes =
Estimated mean = weighted average of sample mean m and prior mean μ Weights indicate how much you trust the
sample
[ ] μσ+σ
σ+
σ+σσ
=θ 220
2
220
20
1
1
1|
//N
/m
//N/N
E X
[Alpaydin 2004 The MIT Press]
2/9/08 CS 461, Winter 2008 12
Example: Coin flipping
?
??
??
?
?
0 flips total
[Copyright Terran Lane]
2/9/08 CS 461, Winter 2008 13
Example: Coin flipping
1 flip total
[Copyright Terran Lane]
2/9/08 CS 461, Winter 2008 14
Example: Coin flipping
5 flips total
[Copyright Terran Lane]
2/9/08 CS 461, Winter 2008 15
Example: Coin flipping
10 flips total
[Copyright Terran Lane]
2/9/08 CS 461, Winter 2008 16
Example: Coin flipping
20 flips total
[Copyright Terran Lane]
2/9/08 CS 461, Winter 2008 17
Example: Coin flipping
50 flips total
[Copyright Terran Lane]
2/9/08 CS 461, Winter 2008 18
Example: Coin flipping
100 flips total
[Copyright Terran Lane]
2/9/08 CS 461, Winter 2008 19
Parametric Classification
Remember Naïve Bayes? Maximum likelihood estimator (MLE)
Maximum a-posteriori (MAP) classifier
€
Cpredict = argmaxc
P(X1 = u1L Xm = um | C = c)
€
Cpredict = argmaxc
P(C = c | X1 = u1L Xm = um )
[Copyright Andrew Moore]
2/9/08 CS 461, Winter 2008 20
Parametric Classification
€
gi x( ) = p x | Ci( )P Ci( )
or equivalently
gi x( ) = log p x | Ci( ) + log P Ci( )
€
p x | Ci( ) =1
2π σ i
exp −x − μ i( )
2
2σ i2
⎡
⎣ ⎢ ⎢
⎤
⎦ ⎥ ⎥
gi x( ) = −1
2log 2π − log σ i −
x − μ i( )2
2σ i2 + log P Ci( )
[Alpaydin 2004 The MIT Press]
Discriminant (take the max over i):
If we assume p(x|Ci) are Gaussian:
2/9/08 CS 461, Winter 2008 21
Given the sample
ML estimates of priors are
Discriminant becomes
Nt
tt,rx 1}{ ==X
ℜ∈x ⎪⎩
⎪⎨⎧
≠∈∈
= , if 0
if 1
ijx
xr
jt
it
ti C
C
€
ˆ P Ci( ) =
rit
t
∑N
mi =
x trit
t
∑
rit
t
∑ si
2 =
x t − mi( )2ri
t
t
∑
rit
t
∑
€
gi x( ) = −1
2log 2π − log si −
x − mi( )2
2si2 + log ˆ P Ci( )
[Alpaydin 2004 The MIT Press]
1-D data
Observed frequency
Class indicators
Observed variance
Observed mean
2/9/08 CS 461, Winter 2008 22
Example: 2 classes (same variance)
[Alpaydin 2004 The MIT Press]
Posterior = discriminant g(x) normalized by P(x)
Likelihood = Gaussian with mean, std dev
Assume both classes have the same prior
2/9/08 CS 461, Winter 2008 23
Example: 2 classes (diff variance)
[Alpaydin 2004 The MIT Press]
Posterior = discriminant g(x) normalized by P(x)
Likelihood = Gaussian with mean, std dev
Assume both classes have the same prior
2/9/08 CS 461, Winter 2008 24
Example: Predicting student’s major
From HW 3: Use “glasses” feature: yes or no Discrete distribution
Likelihood (of data)
2/9/08 CS 461, Winter 2008 25
Example: Predicting student’s major
Posterior (of class)
Priors: P(CS) = 0.4, P(Physics) = 0.3, P(EE) = 0.3
Probability of data: P(yes) = 0.6, P(no) = 0.4
Bayes: P(major|glasses) = P(glasses|major) P(major) / P(glasses)
MAPnoMAPyes
2/9/08 CS 461, Winter 2008 26
Summary: Key Points for Today
Parametric methods Data comes from distribution Bernoulli, Gaussian, and their parameters How good is a parameter estimate? (bias,
variance)
Bayes estimation ML: use the data MAP: use the prior and the data Bayes estimator: integrated estimate
(weighted)
Parametric classification Maximize the posterior probability
2/9/08 CS 461, Winter 2008 27
Next Time
Clustering!(read Ch. 7.1-7.4, 7.8… optional)
No reading questions