View
260
Download
0
Category
Preview:
Citation preview
姓名 : 李政軒
INTRODUCTION TO Machine Learning Parametric Methods
Parametric Estimation
X = { xt }t where xt ~ p (x)Parametric estimation:
Assume a form for p (x |q ) and estimate q , its sufficient statistics, using X N ( μ, σ2) where q = { μ, σ2}
Maximum Likelihood EstimationLikelihood of q given the sample X
l (θ|X) = p (X |θ) = ∏t p (xt|θ)
Log likelihood L(θ|X) = log l (θ|X) = ∑
t log p (xt|θ)
Maximum likelihood estimatorθ* = argmaxθ L(θ|X)
Examples: Bernoulli/MultinomialBernoulli: Two states, failure/success, x in
{0,1} P (x) = po
x (1 – po ) (1 – x)
L (po|X) = log ∏t po
xt (1 – po ) (1 – xt)
MLE: po = ∑t xt / N
Multinomial: K>2 states, xi in {0,1}P (x1,x2,...,xK) = ∏
i pi
xi
L(p1,p2,...,pK|X) = log ∏t ∏
i pi
xit
MLE: pi = ∑t xi
t / N
Gaussian (Normal) Distribution
N
mxs
N
xm
t
t
t
t
2
2
p(x) = N ( μ, σ2)
MLE for μ and σ2:
2
2
22
1
x
xp exp
Bias and Variance
Unknown parameter qEstimator di = d (Xi) on sample Xi
Bias: bq(d) = E [d] – qVariance: E [(d–E [d])2]
Mean square error: r (d,q) = E [(d–q)2]
= (E [d] – q)2 + E [(d–E [d])2]
= Bias2 + Variance
Bayes’ Estimator
Treat θ as a random var with prior p (θ)Bayes’ rule: p (θ|X) = p(X|θ) p(θ) / p(X) Full: p(x|X) = ∫ p(x|θ) p(θ|X) dθMaximum a Posteriori (MAP): θMAP =
argmaxθ p(θ|X)
Maximum Likelihood (ML): θML = argmaxθ p(X|θ)
Bayes’: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ
Parametric Classification
iii
iii
CPCxpxg
CPCxpxg
log| log or
|
ii
iii
i
i
ii
CPx
xg
xCxp
log log log
exp|
2
2
2
2
22
2
1
22
1
Given the sample
ML estimates are
Discriminant becomes
Nt
tt ,rx 1}{ X
x
, if
if ijx
xr
jt
it
ti C
C
0
1
t
ti
t
tii
t
i
t
ti
t
ti
t
it
ti
i r
rmxs
r
rxm
N
rCP
2
2 ˆ
ii
iii CP
smx
sxg ˆ log log log
2
2
22
2
1
Parametric Classification
Parametric Classification (a)and(b) for two classes when the input is one-
dimensional. Variances are equal and the posteriors intersect at one point, which is the threshold if decision.
Parametric Classification (a)and(b) for two classes when the input is one-dimensional.
Variances are unequal and the posteriors intersect at two points. In (c), the expected risks are shown for the two classes and for reject with
2.0
Regression
2
20
,N
,N
|~|
~
|:estimator
xgxrp
xgxfr
N
t
tN
t
tt
N
t
tt
xpxrp
rxp
11
1
log| log
, log|XL
Regression: From LogL to Error
2
1
2
12
2
2
1
2
1
2
12
22
1
N
t
tt
N
t
tt
ttN
t
xgrE
xgrN
xgr
||
|log
|exp log|
X
XL
Linear Regression
0101 wxwwwxg tt ,|
t
t
t
tt
t
t
t
t
t
t
xwxwxr
xwNwr
2
10
10
t
t
tt
t
t
t
t
tt
t
xr
r
ww
xx
xNyw
1
02A
yw 1A
Polynomial Regression
01
2
2012 wxwxwxwwwwwxg ttktkk
t ,,,,|
NNNN
k
k
r
rr
xxx
xxxxxx
2
1
22
2222
1211
1
1
1
r D
rw TT DDD1
Square Error:
Relative Square Error:
Absolute Error: E (θ |X) = ∑t |rt – g(xt| θ)|
ε-sensitive Error: E (θ |X) = ∑ t
1(|rt – g(xt| θ)|>ε) (|rt – g(xt|θ)| –
ε)
Other Error Measures
2
12
1
N
t
tt xgrE ||X
2
1
2
1
N
t
t
N
t
tt
rr
xgrE
||X
Bias and Variance
222 ||| xgExgExgExrExxgxrEE XXXX bias variance
222 xgxrExxrErExxgrE ||||
noise squared error
Estimating Bias and Variance
ti
t i
tti
t
tt
xgM
xg
xgxgNM
g
xfxgN
g
1
1
1
2
22
Variance
Bias
M samples Xi={xti , rt
i}, i=1,...,M
are used to fit gi (x), i =1,...,M
Bias/Variance Dilemma
Example: gi(x)=2 has no variance and high bias
gi(x)= ∑t rt
i/N has lower bias with variance
As we increase complexity, bias decreases (a better fit to data) and variance increases (fit varies more with
data)
Bias/Variance Dilemma
(.)g
(a) Function, f(x) = 2sin(1.5x), and one noisy (N(0,1)) dataset sampled from the function. Five samples are taken, each containing twenty in-stances. (b), (c), (d) are five polynomial fits, namely, gi(.), of order 1, 3 and 5. for each case, dotted line is the average of the five fits namely, .
Polynomial Regression
Best fit “min error”
In the same setting as that of previous, using one hundred models instead of five, bias, variance, and error for polynomials of order 1 to 5.
Model Selection
Cross-validation: Measure generalization accuracy by testing on data unused during training
Regularization: Penalize complex modelsE’=error on data + λ model
complexityAkaike’s information criterion (AIC), Bayesian information criterion (BIC)
Minimum description length (MDL): Kolmogorov complexity, shortest description of data
Structural risk minimization (SRM)
Best fit, “elbow”
Model Selection
Bayesian Model Selection
data
model model|datadata|model
ppp
p
Prior on models, p(model)
Regularization, when prior favors simpler models
Bayes, MAP of the posterior, p(model|data)Average over a number of models with high
posterior
Regression example
Coefficients increase in magnitude as order increases:1: [-0.0769, 0.0016]2: [0.1682, -0.6657, 0.0080]3: [0.4238, -2.5778, 3.4675, -0.00024: [-0.1093, 1.4356, -5.5007, 6.0454, -0.0019]
i i
N
t
tt wxgrEtionregulariza 22
12
1 ww ||: X
Recommended