View
255
Download
6
Category
Preview:
Citation preview
CHAPTER 4: Parametric Methods
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
2
Parametric Estimation X = { xt }t where xt ~ p (x) Parametric estimation:
Assume a form for p (x | θ) and estimate θ, its sufficient statistics, using Xe.g., N ( μ, σ2) where θ = { μ, σ2}
Problem: How can we obtain θ from X? Assumption: X contains samples of a one-
dimensional random variable Later multivariate estimation: X contains
multiple and not only a single measurement.
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
3
Maximum Likelihood Estimation
Density function p with parameters θ is given and xt~p (X |θ)
Likelihood of θ given the sample Xl (θ|X) = p (X |θ) = ∏
t p (xt|θ)
We look θ for that “maximizes the likelihood of the sample”! Log likelihood
L(θ|X) = log l (θ|X) = ∑t log p (xt|θ)
Maximum likelihood estimator (MLE)θ* = argmaxθ L(θ|X)
Homework: Sample: 0, 3, 3, 4, 5 and x~N(,)? Use MLE to find(,)!
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
4
Examples: Bernoulli/Multinomial Bernoulli: Two states, failure/success, x in {0,1}
P (x) = pox (1 – po )
(1 – x)
L (po|X) = log ∏t po
xt (1 – po ) (1 – xt)
MLE: po = ∑t xt / N
Multinomial: K>2 states, xi in {0,1}
P (x1,x2,...,xK) = ∏i pi
xi
L(p1,p2,...,pK|X) = log ∏t ∏
i pi
xit
MLE: pi = ∑t xi
t / N
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)5
Gaussian (Normal) Distribution
2
2
2exp
2
1 x-xp
p(x) = N ( μ, σ2)
MLE for μ and σ2:
μ σ
N
mxs
N
xm
t
t
t
t
2
2
2
2
2exp
2
1 xxp
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
6
Bias and Variance
Unknown parameter θEstimator di = d (Xi) on sample Xi
Bias: bθ(d) = E [d] – θVariance: E [(d–E [d])2]
Mean square error of the estimator d: r (d,θ) = E [(d–θ)2]
= (E [d] – θ)2 + E [(d–E [d])2]= Bias2 + Variance
Error in the Model itself Variation/randomness of the model
7
Bayes’ Estimator Treat θ as a random var with prior p (θ) Bayes’ rule: p (θ|X) = p(X|θ) * p(θ) / p(X) Maximum a Posteriori (MAP): θMAP = argmaxθ p(θ|X)
Maximum Likelihood (ML): θML = argmaxθ p(X|θ)
Bayes’ Estimator: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ
Comments: ML just takes the maximum value of the density function Compared with ML, MAP additionally considers priors Bayes’ estimator averages over all possible values of θ which are
weighted by their likelihood to occur (which is measured by a probability distribution p(θ)).
For MAP see: http://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)8
Bayes’ Estimator: Example
xt ~ N (θ, σo2) and θ ~ N ( μ, σ2)
θML = m
θMAP = θBayes’ =
220
2
220
20
1
1
1|
//N
/m
//N/N
E X
σ: converges to m
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
9
Parametric Classification
iii
iii
CPCxpxg
CPCxpxg
log| log
lyequivalent or
|
ii
iii
i
i
i
i
CPx
xg
xCxp
log2
log2 log21
2exp
2
1|
2
2
2
2
kind of p(Ci|x)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
10
Given the sample
ML estimates are
Discriminant becomes
Nt
tt,rx 1}{ X
x
, if 0
if 1
ijx
xr
jt
it
ti C
C
t
ti
t
tii
t
i
t
ti
t
ti
t
it
ti
i r
rmxs
r
rxm
N
rCP̂
2
2
ii
iii CP̂
s
mxsxg log
2 log2 log
21
2
2
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)11
Equal variances
Single boundary athalfway between means
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)12
Variances are different
Two boundaries
Homework!
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
13
Regression
2
2
|~|
0~
|:for estimator
,N
,N
xgxrp
xgr
xfr
N
t
tN
t
tt
N
t
tt
xpxrp
r,xp
11
1
log| log
log|XLMaximizing the probabilityof the sample again!
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
14
Regression: From LogL to Error
2
1
2
12
2
2
1
|21
|
|21
2log
2|
exp2
1 log|
N
t
tt
N
t
tt
ttN
t
xgrE
xgrN
xgr
X
XL
Skip to 20!
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)15
Linear Regression 0101| wxww,wxg tt
t
t
t
tt
t
t
t
t
t
t
xwxwxr
xwNwr
2
10
10
t
t
tt
t
t
t
t
tt
t
xr
r
w
w
xx
xNyw
1
02A
yw 1ARelationship to what we discussed in Topic2??
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)16
Polynomial Regression
01
2
2012| wxwxwxww,w,w,,wxg ttktkk
t
NNNN
k
k
r
r
r
xxx
xxx
xxx
2
1
22
2222
1211
1
1
1
rD
rw TT DDD1
Here we get k+1 equations with k+1 unknowns!
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
17
Other Error Measures
Square Error:
Relative Square Error:
Absolute Error: E (θ|X) = ∑t |rt – g(xt|θ)|
ε-sensitive Error:
E (θ|X) = ∑ t 1(|rt – g(xt|θ)|>ε) (|rt –
g(xt|θ)| – ε)
2
1
|21
|
N
t
tt xgrE X
2
1
2
1
||
N
t
t
N
t
tt
rr
xgrE X
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
18
Bias and Variance
222 ||| xgExgExgExrExxgxrEE XXXX bias variance
222 |||| xgxrExxrErExxgrE
noise squared error
To be revisited next week!
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
19
Estimating Bias and Variance
M samples Xi={xti , rt
i}, i=1,...,M
are used to fit gi (x), i =1,...,M
ti
t i
tti
t
tt
xgM
xg
xgxgNM
g
xfxgN
g
1
1Variance
1Bias
2
22
Initially skip!
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
20
Bias/Variance Dilemma
Example: gi(x)=2 has no variance and high bias
gi(x)= ∑t rt
i/N has lower bias with variance
As we increase complexity, bias decreases (a better fit to data) and variance increases (fit varies more with
data) Bias/Variance dilemma: (Geman et al., 1992)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
21
bias
variance
f
gi g
f
Already visited as Topic4!
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
22
Polynomial Regression
Best fit “min error”
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
23
Model Selection
Cross-validation: Measure generalization accuracy by testing on data unused during training
Regularization: Penalize complex modelsE’=error on data + λ model complexity
Akaike’s information criterion (AIC), Bayesian information criterion (BIC)
Minimum description length (MDL): Kolmogorov complexity, shortest description of data
Structural risk minimization (SRM)
Remark: will be discussed in more depth later: Topic 11
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
24
Bayesian Model Selection
Prior on models, p(model)
Regularization, when prior favors simpler models Bayes, MAP of the posterior, p(model|data) Average over a number of models with high
posterior (voting, ensembles: Chapter 15)
data
model model|datadata|model
ppp
p
CHAPTER 5:
Multivariate Methods
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
26
Multivariate Data
Multiple measurements (sensors) d inputs/features/attributes: d-variate N instances/observations/examples
Nd
NN
d
d
XXX
XXX
XXX
21
222
21
112
11
X
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
27
Multivariate Parameters
ji
ijijji
jiij
Td
X,X
X,X
,...,E
Corr :nCorrelatio
Cov:Covariance
:Mean 1μx
221
22221
11221
Cov
ddd
d
d
TE
μμ XXX
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)28
Parameter Estimation
ji
ijij
jtj
N
t iti
ij
N
t
ti
i
ss
sr:
N
mxmxs:
d,...,i,N
xm:
R
S
m
matrix nCorrelatio
matrix Covariance
1 mean Sample
1
1
29
Multivariate Normal Distribution
μxμxx
μx
1212 2
1exp
2
1Σ
Σ
Σ
T
//d
d
p
~ ,N
Mahalanobis distance between x and
http://www.analyzemath.com/Calculators/inverse_matrix_3by3.html
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
30
Multivariate Normal Distribution Mahalanobis distance: (x – μ)T ∑–1 (x – μ)
measures the distance from x to μ in terms of ∑ (normalizes for difference in variances and correlations)
Bivariate: d = 2
2221
2121
iiii xz
zzzzxxp
/
212
1exp
12
1, 2
2212122
21
21
Remark: is the correlation between the two variables
Called z-score zi for xi
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
31
Bivariate Normal
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
32
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
33
Independent Inputs: Naive Bayes If xi are independent, offdiagonals of ∑ are 0,
Mahalanobis distance reduces to weighted (by 1/σi) Euclidean distance:
If variances are also equal, reduces to Euclidean distance
d
i i
iid
ii
/d
d
iii
xxpp
1
2
1
21 21
exp2
1
x
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)34
Parametric Classification
If p (x | Ci ) ~ N ( μi , ∑i )
Discriminant functions are
iiT
i/
i/diCp μxμxx 1
212 21
exp2
1| Σ
Σ
iiiT
ii
iii
CPd
CPCpg
log21
log21
2log2
log| log
1
μΣμΣ xx
xx
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
35
Estimation of Parameters
t
ti
T
it
t itt
ii
t
ti
t
tti
i
t
ti
i
r
r
r
rN
rCP̂
mxmx
xm
S
iiiT
iii CP̂g log21
log21 1 mxmxx SS
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)36
Different Si
Quadratic discriminant
iiiiT
ii
iii
ii
iT
iiT
iiiT
iiiT
iT
ii
CP̂w
w
CP̂g
log log21
21
21
where
log221
log21
10
1
1
0
111
SS
S
SW
W
SSSS
mm
mw
xwxx
mmmxxxx
skip
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
37
likelihoods
posterior for C1
discriminant: P (C1|x ) = 0.5
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
38
Common Covariance Matrix S
Shared common sample covariance S
Discriminant reduces to
which is a linear discriminant
ii
iCP̂ SS
iiT
ii CP̂g log21 1 mxmxx S
iiT
iiii
iT
ii
CP̂w
wg
log21
where
10
1
0
mmmw
xwx
SS
Initially skip!
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
39
Common Covariance Matrix SInitially skip!
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
40
Diagonal S
When xj j = 1,..d, are independent, ∑ is diagonal
p (x|Ci) = ∏j p (xj |Ci) (Naive Bayes’ assumption)
Classify based on weighted Euclidean distance (in sj units) to the nearest mean
id
j j
ijtj
i CP̂s
mxg log
21
1
2
x
Likely covered in April!
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
41
Diagonal S
variances may bedifferent
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
42
Diagonal S, equal variances
Nearest mean classifier: Classify based on Euclidean distance to the nearest mean
Each mean can be considered a prototype or template and this is template matching
id
jij
tj
ii
i
CP̂mxs
CP̂s
g
log21
log2
2
12
2
2
mxx
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
43
Diagonal S, equal variances
*?
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
44
Model Selection
As we increase complexity (less restricted S), bias decreases and variance increases
Assume simple models (allow some bias) to control variance (regularization)
Assumption Covariance matrix No of parameters
Shared, Hyperspheric Si=S=s2I 1
Shared, Axis-aligned Si=S, with sij=0 d
Shared, Hyperellipsoidal Si=S d(d+1)/2
Different, Hyperellipsoidal
Si K d(d+1)/2
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
45
Discrete Features
Binary features:if xj are independent (Naive Bayes’)
the discriminant is linear
ij
ijjijj
iii
CPpxpx
CPCpg
log1 log 1 log
log| log
xx
Estimated parameters
ijij Cxpp |1
d
j
xij
xiji
jj ppCxp1
11|
t
ti
t
ti
tj
ij r
rxp̂
skip!
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
46
Discrete Features
Multinomial (1-of-nj) features: xj {v1, v2,..., vnj}
if xj are independent
ikjijkijk CvxpCzpp ||1
t
ti
t
ti
tjk
ijk
iijkj k jki
d
j
n
k
zijki
r
rzp̂
CPpzg
pCpj
jk
log log
|1 1
x
x
skip!
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
47
Multivariate Regression
Multivariate linear model
Multivariate polynomial model: Define new higher-order variables
z1=x1, z2=x2, z3=x12, z4=x2
2, z5=x1x2
and use the linear model in this new z space (basis functions, kernel trick, SVM: Chapter 10)
dtt w,...,w,wxgr 10|
211010
22110
21
|
t
tdd
ttd
tdd
tt
xwxwwrw,...,w,wE
xwxwxww
X
skip!
Recommended