View
13
Download
0
Category
Preview:
Citation preview
MIMA Group
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University
Chapter 4Non-Parameter Estimation
M LD M
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 2
Contents Introduction Parzen Windows K-Nearest-Neighbor Estimation Classification Techniques
The Nearest-Neighbor rule(1-NN) The Nearest-Neighbor rule(k-NN)
Distance Metrics
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 3
Bayes Rule for Classification
To compute the posterior probability, we need to know the prior probability and the likelihood.
Case I: has certain parametric form Maximum-Likelihood Estimation Bayesian Parameter Estimation
Problems: The assumed parametric form may not fit the ground-
truth density encountered in practice, e.g., assumed parametric form: unimodal; ground-truth: multimodal
)()()|()|(
xppxpxP ii
i
)|( ixp
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 4
Non-Parameter Estimation Case II: doesn’t have parametric form How?
)|( ixp
Let the data speak for themselves!
Parzen WindowsKn-Nearest-Neighbor
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 5
Goals Estimate class-conditional densities
Estimate posterior probabilities
)|( ip x
)|( xiP
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 6
Density Estimation Assume p(x) is continuous, and R is small Fundamental fact
The probability of a vector x fall into a region R:
R
n samples
+x
')'()( xxX dpP R
R R
')( xx dp
RPRVp )(x
Given n examples (i.i.d.) {x1, x2,…, xn}, let K denote the random variable representing number of samples falling into R, K will take Binomial distribution:
knk PPkn
kKP
)1()( RR),(~ RPnBK
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 7
Density Estimation Assume p(x) is continuous, and R is small Fundamental fact
The probability of a vector x fall into a region R:
R
n samples
+x
')'()( xxX dpP R
R R
')( xx dp
RPRVp )(x
RR PVp )(x
RnPKE ][ RVnKEp /][)( x
Let kR denote the actual number of samples in R
R
R
Vnk
p/
)( x
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 8
Density Estimation
Use subscript n to take sample size into account
We hope that: To do this, we should have
R
R
Vnk
p/
)( x
)()(lim xx ppnn
n
nn V
nkp /)( x
0lim nnV
nn
klim
0/lim
nknn
R
+x
n samples
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 9
Density Estimation
Fix Vn and determine kn Parzen Windows
Fix kn and determine Vn kn-Nearest-Neighbor
n
nn V
nkp /)( x What items can be controlled?How?
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 10
Parzen Windows
Assume Rn is a d-dimensional hypercube The length of each edge is hn
n
nn V
nkp /)( x Fix Vn and determine kn
dnn hV
Emanuel Parzen(1929-)
Determine kn with window function a.k.a. kernel function, potential function.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 11
Window function
It defines a unit hypercube centered at the origin.
1
1
1
otherwise0
2/||1 n
n
hxhx
hn
hn
hn
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 12
Window function
1 means that x’ falls within the hypercube of volume vn centered at x.
kn: # samples inside the hypercube centered at x,
otherwise0
2/||1 njj
n
hxxhxx
hn
hn
hnx
n
i n
in h
k1
xx
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 13
Parzen Window Estimation
is not limited to be the hypercube window function defined previously. It could be any pdf function
n
i n
in h
k1
xx
n
nn V
nkp /)( x
n
i n
i
nn hVn
p1
11)( xxx
)(u
1)( uu d
Parzen pdf
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 14
Parzen Window Estimation
pn(x) is a pdf function?
1
1 1ni
i n n
dn V h
x x x
n
id
n 1
1 uu1
1 1ni
i n n
dn V h
x x x
n
i n
i
nn hVn
p1
11)( xxx
Set (x-xi)/hn=u.
Window functionBeing pdf
Window width
Training data
Parzenpdf
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 15
Parzen Window Estimation
pn(x): superpostion (叠加)of n interpolations (插值)
xi: contributes to pn(x) based on its “distance” from x.
n
i n
i
nn hVn
p1
11)( xxx
nnn hV
xx 1)(
n
iinn n
p1
)x-x(1)( x
What is the effect of hn (window width) on the Parzen pdf?
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 16
Parzen Window Estimation The effect of hn
ndnnn
n hhhVxxx 11)(
Affects the width(horizontal scale)
Affects the amplitude(vertical scale)
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 17
Parzen Window Estimation
The shape of δn(x) with decreasing values of hn
nnn hV
xx 1)( Suppose φ(.) being a 2-d Gaussian pdf.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 18
Parzen Window Estimation
When hn is very large, δn(x) will be broad with small amplitude. Pn(x) will be the superposition of n broad, slowly
changing functions, i.e., being smooth with low resolution.
When hn is very small, δn(x) will be sharp with large amplitude. Pn(x) will be the superposition of n sharp pulses, i.e.,
being variable/unstable with high resolution.
nnn hV
xx 1)(
n
iinn n
p1
)x-x(1)( x
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 19
Parzen Window Estimation
Pazen window estimations for five samples, supposing that φ(.) is a 2-d Gaussian pdf.
nnn hV
xx 1)(
n
iinn n
p1
)x-x(1)( x
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 20
Parzen Window Estimation Convergence conditions
To ensure convergence, i.e.,
We have the following additional constraints:
)()]([lim xx ppE nn
0)]([lim
xnn
pVar
)(sup uu
0)(lim1||||
d
iiuu
u
0lim nnV
nn
nVlim
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 21
Illustrations One dimension case:
2/2
21)( ue
u
n
i n
i
nn hhn
p1
11)( xxx
nhhn /1 nhhn /1
)1,0(~ NX
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 22
Illustrations One dimension case:
n
i n
i
nn hhn
p1
11)( xxx
nhhn /1 nhhn /1
2/2
21)( ue
u
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 23
Illustrations Two dimension case:
nhhn /1 nhhn /1
n
i n
i
nn hhn
p1
2
11)( xxx
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 24
Classification Example
Smaller window Larger window
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 25
Choosing Window Function Vn must approach zero when n, but at a rate
slower than 1/n, e.g.,
The value of initial volume V1 is important. In some cases, a cell volume is proper for one
region but unsuitable in a different region.
nVVn /1
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 26
kn-Nearest Neighbor
Fix kn and then determine Vn
To estimate p(x), we can center a cell about xand let it grow until it captures kn samples, kn is some specified function of n, e.g.,
n
nn V
nkp /)( x
nkn
nnklim
0lim nnV
Principled rule to choose kn
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 27
kn-Nearest Neighbor
Eight points in one dimension(n=8, d=1)Red curve: kn=3Black curve: kn=5
Thirty-one points in two dimensions(n = 31, d=2)Black surface: kn=5
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 28
Estimation of A Posterior probability
Pn(i|x)=?
c
jjn
inin
xp
xpP
1
),(
),()|(
x
n
iin V
nkxp /),(
n
nc
jjnn V
nkxp /),(1
n
i
kk
x
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 29
Estimation of A Posterior probability
Pn(i|x)=?
c
jjn
inin
xp
xpP
1
),(
),()|(
x
n
iin V
nkxp /),(
n
nc
jjnn V
nkxp /),(1
n
i
kk
The value of Vn or kn can be determined base on Parzen windowor kn-nearest-neighbor technique.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 30
Nearest Neighbor Classifier Store all training examples Given a new example x to be classified, search
for the training example (xi, yi) whose xi is most similar (or closest) to x, and predict yi. (Lazy Learning)
c
jjn
inin
xp
xpP
1
),(
),()|(
xn
i
kk
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 31
Decision Boundaries Decision Boundaries
The voronoi diagram Given a set of points, a Voronoi diagram describes
the areas that are nearest to any given point. These areas can be viewed as zones of control.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 32
Decision Boundaries Decision boundary is
formed by only retaining these line segment separating different classes.
The more training examples we have stored, the more complex the decision boundaries can become.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 33
Decision Boundaries With large number of
examples and noise in the labels, the decision boundary can become nasty!
It can be bad some times-note the islands in this figure, they are formed because of noisy examples.
If the nearest neighbor happens to be a noisy point, the prediction will be incorrect.
How to deal with this?
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 34
Effect of k Different k values give different results:
Large k produces smoother boundaries The impact of class label noises canceled out by one another.
When k is too large, what will happen. Oversimplified boundaries, e.g., k=N, we always predict the majority
class
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 35
How to Choose k? Can we choose k to minimize the mistakes that
we make on training examples? (training error) What is the training error of nearest-neighbor?
Can we choose k to minimize the mistakes that we make on test examples? (test error)
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 36
How to Choose k? How do training error and test error change as
we change the value ok k?
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 37
Model Selection Choosing k for k-NN is just one of the many
model selection problems we face in machine learning.
Model selection is about choosing among different models Linear regression vs. quadratic regression K-NN vs. decision tree Heavily studied in machine learning, crucial
importance in practice. If we use training error to select models, we will
always choose more complex ones.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 38
Model Selection Choosing k for k-NN is just one of the many
model selection problems we face in machine learning.
Model selection is about choosing among different models Linear regression vs. quadratic regression K-NN vs. decision tree Heavily studied in machine learning, crucial
importance in practice. If we use training error to select models, we will
always choose more complex ones.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 39
Model Selection We can keep part of the labeled data apart as
validation data. Evaluate different k values based on the
prediction accuracy on the validation data Choose k that minimize validation error
Validation can be viewed as another name for testing, but the name testing is typically reserved for final evaluation purpose, whereas validation is mostly used for model selection purpose.
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 40
Model Selection The impact of validation set size
If we only reserve one point in our validation set, should we trust the validation error as a reliable estimate of our classifier’s performance?
The larger the validation set, the more reliable our model selection choices are
When the total labeled set is small, we might not be able to get a big enough validation set – leading to unreliable model selection decisions
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 41
Model Selection K-fold Cross Validation
Perform learning/testing K timesEach time reserve one subset for validation set,
train on the rest
Special case: Learve one-out crass validation
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 42
Other issues of kNN It can be computationally expensive to find the
nearest neighbors! Speed up the computation by using smart data
structures to quickly search for approximate solutions For large data set, it requires a lot of memory
Remove unimportant examples
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 43
Final words on KNN KNN is what we call lazy learning (vs. eager learning)
Lazy: learning only occur when you see the test example Eager: learn a model before you see the test example, training examples can be
thrown away after learning
Advantage: Conceptually simple, easy to understand and explain Very flexible decision boundaries Not much learning at all!
Disadvantage – It can be hard to find a good distance measure – Irrelevant features and noise can be very detrimental – Typically can not handle more than 30 attributes – Computational cost: requires a lot computation and memory
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 44
Distance Metrics Distance Measurement is an importance factor
for nearest-neighbor classifier, e.g., To achieve invariant pattern recognition and data
mining results.The effect of change units
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 45
Distance Metrics Distance Measurement is an importance factor
for nearest-neighbor classifier, e.g., To achieve invariant pattern recognition and data
mining results.The effect of change units
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 46
Properties of a Distance Metric
Nonnegativity
Reflexivity
Symmetry
Triangle Inequality
0),( baD
baba iff 0),(D
),(),( abba DD
),(),(),( cacbba DDD
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 47
Minkowski Metric (Lp Norm)
1. L1 norm
2. L2 norm
3. L norm
pd
i
pp x
/1
1||||||
x
d
iii baL
111 ||||||),( baba
2/1
1
222 ||||||),(
d
iii baL baba
|)max(|||||||),(/1
1ii
d
iii babaL
baba
Manhattan or city block distance
Euclidean distance
Chessboard distance
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 48
Minkowski Metric (Lp Norm)
1. L1 norm
2. L2 norm
3. L norm
pd
i
pp x
/1
1||||||
x
d
iii baL
111 ||||||),( baba
2/1
1
222 ||||||),(
d
iii baL baba
|)max(|||||||),(/1
1ii
d
iii babaL
baba
Manhattan or city block distance
Euclidean distance
Chessboard distance
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 49
Summary Basic setting for non-parametric techniques
Let the data speak for themselvesParametric form not assumed for class-conditional
pdf Estimate class-conditional pdf from training examples
Make predictions based on Bayes Theorem Fundamental results in density estimation
n
nn V
nkp /)( x
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 50
Summary Parzen Windows
Fix Vn and then determine kn
n
nn V
nkp /)( x
n
i n
i
nn hVn
p1
11)( xxx
Window functionBeing pdf
Window width
Training data
Parzenpdf
MIMA
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 51
Summary kn-Nearest-Neighbor
Fix kn and then determine Vn
Fix kn and then determine Vn
To estimate p(x), we can center a cell about xand let it grow until it captures kn samples, where is some specified function of n, e.g.,
nkn
nnklim
0lim nnV
Principled rule to choose kn
MIMA Group
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University
Any Question?
Recommended