M L D M Chapter 4 Non-Parameter Estimationmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter4.pdf · Maximum-Likelihood Estimation Bayesian Parameter Estimation Problems: The assumed

MIMA Group

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

Chapter 4Non-Parameter Estimation

M LD M

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 2

Contents Introduction Parzen Windows K-Nearest-Neighbor Estimation Classification Techniques

The Nearest-Neighbor rule(1-NN) The Nearest-Neighbor rule(k-NN)

Distance Metrics

MIMA


Bayes Rule for Classification

To compute the posterior probability, we need to know the prior probability and the likelihood.

Case I: has certain parametric form Maximum-Likelihood Estimation Bayesian Parameter Estimation

Problems: The assumed parametric form may not fit the ground-

truth density encountered in practice, e.g., assumed parametric form: unimodal; ground-truth: multimodal

)()()|()|(

xppxpxP ii

i

)|( ixp

MIMA


Non-Parameter Estimation Case II: doesn’t have parametric form How?

)|( ixp

Let the data speak for themselves!

Parzen WindowsKn-Nearest-Neighbor

MIMA


Goals Estimate class-conditional densities

Estimate posterior probabilities

)|( ip x

)|( xiP

MIMA


Density Estimation Assume p(x) is continuous, and R is small Fundamental fact

The probability of a vector x fall into a region R:

R

n samples

+x

')'()( xxX dpP R

R R

')( xx dp

RPRVp )(x

Given n examples (i.i.d.) {x1, x2,…, xn}, let K denote the random variable representing number of samples falling into R, K will take Binomial distribution:

knk PPkn

kKP

)1()( RR),(~ RPnBK

MIMA


Density Estimation Assume p(x) is continuous, and R is small Fundamental fact

The probability of a vector x fall into a region R:

R

n samples

+x

')'()( xxX dpP R

R R

')( xx dp

RPRVp )(x

RR PVp )(x

RnPKE ][ RVnKEp /][)( x

Let kR denote the actual number of samples in R

R

R

Vnk

p/

)( x

MIMA


Density Estimation

Use subscript n to take sample size into account

We hope that: To do this, we should have

R

R

Vnk

p/

)( x

)()(lim xx ppnn

n

nn V

nkp /)( x

0lim nnV

nn

klim

0/lim

nknn

R

+x

n samples

MIMA


Density Estimation

Fix Vn and determine kn Parzen Windows

Fix kn and determine Vn kn-Nearest-Neighbor

n

nn V

nkp /)( x What items can be controlled?How?

MIMA


Parzen Windows

Assume Rn is a d-dimensional hypercube The length of each edge is hn

n

nn V

nkp /)( x Fix Vn and determine kn

dnn hV

Emanuel Parzen(1929-)

Determine kn with window function a.k.a. kernel function, potential function.

MIMA


Window function

It defines a unit hypercube centered at the origin.

1

1

1

otherwise0

2/||1 n

n

hxhx

hn

hn

hn

MIMA


Window function

1 means that x’ falls within the hypercube of volume vn centered at x.

kn: # samples inside the hypercube centered at x,

otherwise0

2/||1 njj

n

hxxhxx

hn

hn

hnx

n

i n

in h

k1

xx

MIMA


Parzen Window Estimation

is not limited to be the hypercube window function defined previously. It could be any pdf function

n

i n

in h

k1

xx

n

nn V

nkp /)( x

n

i n

i

nn hVn

p1

11)( xxx

)(u

1)( uu d

Parzen pdf

MIMA



pn(x) is a pdf function?

1

1 1ni

i n n

dn V h

x x x

n

id

n 1

1 uu1

1 1ni

i n n

dn V h

x x x

n

i n

i

nn hVn

p1

11)( xxx

Set (x-xi)/hn=u.

Window functionBeing pdf

Window width

Training data

Parzenpdf

MIMA



pn(x): superpostion (叠加）of n interpolations (插值）

xi: contributes to pn(x) based on its “distance” from x.

n

i n

i

nn hVn

p1

11)( xxx

nnn hV

xx 1)(

n

iinn n

p1

)x-x(1)( x

What is the effect of hn (window width) on the Parzen pdf?

MIMA


Parzen Window Estimation The effect of hn

ndnnn

n hhhVxxx 11)(

Affects the width(horizontal scale)

Affects the amplitude(vertical scale)

MIMA



The shape of δn(x) with decreasing values of hn

nnn hV

xx 1)( Suppose φ(.) being a 2-d Gaussian pdf.

MIMA



When hn is very large, δn(x) will be broad with small amplitude. Pn(x) will be the superposition of n broad, slowly

changing functions, i.e., being smooth with low resolution.

When hn is very small, δn(x) will be sharp with large amplitude. Pn(x) will be the superposition of n sharp pulses, i.e.,

being variable/unstable with high resolution.

nnn hV

xx 1)(

n

iinn n

p1

)x-x(1)( x

MIMA



Pazen window estimations for five samples, supposing that φ(.) is a 2-d Gaussian pdf.

nnn hV

xx 1)(

n

iinn n

p1

)x-x(1)( x

MIMA


Parzen Window Estimation Convergence conditions

To ensure convergence, i.e.,

We have the following additional constraints:

)()]([lim xx ppE nn

0)]([lim

xnn

pVar

)(sup uu

0)(lim1||||

d

iiuu

u

0lim nnV

nn

nVlim

MIMA


Illustrations One dimension case:

2/2

21)( ue

u

n

i n

i

nn hhn

p1

11)( xxx

nhhn /1 nhhn /1

)1,0(~ NX

MIMA


Illustrations One dimension case:

n

i n

i

nn hhn

p1

11)( xxx

nhhn /1 nhhn /1

2/2

21)( ue

u

MIMA


Illustrations Two dimension case:

nhhn /1 nhhn /1

n

i n

i

nn hhn

p1

2

11)( xxx

MIMA


Classification Example

Smaller window Larger window

MIMA


Choosing Window Function Vn must approach zero when n, but at a rate

slower than 1/n, e.g.,

The value of initial volume V1 is important. In some cases, a cell volume is proper for one

region but unsuitable in a different region.

nVVn /1

MIMA


kn-Nearest Neighbor

Fix kn and then determine Vn

To estimate p(x), we can center a cell about xand let it grow until it captures kn samples, kn is some specified function of n, e.g.,

n

nn V

nkp /)( x

nkn

nnklim

0lim nnV

Principled rule to choose kn

MIMA


kn-Nearest Neighbor

Eight points in one dimension(n=8, d=1)Red curve: kn=3Black curve: kn=5

Thirty-one points in two dimensions(n = 31, d=2)Black surface: kn=5

MIMA


Estimation of A Posterior probability

Pn(i|x)=?

c

jjn

inin

xp

xpP

1

),(

),()|(

x

n

iin V

nkxp /),(

n

nc

jjnn V

nkxp /),(1

n

i

kk

x

MIMA


Estimation of A Posterior probability

Pn(i|x)=?

c

jjn

inin

xp

xpP

1

),(

),()|(

x

n

iin V

nkxp /),(

n

nc

jjnn V

nkxp /),(1

n

i

kk

The value of Vn or kn can be determined base on Parzen windowor kn-nearest-neighbor technique.

MIMA


Nearest Neighbor Classifier Store all training examples Given a new example x to be classified, search

for the training example (xi, yi) whose xi is most similar (or closest) to x, and predict yi. (Lazy Learning)

c

jjn

inin

xp

xpP

1

),(

),()|(

xn

i

kk

MIMA


Decision Boundaries Decision Boundaries

The voronoi diagram Given a set of points, a Voronoi diagram describes

the areas that are nearest to any given point. These areas can be viewed as zones of control.

MIMA


Decision Boundaries Decision boundary is

formed by only retaining these line segment separating different classes.

The more training examples we have stored, the more complex the decision boundaries can become.

MIMA


Decision Boundaries With large number of

examples and noise in the labels, the decision boundary can become nasty!

It can be bad some times-note the islands in this figure, they are formed because of noisy examples.

If the nearest neighbor happens to be a noisy point, the prediction will be incorrect.

How to deal with this?

MIMA


Effect of k Different k values give different results:

Large k produces smoother boundaries The impact of class label noises canceled out by one another.

When k is too large, what will happen. Oversimplified boundaries, e.g., k=N, we always predict the majority

class

MIMA


How to Choose k? Can we choose k to minimize the mistakes that

we make on training examples? (training error) What is the training error of nearest-neighbor?

Can we choose k to minimize the mistakes that we make on test examples? (test error)

MIMA


How to Choose k? How do training error and test error change as

we change the value ok k?

MIMA


Model Selection Choosing k for k-NN is just one of the many

model selection problems we face in machine learning.

Model selection is about choosing among different models Linear regression vs. quadratic regression K-NN vs. decision tree Heavily studied in machine learning, crucial

importance in practice. If we use training error to select models, we will

always choose more complex ones.

MIMA


Model Selection Choosing k for k-NN is just one of the many

model selection problems we face in machine learning.

Model selection is about choosing among different models Linear regression vs. quadratic regression K-NN vs. decision tree Heavily studied in machine learning, crucial

importance in practice. If we use training error to select models, we will

always choose more complex ones.

MIMA


Model Selection We can keep part of the labeled data apart as

validation data. Evaluate different k values based on the

prediction accuracy on the validation data Choose k that minimize validation error

Validation can be viewed as another name for testing, but the name testing is typically reserved for final evaluation purpose, whereas validation is mostly used for model selection purpose.

MIMA


Model Selection The impact of validation set size

If we only reserve one point in our validation set, should we trust the validation error as a reliable estimate of our classifier’s performance?

The larger the validation set, the more reliable our model selection choices are

When the total labeled set is small, we might not be able to get a big enough validation set – leading to unreliable model selection decisions

MIMA


Model Selection K-fold Cross Validation

Perform learning/testing K timesEach time reserve one subset for validation set,

train on the rest

Special case: Learve one-out crass validation

MIMA


Other issues of kNN It can be computationally expensive to find the

nearest neighbors! Speed up the computation by using smart data

structures to quickly search for approximate solutions For large data set, it requires a lot of memory

Remove unimportant examples

MIMA


Final words on KNN KNN is what we call lazy learning (vs. eager learning)

Lazy: learning only occur when you see the test example Eager: learn a model before you see the test example, training examples can be

thrown away after learning

Advantage: Conceptually simple, easy to understand and explain Very flexible decision boundaries Not much learning at all!

Disadvantage – It can be hard to find a good distance measure – Irrelevant features and noise can be very detrimental – Typically can not handle more than 30 attributes – Computational cost: requires a lot computation and memory

MIMA


Distance Metrics Distance Measurement is an importance factor

for nearest-neighbor classifier, e.g., To achieve invariant pattern recognition and data

mining results.The effect of change units

MIMA


Distance Metrics Distance Measurement is an importance factor

for nearest-neighbor classifier, e.g., To achieve invariant pattern recognition and data

mining results.The effect of change units

MIMA


Properties of a Distance Metric

Nonnegativity

Reflexivity

Symmetry

Triangle Inequality

0),( baD

baba iff 0),(D

),(),( abba DD

),(),(),( cacbba DDD

MIMA


Minkowski Metric (Lp Norm)

1. L1 norm

2. L2 norm

3. L norm

pd

i

pp x

/1

1||||||

x

d

iii baL

111 ||||||),( baba

2/1

1

222 ||||||),(

d

iii baL baba

|)max(|||||||),(/1

1ii

d

iii babaL

baba

Manhattan or city block distance

Euclidean distance

Chessboard distance

MIMA


Minkowski Metric (Lp Norm)

1. L1 norm

2. L2 norm

3. L norm

pd

i

pp x

/1

1||||||

x

d

iii baL

111 ||||||),( baba

2/1

1

222 ||||||),(

d

iii baL baba

|)max(|||||||),(/1

1ii

d

iii babaL

baba

Manhattan or city block distance

Euclidean distance

Chessboard distance

MIMA


Summary Basic setting for non-parametric techniques

Let the data speak for themselvesParametric form not assumed for class-conditional

pdf Estimate class-conditional pdf from training examples

Make predictions based on Bayes Theorem Fundamental results in density estimation

n

nn V

nkp /)( x

MIMA


Summary Parzen Windows

Fix Vn and then determine kn

n

nn V

nkp /)( x

n

i n

i

nn hVn

p1

11)( xxx

Window functionBeing pdf

Window width

Training data

Parzenpdf

MIMA


Summary kn-Nearest-Neighbor



To estimate p(x), we can center a cell about xand let it grow until it captures kn samples, where is some specified function of n, e.g.,

nkn

nnklim

0lim nnV

Principled rule to choose kn

MIMA Group

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

Any Question?

Documents

M L D M Chapter 4 Non-Parameter Estimationmima.sdu.edu.cn/Members/xinshunxu/Courses/ML/Chapter4.pdf · Maximum-Likelihood Estimation Bayesian Parameter Estimation Problems: The assumed