M L D M Chapter 4 Non-Parameter...

MIMA Group

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

Chapter 4Non-Parameter Estimation

M LD M

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 2

Contents Introduction Parzen Windows K-Nearest-Neighbor Estimation Classification Techniques

The Nearest-Neighbor rule(1-NN) The Nearest-Neighbor rule(k-NN)

Distance Metrics

Bayes Rule for Classification

To compute the posterior probability, we need to know the prior probability and the likelihood.

Case I: has certain parametric form Maximum-Likelihood Estimation Bayesian Parameter Estimation

Problems: The assumed parametric form may not fit the ground-

truth density encountered in practice, e.g., assumed parametric form: unimodal; ground-truth: multimodal

)()()|()|(

xppxpxP ii

)|( ixp

Non-Parameter Estimation Case II: doesn’t have parametric form How?

)|( ixp

Let the data speak for themselves!

Parzen WindowsKn-Nearest-Neighbor

Goals Estimate class-conditional densities

Estimate posterior probabilities

)|( ip x

)|( xiP

Density Estimation Assume p(x) is continuous, and R is small Fundamental fact

The probability of a vector x fall into a region R:

n samples

')'()( xxX dpP R

')( xx dp

RPRVp )(x

Given n examples (i.i.d.) {x1, x2,…, xn}, let K denote the random variable representing number of samples falling into R, K will take Binomial distribution:

knk PPkn

)1()( RR),(~ RPnBK

Density Estimation Assume p(x) is continuous, and R is small Fundamental fact

The probability of a vector x fall into a region R:

n samples

')'()( xxX dpP R

')( xx dp

RPRVp )(x

RR PVp )(x

RnPKE ][ RVnKEp /][)( x

Let kR denote the actual number of samples in R

Density Estimation

Use subscript n to take sample size into account

We hope that: To do this, we should have

)()(lim xx ppnn

nkp /)( x

0lim nnV

n samples

Density Estimation

Fix Vn and determine kn Parzen Windows

Fix kn and determine Vn kn-Nearest-Neighbor

nkp /)( x What items can be controlled?How?

Parzen Windows

Assume Rn is a d-dimensional hypercube The length of each edge is hn

nkp /)( x Fix Vn and determine kn

dnn hV

Emanuel Parzen(1929-)

Determine kn with window function a.k.a. kernel function, potential function.

Window function

It defines a unit hypercube centered at the origin.

otherwise0

2/||1 n

Window function

1 means that x’ falls within the hypercube of volume vn centered at x.

kn: # samples inside the hypercube centered at x,

otherwise0

2/||1 njj

hxxhxx

Parzen Window Estimation

is not limited to be the hypercube window function defined previously. It could be any pdf function

nkp /)( x

nn hVn

11)( xxx

1)( uu d

Parzen pdf

pn(x) is a pdf function?

dn V h

nn hVn

11)( xxx

Set (x-xi)/hn=u.

Window functionBeing pdf

Window width

Training data

Parzenpdf

pn(x): superpostion (叠加）of n interpolations (插值）

xi: contributes to pn(x) based on its “distance” from x.

nn hVn

11)( xxx

nnn hV

xx 1)(

iinn n

)x-x(1)( x

What is the effect of hn (window width) on the Parzen pdf?

Parzen Window Estimation The effect of hn

n hhhVxxx 11)(

Affects the width(horizontal scale)

Affects the amplitude(vertical scale)

The shape of δn(x) with decreasing values of hn

nnn hV

xx 1)( Suppose φ(.) being a 2-d Gaussian pdf.

When hn is very large, δn(x) will be broad with small amplitude. Pn(x) will be the superposition of n broad, slowly

changing functions, i.e., being smooth with low resolution.

When hn is very small, δn(x) will be sharp with large amplitude. Pn(x) will be the superposition of n sharp pulses, i.e.,

being variable/unstable with high resolution.

nnn hV

xx 1)(

iinn n

)x-x(1)( x

Pazen window estimations for five samples, supposing that φ(.) is a 2-d Gaussian pdf.

nnn hV

xx 1)(

iinn n

)x-x(1)( x

Parzen Window Estimation Convergence conditions

To ensure convergence, i.e.,

We have the following additional constraints:

)()]([lim xx ppE nn

0)]([lim

)(sup uu

0)(lim1||||

0lim nnV

Illustrations One dimension case:

21)( ue

nn hhn

11)( xxx

nhhn /1 nhhn /1

)1,0(~ NX

Illustrations One dimension case:

nn hhn

11)( xxx

nhhn /1 nhhn /1

21)( ue

Illustrations Two dimension case:

nhhn /1 nhhn /1

nn hhn

11)( xxx

Classification Example

Smaller window Larger window

Choosing Window Function Vn must approach zero when n, but at a rate

slower than 1/n, e.g.,

The value of initial volume V1 is important. In some cases, a cell volume is proper for one

region but unsuitable in a different region.

nVVn /1

kn-Nearest Neighbor

Fix kn and then determine Vn

To estimate p(x), we can center a cell about xand let it grow until it captures kn samples, kn is some specified function of n, e.g.,

nkp /)( x

nnklim

0lim nnV

Principled rule to choose kn

kn-Nearest Neighbor

Eight points in one dimension(n=8, d=1)Red curve: kn=3Black curve: kn=5

Thirty-one points in two dimensions(n = 31, d=2)Black surface: kn=5

Estimation of A Posterior probability

Pn(i|x)=?

),()|(

nkxp /),(

jjnn V

nkxp /),(1

Estimation of A Posterior probability

Pn(i|x)=?

),()|(

nkxp /),(

jjnn V

nkxp /),(1

The value of Vn or kn can be determined base on Parzen windowor kn-nearest-neighbor technique.

Nearest Neighbor Classifier Store all training examples Given a new example x to be classified, search

for the training example (xi, yi) whose xi is most similar (or closest) to x, and predict yi. (Lazy Learning)

),()|(

Decision Boundaries Decision Boundaries

The voronoi diagram Given a set of points, a Voronoi diagram describes

the areas that are nearest to any given point. These areas can be viewed as zones of control.

Decision Boundaries Decision boundary is

formed by only retaining these line segment separating different classes.

The more training examples we have stored, the more complex the decision boundaries can become.

Decision Boundaries With large number of

examples and noise in the labels, the decision boundary can become nasty!

It can be bad some times-note the islands in this figure, they are formed because of noisy examples.

If the nearest neighbor happens to be a noisy point, the prediction will be incorrect.

How to deal with this?

Effect of k Different k values give different results:

Large k produces smoother boundaries The impact of class label noises canceled out by one another.

When k is too large, what will happen. Oversimplified boundaries, e.g., k=N, we always predict the majority

How to Choose k? Can we choose k to minimize the mistakes that

we make on training examples? (training error) What is the training error of nearest-neighbor?

Can we choose k to minimize the mistakes that we make on test examples? (test error)

How to Choose k? How do training error and test error change as

we change the value ok k?

Model Selection Choosing k for k-NN is just one of the many

model selection problems we face in machine learning.

Model selection is about choosing among different models Linear regression vs. quadratic regression K-NN vs. decision tree Heavily studied in machine learning, crucial

importance in practice. If we use training error to select models, we will

always choose more complex ones.

Model Selection Choosing k for k-NN is just one of the many

model selection problems we face in machine learning.

Model selection is about choosing among different models Linear regression vs. quadratic regression K-NN vs. decision tree Heavily studied in machine learning, crucial

importance in practice. If we use training error to select models, we will

always choose more complex ones.

Model Selection We can keep part of the labeled data apart as

validation data. Evaluate different k values based on the

prediction accuracy on the validation data Choose k that minimize validation error

Validation can be viewed as another name for testing, but the name testing is typically reserved for final evaluation purpose, whereas validation is mostly used for model selection purpose.

Model Selection The impact of validation set size

If we only reserve one point in our validation set, should we trust the validation error as a reliable estimate of our classifier’s performance?

The larger the validation set, the more reliable our model selection choices are

When the total labeled set is small, we might not be able to get a big enough validation set – leading to unreliable model selection decisions

Model Selection K-fold Cross Validation

Perform learning/testing K timesEach time reserve one subset for validation set,

train on the rest

Special case: Learve one-out crass validation

Other issues of kNN It can be computationally expensive to find the

nearest neighbors! Speed up the computation by using smart data

structures to quickly search for approximate solutions For large data set, it requires a lot of memory

Remove unimportant examples

Final words on KNN KNN is what we call lazy learning (vs. eager learning)

Lazy: learning only occur when you see the test example Eager: learn a model before you see the test example, training examples can be

thrown away after learning

Advantage: Conceptually simple, easy to understand and explain Very flexible decision boundaries Not much learning at all!

Disadvantage – It can be hard to find a good distance measure – Irrelevant features and noise can be very detrimental – Typically can not handle more than 30 attributes – Computational cost: requires a lot computation and memory

Distance Metrics Distance Measurement is an importance factor

for nearest-neighbor classifier, e.g., To achieve invariant pattern recognition and data

mining results.The effect of change units

Distance Metrics Distance Measurement is an importance factor

for nearest-neighbor classifier, e.g., To achieve invariant pattern recognition and data

mining results.The effect of change units

Properties of a Distance Metric

Nonnegativity

Reflexivity

Symmetry

Triangle Inequality

0),( baD

baba iff 0),(D

),(),( abba DD

),(),(),( cacbba DDD

Minkowski Metric (Lp Norm)

1. L1 norm

2. L2 norm

3. L norm

1||||||

iii baL

111 ||||||),( baba

222 ||||||),(

iii baL baba

|)max(|||||||),(/1

iii babaL

Manhattan or city block distance

Euclidean distance

Chessboard distance

Minkowski Metric (Lp Norm)

1. L1 norm

2. L2 norm

3. L norm

1||||||

iii baL

111 ||||||),( baba

222 ||||||),(

iii baL baba

|)max(|||||||),(/1

iii babaL

Manhattan or city block distance

Euclidean distance

Chessboard distance

Summary Basic setting for non-parametric techniques

Let the data speak for themselvesParametric form not assumed for class-conditional

pdf Estimate class-conditional pdf from training examples

Make predictions based on Bayes Theorem Fundamental results in density estimation

nkp /)( x

Summary Parzen Windows

Fix Vn and then determine kn

nkp /)( x

nn hVn

11)( xxx

Window functionBeing pdf

Window width

Training data

Parzenpdf

Summary kn-Nearest-Neighbor

Fix kn and then determine Vn

To estimate p(x), we can center a cell about xand let it grow until it captures kn samples, where is some specified function of n, e.g.,

nnklim

0lim nnV

Principled rule to choose kn

MIMA Group

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

Any Question?

M L D M Chapter 4 Non-Parameter...

Documents

Parameter estimation

Classification, Parameter - Ferdowsi University of Mashhadfumblog.um.ac.ir/gallery/816/Classification Parameter Estimation and State Estimation...Classification, Parameter Estimation

Parameter Estimation for the Two-Parameter Weibull

Parameter Estimation - Statistics

Direct reservoir parameter estimation

10. Parameter Estimation

Parameter Estimation - Seibert_1999_AFM

Process parameter and state estimation - Pure · or less coherent picture, viz. process parameter and -state estimation. Among the basic considerations on process parameter estimation

Using Modelica/Matlab for parameter estimation in a ... · Using Modelica/Matlab for parameter estimation ... Di¤erent kinetic reaction rates for the fermentation ... Parameter Estimation

Review Parameter Estimation

Parameter Estimation For HMM

Parameter Estimation for 3-Parameter Pareto

Sample Size and Item Parameter Estimation Precision When ... · Sample Size and Item Parameter Estimation Precision When ... and item parameter estimation precision when utilizing

Parameter estimation using EnKF

AERODYNAMIC PARAMETER ESTIMATION OF · PDF fileAERODYNAMIC PARAMETER ESTIMATION OF A MISSILE ... Aerodynamic parameter estimation from closed loop data has been ... 2 6-DOF MISSILE

Parameter Estimation in a Moving Horizon Perspectiveusers.isy.liu.se/en/rt/ljung/mhe/state_ho.pdf · Parameter Estimation in a Moving Horizon Perspective State and Parameter Estimation

Parameter estimation class 6

Parameter estimation for 3-parameter generalized pareto distribution by ...hydrologie.org/hsj/400/hysj_40_02_0165.pdf · Parameter estimation for 3-parameter generalized pareto distribution

Introduction to Parameter Estimation

Parameter Estimation - Ch. 3