31
The Nearest-Neighbor Classifier NASA Space Program

The Nearest-Neighbor Classifier

  • Upload
    eze

  • View
    57

  • Download
    0

Embed Size (px)

DESCRIPTION

The Nearest-Neighbor Classifier. NASA Space Program. Outline. Training and test data accuracy The nearest-neighbor classifier. Classification Accuracy. Say we have N feature vectors Say we know the true class label for each feature vector - PowerPoint PPT Presentation

Citation preview

Page 1: The Nearest-Neighbor Classifier

The Nearest-Neighbor Classifier

NASA Space Program

Page 2: The Nearest-Neighbor Classifier

OutlineTraining and test data accuracyThe nearest-neighbor classifier

Page 3: The Nearest-Neighbor Classifier

Classification AccuracySay we have N feature vectorsSay we know the true class label for each feature vectorWe can measure how accurate a classifier is by how many feature vectors it classifies correctlyAccuracy = percentage of feature vectors correctly classified training accuracy = accuracy on training data test accuracy = accuracy on new data not used in

training

Page 4: The Nearest-Neighbor Classifier

Training Data and Test Data

Training data labeled data used to build a classifier

Test data new data, not used in the training process, to

evaluate how well a classifier does on new data

Memorization versus Generalization better training_accuracy

“memorizing” the training data: better test_accuracy

“generalizing” to new data in general, we would like our classifier to perform

well on new test data, not just on training data, i.e., we would like it to generalize to new data

Page 5: The Nearest-Neighbor Classifier

Examples of Training and Test Data

Speech Recognition Training data: words recorded and labeled in a laboratory Test data: words recorded from new speakers, new locations

Zipcode Recognition Training data: zipcodes manually selected, scanned,

labeled Test data: actual letters being scanned in a post

officeCredit Scoring

Training data: historical database of loan applications with payment history or decision at that time

Test data: you

Page 6: The Nearest-Neighbor Classifier

Some NotationTraining Data Dtrain = { [x(1), c(1)] , [x(2), c(2)] , …………[x(N),

c(N)] } N pairs of feature vectors and class labels

Feature Vectors and Class Labels: x(i) is the ith training data feature vector in MATLAB this could be the ith row of an N x d

matrix c(i) is the class label of the ith feature vector in general, c(i) can take m different class values,

e.g., c = 1, c = 2, ... Let y be a new feature vector whose class label we

do not know, i.e., we wish to classify it.

Page 7: The Nearest-Neighbor Classifier

Nearest Neighbor Classifier

y is a new feature vector whose class label is unknownSearch Dtrain for the closest feature vector to y

let this “closest feature vector” be x(j)

Classify y with the same label as x(j), i.e. y is assigned label c(j)

How are “closest x” vectors determined? typically use minimum Euclidean distance

dE(x, y) = sqrt( (xi - yi)2 )

Side note: this produces a “Voronoi tesselation” of the d-space

each point “claims” a cell surrounding it cell boundaries are polygons

Analogous to “memory-based” reasoning in humans

Page 8: The Nearest-Neighbor Classifier

Geometric Interpretation of Nearest Neighbor

1

1

1

2

2

2

Feature 1

Feature 2

Page 9: The Nearest-Neighbor Classifier

Regions for Nearest Neighbors

1

1

1

2

2

2

Feature 1

Feature 2

Each data point defines a “cell” of space that is closest to it. All points within that cell are assigned that class

Page 10: The Nearest-Neighbor Classifier

Nearest Neighbor Decision Boundary

1

1

1

2

2

2

Feature 1

Feature 2

Overall decision boundary = unionof cell boundaries where class decision is different on each side

Page 11: The Nearest-Neighbor Classifier

How should the new point be classified?

1

1

1

2

2

2

Feature 1

Feature 2

?

Page 12: The Nearest-Neighbor Classifier

Local Decision Boundaries

1

1

1

2

2

2

Feature 1

Feature 2

?

Boundary? Points that are equidistantbetween points of class 1 and 2Note: locally the boundary is(1) linear (because of Euclidean distance)(2) halfway between the 2 class points(3) at right angles to connector

Page 13: The Nearest-Neighbor Classifier

Finding the Decision Boundaries

1

1

1

2

2

2

Feature 1

Feature 2

?

Page 14: The Nearest-Neighbor Classifier

Finding the Decision Boundaries

1

1

1

2

2

2

Feature 1

Feature 2

?

Page 15: The Nearest-Neighbor Classifier

Finding the Decision Boundaries

1

1

1

2

2

2

Feature 1

Feature 2

?

Page 16: The Nearest-Neighbor Classifier

Overall Boundary = Piecewise Linear

1

1

1

2

2

2

Feature 1

Feature 2

?

Decision Region for Class 1

Decision Region for Class 2

Page 17: The Nearest-Neighbor Classifier

Geometric Interpretation of kNN (k=1)

1

1

1

2

2

2

Feature 1

Feature 2

?

Page 18: The Nearest-Neighbor Classifier

More Data Points

1

1

1

1

1

1

1

1

2

2

2 2

2

2

22

Feature 1

Feature 2

Page 19: The Nearest-Neighbor Classifier

More Complex Decision Boundary

1

1

1

1

1

1

1

1

2

2

22

2

2

22

Feature 1

Feature 2

In general:Nearest-neighbor classifierproduces piecewise linear decision boundaries

Page 20: The Nearest-Neighbor Classifier

K-Nearest Neighbor (kNN) Classifier

Find the k-nearest neighbors to y in Dtrain i.e., rank the feature vectors according to

Euclidean distance select the k vectors which have smallest

distance to y

Classification ranking yields k feature vectors and a set of k

class labels pick the class label which is most common in

this set (“vote”) classify y as belonging to this class

Page 21: The Nearest-Neighbor Classifier

K-Nearest Neighbor (kNN) Classifier

Theoretical Considerations as k increases

we are averaging over more neighbors the effective decision boundary is more

“smooth” as N increases, the optimal k value

tends to increase in proportion to log N

Page 22: The Nearest-Neighbor Classifier

K-Nearest Neighbor (kNN) Classifier

Notes: In effect, the classifier uses the nearest k

feature vectors from Dtrain to “vote” on the class label for y

the single-nearest neighbor classifier is the special case of k=1

for two-class problems, if we choose k to be odd (i.e., k=1, 3, 5,…) then there will never be any “ties”

“training” is trivial for the kNN classifier, i.e., we just use Dtrain as a “lookup table” when we want to classify a new feature vector

Page 23: The Nearest-Neighbor Classifier

K-Nearest Neighbor (kNN) Classifier

Extensions of the Nearest Neighbor classifier weighted distances

e.g., if some of the features are more important

e.g., if features are irrelevant fast search techniques (indexing) to

find k-nearest neighbors in d-space

Page 24: The Nearest-Neighbor Classifier

Accuracy on Training Data versus Test Data

Training Accuracy = 1/n Dtrain I( o(i), c(i) )

where I( o(i), c(i) ) = 1 if o(i) = c(i), and 0 otherwise

where o(i) is the output of the classifier for training feature x(i)and c(i) is the true class for training data vector x(i)

Let Dtest be a set of new data, unseen in the training process: butassume that Dtest is being generated by the same “mechanism” as generated Dtrain:

Test Accuracy = 1/ntest Dtest I( o(j), c(j) )

Test Accuracy is what we are really interested in: unfortunately test accuracy is usually greater on average than train accuracy

Page 25: The Nearest-Neighbor Classifier

Test Accuracy and Generalization

The accuracy of our classifier on new unseen data is a fair/honest assessment of the performance of our classifierWhy is training accuracy not good enough?

Training accuracy is optimistic a classifier like nearest-neighbor can construct

boundaries which always separate all training data points, but which do not separate new points e.g., what is the training accuracy of kNN, k = 1?

A flexible classifier can “overfit” the training data in effect it just memorizes the training data, but

does not learn the general relationship between x and C

Page 26: The Nearest-Neighbor Classifier

Generalization We are really interested in how our

classifier generalizes to new data test data accuracy is a good estimate

of generalization performance

Test Accuracy and Generalization

Page 27: The Nearest-Neighbor Classifier

Assignment3 parts classplot: Plot classification data in

two-dimensions knn: Implement a nearest-neighbor

classifier knn_test: Test the effect of the value

k on the accuracy of the classifier

Test data knnclassifier.mat

Page 28: The Nearest-Neighbor Classifier

Plotting Functionfunction classplot(data, x, y); % function classplot(data, x, y); % % brief description of what the function does % ...... % Your Name, ICS 175A, date % % Inputs % data: (a structure with the same fields as described above: % your comment header should describe the structure explicitly) % Note that if you are only using certain fields in the structure % in the function below, you need only define these fields in the input comments -------- Your code goes here -------

Page 29: The Nearest-Neighbor Classifier

Nearest Neighbor Classifier

function [class_predictions] = knn(traindata,trainlabels,k, testdata) % function [class_predictions] = knn(traindata,trainlabels,k, testdata) % % a brief description of what the function does % ...... % Your Name, ICS 175A, date % % Inputs % traindata: a N1 x d vector of feature data (the "memory" for kNN) % trainlabels: a N1 x 1 vector of classlabels for traindata % k: an odd positive integer indicating the number of neighbors to use % testdata: a N2 x d vector of feature data for testing the knn classifier % % Outputs % class_predictions: N2 x 1 vector of predicted class values

-------- Pseudocode ------- Read in the training data set Dtrain       y = feature vector to be classified       kneighbors =  k-nearest neighbors to y in Dtrain       kclasses = class values of the kneighbors       kvote = the most common target value in kclasses

      predicted_class(y) = kvote

Page 30: The Nearest-Neighbor Classifier

Accuracy of kNN Classifier as k is varied

function [accuracies] = knn_test(traindata,trainlabels, testdata, testlabels,kmax,plotflag) % function [accuracies] = knn_test(traindata,trainlabels, testdata, testlabels,kmax,plotflag) % % a brief description of what the function does % ...... % Your Name, ICS 175A, date % % Inputs % traindata: a N1 x d vector of feature data (the "memory" for kNN) % trainlabels: a N1 x 1 vector of classlabels for traindata % testdata: a N2 x d vector of feature data for testing the knn classifier % testlabels: a N2 x 1 vector of classlabels for traindata % kmax: an odd positive integer indicating the maximum number of neighbors % plotflag: (optional argument) if 1, the accuracy versus k is plotted, % otherwise no plot. % % Outputs % accuracies: r x 1 vector of accuracies on testdata, where r is the % number of values of k that are tested. -------- Pseudocode ------- Read in the training data set Dtrain, and Dtest       For k = 1, 3, 5, ... Kmax (odd numbers)                classify each point in Dtest using the k nearest neighbors in Dtest             error_k = 100*(number of points incorrectly classified)/(number of points in Dtest)       end

Page 31: The Nearest-Neighbor Classifier

SummaryImportant Concepts classification is an important component in

intelligent systems a classifier = a mapping from feature space

to a class label decision boundaries = boundaries between classes

classification learning using training data to define a classifier

the nearest-neighbor classifier training accuracy versus test accuracy