Classification I

Lecturer: Dr. Bo Yuan

E-mail: yuanb@sz.tsinghua.edu.cn

Overview

K-Nearest Neighbor Algorithm

Naïve Bayes Classifier

Thomas Bayes

Classification

Definition

Classification is one of the fundamental skills for survival. Food vs. Predator

A kind of supervised learning Techniques for deducing a function from data

<Input, Output>

Input: a vector of features

Output: a Boolean value (binary classification) or integer (multiclass)

“Supervised” means: A teacher or oracle is needed to label each data sample.

We will talk about unsupervised learning later.5

Classifiers

Height

JaneJack

Z=f(x,y)

{boy, girl}

Height Weight

Training a Classifier

Learning

Lazy Learners

Neighborhood

K-Nearest Neighbor Algorithm

The algorithm procedure: Given a set of n training data in the form of <x, y>.

Given an unknown sample x′.

Calculate the distance d(x′, xi) for i=1 … n.

Select the K samples with the shortest distances.

Assign x′ the label that dominates the K samples.

It is the simplest classifier you will ever meet (I mean it!).

No Training (literally) A memory of the training data is maintained.

All computation is deferred until classification.

Produces satisfactory results in many cases. Should give it a go whenever possible.

Properties of KNN

Instance-Based Learning

No explicit description of the target function

Can handle complicated situations.

Properties of KNN

Dependent of the data distributions.

Can make mistakes at boundaries.

K=7 Neighborhood

K=1 Neighborhood

Challenges of KNN

The Value of K Non-monotonous impact on accuracy

Too Big vs. Too Small

Rule of thumbs

Weights Different features may have different impact …

Distance There are many different ways to measure the distance.

Euclidean, Manhattan …

Complexity Need to calculate the distance between x′ and all training data.

In proportion to the size of the training data.13

Distance Metrics

kiik yxyxL

iii yxyxL

Distance Metrics

The shortest path between two points …

Mahalanobis Distance

Distance from a point to a point set

Mahalanobis Distance

xSxxD TM

xxxD TM )(

For identity matrix S:

For diagonal matrix S:

Voronoi Diagram

perpendicular bisector

Voronoi Diagram

Structured Data

KD-Tree

Point Set: {(2,3), (5,4), (9,6), (4,7), (8,1), (7,2)}

KD-Tree

function kdtree (list of points pointList, int depth) { if pointList is empty return nil; else { // Select axis based on depth so that axis cycles through all valid values var int axis := depth mod k; // Sort point list and choose median as pivot element select median by axis from pointList; // Create node and construct subtrees var tree_node node; node.location := median; node.leftChild := kdtree(points in pointList before median, depth+1); node.rightChild := kdtree(points in pointList after median, depth+1); return node; } }

KD-Tree

Evaluation

Accuracy

Recall what we have learned in the first lecture … Confusion Matrix ROC Curve

Training Set vs. Test Set

N-fold Cross Validation

Test Set

Test SetTest Set

Leave One Out Cross Validation

An extreme case of N-fold cross validation

N=number of available samples

Usually very time consuming but okay for KNN

Now, let’s try KNN+LOOCV …

All students in this class are given one of two labels.

Gender: Male vs. Female

Major: CS vs. EE vs. Automation

10 Minutes …

Bayes Theorem

A B BAPBPAPBAP

APABPBPBAPBAP ||

APABPBAP ||

evidencepriorlikelihoodposterior

Bayes Theorem

Fish Example

Salmon vs. Tuna

P(ω1)=P(ω2)

P(ω1)>P(ω2)

Additional information

xPPxPxP ii

Shooting Example

Probability of Kill P(A): 0.6 P(B): 0.5

The target is killed with: One shoot from A One shoot from B

What is the probability that it is shot down by A? C: The target is killed.

5.06.05.04.05.06.06.01

)()()()(

APACPCAP

Cancel Example

ω1: Cancer; ω2: Normal

P(ω1)=0.008; P(ω2)=0.992

Lab Test Outcomes: + vs. –

P(+|ω1)=0.98; P(-|ω1)=0.02

P(+|ω2)=0.03; P(-|ω2)=0.97

Now someone has a positive test result…

Is he/she doomed?30

Cancel Example

0078.0008.098.0|| 111 PPP

0298.0992.003.0|| 222 PPP

11 21.00298.00078.0

0078.0| PP

|| 21 PP

Headache & Flu Example

H=“Having a headache”

F=“Coming down with flu”

P(H)=1/10; P(F)=1/40; P(H|F)=1/2

What does this mean?

One day you wake up with a headache …

Since 50% flu cases are associated with headaches …

I must have a 50-50 chance of coming down with flu!

Headache & Flu Example

10/140/12/1

)()()|()|(

HPFPFHPHFP

Headache

The truth is …

Naïve Bayes Classifier

niMAP aaaPi

,...,,|maxarg 21

MAP: Maximum A Posterior

iinMAP aaaP

PaaaPi ,...,,

|,...,,maxarg21

iinMAP PaaaPi

|,...,,maxarg 21

ijiMAP aPPi

|maxarg

Conditionally Independent

Independence

BPAPBAP

ABPAPBAP | BPABP |

)|()|()|,( GBPGAPGBAP

Conditionally Independent

)|(),|( GAPBGAP

)|(),|()(/),(),|()(/),,()|,(

GBPGBAPGPGBPGBAPGPGBAPGBAP

Conditional Independence

)|()|()|( YBPYRPYBRP

Independent ≠ Uncorrelated

];1,1[

Cov (X,Y)=0 X and Y are uncorrelated

However, Y is completely determined by X.

X Y1 1

0.5 0.250.2 0.040 0

-0.2 0.04-0.5 0.25-1 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

𝜌𝑋 ,𝑌=𝑐𝑜𝑣 ( 𝑋 ,𝑌 )𝜎 𝑋𝜎𝑌

=𝐸 ( (𝑋 −𝜇𝑋 ) (𝑌 −𝜇𝑌 ))

𝜎 𝑋𝜎 𝑌

Estimating P(αj|ωi)

α1 α2 α3 ω+ ω1

3/2|'' 12 aP

3/1|'' 12 aP

5/2;5/3 21 PP

ijkjijk a

1|Laplace Smoothing

How about continuous variables?

Tennis Example

Day Outlook Temperature Humidity Wind Play Tennis

Day1 Sunny Hot High Weak No Day2 Sunny Hot High Strong No

Day3 Overcast Hot High Weak Yes

Day4 Rain Mild High Weak Yes

Day5 Rain Cool Normal Weak Yes

Day6 Rain Cool Normal Strong No

Day7 Overcast Cool Normal Strong Yes

Day8 Sunny Mild High Weak No

Day9 Sunny Cool Normal Weak Yes

Day10 Rain Mild Normal Weak Yes

Day11 Sunny Mild Normal Strong Yes

Day12 Overcast Mild High Strong Yes

Day13 Overcast Hot Normal Weak Yes

Day14 Rain Mild High Strong No

Tennis Example

)(:Predict

Wind,Humidity,eTemperatur,Outlook:Given

nooryesPlayTennis

stronghighcoolsunny

795.00053.00206.0

0206.0 :yprobabilit with splay tenni not to is conclusion The

0206.0)|()|()|()|()(0053.0)|()|()|()|()(

...5/3|9/3|

14/514/9

nostrongPnohighPnocoolPnosunnyPnoPyesstrongPyeshighPyescoolPyessunnyPyesP

noPlayTennisstrongWindPyesPlayTennisstrongWindP

noPlayTennisPyesPlayTennisP

SolutionBayes

Text Classification Example

Interesting? Boring?

Politics? Entertainment? Sports?

Text Representation

α1 α2 α3 α4 … αn ωLong long ago there … king 1New sanctions will be … Iran 0

Hidden Markov models are … method 0The Federal Court today … investigate 0

However, there are 2×n×|Vocabulary| terms in total. For n=100 and a

vocabulary of 50,000 distinct words, it adds up to 10 million terms!

We need to estimate probabilities such as .

Text Representation

By only considering the probability of encountering a specific word instead of the specific word position, we can reduce the number of probabilities to be estimated.

We only count the frequency of each word.

Now, 2×50,000=100,000 terms need to be estimated.

n: the total number of word positions in all training samples whose target value is ωi.

nk: the number of times word Vk is found among these n positions.

1|VocabularynnVP k

Case Study: Newsgroups

Classification Joachims, 1996 20 newsgroups 20,000 documents Random Guess: 5% NB: 89%

Recommendation Lang, 1995 NewsWeeder User rated articles Interesting vs. Uninteresting Top 10% selected articles 16% vs. 59%

Reading Materials

C. C. Aggarwal, A. Hinneburg and D. A. Keim, “On the Surprising Behavior of Distance Metrics in High Dimensional Space,” Proc. the 8th International Conference on Database Theory, LNCS 1973, pp. 420-434, London, UK, 2001.

J. H. Friedman, J. L. Bentley, and R. A. Finkel, “An Algorithm for Finding Best Matches in Logarithmic Expected Time,” ACM Transactions on Mathematical Software, 3(3):209–226, 1977.

S. M. Omohundro, “Bumptrees for Efficient Function, Constraint, and Classification Learning,” Advances in Neural Information Processing Systems 3, pp. 693-699, Morgan Kaufmann, 1991.

Tom Mitchell, Machine Learning (Chapter 6), McGraw-Hill.

Additional reading about Naïve Bayes Classifier http://www-2.cs.cmu.edu/~tom/NewChapters.html

Software for text classification using Naïve Bayes Classifier http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html

Review

What is classification?

What is supervised learning?

What does KNN stand for?

What are the major challenges of KNN?

How to accelerate KNN?

What is N-fold cross validation?

What does LOOCV stand for?

What is Bayes Theorem?

What is the key assumption in Naïve Bayes Classifiers?

Next Week’s Class Talk

Volunteers are required for next week’s class talk.

Topic 1: Efficient KNN Implementations

Hints: Ball Trees Metric Trees R Trees

Topic 2: Bayesian Belief Networks

Length: 20 minutes plus question time

Classification I

Documents

Mektan-I-4 Soil Classification - Exercise

Taxonomy and Classification Biology I – 2011 17.1 The History of Classification 17.3 Domains and Kingdoms

DSAC I ClassificationDSAC I Classification Karst ... Creek Updat… · Wolf Creek Dam DSAC I ClassificationDSAC I Classification -Karst foundation seepageKarst foundation seepage

LEPROSY. Leprosy I Leprosy I Introduction Introduction Epidemiology Epidemiology Bacteriology Bacteriology Classification Classification Clinical features

Title i Classification of Property Preliminary Provisions

Unit i : Classification of Signals

Rules for Classification and Construction I Ship Technology · 2012. 5. 10. · (see Rules for Classification and Construction, I - Ship Technology, Part 0 - Classification and Surveys)

PULMONARY HYPERTENSION ETIOPATHOGENESIS & CLASSIFICATION PART- I

Part I: Classification of Manufactured Commodities

I-CAN: Classification of Disability Support Needs

Lecture2 - Image classification and the data-driven approach k-nearest neighbor Linear classification I

Chapitre 5: La classification périodique des éléments I. Etude des trois premières lignes de la classification II. Définition 1. Définitions Classification

Chapter 16 Plant Classification (Systematics) I. Introduction:

APPENDIX F: STREAM ORDER AND WATERWAY ......98 APPENDIX F: STREAM ORDER AND WATERWAY CLASSIFICATION SYSTEM Stream Order Classification System I&I NSW uses the Strahler stream classification

NATIONAL CLASSIFICATION OF OCCUPATIONS-2015 - NCS Classification of Occupations... · National Classification of Occupations – 2015 Introduction VOLUME I 3 1.3 National Classification

I. Classification of Matter

CLASSIFICATION AND COMPONTNTS OF REMOVABLE PARTIAL · PDF file3 REMOVABLE PARTIAL DENTURE CLASSIFICATION - KENNEDY CLASSIFICATION SYSTEM-CLASS I - Bilateral Posterior Edentulous Areas

Data Mining I Classification, Part 2

DEGREE I PAPER 1 CLASSIFICATION OF PORIFERA

[PPT]Scientific Classification Systems - I Love Scienceiteachbio.com/Life Science/Classification... · Web viewTitle Scientific Classification Systems Author Isla Cordelae Last modified