View
53
Download
0
Category
Preview:
DESCRIPTION
Classification I. Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn. Overview. K-Nearest Neighbor Algorithm Naïve Bayes Classifier . Thomas Bayes. Classification. Definition. Classification is one of the fundamental skills for survival. Food vs. Predator - PowerPoint PPT Presentation
Citation preview
LOGO
Classification I
Lecturer: Dr. Bo Yuan
E-mail: yuanb@sz.tsinghua.edu.cn
Overview
K-Nearest Neighbor Algorithm
Naïve Bayes Classifier
3
Thomas Bayes
Classification
4
Definition
Classification is one of the fundamental skills for survival. Food vs. Predator
A kind of supervised learning Techniques for deducing a function from data
<Input, Output>
Input: a vector of features
Output: a Boolean value (binary classification) or integer (multiclass)
“Supervised” means: A teacher or oracle is needed to label each data sample.
We will talk about unsupervised learning later.5
Classifiers
6
Height
Wei
ght
Mary
Lisa
JaneJack
Peter
Tom
Sam
Helen
Z=f(x,y)
{boy, girl}
Height Weight
Training a Classifier
7
Learning
Lazy Learners
8
Car
Truck
K-Nearest Neighbor Algorithm
The algorithm procedure: Given a set of n training data in the form of <x, y>.
Given an unknown sample x′.
Calculate the distance d(x′, xi) for i=1 … n.
Select the K samples with the shortest distances.
Assign x′ the label that dominates the K samples.
It is the simplest classifier you will ever meet (I mean it!).
No Training (literally) A memory of the training data is maintained.
All computation is deferred until classification.
Produces satisfactory results in many cases. Should give it a go whenever possible.
10
Properties of KNN
11
Instance-Based Learning
No explicit description of the target function
Can handle complicated situations.
Properties of KNN
12
?
Dependent of the data distributions.
Can make mistakes at boundaries.
K=7 Neighborhood
K=1 Neighborhood
Challenges of KNN
The Value of K Non-monotonous impact on accuracy
Too Big vs. Too Small
Rule of thumbs
Weights Different features may have different impact …
Distance There are many different ways to measure the distance.
Euclidean, Manhattan …
Complexity Need to calculate the distance between x′ and all training data.
In proportion to the size of the training data.13
K
Acc
urac
y
Distance Metrics
14
kd
i
kiik yxyxL
/1
1
,
2/1
1
22 ,
d
iii yxyxL
d
iii yxyxL
11 ,
Distance Metrics
15
The shortest path between two points …
Mahalanobis Distance
16
Distance from a point to a point set
Mahalanobis Distance
17
xSxxD TM
1)(
xxxD TM )(
For identity matrix S:
n
i i
iiM
xxD1
2
2
)(
For diagonal matrix S:
Voronoi Diagram
18
perpendicular bisector
Structured Data
20
0 1
1
0.5
0.5
?
KD-Tree
21
Point Set: {(2,3), (5,4), (9,6), (4,7), (8,1), (7,2)}
KD-Tree
function kdtree (list of points pointList, int depth) { if pointList is empty return nil; else { // Select axis based on depth so that axis cycles through all valid values var int axis := depth mod k; // Sort point list and choose median as pivot element select median by axis from pointList; // Create node and construct subtrees var tree_node node; node.location := median; node.leftChild := kdtree(points in pointList before median, depth+1); node.rightChild := kdtree(points in pointList after median, depth+1); return node; } }
22
KD-Tree
23
Evaluation
Accuracy
Recall what we have learned in the first lecture … Confusion Matrix ROC Curve
Training Set vs. Test Set
N-fold Cross Validation
24
Test Set
Test SetTest Set
Test SetTest Set
LOOCV
Leave One Out Cross Validation
An extreme case of N-fold cross validation
N=number of available samples
Usually very time consuming but okay for KNN
Now, let’s try KNN+LOOCV …
All students in this class are given one of two labels.
Gender: Male vs. Female
Major: CS vs. EE vs. Automation
25
26
10 Minutes …
Bayes Theorem
27
A B BAPBPAPBAP
APABPBPBAPBAP ||
BP
APABPBAP ||
evidencepriorlikelihoodposterior
Bayes Theorem
Fish Example
Salmon vs. Tuna
P(ω1)=P(ω2)
P(ω1)>P(ω2)
Additional information
28
xPPxPxP ii
i ||
Shooting Example
Probability of Kill P(A): 0.6 P(B): 0.5
The target is killed with: One shoot from A One shoot from B
What is the probability that it is shot down by A? C: The target is killed.
29
43
5.06.05.04.05.06.06.01
)()()()(
CP
APACPCAP
Cancel Example
ω1: Cancer; ω2: Normal
P(ω1)=0.008; P(ω2)=0.992
Lab Test Outcomes: + vs. –
P(+|ω1)=0.98; P(-|ω1)=0.02
P(+|ω2)=0.03; P(-|ω2)=0.97
Now someone has a positive test result…
Is he/she doomed?30
Cancel Example
31
0078.0008.098.0|| 111 PPP
0298.0992.003.0|| 222 PPP
11 21.00298.00078.0
0078.0| PP
|| 21 PP
Headache & Flu Example
H=“Having a headache”
F=“Coming down with flu”
P(H)=1/10; P(F)=1/40; P(H|F)=1/2
What does this mean?
One day you wake up with a headache …
Since 50% flu cases are associated with headaches …
I must have a 50-50 chance of coming down with flu!
32
Headache & Flu Example
33
81
10/140/12/1
)()()|()|(
HPFPFHPHFP
Flu
Headache
The truth is …
Naïve Bayes Classifier
34
niMAP aaaPi
,...,,|maxarg 21
MAP: Maximum A Posterior
n
iinMAP aaaP
PaaaPi ,...,,
|,...,,maxarg21
21
iinMAP PaaaPi
|,...,,maxarg 21
j
ijiMAP aPPi
|maxarg
Conditionally Independent
Independence
35
BPAPBAP
ABPAPBAP | BPABP |
)|()|()|,( GBPGAPGBAP
Conditionally Independent
)|(),|( GAPBGAP
)|(),|()(/),(),|()(/),,()|,(
GBPGBAPGPGBPGBAPGPGBAPGBAP
Conditional Independence
36
)|()|()|( YBPYRPYBRP
Independent ≠ Uncorrelated
37
2
];1,1[
XY
X
Cov (X,Y)=0 X and Y are uncorrelated
However, Y is completely determined by X.
X Y1 1
0.5 0.250.2 0.040 0
-0.2 0.04-0.5 0.25-1 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
X
Y
𝜌𝑋 ,𝑌=𝑐𝑜𝑣 ( 𝑋 ,𝑌 )𝜎 𝑋𝜎𝑌
=𝐸 ( (𝑋 −𝜇𝑋 ) (𝑌 −𝜇𝑌 ))
𝜎 𝑋𝜎 𝑌
Estimating P(αj|ωi)
38
α1 α2 α3 ω+ ω1
ω2
- ω1
+ ω1
ω2
3/2|'' 12 aP
3/1|'' 12 aP
5/2;5/3 21 PP
ji
ijkjijk a
aaaP
1|Laplace Smoothing
How about continuous variables?
Tennis Example
39
Day Outlook Temperature Humidity Wind Play Tennis
Day1 Sunny Hot High Weak No Day2 Sunny Hot High Strong No
Day3 Overcast Hot High Weak Yes
Day4 Rain Mild High Weak Yes
Day5 Rain Cool Normal Weak Yes
Day6 Rain Cool Normal Strong No
Day7 Overcast Cool Normal Strong Yes
Day8 Sunny Mild High Weak No
Day9 Sunny Cool Normal Weak Yes
Day10 Rain Mild Normal Weak Yes
Day11 Sunny Mild Normal Strong Yes
Day12 Overcast Mild High Strong Yes
Day13 Overcast Hot Normal Weak Yes
Day14 Rain Mild High Strong No
Tennis Example
40
)(:Predict
Wind,Humidity,eTemperatur,Outlook:Given
nooryesPlayTennis
stronghighcoolsunny
795.00053.00206.0
0206.0 :yprobabilit with splay tenni not to is conclusion The
0206.0)|()|()|()|()(0053.0)|()|()|()|()(
...5/3|9/3|
14/514/9
:
nostrongPnohighPnocoolPnosunnyPnoPyesstrongPyeshighPyescoolPyessunnyPyesP
noPlayTennisstrongWindPyesPlayTennisstrongWindP
noPlayTennisPyesPlayTennisP
SolutionBayes
Text Classification Example
41
Interesting? Boring?
Politics? Entertainment? Sports?
Text Representation
42
α1 α2 α3 α4 … αn ωLong long ago there … king 1New sanctions will be … Iran 0
Hidden Markov models are … method 0The Federal Court today … investigate 0
However, there are 2×n×|Vocabulary| terms in total. For n=100 and a
vocabulary of 50,000 distinct words, it adds up to 10 million terms!
We need to estimate probabilities such as .
Text Representation
By only considering the probability of encountering a specific word instead of the specific word position, we can reduce the number of probabilities to be estimated.
We only count the frequency of each word.
Now, 2×50,000=100,000 terms need to be estimated.
n: the total number of word positions in all training samples whose target value is ωi.
nk: the number of times word Vk is found among these n positions.
43
||
1|VocabularynnVP k
iK
Case Study: Newsgroups
Classification Joachims, 1996 20 newsgroups 20,000 documents Random Guess: 5% NB: 89%
Recommendation Lang, 1995 NewsWeeder User rated articles Interesting vs. Uninteresting Top 10% selected articles 16% vs. 59%
44
Reading Materials
C. C. Aggarwal, A. Hinneburg and D. A. Keim, “On the Surprising Behavior of Distance Metrics in High Dimensional Space,” Proc. the 8th International Conference on Database Theory, LNCS 1973, pp. 420-434, London, UK, 2001.
J. H. Friedman, J. L. Bentley, and R. A. Finkel, “An Algorithm for Finding Best Matches in Logarithmic Expected Time,” ACM Transactions on Mathematical Software, 3(3):209–226, 1977.
S. M. Omohundro, “Bumptrees for Efficient Function, Constraint, and Classification Learning,” Advances in Neural Information Processing Systems 3, pp. 693-699, Morgan Kaufmann, 1991.
Tom Mitchell, Machine Learning (Chapter 6), McGraw-Hill.
Additional reading about Naïve Bayes Classifier http://www-2.cs.cmu.edu/~tom/NewChapters.html
Software for text classification using Naïve Bayes Classifier http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html
45
Review
What is classification?
What is supervised learning?
What does KNN stand for?
What are the major challenges of KNN?
How to accelerate KNN?
What is N-fold cross validation?
What does LOOCV stand for?
What is Bayes Theorem?
What is the key assumption in Naïve Bayes Classifiers?
46
Next Week’s Class Talk
Volunteers are required for next week’s class talk.
Topic 1: Efficient KNN Implementations
Hints: Ball Trees Metric Trees R Trees
Topic 2: Bayesian Belief Networks
Length: 20 minutes plus question time
47
Recommended