26
SEEM4630 2013-2014 Tutorial 2 Classification : Decision tree, Naïve Bayes & k-NN Wentao TIAN, [email protected]

SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN, [email protected]

Embed Size (px)

Citation preview

Page 1: SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN, wttian@se.cuhk.edu.hk

SEEM4630 2013-2014 Tutorial 2

Classification:

Decision tree, Naïve Bayes & k-NN

Wentao TIAN, [email protected]

Page 2: SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN, wttian@se.cuhk.edu.hk

Given a collection of records (training set ), each record contains a set of attributes, one of the attributes is the class.

Find a model for class attribute as a function of the values of other attributes. Decision tree Naïve bayes k-NN

Goal: previously unseen records should be assigned a class as accurately as possible.

2

Classification: Definition

Page 3: SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN, wttian@se.cuhk.edu.hk

GoalConstruct a tree so that instances belonging to

different classes should be separated Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive manner

At start, all the training examples are at the rootTest attributes are selected on the basis of a

heuristics or statistical measure (e.g., information gain)

Examples are partitioned recursively based on selected attributes 3

Decision Tree

Page 4: SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN, wttian@se.cuhk.edu.hk

4

Attribute Selection Measure 1: Information Gain

Let pi be the probability that a tuple belongs to class Ci, estimated by |Ci,D|/|D|

Expected information (entropy) needed to classify a tuple in D:

Information needed (after using A to split D into v partitions) to classify D:

Information gained by branching on attribute A

)(||

||)(

1j

v

j

jA DInfo

D

DDInfo

(D)InfoInfo(D)Gain(A) A

)(log)( 21

i

m

ii ppDInfo

Page 5: SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN, wttian@se.cuhk.edu.hk

5

Attribute Selection Measure 2: Gain Ratio

Information gain measure is biased towards attributes with a large number of values

C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain)

GainRatio(A) = Gain(A)/SplitInfo(A)

)||

||(log

||

||)( 2

1 D

D

D

DDSplitInfo j

v

j

jA

Page 6: SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN, wttian@se.cuhk.edu.hk

6

Attribute Selection Measure 3: Gini index

If a data set D contains examples from n classes, gini index, gini(D) is defined as

where pj is the relative frequency of class j in D

If a data set D is split on A into two subsets D1

and D2, the gini index gini(D) is defined as

Reduction in Impurity:

n

jp jDgini

1

21)(

)(||||)(

||||)( 2

21

1 DginiDD

DginiDDDginiA

)()()( DginiDginiAginiA

Page 7: SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN, wttian@se.cuhk.edu.hk

ExampleOutlook Temperature Humidity Wind Play Tennis

Sunny >25 High Weak No

Sunny >25 High Strong No

Overcast >25 High Weak Yes

Rain 15-25 High Weak Yes

Rain <15 Normal Weak Yes

Rain <15 Normal Strong No

Overcast <15 Normal Strong Yes

Sunny 15-25 High Weak No

Sunny <15 Normal Weak Yes

Rain 15-25 Normal Weak Yes

Sunny 15-25 Normal Strong Yes

Overcast 15-25 High Strong Yes

Overcast >25 Normal Weak Yes

Rain 15-25 High Strong No

7

Page 8: SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN, wttian@se.cuhk.edu.hk

8

Tree induction example

S[9+, 5-] Outlook

Sunny [2+,3-] Overcast [4+,0-] Rain [3+,2-]

Gain(Outlook) = 0.94 – 5/14[-2/5(log2(2/5))-3/5(log2(3/5))] – 4/14[-4/4(log2(4/4))-0/4(log2(0/4))] – 5/14[-3/5(log2(3/5))-2/5(log2(2/5))] = 0.94 – 0.69 = 0.25

Entropy of data S

Split data by attribute Outlook

Info(S) = -9/14(log2(9/14))-5/14(log2(5/14)) = 0.94

Page 9: SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN, wttian@se.cuhk.edu.hk

9

Tree induction example

S[9+, 5-] Temperature

<15 [3+,1-]15-25 [5+,1-]>25 [2+,2-]

Gain(Temperature) = 0.94 – 4/14[-3/4(log2(3/4))-1/4(log2(1/4))] – 6/14[-5/6(log2(5/6))-1/6(log2(1/6))] – 4/14[-2/4(log2(2/4))-2/4(log2(2/4))] = 0.94 – 0.80 = 0.14

Split data by attribute Temperature

Page 10: SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN, wttian@se.cuhk.edu.hk

10

S[9+, 5-] Wind

Weak [6+, 2-]

Strong [3+, 3-]

Gain(Humidity) = 0.94 – 7/14[-3/7(log2(3/7))-4/7(log2(4/7))] – 7/14[-6/7(log2(6/7))-1/7(log2(1/7))] = 0.94 – 0.79 = 0.15

Gain(Wind) = 0.94 – 8/14[-6/8(log2(6/8))-2/8(log2(2/8))] – 6/14[-3/6(log2(3/6))-3/6(log2(3/6))] = 0.94 – 0.89 = 0.05

Split data by attribute Humidity

Split data by attribute Wind

Tree induction example

S[9+, 5-] Humidity

High [3+,4-]

Normal [6+, 1-]

Page 11: SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN, wttian@se.cuhk.edu.hk

11

Outlook

Yes?? ??

Overcast

Sunny Rain

Gain(Outlook) = 0.25Gain(Temperature)=0.14Gain(Humidity) = 0.15Gain(Wind) = 0.05

NoWeakHigh>25Sunny

NoStrongHigh>25Sunny

YesWeakHigh>25Overcast

YesWeakHigh15-25Rain

YesWeakNormal<15Rain

NoStrongNormal<15Rain

YesStrongNormal<15Overcast

NoWeakHigh15-25Sunny

YesWeakNormal<15Sunny

YesWeakNormal15-25Rain

YesStrongNormal15-25Sunny

YesStrongHigh15-25Overcast

YesWeakNormal>25Overcast

NoStrongHigh15-25Rain

Play Tennis

WindHumidity

Temperature

Outlook

Tree induction example

Page 12: SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN, wttian@se.cuhk.edu.hk

12

Sunny[2+, 3-] Wind

Weak [1+, 2-]

Strong [1+, 1-]

Gain(Humidity) = 0.97 – 3/5[-0/3(log2(0/3))-3/3(log2(3/3))] – 2/5[-2/2(log2(2/2))-0/2(log2(0/2))]= 0.97 – 0 = 0.97

Gain(Wind)= 0.97 – 3/5[-1/3(log2(1/3))-2/3(log2(2/3))] – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))]= 0.97 – 0.95= 0.02

Entropy of branch Sunny

Split Sunny branch by attribute Temperature

Split Sunny branch by attribute Humidity

Split Sunny branch by attribute Wind

Info(Sunny) = -2/5(log2(2/5))-3/5(log2(3/5)) = 0.97

Sunny[2+,3-] Temperature

<15 [1+,0-]

15-25 [1+,1-]>25 [0+,2-]

Gain(Temperature) = 0.97 – 1/5[-1/1(log2(1/1))-0/1(log2(0/1))] – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] – 2/5[-0/2(log2(0/2))-2/2(log2(2/2))]= 0.97 – 0.4 = 0.57

Sunny[2+,3-] Humidity

High [0+,3-]

Normal [2+, 0-]

Page 13: SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN, wttian@se.cuhk.edu.hk

13

Outlook

YesHumidity ??

YesNo

High

Sunny Rain

Normal

Overcast

Tree induction example

Page 14: SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN, wttian@se.cuhk.edu.hk

Gain(Humidity) = 0.97 – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] – 3/5[-2/3(log2(2/3))-1/3(log2(1/3))]= 0.97 – 0.95 = 0.02

Gain(Wind) = 0.97 – 3/5[-3/3(log2(3/3))-0/3(log2(0/3))] – 2/5[-0/2(log2(0/2))-2/2(log2(2/2))]= 0.97 – 0 = 0.97

Entropy of branch Rain

Split Rain branch by attribute Temperature

Split Rain branch by attribute Humidity

Split Rain branch by attribute Wind

14

Info(Rain) = -3/5(log2(3/5))-2/5(log2(2/5)) = 0.97

Rain[3+,2-] Temperature

<15 [1+,1-]

15-25 [2+,1-]>25 [0+,0-]

Gain(Outlook) = 0.97 – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] – 3/5[-2/3(log2(2/3))-1/3(log2(1/3))] – 0/5[-0/0(log2(0/0))-0/0(log2(0/0))]= 0.97 – 0.95 = 0.02

Rain[3+,2-] Wind

Weak [3+, 0-]

Strong [0+, 2-]

Rain[3+,2-] Humidity

High [1+,1-]

Normal [2+, 1-]

Page 15: SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN, wttian@se.cuhk.edu.hk

15

Outlook

YesHumidity Wind

YesNo

NormalHigh

NoYes

StrongWeak

OvercastSunny Rain

Page 16: SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN, wttian@se.cuhk.edu.hk

16

Bayesian Classification A statistical classifier: performs probabilistic

prediction, i.e., predicts class membership probabilities

where xi is the value of attribute

Ai

Choose the class label that has the highest probability Foundation: Based on Bayes’ Theorem.

posteriori probability

prior probability

likelihood

),...,,|( 21 ni xxxCP

),...,,|( 21 ni xxxCP

)|,...,,( 21 in CxxxP

)( iCP

),...,,(

)()|,...,,(),...,,|(

21

2121

n

iinni xxxP

CPCxxxPxxxCP

Model: compute

from data

Page 17: SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN, wttian@se.cuhk.edu.hk

)|,...,,( 21 in CxxxP

17

Naïve Bayes Classifier Problem: joint probabilities are difficult to estimate

Naïve Bayes Classifier Assumption: attributes are conditionally independent

)|()|()|,...,,( 121 iniin CxPCxPCxxxP

11 2

1 2

( | ) ( )( | , ,..., )

( , ,..., )

n

j i iji n

n

P x C P CP C x x x

P x x x

Page 18: SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN, wttian@se.cuhk.edu.hk

A B C

m b t

m s t

g q t

h s t

g q t

g q f

g s f

h b F

h q f

m b f

18

Example: Naïve Bayes Classifier

P(C=t) = 1/2 P(C=f) = 1/2

P(A=m|C=t) = 2/5P(A=m|C=f) = 1/5P(B=q|C=t) = 2/5P(B=q|C=f) = 2/5

Test Record: A=m, B=q, C=?

Page 19: SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN, wttian@se.cuhk.edu.hk

For C = tP(A=m|C=t) * P(B=q|C=t) * P(C=t) = 2/5 * 2/5 *

1/2 = 2/25

P(C=t|A=m, B=q) = (2/25) / P(A=m, B=q)

For C = fP(A=m|C=f) * P(B=q|C=f) * P(C=f) = 1/5 * 2/5 *

1/2 = 1/25

P(C=t|A=m, B=q) = (1/25) / P(A=m, B=q)

Conclusion: A=m, B=q, C=t19

Example: Naïve Bayes Classifier

Higher!

Page 20: SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN, wttian@se.cuhk.edu.hk

InputA set of stored recordsk: # of nearest neighbors

OutputCompute distance: Identify k nearest neighborsDetermine the class label of unknown record based on

class labels of nearest neighbors (i.e. by taking majority vote)

20

Nearest Neighbor Classification

i ii

qpqpd 2)(),(

Page 21: SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN, wttian@se.cuhk.edu.hk

21

Nearest Neighbor ClassificationInput Given 8 training

instancesP1 (4, 2) OrangeP2 (0.5, 2.5) OrangeP3 (2.5, 2.5) OrangeP4 (3, 3.5) OrangeP5 (5.5, 3.5) OrangeP6 (2, 4) BlackP7 (4, 5) BlackP8 (2.5, 5.5) Black k = 1 & k = 3

New Instance:Pn (4, 4) ?

Calculate the distances:

d(P1, Pn) = d(P2, Pn) = 3.80d(P3, Pn) = 2.12d(P4, Pn) = 1.12d(P5, Pn) = 1.58d(P6, Pn) = 2d(P7, Pn) = 1d(P8, Pn) = 2.12

A Discrete Example

2)42()44( 22

Page 22: SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN, wttian@se.cuhk.edu.hk

22

Nearest Neighbor Classification

k = 1

P1P2 P3

P4 P5P6

P7P8

Pn

P1P2 P3

P4 P5

P6

P7P8

Pn

k = 3

Page 23: SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN, wttian@se.cuhk.edu.hk

Scaling issuesAttributes may have to be scaled to

prevent distance measures from being dominated by one of the attributes• Each attribute must follow in the same range• Min-Max normalization

Example:• Two data records: a = (1, 1000), b = (0.5, 1)• dis(a, b) = ?

23

Nearest Neighbor Classification…

Page 24: SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN, wttian@se.cuhk.edu.hk

Two Types of Learning MethodologiesLazy Learning

• Instance-based learning. (k-NN)Eager Learning

• Decision-tree and Bayesian classification.• ANN & SVM

24

Classification: Lazy & Eager Learning

P1P2 P3

P4 P5

P6

P7P8

Pn

P1P2 P3

P4 P5

P6

P7P8

Pn

Page 25: SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN, wttian@se.cuhk.edu.hk

Lazy Learninga. Do not require model buildingb. Less time training but more time predictingc. Lazy method effectively uses a richer

hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function

Eager Learninga. Require model buildingb. More time training but less time predictingc. Must commit to a single hypothesis that

covers the entire instance space

25

Differences Between Lazy &Eager Learning

Page 26: SEEM4630 2013-2014 Tutorial 2 Classification: Decision tree, Naïve Bayes & k-NN Wentao TIAN, wttian@se.cuhk.edu.hk

Thank you & Questions?

26