Naïve Bayes Classifier 1 Adopted from slides by Ke Chen from University of Manchester and YangQiu Song from MSRA

Naïve Bayes Classifier

1

Adopted from slides by Ke Chen from University of Manchester and

YangQiu Song from MSRA

Generative vs. Discriminative ClassifiersTraining classifiers involves estimating f: X Y, or P(Y|X)

Discriminative classifiers (also called ‘informative’ by Rubinstein&Hastie):

1. Assume some functional form for P(Y|X)

2. Estimate parameters of P(Y|X) directly from training data

Generative classifiers

1. Assume some functional form for P(X|Y), P(X)

2. Estimate parameters of P(X|Y), P(X) directly from training data

3. Use Bayes rule to calculate P(Y|X= xi)

Bayes Formula

Generative Model

• Color• Size• Texture• Weight• …

Discriminative Model

• Logistic Regression

• Color• Size• Texture• Weight• …

Comparison

• Generative models– Assume some functional form for P(X|Y), P(Y)– Estimate parameters of P(X|Y), P(Y) directly from

training data– Use Bayes rule to calculate P(Y|X= x)

• Discriminative models– Directly assume some functional form for P(Y|X)– Estimate parameters of P(Y|X) directly from

training data

Probability Basics

7

• Prior, conditional and joint probability for random variables– Prior probability:

– Conditional probability:

– Joint probability:

– Relationship:

– Independence:

• Bayesian Rule

)| ,)( 121 XP(XX|XP 2

)()()(

)(X

XX

PCPC|P

|CP

)(XP

) )( ),,( 22 ,XP(XPXX 11 XX

)()|()()|() 2211122 XPXXPXPXXP,XP(X1

)()() ),()|( ),()|( 212121212 XPXP,XP(XXPXXPXPXXP 1

EvidencePriorLikelihood

Posterior

Probability Basics

8

• Quiz: We have two six-sided dice. When they are tolled, it could end up with the following occurance: (A) dice 1 lands on side “3”, (B) dice 2 lands on side “1”, and (C) Two dice sum to eight. Answer the following questions:

? equals ),( Is 8)

?),( 7)

?),( 6)

?)|( 5)

?)|( 4)

? 3)

? 2)

? )( )1

P(C)P(A)CAP

CAP

BAP

ACP

BAP

P(C)

P(B)

AP

Probabilistic Classification

9

• Establishing a probabilistic model for classification– Discriminative model ),, , )( 1 n1L X(Xc,,cC|CP XX

),,,( 21 nxxx x

Discriminative Probabilistic Classifier

1x 2x nx

)|( 1 xcP )|( 2 xcP )|( xLcP


10

• Establishing a probabilistic model for classification (cont.)– Generative model

),, , )( 1 n1L X(Xc,,cCC|P XX

GenerativeProbabilistic Model

for Class 1

)|( 1cP x

1x 2x nx


for Class 2

)|( 2cP x

1x 2x nx


for Class L

)|( LcP x

1x 2x nx

),,,( 21 nxxx x


11

• MAP classification rule– MAP: Maximum A Posterior

– Assign x to c* if

• Generative classification with the MAP rule– Apply Bayesian rule to convert them into posterior

probabilities

– Then apply the MAP rule

Lc,,cccc|cCP|cCP 1** , )( )( xXxX

Li

cCPcC|P

PcCPcC|P

|cCP

ii

iii

,,2,1 for

)()(

)()()(

)(

xX

xXxX

xX

Naïve Bayes

12

• Bayes classification

Difficulty: learning the joint probability

• Naïve Bayes classification– Assumption that all input attributes are conditionally

independent!

– MAP classification rule: for

)()|,,()()( )( 1 CPCXXPCPC|P|CP n XX

)|,,( 1 CXXP n

)|()|()|(

)|,,()|(

)|,,();,,|()|,,,(

21

21

22121

CXPCXPCXP

CXXPCXP

CXXPCXXXPCXXXP

n

n

nnn

Lnn ccccccPcxPcxPcPcxPcxP ,, , ),()]|()|([)()]|()|([ 1*

1***

1

),,,( 21 nxxx x

Naïve Bayes

13

• Naïve Bayes Algorithm (for discrete input attributes)– Learning Phase: Given a training set S,

Output: conditional probability tables; for

elements

– Test Phase: Given an unknown instance ,

Look up tables to assign the label c* to X’ if

; in examples with)|( estimate)|(̂

),1 ;,,1( attribute each of value attribute every For

; in examples with)( estimate)(̂

of value target each For 1

S

S

ijkjijkj

jjjk

ii

Lii

cCxXPcCxXP

N,knj Xx

cCPcCP

)c,,c(c c

Lnn ccccccPcaPcaPcPcaPcaP ,, , ),(̂)]|(̂)|(̂[)(̂)]|(̂)|(̂[ 1*

1***

1

),,( 1 naa X

LNX jj ,

Example

14

• Example: Play Tennis

Example

15

• Learning Phase

Outlook Play=Yes

Play=No

Sunny 2/9 3/5Overcast 4/9 0/5

Rain 3/9 2/5

Temperature

Play=Yes Play=No

Hot 2/9 2/5Mild 4/9 2/5Cool 3/9 1/5

Humidity Play=Yes

Play=No

High 3/9 4/5Normal 6/9 1/5

Wind Play=Yes

Play=No

Strong 3/9 3/5Weak 6/9 2/5

P(Play=Yes) = 9/14P(Play=No) = 5/14

Example

16

• Test Phase– Given a new instance, x’=(Outlook=Sunny, Temperature=Cool, Humidity=High,

Wind=Strong)

– Look up tables

– MAP rule

P(Outlook=Sunny|Play=No) = 3/5

P(Temperature=Cool|Play==No) = 1/5

P(Huminity=High|Play=No) = 4/5

P(Wind=Strong|Play=No) = 3/5

P(Play=No) = 5/14

P(Outlook=Sunny|Play=Yes) = 2/9

P(Temperature=Cool|Play=Yes) = 3/9

P(Huminity=High|Play=Yes) = 3/9

P(Wind=Strong|Play=Yes) = 3/9

P(Play=Yes) = 9/14

P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|

Yes)]P(Play=Yes) = 0.0053 P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be

“No”.

Example

17

• Test Phase– Given a new instance, x’=(Outlook=Sunny, Temperature=Cool, Humidity=High,

Wind=Strong)

– Look up tables

– MAP rule

P(Outlook=Sunny|Play=No) = 3/5

P(Temperature=Cool|Play==No) = 1/5

P(Huminity=High|Play=No) = 4/5

P(Wind=Strong|Play=No) = 3/5

P(Play=No) = 5/14

P(Outlook=Sunny|Play=Yes) = 2/9

P(Temperature=Cool|Play=Yes) = 3/9

P(Huminity=High|Play=Yes) = 3/9

P(Wind=Strong|Play=Yes) = 3/9

P(Play=Yes) = 9/14

P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|

Yes)]P(Play=Yes) = 0.0053 P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be

“No”.

Relevant Issues

18

• Violation of Independence Assumption– For many real world tasks,

– Nevertheless, naïve Bayes works surprisingly well anyway!

• Zero conditional probability Problem– If no example contains the attribute value

– In this circumstance, during test

– For a remedy, conditional probabilities estimated with

)|()|( )|,,( 11 CXPCXPCXXP nn

0)|(̂ , ijkjjkj cCaXPaX

0)|(̂)|(̂)|(̂ 1 inijki cxPcaPcxP

)1 examples, virtual"" of (number prior to weight:

) of values possible for /1 (usually, estimate prior :

whichfor examples training of number :

C and whichfor examples training of number :

)|(̂

mm

Xttpp

cCn

caXnmnmpn

cCaXP

j

i

ijkjc

cijkj

Relevant Issues

19

• Continuous-valued Input Attributes– Numberless values for an attribute

– Conditional probability modeled with the normal distribution

– Learning Phase: Output: normal distributions and

– Test Phase:• Calculate conditional probabilities with all the normal

distributions• Apply the MAP rule to make a decision

ijji

ijji

ji

jij

jiij

cC

cX

XcCXP

whichfor examples of X values attribute of deviation standard :

C whichfor examples of values attribute of (avearage) mean :

2

)(exp

2

1)|(̂ 2

2

Ln ccCXX ,, ),,,( for 11 XLn

),,( for 1 nXX X

LicCP i ,,1 )(

Conclusions• Naïve Bayes based on the independence assumption

– Training is very easy and fast; just requiring considering each attribute in each class separately

– Test is straightforward; just looking up tables or calculating conditional probabilities with normal distributions

• A popular generative model– Performance competitive to most of state-of-the-art classifiers

even in presence of violating independence assumption– Many successful applications, e.g., spam mail filtering– A good candidate of a base learner in ensemble learning– Apart from classification, naïve Bayes can do more…

20

Extra Slides

21

Naïve Bayes (1)• Revisit

• Which is equal to

• Naïve Bayes assumes conditional independency

• Then the inference of posterior is

Naïve Bayes (2)• Training: Observation is multinomial; Supervised, with label information– Maximum Likelihood Estimation (MLE)

– Maximum a Posteriori (MAP): put Dirichlet prior

• Classification

Naïve Bayes (3)• What if we have continuous Xi？

• Generative training

• Prediction

Naïve Bayes (4)

• Problems– Features may overlapped– Features may not be independent

• Size and weight of tiger– Use a joint distribution estimation (P(X|Y), P(Y)) to solve a

conditional problem (P(Y|X= x))• Can we discriminatively train?

– Logistic regression – Regularization– Gradient ascent

Documents

Naïve Bayes Classifier 1 Adopted from slides by Ke Chen from University of Manchester and YangQiu Song from MSRA