Upload
pierce-parsons
View
230
Download
3
Embed Size (px)
Citation preview
Naïve Bayes Classifier
1
Adopted from slides by Ke Chen from University of Manchester and
YangQiu Song from MSRA
Generative vs. Discriminative ClassifiersTraining classifiers involves estimating f: X Y, or P(Y|X)
Discriminative classifiers (also called ‘informative’ by Rubinstein&Hastie):
1. Assume some functional form for P(Y|X)
2. Estimate parameters of P(Y|X) directly from training data
Generative classifiers
1. Assume some functional form for P(X|Y), P(X)
2. Estimate parameters of P(X|Y), P(X) directly from training data
3. Use Bayes rule to calculate P(Y|X= xi)
Bayes Formula
Generative Model
• Color• Size• Texture• Weight• …
Discriminative Model
• Logistic Regression
• Color• Size• Texture• Weight• …
Comparison
• Generative models– Assume some functional form for P(X|Y), P(Y)– Estimate parameters of P(X|Y), P(Y) directly from
training data– Use Bayes rule to calculate P(Y|X= x)
• Discriminative models– Directly assume some functional form for P(Y|X)– Estimate parameters of P(Y|X) directly from
training data
Probability Basics
7
• Prior, conditional and joint probability for random variables– Prior probability:
– Conditional probability:
– Joint probability:
– Relationship:
– Independence:
• Bayesian Rule
)| ,)( 121 XP(XX|XP 2
)()()(
)(X
XX
PCPC|P
|CP
)(XP
) )( ),,( 22 ,XP(XPXX 11 XX
)()|()()|() 2211122 XPXXPXPXXP,XP(X1
)()() ),()|( ),()|( 212121212 XPXP,XP(XXPXXPXPXXP 1
EvidencePriorLikelihood
Posterior
Probability Basics
8
• Quiz: We have two six-sided dice. When they are tolled, it could end up with the following occurance: (A) dice 1 lands on side “3”, (B) dice 2 lands on side “1”, and (C) Two dice sum to eight. Answer the following questions:
? equals ),( Is 8)
?),( 7)
?),( 6)
?)|( 5)
?)|( 4)
? 3)
? 2)
? )( )1
P(C)P(A)CAP
CAP
BAP
ACP
BAP
P(C)
P(B)
AP
Probabilistic Classification
9
• Establishing a probabilistic model for classification– Discriminative model ),, , )( 1 n1L X(Xc,,cC|CP XX
),,,( 21 nxxx x
Discriminative Probabilistic Classifier
1x 2x nx
)|( 1 xcP )|( 2 xcP )|( xLcP
Probabilistic Classification
10
• Establishing a probabilistic model for classification (cont.)– Generative model
),, , )( 1 n1L X(Xc,,cCC|P XX
GenerativeProbabilistic Model
for Class 1
)|( 1cP x
1x 2x nx
GenerativeProbabilistic Model
for Class 2
)|( 2cP x
1x 2x nx
GenerativeProbabilistic Model
for Class L
)|( LcP x
1x 2x nx
),,,( 21 nxxx x
Probabilistic Classification
11
• MAP classification rule– MAP: Maximum A Posterior
– Assign x to c* if
• Generative classification with the MAP rule– Apply Bayesian rule to convert them into posterior
probabilities
– Then apply the MAP rule
Lc,,cccc|cCP|cCP 1** , )( )( xXxX
Li
cCPcC|P
PcCPcC|P
|cCP
ii
iii
,,2,1 for
)()(
)()()(
)(
xX
xXxX
xX
Naïve Bayes
12
• Bayes classification
Difficulty: learning the joint probability
• Naïve Bayes classification– Assumption that all input attributes are conditionally
independent!
– MAP classification rule: for
)()|,,()()( )( 1 CPCXXPCPC|P|CP n XX
)|,,( 1 CXXP n
)|()|()|(
)|,,()|(
)|,,();,,|()|,,,(
21
21
22121
CXPCXPCXP
CXXPCXP
CXXPCXXXPCXXXP
n
n
nnn
Lnn ccccccPcxPcxPcPcxPcxP ,, , ),()]|()|([)()]|()|([ 1*
1***
1
),,,( 21 nxxx x
Naïve Bayes
13
• Naïve Bayes Algorithm (for discrete input attributes)– Learning Phase: Given a training set S,
Output: conditional probability tables; for
elements
– Test Phase: Given an unknown instance ,
Look up tables to assign the label c* to X’ if
; in examples with)|( estimate)|(̂
),1 ;,,1( attribute each of value attribute every For
; in examples with)( estimate)(̂
of value target each For 1
S
S
ijkjijkj
jjjk
ii
Lii
cCxXPcCxXP
N,knj Xx
cCPcCP
)c,,c(c c
Lnn ccccccPcaPcaPcPcaPcaP ,, , ),(̂)]|(̂)|(̂[)(̂)]|(̂)|(̂[ 1*
1***
1
),,( 1 naa X
LNX jj ,
Example
14
• Example: Play Tennis
Example
15
• Learning Phase
Outlook Play=Yes
Play=No
Sunny 2/9 3/5Overcast 4/9 0/5
Rain 3/9 2/5
Temperature
Play=Yes Play=No
Hot 2/9 2/5Mild 4/9 2/5Cool 3/9 1/5
Humidity Play=Yes
Play=No
High 3/9 4/5Normal 6/9 1/5
Wind Play=Yes
Play=No
Strong 3/9 3/5Weak 6/9 2/5
P(Play=Yes) = 9/14P(Play=No) = 5/14
Example
16
• Test Phase– Given a new instance, x’=(Outlook=Sunny, Temperature=Cool, Humidity=High,
Wind=Strong)
– Look up tables
– MAP rule
P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=No) = 3/5
P(Play=No) = 5/14
P(Outlook=Sunny|Play=Yes) = 2/9
P(Temperature=Cool|Play=Yes) = 3/9
P(Huminity=High|Play=Yes) = 3/9
P(Wind=Strong|Play=Yes) = 3/9
P(Play=Yes) = 9/14
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|
Yes)]P(Play=Yes) = 0.0053 P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be
“No”.
Example
17
• Test Phase– Given a new instance, x’=(Outlook=Sunny, Temperature=Cool, Humidity=High,
Wind=Strong)
– Look up tables
– MAP rule
P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=No) = 3/5
P(Play=No) = 5/14
P(Outlook=Sunny|Play=Yes) = 2/9
P(Temperature=Cool|Play=Yes) = 3/9
P(Huminity=High|Play=Yes) = 3/9
P(Wind=Strong|Play=Yes) = 3/9
P(Play=Yes) = 9/14
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|
Yes)]P(Play=Yes) = 0.0053 P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be
“No”.
Relevant Issues
18
• Violation of Independence Assumption– For many real world tasks,
– Nevertheless, naïve Bayes works surprisingly well anyway!
• Zero conditional probability Problem– If no example contains the attribute value
– In this circumstance, during test
– For a remedy, conditional probabilities estimated with
)|()|( )|,,( 11 CXPCXPCXXP nn
0)|(̂ , ijkjjkj cCaXPaX
0)|(̂)|(̂)|(̂ 1 inijki cxPcaPcxP
)1 examples, virtual"" of (number prior to weight:
) of values possible for /1 (usually, estimate prior :
whichfor examples training of number :
C and whichfor examples training of number :
)|(̂
mm
Xttpp
cCn
caXnmnmpn
cCaXP
j
i
ijkjc
cijkj
Relevant Issues
19
• Continuous-valued Input Attributes– Numberless values for an attribute
– Conditional probability modeled with the normal distribution
– Learning Phase: Output: normal distributions and
– Test Phase:• Calculate conditional probabilities with all the normal
distributions• Apply the MAP rule to make a decision
ijji
ijji
ji
jij
jiij
cC
cX
XcCXP
whichfor examples of X values attribute of deviation standard :
C whichfor examples of values attribute of (avearage) mean :
2
)(exp
2
1)|(̂ 2
2
Ln ccCXX ,, ),,,( for 11 XLn
),,( for 1 nXX X
LicCP i ,,1 )(
Conclusions• Naïve Bayes based on the independence assumption
– Training is very easy and fast; just requiring considering each attribute in each class separately
– Test is straightforward; just looking up tables or calculating conditional probabilities with normal distributions
• A popular generative model– Performance competitive to most of state-of-the-art classifiers
even in presence of violating independence assumption– Many successful applications, e.g., spam mail filtering– A good candidate of a base learner in ensemble learning– Apart from classification, naïve Bayes can do more…
20
Extra Slides
21
Naïve Bayes (1)• Revisit
• Which is equal to
• Naïve Bayes assumes conditional independency
• Then the inference of posterior is
Naïve Bayes (2)• Training: Observation is multinomial; Supervised, with label information– Maximum Likelihood Estimation (MLE)
– Maximum a Posteriori (MAP): put Dirichlet prior
• Classification
Naïve Bayes (3)• What if we have continuous Xi?
• Generative training
• Prediction
Naïve Bayes (4)
• Problems– Features may overlapped– Features may not be independent
• Size and weight of tiger– Use a joint distribution estimation (P(X|Y), P(Y)) to solve a
conditional problem (P(Y|X= x))• Can we discriminatively train?
– Logistic regression – Regularization– Gradient ascent