The Naïve Bayes Classifier - svivek.com · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x using 9 Don’t confuse with MAP learning: finds

MachineLearning

TheNaïveBayesClassifier

1

Today’slecture

• ThenaïveBayesClassifier

• LearningthenaïveBayesClassifier

• Practicalconcerns

2

Today’slecture



• Practicalconcerns

3

Wherearewe?

WehaveseenBayesianlearning– Usingaprobabilisticcriteriontoselectahypothesis– Maximumaposterioriandmaximumlikelihoodlearning

• Question:Whatisthedifferencebetweenthem?

Wecouldalsolearnfunctionsthatpredictprobabilitiesofoutcomes

– Differentfromusingaprobabilisticcriteriontolearn

Maximumaposteriori(MAP)predictionasopposedtoMAPlearning

4

Wherearewe?

WehaveseenBayesianlearning– Usingaprobabilisticcriteriontoselectahypothesis– Maximumaposterioriandmaximumlikelihoodlearning

• Question:Whatisthedifferencebetweenthem?

Wecouldalsolearnfunctionsthatpredictprobabilitiesofoutcomes

– Differentfromusingaprobabilisticcriteriontolearn

Maximumaposteriori(MAP)predictionasopposedtoMAPlearning

5

MAPprediction

Let’sbeusetheBayesruleforpredictingy givenaninputx

6

Posteriorprobabilityoflabelbeingy forthisinputx

MAPprediction


Predicty fortheinputx using

7

MAPprediction



8

MAPprediction



9

Don’tconfusewithMAPlearning:findshypothesisby

MAPprediction


10

Likelihood ofobservingthisinputx whenthelabelisy

Priorprobabilityofthelabelbeingy

Allweneedarethesetwosetsofprobabilities

Example:Tennisagain

11

Temperature Wind P(T, W|Tennis=Yes)

Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35

Temperature Wind P(T, W|Tennis=No)

Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2

Playtennis P(Playtennis)

Yes 0.3

No 0.7Prior

Likelihood

Withoutanyotherinformation,whatisthepriorprobabilitythatIshouldplaytennis?

OndaysthatIdo playtennis,whatistheprobabilitythatthetemperatureisTandthewindisW?

OndaysthatIdon’t playtennis,whatistheprobabilitythatthetemperatureisTandthewindisW?

Example:Tennisagain

12


Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35


Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2


Yes 0.3

No 0.7Prior

Likelihood




Example:Tennisagain

13


Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35


Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2


Yes 0.3

No 0.7Prior

Likelihood




Example:Tennisagain

14


Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35


Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2


Yes 0.3

No 0.7Prior

Likelihood

Input:Temperature=Hot(H)Wind=Weak(W)

ShouldIplaytennis?

Example:Tennisagain

15


Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35


Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2


Yes 0.3

No 0.7Prior

Likelihood


ShouldIplaytennis?

argmaxy P(H,W|play?)P(play?)

Example:Tennisagain

16


Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35


Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2


Yes 0.3

No 0.7Prior

Likelihood


ShouldIplaytennis?


P(H,W|Yes)P(Yes)=0.4£ 0.3=0.12

P(H,W|No)P(No)=0.1£ 0.7=0.07

Example:Tennisagain

17


Hot Strong 0.15

Hot Weak 0.4

Cold Strong 0.1

Cold Weak 0.35


Hot Strong 0.4

Hot Weak 0.1

Cold Strong 0.3

Cold Weak 0.2


Yes 0.3

No 0.7Prior

Likelihood


ShouldIplaytennis?


P(H,W|Yes)P(Yes)=0.4£ 0.3=0.12

P(H,W|No)P(No)=0.1£ 0.7=0.07

MAPprediction=Yes

Howhardisittolearnprobabilisticmodels?

O T H W Play?1 S H H W -2 S H H S -3 O H H W +4 R M H W +5 R C N W +6 R C N S -7 O C N S +8 S M H W -9 S C N W +10 R M N W +11 S M N S +12 O M H S +13 O H N W +14 R M H S -

Outlook: S(unny),O(vercast),R(ainy)

Temperature: H(ot),M(edium),C(ool)

Humidity: H(igh),N(ormal),L(ow)

Wind: S(trong),W(eak)

18



Outlook: S(unny),O(vercast),R(ainy)

Temperature: H(ot),M(edium),C(ool)

Humidity: H(igh),N(ormal),L(ow)

Wind: S(trong),W(eak)

19

Weneedtolearn

1. ThepriorP(Play?)2. ThelikelihoodsP(X|Play?)



PriorP(play?)

• Asinglenumber(Whyonlyone?)

LikelihoodP(X|Play?)

• Thereare4features

• ForeachvalueofPlay? (+/-),weneedavalueforeachpossibleassignment:P(x1,x2, x3,x4 |Play?)

• (24 – 1)parametersineachcase

Oneforeachassignment

20



PriorP(play?)





21



3 3 3 2

PriorP(play?)





22Valuesforthisfeature



3 3 3 2

PriorP(play?)





• (3 ⋅ 3 ⋅ 3 ⋅ 2 − 1)parametersineachcase

Oneforeachassignment

23Valuesforthisfeature



PriorP(Y)

• Ifthereareklabels,thenk– 1parameters(whynotk?)

LikelihoodP(X|Y)

• Iftherearedfeatures,then:

• WeneedavalueforeachpossibleP(x1,x2,!,xd |y)foreachy

• k(2d – 1)parameters

Needalotofdatatoestimatethesemanynumbers!

24

Ingeneral



PriorP(Y)


LikelihoodP(X|Y)

• IftherearedBooleanfeatures:




25

Ingeneral



PriorP(Y)


LikelihoodP(X|Y)





26

Ingeneral


PriorP(Y)


LikelihoodP(X|Y)





27

Highmodelcomplexity

Ifthereisverylimiteddata,highvarianceintheparameters


PriorP(Y)


LikelihoodP(X|Y)





28

Highmodelcomplexity


Howcanwedealwiththis?


PriorP(Y)


LikelihoodP(X|Y)





29

Highmodelcomplexity


Howcanwedealwiththis?

Answer:Makeindependenceassumptions

Recall:Conditionalindependence

SupposeX,YandZarerandomvariables

XisconditionallyindependentofYgivenZiftheprobabilitydistributionofXisindependentofthevalueofYwhenZisobserved

Orequivalently

30

Modelingthefeatures

𝑃(𝑥+, 𝑥-,⋯ , 𝑥/|𝑦) requiredk(2d – 1)parameters

Whatifallthefeatureswereconditionallyindependentgiventhelabel?

Thatis,𝑃 𝑥+, 𝑥-,⋯ , 𝑥/ 𝑦 = 𝑃 𝑥+ 𝑦 𝑃 𝑥- 𝑦 ⋯𝑃 𝑥/ 𝑦

Requiresonlydnumbersforeachlabel.kd featuresoverall.Notbad!

31

TheNaïveBayesAssumption

Modelingthefeatures

𝑃(𝑥+, 𝑥-,⋯ , 𝑥/|𝑦) requiredk(2d – 1)parameters

Whatifallthefeatureswereconditionallyindependentgiventhelabel?

Thatis,𝑃 𝑥+, 𝑥-,⋯ , 𝑥/ 𝑦 = 𝑃 𝑥+ 𝑦 𝑃 𝑥- 𝑦 ⋯𝑃 𝑥/ 𝑦

Requiresonlydnumbersforeachlabel.kd parametersoverall.Notbad!

32

TheNaïveBayesAssumption


Assumption:FeaturesareconditionallyindependentgiventhelabelY

Topredict,weneedtwosetsofprobabilities– PriorP(y)– Foreachxj,wehavethelikelihoodP(xj |y)

33




Decisionrule

34

ℎ45 𝒙 = argmax<

𝑃 𝑦 𝑃 𝑥+, 𝑥-,⋯ , 𝑥/ 𝑦)




Decisionrule

35

ℎ45 𝒙 = argmax<

𝑃 𝑦 𝑃 𝑥+, 𝑥-,⋯ , 𝑥/ 𝑦)

= argmax<

𝑃 𝑦 =𝑃(𝑥>|𝑦)�

>

DecisionboundariesofnaïveBayes

WhatisthedecisionboundaryofthenaïveBayesclassifier?

Considerthetwoclasscase.Wepredictthelabeltobe+if

36

𝑃 𝑦 = + =𝑃 𝑥> 𝑦 = + > 𝑃 𝑦 = − =𝑃 𝑥> 𝑦 = −)�

>

�

>



Considerthetwoclasscase.Wepredictthelabeltobe+if

37

𝑃 𝑦 = + =𝑃 𝑥> 𝑦 = + > 𝑃 𝑦 = − =𝑃 𝑥> 𝑦 = −)�

>

�

>

𝑃 𝑦 = + ∏ 𝑃 𝑥> 𝑦 = +)�>

𝑃 𝑦 = − ∏ 𝑃(𝑥>|𝑦 = −)�>

> 1



Takinglogandsimplifying,weget

38

Thisisalinearfunctionofthefeaturespace!

Easytoprove.Seenoteoncoursewebsite

log𝑃(𝑦 = −|𝒙)𝑃(𝑦 = +|𝒙) = 𝒘F𝒙 + 𝑏

Today’slecture



• PracticalConcerns

39

LearningthenaïveBayesClassifier

• Whatisthehypothesisfunctionh definedby?– Acollectionofprobabilities

• Priorforeachlabel:P(y)• Likelihoodsforfeaturexj givenalabel:P(xj|y)

IfwehaveadatasetD={(xi,yi)}withmexamplesAndwewanttolearntheclassifierinaprobabilisticway– Whatistheprobabilisticcriteriontoselectthehypothesis?

40



• Priorforeachlabel:𝑃(𝑦)• Likelihoodsforfeaturexj givenalabel:𝑃(𝑥𝑗|𝑦)

IfwehaveadatasetD={(xi,yi)}withmexamplesAndwewanttolearntheclassifierinaprobabilisticway– Whatistheprobabilisticcriteriontoselectthehypothesis?

41




Supposewehaveadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)}withmexamples

42




Supposewehaveadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)}withmexamples

43

Anoteonconventionforthissection:• Examplesinthedatasetareindexedbythesubscript𝑖 (e.g. 𝒙𝑖)• Featureswithinanexampleareindexedbythesubscript𝑗

• The𝑗MN featureofthe𝑖MN examplewillbe𝑥O>




Ifwehaveadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)}withmexamplesAndwewanttolearntheclassifierinaprobabilisticway– Whatisaprobabilisticcriteriontoselectthehypothesis?

44


Maximumlikelihoodestimation

45

HerehisdefinedbyalltheprobabilitiesusedtoconstructthenaïveBayesdecision


Givenadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)} withmexamples

46

Eachexampleinthedatasetisindependentandidenticallydistributed

SowecanrepresentP(D|h)asthisproduct


Givenadataset𝐷 = {(𝒙𝑖, 𝑦𝑖)}withmexamples

47

Asks“Whatprobabilitywouldthisparticularh assigntothepair(xi,yi)?”

Eachexampleinthedatasetisindependentandidenticallydistributed

SowecanrepresentP(D|h)asthisproduct


GivenadatasetD={(xi,yi)}withmexamples

48



49

TheNaïveBayesassumption

xij isthejthfeatureofxi



50

Howdoweproceed?



51



52

Whatnext?



53

Whatnext?

Weneedtomakeamodelingassumptionaboutthefunctionalformoftheseprobabilitydistributions



54

Forsimplicity,supposetherearetwolabels1 and0 andallfeaturesarebinary

• Prior:P(y=1)=p andP(y=0)=1– p



55



• Likelihood foreachfeaturegivenalabel• P(xj =1|y=1)=aj andP(xj =0 |y=1)=1– aj• P(xj =1|y=0)=bj andP(xj =0 |y=0)=1- bj



56






57




hconsistsofp,allthea’sandb’s



58




59


[z]iscalledtheindicatorfunctionortheIversonbracket

Itsvalueis1iftheargumentzistrueandzerootherwise



60

Likelihoodforeachfeaturegivenalabel• P(xj =1|y=1)=aj andP(xj =0 |y=1)=1– aj• P(xj =1|y=0)=bj andP(xj =0 |y=0)=1- bj


Substitutingandderivingtheargmax,weget

61

P(y=1)=p



62

P(y=1)=p

P(xj =1|y=1)=aj



63

P(y=1)=p

P(xj =1|y=1)=aj

P(xj =1|y=0)=bj

Let’slearnanaïveBayesclassifier


64



65

P(Play=+)=9/14 P(Play=-)=5/14



66

P(Play=+)=9/14 P(Play=-)=5/14

P(O =S|Play=+)=2/9


67

P(Play=+)=9/14 P(Play=-)=5/14

P(O =S|Play=+)=2/9


P(O =R|Play=+)=3/9


68

P(Play=+)=9/14 P(Play=-)=5/14

P(O =S|Play=+)=2/9


P(O =R|Play=+)=3/9

P(O =O|Play=+)=4/9

Andsoon,forotherattributesandalsoforPlay=-

NaïveBayes:LearningandPrediction

• Learning– Counthowoftenfeaturesoccurwitheachlabel.Normalizetogetlikelihoods

– Priorsfromfractionofexampleswitheachlabel– Generalizestomulticlass

• Prediction– Uselearnedprobabilitiestofindhighestscoringlabel

69

Today’slecture



• Practicalconcerns+anexample

70

ImportantcaveatswithNaïveBayes

1. Featuresneednotbeconditionallyindependentgiventhelabel– Justbecauseweassumethattheyaredoesn’tmeanthat

that’showtheybehaveinnature– Wemadeamodelingassumptionbecauseitmakes

computation andlearningeasier

2. Notenoughtrainingdatatogetgoodestimatesoftheprobabilitiesfromcounts

71


1. Featuresarenotconditionallyindependentgiventhelabel

AllbetsareoffifthenaïveBayesassumptionisnotsatisfied

Andyet,veryoftenusedinpracticebecauseofsimplicityWorksreasonablywellevenwhentheassumptionisviolated

72



73

Thebasicoperationforlearninglikelihoodsiscountinghowoftenafeatureoccurswithalabel.

Whatifweneverseeaparticularfeaturewithaparticularlabel?Eg:SupposeweneverobserveTemperature=coldwithPlayTennis=Yes

Shouldwetreatthosecountsaszero?



74



Shouldwetreatthosecountsaszero? Butthatwillmaketheprobabilitieszero



75



Shouldwetreatthosecountsaszero?

Answer:Smoothing• Addfakecounts(verysmallnumberssothatthecountsarenotzero)• TheBayesianinterpretationofsmoothing:Priors onthehypothesis(MAPlearning)

Butthatwillmaketheprobabilitieszero

Example:Classifyingtext

• Instancespace:Textdocuments• Labels:Spam orNotSpam

• Goal:TolearnafunctionthatcanpredictwhetheranewdocumentisSpam orNotSpam

HowwouldyoubuildaNaïveBayesclassifier?

76

Letusbrainstorm

Howtorepresentdocuments?Howtoestimateprobabilities?Howtoclassify?


1. RepresentdocumentsbyavectorofwordsAsparsevectorconsistingofonefeatureperword

2. LearningfromNlabeleddocuments1. Priors

2. Foreachwordwinvocabulary:

77





78





79





80





81

Howoftendoesawordoccurwithalabel?





82

Smoothing

Continuousfeatures

• Sofar,wehavebeenlookingatdiscretefeatures– P(xj |y)isaBernoullitrial(i.e.acointoss)

• WecouldmodelP(xj |y)withotherdistributionstoo– Thisisaseparateassumptionfromtheindependence

assumptionthatnaiveBayesmakes– Eg:Forrealvaluedfeatures,(Xj |Y)couldbedrawnfroma

normaldistribution

• Exercise:Derivethemaximumlikelihoodestimatewhenthefeaturesareassumedtobedrawnfromthenormaldistribution

83

Summary:NaïveBayes

• Independenceassumption– Allfeaturesareindependentofeachothergiventhelabel

• Maximumlikelihoodlearning:Learningissimple– Generalizestorealvaluedfeatures

• PredictionviaMAPestimation– Generalizestobeyondbinaryclassification

• Importantcaveatstoremember– Smoothing– Independenceassumptionmaynotbevalid

• Decisionboundaryislinearforbinaryclassification

84

Documents

The Naïve Bayes Classifier - svivek.com · Let’s be use the Bayes rule for predicting ygiven an input x Predict yfor the input x using 9 Don’t confuse with MAP learning: finds