43
1 Naïve Bayes Classifier Pradeep Ravikumar Co-instructor: Ziv Bar-Joseph Machine Learning 10-701

Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

1

NaïveBayesClassifier

PradeepRavikumar

Co-instructor:Ziv Bar-Joseph

MachineLearning10-701

Page 2: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

2

Goal:

Classification

SportsScienceNews

Features,X Labels,Y

ProbabilityofError

Page 3: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

0

0.5

1

OptimalClassificationOptimalpredictor:(Bayes classifier)

3

• EventheoptimalclassifiermakesmistakesR(f*)>0• Optimalclassifierdependsonunknown distribution

Bayes risk

X

Page 4: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

OptimalClassifier

Bayes Rule:

Optimalclassifier:

4

Classconditionaldensity

Classprior

Page 5: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

5

Wecannowconsiderappropriatemodelsforthetwoterms

ClassprobabilityP(Y=y),ClassconditionaldistributionoffeaturesP(X=x|Y=y)

Classconditionaldistribution

Classprobability

ModelbasedApproach

= θ = 1 − θ

ModelingClassprobabilityP(Y=y)=Bernoulli(θ)

Likeacoinflip

Page 6: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

ModelingClassConditionalDistributionofFeatures

• Gaussianclassconditionaldensities(1-dimension/feature)

6DecisionBoundary

Page 7: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

• Gaussianclassconditionaldensities (2-dimensions/features)

7

ModelingClassConditionalDistributionofFeatures

DecisionBoundary

µ1

µ1

µ2

µ2

Page 8: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

Handwrittendigitrecognition

8Note:8digits shownoutof10(0,1,…,9);

Axesareobtainedbynonlineardimensionality reduction (laterincourse)

φ2(X)

φ1(X)

Multi-classclassification

Page 9: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

Handwrittendigitrecognition

9

TrainingData:

GaussianBayesmodel:

P(Y=y)=py forallyin0,1,2,…,9 p0,p1,…,p9 (sumto1)

P(X=x|Y =y)~N(μy,Σy)foreachy μy – d-dimvectorΣy - dxd matrix

1

…ngreyscaleimages

…nlabels

Input,X

Label,Y

Eachimagerepresentedasavectorofintensityvaluesatthedpixels(features)

=

2

664

X1

X2

. . .Xd

3

775

2

X

Page 10: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

GaussianBayesclassifier

10

P(Y=y)=py forallyin0,1,2,…,9 p0,p1,…,p9 (sumto1)

P(X=x|Y =y)~N(μy,Σy)foreachy μy – d-dimvectorΣy - dxd matrix

Page 11: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

11

1p(2⇡)d|⌃y|

• Binaryclassificationwithcontinuousfeaturesdecisionboundaryissetofpointsx:P(Y=1|X=x)=P(Y=0|X=x)

IfclassconditionalfeaturedistributionP(X=x|Y=y)is2-dimGaussianN(μy,Σy)

DecisionBoundaryofGaussianBayes

P (Y = 1|X = x)

P (Y = 0|X = x)=

P (X = x|Y = 1)P (Y = 1)

P (X = x|Y = 0)P (Y = 0)

=

s|⌃0||⌃1|

exp

✓��(x� µ1)⌃

�11 (x� µ1)0

2+

(x� µ0)⌃�10 (x� µ0)0

2

◆✓

1� ✓

Note:Ingeneral,thisimpliesaquadraticequationinx.ButifΣ1=Σ0,thenquadraticpartcancelsoutandequationislinear.

Page 12: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

GaussianBayesclassifier

12

P(Y=y)=py forallyin0,1,2,…,9 p0,p1,…,p9 (sumto1)

P(X=x|Y =y)~N(μy,Σy)foreachy μy – d-dimvectorΣy - dxd matrix

Howtolearnparameterspy,μy,Σy fromdata?

Page 13: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

Howmanyparametersdoweneedtolearn?

13

Kd +Kd(d+1)/2=O(Kd2) ifdfeatures

Quadraticindimensiond!Ifd=256x256pixels,~21.5billionparameters!

Classprobability:

P(Y=y)=py forallyin0,1,2,…,9 p0,p1,…,p9 (sumto1)

Classconditionaldistributionoffeatures:

P(X=x|Y =y)~N(μy,Σy)foreachy μy – d-dimvectorΣy - dxd matrix

K-1ifKlabels

Page 14: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

Whataboutdiscretefeatures?

1414

TrainingData:

DiscreteBayesmodel:

P(Y=y)=py forallyin0,1,2,…,9 p0,p1,…,p9 (sumto1)

P(X=x|Y =y)~Foreachlabely,maintainprobabilitytablewith2d-1entries

1

…nblack-whiteimages

…nlabels

Input,X

Label,Y

Eachimagerepresentedasavectorofdbinaryfeatures(black1orwhite0)

=

2

664

X1

X2

. . .Xd

3

775

2

X

Page 15: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

Howmanyparametersdoweneedtolearn?

15

Classprobability:

P(Y=y)=py forallyin0,1,2,…,9 p0,p1,…,p9 (sumto1)

Classconditionaldistributionoffeatures:

P(X=x|Y =y)~Foreachlabely,maintainprobabilitytablewith2d-1entries

K-1ifKlabels

K(2d – 1)ifdbinaryfeatures

Exponentialindimensiond!

Page 16: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

What’swrongwithtoomanyparameters?

• Howmanytrainingdataneededtolearnoneparameter(biasofacoin)?

• Needlotsoftrainingdatatolearntheparameters!– Trainingdata>numberofparameters

16

Page 17: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

NaïveBayesClassifier

17

• BayesClassifierwithadditional“naïve”assumption:– Featuresareindependentgivenclass:

– Moregenerally:

• Ifconditionalindependenceassumptionholds,NBisoptimalclassifier!Butworseotherwise.

X =

X1

X2

=

2

664

X1

X2

. . .Xd

3

775X =

X1

X2

Page 18: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

ConditionalIndependence

18

• Xisconditionallyindependent ofYgivenZ:probabilitydistributiongoverningXisindependentofthevalueofY,giventhevalueofZ

• Equivalentto:

• e.g.,Note: doesNOTmeanThunderisindependentofRain

Page 19: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

Conditionalvs.MarginalIndependence

19

Page 20: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

Conditionalvs.MarginalIndependence

20

Wearingcoatsisindependentofaccidentsconditionedonthefactthatitrained

Page 21: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

NaïveBayesClassifier

21

• BayesClassifierwithadditional“naïve”assumption:– Featuresareindependentgivenclass:

• Howmanyparametersnow?

Page 22: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

Handwrittendigitrecognition(continuousfeatures)

22

TrainingData:

Howmanyparameters?

ClassprobabilityP(Y=y)=py forally

Classconditionaldistributionoffeatures(usingNaïveBayesassumption)

P(Xi =xi|Y =y)~N(μ(y)i,σ2i(y))foreachyandeachpixeli

K-1ifKlabels

2Kd

1 2

…ngreyscaleimageswithdpixels

…nlabels

X

Y

=

2

664

X1

X2

. . .Xd

3

775

May not hold

LinearinsteadofQuadraticind!

Page 23: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

Handwrittendigitrecognition(discretefeatures)

23

TrainingData:

Howmanyparameters?

ClassprobabilityP(Y=y)=py forally

Classconditionaldistributionoffeatures(usingNaïveBayesassumption)

P(Xi =xi|Y =y)– oneprobabilityvalueforeachy,pixeli

K-1ifKlabels

Kd

1 2

…nblack-white(1/0)imageswithdpixels

…nlabels

X

Y

=

2

664

X1

X2

. . .Xd

3

775

May not hold

LinearinsteadofExponentialind!

Page 24: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

NaïveBayesClassifier

24

• BayesClassifierwithadditional“naïve”assumption:– Featuresareindependentgivenclass:

• Hasfewerparameters,andhencerequiresfewertrainingdata,eventhoughassumptionmaybeviolatedinpractice

Page 25: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

NaïveBayes Algo – Discretefeatures

• TrainingData

• MaximumLikelihoodEstimates– ForClassprobability

– Forclassconditionaldistribution

• NBPredictionfortestdata

25

Page 26: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

IssueswithNaïveBayes

26

• Issue1: Usually,featuresarenotconditionallyindependent:

Nonetheless,NBisthesinglemostusedclassifierparticularlywhendataislimited,workswell

• Issue2: TypicallyuseMAPestimatesinsteadofMLEsinceinsufficientdatamaycauseMLEtobezero.

Page 27: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

InsufficientdataforMLE

27

• WhatifyouneverseeatraininginstancewhereX1=awhenY=b?– e.g.,b={SpamEmail},a={‘Earn’}– P(X1=a|Y=b)=0

• Thus,nomatterwhatthevaluesX2,…,Xd take:

• Whatnow???

=0

Page 28: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

NaïveBayes Algo – Discretefeatures

• TrainingData

• MaximumAPosteriori(MAP)Estimates– addm“virtual”datapts

Assumegivensomepriordistribution(typicallyuniform):

MAPEstimate

Now,evenifyouneverobserveaclass/featureposteriorprobabilityneverzero.

28

#virtualexampleswithY=b

Page 29: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

CaseStudy:TextClassification

29

• Classifye-mails– Y={Spam,NotSpam}

• Classifynewsarticles– Y={whatisthetopicofthearticle?}

• Classifywebpages– Y={Student,professor,project,…}

• WhataboutthefeaturesX?– Thetext!

Page 30: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

Bagofwordsapproach

30

aardvark 0

about 2

all 2

Africa 1

apple 0

anxious 0

...

gas 1

...

oil 1

Zaire 0

Page 31: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

NBforTextClassification

31

• FeaturesX arethecountofhowmanytimeseachwordinthevocabularyappearsindocument

• ProbabilitytableforP(X|Y)ishuge!!!

• NBassumptionhelpsalot!!!

• Bagofwords+NaïveBayesassumptionimplyP(X|Y=y)isjusttheproduct ofprobabilityofeachword,raisedtoitscount, inadocumentontopicy

Page 32: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

Bagofwordsmodel

32

• Typicaladditionalassumption– Positionindocumentdoesn’tmatter– “Bagofwords”model– orderofwordsonthepageignored– Soundsreallysilly,butoftenworksverywell!

inislecturelecture nextoverpersonrememberroomsittingthethe the toto upwakewhenyou

Page 33: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

Bagofwordsmodel

33

• Typicaladditionalassumption– Positionindocumentdoesn’tmatter– “Bagofwords”model– orderofwordsonthepageignored– Soundsreallysilly,butoftenworksverywell!

Whenthelectureisover,remembertowakeupthepersonsittingnexttoyouinthelectureroom.

Page 34: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

NBwithBagofWordsfortextclassification

34

• Learningphase:– ClassPriorP(Y):fraction oftimestopicYappearsinthecollectionofdocuments

– P(w|Y):fractionoftimesword wappearsindocumentswithtopicY

• Testphase:– Foreachdocument

• UseBagofwords+naïveBayesdecisionrule

Page 35: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

Twentynewsgroupsresults

35

Page 36: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

Whatiffeaturesarecontinuous?

36

Eg.,characterrecognition:Xi isintensityatith pixel

GaussianNaïveBayes (GNB):

Differentmeanandvarianceforeachclasskandeachpixeli.

Sometimesassumevariance• isindependentofY(i.e.,σi),• orindependentofXi (i.e.,σk)• orboth(i.e.,σ)

Page 37: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

Estimatingparameters:Ydiscrete,Xi continuous

37

Maximumlikelihoodestimates:

jth trainingimageith pixelin

jth trainingimage

kth class

Page 38: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

Example:GNBforclassifyingmentalstates

38

~1mmresolution

~2imagespersec.

15,000voxels/image

non-invasive,safe

measuresBloodOxygenLevelDependent(BOLD)response

[Mitchelletal.]

Page 39: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

GaussianNaïveBayes:Learnedµvoxel,word

39

[Mitchelletal.]

15,000voxelsorfeatures

10trainingexamplesorsubjectsperclass(12wordcategories)

Page 40: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

LearnedNaïveBayes Models–MeansforP(BrainActivity |WordCategory)

40

AnimalwordsPeoplewordsPairwise classificationaccuracy:85% [Mitchelletal.]

Page 41: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

Whatyoushouldknow…

41

• OptimaldecisionusingBayes Classifier• NaïveBayes classifier

– What’stheassumption– Whyweuseit– Howdowelearnit– WhyisMAPestimationimportant

• Textclassification– Bagofwordsmodel

• GaussianNB– Featuresarestillconditionallyindependent– EachfeaturehasaGaussiandistributiongivenclass

Page 42: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

GaussianNaïveBayes vs.LogisticRegression

42

• Representationequivalence(bothyieldlineardecisionboundaries)– Butonlyinaspecialcase!!!(GNBwithclass-independentvariances)

– LRmakesnoassumptionsabout P(X|Y)inlearning!!!– Optimizedifferentfunctions(MLE/MCLE)or(MAP/MCAP)! Obtaindifferentsolutions

SetofGaussianNaïveBayes parameters

(featurevarianceindependentofclasslabel)

SetofLogisticRegressionparameters

Page 43: Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

Discriminativevs GenerativeClassifiers

43

Generative(Modelbased)approach:e.g.NaïveBayes• AssumesomeprobabilitymodelforP(Y)andP(X|Y)• Estimateparametersofprobabilitymodelsfromtrainingdata

Discriminative(Modelfree)approach:e.g.LogisticregressionWhynotlearnP(Y|X)directly?Orbetteryet,whynotlearnthedecisionboundarydirectly?• AssumesomefunctionalformforP(Y|X)orforthedecisionboundary• Estimateparametersoffunctionalformdirectlyfromtrainingdata

OptimalClassifier: