CS4705 - Columbia Universitykathy/NLP/2019/ClassSlides/... · • Class parKcipaon using PollEverywhere or in-class comments Today • SciKit Learn Tutorial • Wrap up on opKmizaon

CS4705ProbabilityReviewandNaïveBayesSlidesfromDragomirRadevandmodified

Announcements• Readingfortoday:C.4,4.5NLP• Readingfornextclass:C3,NLP

• NextclasswillbetaughtbyChrisKedzie• Fornewstudentsinclass:•  Nolaptoppolicy•  ClassparKcipaKonusingPollEverywhereorin-classcomments

Today• SciKitLearnTutorial• WrapuponopKmizaKon• GeneraKvemethods

Regularization• Considerthecasewhereoneormoredocumentsaremis-labeled•  Textfromanovelmaybemis-labeledassocialmediaifpostedasaquote

• TheclassifierwillaRempttolearnweightsthatpromotewordscharacterisKcofnovelsaspredictorsofsocialmedia• OverfiTngcanalsooccurwhenthesocialmediadocumentsinthetrainingsetarenotrepresentaKve

Loss•  TopreventoverfiTng,aregularizaKonparameterR(Θ)isadded:

TwoCommonregularizers•  L2regularizaKon•  Keepssumofsquaresofparametervalueslow

•  Gaussianpriororweightdecay(HereWisweightsnotincludingb)•  Preferstodecreaseparameterwithhighweightby1than10parameterswithlowweights

•  L1regularizaKon•  KeepssumofabsolutevalueofparameterslowPunisheduniformlyforhighandlowvalues

Gradientbasedoptimization• RepeatunKlL(Loss)<margin• ComputeLoverthetrainingset• ComputegradientsofΘwithrespecttoL• MovetheparametersintheoppositedirecKonofthegradient

StochasticGradientDescent

Problem• Erroriscalculatedbasedonjustonetrainingsample• MaynotberepresentaKveofcorpuswideloss• Insteadcalculatetheerrorbasedonasetoftrainingexamples:minibatch• ->MinibatchstochasKcgradientdescent

ComputingGradients

Summary• Smoothinghelpstoaccountforzerovaluedn-grams• TextclassificaKonusingfeaturevectorsrepresenKngn-gramsandotherproperKes• DiscriminaKvelearning• MethodsforopKmizaKon,lossfuncKonsandregularizaKon

ClassiCicationusingaGenerativeApproach• StartwithNaïveBayesandMaximumLikelihoodExpectaKon• Butweneedsomebackgroundinprobabilityfirst

ProbabilitiesinNLP• Veryimportantforlanguageprocessing• ExampleinspeechrecogniKon:•  “recognizespeech”vs“wreckanicebeach”

• ExampleinmachinetranslaKon:•  “l’avocatgeneral”:“theaRorneygeneral”vs.“thegeneralavocado”

• ExampleininformaKonretrieval:•  Ifadocumentincludesthreeoccurrencesof“sKr”andoneof“rice”,whatistheprobabilitythatitisarecipe

• ProbabiliKesmakeitpossibletocombineevidencefrommulKplesourcessystemaKcally

Probabilities• Probabilitytheory•  predicKnghowlikelyitisthatsomethingwillhappen

• Experiment(trial)•  e.g.,throwingacoin

• Possibleoutcomes•  headsortails

• Samplespaces•  discrete(numberof“rice”)orconKnuous(e.g.,temperature)

• Events•  Ωisthecertainevent•  ∅istheimpossibleevent•  eventspace-allpossibleevents

SampleSpace• Randomexperiment:anexperimentwithuncertainoutcome•  e.g.,flippingacoin,pickingawordfromtext• Samplespace:allpossibleoutcomes,e.g.,•  Tossing2faircoins,Ω={HH,HT,TH,TT}

Events• Event:asubspaceofthesamplespace•  E⊆Ω,EhappensiffoutcomeisinE,e.g.,•  E={HH}(allheads)•  E={HH,TT}(sameface)

• ProbabilityofEvent:0≤P(E)≤1,s.t.•  P(Ω)=1(outcomealwaysinΩ)•  P(A∪B)=P(A)+P(B),if(A∩B)=∅(e.g.,A=sameface,B=differentface)

Example:TossaDie

• Samplespace:Ω={1,2,3,4,5,6}• Fairdie:•  p(1)=p(2)=p(3)=p(4)=p(5)=p(6)=1/6

• Unfairdie:p(1)=0.3,p(2)=0.2,...• N-dimensionaldie:•  Ω={1,2,3,4,…,N}

• Exampleinmodelingtext:•  TossadietodecidewhichwordtowriteinthenextposiKon•  Ω={cat,dog,Kger,…}

Example:FlipaCoin• Ω:{Head,Tail}• Faircoin:•  p(H)=0.5,p(T)=0.5• Unfaircoin,e.g.:•  p(H)=0.3,p(T)=0.7• Flippingtwofaircoins:•  Samplespace:{HH,HT,TH,TT}

• Exampleinmodelingtext:•  Flipacointodecidewhetherornottoincludeawordinadocument•  Samplespace={appear,absence}

Probabilities

• ProbabiliKes•  numbersbetween0and1

• ProbabilitydistribuKon•  distributesaprobabilitymassof1throughoutthesamplespaceΩ.

•  Example:•  AfaircoinistossedthreeKmes.•  Whatistheprobabilityof3heads?

Probabilities

•  Jointprobability:P(A∩B),alsowriRenasP(A,B)•  CondiKonalProbability:P(A|B)=P(A∩B)/P(B)•  P(A∩B)=P(A)P(B|A)=P(B)P(A|B)•  So,P(A|B)=P(B|A)P(A)/P(B)(Bayes’Rule)•  Forindependentevents,P(A∩B)=P(A)P(B),soP(A|B)=P(A)

•  Totalprobability:IfA1,…,AnformaparKKonofS,then•  P(B)=P(B∩S)=P(B,A1)+…+P(B,An)•  So,P(Ai|B)=P(B|Ai)P(Ai)/P(B)=P(B|Ai)P(Ai)/[P(B|A1)P(A1)+…+P(B|An)P(An)]•  ThisallowsustocomputeP(Ai|B)basedonP(B|Ai)

Probabilities

•  Jointprobability:P(A∩B),alsowriRenasP(A,B)•  CondiKonalProbability:P(A|B)=P(A∩B)/P(B)•  P(A∩B)=P(A)P(B|A)=P(B)P(A|B)•  So,P(A|B)=P(B|A)P(A)/P(B)(Bayes’Rule)•  Forindependentevents,P(A∩B)=P(A)P(B),soP(A|B)=P(A)

•  Totalprobability:IfA1,…,AnformaparKKonofS,then•  P(B)=P(B∩S)=P(B,A1)+…+P(B,An)•  So,P(Ai|B)=P(B|Ai)P(Ai)/P(B)=P(B|Ai)P(Ai)/[P(B|A1)P(A1)+…+P(B|An)P(An)]•  ThisallowsustocomputeP(Ai|B)basedonP(B|Ai)

PropertiesofProbabilities•  p(∅)=0•  P(certainevent)=1•  p(X)≤p(Y),ifX⊆Y•  p(X∪Y)=p(X)+p(Y),ifX∩Y=∅

ConditionalProbability

• Priorandposteriorprobability• CondiKonalprobability

P(A|B)=P(A∩B)

P(B)

Ω

A B

A∩B

ConditionalProbability• Six-sidedfairdie• P(Deven)=?• P(D>=4)=?• P(Deven|D>=4)=?• P(Dodd|D>=4)=?• MulKplecondiKons• P(Dodd|D>=4,D<=5)=?

Independence

•  TwoeventsareindependentwhenP(A∩B)=P(A)P(B)

• UnlessP(B)=0thisisequivalenttosayingthatP(A)=P(A|B)•  Iftwoeventsarenotindependent,theyareconsidereddependent

[slidefromBrendanO’Connor]

NaïveBayesClassiCier• WeuseBaye’srule:•  P(C|D)=P(D|C)P(C)P(D)HereC=Class,D=Document

• WecansimplifyandignoreP(D)sinceitisindependentofclasschoice•  P(C|D)≅P(D|C)P(C)≅P(C)ΠP(wi|C)i=1,n•  ThisesKmatestheprobabilityofDbeinginClassCassumingthatDasntokensandwisatokeninD.

UseLabeledTrainingData• P(C)isequivalenttothenumberoflabeleddocumentsintheclass/totalnumberofdocuments:

P(C)=Dc/DP(wi|C)isequivalenttothenumberofKmeswioccurswithlabelC/thenumberofKmesallwordsinthevocabulary(V)occurwithlabelC

P(w,|C)=Count(wiC)/ΣCount(viC)viεV

MultinomialNaïveBayesIndependenceAssumptions

• BagofWordsassumpKon•  AssumeposiKondoesn’tmaRer

• CondiKonalIndependence•  AssumethefeatureprobabiliKesP(wi|c)areindependentgiventheclassc.

[JurafskyandMarKn]

P(w1,…wn)

P(w1,…wn)=ΠP(wi|C)i=1,n

MultinomialNaïveBayesClassiCier• CMAP=argmaxP(w1…wn|C)P(C)• CNB=argmaxP(Cj)ΠP(w|C)wεW

Thisiswhyit’snaïve!

[JurafskyandMarKn]

Laplace Smoothing: Needed because counts may be zero

P̂(wi | c) =count(wi,c)+1count(w,c)+1( )

w∈V∑

=count(wi,c)+1

count(w,cw∈V∑ )

#

$%%

&

'(( + V

P̂(wi | c) =count(wi,c)count(w,c)( )

w∈V∑

[JurafskyandMarKn]

Questions?

SciKitLearn

Documents

CS4705 - Columbia Universitykathy/NLP/2019/ClassSlides/... · • Class parKcipaon using PollEverywhere or in-class comments Today • SciKit Learn Tutorial • Wrap up on opKmizaon