TextClassification&LinearModelsCMSC723/LING723/INST725
MarineCarpuat
Slidescredit:DanJurafsky &JamesMartin,JacobEisenstein
Logistics/Reminders
• Homework1– dueThursdaySep7by12pm.
• Project1comingup
• Thursdaylecturetime:projectset-upofficehourinCSIC1121
Recap:WordMeaning
2coreissuesfromanNLPperspective
• Semanticsimilarity:giventwowords,howsimilararetheyinmeaning?• Keyconcepts:vectorsemantics,PPMIanditsvariants,cosinesimilarity
• Wordsensedisambiguation:givenawordthathasmorethanonemeaning,whichoneisusedinaspecificcontext?• Keyconcepts:wordsense,WordNetandsenseinventories,unsuperviseddisambiguation(Lesk),superviseddisambiguation
Today
• Textclassificationproblems• andtheirevaluation
• Linearclassifiers• Features&Weights• Bagofwords• NaïveBayes
Textclassification
Isthisspam?From: "Fabian Starr“ <[email protected]>Subject: Hey! Sofware for the funny prices!
Get the great discounts on popular software today for PC and Macintoshhttp://iiled.org/Cj4Lmx70-90% Discounts from retail price!!!All sofware is instantly available to download - No Need Wait!
Whatisthesubjectofthisarticle?
• Antogonists andInhibitors• BloodSupply• Chemistry• DrugTherapy• Embryology• Epidemiology• …
MeSH SubjectCategoryHierarchy
?
MEDLINE Article
TextClassification
• Assigningsubjectcategories,topics,orgenres• Spamdetection• Authorshipidentification• Age/genderidentification• LanguageIdentification• Sentimentanalysis• …
TextClassification:definition
• Input:• adocumentd• afixedsetofclassesY= {y1,y2,…,yJ}
• Output:apredictedclassy Î Y
ClassificationMethods:Hand-codedrules
• Rulesbasedoncombinationsofwordsorotherfeatures• spam:black-list-addressOR(“dollars”AND“havebeenselected”)
• Accuracycanbehigh• Ifrulescarefullyrefinedbyexpert
• Butbuildingandmaintainingtheserulesisexpensive
ClassificationMethods:SupervisedMachineLearning
• Input• adocumentd• afixedsetofclassesY= {y1,y2,…,yJ}• a trainingsetofm hand-labeleddocuments(d1,y1),....,(dm,ym)
• Output• alearnedclassifierdà y
Aside:gettingexamplesforsupervisedlearning
• Humanannotation• Byexpertsornon-experts(crowdsourcing)• Founddata
• Howdoweknowhowgoodaclassifieris?• Compareclassifierpredictionswithhumanannotation• Onheldout testexamples• Evaluationmetrics:accuracy,precision,recall
The2-by-2contingencytable
correct notcorrectselected tp fp
notselected fn tn
Precisionandrecall
• Precision:%ofselecteditemsthatarecorrectRecall:%ofcorrectitemsthatareselected
correct notcorrectselected tp fp
notselected fn tn
Acombinedmeasure:F
• AcombinedmeasurethatassessestheP/RtradeoffisFmeasure(weightedharmonicmean):
• PeopleusuallyusebalancedF1measure• i.e.,withb =1(thatis,a =½):
F =2PR/(P+R)
RPPR
RP
F+
+=
−+= 2
2 )1(1)1(1
1ββ
αα
LinearClassifiers
Bagofwords
Definingfeatures
Definingfeatures
Linearclassification
LinearModelsforClassification
Featurefunction
representation
Weights
Howcanwelearnweights?
• Byhand
• Probability• e.g.,Naïve Bayes
• Discriminativetraining• e.g.,perceptron,supportvectormachines
GenerativeStoryforMultinomialNaïveBayes
• Ahypotheticalstochasticprocessdescribinghowtrainingexamplesaregenerated
PredictionwithNaïveBayesScore(x,y)
PredictionwithNaïveBayesScore(x,y)
ParameterEstimation
• “countandnormalize”• Parametersofamultinomialdistribution
• Relativefrequencyestimator• Formally:thisisthemaximumlikelihoodestimate
• SeeCIMLforderivation
Smoothing(addalpha/Laplace)
NaïveBayesrecap
Today
• Textclassificationproblems• andtheirevaluation
• Linearclassifiers• Features&Weights• Bagofwords• NaïveBayes