Intro to Information Extraction -Sentiment Lexicon...

Preview:

Citation preview

IntrotoInformationExtraction-SentimentLexiconInductionasan

example

Many slides adapted from Ellen Riloff and Dan Jurafsky

WhatisInformationExtraction?• Informationextraction(IE)isanumbrellatermforNLPtasksthatinvolveextractingpiecesofinformationfromtextandassigningsomemeaningtotheinformation.

• ManyIEapplicationsaimtoturnunstructuredtextintoa“structured” representation.

• IEproblemstypicallyinvolve:– identifyingtextsnippetstoextract

– assigningsemanticmeaningtoentitiesorconcepts

– findingrelationsbetweenentitiesorconcepts

IEApplications• BiologicalProcesses(Genomics)

• ClinicalMedicine

• QuestionAnswering/WebSearch

• QueryExpansion/SemanticSets

• ExtractingEntityProfiles

• TrackingEvents(Violent,Diseases,Business,etc)

• TrackingOpinions(Political,ProductReputation,FinancialPrediction,On-lineReviews,etc.)

GeneralTechniques• SyntacticAnalysis

– PhraseIdentification

– FeatureExtraction

• SemanticAnalysis

• StatisticalMeasures

• MachineLearning

– Supervised&WeaklySupervised

• GraphAlgorithms

NamedEntityRecognition(NER)

MarsOneannouncedMonday thatithaspicked1,058aspiringspaceflyerstomoveontothenextroundinitssearchforthefirsthumanstoliveanddieontheRedPlanet.

TheWallStreetJournalreportsthatGoogle planstopartnerwithToyotatodevelopAndroidsoftwarefortheirhybridcars.

NERtypicallyinvolvesextractingandlabelingcertaintypesofentities,suchaspropernamesanddates.

Domain-specificNER

IL−2 gene expressionand NFkappaB activationthrough CD28 requiresreactiveoxygenproductionby 5lipoxygenase.

Biomedicalsystemsmustrecognizegenesandproteins:

Adrenal-Sparingsurgeryissafeandeffective,andmaybecomethetreatmentofchoiceinpatientswithhereditaryphaeochromocytoma.

Clinicalmedicalsystemsmustrecognizeproblemsandtreatments:

SemanticClassIdentification

MarsOneannouncedMondaythatithaspicked1,058aspiringspaceflyerstomoveontothenextroundinitssearchforthefirsthumanstoliveanddieontheRedPlanet.

TheWallStreetJournalreportsthatGoogleplanstopartnerwithToyotatodevelopAndroidsoftwarefortheirhybridcars.

SemanticLexiconInduction• Althoughsomegeneralsemanticdictionariesexist(e.g.,WordNet),domain-specificapplicationsoftenhavespecializedvocabulary.

• SemanticLexiconInductiontechniqueslearnlistsofwordsthatbelongtoasemanticclass.

Vehicles:car,jeep,helicopter,bike,tricycle,scooter,…

Animal:tiger,zebra,wolverine,platypus,echidna,…

Symptoms:cough,sneeze,pain,pu/pd,elevatedbp,…

Products:camera,laptop,iPad,tablet,GPSdevice,…

Domain-specificVocabulary

A 14yo m/n doxy owned by a reputable breeder is being treated for IBD with pred.

doxy

predIBDbreeder

ANIMALHUMANDISEASEDRUG

Domain-specific meanings: lab, mix, m/n = ANIMAL

SemanticTaxonomyInduction• Ideally,wewantsemanticconceptstobeorganizedinataxonomy,tosupportgeneralizationbuttodistinguishdifferentsubtypes.

AnimalMammal

FelineLion,PantheraLeoTiger,PantheraTigris,FelisTigrisCougar,MountainLion,Puma,Panther,Catamount

CanineWolf,CanisLupusCoyote,PrairieWolf,BrushWolf,AmericanJackalDog,Puppy,CanisLupusFamiliaris,Mongrel

ChallengesinTaxonomyInduction• Butthereareoftenmanywaystoorganizeaconceptualspace!

• Stricthierarchiesarerareinrealdata– graphs/networksaremorerealisticthantreestructures.

• Forexample,animalscouldbesubcategorizedbasedon:– carnivorevs.herbivore– water-dwellingvs.land-dwelling– wildvs.petsvs.agricultural– physicalcharacteristics(e.g.,baleenvs.toothedwhales)– habitat(e.g.,arcticvs.desert)

RelationExtraction

InSalzburg,littleMozart grewupinalovingmiddle-classenvironment.

Birthplace(Mozart,Salzburg)

SteveBallmer isanAmericanbusinessmanwhohasbeenservingastheCEOofMicrosoft sinceJanuary2000

Employed-By(SteveBallmer,Microsoft)CEO(SteveBallmer,Microsoft)

RelationsforWebSearch

Paraphrasing• Relationscanoftenbeexpressedwithamultitudeofdifferenceexpressions.

• Paraphrasingsystemstrytoexplicitlylearnphrasesthatrepresentthesametypeofrelation.

• Examples:– XwasborninY

– YisthebirthplaceofX

– X’sbirthplaceisY

– X’shometownisY

– XgrewupinY

15

Event ExtractionGoal: extract facts about events from unstructured documents

December 29, Pakistan - The U.S. embassy in Islamabad was damaged this morning by a car bomb. Three diplomats were injured in the explosion. Al Qaeda has claimed responsibility for the attack.

EVENTType: bombingTarget: U.S. embassyLocation: Islamabad,

PakistanDate: December 29Weapon: car bombVictim: three diplomatsPerpetrator: Al Qaeda

Example: extracting information about terrorism events in news articles:

EventExtraction

Document TextNew Jersey, February, 26. An outbreak of swine flu has been confirmed in Mercer County, NJ. Five teenage boysappear to have contracted the deadly virus from an unknown source. The CDC is investigating the cases and is taking measures to prevent the spread. . .

EventDisease: swine fluLocation: Mercer County, NJVictim: Five teenage boysDate: February 26Status: confirmed

Anotherexample:extractinginformationaboutdiseaseoutbreakevents.

Large-ScaleIEfromtheWeb

• SomeresearchershavebeendevelopingIEsystemsforlarge-scaleextractionoffactsandrelationsfromtheWeb.

• ThesesystemsexploitthemassiveamountoftextandredundancyavailableontheWebanduseweaklysupervised,iterativelearningtoharvestinformationforautomatedknowledgebaseconstruction.

• TheKnowItAllprojectatUWandNELLprojectatCMUarewell-knownresearchgroupspursuingthiswork.

OpinionExtraction

Source:<writer>Target:PowershotAspect:pictures,colorsEvaluation:beautiful,easytogrip

IjustboughtaPowershotafewdaysago.Itooksomepicturesusingthecamera.Colorsaresobeautifulevenwhenflashisused.Alsoeasytogripsincethebodyhasagriphandle.[Kobayashietal.,2007]

OpinionExtractionfromNews[Wilson&Wiebe,2009]

Source:ItaliansenatorRenzoGubertTarget:theChineseGovernmentEvaluation:praisedPOSITIVE

ItaliansenatorRenzoGubertpraisedtheChineseGovernment’sefforts.

AfricanobserversgenerallyapprovedofhisvictorywhileWesterngovernmentsdenouncedit.

Source:AfricanobserversTarget:hisvictoryEvaluation:approvedPOSITIVE

Source:WesterngovernmentsTarget:it(hisvictory)Evaluation:denouncedNEGATIVE

Summary• Informationextractionsystemsfrequentlyrelyonlow-levelNLPtoolsforbasiclanguageanalysis,ofteninapipelinearchitecture.

• ThereareawidevarietyofapplicationsforIE,includingbothbroad-coverageanddomain-specificapplications.

• SomeIEtasksarerelativelywell-understood(e.g.,namedentityrecognition),whileothersarestillquitechallenging!

• We’veonlyscratchedthesurfaceofpossibleIEtasks…nearlyendlesspossibilities.

TurneyAlgorithmtolearnaSentimentLexicon

1. Extractaphrasallexiconfromreviews

2. Learnpolarityofeachphrase

3. Rateareviewbytheaveragepolarityofitsphrases

22

Turney (2002): Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews

Extracttwo-wordphraseswithadjectives

FirstWord SecondWord ThirdWord (notextracted)

JJ NNorNNS anythingRB, RBR,RBS JJ NotNNnorNNS

JJ JJ NotNNnorNNSNNorNNS JJ NotNNnor NNS

RB,RBR,orRBS

VB,VBD,VBN,VBG

anything

23

Howtomeasurepolarityofaphrase?• Positivephrasesco-occurmorewith“excellent”->”great”

• Negativephrasesco-occurmorewith“poor”

• Buthowtomeasureco-occurrence?

24

PointwiseMutualInformation

• Pointwise mutualinformation:

– Howmuchmoredoeventsxandyco-occurthaniftheywereindependent?

PMI(X,Y ) = log2P(x,y)P(x)P(y)

PointwiseMutualInformation

• Pointwise mutualinformation:

– Howmuchmoredoeventsxandyco-occurthaniftheywereindependent?

• PMIbetweentwowords:

– Howmuchmoredotwowordsco-occurthaniftheywereindependent?PMI(word1,word2 ) = log2

P(word1,word2)P(word1)P(word2)

PMI(X,Y ) = log2P(x,y)P(x)P(y)

HowtoEstimatePointwiseMutualInformation

– Querysearchengine(Altavista)

• P(word)estimatedbyhits(word)/N

• P(word1,word2)byhits(word1 NEAR word2)/N– (MorecorrectlythebigramdenominatorshouldbekN,becausethereareatotalofNconsecutivebigrams(word1,word2),butkN bigramsthatarekwordsapart,butwejustuseNontherestofthisslideandthenext.)

PMI(word1,word2 ) = log2

1Nhits(word1 NEAR word2)

1Nhits(word1) 1

Nhits(word2)

Doesphraseappearmorewith“poor” or“excellent”?

28

Polarity(phrase) = PMI(phrase,"excellent")−PMI(phrase,"poor")

= log2hits(phrase NEAR "excellent")hits("poor")hits(phrase NEAR "poor")hits("excellent")!

"#

$

%&

= log2hits(phrase NEAR "excellent")

hits(phrase)hits("excellent")hits(phrase)hits("poor")

hits(phrase NEAR "poor")

= log2

1N hits(phrase NEAR "excellent")1N hits(phrase) 1

N hits("excellent")− log2

1N hits(phrase NEAR "poor")1N hits(phrase) 1

N hits("poor")

Phrasesfromathumbs-upreview

29

Phrase POStags

Polarity

onlineservice JJNN 2.8

onlineexperience JJNN 2.3

directdeposit JJNN 1.3

localbranch JJNN 0.42…

lowfees JJNNS 0.33

trueservice JJNN -0.73

otherbank JJNN -0.85

inconvenientlylocated

JJNN -1.5

Average 0.32

Phrasesfromathumbs-downreview

30

Phrase POStags

Polarity

directdeposits JJNNS 5.8

onlineweb JJNN 1.9

veryhandy RBJJ 1.4…

virtualmonopoly JJNN -2.0

lesserevil RBRJJ -2.3

otherproblems JJNNS -2.8

lowfunds JJNNS -6.8

unethicalpractices

JJNNS -8.5

Average -1.2

ResultsofTurneyalgorithm• 410reviewsfromEpinions

– 170(41%)negative

– 240(59%)positive

• Majorityclassbaseline:59%

• Turney algorithm:74%

• Phrasesratherthanwords

• Learnsdomain-specificinformation31

Recommended