32
POS tagging CMSC 723 / LING 723 / INST 725 Marine Carpuat

slides 07 neurallm - UMD Department of Computer Science · 3rd person pronouns neutral to both gender and number No features distinguishing verbs from nouns 漂亮: beautiful/to be

Embed Size (px)

Citation preview

POStaggingCMSC723/LING723/INST725

MarineCarpuat

PartsofSpeech

• “Equivalenceclass”oflinguisticentities• “Categories”or“types”ofwords

• StudydatesbacktotheancientGreeks• DionysiusThrax ofAlexandria(c. 100BC)• 8partsofspeech:noun,verb,pronoun,preposition,adverb,conjunction,participle,article

• Remarkablyenduringlist!

2

HowcanwedefinePOS?

• Bymeaning?• Verbsareactions• Adjectivesareproperties• Nounsarethings

• Bythesyntacticenvironment• Whatoccursnearby?• Whatdoesitactas?

• Bywhatmorphologicalprocessesaffectit• Whataffixesdoesittake?

• Typicallycombinationofsyntactic+morphology

PartsofSpeech

• Openclass• Impossibletocompletelyenumerate• Newwordscontinuouslybeinginvented,borrowed,etc.

• Closedclass• Closed,fixedmembership• Reasonablyeasytoenumerate• Generally,shortfunctionwordsthat“structure”sentences

OpenClassPOS

• FourmajoropenclassesinEnglish• Nouns• Verbs• Adjectives• Adverbs

• Alllanguageshavenounsandverbs...butmaynothavetheothertwo

Nouns

• Openclass• Newinventionsallthetime:muggle,webinar,...

• Semantics:• Generally,wordsforpeople,places,things• Butnotalways(bandwidth,energy,...)

• Syntacticenvironment:• Occurringwithdeterminers• Pluralizable,possessivizable

• Othercharacteristics:• Massvs.countnouns

Verbs

• Openclass• Newinventionsallthetime:google,tweet,...

• Semantics• Generally,denoteactions,processes,etc.

• Syntacticenvironment• E.g.,Intransitive,transitive

• Othercharacteristics• Mainvs.auxiliaryverbs• Gerunds(verbsbehavinglikenouns)• Participles(verbsbehavinglikeadjectives)

AdjectivesandAdverbs

• Adjectives• Generallymodifynouns,e.g.,tall girl

• Adverbs• Asemanticandformalhodge-podge…• Sometimesmodifyverbs,e.g.,sangbeautifully• Sometimesmodifyadjectives,e.g.,extremely hot

ClosedClassPOS

• Prepositions• InEnglish,occurringbeforenounphrases• Specifyingsometypeofrelation(spatial,temporal,…)• Examples:on theshelf,before noon

• Particles• Resemblesapreposition,butusedwithaverb(“phrasalverbs”)• Examples:findout,turnover,goon

Particlevs.Prepositions

Hecameby theofficeinahurryHecameby hisfortunehonestly

Weranup thephonebillWeranup thesmallhill

Heliveddown theblockHeneverliveddown thenicknames

(by=preposition)(by=particle)

(up=particle)(up=preposition)

(down=preposition)(down=particle)

MoreClosedClassPOS

• Determiners• Establishreferenceforanoun• Examples:a,an,the (articles),that,this,many,such,…

• Pronouns• Refertopersonorentities:he,she,it• Possessivepronouns:his,her,its• Wh-pronouns:what,who

ClosedClassPOS:Conjunctions

• Coordinatingconjunctions• Jointwoelementsof“equalstatus”• Examples:catsand dogs,salador soup

• Subordinatingconjunctions• Jointwoelementsof“unequalstatus”• Examples:We’llleaveafter youfinisheating.While Iwaswaitinginline,Isawmyfriend.

• Complementizers areaspecialcase:Ithinkthat youshouldfinishyourassignment

BeyondEnglish…ChineseNoverb/adjectivedistinction!

RiauIndonesian/MalayNoArticlesNoTenseMarking3rdpersonpronounsneutraltobothgenderandnumberNofeaturesdistinguishingverbsfromnouns

漂亮:beautiful/tobebeautiful

Ayam (chicken) Makan (eat)

The chicken is eatingThe chicken ate

The chicken will eatThe chicken is being eatenWhere the chicken is eatingHow the chicken is eating

Somebody is eating the chickenThe chicken that is eating

POStagging

POSTagging:What’sthetask?

• Processofassigningpart-of-speechtagstowords

• Butwhattagsarewegoingtoassign?• Coarsegrained:noun,verb,adjective,adverb,…• Finegrained:{proper,common}noun• Evenfiner-grained:{proper,common}noun± animate

• Importantissuestoremember• Choiceoftagsencodescertaindistinctions/non-distinctions• Tagsets willdifferacrosslanguages!

• ForEnglish,PennTreebankisthemostcommontagset

PennTreebankTagset:45Tags

PennTreebankTagset:Choices

• Example:• The/DTgrand/JJjury/NNcommmented/VBDon/INa/DTnumber/NNof/INother/JJtopics/NNS./.

• Distinctionsandnon-distinctions• Prepositionsandsubordinatingconjunctionsaretagged“IN”(“Although/INI/PRP..”)

• Exceptthepreposition/complementizer “to”istagged“TO”

WhydoPOStagging?

• OneofthemostbasicNLPtasks• NicelyillustratesprinciplesofstatisticalNLP

• Usefulforhigher-levelanalysis• Neededforsyntacticanalysis• Neededforsemanticanalysis

• SampleapplicationsthatrequirePOStagging• Machinetranslation• Informationextraction• Lotsmore…

Tryyourhandattagging…

• Theback door• Onmyback• Winthevotersback• Promisedtoback thebill

Tryyourhandattagging…

• Ihopethat shewins• That daywasnice• Youcangothat far

WhyisPOStagginghard?

• Ambiguity!

• AmbiguityinEnglish• 11.5%ofwordtypesambiguousinBrowncorpus• 40%ofwordtokensambiguousinBrowncorpus• AnnotatordisagreementinPennTreebank:3.5%

POStagging:howtodoit?

• GivenPennTreebank,howwouldyoubuildasystemthatcanPOStagnewtext?

• Baseline:pickmostfrequenttagforeachwordtype• 90%accuracyiftrain+test setsaredrawnfromPennTreebank

• Canwedobetter?

HowtoPOStagautomatically?

HowcanwePOStagautomatically?

• POStaggingasmulticlassclassification• Whatisx?Whatisy?

• POStaggingassequencelabeling• Modelssequencesofpredictions

LinearModelsforClassification

Featurefunction

representation

Weights

Multiclassperceptron

POStaggingSequencelabelingwiththeperceptron

Sequencelabelingproblem

• Input:• sequenceoftokensx=[x1 … xK]• VariablelengthK

• Output(akalabel):• sequenceoftagsy=[y1 … yK]• Sizeofoutputspace?

StructuredPerceptron• Perceptronalgorithmcanbeusedforsequencelabeling

• Buttherearechallenges• Howtocomputeargmax efficiently?• Whatareappropriatefeatures?

• Approach:leveragestructureofoutputspace

Featurefunctionsforsequencelabeling

• Examplefeatures?• Numberoftimes“monsters”istaggedasnoun

• Numberoftimes“noun”isfollowedby“verb”

• Numberoftimes“tasty”istaggedas“verb”

• Numberoftimestwoverbsareadjacent• …

Featurefunctionsforsequencelabeling

• StandardfeaturesofPOStagging

• Unaryfeatures: #timeswordwhasbeenlabeledwithtaglforallwordswandalltagsl

• Markovfeatures: #timestaglisadjacenttotagl’inoutputforalltagslandl’

• Sizeoffeaturerepresentationisconstantwrtinputlength

Solvingtheargmax problemforsequences

• Efficientalgorithmspossibleifthefeaturefunctiondecomposesovertheinput

• Thisholdsforunaryandmarkovfeatures

Solvingtheargmax problemforsequences

• Trellissequencelabeling• Anypathrepresentsalabelingofinputsentence

• Goldstandardpathinred

• Eachedgereceivesaweightsuchthataddingweightsalongthepathcorrespondstoscoreforinput/ouputconfiguration

• Anymax-weightmax-weightpathalgorithmcanfindtheargmax

• e.g.ViterbialgorithmO(LK2)

POStaggingCMSC723/LING723/INST725

MarineCarpuat