Download pdf - TTIC 31190: Natural Language Processingttic.uchicago.edu/~kgimpel/teaching/31190/lectures/13.pdf · TTIC 31190: Natural Language Processing ... he’s a bass in the choir . bass 3

TTIC31190:NaturalLanguageProcessing

KevinGimpelWinter2016

Lecture13:DependencySyntax/Parsing

&ReviewforMidterm

1

Announcement• projectproposalduetoday• emailmetosetupa15-minutemeetingnextweektodiscussyourprojectproposal

• timespostedoncoursewebpage• letmeknowifnoneofthoseworkforyou

2

Announcement• midtermisThursday,room#530• closed-book,butyoucanbringan8.5x11sheet(thoughIdon’tthinkyou’llneedto)

• wewillstartat10:35am,finishat11:50am

3

Roadmap• classification• words• lexicalsemantics• languagemodeling• sequencelabeling• neuralnetworkmethodsinNLP• syntaxandsyntacticparsing• semanticcompositionality• semanticparsing• unsupervisedlearning• machinetranslationandotherapplications

4

WhatisSyntax?• rules,principles,processesthatgovernsentencestructureofalanguage

• candifferwidelyamonglanguages• buteverylanguagehassystematicstructuralprinciples

5

ConstituentParse(Bracketing/Tree)(S(NPtheman)(VPwalked(PPto(NPthepark))))

6

themanwalkedtothepark

S

NP

NP

VP

PP

Key:S=sentenceNP=nounphraseVP=verbphrasePP=prepositionalphraseDT=determinerNN=nounVBD=verb(pasttense)IN=preposition

DT NN VBDINDTNN

ConstituentParse(Bracketing/Tree)(S(NPtheman)(VPwalked(PPto(NPthepark))))

7

themanwalkedtothepark

S

NP

NP

VP

PP

DT NN VBDINDTNN preterminals

nonterminals

terminals

PennTreebankNonterminals

8

ProbabilisticContext-FreeGrammar(PCFG)

• assignprobabilitiestorewriterules:NPà DTNN 0.5NPà NNS 0.3NPà NPPP 0.2

NNàman 0.01NNà park 0.0004NNàwalk 0.002NNà….

9

givenatreebank,estimatetheseprobabilitiesusingMLE(“countandnormalize”)

HowwelldoesaPCFGwork?• PCFGlearnedfromthePennTreebankwithMLEgetsabout73%F1score

• state-of-the-artparsersarearound92%• simplemodificationscanimprovePCFGs:– smoothing– treetransformations(selectiveflattening)– parentannotation

10

ParentAnnotationVPà VNPPP

VPS à VNPVP PPVP

addsmoreinformation,butalsofragmentscounts,makingparameterestimatesnoisier(sincewe’rejustusingMLE)

11

HowwelldoesaPCFGwork?• PCFGlearnedfromthePennTreebankwithMLEgetsabout73%F1score

• state-of-the-artparsersarearound92%• simplemodificationscanimprovePCFGs:– smoothing– treetransformations(selectiveflattening)– parentannotation– lexicalization

12

Collins(1997)

13

LexicalizedPCFGs

14

nonterminals aredecoratedwiththeheadwordofthesubtree

Lexicalization• thisaddsalotmorerules!• manymoreparameterstoestimateàsmoothingbecomesmuchmoreimportant– e.g.,right-handsideofrulemightbefactoredintoseveralsteps

• butit’sworthitbecauseheadwordsarereallyusefulforconstituentparsing

15

Results(Collins,1997)

16

HeadRules• howareheadsdecided?• mostresearchersusedeterministicheadrules(Magerman/Collins)

• foraPCFGruleAà B1 …BN,theseheadrulessaywhichofB1 …BNistheheadoftherule

• examples:Sà NPVPVPà VBD NPPPNPà DTJJNN

17

HeadAnnotation

18fromNoahSmith

LexicalHeadAnnotation

19fromNoahSmith

LexicalHeadAnnotationà Dependencies

20

removenonlexicalparts:

fromNoahSmith

Dependencies

21

mergeredundantnodes:

fromNoahSmith

22

constituentparse: dependencyparse:

23

constituentparse: labeled dependencyparse:

nsubj

det

dobj

pobj

det

prep

nsubj =“nominalsubject”dobj =“directobject”prep=“prepositionmodifier”pobj =“objectofpreposition”det =“determiner”

24

constituentparse: labeled dependencyparse:

nsubj

det

dobj

pobj

det

prep

nsubj =“nominalsubject”dobj =“directobject”prep=“prepositionmodifier”pobj =“objectofpreposition”det =“determiner”

capturessomesemanticrelationships

• how(unlabeled)dependencytreesaretypicallydrawn:– rootoftreeisrepresentedby$(“wallsymbol”)– arrowsdrawnentirelyabove(orbelow)sentence– arrowsaredirectedfromchildtoparent(orfromparenttochild);youwillseebothinpractice—don’tgetconfused!

25

source: $ konnten sie es übersetzen ?

reference: $ could you translate it ?“wall”symbol

CrossingDependencies

26

ifdependenciescross(“nonprojective”),nolongercorrespondsto

aPCFG

fromNoahSmith

Projectivevs.Nonprojective DependencyParsing

• Englishdependencytreebanks aremostlyprojective– butwhenfocusingmoreonsemanticrelationships,oftenbecomesmorenonprojective

• some(relatively)freewordorderlanguages,likeCzech,arefairlynonprojective

• nonprojective parsingcanbeformulatedasaminimumspanningtreeproblem

• projectiveparsingcannot

27

DependencyParsing• severalwidely-usedalgorithms• differentguaranteesbutsimilarperformanceinpractice

• graph-based:– dynamicprogramming(Eisner,1997)– minimumspanningtree(McDonaldetal.,2005)

• transition-based:– shift-reduce(Nivre,interalia)

28

DependencyParsers• Stanfordparser• TurboParser• Joakim Nivre’s MALTparser• RyanMcDonald’sMSTparser• andmanyothersformanynon-Englishlanguages

29

ComplexityComparison• constituentparsing:O(Gn3)– parsingcomplexitydependsongrammarstructure(“grammarconstant”G)

– sinceithaslotsofnonterminal-onlyrulesatthetopofthetree,therearemanyruleprobabilitiestoestimate

• dependencyparsing:O(n3)– operatesdirectlyonwords,soparsingcomplexityhasnogrammarconstant

– featuresdesignedonpossibledependencies(pairsofwords)andlargerstructures

– transition-basedparsingalgorithmsareO(n),thoughnotoptimal;also,non-projectiveparsingisfaster

30

ApplicationsofDependencyParsing• widelyusedforNLPtasksbecause:– fasterthanconstituentparsing– capturesmoresemanticinformation

• textclassification(featuresondependencies)• syntax-basedmachinetranslation• relationextraction– e.g.,extractrelationbetweenSamSmithandAITech:SamSmithwasnamednewCEOofAITech.– usedependencypathbetweenSamSmithandAITech:

• Smithà named,namedß CEO,CEOß of,ofß AITech

31

Summary:twotypesofgrammars• phrasestructure/constituentgrammars– inspiredmostlybyChomskyandothers– onlyappropriateforcertainlanguages(e.g.,English)

• dependencygrammars– closertoasemanticrepresentation;somehavemadethismoreexplicit

– problematicforcertainsyntacticstructures(e.g.,conjunctions,nestingofnounphrases,etc.)

• botharewidelyusedinNLP• youcanfindconstituentparsersanddependencyparsersforseverallanguagesonline

32

Review

33

Modeling,Inference,Learning

• Modeling:Howdoweassignascoretoan(x,y)pairusingparameters?

modeling:definescorefunction

34


• Inference:Howdoweefficientlysearchoverthespaceofalllabels?

inference:solve_ modeling:definescorefunction

35


• Learning:Howdowechoose?

learning:choose_

modeling:definescorefunctioninference:solve_

36

Applications

37

ApplicationsofourClassificationFramework

38

textclassification:

x y

thehulk isanangerfueledmonsterwithincrediblestrengthandresistancetodamage. objective

intryingtobedaringandoriginal,itcomesoffasonlyoccasionallysatiricalandneverfresh. subjective

={objective,subjective}


39

wordsenseclassifierforbass:

x y

he’sabassinthechoir. bass3

our bassisline-caughtfromtheAtlantic. bass4

={bass1,bass2,…,bass8}


40

skip-grammodelasaclassifier:

x y

agriculture <s>

agriculture is

agriculture the

=V (theentirevocabulary)

corpus(EnglishWikipedia):agriculture isthetraditionalmainstayofthecambodian economy.butbenares hasbeendestroyedbyanearthquake .…

determinerverb(past)prep.properproperposs.adj.noun

modalverbdet.adjectivenounprep.properpunc.

41

Part-of-SpeechTagging

determinerverb(past)prep.nounnounposs.adj.nounSomequestionedifTimCook’sfirstproduct

modalverbdet.adjectivenounprep.nounpunc.wouldbeabreakawayhitforApple.

Simplestkindofstructuredprediction:SequenceLabeling

42

OOOB-PERSONI-PERSONOOOSomequestionedifTimCook’sfirstproduct

OOOOOOB-ORGANIZATIONOwouldbeabreakawayhitforApple.

NamedEntityRecognition

B=“begin”I=“inside”O=“outside”

FormulatingsegmentationtasksassequencelabelingviaB-I-Olabeling:

ApplicationsofourClassifierFrameworksofar

43

task input(x) output(y) outputspace() sizeof

textclassification asentence goldstandard

label forx

pre-defined, smalllabelset (e.g.,

{positive,negative})2-10

wordsensedisambiguation

instanceofaparticularword(e.g.,bass)with

itscontext

goldstandardwordsenseofx

pre-definedsenseinventory from

WordNet forbass2-30

learning skip-gramwordembeddings

instanceofawordinacorpus

awordinthecontextofx in

acorpusvocabulary |V|

part-of-speechtagging asentence

goldstandardpart-of-speech

tagsforx

allpossiblepart-of-speech tagsequenceswithsamelengthasx

|P||x|

ApplicationsofClassifierFramework(continued)

44

task input(x) output(y) outputspace() sizeof

namedentity

recognitionasentence

goldstandardnamedentitylabels forx

(BIOtags)

allpossibleBIOlabelsequenceswithsame

lengthasx|P||x|

constituentparsing asentence

goldstandardconstituentparse(labeledbracketing)

ofx

all possible labeledbracketings ofx

exponentialinlengthofx(Catalannumber)

dependencyparsing asentence

goldstandarddependencyparse(labeleddirectedspanning tree)ofx

allpossible labeleddirectedspanning trees

ofx

exponentialinlengthofx

• eachapplicationdrawsfromparticularlinguisticconceptsandmustaddressdifferentkindsoflinguisticambiguity/variability:– wordsense:sensegranularity,relationshipsamongsenses,wordsenseambiguity

– wordvectors:distributionalproperties,senseambiguity,differentkindsofsimilarity

– part-of-speech:taggranularity,tagambiguity– parsing:constituent/dependencyrelationships,attachment&coordinationambiguities

45

Modeling

46

modelfamilies• linearmodels– lotsoffreedomindefiningfeatures,thoughfeatureengineeringrequiredforbestperformance

– learningusesoptimizationofalossfunction– onecan(tryto)interpretlearnedfeatureweights

• stochastic/generativemodels– linearmodelswithsimple“features”(countsofevents)– learningiseasy:count&normalize(butsmoothingneeded)– easytogeneratesamples

• neuralnetworks– canusuallygetawaywithlessfeatureengineering– learningusesoptimizationofalossfunction– hardtointerpret(thoughwetry!),butoftenworksbest

47

specialcaseoflinearmodels:stochastic/generativemodels

48

model tasks contextexpansion

n-gramlanguage models languagemodeling (forMT,ASR,etc.) increasen

hiddenMarkovmodelspart-of-speechtagging,

namedentityrecognition,wordclustering

increaseorderofHMM(e.g.,bigramHMMà trigram HMM)

probabilistic context-freegrammars constituentparsing increasesizeofrules,e.g.,flattening,

parentannotation,etc.

• alluseMLE+smoothing(thoughprobablydifferentkindsofsmoothing)• allassignprobabilitytosentences(someassignprobabilityjointlytopairs

of<sentence,somethingelse>)• allhavethesametrade-offofincreasing“context”(featuresize)and

needingmoredata/bettersmoothing

FeatureEngineeringforTextClassification

• Twofeatures:

where

• Whatshouldtheweightsbe?

49

unigrambinarytemplate:

bigrambinarytemplate:

trigrambinaryfeatures…

50

Higher-OrderBinaryFeatureTemplates

UnigramCountFeatures

• a``count’’featurereturnsthecountofaparticularwordinthetext

• unigramcountfeaturetemplate:

51

FeatureCountCutoffs• problem:somefeaturesareextremelyrare• solution:onlykeepfeaturesthatappearatleastk timesinthetrainingdata

52

2-transformation(1-layer)network

• we’llcallthisa“2-transformation”neuralnetwork,ora“1-layer”neuralnetwork

• inputvectoris• scorevectoris• onehiddenvector(“hiddenlayer”)

53

vectoroflabelscores

1-layerneuralnetworkforsentimentclassification

54

ikr smh heaskedfiryo lastnamesohecan

55

intj pronoun prepadj prep verbotherverbdet noun pronoun

NeuralNetworksforTwitterPart-of-SpeechTagging

vectorforlastvectorforyo

• let’susethecenterword+twowordstotheright:

vectorforname

• ifname istotherightofyo,thenyo isprobablyaformofyour• butourx aboveusesseparatedimensionsforeachposition!

– i.e.,nameistwowordstotheright– whatifnameisonewordtotheright?

Convolution

56

vectorforlastvectorforyo vectorforname

=“featuremap”,hasanentryforeachwordposition incontextwindow/sentence

Pooling

57



howdoweconvertthisintoafixed-lengthvector?usepooling:

max-pooling:returnsmaximumvalueinaverage pooling:returnsaverageofvaluesin

Pooling

58



howdoweconvertthisintoafixed-lengthvector?usepooling:

max-pooling:returnsmaximumvalueinaverage pooling:returnsaverageofvaluesin

then,thissinglefilterproducesasinglefeaturevalue(theoutputofsomekindofpooling).inpractice,weusemanyfiltersofmanydifferentlengths(e.g.,n-gramsratherthanwords).

ConvolutionalNeuralNetworks• convolutionalneuralnetworks(convnets orCNNs)usefiltersthatare“convolvedwith”(matchedagainstallpositionsof)theinput

• thinkofconvolutionas“performthesameoperationeverywhereontheinputinsomesystematicorder”

• “convolutionallayer”=setoffiltersthatareconvolvedwiththeinputvector(whetherx orhiddenvector)

• couldbefollowedbymoreconvolutionallayers,orbyatypeofpooling

• oftenusedinNLPtoconvertasentenceintoafeaturevector

59

RecurrentNeuralNetworks

60

“hiddenvector”

LongShort-TermMemory(LSTM)RecurrentNeuralNetworks

61

Backward&BidirectionalLSTMs

62

bidirectional:ifshallow,justuseforwardandbackwardLSTMsinparallel,concatenatefinaltwohiddenvectors,feedtosoftmax

DeepLSTM(2-layer)

63

layer1

layer2

RecursiveNeuralNetworksforNLP• first,runaconstituentparseronthesentence• converttheconstituenttreetoabinarytree(eachrewritehasexactlytwochildren)

• constructvectorforsentencerecursivelyateachrewrite(“splitpoint”):

64

Learning

65

CostFunctions• costfunction:scoresoutputagainstagoldstandard

• shouldreflecttheevaluationmetricforyourtask

• usualconventions:• forclassification,whatcostshouldweuse?• forclassification,whatcostshouldweuse?

66

Empirical RiskMinimization(Vapnik etal.)

67

• replaceexpectationwithsumoverexamples:

Empirical RiskMinimization(Vapnik etal.)

68

• replaceexpectationwithsumoverexamples:

problem:NP-hardevenforbinaryclassificationwithlinearmodels

EmpiricalRiskMinimizationwithSurrogateLossFunctions

69

• giventrainingdata:whereeach isalabel

• wewanttosolvethefollowing:

manypossiblelossfunctionstoconsider

optimizing

LossFunctions

70

name loss whereused

cost(“0-1”)intractable,but

underlies“directerrorminimization”

perceptron perceptronalgorithm(Rosenblatt,1958)

hingesupportvector

machines,other large-marginalgorithms

log

logisticregression,conditional randomfields,maximumentropymodels

(Sub)gradientsofLossesforLinearModels

71

name entryj of(sub)gradientofloss forlinearmodel

cost(“0-1”) notsubdifferentiable ingeneral

perceptron

hinge

log

(Sub)gradientsofLossesforLinearModels

72

name entryj of(sub)gradientofloss forlinearmodel

cost(“0-1”) notsubdifferentiable ingeneral

perceptron

hinge

log

expectationoffeaturevaluewithrespecttodistributionovery (wheredistribution isdefinedbytheta)

alternativenotation:

Visualization

73

score

fivepossibleoutputs

Visualization

74

cost

fivepossibleoutputs

Visualization

75

cost

goldstandard

Visualization

76

cost

goldstandard

Visualization

77

score+cost

goldstandard

78

perceptronloss:

79

score

goldstandard

perceptronloss:

80

score

goldstandard

perceptronloss:

81

score

goldstandard

perceptronloss:

effectoflearning?

82

score

goldstandard

perceptronloss:

effectoflearning:goldstandardwillhavehighestscore

83

hingeloss:

84

score+cost

goldstandard

hingeloss:

85

score+cost

goldstandard

hingeloss:

86

score+cost

goldstandard

hingeloss:

effectoflearning?

87

score+cost

goldstandard

hingeloss:

effectoflearning:scoreofgoldstandardwillbehigherthanscore+costofall

others

Regularized EmpiricalRiskMinimization

88

• giventrainingdata:whereeach isalabel

• wewanttosolvethefollowing:

regularizationterm

regularizationstrength

RegularizationTerms

• mostcommon:penalizelargeparametervalues• intuition:largeparametersmightbeinstancesofoverfitting

• examples:L2 regularization:(alsocalledTikhonov regularizationorridgeregression)

L1 regularization:(alsocalledbasispursuitorLASSO)

89

Dropout• popularregularizationmethodforneuralnetworks

• randomly“dropout”(settozero)someofthevectorentriesinthelayers

90

Inference

91

Exponentially-LargeSearchProblems

92

inference:solve_

• whenoutputisasequenceortree,thisargmax requiresiteratingoveranexponentially-largeset

Learningrequiressolvingexponentially-hardproblemstoo!

93

loss entryj of(sub)gradientofloss forlinearmodel

perceptron

hinge

log

computing eachof thesetermsrequiresiteratingthroughevery

possibleoutput

DynamicProgramming(DP)• whatisdynamicprogramming?– afamilyofalgorithmsthatbreakproblemsintosmallerpiecesandreusesolutionsforthosepieces

– onlyapplicablewhentheproblemhascertainproperties(optimalsubstructureandoverlappingsub-problems)

• inthisclass,weuseDPtoiterateoverexponentially-largeoutputspacesinpolynomialtime

• wefocusonaparticulartypeofDPalgorithm:memoization

94

ImplementingDPalgorithms• evenifyourgoalistocomputeasumoramax,focusfirstoncountingmode (countthenumberofuniqueoutputsforaninput)

• memoization =recursion+saving/reusingsolutions– startbydefiningrecursiveequations– “memoize”bycreatingatabletostoreallintermediateresultsfromrecursiveequations,usethemwhenrequested

95

InferenceinHMMs

96

• sincetheoutputisasequence,thisargmaxrequiresiteratingoveranexponentially-largeset

• lastweekwetalkedaboutusingdynamicprogramming(DP)tosolvetheseproblems

• forHMMs(andothersequencemodels),theforsolvingthisiscalledtheViterbialgorithm

ViterbiAlgorithm• recursiveequations+memoization:

97

basecase:returnsprobabilityofsequencestartingwithlabely forfirstword

recursivecase:computesprobabilityofmax-probabilitylabelsequencethatendswithlabely atpositionm

finalvalueisin:

ViterbiAlgorithm• spaceandtimecomplexity?• canbereadofffromtherecursiveequations:

98

spacecomplexity:sizeofmemoization table,whichis#ofuniqueindicesofrecursiveequations

so,spacecomplexityisO(|x||L|)

lengthofsentence

numberoflabels*

ViterbiAlgorithm• spaceandtimecomplexity?• canbereadofffromtherecursiveequations:

99

timecomplexity:sizeofmemoization table*complexityofcomputingeachentry

so,timecomplexityisO(|x||L||L|)=O(|x||L|2)

lengthofsentence

numberoflabels*

eachentryrequiresiteratingthroughthelabels*

FeatureLocality

• featurelocality:how“big”areyourfeatures?• whendesigningefficientinferencealgorithms(whetherw/DPorothermethods),weneedtobemindfulofthis

• featurescanbearbitrarilybigintermsoftheinput,butnotintermsoftheoutput!

• thefeaturesinHMMsaresmallinboththeinputandoutputsequences(onlytwopiecesatatime)

100