TTIC31190:NaturalLanguageProcessing
KevinGimpelWinter2016
Lecture13:DependencySyntax/Parsing
&ReviewforMidterm
1
Announcement• projectproposalduetoday• emailmetosetupa15-minutemeetingnextweektodiscussyourprojectproposal
• timespostedoncoursewebpage• letmeknowifnoneofthoseworkforyou
2
Announcement• midtermisThursday,room#530• closed-book,butyoucanbringan8.5x11sheet(thoughIdon’tthinkyou’llneedto)
• wewillstartat10:35am,finishat11:50am
3
Roadmap• classification• words• lexicalsemantics• languagemodeling• sequencelabeling• neuralnetworkmethodsinNLP• syntaxandsyntacticparsing• semanticcompositionality• semanticparsing• unsupervisedlearning• machinetranslationandotherapplications
4
WhatisSyntax?• rules,principles,processesthatgovernsentencestructureofalanguage
• candifferwidelyamonglanguages• buteverylanguagehassystematicstructuralprinciples
5
ConstituentParse(Bracketing/Tree)(S(NPtheman)(VPwalked(PPto(NPthepark))))
6
themanwalkedtothepark
S
NP
NP
VP
PP
Key:S=sentenceNP=nounphraseVP=verbphrasePP=prepositionalphraseDT=determinerNN=nounVBD=verb(pasttense)IN=preposition
DT NN VBDINDTNN
ConstituentParse(Bracketing/Tree)(S(NPtheman)(VPwalked(PPto(NPthepark))))
7
themanwalkedtothepark
S
NP
NP
VP
PP
DT NN VBDINDTNN preterminals
nonterminals
terminals
PennTreebankNonterminals
8
ProbabilisticContext-FreeGrammar(PCFG)
• assignprobabilitiestorewriterules:NPà DTNN 0.5NPà NNS 0.3NPà NPPP 0.2
NNàman 0.01NNà park 0.0004NNàwalk 0.002NNà….
9
givenatreebank,estimatetheseprobabilitiesusingMLE(“countandnormalize”)
HowwelldoesaPCFGwork?• PCFGlearnedfromthePennTreebankwithMLEgetsabout73%F1score
• state-of-the-artparsersarearound92%• simplemodificationscanimprovePCFGs:– smoothing– treetransformations(selectiveflattening)– parentannotation
10
ParentAnnotationVPà VNPPP
VPS à VNPVP PPVP
addsmoreinformation,butalsofragmentscounts,makingparameterestimatesnoisier(sincewe’rejustusingMLE)
11
HowwelldoesaPCFGwork?• PCFGlearnedfromthePennTreebankwithMLEgetsabout73%F1score
• state-of-the-artparsersarearound92%• simplemodificationscanimprovePCFGs:– smoothing– treetransformations(selectiveflattening)– parentannotation– lexicalization
12
Collins(1997)
13
LexicalizedPCFGs
14
nonterminals aredecoratedwiththeheadwordofthesubtree
Lexicalization• thisaddsalotmorerules!• manymoreparameterstoestimateàsmoothingbecomesmuchmoreimportant– e.g.,right-handsideofrulemightbefactoredintoseveralsteps
• butit’sworthitbecauseheadwordsarereallyusefulforconstituentparsing
15
Results(Collins,1997)
16
HeadRules• howareheadsdecided?• mostresearchersusedeterministicheadrules(Magerman/Collins)
• foraPCFGruleAà B1 …BN,theseheadrulessaywhichofB1 …BNistheheadoftherule
• examples:Sà NPVPVPà VBD NPPPNPà DTJJNN
17
HeadAnnotation
18fromNoahSmith
LexicalHeadAnnotation
19fromNoahSmith
LexicalHeadAnnotationà Dependencies
20
removenonlexicalparts:
fromNoahSmith
Dependencies
21
mergeredundantnodes:
fromNoahSmith
22
constituentparse: dependencyparse:
23
constituentparse: labeled dependencyparse:
nsubj
det
dobj
pobj
det
prep
nsubj =“nominalsubject”dobj =“directobject”prep=“prepositionmodifier”pobj =“objectofpreposition”det =“determiner”
24
constituentparse: labeled dependencyparse:
nsubj
det
dobj
pobj
det
prep
nsubj =“nominalsubject”dobj =“directobject”prep=“prepositionmodifier”pobj =“objectofpreposition”det =“determiner”
capturessomesemanticrelationships
• how(unlabeled)dependencytreesaretypicallydrawn:– rootoftreeisrepresentedby$(“wallsymbol”)– arrowsdrawnentirelyabove(orbelow)sentence– arrowsaredirectedfromchildtoparent(orfromparenttochild);youwillseebothinpractice—don’tgetconfused!
25
source: $ konnten sie es übersetzen ?
reference: $ could you translate it ?“wall”symbol
CrossingDependencies
26
ifdependenciescross(“nonprojective”),nolongercorrespondsto
aPCFG
fromNoahSmith
Projectivevs.Nonprojective DependencyParsing
• Englishdependencytreebanks aremostlyprojective– butwhenfocusingmoreonsemanticrelationships,oftenbecomesmorenonprojective
• some(relatively)freewordorderlanguages,likeCzech,arefairlynonprojective
• nonprojective parsingcanbeformulatedasaminimumspanningtreeproblem
• projectiveparsingcannot
27
DependencyParsing• severalwidely-usedalgorithms• differentguaranteesbutsimilarperformanceinpractice
• graph-based:– dynamicprogramming(Eisner,1997)– minimumspanningtree(McDonaldetal.,2005)
• transition-based:– shift-reduce(Nivre,interalia)
28
DependencyParsers• Stanfordparser• TurboParser• Joakim Nivre’s MALTparser• RyanMcDonald’sMSTparser• andmanyothersformanynon-Englishlanguages
29
ComplexityComparison• constituentparsing:O(Gn3)– parsingcomplexitydependsongrammarstructure(“grammarconstant”G)
– sinceithaslotsofnonterminal-onlyrulesatthetopofthetree,therearemanyruleprobabilitiestoestimate
• dependencyparsing:O(n3)– operatesdirectlyonwords,soparsingcomplexityhasnogrammarconstant
– featuresdesignedonpossibledependencies(pairsofwords)andlargerstructures
– transition-basedparsingalgorithmsareO(n),thoughnotoptimal;also,non-projectiveparsingisfaster
30
ApplicationsofDependencyParsing• widelyusedforNLPtasksbecause:– fasterthanconstituentparsing– capturesmoresemanticinformation
• textclassification(featuresondependencies)• syntax-basedmachinetranslation• relationextraction– e.g.,extractrelationbetweenSamSmithandAITech:SamSmithwasnamednewCEOofAITech.– usedependencypathbetweenSamSmithandAITech:
• Smithà named,namedß CEO,CEOß of,ofß AITech
31
Summary:twotypesofgrammars• phrasestructure/constituentgrammars– inspiredmostlybyChomskyandothers– onlyappropriateforcertainlanguages(e.g.,English)
• dependencygrammars– closertoasemanticrepresentation;somehavemadethismoreexplicit
– problematicforcertainsyntacticstructures(e.g.,conjunctions,nestingofnounphrases,etc.)
• botharewidelyusedinNLP• youcanfindconstituentparsersanddependencyparsersforseverallanguagesonline
32
Review
33
Modeling,Inference,Learning
• Modeling:Howdoweassignascoretoan(x,y)pairusingparameters?
modeling:definescorefunction
34
Modeling,Inference,Learning
• Inference:Howdoweefficientlysearchoverthespaceofalllabels?
inference:solve_ modeling:definescorefunction
35
Modeling,Inference,Learning
• Learning:Howdowechoose?
learning:choose_
modeling:definescorefunctioninference:solve_
36
Applications
37
ApplicationsofourClassificationFramework
38
textclassification:
x y
thehulk isanangerfueledmonsterwithincrediblestrengthandresistancetodamage. objective
intryingtobedaringandoriginal,itcomesoffasonlyoccasionallysatiricalandneverfresh. subjective
={objective,subjective}
ApplicationsofourClassificationFramework
39
wordsenseclassifierforbass:
x y
he’sabassinthechoir. bass3
our bassisline-caughtfromtheAtlantic. bass4
={bass1,bass2,…,bass8}
ApplicationsofourClassificationFramework
40
skip-grammodelasaclassifier:
x y
agriculture <s>
agriculture is
agriculture the
=V (theentirevocabulary)
corpus(EnglishWikipedia):agriculture isthetraditionalmainstayofthecambodian economy.butbenares hasbeendestroyedbyanearthquake .…
determinerverb(past)prep.properproperposs.adj.noun
modalverbdet.adjectivenounprep.properpunc.
41
Part-of-SpeechTagging
determinerverb(past)prep.nounnounposs.adj.nounSomequestionedifTimCook’sfirstproduct
modalverbdet.adjectivenounprep.nounpunc.wouldbeabreakawayhitforApple.
Simplestkindofstructuredprediction:SequenceLabeling
42
OOOB-PERSONI-PERSONOOOSomequestionedifTimCook’sfirstproduct
OOOOOOB-ORGANIZATIONOwouldbeabreakawayhitforApple.
NamedEntityRecognition
B=“begin”I=“inside”O=“outside”
FormulatingsegmentationtasksassequencelabelingviaB-I-Olabeling:
ApplicationsofourClassifierFrameworksofar
43
task input(x) output(y) outputspace() sizeof
textclassification asentence goldstandard
label forx
pre-defined, smalllabelset (e.g.,
{positive,negative})2-10
wordsensedisambiguation
instanceofaparticularword(e.g.,bass)with
itscontext
goldstandardwordsenseofx
pre-definedsenseinventory from
WordNet forbass2-30
learning skip-gramwordembeddings
instanceofawordinacorpus
awordinthecontextofx in
acorpusvocabulary |V|
part-of-speechtagging asentence
goldstandardpart-of-speech
tagsforx
allpossiblepart-of-speech tagsequenceswithsamelengthasx
|P||x|
ApplicationsofClassifierFramework(continued)
44
task input(x) output(y) outputspace() sizeof
namedentity
recognitionasentence
goldstandardnamedentitylabels forx
(BIOtags)
allpossibleBIOlabelsequenceswithsame
lengthasx|P||x|
constituentparsing asentence
goldstandardconstituentparse(labeledbracketing)
ofx
all possible labeledbracketings ofx
exponentialinlengthofx(Catalannumber)
dependencyparsing asentence
goldstandarddependencyparse(labeleddirectedspanning tree)ofx
allpossible labeleddirectedspanning trees
ofx
exponentialinlengthofx
• eachapplicationdrawsfromparticularlinguisticconceptsandmustaddressdifferentkindsoflinguisticambiguity/variability:– wordsense:sensegranularity,relationshipsamongsenses,wordsenseambiguity
– wordvectors:distributionalproperties,senseambiguity,differentkindsofsimilarity
– part-of-speech:taggranularity,tagambiguity– parsing:constituent/dependencyrelationships,attachment&coordinationambiguities
45
Modeling
46
modelfamilies• linearmodels– lotsoffreedomindefiningfeatures,thoughfeatureengineeringrequiredforbestperformance
– learningusesoptimizationofalossfunction– onecan(tryto)interpretlearnedfeatureweights
• stochastic/generativemodels– linearmodelswithsimple“features”(countsofevents)– learningiseasy:count&normalize(butsmoothingneeded)– easytogeneratesamples
• neuralnetworks– canusuallygetawaywithlessfeatureengineering– learningusesoptimizationofalossfunction– hardtointerpret(thoughwetry!),butoftenworksbest
47
specialcaseoflinearmodels:stochastic/generativemodels
48
model tasks contextexpansion
n-gramlanguage models languagemodeling (forMT,ASR,etc.) increasen
hiddenMarkovmodelspart-of-speechtagging,
namedentityrecognition,wordclustering
increaseorderofHMM(e.g.,bigramHMMà trigram HMM)
probabilistic context-freegrammars constituentparsing increasesizeofrules,e.g.,flattening,
parentannotation,etc.
• alluseMLE+smoothing(thoughprobablydifferentkindsofsmoothing)• allassignprobabilitytosentences(someassignprobabilityjointlytopairs
of<sentence,somethingelse>)• allhavethesametrade-offofincreasing“context”(featuresize)and
needingmoredata/bettersmoothing
FeatureEngineeringforTextClassification
• Twofeatures:
where
• Whatshouldtheweightsbe?
49
unigrambinarytemplate:
bigrambinarytemplate:
trigrambinaryfeatures…
50
Higher-OrderBinaryFeatureTemplates
UnigramCountFeatures
• a``count’’featurereturnsthecountofaparticularwordinthetext
• unigramcountfeaturetemplate:
51
FeatureCountCutoffs• problem:somefeaturesareextremelyrare• solution:onlykeepfeaturesthatappearatleastk timesinthetrainingdata
52
2-transformation(1-layer)network
• we’llcallthisa“2-transformation”neuralnetwork,ora“1-layer”neuralnetwork
• inputvectoris• scorevectoris• onehiddenvector(“hiddenlayer”)
53
vectoroflabelscores
1-layerneuralnetworkforsentimentclassification
54
ikr smh heaskedfiryo lastnamesohecan
55
intj pronoun prepadj prep verbotherverbdet noun pronoun
NeuralNetworksforTwitterPart-of-SpeechTagging
vectorforlastvectorforyo
• let’susethecenterword+twowordstotheright:
vectorforname
• ifname istotherightofyo,thenyo isprobablyaformofyour• butourx aboveusesseparatedimensionsforeachposition!
– i.e.,nameistwowordstotheright– whatifnameisonewordtotheright?
Convolution
56
vectorforlastvectorforyo vectorforname
=“featuremap”,hasanentryforeachwordposition incontextwindow/sentence
Pooling
57
vectorforlastvectorforyo vectorforname
=“featuremap”,hasanentryforeachwordposition incontextwindow/sentence
howdoweconvertthisintoafixed-lengthvector?usepooling:
max-pooling:returnsmaximumvalueinaverage pooling:returnsaverageofvaluesin
Pooling
58
vectorforlastvectorforyo vectorforname
=“featuremap”,hasanentryforeachwordposition incontextwindow/sentence
howdoweconvertthisintoafixed-lengthvector?usepooling:
max-pooling:returnsmaximumvalueinaverage pooling:returnsaverageofvaluesin
then,thissinglefilterproducesasinglefeaturevalue(theoutputofsomekindofpooling).inpractice,weusemanyfiltersofmanydifferentlengths(e.g.,n-gramsratherthanwords).
ConvolutionalNeuralNetworks• convolutionalneuralnetworks(convnets orCNNs)usefiltersthatare“convolvedwith”(matchedagainstallpositionsof)theinput
• thinkofconvolutionas“performthesameoperationeverywhereontheinputinsomesystematicorder”
• “convolutionallayer”=setoffiltersthatareconvolvedwiththeinputvector(whetherx orhiddenvector)
• couldbefollowedbymoreconvolutionallayers,orbyatypeofpooling
• oftenusedinNLPtoconvertasentenceintoafeaturevector
59
RecurrentNeuralNetworks
60
“hiddenvector”
LongShort-TermMemory(LSTM)RecurrentNeuralNetworks
61
Backward&BidirectionalLSTMs
62
bidirectional:ifshallow,justuseforwardandbackwardLSTMsinparallel,concatenatefinaltwohiddenvectors,feedtosoftmax
DeepLSTM(2-layer)
63
layer1
layer2
RecursiveNeuralNetworksforNLP• first,runaconstituentparseronthesentence• converttheconstituenttreetoabinarytree(eachrewritehasexactlytwochildren)
• constructvectorforsentencerecursivelyateachrewrite(“splitpoint”):
64
Learning
65
CostFunctions• costfunction:scoresoutputagainstagoldstandard
• shouldreflecttheevaluationmetricforyourtask
• usualconventions:• forclassification,whatcostshouldweuse?• forclassification,whatcostshouldweuse?
66
Empirical RiskMinimization(Vapnik etal.)
67
• replaceexpectationwithsumoverexamples:
Empirical RiskMinimization(Vapnik etal.)
68
• replaceexpectationwithsumoverexamples:
problem:NP-hardevenforbinaryclassificationwithlinearmodels
EmpiricalRiskMinimizationwithSurrogateLossFunctions
69
• giventrainingdata:whereeach isalabel
• wewanttosolvethefollowing:
manypossiblelossfunctionstoconsider
optimizing
LossFunctions
70
name loss whereused
cost(“0-1”)intractable,but
underlies“directerrorminimization”
perceptron perceptronalgorithm(Rosenblatt,1958)
hingesupportvector
machines,other large-marginalgorithms
log
logisticregression,conditional randomfields,maximumentropymodels
(Sub)gradientsofLossesforLinearModels
71
name entryj of(sub)gradientofloss forlinearmodel
cost(“0-1”) notsubdifferentiable ingeneral
perceptron
hinge
log
(Sub)gradientsofLossesforLinearModels
72
name entryj of(sub)gradientofloss forlinearmodel
cost(“0-1”) notsubdifferentiable ingeneral
perceptron
hinge
log
expectationoffeaturevaluewithrespecttodistributionovery (wheredistribution isdefinedbytheta)
alternativenotation:
Visualization
73
score
fivepossibleoutputs
Visualization
74
cost
fivepossibleoutputs
Visualization
75
cost
goldstandard
Visualization
76
cost
goldstandard
Visualization
77
score+cost
goldstandard
78
perceptronloss:
79
score
goldstandard
perceptronloss:
80
score
goldstandard
perceptronloss:
81
score
goldstandard
perceptronloss:
effectoflearning?
82
score
goldstandard
perceptronloss:
effectoflearning:goldstandardwillhavehighestscore
83
hingeloss:
84
score+cost
goldstandard
hingeloss:
85
score+cost
goldstandard
hingeloss:
86
score+cost
goldstandard
hingeloss:
effectoflearning?
87
score+cost
goldstandard
hingeloss:
effectoflearning:scoreofgoldstandardwillbehigherthanscore+costofall
others
Regularized EmpiricalRiskMinimization
88
• giventrainingdata:whereeach isalabel
• wewanttosolvethefollowing:
regularizationterm
regularizationstrength
RegularizationTerms
• mostcommon:penalizelargeparametervalues• intuition:largeparametersmightbeinstancesofoverfitting
• examples:L2 regularization:(alsocalledTikhonov regularizationorridgeregression)
L1 regularization:(alsocalledbasispursuitorLASSO)
89
Dropout• popularregularizationmethodforneuralnetworks
• randomly“dropout”(settozero)someofthevectorentriesinthelayers
90
Inference
91
Exponentially-LargeSearchProblems
92
inference:solve_
• whenoutputisasequenceortree,thisargmax requiresiteratingoveranexponentially-largeset
Learningrequiressolvingexponentially-hardproblemstoo!
93
loss entryj of(sub)gradientofloss forlinearmodel
perceptron
hinge
log
computing eachof thesetermsrequiresiteratingthroughevery
possibleoutput
DynamicProgramming(DP)• whatisdynamicprogramming?– afamilyofalgorithmsthatbreakproblemsintosmallerpiecesandreusesolutionsforthosepieces
– onlyapplicablewhentheproblemhascertainproperties(optimalsubstructureandoverlappingsub-problems)
• inthisclass,weuseDPtoiterateoverexponentially-largeoutputspacesinpolynomialtime
• wefocusonaparticulartypeofDPalgorithm:memoization
94
ImplementingDPalgorithms• evenifyourgoalistocomputeasumoramax,focusfirstoncountingmode (countthenumberofuniqueoutputsforaninput)
• memoization =recursion+saving/reusingsolutions– startbydefiningrecursiveequations– “memoize”bycreatingatabletostoreallintermediateresultsfromrecursiveequations,usethemwhenrequested
95
InferenceinHMMs
96
• sincetheoutputisasequence,thisargmaxrequiresiteratingoveranexponentially-largeset
• lastweekwetalkedaboutusingdynamicprogramming(DP)tosolvetheseproblems
• forHMMs(andothersequencemodels),theforsolvingthisiscalledtheViterbialgorithm
ViterbiAlgorithm• recursiveequations+memoization:
97
basecase:returnsprobabilityofsequencestartingwithlabely forfirstword
recursivecase:computesprobabilityofmax-probabilitylabelsequencethatendswithlabely atpositionm
finalvalueisin:
ViterbiAlgorithm• spaceandtimecomplexity?• canbereadofffromtherecursiveequations:
98
spacecomplexity:sizeofmemoization table,whichis#ofuniqueindicesofrecursiveequations
so,spacecomplexityisO(|x||L|)
lengthofsentence
numberoflabels*
ViterbiAlgorithm• spaceandtimecomplexity?• canbereadofffromtherecursiveequations:
99
timecomplexity:sizeofmemoization table*complexityofcomputingeachentry
so,timecomplexityisO(|x||L||L|)=O(|x||L|2)
lengthofsentence
numberoflabels*
eachentryrequiresiteratingthroughthelabels*
FeatureLocality
• featurelocality:how“big”areyourfeatures?• whendesigningefficientinferencealgorithms(whetherw/DPorothermethods),weneedtobemindfulofthis
• featurescanbearbitrarilybigintermsoftheinput,butnotintermsoftheoutput!
• thefeaturesinHMMsaresmallinboththeinputandoutputsequences(onlytwopiecesatatime)
100