29
Bottom-up Parsing Micha Elsner February 21, 2013

Bottom-up Parsing · NN !mat VP !V VP !V PP V !sat PP !IN NP IN !on 7. ... Next step NP (length 2) from 1-2 because NP !D N and D from 1-1 and

  • Upload
    lekiet

  • View
    220

  • Download
    1

Embed Size (px)

Citation preview

Bottom-up Parsing

Micha Elsner

February 21, 2013

Why parsing?

I As linguistic inquiryI What do we need to put in our model to do a good job?I Early on, lots of connections between syntactic theorists

and comp ling parsing peopleI Now not so much

I To add long-distance dependenciesI Correct some errors in tagging, LMs

I To aid information extraction, translationI Via compositional semanticsI An active area; results not yet amazing

I To model writing style, information structure, processingI Parser as model of processing

2

Parsing: basics of representation

I Parsing: find syntactic structure for input sentenceI What kind of structure?I Most popular answer given by Penn Treebank project

I Marcus, Santorini and Marcinkiewicz ‘93I Many person-hours of grad student laborI The PTB uses a phrase structure grammar...I Loosely based on Government and Binding theoryI Plus some null elements (traces) for movementI Many researchers use this formalism by default

I Better than making your own treebank!

To the degree that most grammatical formalisms tend to capturethe same regularities this can still be a successful strategy evenif no one formalism is widely preferred over the rest.–Charniak

3

What PTB trees look like

S

������

���

��

HHH

HHH

HHH

HH

NP-SBJ

���

HHH

NNP

Ms.

�� HH

NNP Haag

VP�� HH

VBZ

plays

NP

NNP

Elianti

.

.

( (S(NP-SBJ (NNP Ms.) (NNP Haag) )(VP (VBZ plays)(NP (NNP Elianti) ))

(. .) ))4

What PTB trees look like

5

Parsing the PTB

I Most of the influential statistical parsing projects don’t dotraces or movement

I Algorithms that actually move things are horribly inefficientI Real problem (still open) is to get long-distance

dependenciesI Nor do they do the function annotations (like -SBJ)I Although these can be done fairly easilyI Without these, the grammar is context-free

This lecture: context-free grammar parsing and some importantingredients for a good parser

6

Review: CFGs

I Produce a string of terminalsI By expanding nonterminal symbolsI Using rewriting rules

S → NP VPNP → DT NNNP → NNPDT → theNN → catNN → matVP → VVP → V PPV → satPP → IN NPIN → on

7

Bottom-up Recognizer for CFG

Relies on a data structure called the chartI Like trellis for HMMs

the cat sat on the mat

length 1

length 6

2

3

4

5

@1

@2

@3

@4

@5

@6

beginning:

8

Step 1: terminal rules

the cat sat on the mat

length 1

length 6

2

3

4

5

@1

@2

@3

@4

@5

@6

beginning:

D N VPV

P D N

9

Next step

the cat sat on the mat

length 1

length 6

2

3

4

5

@1

@2

@3

@4

@5

@6

beginning:

D N VPV

P D N

NP

NP (length 2) from 1-2 because NP→ D N and D from 1-1 andN from 2-2

10

A few steps forward

the cat sat on the mat

length 1

length 6

2

3

4

5

@1

@2

@3

@4

@5

@6

beginning:

D N VPV

P D N

NP NP

S

S (length 3) from 1-3 because S→ NP VP and NP from 1-2and VP from 3-3

11

Probabilistic CFGs (PCFGs)

Each rule gets a probability P(right side|left side)I Since these are conditional, they sum to 1 for each LHS

S → NP VP 1NP → DT NN .9NP → NNP .1DT → the 1NN → cat .5NN → mat .5VP → V .4VP → V PP .6V → sat 1PP → IN NP 1IN → on 1

12

The modelA tree T is made by repeatedly applying rules:

I The probability of a tree T and string W is the product ofall the rule probs

I These include rules that insert nonterminalsI And rules that insert words

P(T ,W1:n) =∏

A→B C∈T

P(B C|A)∏

A→W∈T

P(Wi |A)

S���

HHH

NP�� HH

D

the

N

cat

VP

V

sat

P(NP VP|S)P(D N|NP)P(the|D)P(cat|N)P(V |VP)P(sat |V )

1× .9× 1× .5× .4× 113

CKY algorithm with numbers

the cat sat on the mat

length 1

length 6

2

3

4

5

@1

@2

@3

@4

@5

@6

beginning:

D: 1N: .5 VP: .4

V: 1P: 1 D: 1 N: 1

Each word gets P(NT |word)

14

CKY algorithm with numbers II

the cat sat on the mat

length 1

length 6

2

3

4

5

@1

@2

@3

@4

@5

@6

beginning:

NP: .45

D: 1 N: .5 VP: .4V: 1

P: 1 D: 1 N: 1

When we apply a rule, we multiply the rule prob and the probsof the component parts from the chart

I P(NP → D N) = .9, P(D at 1− 1) = 1 andP(N at 2− 2) = .5

15

More formally

CKY: Cocke-Kasami-Younger algorithmA max-product algorithm for PCFGs

I Is a dynamic programI Quite similar to Viterbi for HMMsI Works for grammar with binary rules

I Discuss unary and n-ary rules in a secondI Also has a sum-product version

I Which computes P(W1:n)

16

DefinitionWe operate by computing µA(i , l), the best way to parse anonterminal of type A and length l starting at i :

µA(i , l) = maxtree T

P(Wi:i+l ,T , root ofT = A)

I To do so, we must apply a rule to make A from some thingsB, C

I These things must cover the span (i , l)I We divide the span into (i , j) and (i + j , l − j)

I For instance, span (1, 4) can be split: (1, 2), (2,4)I Or (1, 3), (3, 4)

I We have to search over potential j

µA(i , l) = max0<j<l,A→B C

P(A→ B C)µB(i , j)µC(i + j , l − j)

17

The solution

As with HMMs, notice that the last chart entry is the answer:

µS(1,n) = maxtree T

P(W1:n,T , root ofT = S)

Joint probability of the sentence and the best tree rooted at SI The conditional is proportional to the jointI By a factor of P(W1:n)

18

Efficiency

For an n-word sentence and G grammar rules, CKY runs inO(Gn3) time

I The chart is (less than) n2 sizeI To fill in cell i − k , search for split point jI No more than n places to lookI Must look once per each rule

19

Binarizing the grammarCKY searches for a single split point:

I So we need the grammar to be binaryI Each rule has (one or) two children

I Unary (one-child) rules also cause problems, but seeCharniak for these

I There is an equivalent binary grammar for any CFGI Chomsky Normal Form

NP

������

��@@

PPPP

PP

DT

an

NNP

Oct.

CD

19

NN

review

NP

���

HHH

DT

an

NNP-CD-NN

���

HHH

NNP

Oct.

CD-NN�� HH

CD

19

NN

review

20

Building a lousy parser

We now know how to build a parser:I Take Penn Treebank, convert to Chomsky Normal (binary)

formI Estimate rule probs by smoothed max likelihood

P̂(A→ BC) =#(A→ B C) + λ

#(A→ •) + |B,C|λ

I Parse with CKYThis parser isn’t very good

I 70% F-score... modern parsers >90%Why not?

21

Stupid parser assumptions

This parser makes stupid independence assumptions:I Decision to produce NT independent of expansion of NT

I Eg, “expects [S sales to increase]” (non-finite clause)I vs “hopes [S sales will increase]” (finite clause)I Or “a [NN chair]” (count)I vs “*a [NN pork]” (mass)

I No subcategories, feature checking, etcOn the other hand, it’s too restrictive:

I Prob of “[ADJP very good]”, “[ADJP very very good]”,“[ADJP very very very good]” unrelated

I Depend on very specific training countsI No concept of adjunction, productive processes, etc

22

Making the grammar more specificCapture subcategorization effects by annotating heads andparent categories:

I (S (NP (DT the) (JJ recent) (NNS sale)) ...)I (NP^S-NNS (DT the) (JJ recent) (NN sale))I “Vertical Markovization”

S

����

HHH

H

NP^S-NNP

����

HHH

H

DT

the

JJ

recent

NN

sale

...

...

I NP^S-NNP (subject proper NP)I vs NP^PP-PRO (pronoun in prep. phrase)

23

Making the grammar less specificUse X-bar-like structure to allow adjunction:

I “Horizontal Markovization”NP-NNS

@N-+DT-NNS

����

HHH

H

DT

the

@N-+JJ-NNS

����

HHHH

JJ

recent

@N-NNS+PP

���

HHH

@N-NNS

NNS

sale

PP-IN�� HH

of shares

I Any @N symbol can produce adjunctsI The “+CONTEXT” version represents modifier orders

24

Lexicalized grammarsSometimes useful to incorporate words into the grammar rulesdirectly:

I Verbs subcategorize for prepositionsI CollocationsI Crude semantics

For example, might have:

VP-VB-deliver

����

HHH

H

VB

deliver

PP-TO-to

���

HHH

TO

to

NP-NN-store�� HH

the store

Uses grammar rule VP-VB-deliver→ VB PP-TO-toI Checks if “deliver” takes “to” as preposition

25

Smoothing lexicalized parsers

Lexicalized parsers have very large grammars:I Grammar is never written out explicitlyI Rules are made up when neededI Probability of a rule is hard to estimate

I Rule like “VP-deliver→ VB PP-to” is rare

Typical approach: interpolated estimation (like LM)I Use PCFG without words as backoff

P̂(VP-VB-deliver→ VB PP-TO-to) ∝#(VP-VB-deliver→ VB PP-TO-to) + βP(VP→ VB PP)

Exact backoff strategy depends on which parser you look at

26

Discriminative methodsI Like an HMM, a PCFG is a generative modelI Obvious question: is there a discriminative (CRF-like)

model?I Early answer: yes, but it takes weeks to train

I And the features we want to use often violate the Markovproperty

I For instance, constructionsI Checking agreement across multiple nodes

I ...in which case you can’t train at all!

RerankingA combined discriminative/generative system:

I Print out the top 100 parsesI Use max-ent to pick the best one (using arbitrary features)

I Not necessarily the one with highest original probI Max-ent features can violate the Markov propertyI Because label set size reduced to 100

27

Parsing results

I Naive PCFG from Treebank: 70%I Markovization, lexicalize a few prep/comp/conj: 87% (short

ss)I Klein+Manning 2003

I Fully lexicalized CFG with backoff smoothing: 87-88%(short ss)

I Charniak 1997, Collins 1999I Reranking with large max-ent feature set: 90-91% (all ss)

I Collins 2000, Charniak+Johnson 2005I Further improvements to about 92.5%

I Notably Petrov+Klein Berkeley parser, which we will discusslater

28

So far

I Recognizer/parser for PCFG via dynamic programI Modified CFG is capable of parsing accuratelyI Smoothing, reranking allow use of many feature types

Next time?I Other grammar formalisms?I Incremental algorithms (more cognitively plausible)

29