92
Introduction to NLP Syntax and Parsing (Part II) Prof. Reut Tsarfaty Bar Ilan University November 24, 2020

Introduction to NLP Syntax and Parsing (Part II)

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Introduction to NLP

Syntax and Parsing (Part II)

Prof. Reut TsarfatyBar Ilan University

November 24, 2020

Introducing the Parsing Problem

Previously: What is a Parse Tree?I HierarchicalI Phrase-Structure (all labeled spans)I Dependency-Structure (all lexical relations)

Today: The Parsing Task

Sentence, Grammar→ Parser → Parse-Tree

I The Parsing ObjectiveI Training: Learn the grammarI Prediction: Find the best tree

Previously on NLP@BIU

Formal GrammarA finite generative device that allows usto generate all & only sentences in a language

Definition:A formal G grammar is a tuple

G = 〈T ,N ,S,R〉

I T is a finite set of terminal symbols (alphabet)I N is a finite set of non-terminal symbolsI S ∈ N is start SymbolI R is a finite set of rules

Formal Languages Hierarchy

Formal Languages Hierarchy (Chomsky 1959)I a,b, .. ∈ TI A,B, .. ∈ NI α, β, γ ∈ (N ∪ T )∗

Language RulesRegular A→ aB

A→ BaContext-Free A→ α

Context-Sensitive βAγ → βαγ

Unrestricted α→ β

Natural Language is at least Context-Free!

Context-Free Grammars

A Context-Free GrammarA context free grammar G is a tuple

G = 〈T ,N ,S,R〉

I T is a finite set of terminal symbolsI N is a finite set of non-terminal symbolsI S ∈ N is start SymbolI R is a finite set of context-free

R = {A→ α|A ∈ N and α ∈ (N ∪ T )∗}

InterpretationA→ α ∈ R rewrites A into α independently of A’s context

Context-Free Grammars

Example: A Toy Context-Free Grammar forEnglish

G =

T = {sleeps, saw ,man,woman,dog,with, in, the}N = {S,NP,VP,PP,V0,V1,NN,DT , IN}S ∈ NR =

S → NP VPVP → V0VP → V1 NPNP → DT NNNP → NP PPPP → IN NP

V0 → sleepsV1 → sawNN → manNN → womanNN → dogDT → theIN → withIN → in

What Can We Do with Context-Free Grammars?

Context-Free DerivationI s1 =S // applying S→ NP VP

I s2 = NP VP // applying NP→ DT NNI s3 = DT NN VP // applying DT→ theI s4 = the NN VP // applying NN→ manI s5 = the man VP // applying VP→ V0I s6 = the man V0 // applying V0→ sleepsI s7 = the man sleeps

We say that s1...s7 derives the sentence ”the man sleeps”

What Can We Do with Context-Free Grammars?

Context-Free DerivationI s1 =S // applying S→ NP VPI s2 = NP VP // applying NP→ DT NN

I s3 = DT NN VP // applying DT→ theI s4 = the NN VP // applying NN→ manI s5 = the man VP // applying VP→ V0I s6 = the man V0 // applying V0→ sleepsI s7 = the man sleeps

We say that s1...s7 derives the sentence ”the man sleeps”

What Can We Do with Context-Free Grammars?

Context-Free DerivationI s1 =S // applying S→ NP VPI s2 = NP VP // applying NP→ DT NNI s3 = DT NN VP // applying DT→ the

I s4 = the NN VP // applying NN→ manI s5 = the man VP // applying VP→ V0I s6 = the man V0 // applying V0→ sleepsI s7 = the man sleeps

We say that s1...s7 derives the sentence ”the man sleeps”

What Can We Do with Context-Free Grammars?

Context-Free DerivationI s1 =S // applying S→ NP VPI s2 = NP VP // applying NP→ DT NNI s3 = DT NN VP // applying DT→ theI s4 = the NN VP // applying NN→ man

I s5 = the man VP // applying VP→ V0I s6 = the man V0 // applying V0→ sleepsI s7 = the man sleeps

We say that s1...s7 derives the sentence ”the man sleeps”

What Can We Do with Context-Free Grammars?

Context-Free DerivationI s1 =S // applying S→ NP VPI s2 = NP VP // applying NP→ DT NNI s3 = DT NN VP // applying DT→ theI s4 = the NN VP // applying NN→ manI s5 = the man VP // applying VP→ V0

I s6 = the man V0 // applying V0→ sleepsI s7 = the man sleeps

We say that s1...s7 derives the sentence ”the man sleeps”

What Can We Do with Context-Free Grammars?

Context-Free DerivationI s1 =S // applying S→ NP VPI s2 = NP VP // applying NP→ DT NNI s3 = DT NN VP // applying DT→ theI s4 = the NN VP // applying NN→ manI s5 = the man VP // applying VP→ V0I s6 = the man V0 // applying V0→ sleeps

I s7 = the man sleepsWe say that s1...s7 derives the sentence ”the man sleeps”

What Can We Do with Context-Free Grammars?

Context-Free DerivationI s1 =S // applying S→ NP VPI s2 = NP VP // applying NP→ DT NNI s3 = DT NN VP // applying DT→ theI s4 = the NN VP // applying NN→ manI s5 = the man VP // applying VP→ V0I s6 = the man V0 // applying V0→ sleepsI s7 = the man sleeps

We say that s1...s7 derives the sentence ”the man sleeps”

Derivations and Parse-Trees

Every LMCF Derivation corresponds to a single tree

I s1 = SI s2 = NP VPI s3 = DT NN VPI s4 = the NN VPI s5 = the man VPI s6 = the man V0I s7 = the man sleeps

S

NP

DT

the

NN

man

VP

V0

sleeps

Left-Most Derivations and Parse-Trees

Every LMCF Derivation corresponds to a single tree

I s1 = SI s2 = NP VPI s3 = DT NN VPI s4 = the NN VPI s5 = the man VPI s6 = the man V0I s7 = the man sleeps

S

NP

DT

the

NN

man

VP

V0

sleeps

Left-Most Derivations and Parse-Trees

Every LMCF Derivation corresponds to a single tree

I s1 = SI s2 = NP VPI s3 = DT NN VPI s4 = the NN VPI s5 = the man VPI s6 = the man V0I s7 = the man sleeps

S

NP

DT

the

NN

man

VP

V0

sleeps

Left-Most Derivations and Parse-Trees

Every LMCF Derivation corresponds to a single tree

I s1 = SI s2 = NP VPI s3 = DT NN VPI s4 = the NN VPI s5 = the man VPI s6 = the man V0I s7 = the man sleeps

S

NP

DT

the

NN

man

VP

V0

sleeps

Left-Most Derivations and Parse-Trees

Every LMCF Derivation corresponds to a single tree

I s1 = SI s2 = NP VPI s3 = DT NN VPI s4 = the NN VPI s5 = the man VPI s6 = the man V0I s7 = the man sleeps

S

NP

DT

the

NN

man

VP

V0

sleeps

Left-Most Derivations and Parse-Trees

Every LMCF Derivation corresponds to a single tree

I s1 = SI s2 = NP VPI s3 = DT NN VPI s4 = the NN VPI s5 = the man VPI s6 = the man V0I s7 = the man sleeps

S

NP

DT

the

NN

man

VP

V0

sleeps

Left-Most Derivations and Parse-Trees

Every LMCF Derivation corresponds to a single tree

I s1 = SI s2 = NP VPI s3 = DT NN VPI s4 = the NN VPI s5 = the man VPI s6 = the man V0I s7 = the man sleeps

S

NP

DT

the

NN

man

VP

V0

sleeps

Left-Most Derivations and Parse-Trees

Every LMCF Derivation corresponds to a singleParse-TreeI s1 = SI s2 = NP VPI s3 = DT NN VPI s4 = the NN VPI s5 = the man VPI s6 = the man V0I s7 = the man sleeps

S

NP

DT

the

NN

man

VP

V0

sleeps

Left-Most Derivations and Parse-Trees

Derivation of a StringWe say that G derives s and write G⇒∗ sIff there exists a derivation S ⇒∗ s

Derivation of a treeWe say that G generates tree t and write G⇒∗ tIff there exists a derivation S ⇒∗ s that corresponds to t

Generative CapacityI Weak Generative Capacity

LG = {s|G⇒∗ s}

I Strong Generative Capacity

TG = {t |G⇒∗ t}

Coming Up: How do we parse using a CFG?

Statistical Parsing

→ The Statistical Parsing ObjectiveI Probabilistic Context-Free Grammars (PCFGs)I Search with PCFGsI Training with PCFGs

Strings and Parse-TreesThe Yield FunctionA function from trees to the sequence of terminals in the leaves

Y : TG → LG

String s is GrammaticalIff ∃t1 such that G⇒∗ t1 and Y(t1) = s

String s is AmbiguousIff ∃t1, t2 (t1 6= t2) such that G⇒∗ t1, t2 and Y(t1) = Y(t2) = s

S

NP

NN

time

VP

VB

flies

RB

fast

S

NP

NN

time

NN

flies

VP

VB

fast

http://en.wikipedia.org/wiki/Catalan_number

Strings and Parse-TreesThe Yield FunctionA function from trees to the sequence of terminals in the leaves

Y : TG → LG

String s is GrammaticalIff ∃t1 such that G⇒∗ t1 and Y(t1) = s

String s is AmbiguousIff ∃t1, t2 (t1 6= t2) such that G⇒∗ t1, t2 and Y(t1) = Y(t2) = s

S

NP

NN

time

VP

VB

flies

RB

fast

S

NP

NN

time

NN

flies

VP

VB

fast

http://en.wikipedia.org/wiki/Catalan_number

Strings and Parse-TreesThe Yield FunctionA function from trees to the sequence of terminals in the leaves

Y : TG → LG

String s is GrammaticalIff ∃t1 such that G⇒∗ t1 and Y(t1) = s

String s is AmbiguousIff ∃t1, t2 (t1 6= t2) such that G⇒∗ t1, t2 and Y(t1) = Y(t2) = s

S

NP

NN

time

VP

VB

flies

RB

fast

S

NP

NN

time

NN

flies

VP

VB

fast

http://en.wikipedia.org/wiki/Catalan_number

Strings and Parse-TreesThe Yield FunctionA function from trees to the sequence of terminals in the leaves

Y : TG → LG

String s is GrammaticalIff ∃t1 such that G⇒∗ t1 and Y(t1) = s

String s is AmbiguousIff ∃t1, t2 (t1 6= t2) such that G⇒∗ t1, t2 and Y(t1) = Y(t2) = s

S

NP

NN

time

VP

VB

flies

RB

fast

S

NP

NN

time

NN

flies

VP

VB

fast

http://en.wikipedia.org/wiki/Catalan_number

Strings and Parse-TreesThe Yield FunctionA function from trees to the sequence of terminals in the leaves

Y : TG → LG

String s is GrammaticalIff ∃t1 such that G⇒∗ t1 and Y(t1) = s

String s is AmbiguousIff ∃t1, t2 (t1 6= t2) such that G⇒∗ t1, t2 and Y(t1) = Y(t2) = s

S

NP

NN

time

VP

VB

flies

RB

fast

S

NP

NN

time

NN

flies

VP

VB

fast

http://en.wikipedia.org/wiki/Catalan_number

The Parsing Problem: Our Objective Function

Deriving the objective function:

t∗ = argmax{t |Y(t)=x}P(t |x)

= argmax{t |Y(t)=x}P(t , x)P(x)

= argmax{t |Y(t)=x}P(t , x)

= argmax{t |Y(t)=x}P(t)

The Parsing Problem: Our Objective Function

Deriving the objective function:

t∗ = argmax{t |Y(t)=x}P(t |x)

= argmax{t |Y(t)=x}P(t , x)P(x)

= argmax{t |Y(t)=x}P(t , x)

= argmax{t |Y(t)=x}P(t)

The Parsing Problem: Our Objective

I We will assign probabilities P(t) to parse-trees s.t.

∀t : G⇒∗ t : P(t) > 0

∑G⇒∗t

P(t) = 1

I The probability P(t) ranks parse-trees

t∗ = arg max{t |G⇒∗t}

P(t)

This is the objective function of the parsing problem.I Key Challenges: How to find t? How to calculate P(t)?

The Parsing Problem: Our Objective

I We will assign probabilities P(t) to parse-trees s.t.

∀t : G⇒∗ t : P(t) > 0

∑G⇒∗t

P(t) = 1

I The probability P(t) ranks parse-trees

t∗ = arg max{t |G⇒∗t}

P(t)

This is the objective function of the parsing problem.

I Key Challenges: How to find t? How to calculate P(t)?

The Parsing Problem: Our Objective

I We will assign probabilities P(t) to parse-trees s.t.

∀t : G⇒∗ t : P(t) > 0

∑G⇒∗t

P(t) = 1

I The probability P(t) ranks parse-trees

t∗ = arg max{t |G⇒∗t}

P(t)

This is the objective function of the parsing problem.I Key Challenges: How to find t? How to calculate P(t)?

Coming Up Next:

X The Statistical Parsing Objective→ Probabilistic Context-Free Grammars (PCFGs)I Search with PCFGsI Training with PCFGs

Probabilistic Context-Free Grammars

DefinitionA probabilistic context free grammar G is a tuple

G = 〈T ,N ,S,R,P〉

Where:I 〈T , N , S, R〉 is a CFGI P is a function assigning probabilities to rules:

P : R → (0,1]∑{α|A→α}

P(A→ α) = 1

InterpretationGiven A, it rewrites into (or emits) α in probability P(A→ α),independently of context. Notation: P(A→ α) = P(α|A)

Probabilistic Context-Free Grammars

The probability of a treeI Given a probabilistic context free grammar

G = 〈T ,N ,S,R,P〉I Given a parse-tree t composed of k rules of the form:

A1 → α1,A2 → α2, ... ∈ RI We define:

P(t)=P(A1 → α1)× ...× P(Ak → αk ) =k∏

i=1

P(Ai → αi)

Given the independence assumption of CFG

Probabilistic Context-Free Grammars

The probability of a treeI Given a probabilistic context free grammar

G = 〈T ,N ,S,R,P〉I Given a parse-tree t composed of k rules of the form:

A1 → α1,A2 → α2, ... ∈ RI We define:

P(t)=P(A1 → α1)× ...× P(Ak → αk ) =k∏

i=1

P(Ai → αi)

Given the independence assumption of CFG

Probabilistic Context-Free Grammars: Example

A Toy Grammar for English

G =

T = {sleeps, saw ,man,woman,dog,with, in, the}N = {S,NP,VP,PP,Vi ,Vt ,Vd ,NN,DT , IN}S ∈ NR = Rgram ∪ RlexP =

S → NP VP 1.0VP → V0 0.3VP → V1 NP 0.7NP → DT NN 0.6NP → NP PP 0.4PP → IN NP 1.0

V0 → sleeps 1.0V1 → saw 1.0NN → man 0.2NN → woman 0.2NN → dog 0.6DT → the 1.0IN → with 0.6IN → in 0.4

Probabilistic Context-Free Grammars: Example

Derivation ProbabilityI s1 = S 1.0

I s2 = NP VP × P(S→ NP VP)I s3 = DT NN VP × P(NP→ DT NN)I s4 = the NN VP × P(DT→ the)I s5 = the man VP × P(NN→ man)I s6 = the man V0 × P(VP→ V0)I s7 = the man sleeps × P(V0→ sleeps)

The PCFG generates a tree for ”the man sleeps” with prob:

1.0× 0.6× 1.0× 0.1× 0.3× 1.0 = 0.018

Probabilistic Context-Free Grammars: Example

Derivation ProbabilityI s1 = S 1.0I s2 = NP VP × P(S→ NP VP)

I s3 = DT NN VP × P(NP→ DT NN)I s4 = the NN VP × P(DT→ the)I s5 = the man VP × P(NN→ man)I s6 = the man V0 × P(VP→ V0)I s7 = the man sleeps × P(V0→ sleeps)

The PCFG generates a tree for ”the man sleeps” with prob:

1.0× 0.6× 1.0× 0.1× 0.3× 1.0 = 0.018

Probabilistic Context-Free Grammars: Example

Derivation ProbabilityI s1 = S 1.0I s2 = NP VP × P(S→ NP VP)I s3 = DT NN VP × P(NP→ DT NN)

I s4 = the NN VP × P(DT→ the)I s5 = the man VP × P(NN→ man)I s6 = the man V0 × P(VP→ V0)I s7 = the man sleeps × P(V0→ sleeps)

The PCFG generates a tree for ”the man sleeps” with prob:

1.0× 0.6× 1.0× 0.1× 0.3× 1.0 = 0.018

Probabilistic Context-Free Grammars: Example

Derivation ProbabilityI s1 = S 1.0I s2 = NP VP × P(S→ NP VP)I s3 = DT NN VP × P(NP→ DT NN)I s4 = the NN VP × P(DT→ the)

I s5 = the man VP × P(NN→ man)I s6 = the man V0 × P(VP→ V0)I s7 = the man sleeps × P(V0→ sleeps)

The PCFG generates a tree for ”the man sleeps” with prob:

1.0× 0.6× 1.0× 0.1× 0.3× 1.0 = 0.018

Probabilistic Context-Free Grammars: Example

Derivation ProbabilityI s1 = S 1.0I s2 = NP VP × P(S→ NP VP)I s3 = DT NN VP × P(NP→ DT NN)I s4 = the NN VP × P(DT→ the)I s5 = the man VP × P(NN→ man)

I s6 = the man V0 × P(VP→ V0)I s7 = the man sleeps × P(V0→ sleeps)

The PCFG generates a tree for ”the man sleeps” with prob:

1.0× 0.6× 1.0× 0.1× 0.3× 1.0 = 0.018

Probabilistic Context-Free Grammars: Example

Derivation ProbabilityI s1 = S 1.0I s2 = NP VP × P(S→ NP VP)I s3 = DT NN VP × P(NP→ DT NN)I s4 = the NN VP × P(DT→ the)I s5 = the man VP × P(NN→ man)I s6 = the man V0 × P(VP→ V0)

I s7 = the man sleeps × P(V0→ sleeps)The PCFG generates a tree for ”the man sleeps” with prob:

1.0× 0.6× 1.0× 0.1× 0.3× 1.0 = 0.018

Probabilistic Context-Free Grammars: Example

Derivation ProbabilityI s1 = S 1.0I s2 = NP VP × P(S→ NP VP)I s3 = DT NN VP × P(NP→ DT NN)I s4 = the NN VP × P(DT→ the)I s5 = the man VP × P(NN→ man)I s6 = the man V0 × P(VP→ V0)I s7 = the man sleeps × P(V0→ sleeps)

The PCFG generates a tree for ”the man sleeps” with prob:

1.0× 0.6× 1.0× 0.1× 0.3× 1.0 = 0.018

Probabilistic Context-Free Grammars: Example

Derivation ProbabilityI s1 = S 1.0I s2 = NP VP × P(S→ NP VP)I s3 = DT NN VP × P(NP→ DT NN)I s4 = the NN VP × P(DT→ the)I s5 = the man VP × P(NN→ man)I s6 = the man V0 × P(VP→ V0)I s7 = the man sleeps × P(V0→ sleeps)

The PCFG generates a tree for ”the man sleeps” with prob:

1.0× 0.6× 1.0× 0.1× 0.3× 1.0 = 0.018

Probabilistic Context-Free Grammars: Example

Probability of the derivation == Probability of the treep( S

NP

DT

the

NN

man

VP

V0

sleeps

) =

1.0× 0.6× 1.0× 0.1× 0.3× 1.0 = 0.018

Probability of the string =? Probability of the tree

Probabilistic Context-Free Grammars

The Probability of a stringGiven a probabilistic context free grammar G = 〈T ,N ,S,R,P〉Given a string s yielded by different trees {t |Y(t) = s}

p(s) =∑

{t |G⇒∗t ,Y(t)=s}

p(t)

Note that p(s) defines a language model.It can be formally shown that∑

{s|s∈L(G)}

p(s) = 1

Probabilistic Context-Free Grammars

The Probability of a stringGiven a probabilistic context free grammar G = 〈T ,N ,S,R,P〉Given a string s yielded by different trees {t |Y(t) = s}

p(s) =∑

{t |G⇒∗t ,Y(t)=s}

p(t)

Note that p(s) defines a language model.It can be formally shown that∑

{s|s∈L(G)}

p(s) = 1

Coming Up Next:

X Context-Free Grammars (CFGs)X Probabilistic Context-Free Grammars (PCFGs)→ Search with PCFGsX Training PCFGs

Assume a Sentence s and a CFG G

What Questions Can We Answer?

I Is s derived by G ?I What are all trees t of s derived by G ?I What is the best tree t for s derived by G ?

Fortunately, They are all answered by the same algorithm!

Assume a Sentence s and a CFG G

What Questions Can We Answer?

I Is s derived by G ?I What are all trees t of s derived by G ?I What is the best tree t for s derived by G ?

Fortunately, They are all answered by the same algorithm!

The CYK Algorithm

demo deck

The CYK Algorithm

ObservationThe probability of a tree rooted in X is made up of three termsI The probability of the rule X → YZI The probability of the subtree rooted at YI The probability of the subtree rooted at Z

X

Y

xi ....xs

Z

xs+1....xj

The Dynamic Engine

p(i , j ,X ) = P(X → YZ ) ×p(i , s,Y ) ×p(s + 1, j ,Z )

The CKY Algorithm (Probabilistic Version)

I Input:sentence S = x1...xn, PCFG G = 〈T ,N ,S,R,P〉

I Initialization: // lex rulesFor i ∈ 1...n, for X ∈ N ,I if X → xi ∈ R then CKY (i − 1, i ,X ) = P(X → xi) (o/w 0)

I Algorithm: // gram rulesFor len = 2...(n − 1)

For i = 0...(n − len)Set j = i + lenFor all X ∈ N

CKY (i , j ,X ) = maxX→YZ∈R,s∈(i+1)...(j−1) P(X → YZ )×CKY (i , s,Y )×CKY (s + 1, j ,Z )

BP(i , j ,X ) = arg maxX→YZ∈R,s∈(i+1)...(j−1) P(X → YZ )×CKY (i , s,Y )×CKY (s + 1, j ,Z )

The CYK Algorithm

I Output:I Recognition: If CKY (0,n,S) > 0 return trueI Probability: CKY (0,n,S)I Tree: BP(0,n,S)

The ParseEval Evaluation Metrics

Assume the following trees: How can we compare them?(g) A

B

C

w1

D

w2

E

F

w3(h) A

C

w1

B

D

w2

E

w3

The ParseEval Evaluation MetricsAssume the following trees: How can we compare them?

(g) A

B

C

w1

D

w2

E

F

w3

(0, A, 3)(0, B, 2)(0, C, 1)(1, D, 2)(2, E, 3)(2, F, 3)

(h) A

C

w1

B

D

w2

E

w3

(0, A, 3)(1, B, 3)(0, C, 1)(1, D, 2)(2, E, 3)

The ParseEval Evaluation MetricsAssume the following trees: How can we compare them?

(g) A

B

C

w1

D

w2

E

F

w3

(0, A, 3)(0, B, 2)(0, C, 1)(1, D, 2)(2, E, 3)(2, F, 3)

(0, A, 3)(0, B, 2)(0, C, 1)(1, D, 2)(2, E, 3)(2, F, 3)

→ Recall = 46

(h) A

C

w1

B

D

w2

E

w3

(0, A, 3)(1, B, 3)(0, C, 1)(1, D, 2)(2, E, 3)

(0, A, 3)(1, B, 3)(0, C, 1)(1, D, 2)(2, E, 3)

→ Precision = 45

The ParseEval Evaluation MetricsLet Tg the set of tuples of the gold treeLet Th the set of tuples of the hypothesized treeLet Tintersect = Tg ∩ Th be their set intersection

Precision

P =Tintersect

Th

Recall

R =Tintersect

Tg

Recall

F1 =2× P × R

P + R

The ParseEval Evaluation Metrics

The Standard Practice: Discard Root and POS Tags

(g) A

B

C

w1

D

w2

E

F

w3

→ (0, B, 2)(2, E, 3)

→ Recall = 02

(h) A

C

w1

B

D

w2

E

w3

→ (1, B, 3) → Precision = 01

Probabilistic Context-Free Grammars

Where do the rules/probabilities come from?

I Make them up ourselvesI Ask genius linguists to write rulesI Ask smart linguistic grad-students to annotate treesI And then, easy: Read off a PCFG from the trees

Probabilistic Context-Free Grammars

Where do the rules/probabilities come from?I Make them up ourselves

I Ask genius linguists to write rulesI Ask smart linguistic grad-students to annotate treesI And then, easy: Read off a PCFG from the trees

Probabilistic Context-Free Grammars

Where do the rules/probabilities come from?I Make them up ourselvesI Ask genius linguists to write rules

I Ask smart linguistic grad-students to annotate treesI And then, easy: Read off a PCFG from the trees

Probabilistic Context-Free Grammars

Where do the rules/probabilities come from?I Make them up ourselvesI Ask genius linguists to write rulesI Ask smart linguistic grad-students to annotate treesI And then, easy: Read off a PCFG from the trees

Treebank Grammars

A TreebankI A set of sentences annotated with their correct parse trees

ExamplesI English: The WSJ Treebank for English (40K sentences)I Hebrew: The Haaretz Treebank (6.5K sentences)I Swedish: The TalBanken Treebank (5K sentences)

Also Arabic, Basque, Chinese, French, Czech, Ger-man, Hindi, Hungarian, Korean, Portuguese, Spanish,And more...

Treebank Grammars

Constituency Treebanks Bracketed Format

( (S (NP (DT the)(NN man)) (VP (VB sleeps))))

S

NP

DT

the

NN

man

VP

VB

sleeps

Treebank Grammars

TrainingI T : all observed terminalsI N : all observed non-terminalsI R : all observed rules in the derivations

I P : Maximum Likelihood Estimation (MLE):We define a count function over the corpus

Count : R → N

And calculate the relative frequencies

P̂(A→ α) =Count(A→ α)∑γ Count(A→ γ)

=Count(A→ α)

Count(A)

Treebank Grammars

TrainingI T : all observed terminalsI N : all observed non-terminalsI R : all observed rules in the derivationsI P : Maximum Likelihood Estimation (MLE):

We define a count function over the corpus

Count : R → N

And calculate the relative frequencies

P̂(A→ α) =Count(A→ α)∑γ Count(A→ γ)

=Count(A→ α)

Count(A)

Treebank Grammars

Training Example

S

NP

NNP

John

VP

VB

likes

NP

NNP

Mary

S→ NP VP (1)VP→ VB NP (1)NP→ NNP (1)NNP→ Mary (0.5)NNP→ John (0.5)VB→ likes (1)

Treebank Grammars

Training Example

S

NP

NNP

John

VP

VB

likes

NP

NNP

Mary

S→ NP VP (1)VP→ VB NP (1)NP→ NNP (1)NNP→ Mary (0.5)NNP→ John (0.5)VB→ likes (1)

How Good are Treebank Grammars?

Charniak 1996:A CYK statistical parser based on a treebank grammar trainedon the WSJ Penn Treebank scored F=75 on an unseen testset.

Limitations of Treebank Grammars

Challenges with “Vanilla PCFGs”I Challenge: Independence assumptions are too strongI Challenge: Independence assumptions are too weakI Challenge: Lack of sensitivity to Lexical Information

(1) Independence Assumptions are too Strong

Problem (1): Locality

S

NP

he

VP

V

likes

NP

her

S → NP VP (1)VP → V NP (1)NP → he (1/2)NP → her (1/2)V → likes (1)

⇒ S

NP

her

VP

V

likes

NP

he

(1) Independence Assumptions are too Strong

Solution (1): Vertical Markovization (v = 0,1,2...)

S

NP@S

he

VP@S

V@VP

likes

NP@VP

her

S → NP@S VP@S (1)VP → V@VP NP@VP (1)NP@S → he (1)NP@VP → her (1)V@VP → likes (1)

(1) Independence Assumptions are too Strong

Solution (1): Vertical Markovization (v = 0,1,2...)

S

NP@S

he

VP@S

V@VP

likes

NP@VP

her

S → NP@S VP@S (1)VP → V@VP NP@VP (1)NP@S → he (1)NP@VP → her (1)V@VP → likes (1)

(2) Independence Assumptions are too Weak

Problem (2): Specificity

S

NP

john

VP

RB

really

V

likes

NP

mary

S → NP VP (1)VP → RB V NP (1)NP → john (1/2)NP → mary (1/2)RB → really (1)V → likes (1)

6⇒ S

NP

john

VP

RB

really

RB

really

V

likes

NP

mary

(2) Independence Assumptions are too Weak

Solution (2): Horizontal Markovization h = 0,1,2, ...∞

S

NP

john

VP

RB

really

VP

V

likes

NP

mary

S → NP VP (1)VP → RB VP (1/2)VP → V NP (1/2)NP → john (1/2)NP → mary (1/2)RB → really (1)V → likes (1)

⇒ S

NP

john

VP

RB

really

VP

RB

really

VP

V

likes

NP

mary

Two-Dimensional ParameterizationSolution (2): Binarization and Markovization h = 0

VP

V

saw

NN

mary

PP

in-class

PP

on-tuesday

PP

last-week

VP

V

saw

-@VP

NN

mary

-@VP

PP

in-class

-@VP

PP

on-tuesday

-@VP

PP

last-week

Two-Dimensional ParameterizationSolution (2): Binarization and Markovization h = 1

VP

V

saw

NN

mary

PP

in-class

PP

on-tuesday

PP

last-week

VP

V

saw

V@VP

NN

mary

NN@VP

PP

in-class

PP@VP

PP

on-tuesday

PP@VP

PP

last-week

Two-Dimensional ParameterizationSolution (2): Binarization and Markovization h = 2

VP

V

saw

NN

mary

PP

in-class

PP

on-tuesday

PP

last-week

VP

V

saw

-,V@VP

NN

mary

V,NN@VP

PP

in-class

NN,PP@VP

PP

on-tuesday

PP,PP@VP

PP

last-week

Limitations of Treebank PCFGs

Challenges with “Vanilla PCFGs”

X Challenge: Independence assumptions are too strongX Challenge: Independence assumptions are too weak→ Challenge: Lack of sensitivity to Lexical Information

(3) Lack of sensitivity to Lexical Information

Problem (3): No Lexical Dependencies(t1) S

NP

John

VP

VP

V

ate

NP

pizza

PP

P

with

NP

a-fork

(t2) S

NP

John

VP

V

ate

NP

NP

pizza

PP

P

with

NP

a-fork

(3) Lack of sensitivity to Lexical Information

Problem (3): No Lexical Dependencies(t1) S

NP

John

VP

VP

V

ate

NP

pizza

PP

P

with

NP

a-fork

(t2) S

NP

John

VP

V

ate

NP

NP

pizza

PP

P

with

NP

a-fork

(3) Lack of sensitivity to Lexical Information

Problem (3): No Lexical Dependencies(t1) S

NP

John

VP

VP

V

ate

NP

pizza

PP

P

with

NP

olives

(t2) S

NP

John

VP

V

ate

NP

NP

pizza

PP

P

with

NP

olives

(3) Lack of sensitivity to Lexical Information

Solution to (3): Lexicalized GrammarsS

NP

workers

VP

VP

V

dumped

NP

sacks

PP

P

into

NP

bins

(3) Lack of sensitivity to Lexical Information

Solution to (3): Lexicalized GrammarsS

NP/workers

workers

VP

VP

V/dumped

dumped

NP/sacks

sacks

PP

P/into

into

NP/bins

bins

(3) Lack of sensitivity to Lexical Information

Solution to (3): Lexicalized GrammarsS

NP/workers

workers

VP

VP/dumped

V/dumped

dumped

NP/sacks

sacks

PP/bins

P/into

into

NP/bins

bins

(3) Lack of sensitivity to Lexical Information

Solution to (3): Lexicalized GrammarsS

NP/workers

workers

VP/dumped

VP/dumped

V/dumped

dumped

NP/sacks

sacks

PP/bins

P/into

into

NP/bins

bins

(3) Lack of sensitivity to Lexical Information

Solution to (3): Lexicalized GrammarsS/dumped

NP/workers

workers

VP/dumped

VP/dumped

V/dumped

dumped

NP/sacks

sacks

PP/bins

P/into

into

NP/bins

bins

(3) Lack of sensitivity to Lexical Information

Solution to (3): Lexicalized CKYExtending CYK:

X(head)

Y(head)

xi ...xs

Z(dep)

xs+1...xj

Maximizing CYK [i ; j ;head ;X ]:I For all symbols XI For all splits sI For all head possibilities h on the rightI For all head possibilities h on the left

Variations of Phrase Structure Parsing

Challenges with “Vanilla PCFGs”

X Challenge: Independence assumptions are too strongX Challenge: Independence assumptions are too weakX Challenge: Lack of sensitivity to Lexical Information

What’s Next...Are there better/other/quicker alternatives?