28
Grammar Induction Regina Barzilay EECS Department MIT November 15, 2004

Grammar Induction - Massachusetts Institute of Technology

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Grammar Induction

Regina Barzilay

EECS Department

MIT

November 15, 2004

Unsupervised Grammar Induction

• Goal: Learn syntactic structure from raw text

• Motivation for unsupervised learning:

– Huge amounts of raw data are readily available(compare with 50K of Penn Treebank)

– We can easily adapt for new domains and tasks

– Humans can learn language with little supervision

Grammar Induction 1/27

What a Crazy Idea!

• Gold (1967): Negative learnability results

– No superfinite class of languages is learned without

negative examples

• Chomsky (1980): The poverty of stimulus

– Children are born with a hard-wired language acquisition device (LAD)

– Free parameters of LAD are set through interaction

with other humans

Grammar Induction 2/27

Response on Negative Learnability

• If texts are generated by some stochastic process, language acquisition can be guided by observed frequencies

– Constructions that have very low frequency function

as pseudo-negative examples

• Horning (1969): stochastic context-free grammars are learnable only from positive examples

Grammar Induction 3/27

Response on Poverty of the Stimulus

• There’s a lot of stimulus: Baayen estimate of 200 million words by adulthood

• Language acquisition is emergent result of the interaction between cognition, learning abilities, and the structure of the linguistic data the child receives (Jackendoff, 2002)

Grammar Induction 4/27

Learning Grammars

• A discrete structure (the topology of the HMM, the context-free backbone of a PCFG)

• A set of continuous parameters which determine the probability of the sentences described by the grammar

Grammar Induction 5/27

Learn PCFGs with EM

• (Lari&Young 1990): Learning PCFGs with EM

– Full binary grammar over n symbols

– Parse randomly at first

– Re-estimate rule probabilities of parses

– Repeat

Grammar Induction 6/27

Preliminaries

• R is the set of rules in the context free grammar

N is the st of non-terminals in the grammar

• Θ for r ∈ R is the parameter for rule r

• Let R(λ) ⊂ R be the rules of the form λ β for→

some β

R• The parameter space Ω is the set of Θ ∈ [0, 1]| |

such that for all α ∈ N

r∈R(α) Θr = 1

Grammar Induction 7/27

Preliminaries(cont)

ΘCount(T,r)P (T |Θ) = r

r∈R

where Count(T, r) is the number of times rule r is seen in the tree T

logP (T |Θ) = Count(T, r) log Θr

r∈R

Grammar Induction 8/27

EM for PCFGs

• A PCFG defines a distribution P (S, T |Θ) over tree/sentence pairs (S, T )

• If we had tree/sentence pairs (fully observed data) then

L(Θ) =

logP (Si, Ti|Θ) i

• If we only have raw sentences S1, . . . , Sn, then trees are hidden variables

L(Θ) =

log

P (Si, T |Θ) i T

Grammar Induction 9/27

EM for PCFGs (cont.)

Trees are hidden variables •

L(Θ) =

log

P (Si, T |Θ) i T

• EM algorithm is then Θt = argmaxΘQ(Θ,Θt−1), where

Q(Θ,Θt−1) =

P (T |Si, Θt−1) log P (Si, T Θ)|i T

Grammar Induction 10/27

EM for PCFGs (cont.)

logP (T, Si Θ) =

r∈R Count(Si, T, r) log Θr,|where Count(S, T, r) is the number of times rule r is seen in the sentence tree/pair (S, T ) Q(Θ, Θt−1) =

i

T

P (T Si, Θt−1) log P (Si, Y Θ)| |

=

i

T

P (T Si, Θt−1)

r∈R

Count(Si, T, r) log Θr|=

i

r∈R

Count(Si, r) log Θr

where the expected counts are

Count(Si, r) =

T P (T Si, Θ

t−1)Count(Si, T, r)|

Grammar Induction 11/27

EM for PCFGs (cont.)

• Solving ΘML = argmaxΘ∈Ω gives:

Θα β =

i(Count(Si, α → β)) →

i

S∈R(α) Count(Si, S)

• Use Inside-Outside algorithm for efficient computation (see the handout)

Grammar Induction 12/27

Example Parse

S VBD

NP

DT NN

The screen

VP

VBD NP

NP PPwas

VBD DT

DT VBD

DT NN

DT NN

DT INwas

of

red

DT NN IN NP The screen DT NN

a sea of NN a sea

red

Treebank Parse

Grammar Induction 13/27

Local Maxima

Carroll&Charniak, 1992:

• Constraints on grammar topology

– Limit on rule length

– Non-lexicalized

– Rules added processing sentences in incremental order

– Rare rules are removed

• Generate a corpus using a PCFG

• Initialize rules with random probability (repeat 300 times)

• None of the derived grammars matched the original

Grammar Induction 14/27

Grammar Format

• Lari&Young, 1990: Satisfactory grammar learning requires more nonterminals than are theoretically needed to describe a language at hand

• There is no guarantee that the nonterminals that the algorithm learns will have any resemblance to nonterminals motivated in linguistic analysis

• Constraints on the grammar format may simplify the reestimation procedure

– Carroll&Charniak, 1992: Specify constraints on non-terminals that may appear together on the right-hand side of the rule

Grammar Induction 15/27

Partially Unsupervised LearningPereira&Schabes 1992

• Idea: Encourage the probabilities into a good region of the parameter space

• Implementation: modify Inside-Outside algorithm to consider only parses that do not cross provided bracketing

• Experiments: 15 non terminals over 45 part of speech tags The algorithm uses Treebank bracketing, but ignores the labels

• Evaluation Measure: fraction of nodes in gold trees

correctly posited in proposed trees (unlabeled recall)

Results: •

Grammar Induction 16/27

– Constrained and unconstrained grammars have similar cross-entropy

– But very different bracketing accuracy: 37% vs. 90%

Grammar Induction 17/27

Linguistic Constituency

Look at Noun Phrases: •

you, the man, a cat with a limp, three black cats

They may appear in the following contexts: saw me, He saw , The elephant sat on

• EM assumptions/greediness may not work:

– The important test for linguistic constituency emphasize external distribution

– Not internal constituency

Grammar Induction 18/27

Finding TopologyStolcke&Omohundro, 1994: Bayesian model merging

• Data incorporation: Given a body of data X, build an initial model M0 by explicitly accommodating each data point individually such that M0 maximizes the likelihood P (X|M).

• Generalization: Build a sequence of new models, obtaining Mi+1 from Mi by applying a merging operator m that coalesces substructures in Mi, Mi+1 = m(Mi), i = 0, 1

• Optimization: Maximize posterior probability

• Search strategy: Greedy or beam search through the space of possible merges

Grammar Induction 19/27

HMM Topology Induction

• Data incorporation: For each observed sample create a unique path between the initial and final states by assigning a new state to each symbol token in the sample

• Generalization: Two HMM states are replaced by a single new state, which inherits the union of the transitions and emissions from the old states.

Grammar Induction 20/27

HMM Topology Induction

• Prior distribution: Choose uninformative priors for a model M with topology Ms and parameters θM .

P (M) = P (Ms)P (θM |Ms)

P (Ms) ∝ exp(−l(Ms))

where l(Ms) is the number of bits required to encode Ms.

• Search: Greedy merging strategy.

Grammar Induction 21/27

Example

I

I

a b

I 1 2

3 4 5 6 F a b a b

0.5

0.5

b

I

I 1

1

1 2 5

2

2 F

5 F

6 F

1 2

4 5 6 F

a b

a b a

baba

b a b

a 0.5

0.5

0.67

0.5 0.5

0.33

0.67

0.33

Grammar Induction 22/27

PCFG Induction

• Data Incorporation: Add a top-level production

that covers the sample precisely. Create one

nonterminal for each observed terminal.

• Merging and Chunking: During merging, two nonterminals are replaced by a single new state. Chunking takes a given sequence of nonterminals and abbreviates it using a newly created nonterminal.

Prior distribution: Similar to HMM. •

Search: Beam search. •

Grammar Induction 23/27

ExampleInput: ab,aabb,aaabbb

S −>−> A B −>A A B B−>A A A B B B

A −>a B −>b

Chunk(AB)−>X S −>−> X −>−> A X B−>−> A A X B B

X −>−> A B

Chunk(AXB)−>Y S −>−> X −>−> Y −> A Y B

X −> A B Y −> A X B

Merge S,Y S −>−> X−> A S B

X −> A B

Merge S,X S −> A B−> A S B

Grammar Induction 24/27

Results for PCFGS

• Formal language experiments

– Successfully learned simple grammars Language Sample no. Grammar Search

Parentheses 8 S → ()|(S)|SS BF

a 2n 5 S → aa|SS BF

(ab)n 5 S → ab|aSb BF

wcw R , w ∈ a, b 7 S → c|aSa|bSb BS (3)

Addition strings 23 S → a|b|(S)|S + S BS(4)

• Natural Language syntax

– Mixed results (issues related to data sparseness)

Grammar Induction 25/27

Example of Learned Grammar

Target Grammar Learned Grammar

S NP V P S NP V P → →

V P V erb NP V P V NP → →

NP Det Noun NP DetN → →

NP Det Noun RC NP NP RC → →

RC Rel V P RC REL V P → →

V erb → saw|heard V saw heard → |Noun → cat|dog mouse N → cat|dog mouse| |

Det a the Det → a|the→ |Rel that Rel that→ →

Grammar Induction 26/27

Other approaches

• Recent approaches in learning constituency:

– Clark 2001: Mutual-information filters detect constituents, then an MDL-guided search assembles them

– van Zaanen 00: Aligns similar sentences and extracts their differences

• Results on bracketing:

Clark 34.6

van Zaanen 35.6

Right-branch Baseline 46.4

Grammar Induction 27/27