Grammar Induction - Massachusetts Institute of Technology

Grammar Induction

Regina Barzilay

EECS Department

MIT

November 15, 2004

Unsupervised Grammar Induction

• Goal: Learn syntactic structure from raw text

• Motivation for unsupervised learning:

– Huge amounts of raw data are readily available(compare with 50K of Penn Treebank)

– We can easily adapt for new domains and tasks

– Humans can learn language with little supervision

Grammar Induction 1/27

What a Crazy Idea!

• Gold (1967): Negative learnability results

– No superfinite class of languages is learned without

negative examples

• Chomsky (1980): The poverty of stimulus

– Children are born with a hard-wired language acquisition device (LAD)

– Free parameters of LAD are set through interaction

with other humans


Response on Negative Learnability

• If texts are generated by some stochastic process, language acquisition can be guided by observed frequencies

– Constructions that have very low frequency function

as pseudo-negative examples

• Horning (1969): stochastic context-free grammars are learnable only from positive examples


Response on Poverty of the Stimulus

• There’s a lot of stimulus: Baayen estimate of 200 million words by adulthood

• Language acquisition is emergent result of the interaction between cognition, learning abilities, and the structure of the linguistic data the child receives (Jackendoff, 2002)


Learning Grammars

• A discrete structure (the topology of the HMM, the context-free backbone of a PCFG)

• A set of continuous parameters which determine the probability of the sentences described by the grammar


Learn PCFGs with EM

• (Lari&Young 1990): Learning PCFGs with EM

– Full binary grammar over n symbols

– Parse randomly at first

– Re-estimate rule probabilities of parses

– Repeat


Preliminaries

• R is the set of rules in the context free grammar

N is the st of non-terminals in the grammar

• Θ for r ∈ R is the parameter for rule r

• Let R(λ) ⊂ R be the rules of the form λ β for→

some β

R• The parameter space Ω is the set of Θ ∈ [0, 1]| |

such that for all α ∈ N

r∈R(α) Θr = 1


Preliminaries(cont)

ΘCount(T,r)P (T |Θ) = r

r∈R

where Count(T, r) is the number of times rule r is seen in the tree T

logP (T |Θ) = Count(T, r) log Θr

r∈R


EM for PCFGs

• A PCFG defines a distribution P (S, T |Θ) over tree/sentence pairs (S, T )

• If we had tree/sentence pairs (fully observed data) then

L(Θ) =

logP (Si, Ti|Θ) i

• If we only have raw sentences S1, . . . , Sn, then trees are hidden variables

L(Θ) =

log

P (Si, T |Θ) i T


EM for PCFGs (cont.)

Trees are hidden variables •

L(Θ) =

log

P (Si, T |Θ) i T

• EM algorithm is then Θt = argmaxΘQ(Θ,Θt−1), where

Q(Θ,Θt−1) =

P (T |Si, Θt−1) log P (Si, T Θ)|i T



logP (T, Si Θ) =

r∈R Count(Si, T, r) log Θr,|where Count(S, T, r) is the number of times rule r is seen in the sentence tree/pair (S, T ) Q(Θ, Θt−1) =

i

T

P (T Si, Θt−1) log P (Si, Y Θ)| |

=

i

T

P (T Si, Θt−1)

r∈R

Count(Si, T, r) log Θr|=

i

r∈R

Count(Si, r) log Θr

where the expected counts are

Count(Si, r) =

T P (T Si, Θ

t−1)Count(Si, T, r)|



• Solving ΘML = argmaxΘ∈Ω gives:

Θα β =

i(Count(Si, α → β)) →

i

S∈R(α) Count(Si, S)

• Use Inside-Outside algorithm for efficient computation (see the handout)


Example Parse

S VBD

NP

DT NN

The screen

VP

VBD NP

NP PPwas

VBD DT

DT VBD

DT NN

DT NN

DT INwas

of

red

DT NN IN NP The screen DT NN

a sea of NN a sea

red

Treebank Parse


Local Maxima

Carroll&Charniak, 1992:

• Constraints on grammar topology

– Limit on rule length

– Non-lexicalized

– Rules added processing sentences in incremental order

– Rare rules are removed

• Generate a corpus using a PCFG

• Initialize rules with random probability (repeat 300 times)

• None of the derived grammars matched the original


Grammar Format

• Lari&Young, 1990: Satisfactory grammar learning requires more nonterminals than are theoretically needed to describe a language at hand

• There is no guarantee that the nonterminals that the algorithm learns will have any resemblance to nonterminals motivated in linguistic analysis

• Constraints on the grammar format may simplify the reestimation procedure

– Carroll&Charniak, 1992: Specify constraints on non-terminals that may appear together on the right-hand side of the rule


Partially Unsupervised LearningPereira&Schabes 1992

• Idea: Encourage the probabilities into a good region of the parameter space

• Implementation: modify Inside-Outside algorithm to consider only parses that do not cross provided bracketing

• Experiments: 15 non terminals over 45 part of speech tags The algorithm uses Treebank bracketing, but ignores the labels

• Evaluation Measure: fraction of nodes in gold trees

correctly posited in proposed trees (unlabeled recall)

Results: •


– Constrained and unconstrained grammars have similar cross-entropy

– But very different bracketing accuracy: 37% vs. 90%


Linguistic Constituency

Look at Noun Phrases: •

you, the man, a cat with a limp, three black cats

They may appear in the following contexts: saw me, He saw , The elephant sat on

• EM assumptions/greediness may not work:

– The important test for linguistic constituency emphasize external distribution

– Not internal constituency


Finding TopologyStolcke&Omohundro, 1994: Bayesian model merging

• Data incorporation: Given a body of data X, build an initial model M0 by explicitly accommodating each data point individually such that M0 maximizes the likelihood P (X|M).

• Generalization: Build a sequence of new models, obtaining Mi+1 from Mi by applying a merging operator m that coalesces substructures in Mi, Mi+1 = m(Mi), i = 0, 1

• Optimization: Maximize posterior probability

• Search strategy: Greedy or beam search through the space of possible merges


HMM Topology Induction

• Data incorporation: For each observed sample create a unique path between the initial and final states by assigning a new state to each symbol token in the sample

• Generalization: Two HMM states are replaced by a single new state, which inherits the union of the transitions and emissions from the old states.


HMM Topology Induction

• Prior distribution: Choose uninformative priors for a model M with topology Ms and parameters θM .

P (M) = P (Ms)P (θM |Ms)

P (Ms) ∝ exp(−l(Ms))

where l(Ms) is the number of bits required to encode Ms.

• Search: Greedy merging strategy.


Example

I

I

a b

I 1 2

3 4 5 6 F a b a b

0.5

0.5

b

I

I 1

1

1 2 5

2

2 F

5 F

6 F

1 2

4 5 6 F

a b

a b a

baba

b a b

a 0.5

0.5

0.67

0.5 0.5

0.33

0.67

0.33


PCFG Induction

• Data Incorporation: Add a top-level production

that covers the sample precisely. Create one

nonterminal for each observed terminal.

• Merging and Chunking: During merging, two nonterminals are replaced by a single new state. Chunking takes a given sequence of nonterminals and abbreviates it using a newly created nonterminal.

Prior distribution: Similar to HMM. •

Search: Beam search. •


ExampleInput: ab,aabb,aaabbb

S −>−> A B −>A A B B−>A A A B B B

A −>a B −>b

Chunk(AB)−>X S −>−> X −>−> A X B−>−> A A X B B

X −>−> A B

Chunk(AXB)−>Y S −>−> X −>−> Y −> A Y B

X −> A B Y −> A X B

Merge S,Y S −>−> X−> A S B

X −> A B

Merge S,X S −> A B−> A S B


Results for PCFGS

• Formal language experiments

– Successfully learned simple grammars Language Sample no. Grammar Search

Parentheses 8 S → ()|(S)|SS BF

a 2n 5 S → aa|SS BF

(ab)n 5 S → ab|aSb BF

wcw R , w ∈ a, b 7 S → c|aSa|bSb BS (3)

Addition strings 23 S → a|b|(S)|S + S BS(4)

• Natural Language syntax

– Mixed results (issues related to data sparseness)


Example of Learned Grammar

Target Grammar Learned Grammar

S NP V P S NP V P → →

V P V erb NP V P V NP → →

NP Det Noun NP DetN → →

NP Det Noun RC NP NP RC → →

RC Rel V P RC REL V P → →

V erb → saw|heard V saw heard → |Noun → cat|dog mouse N → cat|dog mouse| |

Det a the Det → a|the→ |Rel that Rel that→ →


Other approaches

• Recent approaches in learning constituency:

– Clark 2001: Mutual-information filters detect constituents, then an MDL-guided search assembles them

– van Zaanen 00: Aligns similar sentences and extracts their differences

• Results on bracketing:

Clark 34.6

van Zaanen 35.6

Right-branch Baseline 46.4


Documents

Grammar Induction - Massachusetts Institute of Technology