155
1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John Blatz, Roy Tromble

1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

Embed Size (px)

Citation preview

Page 1: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

1

Declarative Specification of NLP Systems

Jason Eisner

IBM, May 2006

student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John Blatz, Roy Tromble

Page 2: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

2

An Anecdote from ACL’05

-Michael Jordan

Page 3: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

3

An Anecdote from ACL’05

Just draw a model that actually makes sense for your problem.

-Michael Jordan

Just do Gibbs sampling. Um, it’s only 6 lines in Matlab…

Page 4: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

4

Conclusions to draw from that talk1. Mike & his students are great.

2. Graphical models are great.(because they’re flexible)

3. Gibbs sampling is great.(because it works with nearly any graphical model)

4. Matlab is great.(because it frees up Mike and his students to doodle all day and then execute their doodles)

Page 5: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

5

1. Mike & his students are great.

2. Graphical models are great.(because they’re flexible)

3. Gibbs sampling is great.(because it works with nearly any graphical model)

4. Matlab is great.(because it frees up Mike and his students to doodle all day and then execute their doodles)

Page 6: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

6

Parts of it already are …Language modelingBinary classification (e.g., SVMs)Finite-state transductionsLinear-chain graphical models

Toolkits available; you don’t have to be an expert

Efficient parsers and MT systems are complicated and painful to write

But other parts aren’t … Context-free and beyondMachine translation

Page 7: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

7

This talk: A toolkit that’s general enough for these cases.

(stretches from finite-state to Turing machines)

“Dyna”

Efficient parsers and MT systems are complicated and painful to write

But other parts aren’t … Context-free and beyondMachine translation

Page 8: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

8

Warning

Lots more beyond this talk

see the EMNLP’05 and FG’06 papers

see http://dyna.org(download + documentation)

sign up for updates by email

wait for the totally revamped next version

Page 9: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

9

the case forLittle Languages

declarative programming

small is beautiful

Page 10: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

10

Sapir-Whorf hypothesis

Language shapes thought At least, it shapes conversation

Computer language shapes thought At least, it shapes experimental research Lots of cute ideas that we never pursue Or if we do pursue them, it takes 6-12

months to implement on large-scale data Have we turned into a lab science?

Page 11: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

11

Declarative Specifications

State what is to be done

(How should the computer do it? Turn that over to a general “solver” that handles the specification language.)

Hundreds of domain-specific “little languages” out there. Some have sophisticated solvers.

Page 12: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

12

dot (www.graphviz.org)

digraph g { graph [rankdir = "LR"]; node [fontsize = "16“ shape = "ellipse"]; edge [];

"node0" [label = "<f0> 0x10ba8| <f1>"shape = "record"]; "node1" [label = "<f0> 0xf7fc4380| <f1> | <f2> |-1"shape = "record"]; … "node0":f0 -> "node1":f0 [id = 0]; "node0":f1 -> "node2":f0 [id = 1]; "node1":f0 -> "node3":f0 [id = 2]; …}

nodes

edges

What’s the hard part? Making a nice layout!Actually, it’s NP-hard …

Page 13: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

13

dot (www.graphviz.org)

Page 14: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

14

LilyPond (www.lilypond.org)

c4

<<c4 d4 e4>>

{ f4 <<c4 d4 e4>> }

<< g2 \\ { f4 <<c4 d4 e4>> } >>

Page 15: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

15

LilyPond (www.lilypond.org)

Page 16: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

16

Declarative Specs in NLP

Regular expression (for a FST toolkit) Grammar (for a parser) Feature set (for a maxent distribution, SVM, etc.) Graphical model (DBNs for ASR, IE, etc.)

Claim of this talk:

Sometimes it’s best to peek under the shiny surface.Declarative methods are still great, but should be layered:we need them one level lower, too.

Page 17: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

17

Regular expression (for a FST toolkit) Grammar (for a parser)

Feature set (for a maxent distribution, SVM, etc.)

Not always flexible enough …Need to open up the parser and rejigger it.Declarative specification of algorithms.

Not always flexible enough …Need to open up the learner and rejigger it.Declarative specification of objective functions.

New toolkit

Existingtoolkits

Declarative Specs in NLP

Page 18: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

18

Declarative Specification of Algorithms

Page 19: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

19

How you build a system (“big picture” slide)

cool model

tuned C++ implementation

(data structures, etc.)

practical equations

pseudocode(execution order)

...

|

),(),(,

0

nkji xzyx

zy

x NNNNp

kjjiki

for width from 2 to n for i from 0 to n-width k = i+width for j from i+1 to k-1

PCFG

Page 20: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

20

Wait a minute …

Didn’t I just implement something like this last month?

chart management / indexingcache-conscious data structuresprioritization of partial solutions (best-first, A*)parameter managementinside-outside formulasdifferent algorithms for training and decodingconjugate gradient, annealing, ...parallelization?

I thought computers were supposed to automate drudgery

Page 21: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

21

for width from 2 to n for i from 0 to n-width k = i+width for j from i+1 to k-1

How you build a system (“big picture” slide)

cool model

tuned C++ implementation

(data structures, etc.)

pseudocode(execution order)

PCFG

Dyna language specifies these equations.

Most programs just need to compute some values from other values. Any order is ok.

Some programs also need to update the outputs if the inputs change: spreadsheets, makefiles, email readers dynamic graph algorithms EM and other iterative optimization leave-one-out training of smoothing params

...

|

),(),(,

0

nkji xzyx

zy

x NNNNp

kjjiki

practical equations

Page 22: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

22

How you build a system (“big picture” slide)

cool model

practical equations

...

|

),(),(,

0

nkji xzyx

zy

x NNNNp

kjjiki

PCFG

Compilation strategies (we’ll come back to this)

tuned C++ implementation

(data structures, etc.)

pseudocode(execution order)

for width from 2 to n for i from 0 to n-width k = i+width for j from i+1 to k-1

Page 23: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

23

Writing equations in Dyna int a. a = b * c.

a will be kept up to date if b or c changes. b += x.b += y. equivalent to b = x+y.

b is a sum of two variables. Also kept up to date.

c += z(1).c += z(2).c += z(3).c += z(“four”).c += z(foo(bar,5)).

c is a sum of all nonzero z(…) values.

At compile time, we don’t know how many!

a “pattern”the capitalized Nmatches anything

c += z(N).

Page 24: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

24

More interesting use of patterns a = b * c.

scalar multiplication a(I) = b(I) * c(I).

pointwise multiplication a += b(I) * c(I). means a = b(I)*c(I)

dot product; could be sparse

a(I,K) += b(I,J) * c(J,K). b(I,J)*c(J,K) matrix multiplication; could be sparse J is free on the right-hand side, so we sum over it

I

... + b(“yetis”)*c(“yetis”)+ b(“zebra”)*c(“zebra”)

sparse dot product of query & document

J

Page 25: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

25

By now you may see what we’re up to!

Prolog has Horn clauses:

a(I,K) :- b(I,J) , c(J,K).

Dyna has “Horn equations”:

a(I,K) += b(I,J) * c(J,K).

Dyna vs. Prolog

has a valuee.g., a real number

definition from other values

Like Prolog:Allow nested terms

Syntactic sugar for lists, etc.

Turing-complete

Unlike Prolog:Charts, not backtracking!Compile efficient C++

classesIntegrates with your C++ code

Page 26: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

26

using namespace cky;chart c;

c[rewrite(“s”,”np”,”vp”)] = 0.7;c[word(“Pierre”,0,1)] = 1;c[length(30)] = true; // 30-word sentencecin >> c; // get more axioms from stdin

cout << c[goal]; // print total weight of all parses

The CKY inside algorithm in Dyna:- double item = 0.

:- bool length = false.

constit(X,I,J) += word(W,I,J) * rewrite(X,W).

constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z).

goal += constit(“s”,0,N) if length(N).

put in axioms (values not defined by the above program)

theorem pops out

Page 27: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

27

visual debugger –browse the proof forest

ambiguity

shared substructure

Page 28: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

28

Related algorithms in Dyna?

Viterbi parsing? Logarithmic domain? Lattice parsing? Incremental (left-to-right) parsing? Log-linear parsing? Lexicalized or synchronous parsing? Earley’s algorithm? Binarized CKY?

constit(X,I,J) += word(W,I,J) * rewrite(X,W).

constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z).

goal += constit(“s”,0,N) if length(N).

Page 29: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

29

Related algorithms in Dyna?

constit(X,I,J) += word(W,I,J) * rewrite(X,W).

constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z).

goal += constit(“s”,0,N) if length(N).

Viterbi parsing? Logarithmic domain? Lattice parsing? Incremental (left-to-right) parsing? Log-linear parsing? Lexicalized or synchronous parsing? Earley’s algorithm? Binarized CKY?

max=

max=

max=

Page 30: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

30

Related algorithms in Dyna?

constit(X,I,J) += word(W,I,J) * rewrite(X,W).

constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z).

goal += constit(“s”,0,N) if length(N).

Viterbi parsing? Logarithmic domain? Lattice parsing? Incremental (left-to-right) parsing? Log-linear parsing? Lexicalized or synchronous parsing? Earley’s algorithm? Binarized CKY?

max=

max=

max=

+

+ +

log+=

log+=

log+=

Page 31: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

31

c[ word(“Pierre”, 0, 1) ] = 1

state(5) state(9)

Related algorithms in Dyna?

constit(X,I,J) += word(W,I,J) * rewrite(X,W).

constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z).

goal += constit(“s”,0,N) if length(N).

Viterbi parsing? Logarithmic domain? Lattice parsing? Incremental (left-to-right) parsing? Log-linear parsing? Lexicalized or synchronous parsing? Earley’s algorithm? Binarized CKY?

0.2

9

5

8

Pierre/0.2P/0.5air/0.3

Page 32: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

32

Related algorithms in Dyna?

Viterbi parsing? Logarithmic domain? Lattice parsing? Incremental (left-to-right) parsing? Log-linear parsing? Lexicalized or synchronous parsing? Earley’s algorithm? Binarized CKY?

constit(X,I,J) += word(W,I,J) * rewrite(X,W).

constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z).

goal += constit(“s”,0,N) if length(N).

Just add words one at a time to the chartCheck at any time what can be derived from words so far

Similarly, dynamic grammars

Page 33: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

33

Related algorithms in Dyna?

Viterbi parsing? Logarithmic domain? Lattice parsing? Incremental (left-to-right) parsing? Log-linear parsing? Lexicalized or synchronous parsing? Earley’s algorithm? Binarized CKY?

constit(X,I,J) += word(W,I,J) * rewrite(X,W).

constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z).

goal += constit(“s”,0,N) if length(N).

Again, no change to the Dyna program

Page 34: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

34

Related algorithms in Dyna?

Viterbi parsing? Logarithmic domain? Lattice parsing? Incremental (left-to-right) parsing? Log-linear parsing? Lexicalized or synchronous parsing? Earley’s algorithm? Binarized CKY?

constit(X,I,J) += word(W,I,J) * rewrite(X,W).

constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z).

goal += constit(“s”,0,N) if length(N).

Basically, just add extra arguments to the terms above

Page 35: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

35

Related algorithms in Dyna?

Viterbi parsing? Logarithmic domain? Lattice parsing? Incremental (left-to-right) parsing? Log-linear parsing? Lexicalized or synchronous parsing? Earley’s algorithm? Binarized CKY?

constit(X,I,J) += word(W,I,J) * rewrite(X,W).

constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z).

goal += constit(“s”,0,N) if length(N).

Page 36: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

36

Earley’s algorithm in Dyna

constit(X,I,J) += word(W,I,J) * rewrite(X,W).

constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z).

goal += constit(“s”,0,N) if length(N).

need(“s”,0) = true.

need(Nonterm,J) |= ?constit(_/[Nonterm|_],_,J).

constit(Nonterm/Needed,I,I) += rewrite(Nonterm,Needed) if need(Nonterm,I).

constit(Nonterm/Needed,I,K) += constit(Nonterm/[W|Needed],I,J) * word(W,J,K).

constit(Nonterm/Needed,I,K) += constit(Nonterm/[X|Needed],I,J) * constit(X/[],J,K).

goal += constit(“s”/[],0,N) if length(N).

magic templates transformation(as noted by Minnen 1996)

Page 37: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

37

pseudocode(execution order)

Program transformations

tuned C++ implementation

(data structures, etc.)

for width from 2 to n for i from 0 to n-width k = i+width for j from i+1 to k-1

Blatz & Eisner (FG 2006):

Lots of equivalent ways to write a system of equations!

Transforming from one to another mayimprove efficiency.

Many parsing “tricks” can be generalized into automatic transformations that help other

programs, too!

cool model

practical equations

...

|

),(),(,

0

nkji xzyx

zy

x NNNNp

kjjiki

PCFG

Page 38: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

38

Related algorithms in Dyna?

Viterbi parsing? Logarithmic domain? Lattice parsing? Incremental (left-to-right) parsing? Log-linear parsing? Lexicalized or synchronous parsing? Earley’s algorithm? Binarized CKY?

constit(X,I,J) += word(W,I,J) * rewrite(X,W).

constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z).

goal += constit(“s”,0,N) if length(N).

Page 39: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

39

Rule binarization

constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z).

constit(X\Y,Mid,J) += constit(Z,Mid,J) * rewrite(X,Y,Z).

constit(X,I,J) += constit(Y,I,Mid) * constit(X\Y,Mid,J).

folding transformation: asymp. speedup!

Mid J

ZY Z

X

I Mid

Y

I J

X

Mid J

X\Y

I Mid

Y

Page 40: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

40

Rule binarization

constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z).

constit(X\Y,Mid,J) += constit(Z,Mid,J) * rewrite(X,Y,Z).

constit(X,I,J) += constit(Y,I,Mid) * constit(X\Y,Mid,J).

folding transformation: asymp. speedup!

constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z)MidZY ,,

constit(Y,I,Mid) constit(Z,Mid,J) * rewrite(X,Y,Z)MidY ,

Z

graphical modelsconstraint programmingmulti-way database join

Page 41: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

41

More program transformations Examples that add new semantics

Compute gradient (e.g., derive outside algorithm from inside) Compute upper bounds for A* (e.g., Klein & Manning ACL’03) Coarse-to-fine (e.g., Johnson & Charniak NAACL’06)

Examples that preserve semantics On-demand computation – by analogy with Earley’s algorithm

On-the-fly composition of FSTs Left-corner filter for parsing

Program specialization as unfolding – e.g., compile out the grammar Rearranging computations – by analogy with categorial grammar

Folding reinterpreted as slashed categories “Speculative computation” using slashed categories

abstract away repeated computation to do it once only – by analogy with unary rule closure or epsilon-closure

derives Eisner & Satta ACL’99 O(n3) bilexical parser

Page 42: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

42

Propagate updatesfrom right-to-leftthrough the equations.a.k.a.“agenda algorithm”“forward chaining”“bottom-up inference”“semi-naïve bottom-up”

How you build a system (“big picture” slide)

cool model

tuned C++ implementation

(data structures, etc.)

practical equations

pseudocode(execution order)

...

|

),(),(,

0

nkji xzyx

zy

x NNNNp

kjjiki

for width from 2 to n for i from 0 to n-width k = i+width for j from i+1 to k-1

PCFG

use a general method

Page 43: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

43

agenda of pending updates

prep(2,3)= 1.0

prep(I,3)= ?

s(3,9)+= 0.15 s(3,7)

+= 0.21vp(5,K)

= ?vp(5,9)= 0.5

vp(5,7)= 0.7

Bottom-up inference

np(3,5)+= 0.3

chart of derived items with current values

s(I,K) += np(I,J) * vp(J,K)

rules of program

np(3,5)= 0.1+0.3

0.4

we updated np(3,5);what else must therefore change?

If np(3,5) hadn’t been in the chart already, we would have added it.

vp(5,K)?no more matches to this

query

prep(I,3)?

pp(I,K) += prep(I,J) * np(J,K) pp(2,5)+= 0.3

Page 44: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

44

...

|

),(),(,

0

nkji xzyx

zy

x NNNNp

kjjiki

How you build a system (“big picture” slide)

cool model

practical equations

pseudocode(execution order)

for width from 2 to n for i from 0 to n-width k = i+width for j from i+1 to k-1

PCFG

What’s going on under the

hood?tuned C++

implementation(data structures, etc.)

Page 45: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

45

copy, compare, & hashterms fast, via

integerization (interning)

Compiler provides …

np(3,5)+= 0.3

chart of derived items with current values

s(I,K) += np(I,J) * vp(J,K)

efficient storage of terms(use native C++ types,

“symbiotic” storage, garbage collection,

serialization, …)vp(5,K)?

automatic indexingfor O(1) lookup

rules of program

hard-codedpattern matching

agenda of pending updates

efficient priority queue

Page 46: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

46

n(5,5)= 0.2

agenda of pending updates

n(5,5)+= ?

Beware double-counting!

n(5,5)+= 0.3

chart of derived items with current values

n(I,K) += n(I,J) * n(J,K)

rules of program

If np(3,5) hadn’t been in the chart already, we would have added it.

n(5,K)?

epsilonconstituent

to makeanother copy

of itself

combiningwith itself

Page 47: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

47

Parameter training Maximize some objective function. Use Dyna to compute the function. Then how do you differentiate it?

… for gradient ascent,conjugate gradient, etc.

… gradient also tells us the expected counts for EM!

model parameters(and input sentence)

as axiom values

objective functionas a theorem’s value

e.g., inside algorithmcomputes likelihood

of the sentence

Two approaches: Program transformation – automatically derive the “outside” formulas. Back-propagation – run the agenda algorithm “backwards.”

works even with pruning, early stopping, etc.

DynaMITE: training toolkit

Page 48: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

48

What can Dyna do beyond CKY?

Page 49: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

49

Some examples from my lab … Parsing using …

factored dependency models (Dreyer, Smith, & Smith CONLL’06) with annealed risk minimization (Smith and Eisner EMNLP’06)

constraints on dependency length (Eisner & Smith IWPT’05) unsupervised learning of deep transformations (see Eisner EMNLP’02) lexicalized algorithms (see Eisner & Satta ACL’99, etc.)

Grammar induction using … partial supervision (Dreyer & Eisner EMNLP’06) structural annealing (Smith & Eisner ACL’06) contrastive estimation (Smith & Eisner GIA’05) deterministic annealing (Smith & Eisner ACL’04)

Machine translation using … Very large neighborhood search of permutations (Eisner & Tromble, NAACL-W’06) Loosely syntax-based MT (Smith & Eisner in prep.) Synchronous cross-lingual parsing (Smith & Smith EMNLP’04)

Finite-state methods for morphology, phonology, IE, even syntax … Unsupervised cognate discovery (Schafer & Yarowsky ’05, ’06) Unsupervised log-linear models via contrastive estimation (Smith & Eisner ACL’05) Context-based morph. disambiguation (Smith, Smith & Tromble EMNLP’05) Trainable (in)finite-state machines (see Eisner ACL’02, EMNLP’02, …) Finite-state machines with very large alphabets (see Eisner ACL’97) Finite-state machines over weird semirings (see Eisner ACL’02, EMNLP’03)

Teaching (Eisner JHU’05-06; Smith & Tromble JHU’04)

Easy to try stuff out!

Programs are very short & easy to

change!- see also Eisner ACL’03)

Page 50: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

50

Can it express everything in NLP? Remember, it integrates tightly with C++,

so you only have to use it where it’s helpful,and write the rest in C++. Small is beautiful.

We’re currently extending the class of allowed formulas “beyond the semiring” cf. Goodman (1999) will be able to express smoothing, neural nets, etc.

Of course, it is Turing complete …

Page 51: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

51

Smoothing in Dyna

mle_prob(X,Y,Z) % context = count(X,Y,Z)/count(X,Y).

smoothed_prob(X,Y,Z) = lambda*mle_prob(X,Y,Z)

+ (1-lambda)*mle_prob(Y,Z). % for arbitrary n-grams, can use lists

count_count(N) += 1 whenever N is count(Anything).

% updates automatically during leave-one-out jackknifing

Page 52: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

52

Information retrieval in Dyna

score(Doc) += tf(Doc,Word)*tf(Query,Word)*idf(Word).

idf(Word) = 1/log(df(Word)). df(Word) += 1 whenever tf(Doc,Word) > 0.

Page 53: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

53

Neural networks in Dyna

out(Node) = sigmoid(in(Node)). in(Node) += input(Node). in(Node) += weight(Node,Kid)*out(Kid). error += (out(Node)-target(Node))**2

if ?target(Node).

Recurrent neural net is ok

1x 2x 3x 4x

1h 2h 3h

y

Page 54: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

54

Game-tree analysis in Dyna

goal = best(Board) if start(Board).

best(Board) max= stop(player1, Board). best(Board) max= move(player1, Board,

NewBoard) + worst(NewBoard).

worst(Board) min= stop(player2, Board). worst(Board) min= move(player2, Board,

NewBoard) + best(NewBoard).

Page 55: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

55

Weighted FST composition in Dyna(epsilon-free case)

:- bool item=false. start (A o B, Q x R) |= start (A, Q) & start (B, R). stop (A o B, Q x R) |= stop (A, Q) & stop (B, R). arc (A o B, Q1 x R1, Q2 x R2, In, Out)

|= arc (A, Q1, Q2, In, Match) & arc (B, R1, R2, Match, Out).

Inefficient? How do we fix this?

Page 56: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

56

Constraint programming (arc consistency) :- bool indomain=false. :- bool consistent=true.

variable(Var) |= indomain(Var:Val). possible(Var:Val) &= indomain(Var:Val). possible(Var:Val) &= support(Var:Val, Var2)

whenever variable(Var2). support(Var:Val, Var2) |= possible(Var2:Val2)

& consistent(Var:Val, Var2:Val2).

Page 57: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

57

Edit distance in Dyna: version 1 letter1(“c”,0,1). letter1(“l”,1,2). letter1(“a”,2,3). … % clara letter2(“c”,0,1). letter2(“a”,1,2). letter2(“c”,2,3). … % caca end1(5). end2(4). delcost := 1. inscost := 1. substcost := 1.

align(0,0) = 0.

align(I1,J2) min= align(I1,I2) + letter2(L2,I2,J2) + inscost(L2). align(J1,I2) min= align(I1,I2) + letter1(L1,I1,J1) + delcost(L1). align(J1,J2) min= align(I1,I2) + letter1(L1,I1,J1) +

letter2(L2,I2,J2) + subcost(L1,L2).

align(J1,J2) min= align(I1,I2)+letter1(L,I1,J1)+letter2(L,I2,J2).

goal = align(N1,N2) whenever end1(N1) & end2(N2).

Cost of best alignment of first I1 charactersof string 1 with first I2 characters of string 2.

Next letter is L2. Add it to string 2

only.

same L;free move!

Page 58: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

58

Edit distance in Dyna: version 2 input([“c”, “l”, “a”, “r”, “a”], [“c”, “a”, “c”, “a”]) := 0. delcost := 1. inscost := 1. substcost := 1.

alignupto(Xs,Ys) min= input(Xs,Ys). alignupto(Xs,Ys) min= alignupto([X|Xs],Ys) + delcost. alignupto(Xs,Ys) min= alignupto(Xs,[Y|Ys]) + inscost. alignupto(Xs,Ys) min= alignupto([X|Xs],[Y|Ys])+substcost. alignupto(Xs,Ys) min= alignupto([A|Xs],[A|Ys]). goal min= alignupto([], []).

Xs and Ys are still-unaligned suffixes.This item’s value is supposed to be cost ofaligning everything up to but not including them.

How about different costs for different letters?

Page 59: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

59

input([“c”, “l”, “a”, “r”, “a”], [“c”, “a”, “c”, “a”]) := 0. delcost := 1. inscost := 1. substcost := 1.

alignupto(Xs,Ys) min= input(Xs,Ys). alignupto(Xs,Ys) min= alignupto([X|Xs],Ys) + delcost. alignupto(Xs,Ys) min= alignupto(Xs,[Y|Ys]) + inscost. alignupto(Xs,Ys) min= alignupto([X|Xs],[Y|Ys])+substcost. alignupto(Xs,Ys) min= alignupto([L|Xs],[L|Ys]). goal min= alignupto([], []).

Edit distance in Dyna: version 2

Xs and Ys are still-unaligned suffixes.This item’s value is supposed to be cost ofaligning everything up to but not including them.

(X). (Y).

(X,Y).

+ nocost(L,L)

Page 60: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

60

Is it fast enough? (sort of) Asymptotically efficient 4 times slower than Mark Johnson’s inside-outside 4-11 times slower than Klein & Manning’s Viterbi parser

Page 61: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

61

Are you going to make it faster?

(yup!) Currently rewriting the term classes

to match hand-tuned code Will support “mix-and-match”

implementation strategies store X in an array store Y in a hash don’t store Z

(compute on demand) Eventually, choose

strategies automaticallyby execution profiling

Page 62: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

62

Synopsis: your idea experimental results fast!

Dyna is a language for computation (no I/O). Especially good for dynamic programming. It tries to encapsulate the black art of NLP.

Much prior work in this vein … Deductive parsing schemata (preferably weighted)

Goodman, Nederhof, Pereira, Warren, Shieber, Schabes, Sikkel… Deductive databases (preferably with aggregation)

Ramakrishnan, Zukowski, Freitag, Specht, Ross, Sagiv, … Probabilistic programming languages (implemented)

Zhao, Sato, Pfeffer … (also: efficient Prologish languages)

Page 63: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

63

Dyna contributors!

Jason Eisner Eric Goldlust, Eric Northup, Johnny Graettinger

(compiler backend) Noah A. Smith (parameter training) Markus Dreyer, David Smith (compiler frontend) Mike Kornbluh, George Shafer, Gordon Woodhull,

Constantinos Michael, Ray Buse (visual debugger)

John Blatz (program transformations) Asheesh Laroia (web services)

Page 64: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

64

New examples of dynamic programming in NLP

Page 65: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

65

Some examples from my lab … Parsing using …

factored dependency models (Dreyer, Smith, & Smith CONLL’06) with annealed risk minimization (Smith and Eisner EMNLP’06)

constraints on dependency length (Eisner & Smith IWPT’05) unsupervised learning of deep transformations (see Eisner EMNLP’02) lexicalized algorithms (see Eisner & Satta ACL’99, etc.)

Grammar induction using … partial supervision (Dreyer & Eisner EMNLP’06) structural annealing (Smith & Eisner ACL’06) contrastive estimation (Smith & Eisner GIA’05) deterministic annealing (Smith & Eisner ACL’04)

Machine translation using … Very large neighborhood search of permutations (Eisner & Tromble, NAACL-W’06) Loosely syntax-based MT (Smith & Eisner in prep.) Synchronous cross-lingual parsing (Smith & Smith EMNLP’04)

Finite-state methods for morphology, phonology, IE, even syntax … Unsupervised cognate discovery (Schafer & Yarowsky ’05, ’06) Unsupervised log-linear models via contrastive estimation (Smith & Eisner ACL’05) Context-based morph. disambiguation (Smith, Smith & Tromble EMNLP’05) Trainable (in)finite-state machines (see Eisner ACL’02, EMNLP’02, …) Finite-state machines with very large alphabets (see Eisner ACL’97) Finite-state machines over weird semirings (see Eisner ACL’02, EMNLP’03)

Teaching (Eisner JHU’05-06; Smith & Tromble JHU’04)

- see also Eisner ACL’03)

Page 66: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

66

New examples of dynamic programming in NLP

Parameterized finite-state machines

Page 67: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

67

Parameterized FSMs An FSM whose arc probabilities depend on

parameters: they are formulas.

/p

a/q*exp(t+u)

a/q

b/(1-q)r

/1-p

a/r

1-s

a/exp(t+v)

Page 68: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

68

/.1

a/.44

a/.2

b/.8

/.9

a/.3

.7

a/.56

Parameterized FSMs An FSM whose arc probabilities depend on

parameters: they are formulas.

Page 69: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

69

Parameterized FSMs An FSM whose arc probabilities depend on

parameters: they are formulas.

/p

a/q*exp(t+u)

a/q

b/(1-q)r

/1-p

a/r

1-s

a/exp(t+v)

Expert first: Construct the FSM (topology & parameterization).

Automatic takes over: Given training data, find parameter valuesthat optimize arc probs.

Page 70: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

70

Parameterized FSMsKnight & Graehl1997 - transliteration

p(English text)

p(English text English

phonemes)

p(English phonemes Japanese phonemes)

p(Japanese phonemes

Japanese text)

o

o

o

“/t/ and /d/ are similar …”

Loosely coupled probabilities:

/t/:/tt/

/d/:/dd/

exp p+q+r (coronal, stop,unvoiced)

exp p+q+s (coronal, stop,voiced)

Page 71: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

71

Parameterized FSMsKnight & Graehl1997 - transliteration

p(English text)

p(English text English

phonemes)

p(English phonemes Japanese phonemes)

p(Japanese phonemes

Japanese text)

o

o

o

“Would like to get some of that expert knowledge in here”

Use probabilistic regexps like(a*.7 b) +.5 (ab*.6) …

If the probabilities are variables (a*x b) +y (ab*z) …then arc weights of the compiled machine are nasty formulas. (Especially after minimization!)

Page 72: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

80

New examples of dynamic programming in NLP

Parameterized infinite-state machines

Page 73: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

81

Universal grammar as a parameterized FSA over an infinite state space

Page 74: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

82

New examples of dynamic programming in NLP

More abuses of finite-state machines

Page 75: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

83

Huge-alphabet FSAs for OT phonology

VCCVC

voi

CCVC

voi underlyingtiers

surfacetiers

CCVC

voi

Gen

CCVC

CCVC

voi

CCC

velar

CCVC

voi

voi

CC C

V

etc.Gen proposes allcandidates that include this input.

}

Page 76: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

84

Huge-alphabet FSAs for OT phonology

CCC

velar

CCVC

voi

CC C

V

encode this candidate as a string

at each moment,need to describewhat’s going onon many tiers

Page 77: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

85

Directional Best Paths construction Keep “best” output string for each input string Yields a new transducer (size 3n)

For input abc: abc axcFor input abd: axd

Must allow red arcjust if next input is d

1 2

3

5

4

6

7

a:ab:b

b:x

c:c

c:c

d:d

Page 78: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

86

Compute (q) in O(1) time

as soon as we visit q.

Whole alg. is linear.

Minimization of semiring-weighted FSAs

New definition of for pushing: (q) = weight of the shortest path from q,

breaking ties alphabetically on input symbols Computation is simple, well-defined, independent of (K, ) Breadth-first search back from final states:

d

b

cbaa

b

c

distance 2

Faster than finding min-weight path

à la Mohri.(q) = k (r)

q

r

:k

Page 79: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

87

New examples of dynamic programming in NLP

Tree-to-tree alignment

Page 80: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

Two training trees, showing a free translation from French to English.

Synchronous Tree Substitution Grammar

enfants(“kids”)

d’(“of”)

beaucoup(“lots”)

Sam

donnent (“give”)

baiser(“kiss”)

un(“a”)

à (“to”)

kids

Sam

kiss

quite

often

“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”

Page 81: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

enfants(“kids”)

kids

NPd’

(“of”)

beaucoup(“lots”)

NP

NP

SamSam

NP

Synchronous Tree Substitution Grammar

kissdonnent (“give”)

baiser(“kiss”)

un(“a”)

à (“to”)

Start

NP

NP

nullAdv

quitenullAdv

oftennullAdv

nullAdv

“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”

Two training trees, showing a free translation from French to English.A possible alignment is shown in orange.

Page 82: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

enfants(“kids”)

kids

NPd’

(“of”)

beaucoup(“lots”)

NP

NPquitenull

Adv

oftennullAdv

nullAdv

SamSam

NP

kissdonnent (“give”)

baiser(“kiss”)

un(“a”)

à (“to”)

Start

NP

NP

nullAdv

Synchronous Tree Substitution Grammar

“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”

Start

Two training trees, showing a free translation from French to English.A possible alignment is shown in orange. Alignment shows how trees are generated synchronously from “little trees” ...

Page 83: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

91

New examples of dynamic programming in NLP

Bilexical parsing in O(n3)

(with Giorgio Satta)

Page 84: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

92

Lexicalized CKY

lovesMary the girl outdoors][ [ ][ ] ][

Page 85: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

93

Lexicalized CKY is O(n5) not O(n3)

B

i j

C

j+1 kh h’

h

A

i k

O(n3 combinations)

O(n5 combinations)

... hug visiting relatives... advocate visiting relatives

Page 86: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

94

Idea #1

B

i j

C

j+1 kh h’

h’

A

i k

Combine B with what C?

must try different-width C’s (vary k)

must try differently-headed C’s (vary h’)

Separate these!

Page 87: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

95

Idea #1

C

j+1 kh’

h’

A

i k

B

i j

C

j+1 kh h’

h’

A

i k

i jh

B

h’i j

A

C

h’C

(the old CKY way)

Page 88: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

96

Idea #2

Some grammars allow

A

i

A

k

ki

h h

A

Page 89: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

97

Idea #2

B

i j

C

j+1 kh h’

h’

A

i k

Combine what B and C?

must try different-width C’s (vary k)

must try different midpoints j

Separate these!

Page 90: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

98

Idea #2

h

A

i k

B

i j

C

j+1 kh h’

h

A

i k

i jh

B C

j+1 h’

C

h’i

A

h kh’

C

(the old CKY way)

Page 91: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

99

Idea #2

h

A

k

B

i j

C

j+1 kh h’

h

A

i k

jh

B C

j+1 h’

C

h’

A

h kh’

C

(the old CKY way)

Page 92: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

100

An O(n3) algorithm (with G. Satta)

lovesMary the girl outdoors] [[ ][ ]] [

Page 93: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

101

3 parsers: log-log plot

1

10

100

1000

10000

100000

10 100

Sentence Length

Tim

e NAIVE

IWPT-97

ACL-99

NAIVE

IWPT-97

ACL-99

pruned

exhaustive

Page 94: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

102

New examples of dynamic programming in NLP

O(n)-time partial parsing by limiting dependency length

(with Noah A. Smith)

Page 95: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

A word’s dependents (adjuncts, arguments)

tend to fall near it

in the string.

Short-Dependency Preference

Page 96: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

1

1 1

3

length of a dependency ≈ surface distance

Page 97: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

fract

ion

of

all

dep

en

den

cies

0

0.1

0.2

0.3

0.4

0.5

0.6

0 50 100 150 200 250

English Chinese German

0

0.1

0.2

0.3

0.4

0.5

0.6

1 10 100 1000

50% of English dependencies have length 1, another 20%

have length 2, 10% have length 3 ...

length

Page 98: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

Related Ideas

• Score parses based on what’s between a head and child

(Collins, 1997; Zeman, 2004; McDonald et al., 2005)

• Assume short → faster human processing (Church, 1980; Gibson, 1998)

• “Attach low” heuristic for PPs (English)(Frazier, 1979; Hobbs and Bear, 1990)

• Obligatory and optional re-orderings (English)(see paper)

Page 99: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

Going to Extremes

0

0.1

0.2

0.3

0.4

0.5

0.6

1 10 100 1000

English Chinese German

Longer dependencies are less likely.

What if we eliminate them completely?

Page 100: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

Hard Constraints

Disallow dependencies between words of distance > b ...

Risk: best parse contrived, or no parse at all!

Solution: allow fragments (partial parsing; Hindle, 1990 inter alia).

Why not model the sequence of fragments?

Page 101: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

Building a Vine SBG Parser

Grammar: generates sequence of trees from $

Parser: recognizes sequences of trees without long dependencies

Need to modify training data so the model is consistent

with the parser.

Page 102: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

the

tofilings

estimates

,According

rule

changes cut

insider

some

by

more

than

a

third

.

2

2

84

2

3

2

9

1

1

1

1

1

1

1

1

1

$

would

(from the Penn Treebank)

Page 103: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

the

tofilings

estimates

,According

rule

changes

would

cut

insider

some

by

more

than

a

third

.

2

2

4

2

3

2

1

1

1

1

1

1

1

1

1

b = 4

$

(from the Penn Treebank)

Page 104: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

the

tofilings

estimates

,According

rule

changes

would

cut

insider

some

by

more

than

a

third

.

2

2 2

3

2

1

1

1

1

1

1

1

1

1

$

(from the Penn Treebank)

b = 3

Page 105: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

the

tofilings

estimates

,According

rule

changes

would

cut

insider

some

by

more

than

a

third

.

2

2 2

2

1

1

1

1

1

1

1

1

1

$

(from the Penn Treebank)

b = 2

Page 106: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

the

tofilings

estimates

,According

rule

changes

would

cut

insider

some

by

more

than

a

.

1

1

1

1

1

1

1

1

1

$

third(from the Penn Treebank)

b = 1

Page 107: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

$

the

tofilings

estimates

,According

would

cut

insider

some

by

more

than

a

third

.

changes

rule

(from the Penn Treebank)

b = 0

Page 108: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

• Even for small b, “bunches” can grow to arbitrary size:

• But arbitrary center embedding is out:

Vine Grammar is Regular

Page 109: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

119

Linear-time partial parsing:

Limiting dependency length

NP S NPFinite-state model of root sequence

Bounded dependencylength within each chunk(but chunk could be arbitrarilywide: right- or left- branching)

Natural-language dependencies tend to be short So even if you don’t have enough data to model what the heads are … … you might want to keep track of where they are.

Page 110: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

120

Limiting dependency length Linear-time partial parsing:

Don’t convert into an FSA! Less structure sharing Explosion of states for different stack configurations Hard to get your parse back

NP S NPFinite-state model of root sequence

Bounded dependencylength within each chunk(but chunk could be arbitrarilywide: right- or left- branching)

Page 111: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

121

Limiting dependency length Linear-time partial parsing:

NP S NP

Each piece is at most k wordswide

No dependencies between pieces

Finite state model of sequence

Linear time! O(k2n)

onedependency

Page 112: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

122

Limiting dependency length Linear-time partial parsing:

NPNP S

Each piece is at most k wordswide

No dependencies between pieces

Finite state model of sequence

Linear time! O(k2n)

Page 113: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

Parsing Algorithm

• Same grammar constant as Eisner and Satta (1999)

• O(n3) → O(nb2) runtime

• Includes some overhead (low-order term) for constructing the vine– Reality check ... is it worth it?

Page 114: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

126

F-measure & runtime of a limited-dependency-length parser (POS seqs)

Page 115: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

127

Precision & recall of a limited-dependency-length parser (POS seqs)

Page 116: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

142

New examples of dynamic programming in NLP

Grammar induction by initially limiting dependency length

(with Noah A. Smith)

Page 117: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

144

Soft bias toward short dependencies

-∞ δ = 0 +∞

MLE baseline

linear structure preferred

Multiply parse probability by exp -δSwhere S is the total length of all

dependencies

Then renormalize probabilities

Page 118: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

145

Structural Annealing

-∞ δ = 0 +∞

MLE baseline

Start here; train a model.

Increase δ andretrain.

Repeat ...

Until performance stopsimproving on a smallvalidation dataset.

Page 119: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

146

50.3

41.6

45.6

50.1

48.0

42.3

63.4

57.4

40.5

41.1

58.2

71.8

70.0

61.8

58.4

56.4

62.4

50.4

20 30 40 50 60 70

German

English

Bulgarian

Mandarin

Turkish

Portuguese

MLE

CE: Deletions &TranspositionsStructural Annealing

Other structural biases can be

annealed.

We tried annealing on connectivity (# of fragments), and got similar

results.

GrammarInduction

Page 120: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

147

A 6/9-Accurate Parse

the gene thus canprevent

a

plant from fertilizing itself

Treebank:

the genethus

canprevent

aplant

from fertilizing itself

MLE with locality bias:

preposition misattachment

misattachment of adverb “thus”

verb instead of modal as root

These errors look like

ones made by a

supervised parser in

2000!

Page 121: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

148

Accuracy Improvements

language

random

tree

Klein & Mannin

g (2004)

Smith & Eisner (2006)

German 27.5% 50.3 70.0

English 30.3 41.6 61.8

Bulgarian 30.4 45.6 58.4

Mandarin 22.6 50.1 57.2

Turkish 29.8 48.0 62.4

Portuguese

30.6 42.3 71.81CoNLL-X shared task, best system. 2McDonald et al., 2005

state-of-the-art,

supervised

82.61

90.92

85.91

84.61

69.61

86.51

Page 122: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

149

Combining with Contrastive Estimation

This generally gives us our best results …

Page 123: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

150

New examples of dynamic programming in NLP

Contrastive estimation for HMM and grammar inductionUses lattice parsing …

(with Noah A. Smith)

Page 124: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

Contrastive Estimation:Training Log-Linear Models

on Unlabeled Data

Noah A. Smith and Jason EisnerDepartment of Computer Science /

Center for Language and Speech ProcessingJohns Hopkins University

{nasmith,jason}@cs.jhu.edu

Page 125: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

Contrastive Estimation:(Efficiently) Training Log-

Linear Models (of Sequences) on Unlabeled Data

Noah A. Smith and Jason EisnerDepartment of Computer Science /

Center for Language and Speech ProcessingJohns Hopkins University

{nasmith,jason}@cs.jhu.edu

Page 126: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

Nutshell Version

tractabletraining

unannotated text

“max ent” features

Experiments on unlabeled data:

POS tagging: 46% error rate reduction (relative to EM)

“Max ent” features make it possible

to survive damage to tag dictionary

Dependency parsing: 21% attachment error reduction

(relative to EM)

contrastive estimationwith lattice neighborhoods

sequence models

Page 127: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

“Red leaves don’t hide blue jays.”

Page 128: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

Maximum Likelihood Estimation(Supervised)

red leaves don’t hide blue jays

JJ NNS MD VB JJ NNS

?

?

p

p *

x

y

Σ* × Λ*

Page 129: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

red leaves don’t hide blue jays

? ? ? ? ? ?

?

?

p

p *

x

Σ* × Λ*

Maximum Likelihood Estimation

(Unsupervised)

This is what

EM does.

Page 130: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

Focusing Probability Mass

numerator

denominator

Page 131: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

Conditional Estimation(Supervised)

red leaves don’t hide blue jays

JJ NNS MD VB JJ NNSp

p

x

y

red leaves don’t hide blue jays

? ? ? ? ? ?

(x) × Λ*

A different

denominator!

Page 132: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

Objective Functions

ObjectiveOptimizati

on Algorithm

NumeratorDenominat

or

MLECount &

Normalize*tags & words

Σ* × Λ*

MLE with hidden

variablesEM* words Σ* × Λ*

Conditional Likelihood

Iterative Scaling

tags & words

(words) × Λ*

Perceptron Backproptags & words

hypothesized tags & words*For generative models.

Contrastive Estimation

observed data

(in this talk, raw

word sequence, sum over

all possible taggings)

?

generic numerical

solvers

(in this talk, LMVM L-BFGS)

Page 133: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

This talk is about denominators ...

in the unsupervised case.

A good denominator can improve

accuracy

and

tractability.

Page 134: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

Language Learning (Syntax)

red leaves don’t hide blue jays

Why didn’t he say,

“birds fly” or “dancing granola” or “the wash

dishes” or

any other sequence of words?

EM Why did he pick that sequence for those words?

Why not say “leaves red ...” or “... hide don’t ...” or ...

At last! My own language

learning device!

Page 135: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

What is a syntax model supposed to explain?

Each learning hypothesis

corresponds to

a denominator / neighborhood.

Page 136: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

The Job of Syntax“Explain why each word is necessary.”

→ DEL1WORD neighborhood

red leavesdon’t hide blue jays

leaves don’t hide blue jays

red don’t hide blue jays

red leaves hide blue jays

red leaves don’t blue jays

red leaves don’t hide jays

red leaves don’t hide blue

Page 137: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

The Job of Syntax“Explain the (local) order of the words.”

→ TRANS1 neighborhood

red leavesdon’t hide blue jays

leaves red don’t hide blue jays

red leaves hide don’t blue jays

red don’t leaves hide blue jays

red leaves don’t hide jays blue

red leaves don’t blue hide jays

Page 138: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

red leaves don’t hide blue jays

? ? ? ? ? ?p

p

leaves red don’t hide blue jays

? ? ? ? ? ?

red don’t leaves hide blue jays

? ? ? ? ? ?

red leaves hide don’t blue jays

? ? ? ? ? ?

red leaves don’t blue hide jays

? ? ? ? ? ?

red leaves don’t hide jays blue

? ? ? ? ? ?

red leaves don’t hide blue jays

? ? ? ? ? ?

sentences inTRANS1

neighborhood

Page 139: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

p

p

red leaves don’t hide blue jays

? ? ? ? ? ?

red leaves don’t hide blue jays

leaves

don’t

hideblue

jays

don’t hide blue jays

leaves

don’t

hide

bluered

(with any tagging) sentences inTRANS1

neighborhood

www.dyna.org(shameless self

promotion)

Page 140: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

The New Modeling Imperative

numerator

denominator(“neighborhood”)

“Make the good sentence

likely, at the expense

of those bad neighbors.”

A good sentence

hints that a set of bad

ones is nearby.

Page 141: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

This talk is about denominators ...in the unsupervised case.

A good denominator can improve

accuracy and

tractability.

Page 142: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

Log-Linear Models score of x, y

partition function

Computing is

undesirable!

ConditionalEstimation

(Supervised)

ContrastiveEstimation

(Unsupervised)

Sums over all possible taggings of all possible sentences!1

sentencea few

sentences

Page 143: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

A Big Picture: Sequence Model Estimation

tractable sums

overlapping features

generative,MLE: p(x, y)

log-linear,conditionalestimation:

p(y | x)

unannotated data

generative,EM: p(x)

log-linear,MLE: p(x, y)

log-linear,EM: p(x)

log-linear,CE with

lattice neighborhoods

Page 144: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

Contrastive Neighborhoods

• Guide the learner toward models that do what syntax is supposed to do.

• Lattice representation → efficient algorithms.

There is an art to

choosing neighborhood functions.

Page 145: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

Neighborhoods

neighborhood sizelatticearcs

perturbations

n+1 O(n) delete up to 1 word

n O(n) transpose any bigram

O(n) O(n)

O(n2

)O(n2) delete any contiguous

subsequence

(EM) ∞ - replace each word with anything

DEL1SUBSEQUENCE

TRANS1

DEL1WORD

DELORTRANS1

Σ*

DEL1WORD TRANS1

Page 146: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

The Merialdo (1994) Task

Given unlabeled text

and a POS dictionary(that tells all possible tags for each word type),

learn to tag.A form of

supervision.

Page 147: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

Trigram Tagging Model

red leaves don’t hide blue jays

JJ NNS MD VB JJ NNS

feature set:

tag trigramstag trigrams

tag/word pairs from a POS dictionary

Page 148: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

35.1

58.7

60.4

62.1

66.6

70.0

78.8

79.0

79.3

97.2

99.5

30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0

tagging accuracy (ambiguous words)

10 × data

supervised

• 96K words• full POS dictionary• uninformative initializer• best of 8 smoothing conditions

CRF

HMM

random

EM

LENGTH

TRANS1

DELORTRANS1

DA

EM

DEL1WORD

DEL1SUBSEQUENCE

Smith & Eisner (2004)

Merialdo (1994)

≈ log-linear EM

Page 149: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

90.1

90.4

84.4

69.5

84.8

78.3 80

.5

60.5

81.3

75.2

70.9

56.6

77.2

72.3

66.5

51.0

50.0

55.0

60.0

65.0

70.0

75.0

80.0

85.0

90.0

95.0ta

ggin

g ac

cura

cy (al

l wor

ds)

Dictionary includes ...

■ all words■ words from 1st half of corpus■ words with count 2■ words with count 3

Dictionary excludesOOV words,which can get any tag.

• 96K words• 17 coarse POS tags• uninformative initializer

DELO

RTRA

NS1

LENGTH EM

rand

om

What if wedamagethe POS

dictionary?

Page 150: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

Trigram Tagging Model + Spelling

red leaves don’t hide blue jays

JJ NNS MD VB JJ NNS

feature set:

tag trigramstag trigrams

tag/word pairs from a POS dictionary

1- to 3-character suffixes, contains hyphen, digit

Page 151: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

91.1

91.9

90.8

83.2

90.3

73.8

89.8

73.6

90.1

90.4

84.4

69.5

84.8

78.3 80

.5

60.5

81.3

75.2

70.9

56.6

77.2

72.3

66.5

51.0

50.0

55.0

60.0

65.0

70.0

75.0

80.0

85.0

90.0

95.0ta

ggin

g ac

cura

cy (al

l wor

ds)

DELO

RTRA

NS1

LENGTH EM

rand

om

DELO

RTRA

NS1

+ spe

lling

LENGTH

+ spe

lling

... but only with a smart neighborhoo

d.

Log-linear spelling features

aided recovery ...

Page 152: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

The model need not be finite-state.

Page 153: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

clever

uninformative

23.633.8 48.7

35.2

42.1

37.4

0.0

10.0

20.0

30.0

40.0

50.0

Unsupervised Dependency Parsing

EM

LENGTH

initializer

TRANS1

att

ach

men

t acc

ura

cy

Kle

in &

Man

nin

g (

20

04

)

Page 154: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

To Sum Up ...Contrastive Estimation means

picking your own denominator

for tractabilityor for accuracy(or, as in our case, for both).

Now we can use the task to guide the unsupervised learner

It’s a particularly good fit for log-linear models:

with max ent featuresunsupervised sequence models

all in time for ACL 2006.

(like discriminative techniques do for supervised learners).

Page 155: 1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John