1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John

1

Declarative Specification of NLP Systems

Jason Eisner

IBM, May 2006

student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John Blatz, Roy Tromble

2

An Anecdote from ACL’05

-Michael Jordan

3

An Anecdote from ACL’05

Just draw a model that actually makes sense for your problem.

-Michael Jordan

Just do Gibbs sampling. Um, it’s only 6 lines in Matlab…

4

Conclusions to draw from that talk1. Mike & his students are great.

2. Graphical models are great.(because they’re flexible)

3. Gibbs sampling is great.(because it works with nearly any graphical model)

4. Matlab is great.(because it frees up Mike and his students to doodle all day and then execute their doodles)

5

1. Mike & his students are great.

2. Graphical models are great.(because they’re flexible)

3. Gibbs sampling is great.(because it works with nearly any graphical model)

4. Matlab is great.(because it frees up Mike and his students to doodle all day and then execute their doodles)

6

Parts of it already are …Language modelingBinary classification (e.g., SVMs)Finite-state transductionsLinear-chain graphical models

Toolkits available; you don’t have to be an expert

Efficient parsers and MT systems are complicated and painful to write

But other parts aren’t … Context-free and beyondMachine translation

7

This talk: A toolkit that’s general enough for these cases.

(stretches from finite-state to Turing machines)

“Dyna”

Efficient parsers and MT systems are complicated and painful to write

But other parts aren’t … Context-free and beyondMachine translation

8

Warning

Lots more beyond this talk

see the EMNLP’05 and FG’06 papers

see http://dyna.org(download + documentation)

sign up for updates by email

wait for the totally revamped next version

http://dyna.org/

9

the case forLittle Languages

declarative programming

small is beautiful

10

Sapir-Whorf hypothesis

Language shapes thought At least, it shapes conversation

Computer language shapes thought At least, it shapes experimental research Lots of cute ideas that we never pursue Or if we do pursue them, it takes 6-12

months to implement on large-scale data Have we turned into a lab science?

11

Declarative Specifications

State what is to be done

(How should the computer do it? Turn that over to a general “solver” that handles the specification language.)

Hundreds of domain-specific “little languages” out there. Some have sophisticated solvers.

12

dot (www.graphviz.org)

digraph g { graph [rankdir = "LR"]; node [fontsize = "16“ shape = "ellipse"]; edge [];

"node0" [label = "<f0> 0x10ba8| <f1>"shape = "record"]; "node1" [label = "<f0> 0xf7fc4380| <f1> | <f2> |-1"shape = "record"]; … "node0":f0 -> "node1":f0 [id = 0]; "node0":f1 -> "node2":f0 [id = 1]; "node1":f0 -> "node3":f0 [id = 2]; …}

nodes

edges

What’s the hard part? Making a nice layout!Actually, it’s NP-hard …

13

dot (www.graphviz.org)

14

LilyPond (www.lilypond.org)

c4

<<c4 d4 e4>>

{ f4 <<c4 d4 e4>> }

<< g2 \\ { f4 <<c4 d4 e4>> } >>

15

LilyPond (www.lilypond.org)

16

Declarative Specs in NLP

Regular expression (for a FST toolkit) Grammar (for a parser) Feature set (for a maxent distribution, SVM, etc.) Graphical model (DBNs for ASR, IE, etc.)

Claim of this talk:

Sometimes it’s best to peek under the shiny surface.Declarative methods are still great, but should be layered:we need them one level lower, too.

17

Regular expression (for a FST toolkit) Grammar (for a parser)

Feature set (for a maxent distribution, SVM, etc.)

Not always flexible enough …Need to open up the parser and rejigger it.Declarative specification of algorithms.

Not always flexible enough …Need to open up the learner and rejigger it.Declarative specification of objective functions.

New toolkit

Existingtoolkits

Declarative Specs in NLP

18

Declarative Specification of Algorithms

19

How you build a system (“big picture” slide)

cool model

tuned C++ implementation

(data structures, etc.)

practical equations

pseudocode(execution order)

...

|

),(),(,

0

nkji xzyx

zy

x NNNNp

kjjiki

for width from 2 to n for i from 0 to n-width k = i+width for j from i+1 to k-1

…

PCFG

20

Wait a minute …

Didn’t I just implement something like this last month?

chart management / indexingcache-conscious data structuresprioritization of partial solutions (best-first, A*)parameter managementinside-outside formulasdifferent algorithms for training and decodingconjugate gradient, annealing, ...parallelization?

I thought computers were supposed to automate drudgery

21


…


cool model




PCFG

Dyna language specifies these equations.

Most programs just need to compute some values from other values. Any order is ok.

Some programs also need to update the outputs if the inputs change: spreadsheets, makefiles, email readers dynamic graph algorithms EM and other iterative optimization leave-one-out training of smoothing params

...

|

),(),(,

0

nkji xzyx

zy

x NNNNp

kjjiki

practical equations

22


cool model

practical equations

...

|

),(),(,

0

nkji xzyx

zy

x NNNNp

kjjiki

PCFG

Compilation strategies (we’ll come back to this)





…

23

Writing equations in Dyna int a. a = b * c.

a will be kept up to date if b or c changes. b += x.b += y. equivalent to b = x+y.

b is a sum of two variables. Also kept up to date.

c += z(1).c += z(2).c += z(3).c += z(“four”).c += z(foo(bar,5)).

c is a sum of all nonzero z(…) values.

At compile time, we don’t know how many!

a “pattern”the capitalized Nmatches anything

c += z(N).

24

More interesting use of patterns a = b * c.

scalar multiplication a(I) = b(I) * c(I).

pointwise multiplication a += b(I) * c(I). means a = b(I)*c(I)

dot product; could be sparse

a(I,K) += b(I,J) * c(J,K). b(I,J)*c(J,K) matrix multiplication; could be sparse J is free on the right-hand side, so we sum over it

I

... + b(“yetis”)*c(“yetis”)+ b(“zebra”)*c(“zebra”)

sparse dot product of query & document

J

25

By now you may see what we’re up to!

Prolog has Horn clauses:

a(I,K) :- b(I,J) , c(J,K).

Dyna has “Horn equations”:

a(I,K) += b(I,J) * c(J,K).

Dyna vs. Prolog

has a valuee.g., a real number

definition from other values

Like Prolog:Allow nested terms

Syntactic sugar for lists, etc.

Turing-complete

Unlike Prolog:Charts, not backtracking!Compile efficient C++

classesIntegrates with your C++ code

26

using namespace cky;chart c;

c[rewrite(“s”,”np”,”vp”)] = 0.7;c[word(“Pierre”,0,1)] = 1;c[length(30)] = true; // 30-word sentencecin >> c; // get more axioms from stdin

cout << c[goal]; // print total weight of all parses

The CKY inside algorithm in Dyna:- double item = 0.

:- bool length = false.

constit(X,I,J) += word(W,I,J) * rewrite(X,W).

constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z).

goal += constit(“s”,0,N) if length(N).

put in axioms (values not defined by the above program)

theorem pops out

27

visual debugger –browse the proof forest

ambiguity

shared substructure

28

Related algorithms in Dyna?

Viterbi parsing? Logarithmic domain? Lattice parsing? Incremental (left-to-right) parsing? Log-linear parsing? Lexicalized or synchronous parsing? Earley’s algorithm? Binarized CKY?




29






max=

max=

max=

30






max=

max=

max=

+

+ +

log+=

log+=

log+=

31

c[ word(“Pierre”, 0, 1) ] = 1

state(5) state(9)






0.2

9

5

8

Pierre/0.2P/0.5air/0.3

32






Just add words one at a time to the chartCheck at any time what can be derived from words so far

Similarly, dynamic grammars

33






Again, no change to the Dyna program

34






Basically, just add extra arguments to the terms above

35






36

Earley’s algorithm in Dyna




need(“s”,0) = true.

need(Nonterm,J) |= ?constit(_/[Nonterm|_],_,J).

constit(Nonterm/Needed,I,I) += rewrite(Nonterm,Needed) if need(Nonterm,I).

constit(Nonterm/Needed,I,K) += constit(Nonterm/[W|Needed],I,J) * word(W,J,K).

constit(Nonterm/Needed,I,K) += constit(Nonterm/[X|Needed],I,J) * constit(X/[],J,K).

goal += constit(“s”/[],0,N) if length(N).

magic templates transformation(as noted by Minnen 1996)

37


Program transformations




…

Blatz & Eisner (FG 2006):

Lots of equivalent ways to write a system of equations!

Transforming from one to another mayimprove efficiency.

Many parsing “tricks” can be generalized into automatic transformations that help other

programs, too!

cool model

practical equations

...

|

),(),(,

0

nkji xzyx

zy

x NNNNp

kjjiki

PCFG

38






39

Rule binarization


constit(X\Y,Mid,J) += constit(Z,Mid,J) * rewrite(X,Y,Z).

constit(X,I,J) += constit(Y,I,Mid) * constit(X\Y,Mid,J).

folding transformation: asymp. speedup!

Mid J

ZY Z

X

I Mid

Y

I J

X

Mid J

X\Y

I Mid

Y

40

Rule binarization


constit(X\Y,Mid,J) += constit(Z,Mid,J) * rewrite(X,Y,Z).

constit(X,I,J) += constit(Y,I,Mid) * constit(X\Y,Mid,J).

folding transformation: asymp. speedup!

constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z)MidZY ,,

constit(Y,I,Mid) constit(Z,Mid,J) * rewrite(X,Y,Z)MidY ,

Z

graphical modelsconstraint programmingmulti-way database join

41

More program transformations Examples that add new semantics

Compute gradient (e.g., derive outside algorithm from inside) Compute upper bounds for A* (e.g., Klein & Manning ACL’03) Coarse-to-fine (e.g., Johnson & Charniak NAACL’06)

Examples that preserve semantics On-demand computation – by analogy with Earley’s algorithm

On-the-fly composition of FSTs Left-corner filter for parsing

Program specialization as unfolding – e.g., compile out the grammar Rearranging computations – by analogy with categorial grammar

Folding reinterpreted as slashed categories “Speculative computation” using slashed categories

abstract away repeated computation to do it once only – by analogy with unary rule closure or epsilon-closure

derives Eisner & Satta ACL’99 O(n3) bilexical parser

42

Propagate updatesfrom right-to-leftthrough the equations.a.k.a.“agenda algorithm”“forward chaining”“bottom-up inference”“semi-naïve bottom-up”


cool model



practical equations


...

|

),(),(,

0

nkji xzyx

zy

x NNNNp

kjjiki


…

PCFG

use a general method

43

agenda of pending updates

prep(2,3)= 1.0

prep(I,3)= ?

s(3,9)+= 0.15 s(3,7)

+= 0.21vp(5,K)

= ?vp(5,9)= 0.5

vp(5,7)= 0.7

Bottom-up inference

np(3,5)+= 0.3

chart of derived items with current values

s(I,K) += np(I,J) * vp(J,K)

rules of program

np(3,5)= 0.1+0.3

0.4

we updated np(3,5);what else must therefore change?

If np(3,5) hadn’t been in the chart already, we would have added it.

vp(5,K)?no more matches to this

query

prep(I,3)?

pp(I,K) += prep(I,J) * np(J,K) pp(2,5)+= 0.3

44

...

|

),(),(,

0

nkji xzyx

zy

x NNNNp

kjjiki


cool model

practical equations



…

PCFG

What’s going on under the

hood?tuned C++

implementation(data structures, etc.)

45

copy, compare, & hashterms fast, via

integerization (interning)

Compiler provides …

np(3,5)+= 0.3


s(I,K) += np(I,J) * vp(J,K)

efficient storage of terms(use native C++ types,

“symbiotic” storage, garbage collection,

serialization, …)vp(5,K)?

automatic indexingfor O(1) lookup

rules of program

hard-codedpattern matching


efficient priority queue

46

n(5,5)= 0.2


n(5,5)+= ?

Beware double-counting!

n(5,5)+= 0.3


n(I,K) += n(I,J) * n(J,K)

rules of program

If np(3,5) hadn’t been in the chart already, we would have added it.

n(5,K)?

epsilonconstituent

to makeanother copy

of itself

combiningwith itself

47

Parameter training Maximize some objective function. Use Dyna to compute the function. Then how do you differentiate it?

… for gradient ascent,conjugate gradient, etc.

… gradient also tells us the expected counts for EM!

model parameters(and input sentence)

as axiom values

objective functionas a theorem’s value

e.g., inside algorithmcomputes likelihood

of the sentence

Two approaches: Program transformation – automatically derive the “outside” formulas. Back-propagation – run the agenda algorithm “backwards.”

works even with pruning, early stopping, etc.

DynaMITE: training toolkit

48

What can Dyna do beyond CKY?

49

Some examples from my lab … Parsing using …

factored dependency models (Dreyer, Smith, & Smith CONLL’06) with annealed risk minimization (Smith and Eisner EMNLP’06)

constraints on dependency length (Eisner & Smith IWPT’05) unsupervised learning of deep transformations (see Eisner EMNLP’02) lexicalized algorithms (see Eisner & Satta ACL’99, etc.)

Grammar induction using … partial supervision (Dreyer & Eisner EMNLP’06) structural annealing (Smith & Eisner ACL’06) contrastive estimation (Smith & Eisner GIA’05) deterministic annealing (Smith & Eisner ACL’04)

Machine translation using … Very large neighborhood search of permutations (Eisner & Tromble, NAACL-W’06) Loosely syntax-based MT (Smith & Eisner in prep.) Synchronous cross-lingual parsing (Smith & Smith EMNLP’04)

Finite-state methods for morphology, phonology, IE, even syntax … Unsupervised cognate discovery (Schafer & Yarowsky ’05, ’06) Unsupervised log-linear models via contrastive estimation (Smith & Eisner ACL’05) Context-based morph. disambiguation (Smith, Smith & Tromble EMNLP’05) Trainable (in)finite-state machines (see Eisner ACL’02, EMNLP’02, …) Finite-state machines with very large alphabets (see Eisner ACL’97) Finite-state machines over weird semirings (see Eisner ACL’02, EMNLP’03)

Teaching (Eisner JHU’05-06; Smith & Tromble JHU’04)

Easy to try stuff out!

Programs are very short & easy to

change!- see also Eisner ACL’03)

50

Can it express everything in NLP? Remember, it integrates tightly with C++,

so you only have to use it where it’s helpful,and write the rest in C++. Small is beautiful.

We’re currently extending the class of allowed formulas “beyond the semiring” cf. Goodman (1999) will be able to express smoothing, neural nets, etc.

Of course, it is Turing complete …

51

Smoothing in Dyna

mle_prob(X,Y,Z) % context = count(X,Y,Z)/count(X,Y).

smoothed_prob(X,Y,Z) = lambda*mle_prob(X,Y,Z)

+ (1-lambda)*mle_prob(Y,Z). % for arbitrary n-grams, can use lists

count_count(N) += 1 whenever N is count(Anything).

% updates automatically during leave-one-out jackknifing

52

Information retrieval in Dyna

score(Doc) += tf(Doc,Word)*tf(Query,Word)*idf(Word).

idf(Word) = 1/log(df(Word)). df(Word) += 1 whenever tf(Doc,Word) > 0.

53

Neural networks in Dyna

out(Node) = sigmoid(in(Node)). in(Node) += input(Node). in(Node) += weight(Node,Kid)*out(Kid). error += (out(Node)-target(Node))**2

if ?target(Node).

Recurrent neural net is ok

1x 2x 3x 4x

1h 2h 3h

y

54

Game-tree analysis in Dyna

goal = best(Board) if start(Board).

best(Board) max= stop(player1, Board). best(Board) max= move(player1, Board,

NewBoard) + worst(NewBoard).

worst(Board) min= stop(player2, Board). worst(Board) min= move(player2, Board,

NewBoard) + best(NewBoard).

55

Weighted FST composition in Dyna(epsilon-free case)

:- bool item=false. start (A o B, Q x R) |= start (A, Q) & start (B, R). stop (A o B, Q x R) |= stop (A, Q) & stop (B, R). arc (A o B, Q1 x R1, Q2 x R2, In, Out)

|= arc (A, Q1, Q2, In, Match) & arc (B, R1, R2, Match, Out).

Inefficient? How do we fix this?

56

Constraint programming (arc consistency) :- bool indomain=false. :- bool consistent=true.

variable(Var) |= indomain(Var:Val). possible(Var:Val) &= indomain(Var:Val). possible(Var:Val) &= support(Var:Val, Var2)

whenever variable(Var2). support(Var:Val, Var2) |= possible(Var2:Val2)

& consistent(Var:Val, Var2:Val2).

57

Edit distance in Dyna: version 1 letter1(“c”,0,1). letter1(“l”,1,2). letter1(“a”,2,3). … % clara letter2(“c”,0,1). letter2(“a”,1,2). letter2(“c”,2,3). … % caca end1(5). end2(4). delcost := 1. inscost := 1. substcost := 1.

align(0,0) = 0.

align(I1,J2) min= align(I1,I2) + letter2(L2,I2,J2) + inscost(L2). align(J1,I2) min= align(I1,I2) + letter1(L1,I1,J1) + delcost(L1). align(J1,J2) min= align(I1,I2) + letter1(L1,I1,J1) +

letter2(L2,I2,J2) + subcost(L1,L2).

align(J1,J2) min= align(I1,I2)+letter1(L,I1,J1)+letter2(L,I2,J2).

goal = align(N1,N2) whenever end1(N1) & end2(N2).

Cost of best alignment of first I1 charactersof string 1 with first I2 characters of string 2.

Next letter is L2. Add it to string 2

only.

same L;free move!

58

Edit distance in Dyna: version 2 input([“c”, “l”, “a”, “r”, “a”], [“c”, “a”, “c”, “a”]) := 0. delcost := 1. inscost := 1. substcost := 1.

alignupto(Xs,Ys) min= input(Xs,Ys). alignupto(Xs,Ys) min= alignupto([X|Xs],Ys) + delcost. alignupto(Xs,Ys) min= alignupto(Xs,[Y|Ys]) + inscost. alignupto(Xs,Ys) min= alignupto([X|Xs],[Y|Ys])+substcost. alignupto(Xs,Ys) min= alignupto([A|Xs],[A|Ys]). goal min= alignupto([], []).

Xs and Ys are still-unaligned suffixes.This item’s value is supposed to be cost ofaligning everything up to but not including them.

How about different costs for different letters?

59

input([“c”, “l”, “a”, “r”, “a”], [“c”, “a”, “c”, “a”]) := 0. delcost := 1. inscost := 1. substcost := 1.

alignupto(Xs,Ys) min= input(Xs,Ys). alignupto(Xs,Ys) min= alignupto([X|Xs],Ys) + delcost. alignupto(Xs,Ys) min= alignupto(Xs,[Y|Ys]) + inscost. alignupto(Xs,Ys) min= alignupto([X|Xs],[Y|Ys])+substcost. alignupto(Xs,Ys) min= alignupto([L|Xs],[L|Ys]). goal min= alignupto([], []).

Edit distance in Dyna: version 2

Xs and Ys are still-unaligned suffixes.This item’s value is supposed to be cost ofaligning everything up to but not including them.

(X). (Y).

(X,Y).

+ nocost(L,L)

60

Is it fast enough? (sort of) Asymptotically efficient 4 times slower than Mark Johnson’s inside-outside 4-11 times slower than Klein & Manning’s Viterbi parser

61

Are you going to make it faster?

(yup!) Currently rewriting the term classes

to match hand-tuned code Will support “mix-and-match”

implementation strategies store X in an array store Y in a hash don’t store Z

(compute on demand) Eventually, choose

strategies automaticallyby execution profiling

62

Synopsis: your idea experimental results fast!

Dyna is a language for computation (no I/O). Especially good for dynamic programming. It tries to encapsulate the black art of NLP.

Much prior work in this vein … Deductive parsing schemata (preferably weighted)

Goodman, Nederhof, Pereira, Warren, Shieber, Schabes, Sikkel… Deductive databases (preferably with aggregation)

Ramakrishnan, Zukowski, Freitag, Specht, Ross, Sagiv, … Probabilistic programming languages (implemented)

Zhao, Sato, Pfeffer … (also: efficient Prologish languages)

63

Dyna contributors!

Jason Eisner Eric Goldlust, Eric Northup, Johnny Graettinger

(compiler backend) Noah A. Smith (parameter training) Markus Dreyer, David Smith (compiler frontend) Mike Kornbluh, George Shafer, Gordon Woodhull,

Constantinos Michael, Ray Buse (visual debugger)

John Blatz (program transformations) Asheesh Laroia (web services)

64

New examples of dynamic programming in NLP

65

Some examples from my lab … Parsing using …

factored dependency models (Dreyer, Smith, & Smith CONLL’06) with annealed risk minimization (Smith and Eisner EMNLP’06)

constraints on dependency length (Eisner & Smith IWPT’05) unsupervised learning of deep transformations (see Eisner EMNLP’02) lexicalized algorithms (see Eisner & Satta ACL’99, etc.)

Grammar induction using … partial supervision (Dreyer & Eisner EMNLP’06) structural annealing (Smith & Eisner ACL’06) contrastive estimation (Smith & Eisner GIA’05) deterministic annealing (Smith & Eisner ACL’04)

Machine translation using … Very large neighborhood search of permutations (Eisner & Tromble, NAACL-W’06) Loosely syntax-based MT (Smith & Eisner in prep.) Synchronous cross-lingual parsing (Smith & Smith EMNLP’04)

Finite-state methods for morphology, phonology, IE, even syntax … Unsupervised cognate discovery (Schafer & Yarowsky ’05, ’06) Unsupervised log-linear models via contrastive estimation (Smith & Eisner ACL’05) Context-based morph. disambiguation (Smith, Smith & Tromble EMNLP’05) Trainable (in)finite-state machines (see Eisner ACL’02, EMNLP’02, …) Finite-state machines with very large alphabets (see Eisner ACL’97) Finite-state machines over weird semirings (see Eisner ACL’02, EMNLP’03)

Teaching (Eisner JHU’05-06; Smith & Tromble JHU’04)

- see also Eisner ACL’03)

66


Parameterized finite-state machines

67

Parameterized FSMs An FSM whose arc probabilities depend on

parameters: they are formulas.

/p

a/q*exp(t+u)

a/q

b/(1-q)r

/1-p

a/r

1-s

a/exp(t+v)

68

/.1

a/.44

a/.2

b/.8

/.9

a/.3

.7

a/.56



69



/p

a/q*exp(t+u)

a/q

b/(1-q)r

/1-p

a/r

1-s

a/exp(t+v)

Expert first: Construct the FSM (topology & parameterization).

Automatic takes over: Given training data, find parameter valuesthat optimize arc probs.

70

Parameterized FSMsKnight & Graehl1997 - transliteration

p(English text)

p(English text English

phonemes)

p(English phonemes Japanese phonemes)

p(Japanese phonemes

Japanese text)

o

o

o

“/t/ and /d/ are similar …”

Loosely coupled probabilities:

/t/:/tt/

/d/:/dd/

exp p+q+r (coronal, stop,unvoiced)

exp p+q+s (coronal, stop,voiced)

71

Parameterized FSMsKnight & Graehl1997 - transliteration

p(English text)

p(English text English

phonemes)

p(English phonemes Japanese phonemes)

p(Japanese phonemes

Japanese text)

o

o

o

“Would like to get some of that expert knowledge in here”

Use probabilistic regexps like(a*.7 b) +.5 (ab*.6) …

If the probabilities are variables (a*x b) +y (ab*z) …then arc weights of the compiled machine are nasty formulas. (Especially after minimization!)

80


Parameterized infinite-state machines

81

Universal grammar as a parameterized FSA over an infinite state space

82


More abuses of finite-state machines

83

Huge-alphabet FSAs for OT phonology

VCCVC

voi

CCVC

voi underlyingtiers

surfacetiers

CCVC

voi

Gen

CCVC

CCVC

voi

CCC

velar

CCVC

voi

voi

CC C

V

etc.Gen proposes allcandidates that include this input.

}

84

Huge-alphabet FSAs for OT phonology

CCC

velar

CCVC

voi

CC C

V

encode this candidate as a string

at each moment,need to describewhat’s going onon many tiers

85

Directional Best Paths construction Keep “best” output string for each input string Yields a new transducer (size 3n)

For input abc: abc axcFor input abd: axd

Must allow red arcjust if next input is d

1 2

3

5

4

6

7

a:ab:b

b:x

c:c

c:c

d:d

86

Compute (q) in O(1) time

as soon as we visit q.

Whole alg. is linear.

Minimization of semiring-weighted FSAs

New definition of for pushing: (q) = weight of the shortest path from q,

breaking ties alphabetically on input symbols Computation is simple, well-defined, independent of (K, ) Breadth-first search back from final states:

d

b

cbaa

b

c

distance 2

Faster than finding min-weight path

à la Mohri.(q) = k (r)

q

r

:k

87


Tree-to-tree alignment

Two training trees, showing a free translation from French to English.

Synchronous Tree Substitution Grammar

enfants(“kids”)

d’(“of”)

beaucoup(“lots”)

Sam

donnent (“give”)

baiser(“kiss”)

un(“a”)

à (“to”)

kids

Sam

kiss

quite

often

“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”

enfants(“kids”)

kids

NPd’

(“of”)


NP

NP

SamSam

NP


kissdonnent (“give”)

baiser(“kiss”)

un(“a”)

à (“to”)

Start

NP

NP

nullAdv

quitenullAdv

oftennullAdv

nullAdv


Two training trees, showing a free translation from French to English.A possible alignment is shown in orange.

enfants(“kids”)

kids

NPd’

(“of”)


NP

NPquitenull

Adv

oftennullAdv

nullAdv

SamSam

NP

kissdonnent (“give”)

baiser(“kiss”)

un(“a”)

à (“to”)

Start

NP

NP

nullAdv



Start

Two training trees, showing a free translation from French to English.A possible alignment is shown in orange. Alignment shows how trees are generated synchronously from “little trees” ...

91


Bilexical parsing in O(n3)

(with Giorgio Satta)

92

Lexicalized CKY

lovesMary the girl outdoors][ [ ][ ] ][

93

Lexicalized CKY is O(n5) not O(n3)

B

i j

C

j+1 kh h’

h

A

i k

O(n3 combinations)

O(n5 combinations)

... hug visiting relatives... advocate visiting relatives

94

Idea #1

B

i j

C

j+1 kh h’

h’

A

i k

Combine B with what C?

must try different-width C’s (vary k)

must try differently-headed C’s (vary h’)

Separate these!

95

Idea #1

C

j+1 kh’

h’

A

i k

B

i j

C

j+1 kh h’

h’

A

i k

i jh

B

h’i j

A

C

h’C

(the old CKY way)

96

Idea #2

Some grammars allow

A

i

A

k

ki

h h

A

97

Idea #2

B

i j

C

j+1 kh h’

h’

A

i k

Combine what B and C?

must try different-width C’s (vary k)

must try different midpoints j

Separate these!

98

Idea #2

h

A

i k

B

i j

C

j+1 kh h’

h

A

i k

i jh

B C

j+1 h’

C

h’i

A

h kh’

C

(the old CKY way)

99

Idea #2

h

A

k

B

i j

C

j+1 kh h’

h

A

i k

jh

B C

j+1 h’

C

h’

A

h kh’

C

(the old CKY way)

100

An O(n3) algorithm (with G. Satta)

lovesMary the girl outdoors] [[ ][ ]] [

101

3 parsers: log-log plot

1

10

100

1000

10000

100000

10 100

Sentence Length

Tim

e NAIVE

IWPT-97

ACL-99

NAIVE

IWPT-97

ACL-99

pruned

exhaustive

102


O(n)-time partial parsing by limiting dependency length

(with Noah A. Smith)

A word’s dependents (adjuncts, arguments)

tend to fall near it

in the string.

Short-Dependency Preference

1

1 1

3

length of a dependency ≈ surface distance

fract

ion

of

all

dep

en

den

cies

0

0.1

0.2

0.3

0.4

0.5

0.6

0 50 100 150 200 250

English Chinese German

0

0.1

0.2

0.3

0.4

0.5

0.6

1 10 100 1000

50% of English dependencies have length 1, another 20%

have length 2, 10% have length 3 ...

length

Related Ideas

• Score parses based on what’s between a head and child

(Collins, 1997; Zeman, 2004; McDonald et al., 2005)

• Assume short → faster human processing (Church, 1980; Gibson, 1998)

• “Attach low” heuristic for PPs (English)(Frazier, 1979; Hobbs and Bear, 1990)

• Obligatory and optional re-orderings (English)(see paper)

Going to Extremes

0

0.1

0.2

0.3

0.4

0.5

0.6

1 10 100 1000

English Chinese German

Longer dependencies are less likely.

What if we eliminate them completely?

Hard Constraints

Disallow dependencies between words of distance > b ...

Risk: best parse contrived, or no parse at all!

Solution: allow fragments (partial parsing; Hindle, 1990 inter alia).

Why not model the sequence of fragments?

Building a Vine SBG Parser

Grammar: generates sequence of trees from $

Parser: recognizes sequences of trees without long dependencies

Need to modify training data so the model is consistent

with the parser.

the

tofilings

estimates

,According

rule

changes cut

insider

some

by

more

than

a

third

.

2

2

84

2

3

2

9

1

1

1

1

1

1

1

1

1

$

would

(from the Penn Treebank)

the

tofilings

estimates

,According

rule

changes

would

cut

insider

some

by

more

than

a

third

.

2

2

4

2

3

2

1

1

1

1

1

1

1

1

1

b = 4

$


the

tofilings

estimates

,According

rule

changes

would

cut

insider

some

by

more

than

a

third

.

2

2 2

3

2

1

1

1

1

1

1

1

1

1

$


b = 3

the

tofilings

estimates

,According

rule

changes

would

cut

insider

some

by

more

than

a

third

.

2

2 2

2

1

1

1

1

1

1

1

1

1

$


b = 2

the

tofilings

estimates

,According

rule

changes

would

cut

insider

some

by

more

than

a

.

1

1

1

1

1

1

1

1

1

$

third(from the Penn Treebank)

b = 1

$

the

tofilings

estimates

,According

would

cut

insider

some

by

more

than

a

third

.

changes

rule


b = 0

• Even for small b, “bunches” can grow to arbitrary size:

• But arbitrary center embedding is out:

Vine Grammar is Regular

119

Linear-time partial parsing:

Limiting dependency length

NP S NPFinite-state model of root sequence

Bounded dependencylength within each chunk(but chunk could be arbitrarilywide: right- or left- branching)

Natural-language dependencies tend to be short So even if you don’t have enough data to model what the heads are … … you might want to keep track of where they are.

120

Limiting dependency length Linear-time partial parsing:

Don’t convert into an FSA! Less structure sharing Explosion of states for different stack configurations Hard to get your parse back

NP S NPFinite-state model of root sequence

Bounded dependencylength within each chunk(but chunk could be arbitrarilywide: right- or left- branching)

121


NP S NP

Each piece is at most k wordswide

No dependencies between pieces

Finite state model of sequence

Linear time! O(k2n)

onedependency

122


NPNP S

Each piece is at most k wordswide

No dependencies between pieces

Finite state model of sequence

Linear time! O(k2n)

Parsing Algorithm

• Same grammar constant as Eisner and Satta (1999)

• O(n3) → O(nb2) runtime

• Includes some overhead (low-order term) for constructing the vine– Reality check ... is it worth it?

126

F-measure & runtime of a limited-dependency-length parser (POS seqs)

127

Precision & recall of a limited-dependency-length parser (POS seqs)

142


Grammar induction by initially limiting dependency length


144

Soft bias toward short dependencies

-∞ δ = 0 +∞

MLE baseline

linear structure preferred

Multiply parse probability by exp -δSwhere S is the total length of all

dependencies

Then renormalize probabilities

145

Structural Annealing

-∞ δ = 0 +∞

MLE baseline

Start here; train a model.

Increase δ andretrain.

Repeat ...

Until performance stopsimproving on a smallvalidation dataset.

146

50.3

41.6

45.6

50.1

48.0

42.3

63.4

57.4

40.5

41.1

58.2

71.8

70.0

61.8

58.4

56.4

62.4

50.4

20 30 40 50 60 70

German

English

Bulgarian

Mandarin

Turkish

Portuguese

MLE

CE: Deletions &TranspositionsStructural Annealing

Other structural biases can be

annealed.

We tried annealing on connectivity (# of fragments), and got similar

results.

GrammarInduction

147

A 6/9-Accurate Parse

the gene thus canprevent

a

plant from fertilizing itself

Treebank:

the genethus

canprevent

aplant

from fertilizing itself

MLE with locality bias:

preposition misattachment

misattachment of adverb “thus”

verb instead of modal as root

These errors look like

ones made by a

supervised parser in

2000!

148

Accuracy Improvements

language

random

tree

Klein & Mannin

g (2004)

Smith & Eisner (2006)

German 27.5% 50.3 70.0

English 30.3 41.6 61.8

Bulgarian 30.4 45.6 58.4

Mandarin 22.6 50.1 57.2

Turkish 29.8 48.0 62.4

Portuguese

30.6 42.3 71.81CoNLL-X shared task, best system. 2McDonald et al., 2005

state-of-the-art,

supervised

82.61

90.92

85.91

84.61

69.61

86.51

149

Combining with Contrastive Estimation

This generally gives us our best results …

150


Contrastive estimation for HMM and grammar inductionUses lattice parsing …


Contrastive Estimation:Training Log-Linear Models

on Unlabeled Data

Noah A. Smith and Jason EisnerDepartment of Computer Science /

Center for Language and Speech ProcessingJohns Hopkins University

{nasmith,jason}@cs.jhu.edu

Contrastive Estimation:(Efficiently) Training Log-

Linear Models (of Sequences) on Unlabeled Data

Noah A. Smith and Jason EisnerDepartment of Computer Science /

Center for Language and Speech ProcessingJohns Hopkins University

{nasmith,jason}@cs.jhu.edu

Nutshell Version

tractabletraining

unannotated text

“max ent” features

Experiments on unlabeled data:

POS tagging: 46% error rate reduction (relative to EM)

“Max ent” features make it possible

to survive damage to tag dictionary

Dependency parsing: 21% attachment error reduction

(relative to EM)

contrastive estimationwith lattice neighborhoods

sequence models

“Red leaves don’t hide blue jays.”

Maximum Likelihood Estimation(Supervised)

red leaves don’t hide blue jays

JJ NNS MD VB JJ NNS

?

?

p

p *

x

y

Σ* × Λ*


? ? ? ? ? ?

?

?

p

p *

x

Σ* × Λ*

Maximum Likelihood Estimation

(Unsupervised)

This is what

EM does.

Focusing Probability Mass

numerator

denominator

Conditional Estimation(Supervised)


JJ NNS MD VB JJ NNSp

p

x

y


? ? ? ? ? ?

(x) × Λ*

A different

denominator!

Objective Functions

ObjectiveOptimizati

on Algorithm

NumeratorDenominat

or

MLECount &

Normalize*tags & words

Σ* × Λ*

MLE with hidden

variablesEM* words Σ* × Λ*

Conditional Likelihood

Iterative Scaling

tags & words

(words) × Λ*

Perceptron Backproptags & words

hypothesized tags & words*For generative models.

Contrastive Estimation

observed data

(in this talk, raw

word sequence, sum over

all possible taggings)

?

generic numerical

solvers

(in this talk, LMVM L-BFGS)

This talk is about denominators ...

in the unsupervised case.

A good denominator can improve

accuracy

and

tractability.

Language Learning (Syntax)


Why didn’t he say,

“birds fly” or “dancing granola” or “the wash

dishes” or

any other sequence of words?

EM Why did he pick that sequence for those words?

Why not say “leaves red ...” or “... hide don’t ...” or ...

At last! My own language

learning device!

What is a syntax model supposed to explain?

Each learning hypothesis

corresponds to

a denominator / neighborhood.

The Job of Syntax“Explain why each word is necessary.”

→ DEL1WORD neighborhood

red leavesdon’t hide blue jays

leaves don’t hide blue jays

red don’t hide blue jays

red leaves hide blue jays

red leaves don’t blue jays

red leaves don’t hide jays

red leaves don’t hide blue

The Job of Syntax“Explain the (local) order of the words.”

→ TRANS1 neighborhood

red leavesdon’t hide blue jays

leaves red don’t hide blue jays

red leaves hide don’t blue jays

red don’t leaves hide blue jays

red leaves don’t hide jays blue

red leaves don’t blue hide jays


? ? ? ? ? ?p

p

leaves red don’t hide blue jays

? ? ? ? ? ?

red don’t leaves hide blue jays

? ? ? ? ? ?

red leaves hide don’t blue jays

? ? ? ? ? ?

red leaves don’t blue hide jays

? ? ? ? ? ?

red leaves don’t hide jays blue

? ? ? ? ? ?


? ? ? ? ? ?

sentences inTRANS1

neighborhood

p

p


? ? ? ? ? ?


leaves

don’t

hideblue

jays

don’t hide blue jays

leaves

don’t

hide

bluered

(with any tagging) sentences inTRANS1

neighborhood

www.dyna.org(shameless self

promotion)

The New Modeling Imperative

numerator

denominator(“neighborhood”)

“Make the good sentence

likely, at the expense

of those bad neighbors.”

A good sentence

hints that a set of bad

ones is nearby.

This talk is about denominators ...in the unsupervised case.

A good denominator can improve

accuracy and

tractability.

Log-Linear Models score of x, y

partition function

Computing is

undesirable!

ConditionalEstimation

(Supervised)

ContrastiveEstimation

(Unsupervised)

Sums over all possible taggings of all possible sentences!1

sentencea few

sentences

A Big Picture: Sequence Model Estimation

tractable sums

overlapping features

generative,MLE: p(x, y)

log-linear,conditionalestimation:

p(y | x)

unannotated data

generative,EM: p(x)

log-linear,MLE: p(x, y)

log-linear,EM: p(x)

log-linear,CE with

lattice neighborhoods

Contrastive Neighborhoods

• Guide the learner toward models that do what syntax is supposed to do.

• Lattice representation → efficient algorithms.

There is an art to

choosing neighborhood functions.

Neighborhoods

neighborhood sizelatticearcs

perturbations

n+1 O(n) delete up to 1 word

n O(n) transpose any bigram

O(n) O(n)

O(n2

)O(n2) delete any contiguous

subsequence

(EM) ∞ - replace each word with anything

DEL1SUBSEQUENCE

TRANS1

DEL1WORD

DELORTRANS1

Σ*

DEL1WORD TRANS1

The Merialdo (1994) Task

Given unlabeled text

and a POS dictionary(that tells all possible tags for each word type),

learn to tag.A form of

supervision.

Trigram Tagging Model


JJ NNS MD VB JJ NNS

feature set:

tag trigramstag trigrams

tag/word pairs from a POS dictionary

35.1

58.7

60.4

62.1

66.6

70.0

78.8

79.0

79.3

97.2

99.5

30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0

tagging accuracy (ambiguous words)

10 × data

supervised

• 96K words• full POS dictionary• uninformative initializer• best of 8 smoothing conditions

CRF

HMM

random

EM

LENGTH

TRANS1

DELORTRANS1

DA

EM

DEL1WORD

DEL1SUBSEQUENCE

Smith & Eisner (2004)

Merialdo (1994)

≈ log-linear EM

90.1

90.4

84.4

69.5

84.8

78.3 80

.5

60.5

81.3

75.2

70.9

56.6

77.2

72.3

66.5

51.0

50.0

55.0

60.0

65.0

70.0

75.0

80.0

85.0

90.0

95.0ta

ggin

g ac

cura

cy (al

l wor

ds)

Dictionary includes ...

■ all words■ words from 1st half of corpus■ words with count 2■ words with count 3

Dictionary excludesOOV words,which can get any tag.

• 96K words• 17 coarse POS tags• uninformative initializer

DELO

RTRA

NS1

LENGTH EM

rand

om

What if wedamagethe POS

dictionary?

Trigram Tagging Model + Spelling


JJ NNS MD VB JJ NNS

feature set:

tag trigramstag trigrams

tag/word pairs from a POS dictionary

1- to 3-character suffixes, contains hyphen, digit

91.1

91.9

90.8

83.2

90.3

73.8

89.8

73.6

90.1

90.4

84.4

69.5

84.8

78.3 80

.5

60.5

81.3

75.2

70.9

56.6

77.2

72.3

66.5

51.0

50.0

55.0

60.0

65.0

70.0

75.0

80.0

85.0

90.0

95.0ta

ggin

g ac

cura

cy (al

l wor

ds)

DELO

RTRA

NS1

LENGTH EM

rand

om

DELO

RTRA

NS1

+ spe

lling

LENGTH

+ spe

lling

... but only with a smart neighborhoo

d.

Log-linear spelling features

aided recovery ...

The model need not be finite-state.

clever

uninformative

23.633.8 48.7

35.2

42.1

37.4

0.0

10.0

20.0

30.0

40.0

50.0

Unsupervised Dependency Parsing

EM

LENGTH

initializer

TRANS1

att

ach

men

t acc

ura

cy

Kle

in &

Man

nin

g (

20

04

)

To Sum Up ...Contrastive Estimation means

picking your own denominator

for tractabilityor for accuracy(or, as in our case, for both).

Now we can use the task to guide the unsupervised learner

It’s a particularly good fit for log-linear models:

with max ent featuresunsupervised sequence models

all in time for ACL 2006.

(like discriminative techniques do for supervised learners).

Documents

1 Declarative Specification of NLP Systems Jason Eisner IBM, May 2006 student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John