Upload
sydney-sutton
View
217
Download
2
Tags:
Embed Size (px)
Citation preview
1
Declarative Specification of NLP Systems
Jason Eisner
IBM, May 2006
student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John Blatz, Roy Tromble
2
An Anecdote from ACL’05
-Michael Jordan
3
An Anecdote from ACL’05
Just draw a model that actually makes sense for your problem.
-Michael Jordan
Just do Gibbs sampling. Um, it’s only 6 lines in Matlab…
4
Conclusions to draw from that talk1. Mike & his students are great.
2. Graphical models are great.(because they’re flexible)
3. Gibbs sampling is great.(because it works with nearly any graphical model)
4. Matlab is great.(because it frees up Mike and his students to doodle all day and then execute their doodles)
5
1. Mike & his students are great.
2. Graphical models are great.(because they’re flexible)
3. Gibbs sampling is great.(because it works with nearly any graphical model)
4. Matlab is great.(because it frees up Mike and his students to doodle all day and then execute their doodles)
6
Parts of it already are …Language modelingBinary classification (e.g., SVMs)Finite-state transductionsLinear-chain graphical models
Toolkits available; you don’t have to be an expert
Efficient parsers and MT systems are complicated and painful to write
But other parts aren’t … Context-free and beyondMachine translation
7
This talk: A toolkit that’s general enough for these cases.
(stretches from finite-state to Turing machines)
“Dyna”
Efficient parsers and MT systems are complicated and painful to write
But other parts aren’t … Context-free and beyondMachine translation
8
Warning
Lots more beyond this talk
see the EMNLP’05 and FG’06 papers
see http://dyna.org(download + documentation)
sign up for updates by email
wait for the totally revamped next version
9
the case forLittle Languages
declarative programming
small is beautiful
10
Sapir-Whorf hypothesis
Language shapes thought At least, it shapes conversation
Computer language shapes thought At least, it shapes experimental research Lots of cute ideas that we never pursue Or if we do pursue them, it takes 6-12
months to implement on large-scale data Have we turned into a lab science?
11
Declarative Specifications
State what is to be done
(How should the computer do it? Turn that over to a general “solver” that handles the specification language.)
Hundreds of domain-specific “little languages” out there. Some have sophisticated solvers.
12
dot (www.graphviz.org)
digraph g { graph [rankdir = "LR"]; node [fontsize = "16“ shape = "ellipse"]; edge [];
"node0" [label = "<f0> 0x10ba8| <f1>"shape = "record"]; "node1" [label = "<f0> 0xf7fc4380| <f1> | <f2> |-1"shape = "record"]; … "node0":f0 -> "node1":f0 [id = 0]; "node0":f1 -> "node2":f0 [id = 1]; "node1":f0 -> "node3":f0 [id = 2]; …}
nodes
edges
What’s the hard part? Making a nice layout!Actually, it’s NP-hard …
13
dot (www.graphviz.org)
14
LilyPond (www.lilypond.org)
c4
<<c4 d4 e4>>
{ f4 <<c4 d4 e4>> }
<< g2 \\ { f4 <<c4 d4 e4>> } >>
15
LilyPond (www.lilypond.org)
16
Declarative Specs in NLP
Regular expression (for a FST toolkit) Grammar (for a parser) Feature set (for a maxent distribution, SVM, etc.) Graphical model (DBNs for ASR, IE, etc.)
Claim of this talk:
Sometimes it’s best to peek under the shiny surface.Declarative methods are still great, but should be layered:we need them one level lower, too.
17
Regular expression (for a FST toolkit) Grammar (for a parser)
Feature set (for a maxent distribution, SVM, etc.)
Not always flexible enough …Need to open up the parser and rejigger it.Declarative specification of algorithms.
Not always flexible enough …Need to open up the learner and rejigger it.Declarative specification of objective functions.
New toolkit
Existingtoolkits
Declarative Specs in NLP
18
Declarative Specification of Algorithms
19
How you build a system (“big picture” slide)
cool model
tuned C++ implementation
(data structures, etc.)
practical equations
pseudocode(execution order)
...
|
),(),(,
0
nkji xzyx
zy
x NNNNp
kjjiki
for width from 2 to n for i from 0 to n-width k = i+width for j from i+1 to k-1
…
PCFG
20
Wait a minute …
Didn’t I just implement something like this last month?
chart management / indexingcache-conscious data structuresprioritization of partial solutions (best-first, A*)parameter managementinside-outside formulasdifferent algorithms for training and decodingconjugate gradient, annealing, ...parallelization?
I thought computers were supposed to automate drudgery
21
for width from 2 to n for i from 0 to n-width k = i+width for j from i+1 to k-1
…
How you build a system (“big picture” slide)
cool model
tuned C++ implementation
(data structures, etc.)
pseudocode(execution order)
PCFG
Dyna language specifies these equations.
Most programs just need to compute some values from other values. Any order is ok.
Some programs also need to update the outputs if the inputs change: spreadsheets, makefiles, email readers dynamic graph algorithms EM and other iterative optimization leave-one-out training of smoothing params
...
|
),(),(,
0
nkji xzyx
zy
x NNNNp
kjjiki
practical equations
22
How you build a system (“big picture” slide)
cool model
practical equations
...
|
),(),(,
0
nkji xzyx
zy
x NNNNp
kjjiki
PCFG
Compilation strategies (we’ll come back to this)
tuned C++ implementation
(data structures, etc.)
pseudocode(execution order)
for width from 2 to n for i from 0 to n-width k = i+width for j from i+1 to k-1
…
23
Writing equations in Dyna int a. a = b * c.
a will be kept up to date if b or c changes. b += x.b += y. equivalent to b = x+y.
b is a sum of two variables. Also kept up to date.
c += z(1).c += z(2).c += z(3).c += z(“four”).c += z(foo(bar,5)).
c is a sum of all nonzero z(…) values.
At compile time, we don’t know how many!
a “pattern”the capitalized Nmatches anything
c += z(N).
24
More interesting use of patterns a = b * c.
scalar multiplication a(I) = b(I) * c(I).
pointwise multiplication a += b(I) * c(I). means a = b(I)*c(I)
dot product; could be sparse
a(I,K) += b(I,J) * c(J,K). b(I,J)*c(J,K) matrix multiplication; could be sparse J is free on the right-hand side, so we sum over it
I
... + b(“yetis”)*c(“yetis”)+ b(“zebra”)*c(“zebra”)
sparse dot product of query & document
J
25
By now you may see what we’re up to!
Prolog has Horn clauses:
a(I,K) :- b(I,J) , c(J,K).
Dyna has “Horn equations”:
a(I,K) += b(I,J) * c(J,K).
Dyna vs. Prolog
has a valuee.g., a real number
definition from other values
Like Prolog:Allow nested terms
Syntactic sugar for lists, etc.
Turing-complete
Unlike Prolog:Charts, not backtracking!Compile efficient C++
classesIntegrates with your C++ code
26
using namespace cky;chart c;
c[rewrite(“s”,”np”,”vp”)] = 0.7;c[word(“Pierre”,0,1)] = 1;c[length(30)] = true; // 30-word sentencecin >> c; // get more axioms from stdin
cout << c[goal]; // print total weight of all parses
The CKY inside algorithm in Dyna:- double item = 0.
:- bool length = false.
constit(X,I,J) += word(W,I,J) * rewrite(X,W).
constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z).
goal += constit(“s”,0,N) if length(N).
put in axioms (values not defined by the above program)
theorem pops out
27
visual debugger –browse the proof forest
ambiguity
shared substructure
28
Related algorithms in Dyna?
Viterbi parsing? Logarithmic domain? Lattice parsing? Incremental (left-to-right) parsing? Log-linear parsing? Lexicalized or synchronous parsing? Earley’s algorithm? Binarized CKY?
constit(X,I,J) += word(W,I,J) * rewrite(X,W).
constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z).
goal += constit(“s”,0,N) if length(N).
29
Related algorithms in Dyna?
constit(X,I,J) += word(W,I,J) * rewrite(X,W).
constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z).
goal += constit(“s”,0,N) if length(N).
Viterbi parsing? Logarithmic domain? Lattice parsing? Incremental (left-to-right) parsing? Log-linear parsing? Lexicalized or synchronous parsing? Earley’s algorithm? Binarized CKY?
max=
max=
max=
30
Related algorithms in Dyna?
constit(X,I,J) += word(W,I,J) * rewrite(X,W).
constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z).
goal += constit(“s”,0,N) if length(N).
Viterbi parsing? Logarithmic domain? Lattice parsing? Incremental (left-to-right) parsing? Log-linear parsing? Lexicalized or synchronous parsing? Earley’s algorithm? Binarized CKY?
max=
max=
max=
+
+ +
log+=
log+=
log+=
31
c[ word(“Pierre”, 0, 1) ] = 1
state(5) state(9)
Related algorithms in Dyna?
constit(X,I,J) += word(W,I,J) * rewrite(X,W).
constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z).
goal += constit(“s”,0,N) if length(N).
Viterbi parsing? Logarithmic domain? Lattice parsing? Incremental (left-to-right) parsing? Log-linear parsing? Lexicalized or synchronous parsing? Earley’s algorithm? Binarized CKY?
0.2
9
5
8
Pierre/0.2P/0.5air/0.3
32
Related algorithms in Dyna?
Viterbi parsing? Logarithmic domain? Lattice parsing? Incremental (left-to-right) parsing? Log-linear parsing? Lexicalized or synchronous parsing? Earley’s algorithm? Binarized CKY?
constit(X,I,J) += word(W,I,J) * rewrite(X,W).
constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z).
goal += constit(“s”,0,N) if length(N).
Just add words one at a time to the chartCheck at any time what can be derived from words so far
Similarly, dynamic grammars
33
Related algorithms in Dyna?
Viterbi parsing? Logarithmic domain? Lattice parsing? Incremental (left-to-right) parsing? Log-linear parsing? Lexicalized or synchronous parsing? Earley’s algorithm? Binarized CKY?
constit(X,I,J) += word(W,I,J) * rewrite(X,W).
constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z).
goal += constit(“s”,0,N) if length(N).
Again, no change to the Dyna program
34
Related algorithms in Dyna?
Viterbi parsing? Logarithmic domain? Lattice parsing? Incremental (left-to-right) parsing? Log-linear parsing? Lexicalized or synchronous parsing? Earley’s algorithm? Binarized CKY?
constit(X,I,J) += word(W,I,J) * rewrite(X,W).
constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z).
goal += constit(“s”,0,N) if length(N).
Basically, just add extra arguments to the terms above
35
Related algorithms in Dyna?
Viterbi parsing? Logarithmic domain? Lattice parsing? Incremental (left-to-right) parsing? Log-linear parsing? Lexicalized or synchronous parsing? Earley’s algorithm? Binarized CKY?
constit(X,I,J) += word(W,I,J) * rewrite(X,W).
constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z).
goal += constit(“s”,0,N) if length(N).
36
Earley’s algorithm in Dyna
constit(X,I,J) += word(W,I,J) * rewrite(X,W).
constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z).
goal += constit(“s”,0,N) if length(N).
need(“s”,0) = true.
need(Nonterm,J) |= ?constit(_/[Nonterm|_],_,J).
constit(Nonterm/Needed,I,I) += rewrite(Nonterm,Needed) if need(Nonterm,I).
constit(Nonterm/Needed,I,K) += constit(Nonterm/[W|Needed],I,J) * word(W,J,K).
constit(Nonterm/Needed,I,K) += constit(Nonterm/[X|Needed],I,J) * constit(X/[],J,K).
goal += constit(“s”/[],0,N) if length(N).
magic templates transformation(as noted by Minnen 1996)
37
pseudocode(execution order)
Program transformations
tuned C++ implementation
(data structures, etc.)
for width from 2 to n for i from 0 to n-width k = i+width for j from i+1 to k-1
…
Blatz & Eisner (FG 2006):
Lots of equivalent ways to write a system of equations!
Transforming from one to another mayimprove efficiency.
Many parsing “tricks” can be generalized into automatic transformations that help other
programs, too!
cool model
practical equations
...
|
),(),(,
0
nkji xzyx
zy
x NNNNp
kjjiki
PCFG
38
Related algorithms in Dyna?
Viterbi parsing? Logarithmic domain? Lattice parsing? Incremental (left-to-right) parsing? Log-linear parsing? Lexicalized or synchronous parsing? Earley’s algorithm? Binarized CKY?
constit(X,I,J) += word(W,I,J) * rewrite(X,W).
constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z).
goal += constit(“s”,0,N) if length(N).
39
Rule binarization
constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z).
constit(X\Y,Mid,J) += constit(Z,Mid,J) * rewrite(X,Y,Z).
constit(X,I,J) += constit(Y,I,Mid) * constit(X\Y,Mid,J).
folding transformation: asymp. speedup!
Mid J
ZY Z
X
I Mid
Y
I J
X
Mid J
X\Y
I Mid
Y
40
Rule binarization
constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z).
constit(X\Y,Mid,J) += constit(Z,Mid,J) * rewrite(X,Y,Z).
constit(X,I,J) += constit(Y,I,Mid) * constit(X\Y,Mid,J).
folding transformation: asymp. speedup!
constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z)MidZY ,,
constit(Y,I,Mid) constit(Z,Mid,J) * rewrite(X,Y,Z)MidY ,
Z
graphical modelsconstraint programmingmulti-way database join
41
More program transformations Examples that add new semantics
Compute gradient (e.g., derive outside algorithm from inside) Compute upper bounds for A* (e.g., Klein & Manning ACL’03) Coarse-to-fine (e.g., Johnson & Charniak NAACL’06)
Examples that preserve semantics On-demand computation – by analogy with Earley’s algorithm
On-the-fly composition of FSTs Left-corner filter for parsing
Program specialization as unfolding – e.g., compile out the grammar Rearranging computations – by analogy with categorial grammar
Folding reinterpreted as slashed categories “Speculative computation” using slashed categories
abstract away repeated computation to do it once only – by analogy with unary rule closure or epsilon-closure
derives Eisner & Satta ACL’99 O(n3) bilexical parser
42
Propagate updatesfrom right-to-leftthrough the equations.a.k.a.“agenda algorithm”“forward chaining”“bottom-up inference”“semi-naïve bottom-up”
How you build a system (“big picture” slide)
cool model
tuned C++ implementation
(data structures, etc.)
practical equations
pseudocode(execution order)
...
|
),(),(,
0
nkji xzyx
zy
x NNNNp
kjjiki
for width from 2 to n for i from 0 to n-width k = i+width for j from i+1 to k-1
…
PCFG
use a general method
43
agenda of pending updates
prep(2,3)= 1.0
prep(I,3)= ?
s(3,9)+= 0.15 s(3,7)
+= 0.21vp(5,K)
= ?vp(5,9)= 0.5
vp(5,7)= 0.7
Bottom-up inference
np(3,5)+= 0.3
chart of derived items with current values
s(I,K) += np(I,J) * vp(J,K)
rules of program
np(3,5)= 0.1+0.3
0.4
we updated np(3,5);what else must therefore change?
If np(3,5) hadn’t been in the chart already, we would have added it.
vp(5,K)?no more matches to this
query
prep(I,3)?
pp(I,K) += prep(I,J) * np(J,K) pp(2,5)+= 0.3
44
...
|
),(),(,
0
nkji xzyx
zy
x NNNNp
kjjiki
How you build a system (“big picture” slide)
cool model
practical equations
pseudocode(execution order)
for width from 2 to n for i from 0 to n-width k = i+width for j from i+1 to k-1
…
PCFG
What’s going on under the
hood?tuned C++
implementation(data structures, etc.)
45
copy, compare, & hashterms fast, via
integerization (interning)
Compiler provides …
np(3,5)+= 0.3
chart of derived items with current values
s(I,K) += np(I,J) * vp(J,K)
efficient storage of terms(use native C++ types,
“symbiotic” storage, garbage collection,
serialization, …)vp(5,K)?
automatic indexingfor O(1) lookup
rules of program
hard-codedpattern matching
agenda of pending updates
efficient priority queue
46
n(5,5)= 0.2
agenda of pending updates
n(5,5)+= ?
Beware double-counting!
n(5,5)+= 0.3
chart of derived items with current values
n(I,K) += n(I,J) * n(J,K)
rules of program
If np(3,5) hadn’t been in the chart already, we would have added it.
n(5,K)?
epsilonconstituent
to makeanother copy
of itself
combiningwith itself
47
Parameter training Maximize some objective function. Use Dyna to compute the function. Then how do you differentiate it?
… for gradient ascent,conjugate gradient, etc.
… gradient also tells us the expected counts for EM!
model parameters(and input sentence)
as axiom values
objective functionas a theorem’s value
e.g., inside algorithmcomputes likelihood
of the sentence
Two approaches: Program transformation – automatically derive the “outside” formulas. Back-propagation – run the agenda algorithm “backwards.”
works even with pruning, early stopping, etc.
DynaMITE: training toolkit
48
What can Dyna do beyond CKY?
49
Some examples from my lab … Parsing using …
factored dependency models (Dreyer, Smith, & Smith CONLL’06) with annealed risk minimization (Smith and Eisner EMNLP’06)
constraints on dependency length (Eisner & Smith IWPT’05) unsupervised learning of deep transformations (see Eisner EMNLP’02) lexicalized algorithms (see Eisner & Satta ACL’99, etc.)
Grammar induction using … partial supervision (Dreyer & Eisner EMNLP’06) structural annealing (Smith & Eisner ACL’06) contrastive estimation (Smith & Eisner GIA’05) deterministic annealing (Smith & Eisner ACL’04)
Machine translation using … Very large neighborhood search of permutations (Eisner & Tromble, NAACL-W’06) Loosely syntax-based MT (Smith & Eisner in prep.) Synchronous cross-lingual parsing (Smith & Smith EMNLP’04)
Finite-state methods for morphology, phonology, IE, even syntax … Unsupervised cognate discovery (Schafer & Yarowsky ’05, ’06) Unsupervised log-linear models via contrastive estimation (Smith & Eisner ACL’05) Context-based morph. disambiguation (Smith, Smith & Tromble EMNLP’05) Trainable (in)finite-state machines (see Eisner ACL’02, EMNLP’02, …) Finite-state machines with very large alphabets (see Eisner ACL’97) Finite-state machines over weird semirings (see Eisner ACL’02, EMNLP’03)
Teaching (Eisner JHU’05-06; Smith & Tromble JHU’04)
Easy to try stuff out!
Programs are very short & easy to
change!- see also Eisner ACL’03)
50
Can it express everything in NLP? Remember, it integrates tightly with C++,
so you only have to use it where it’s helpful,and write the rest in C++. Small is beautiful.
We’re currently extending the class of allowed formulas “beyond the semiring” cf. Goodman (1999) will be able to express smoothing, neural nets, etc.
Of course, it is Turing complete …
51
Smoothing in Dyna
mle_prob(X,Y,Z) % context = count(X,Y,Z)/count(X,Y).
smoothed_prob(X,Y,Z) = lambda*mle_prob(X,Y,Z)
+ (1-lambda)*mle_prob(Y,Z). % for arbitrary n-grams, can use lists
count_count(N) += 1 whenever N is count(Anything).
% updates automatically during leave-one-out jackknifing
52
Information retrieval in Dyna
score(Doc) += tf(Doc,Word)*tf(Query,Word)*idf(Word).
idf(Word) = 1/log(df(Word)). df(Word) += 1 whenever tf(Doc,Word) > 0.
53
Neural networks in Dyna
out(Node) = sigmoid(in(Node)). in(Node) += input(Node). in(Node) += weight(Node,Kid)*out(Kid). error += (out(Node)-target(Node))**2
if ?target(Node).
Recurrent neural net is ok
1x 2x 3x 4x
1h 2h 3h
y
54
Game-tree analysis in Dyna
goal = best(Board) if start(Board).
best(Board) max= stop(player1, Board). best(Board) max= move(player1, Board,
NewBoard) + worst(NewBoard).
worst(Board) min= stop(player2, Board). worst(Board) min= move(player2, Board,
NewBoard) + best(NewBoard).
55
Weighted FST composition in Dyna(epsilon-free case)
:- bool item=false. start (A o B, Q x R) |= start (A, Q) & start (B, R). stop (A o B, Q x R) |= stop (A, Q) & stop (B, R). arc (A o B, Q1 x R1, Q2 x R2, In, Out)
|= arc (A, Q1, Q2, In, Match) & arc (B, R1, R2, Match, Out).
Inefficient? How do we fix this?
56
Constraint programming (arc consistency) :- bool indomain=false. :- bool consistent=true.
variable(Var) |= indomain(Var:Val). possible(Var:Val) &= indomain(Var:Val). possible(Var:Val) &= support(Var:Val, Var2)
whenever variable(Var2). support(Var:Val, Var2) |= possible(Var2:Val2)
& consistent(Var:Val, Var2:Val2).
57
Edit distance in Dyna: version 1 letter1(“c”,0,1). letter1(“l”,1,2). letter1(“a”,2,3). … % clara letter2(“c”,0,1). letter2(“a”,1,2). letter2(“c”,2,3). … % caca end1(5). end2(4). delcost := 1. inscost := 1. substcost := 1.
align(0,0) = 0.
align(I1,J2) min= align(I1,I2) + letter2(L2,I2,J2) + inscost(L2). align(J1,I2) min= align(I1,I2) + letter1(L1,I1,J1) + delcost(L1). align(J1,J2) min= align(I1,I2) + letter1(L1,I1,J1) +
letter2(L2,I2,J2) + subcost(L1,L2).
align(J1,J2) min= align(I1,I2)+letter1(L,I1,J1)+letter2(L,I2,J2).
goal = align(N1,N2) whenever end1(N1) & end2(N2).
Cost of best alignment of first I1 charactersof string 1 with first I2 characters of string 2.
Next letter is L2. Add it to string 2
only.
same L;free move!
58
Edit distance in Dyna: version 2 input([“c”, “l”, “a”, “r”, “a”], [“c”, “a”, “c”, “a”]) := 0. delcost := 1. inscost := 1. substcost := 1.
alignupto(Xs,Ys) min= input(Xs,Ys). alignupto(Xs,Ys) min= alignupto([X|Xs],Ys) + delcost. alignupto(Xs,Ys) min= alignupto(Xs,[Y|Ys]) + inscost. alignupto(Xs,Ys) min= alignupto([X|Xs],[Y|Ys])+substcost. alignupto(Xs,Ys) min= alignupto([A|Xs],[A|Ys]). goal min= alignupto([], []).
Xs and Ys are still-unaligned suffixes.This item’s value is supposed to be cost ofaligning everything up to but not including them.
How about different costs for different letters?
59
input([“c”, “l”, “a”, “r”, “a”], [“c”, “a”, “c”, “a”]) := 0. delcost := 1. inscost := 1. substcost := 1.
alignupto(Xs,Ys) min= input(Xs,Ys). alignupto(Xs,Ys) min= alignupto([X|Xs],Ys) + delcost. alignupto(Xs,Ys) min= alignupto(Xs,[Y|Ys]) + inscost. alignupto(Xs,Ys) min= alignupto([X|Xs],[Y|Ys])+substcost. alignupto(Xs,Ys) min= alignupto([L|Xs],[L|Ys]). goal min= alignupto([], []).
Edit distance in Dyna: version 2
Xs and Ys are still-unaligned suffixes.This item’s value is supposed to be cost ofaligning everything up to but not including them.
(X). (Y).
(X,Y).
+ nocost(L,L)
60
Is it fast enough? (sort of) Asymptotically efficient 4 times slower than Mark Johnson’s inside-outside 4-11 times slower than Klein & Manning’s Viterbi parser
61
Are you going to make it faster?
(yup!) Currently rewriting the term classes
to match hand-tuned code Will support “mix-and-match”
implementation strategies store X in an array store Y in a hash don’t store Z
(compute on demand) Eventually, choose
strategies automaticallyby execution profiling
62
Synopsis: your idea experimental results fast!
Dyna is a language for computation (no I/O). Especially good for dynamic programming. It tries to encapsulate the black art of NLP.
Much prior work in this vein … Deductive parsing schemata (preferably weighted)
Goodman, Nederhof, Pereira, Warren, Shieber, Schabes, Sikkel… Deductive databases (preferably with aggregation)
Ramakrishnan, Zukowski, Freitag, Specht, Ross, Sagiv, … Probabilistic programming languages (implemented)
Zhao, Sato, Pfeffer … (also: efficient Prologish languages)
63
Dyna contributors!
Jason Eisner Eric Goldlust, Eric Northup, Johnny Graettinger
(compiler backend) Noah A. Smith (parameter training) Markus Dreyer, David Smith (compiler frontend) Mike Kornbluh, George Shafer, Gordon Woodhull,
Constantinos Michael, Ray Buse (visual debugger)
John Blatz (program transformations) Asheesh Laroia (web services)
64
New examples of dynamic programming in NLP
65
Some examples from my lab … Parsing using …
factored dependency models (Dreyer, Smith, & Smith CONLL’06) with annealed risk minimization (Smith and Eisner EMNLP’06)
constraints on dependency length (Eisner & Smith IWPT’05) unsupervised learning of deep transformations (see Eisner EMNLP’02) lexicalized algorithms (see Eisner & Satta ACL’99, etc.)
Grammar induction using … partial supervision (Dreyer & Eisner EMNLP’06) structural annealing (Smith & Eisner ACL’06) contrastive estimation (Smith & Eisner GIA’05) deterministic annealing (Smith & Eisner ACL’04)
Machine translation using … Very large neighborhood search of permutations (Eisner & Tromble, NAACL-W’06) Loosely syntax-based MT (Smith & Eisner in prep.) Synchronous cross-lingual parsing (Smith & Smith EMNLP’04)
Finite-state methods for morphology, phonology, IE, even syntax … Unsupervised cognate discovery (Schafer & Yarowsky ’05, ’06) Unsupervised log-linear models via contrastive estimation (Smith & Eisner ACL’05) Context-based morph. disambiguation (Smith, Smith & Tromble EMNLP’05) Trainable (in)finite-state machines (see Eisner ACL’02, EMNLP’02, …) Finite-state machines with very large alphabets (see Eisner ACL’97) Finite-state machines over weird semirings (see Eisner ACL’02, EMNLP’03)
Teaching (Eisner JHU’05-06; Smith & Tromble JHU’04)
- see also Eisner ACL’03)
66
New examples of dynamic programming in NLP
Parameterized finite-state machines
67
Parameterized FSMs An FSM whose arc probabilities depend on
parameters: they are formulas.
/p
a/q*exp(t+u)
a/q
b/(1-q)r
/1-p
a/r
1-s
a/exp(t+v)
68
/.1
a/.44
a/.2
b/.8
/.9
a/.3
.7
a/.56
Parameterized FSMs An FSM whose arc probabilities depend on
parameters: they are formulas.
69
Parameterized FSMs An FSM whose arc probabilities depend on
parameters: they are formulas.
/p
a/q*exp(t+u)
a/q
b/(1-q)r
/1-p
a/r
1-s
a/exp(t+v)
Expert first: Construct the FSM (topology & parameterization).
Automatic takes over: Given training data, find parameter valuesthat optimize arc probs.
70
Parameterized FSMsKnight & Graehl1997 - transliteration
p(English text)
p(English text English
phonemes)
p(English phonemes Japanese phonemes)
p(Japanese phonemes
Japanese text)
o
o
o
“/t/ and /d/ are similar …”
Loosely coupled probabilities:
/t/:/tt/
/d/:/dd/
exp p+q+r (coronal, stop,unvoiced)
exp p+q+s (coronal, stop,voiced)
71
Parameterized FSMsKnight & Graehl1997 - transliteration
p(English text)
p(English text English
phonemes)
p(English phonemes Japanese phonemes)
p(Japanese phonemes
Japanese text)
o
o
o
“Would like to get some of that expert knowledge in here”
Use probabilistic regexps like(a*.7 b) +.5 (ab*.6) …
If the probabilities are variables (a*x b) +y (ab*z) …then arc weights of the compiled machine are nasty formulas. (Especially after minimization!)
80
New examples of dynamic programming in NLP
Parameterized infinite-state machines
81
Universal grammar as a parameterized FSA over an infinite state space
82
New examples of dynamic programming in NLP
More abuses of finite-state machines
83
Huge-alphabet FSAs for OT phonology
VCCVC
voi
CCVC
voi underlyingtiers
surfacetiers
CCVC
voi
Gen
CCVC
CCVC
voi
CCC
velar
CCVC
voi
voi
CC C
V
etc.Gen proposes allcandidates that include this input.
}
84
Huge-alphabet FSAs for OT phonology
CCC
velar
CCVC
voi
CC C
V
encode this candidate as a string
at each moment,need to describewhat’s going onon many tiers
85
Directional Best Paths construction Keep “best” output string for each input string Yields a new transducer (size 3n)
For input abc: abc axcFor input abd: axd
Must allow red arcjust if next input is d
1 2
3
5
4
6
7
a:ab:b
b:x
c:c
c:c
d:d
86
Compute (q) in O(1) time
as soon as we visit q.
Whole alg. is linear.
Minimization of semiring-weighted FSAs
New definition of for pushing: (q) = weight of the shortest path from q,
breaking ties alphabetically on input symbols Computation is simple, well-defined, independent of (K, ) Breadth-first search back from final states:
d
b
cbaa
b
c
distance 2
Faster than finding min-weight path
à la Mohri.(q) = k (r)
q
r
:k
87
New examples of dynamic programming in NLP
Tree-to-tree alignment
Two training trees, showing a free translation from French to English.
Synchronous Tree Substitution Grammar
enfants(“kids”)
d’(“of”)
beaucoup(“lots”)
Sam
donnent (“give”)
baiser(“kiss”)
un(“a”)
à (“to”)
kids
Sam
kiss
quite
often
“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”
enfants(“kids”)
kids
NPd’
(“of”)
beaucoup(“lots”)
NP
NP
SamSam
NP
Synchronous Tree Substitution Grammar
kissdonnent (“give”)
baiser(“kiss”)
un(“a”)
à (“to”)
Start
NP
NP
nullAdv
quitenullAdv
oftennullAdv
nullAdv
“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”
Two training trees, showing a free translation from French to English.A possible alignment is shown in orange.
enfants(“kids”)
kids
NPd’
(“of”)
beaucoup(“lots”)
NP
NPquitenull
Adv
oftennullAdv
nullAdv
SamSam
NP
kissdonnent (“give”)
baiser(“kiss”)
un(“a”)
à (“to”)
Start
NP
NP
nullAdv
Synchronous Tree Substitution Grammar
“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”
Start
Two training trees, showing a free translation from French to English.A possible alignment is shown in orange. Alignment shows how trees are generated synchronously from “little trees” ...
91
New examples of dynamic programming in NLP
Bilexical parsing in O(n3)
(with Giorgio Satta)
92
Lexicalized CKY
lovesMary the girl outdoors][ [ ][ ] ][
93
Lexicalized CKY is O(n5) not O(n3)
B
i j
C
j+1 kh h’
h
A
i k
O(n3 combinations)
O(n5 combinations)
... hug visiting relatives... advocate visiting relatives
94
Idea #1
B
i j
C
j+1 kh h’
h’
A
i k
Combine B with what C?
must try different-width C’s (vary k)
must try differently-headed C’s (vary h’)
Separate these!
95
Idea #1
C
j+1 kh’
h’
A
i k
B
i j
C
j+1 kh h’
h’
A
i k
i jh
B
h’i j
A
C
h’C
(the old CKY way)
96
Idea #2
Some grammars allow
A
i
A
k
ki
h h
A
97
Idea #2
B
i j
C
j+1 kh h’
h’
A
i k
Combine what B and C?
must try different-width C’s (vary k)
must try different midpoints j
Separate these!
98
Idea #2
h
A
i k
B
i j
C
j+1 kh h’
h
A
i k
i jh
B C
j+1 h’
C
h’i
A
h kh’
C
(the old CKY way)
99
Idea #2
h
A
k
B
i j
C
j+1 kh h’
h
A
i k
jh
B C
j+1 h’
C
h’
A
h kh’
C
(the old CKY way)
100
An O(n3) algorithm (with G. Satta)
lovesMary the girl outdoors] [[ ][ ]] [
101
3 parsers: log-log plot
1
10
100
1000
10000
100000
10 100
Sentence Length
Tim
e NAIVE
IWPT-97
ACL-99
NAIVE
IWPT-97
ACL-99
pruned
exhaustive
102
New examples of dynamic programming in NLP
O(n)-time partial parsing by limiting dependency length
(with Noah A. Smith)
A word’s dependents (adjuncts, arguments)
tend to fall near it
in the string.
Short-Dependency Preference
1
1 1
3
length of a dependency ≈ surface distance
fract
ion
of
all
dep
en
den
cies
0
0.1
0.2
0.3
0.4
0.5
0.6
0 50 100 150 200 250
English Chinese German
0
0.1
0.2
0.3
0.4
0.5
0.6
1 10 100 1000
50% of English dependencies have length 1, another 20%
have length 2, 10% have length 3 ...
length
Related Ideas
• Score parses based on what’s between a head and child
(Collins, 1997; Zeman, 2004; McDonald et al., 2005)
• Assume short → faster human processing (Church, 1980; Gibson, 1998)
• “Attach low” heuristic for PPs (English)(Frazier, 1979; Hobbs and Bear, 1990)
• Obligatory and optional re-orderings (English)(see paper)
Going to Extremes
0
0.1
0.2
0.3
0.4
0.5
0.6
1 10 100 1000
English Chinese German
Longer dependencies are less likely.
What if we eliminate them completely?
Hard Constraints
Disallow dependencies between words of distance > b ...
Risk: best parse contrived, or no parse at all!
Solution: allow fragments (partial parsing; Hindle, 1990 inter alia).
Why not model the sequence of fragments?
Building a Vine SBG Parser
Grammar: generates sequence of trees from $
Parser: recognizes sequences of trees without long dependencies
Need to modify training data so the model is consistent
with the parser.
the
tofilings
estimates
,According
rule
changes cut
insider
some
by
more
than
a
third
.
2
2
84
2
3
2
9
1
1
1
1
1
1
1
1
1
$
would
(from the Penn Treebank)
the
tofilings
estimates
,According
rule
changes
would
cut
insider
some
by
more
than
a
third
.
2
2
4
2
3
2
1
1
1
1
1
1
1
1
1
b = 4
$
(from the Penn Treebank)
the
tofilings
estimates
,According
rule
changes
would
cut
insider
some
by
more
than
a
third
.
2
2 2
3
2
1
1
1
1
1
1
1
1
1
$
(from the Penn Treebank)
b = 3
the
tofilings
estimates
,According
rule
changes
would
cut
insider
some
by
more
than
a
third
.
2
2 2
2
1
1
1
1
1
1
1
1
1
$
(from the Penn Treebank)
b = 2
the
tofilings
estimates
,According
rule
changes
would
cut
insider
some
by
more
than
a
.
1
1
1
1
1
1
1
1
1
$
third(from the Penn Treebank)
b = 1
$
the
tofilings
estimates
,According
would
cut
insider
some
by
more
than
a
third
.
changes
rule
(from the Penn Treebank)
b = 0
• Even for small b, “bunches” can grow to arbitrary size:
• But arbitrary center embedding is out:
Vine Grammar is Regular
119
Linear-time partial parsing:
Limiting dependency length
NP S NPFinite-state model of root sequence
Bounded dependencylength within each chunk(but chunk could be arbitrarilywide: right- or left- branching)
Natural-language dependencies tend to be short So even if you don’t have enough data to model what the heads are … … you might want to keep track of where they are.
120
Limiting dependency length Linear-time partial parsing:
Don’t convert into an FSA! Less structure sharing Explosion of states for different stack configurations Hard to get your parse back
NP S NPFinite-state model of root sequence
Bounded dependencylength within each chunk(but chunk could be arbitrarilywide: right- or left- branching)
121
Limiting dependency length Linear-time partial parsing:
NP S NP
Each piece is at most k wordswide
No dependencies between pieces
Finite state model of sequence
Linear time! O(k2n)
onedependency
122
Limiting dependency length Linear-time partial parsing:
NPNP S
Each piece is at most k wordswide
No dependencies between pieces
Finite state model of sequence
Linear time! O(k2n)
Parsing Algorithm
• Same grammar constant as Eisner and Satta (1999)
• O(n3) → O(nb2) runtime
• Includes some overhead (low-order term) for constructing the vine– Reality check ... is it worth it?
126
F-measure & runtime of a limited-dependency-length parser (POS seqs)
127
Precision & recall of a limited-dependency-length parser (POS seqs)
142
New examples of dynamic programming in NLP
Grammar induction by initially limiting dependency length
(with Noah A. Smith)
144
Soft bias toward short dependencies
-∞ δ = 0 +∞
MLE baseline
linear structure preferred
Multiply parse probability by exp -δSwhere S is the total length of all
dependencies
Then renormalize probabilities
145
Structural Annealing
-∞ δ = 0 +∞
MLE baseline
Start here; train a model.
Increase δ andretrain.
Repeat ...
Until performance stopsimproving on a smallvalidation dataset.
146
50.3
41.6
45.6
50.1
48.0
42.3
63.4
57.4
40.5
41.1
58.2
71.8
70.0
61.8
58.4
56.4
62.4
50.4
20 30 40 50 60 70
German
English
Bulgarian
Mandarin
Turkish
Portuguese
MLE
CE: Deletions &TranspositionsStructural Annealing
Other structural biases can be
annealed.
We tried annealing on connectivity (# of fragments), and got similar
results.
GrammarInduction
147
A 6/9-Accurate Parse
the gene thus canprevent
a
plant from fertilizing itself
Treebank:
the genethus
canprevent
aplant
from fertilizing itself
MLE with locality bias:
preposition misattachment
misattachment of adverb “thus”
verb instead of modal as root
These errors look like
ones made by a
supervised parser in
2000!
148
Accuracy Improvements
language
random
tree
Klein & Mannin
g (2004)
Smith & Eisner (2006)
German 27.5% 50.3 70.0
English 30.3 41.6 61.8
Bulgarian 30.4 45.6 58.4
Mandarin 22.6 50.1 57.2
Turkish 29.8 48.0 62.4
Portuguese
30.6 42.3 71.81CoNLL-X shared task, best system. 2McDonald et al., 2005
state-of-the-art,
supervised
82.61
90.92
85.91
84.61
69.61
86.51
149
Combining with Contrastive Estimation
This generally gives us our best results …
150
New examples of dynamic programming in NLP
Contrastive estimation for HMM and grammar inductionUses lattice parsing …
(with Noah A. Smith)
Contrastive Estimation:Training Log-Linear Models
on Unlabeled Data
Noah A. Smith and Jason EisnerDepartment of Computer Science /
Center for Language and Speech ProcessingJohns Hopkins University
{nasmith,jason}@cs.jhu.edu
Contrastive Estimation:(Efficiently) Training Log-
Linear Models (of Sequences) on Unlabeled Data
Noah A. Smith and Jason EisnerDepartment of Computer Science /
Center for Language and Speech ProcessingJohns Hopkins University
{nasmith,jason}@cs.jhu.edu
Nutshell Version
tractabletraining
unannotated text
“max ent” features
Experiments on unlabeled data:
POS tagging: 46% error rate reduction (relative to EM)
“Max ent” features make it possible
to survive damage to tag dictionary
Dependency parsing: 21% attachment error reduction
(relative to EM)
contrastive estimationwith lattice neighborhoods
sequence models
“Red leaves don’t hide blue jays.”
Maximum Likelihood Estimation(Supervised)
red leaves don’t hide blue jays
JJ NNS MD VB JJ NNS
?
?
p
p *
x
y
Σ* × Λ*
red leaves don’t hide blue jays
? ? ? ? ? ?
?
?
p
p *
x
Σ* × Λ*
Maximum Likelihood Estimation
(Unsupervised)
This is what
EM does.
Focusing Probability Mass
numerator
denominator
Conditional Estimation(Supervised)
red leaves don’t hide blue jays
JJ NNS MD VB JJ NNSp
p
x
y
red leaves don’t hide blue jays
? ? ? ? ? ?
(x) × Λ*
A different
denominator!
Objective Functions
ObjectiveOptimizati
on Algorithm
NumeratorDenominat
or
MLECount &
Normalize*tags & words
Σ* × Λ*
MLE with hidden
variablesEM* words Σ* × Λ*
Conditional Likelihood
Iterative Scaling
tags & words
(words) × Λ*
Perceptron Backproptags & words
hypothesized tags & words*For generative models.
Contrastive Estimation
observed data
(in this talk, raw
word sequence, sum over
all possible taggings)
?
generic numerical
solvers
(in this talk, LMVM L-BFGS)
This talk is about denominators ...
in the unsupervised case.
A good denominator can improve
accuracy
and
tractability.
Language Learning (Syntax)
red leaves don’t hide blue jays
Why didn’t he say,
“birds fly” or “dancing granola” or “the wash
dishes” or
any other sequence of words?
EM Why did he pick that sequence for those words?
Why not say “leaves red ...” or “... hide don’t ...” or ...
At last! My own language
learning device!
What is a syntax model supposed to explain?
Each learning hypothesis
corresponds to
a denominator / neighborhood.
The Job of Syntax“Explain why each word is necessary.”
→ DEL1WORD neighborhood
red leavesdon’t hide blue jays
leaves don’t hide blue jays
red don’t hide blue jays
red leaves hide blue jays
red leaves don’t blue jays
red leaves don’t hide jays
red leaves don’t hide blue
The Job of Syntax“Explain the (local) order of the words.”
→ TRANS1 neighborhood
red leavesdon’t hide blue jays
leaves red don’t hide blue jays
red leaves hide don’t blue jays
red don’t leaves hide blue jays
red leaves don’t hide jays blue
red leaves don’t blue hide jays
red leaves don’t hide blue jays
? ? ? ? ? ?p
p
leaves red don’t hide blue jays
? ? ? ? ? ?
red don’t leaves hide blue jays
? ? ? ? ? ?
red leaves hide don’t blue jays
? ? ? ? ? ?
red leaves don’t blue hide jays
? ? ? ? ? ?
red leaves don’t hide jays blue
? ? ? ? ? ?
red leaves don’t hide blue jays
? ? ? ? ? ?
sentences inTRANS1
neighborhood
p
p
red leaves don’t hide blue jays
? ? ? ? ? ?
red leaves don’t hide blue jays
leaves
don’t
hideblue
jays
don’t hide blue jays
leaves
don’t
hide
bluered
(with any tagging) sentences inTRANS1
neighborhood
www.dyna.org(shameless self
promotion)
The New Modeling Imperative
numerator
denominator(“neighborhood”)
“Make the good sentence
likely, at the expense
of those bad neighbors.”
A good sentence
hints that a set of bad
ones is nearby.
This talk is about denominators ...in the unsupervised case.
A good denominator can improve
accuracy and
tractability.
Log-Linear Models score of x, y
partition function
Computing is
undesirable!
ConditionalEstimation
(Supervised)
ContrastiveEstimation
(Unsupervised)
Sums over all possible taggings of all possible sentences!1
sentencea few
sentences
A Big Picture: Sequence Model Estimation
tractable sums
overlapping features
generative,MLE: p(x, y)
log-linear,conditionalestimation:
p(y | x)
unannotated data
generative,EM: p(x)
log-linear,MLE: p(x, y)
log-linear,EM: p(x)
log-linear,CE with
lattice neighborhoods
Contrastive Neighborhoods
• Guide the learner toward models that do what syntax is supposed to do.
• Lattice representation → efficient algorithms.
There is an art to
choosing neighborhood functions.
Neighborhoods
neighborhood sizelatticearcs
perturbations
n+1 O(n) delete up to 1 word
n O(n) transpose any bigram
O(n) O(n)
O(n2
)O(n2) delete any contiguous
subsequence
(EM) ∞ - replace each word with anything
DEL1SUBSEQUENCE
TRANS1
DEL1WORD
DELORTRANS1
Σ*
DEL1WORD TRANS1
The Merialdo (1994) Task
Given unlabeled text
and a POS dictionary(that tells all possible tags for each word type),
learn to tag.A form of
supervision.
Trigram Tagging Model
red leaves don’t hide blue jays
JJ NNS MD VB JJ NNS
feature set:
tag trigramstag trigrams
tag/word pairs from a POS dictionary
35.1
58.7
60.4
62.1
66.6
70.0
78.8
79.0
79.3
97.2
99.5
30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0
tagging accuracy (ambiguous words)
10 × data
supervised
• 96K words• full POS dictionary• uninformative initializer• best of 8 smoothing conditions
CRF
HMM
random
EM
LENGTH
TRANS1
DELORTRANS1
DA
EM
DEL1WORD
DEL1SUBSEQUENCE
Smith & Eisner (2004)
Merialdo (1994)
≈ log-linear EM
90.1
90.4
84.4
69.5
84.8
78.3 80
.5
60.5
81.3
75.2
70.9
56.6
77.2
72.3
66.5
51.0
50.0
55.0
60.0
65.0
70.0
75.0
80.0
85.0
90.0
95.0ta
ggin
g ac
cura
cy (al
l wor
ds)
Dictionary includes ...
■ all words■ words from 1st half of corpus■ words with count 2■ words with count 3
Dictionary excludesOOV words,which can get any tag.
• 96K words• 17 coarse POS tags• uninformative initializer
DELO
RTRA
NS1
LENGTH EM
rand
om
What if wedamagethe POS
dictionary?
Trigram Tagging Model + Spelling
red leaves don’t hide blue jays
JJ NNS MD VB JJ NNS
feature set:
tag trigramstag trigrams
tag/word pairs from a POS dictionary
1- to 3-character suffixes, contains hyphen, digit
91.1
91.9
90.8
83.2
90.3
73.8
89.8
73.6
90.1
90.4
84.4
69.5
84.8
78.3 80
.5
60.5
81.3
75.2
70.9
56.6
77.2
72.3
66.5
51.0
50.0
55.0
60.0
65.0
70.0
75.0
80.0
85.0
90.0
95.0ta
ggin
g ac
cura
cy (al
l wor
ds)
DELO
RTRA
NS1
LENGTH EM
rand
om
DELO
RTRA
NS1
+ spe
lling
LENGTH
+ spe
lling
... but only with a smart neighborhoo
d.
Log-linear spelling features
aided recovery ...
The model need not be finite-state.
clever
uninformative
23.633.8 48.7
35.2
42.1
37.4
0.0
10.0
20.0
30.0
40.0
50.0
Unsupervised Dependency Parsing
EM
LENGTH
initializer
TRANS1
att
ach
men
t acc
ura
cy
Kle
in &
Man
nin
g (
20
04
)
To Sum Up ...Contrastive Estimation means
picking your own denominator
for tractabilityor for accuracy(or, as in our case, for both).
Now we can use the task to guide the unsupervised learner
It’s a particularly good fit for log-linear models:
with max ent featuresunsupervised sequence models
all in time for ACL 2006.
(like discriminative techniques do for supervised learners).