Upload
larca-upc
View
1.187
Download
2
Embed Size (px)
Citation preview
Spectral Learning Methodsfor Finite-State Machines,
with Applications to Natural Language Processing
Borja Balle, Xavier Carreras, Franco Luque, Ariadna Quattoni
Traducció de la marca a altres idiomes11
La marca es pot traduir a altres idiomes, excepte el nom de la Universitat, que no és traduïble.
September 2011
B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 1 / 1
Overview
Probabilistic TransducersI Model input-output relations with hidden statesI As conditional distribution Pr[ y | x ] over stringsI With certain independence assumptions
H1 H2 H3 H4
X1
Y1
X2
Y2
X3
Y3
X4
Y4
· · ·
. . .
Input
Output
Hidden
I Used in many applications: NLP, biology, . . .I Hard to learn in general — usually EM algorithm is used
B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 2 / 1
Overview
Spectral Learning Probabilistic Transducers
Our contribution:
I Fast learning algorithm for probabilistic FST
I With PAC-style theoretical guarantees
I Based on Observable Operator Model for FST
I Using spectral methods (Chang ’96, Mossel-Roch ’05, Hsu et al. ’09,Siddiqi et al. ’10)
I Performing better than EM in experiments with real data
B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 3 / 1
Outline
B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 4 / 1
Observable Operators for FST
Deriving Observable Operator Models
Given (x , y) ∈ (X × Y)t aligned sequences, model computesconditional probability (i.e. |x | = |y |)
Pr[ y | x ] =∑
h∈Ht Pr[ y ,h | x ] (marginalize states)
=∑
ht+1∈H Pr[ y ,ht+1 | x ] (independence assumptions)
= 1> αt+1 (vector form, αt+1 ∈ Rm)
= 1>Aytxt αt (forward-backward equations)
= 1>Aytxt · · ·A
y1x1 α (induction on t)
The choice of an operator Aba depends only on observable symbols
B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 5 / 1
Observable Operators for FST
Observable Operator Model ParametersGiven X = {a1, . . . ,ak}, Y = {b1, . . . ,bl}, H = {c1, . . . , cm}, then
Pr[ y | x ] = 1>Aytxt · · ·A
y1x1 α with parameters:
Aba = Ta Db ∈ Rm×m (factorized operator)
Ta(i , j) = Pr[Hs = ci |Xs−1 = a,Hs−1 = cj ] ∈ Rm×m (state transition)
Db(i , j) = δi,j Pr[Ys = b|Hs = cj ] ∈ Rm×m (observation emission)
O(i , j) = Pr[Ys = bi |Hs = cj ] ∈ Rl×m (collected emissions)
α(i) = Pr[H1 = ci ] ∈ Rm (initial probabilites)
The choice of an operator Aba depends only on observable symbols . . .
. . . but operator parameters are conditioned by hidden states
B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 6 / 1
Observable Operators for FST
A Learnable Set of Observable Operators
Note that for any invertible Q ∈ Rm×m
Pr[ y | x ] = 1>Q−1 (Q Aytxt Q−1) · · · (Q Ay1
x1 Q−1) Q α
Idea(subspace identification methods for linear systems, ’80s)
Find a basis for the state space such that operators in the new basisare related to observable quantities
Following multiplicity automata and spectral HMM learning . . .
B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 7 / 1
Observable Operators for FST
A Learnable Set of Observable OperatorsFind a basis Q where operators can be expressed in terms of unigram,bigram and trigram probabilities
ρ(i) = Pr[Y1 = bi ] ∈ Rl
P(i , j) = Pr[Y1 = bj ,Y2 = bi ] ∈ Rl×l
Pba (i , j) = Pr[Y1 = bj ,Y2 = b,Y3 = bi |X2 = a] ∈ Rl×l
Theorem (ρ, P and Pba are sufficient statistics)
Let P = UΣV ∗ be a thin SVD decomposition, then Q = U>O yields(under certain assumptions)
Q α = U>ρ
1>Q−1 = ρ>(U>P)+
Q Aba Q−1 = (U>Pb
a )(U>P)+
B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 8 / 1
Learning Observable Operator Models
Spectral Learning Algorithm
GivenI Input X and output Y alphabetI Number of hidden states mI Training sample S = {(x1, y1), . . . , (xn, yn)}
DoI Compute unigram ρ, bigram P and trigram Pb
a relative frequenciesin S
I Perform SVD on P and take U with top m left singular vectorsI Return operators computed using ρ, P, Pb
a and UIn Time
I O(n) to compute relative frequenciesI O(|Y|3) to compute SVD
B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 9 / 1
Learning Observable Operator Models
PAC-Style ResultI Input distribution DX over X ∗ with λ = E[|X |], µ = mina Pr[X2 = a]I Conditional distributions DY |x on Y∗ given x ∈ X ∗ modeled by an
FST with m states (satisfying certain rank assumptions)I Sampling i.i.d. from joint distribution DX ⊗ DY |X
TheoremFor any 0 < ε, δ < 1, if the algorithm receives a sample of size
n ≥ O
(λ2m|Y|ε4µσ2
Oσ4P
log|X |δ
), (σO and σP are mth singular
values of O and P in target)
then with probability at least 1− δ the hypothesis DY |x satisfies
EX
∑y∈Y∗
∣∣∣DY |X (y)− DY |X (y)∣∣∣ ≤ ε . (L1 distance between
joint distributionsDX ⊗ DY |X and
DX ⊗ DY |X )
B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 10 / 1
Experimental Evaluation
Synthetic Experiments
Goal: Compare against baselines when learning hypothesis hold
Target: Randomly generated with |X | = 3, |Y| = 3, |H| = 2
32 128 512 2048 8192 327680
0.1
0.2
0.3
0.4
0.5
0.6
0.7
# training samples (in thousands)
L1
dis
tan
ce
HMMk−HMMFST
I HMM: model input-outputjointly
I k -HMM: one model for eachinput symbol
I Results averaged over 5 runs
B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 11 / 1
Experimental Evaluation
Transliteration Experiments
Goal: Compare against EM in a real task (where modeling assumptions fail)
Task: English to Russian transliteration (brooklyn→ áðóêëèí)
75 150 350 750 1500 3000 600020
30
40
50
60
70
80
# training sequences
no
rma
lize
d e
dit d
ista
nce
Spectral, m=2Spectral, m=3EM, m=2EM, m=3
Training timesSpectral 26 sEM (iteration) 37 sEM (best) 1133 s
I Sequence alignment done inpreprocessing
I Standard techniques used forinference
I Test size: 943, |X | = 82, |Y| = 34
B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 12 / 1
Dependency Parsing
Dependency Parsing
a new todaymovieJohn saw*
I Directed arcs represent dependencies between a head word anda modifier word.
I E.g.:John modifies saw,movie modifies saw,new modifies movie
B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 13 / 1
Dependency Parsing
Probabilistic Dependency Parsing
saw*
Pr[tree] = Pr[saw |?, RIGHT]
× Pr[John|saw , LEFT]
× Pr[movie, today |saw , RIGHT]
× Pr[a,new |movie, LEFT]
× Pr[ε|movie, RIGHT]× Pr[ε|John, RIGHT]× . . .
B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 14 / 1
Dependency Parsing
Probabilistic Dependency Parsing
John saw*
Pr[tree] = Pr[saw |?, RIGHT]× Pr[John|saw , LEFT]
× Pr[movie, today |saw , RIGHT]
× Pr[a,new |movie, LEFT]
× Pr[ε|movie, RIGHT]× Pr[ε|John, RIGHT]× . . .
B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 14 / 1
Dependency Parsing
Probabilistic Dependency Parsing
todaymovieJohn saw*
Pr[tree] = Pr[saw |?, RIGHT]× Pr[John|saw , LEFT]
× Pr[movie, today |saw , RIGHT]
× Pr[a,new |movie, LEFT]
× Pr[ε|movie, RIGHT]× Pr[ε|John, RIGHT]× . . .
B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 14 / 1
Dependency Parsing
Probabilistic Dependency Parsing
a new todaymovieJohn saw*
Pr[tree] = Pr[saw |?, RIGHT]× Pr[John|saw , LEFT]
× Pr[movie, today |saw , RIGHT]
× Pr[a,new |movie, LEFT]
× Pr[ε|movie, RIGHT]× Pr[ε|John, RIGHT]× . . .
B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 14 / 1
Dependency Parsing
Probabilistic Dependency Parsing
a new todaymovieJohn saw*
Pr[tree] = Pr[saw |?, RIGHT]× Pr[John|saw , LEFT]
× Pr[movie, today |saw , RIGHT]
× Pr[a,new |movie, LEFT]
× Pr[ε|movie, RIGHT]× Pr[ε|John, RIGHT]× . . .
B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 14 / 1
Dependency Parsing
Probabilistic Dependency Parsing
I Dependency trees factor into head-modifier sequences:
Pr[tree] =∏
〈h,d ,m1:T 〉∈y
Pr[m1:T |h,d ]
I State-of-the-art models further assume that:
Pr[m1:T |h,d ] =T∏
t=1
Pr[mt |h,d ]
I In this work:I We model the dynamics of modifier sequences with hidden
structureI We use spectral methods to induce hidden structure
B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 15 / 1
Dependency Parsing
Probabilistic Dependency Parsing
I Observable Operator Model:
Pr(m1:T |h,d) = α>∞ Ah,dmT· · · Ah,d
m1α1
I Statistics for estimating the model:
unigrams: p(i) = Pr[mt = i]bigrams: P(i , j) = Pr[mt = j ,mt+1 = i]
trigrams: Ph,dx (i , j) = Pr[mt−1 = j ,mt =x ,mt+1 = i |h,d ]
B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 16 / 1
Dependency Parsing
Spectral Algorithm
1. Use TRAIN to estimate p, P, and Ph,dm
2. Compute a singular value decomposition of P.Take U to be the matrix of top S left singular vectors of P.
3. Create operators, for each h, d , and m :
α1 = U p (1)
Ah,dm = U>Ph,d
m
(U>P
)+(2)
α>∞ = p>(
U>P)+
(3)
B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 17 / 1
Dependency Parsing
Parsing Sentences
I Task: given a sentence x0:n recover its dependency treeI We compute
y = argmaxy∈Y(x0:n)
∏(xh,xm)∈y
µh,m
whereµi,j = Pr[(xi , xj ]|x0:n) =
∑y∈Y(x0:n) : (xi ,xj )∈y
Pr[y ]
I y and µi,j can be computed in O(n3) with variants of the CKYalgorithm for context-free grammars.
B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 18 / 1
Dependency Parsing
Experiment 1: Spectral vs. EM
Fully unlexicalized parsing
Score curve on English development set.
B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 19 / 1
Dependency Parsing
Results on Multiple Languages
Fully unlexicalized parsing
Det Spectral Det+F Spectral+FEnglish 63.76 78.25 72.93 80.83Danish 61.70 77.25 69.70 78.16Dutch 54.30 61.32 60.90 64.01
Portuguese 71.39 85.33 81.47 85.71Slovene 60.31 66.71 65.63 68.65Swedish 69.09 79.55 76.94 80.54Turkish 53.79 62.56 57.56 63.04
B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 20 / 1
Conclusion
Summary of Contributions
I Fast spectral method for learning input-output OOM
I Strong theoretical guarantees with few assumptions on inputdistribution
I Outperforming previous spectral algorithms on FST
I New tools for inducing sequential latent structure for parsing
I Faster and better than EM in some real tasks
B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 21 / 1
Conclusion
Future (actually, current) work
I Problem: For large alphabets, we might need unrealistically largesamples to compute necessary statistics.Solution: Apply smoothing techniques.
I Problem: In many real tasks symbols are not atomic (e.g. words).Solution: Define OOM models that handle features of the symbols(e.g. morphology, pos tags).
I Problem: Combining multiple models.Solution: Boosting.
B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 22 / 1
Spectral Learning Methodsfor Finite-State Machines,
with Applications to Natural Language Processing
Borja Balle, Xavier Carreras, Franco Luque, Ariadna Quattoni
Traducció de la marca a altres idiomes11
La marca es pot traduir a altres idiomes, excepte el nom de la Universitat, que no és traduïble.
September 2011
B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 23 / 1
Technical AssumptionsX = {a1, . . . ,ak},Y = {b1, . . . ,bl},H = {c1, . . . , cm}
Parameters
Ta(i , j) = Pr[Hs = ci |Xs−1 = a,Hs−1 = cj ] ∈ Rm×m (state transition)
T =∑
a Ta Pr[X1 = a] ∈ Rm×m (“mean” transition matrix)
O(i , j) = Pr[Ys = bi |Hs = cj ] ∈ Rl×m (collected emissions)
α(i) = Pr[H1 = ci ] ∈ Rm (initial probabilites)
Assumptions1. l ≥ m2. α > 03. rank(T ) = rank(O) = m4. mina Pr[X2 = a] > 0
B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 24 / 1