28
Spectral Learning Methods for Finite-State Machines, with Applications to Natural Language Processing Borja Balle, Xavier Carreras, Franco Luque, Ariadna Quattoni September 2011 B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 1/1

Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Embed Size (px)

Citation preview

Page 1: Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Spectral Learning Methodsfor Finite-State Machines,

with Applications to Natural Language Processing

Borja Balle, Xavier Carreras, Franco Luque, Ariadna Quattoni

Traducció de la marca a altres idiomes11

La marca es pot traduir a altres idiomes, excepte el nom de la Universitat, que no és traduïble.

September 2011

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 1 / 1

Page 2: Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Overview

Probabilistic TransducersI Model input-output relations with hidden statesI As conditional distribution Pr[ y | x ] over stringsI With certain independence assumptions

H1 H2 H3 H4

X1

Y1

X2

Y2

X3

Y3

X4

Y4

· · ·

. . .

Input

Output

Hidden

I Used in many applications: NLP, biology, . . .I Hard to learn in general — usually EM algorithm is used

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 2 / 1

Page 3: Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Overview

Spectral Learning Probabilistic Transducers

Our contribution:

I Fast learning algorithm for probabilistic FST

I With PAC-style theoretical guarantees

I Based on Observable Operator Model for FST

I Using spectral methods (Chang ’96, Mossel-Roch ’05, Hsu et al. ’09,Siddiqi et al. ’10)

I Performing better than EM in experiments with real data

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 3 / 1

Page 4: Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Outline

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 4 / 1

Page 5: Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Observable Operators for FST

Deriving Observable Operator Models

Given (x , y) ∈ (X × Y)t aligned sequences, model computesconditional probability (i.e. |x | = |y |)

Pr[ y | x ] =∑

h∈Ht Pr[ y ,h | x ] (marginalize states)

=∑

ht+1∈H Pr[ y ,ht+1 | x ] (independence assumptions)

= 1> αt+1 (vector form, αt+1 ∈ Rm)

= 1>Aytxt αt (forward-backward equations)

= 1>Aytxt · · ·A

y1x1 α (induction on t)

The choice of an operator Aba depends only on observable symbols

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 5 / 1

Page 6: Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Observable Operators for FST

Observable Operator Model ParametersGiven X = {a1, . . . ,ak}, Y = {b1, . . . ,bl}, H = {c1, . . . , cm}, then

Pr[ y | x ] = 1>Aytxt · · ·A

y1x1 α with parameters:

Aba = Ta Db ∈ Rm×m (factorized operator)

Ta(i , j) = Pr[Hs = ci |Xs−1 = a,Hs−1 = cj ] ∈ Rm×m (state transition)

Db(i , j) = δi,j Pr[Ys = b|Hs = cj ] ∈ Rm×m (observation emission)

O(i , j) = Pr[Ys = bi |Hs = cj ] ∈ Rl×m (collected emissions)

α(i) = Pr[H1 = ci ] ∈ Rm (initial probabilites)

The choice of an operator Aba depends only on observable symbols . . .

. . . but operator parameters are conditioned by hidden states

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 6 / 1

Page 7: Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Observable Operators for FST

A Learnable Set of Observable Operators

Note that for any invertible Q ∈ Rm×m

Pr[ y | x ] = 1>Q−1 (Q Aytxt Q−1) · · · (Q Ay1

x1 Q−1) Q α

Idea(subspace identification methods for linear systems, ’80s)

Find a basis for the state space such that operators in the new basisare related to observable quantities

Following multiplicity automata and spectral HMM learning . . .

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 7 / 1

Page 8: Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Observable Operators for FST

A Learnable Set of Observable OperatorsFind a basis Q where operators can be expressed in terms of unigram,bigram and trigram probabilities

ρ(i) = Pr[Y1 = bi ] ∈ Rl

P(i , j) = Pr[Y1 = bj ,Y2 = bi ] ∈ Rl×l

Pba (i , j) = Pr[Y1 = bj ,Y2 = b,Y3 = bi |X2 = a] ∈ Rl×l

Theorem (ρ, P and Pba are sufficient statistics)

Let P = UΣV ∗ be a thin SVD decomposition, then Q = U>O yields(under certain assumptions)

Q α = U>ρ

1>Q−1 = ρ>(U>P)+

Q Aba Q−1 = (U>Pb

a )(U>P)+

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 8 / 1

Page 9: Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Learning Observable Operator Models

Spectral Learning Algorithm

GivenI Input X and output Y alphabetI Number of hidden states mI Training sample S = {(x1, y1), . . . , (xn, yn)}

DoI Compute unigram ρ, bigram P and trigram Pb

a relative frequenciesin S

I Perform SVD on P and take U with top m left singular vectorsI Return operators computed using ρ, P, Pb

a and UIn Time

I O(n) to compute relative frequenciesI O(|Y|3) to compute SVD

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 9 / 1

Page 10: Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Learning Observable Operator Models

PAC-Style ResultI Input distribution DX over X ∗ with λ = E[|X |], µ = mina Pr[X2 = a]I Conditional distributions DY |x on Y∗ given x ∈ X ∗ modeled by an

FST with m states (satisfying certain rank assumptions)I Sampling i.i.d. from joint distribution DX ⊗ DY |X

TheoremFor any 0 < ε, δ < 1, if the algorithm receives a sample of size

n ≥ O

(λ2m|Y|ε4µσ2

Oσ4P

log|X |δ

), (σO and σP are mth singular

values of O and P in target)

then with probability at least 1− δ the hypothesis DY |x satisfies

EX

∑y∈Y∗

∣∣∣DY |X (y)− DY |X (y)∣∣∣ ≤ ε . (L1 distance between

joint distributionsDX ⊗ DY |X and

DX ⊗ DY |X )

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 10 / 1

Page 11: Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Experimental Evaluation

Synthetic Experiments

Goal: Compare against baselines when learning hypothesis hold

Target: Randomly generated with |X | = 3, |Y| = 3, |H| = 2

32 128 512 2048 8192 327680

0.1

0.2

0.3

0.4

0.5

0.6

0.7

# training samples (in thousands)

L1

dis

tan

ce

HMMk−HMMFST

I HMM: model input-outputjointly

I k -HMM: one model for eachinput symbol

I Results averaged over 5 runs

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 11 / 1

Page 12: Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Experimental Evaluation

Transliteration Experiments

Goal: Compare against EM in a real task (where modeling assumptions fail)

Task: English to Russian transliteration (brooklyn→ áðóêëèí)

75 150 350 750 1500 3000 600020

30

40

50

60

70

80

# training sequences

no

rma

lize

d e

dit d

ista

nce

Spectral, m=2Spectral, m=3EM, m=2EM, m=3

Training timesSpectral 26 sEM (iteration) 37 sEM (best) 1133 s

I Sequence alignment done inpreprocessing

I Standard techniques used forinference

I Test size: 943, |X | = 82, |Y| = 34

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 12 / 1

Page 13: Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Dependency Parsing

Dependency Parsing

a new todaymovieJohn saw*

I Directed arcs represent dependencies between a head word anda modifier word.

I E.g.:John modifies saw,movie modifies saw,new modifies movie

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 13 / 1

Page 14: Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Dependency Parsing

Probabilistic Dependency Parsing

saw*

Pr[tree] = Pr[saw |?, RIGHT]

× Pr[John|saw , LEFT]

× Pr[movie, today |saw , RIGHT]

× Pr[a,new |movie, LEFT]

× Pr[ε|movie, RIGHT]× Pr[ε|John, RIGHT]× . . .

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 14 / 1

Page 15: Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Dependency Parsing

Probabilistic Dependency Parsing

John saw*

Pr[tree] = Pr[saw |?, RIGHT]× Pr[John|saw , LEFT]

× Pr[movie, today |saw , RIGHT]

× Pr[a,new |movie, LEFT]

× Pr[ε|movie, RIGHT]× Pr[ε|John, RIGHT]× . . .

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 14 / 1

Page 16: Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Dependency Parsing

Probabilistic Dependency Parsing

todaymovieJohn saw*

Pr[tree] = Pr[saw |?, RIGHT]× Pr[John|saw , LEFT]

× Pr[movie, today |saw , RIGHT]

× Pr[a,new |movie, LEFT]

× Pr[ε|movie, RIGHT]× Pr[ε|John, RIGHT]× . . .

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 14 / 1

Page 17: Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Dependency Parsing

Probabilistic Dependency Parsing

a new todaymovieJohn saw*

Pr[tree] = Pr[saw |?, RIGHT]× Pr[John|saw , LEFT]

× Pr[movie, today |saw , RIGHT]

× Pr[a,new |movie, LEFT]

× Pr[ε|movie, RIGHT]× Pr[ε|John, RIGHT]× . . .

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 14 / 1

Page 18: Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Dependency Parsing

Probabilistic Dependency Parsing

a new todaymovieJohn saw*

Pr[tree] = Pr[saw |?, RIGHT]× Pr[John|saw , LEFT]

× Pr[movie, today |saw , RIGHT]

× Pr[a,new |movie, LEFT]

× Pr[ε|movie, RIGHT]× Pr[ε|John, RIGHT]× . . .

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 14 / 1

Page 19: Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Dependency Parsing

Probabilistic Dependency Parsing

I Dependency trees factor into head-modifier sequences:

Pr[tree] =∏

〈h,d ,m1:T 〉∈y

Pr[m1:T |h,d ]

I State-of-the-art models further assume that:

Pr[m1:T |h,d ] =T∏

t=1

Pr[mt |h,d ]

I In this work:I We model the dynamics of modifier sequences with hidden

structureI We use spectral methods to induce hidden structure

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 15 / 1

Page 20: Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Dependency Parsing

Probabilistic Dependency Parsing

I Observable Operator Model:

Pr(m1:T |h,d) = α>∞ Ah,dmT· · · Ah,d

m1α1

I Statistics for estimating the model:

unigrams: p(i) = Pr[mt = i]bigrams: P(i , j) = Pr[mt = j ,mt+1 = i]

trigrams: Ph,dx (i , j) = Pr[mt−1 = j ,mt =x ,mt+1 = i |h,d ]

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 16 / 1

Page 21: Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Dependency Parsing

Spectral Algorithm

1. Use TRAIN to estimate p, P, and Ph,dm

2. Compute a singular value decomposition of P.Take U to be the matrix of top S left singular vectors of P.

3. Create operators, for each h, d , and m :

α1 = U p (1)

Ah,dm = U>Ph,d

m

(U>P

)+(2)

α>∞ = p>(

U>P)+

(3)

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 17 / 1

Page 22: Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Dependency Parsing

Parsing Sentences

I Task: given a sentence x0:n recover its dependency treeI We compute

y = argmaxy∈Y(x0:n)

∏(xh,xm)∈y

µh,m

whereµi,j = Pr[(xi , xj ]|x0:n) =

∑y∈Y(x0:n) : (xi ,xj )∈y

Pr[y ]

I y and µi,j can be computed in O(n3) with variants of the CKYalgorithm for context-free grammars.

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 18 / 1

Page 23: Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Dependency Parsing

Experiment 1: Spectral vs. EM

Fully unlexicalized parsing

Score curve on English development set.

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 19 / 1

Page 24: Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Dependency Parsing

Results on Multiple Languages

Fully unlexicalized parsing

Det Spectral Det+F Spectral+FEnglish 63.76 78.25 72.93 80.83Danish 61.70 77.25 69.70 78.16Dutch 54.30 61.32 60.90 64.01

Portuguese 71.39 85.33 81.47 85.71Slovene 60.31 66.71 65.63 68.65Swedish 69.09 79.55 76.94 80.54Turkish 53.79 62.56 57.56 63.04

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 20 / 1

Page 25: Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Conclusion

Summary of Contributions

I Fast spectral method for learning input-output OOM

I Strong theoretical guarantees with few assumptions on inputdistribution

I Outperforming previous spectral algorithms on FST

I New tools for inducing sequential latent structure for parsing

I Faster and better than EM in some real tasks

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 21 / 1

Page 26: Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Conclusion

Future (actually, current) work

I Problem: For large alphabets, we might need unrealistically largesamples to compute necessary statistics.Solution: Apply smoothing techniques.

I Problem: In many real tasks symbols are not atomic (e.g. words).Solution: Define OOM models that handle features of the symbols(e.g. morphology, pos tags).

I Problem: Combining multiple models.Solution: Boosting.

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 22 / 1

Page 27: Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Spectral Learning Methodsfor Finite-State Machines,

with Applications to Natural Language Processing

Borja Balle, Xavier Carreras, Franco Luque, Ariadna Quattoni

Traducció de la marca a altres idiomes11

La marca es pot traduir a altres idiomes, excepte el nom de la Universitat, que no és traduïble.

September 2011

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 23 / 1

Page 28: Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Technical AssumptionsX = {a1, . . . ,ak},Y = {b1, . . . ,bl},H = {c1, . . . , cm}

Parameters

Ta(i , j) = Pr[Hs = ci |Xs−1 = a,Hs−1 = cj ] ∈ Rm×m (state transition)

T =∑

a Ta Pr[X1 = a] ∈ Rm×m (“mean” transition matrix)

O(i , j) = Pr[Ys = bi |Hs = cj ] ∈ Rl×m (collected emissions)

α(i) = Pr[H1 = ci ] ∈ Rm (initial probabilites)

Assumptions1. l ≥ m2. α > 03. rank(T ) = rank(O) = m4. mina Pr[X2 = a] > 0

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 24 / 1