99
Spectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle Ariadna Quattoni Xavier Carreras pq McGill University pq Xerox Research Centre Europe TUTORIAL @ EMNLP 2014

Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

  • Upload
    others

  • View
    27

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Spectral Learning Techniques for Weighted Automata,Transducers, and Grammars

Borja Balle♦ Ariadna Quattoni♥ Xavier Carreras♥

p♦q McGill Universityp♥q Xerox Research Centre Europe

TUTORIAL @ EMNLP 2014

Page 2: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Latest Version Available from:

http://www.lsi.upc.edu/~bballe/slides/tutorial-emnlp14.pdf

Page 3: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Outline

1. Weighted Automata and Hankel Matrices

2. Spectral Learning of Probabilistic Automata

3. Spectral Methods for Transducers and GrammarsSequence TaggingFinite-State TransductionsTree Automata

4. Hankel Matrices with Missing Entries

5. Conclusion

6. References

Page 4: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Compositional Functions and Bilinear Operators

§ Compositional functions defined in terms of recurrence relations

§ Consider a sequence abaccb

fpabaccbq “ αfpabq ¨ βfpaccbq

“ αfpabq ¨Aa ¨ βfpccbq

“ αfpabaq ¨Ac ¨ βfpcbq

where§ n is the dimension of the model§ αf maps prefixes to Rn§ βf maps suffixes to Rn§ Aa is a bilinear operator in Rnˆn

Problem

How to estimate αf, βf and Aa,Ab, . . . from “samples” of f?

Page 5: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Weighted Finite Automata (WFA)

An algebraic model

for compositional functions on strings

Page 6: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Weighted Finite Automata (WFA)Example with 2 states and alphabet Σ “ ta,bu

q0 q1

a 0.4b 0.1

a 0.2b 0.3

a 0.1b 0.1

a 0.1b 0.1

0.6

Operator Representation

α0 “

1.00.0

α8 “

0.00.6

Aa “

0.4 0.20.1 0.1

Ab “

0.1 0.30.1 0.1

fpabq “ αJ0 AaAbα8

Page 7: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Weighted Finite Automata (WFA)Notation:

§ Σ: alphabet – finite set

§ n: number of states – positive integer

§ α0: initial weights – vector in Rn (features of empty prefix)

§ α8: final weights – vector in Rn (features of empty suffix)

§ Aσ: transition weights – matrix in Rnˆn (@σ P Σ)

Definition: WFA with n states over Σ

A “ xα0,α8, tAσuy

Compositional Function: Every WFA A defines a function fA : Σ‹ Ñ R

fApxq “ fApx1 . . . xT q “ αJ0 Ax1 ¨ ¨ ¨AxTα8 “ αJ0 Axα8

Page 8: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Example – Hidden Markov Model

§ Assigns probabilities to strings fpxq “ Prxs§ Emission and transition are conditionally independent given state

αJ0 “ r0.3 0.3 0.4s

αJ8 “ r1 1 1s

Aa “ Oa ¨ T

T “

»

0 0.7 0.30 0.75 0.250 0.4 0.6

fi

fl

Oa “

»

0.3 0 00 0.9 00 0 0.5

fi

fl

0.3

0.4

0.75

0.25

0.7

0.6 a, 0.5b, 0.5

a, 0.3b, 0.7

a, 0.9b, 0.1

Page 9: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Example – Probabilistic Transducer

§ Σ “ Xˆ Y, where X input alphabet and Y output alphabet

§ Assigns conditional probabilities fpx,yq “ Pry|xs to pairs px,yq P Σ‹

X “ tA,Bu

Y “ ta,bu

αJ0 “ r0.3 0 0.7s

αJ8 “ r1 1 1s

AbB “

»

0.2 0.4 00 0 10 0.75 0

fi

fl

A/a, 0.1 ∣ A/b, 0.9

B/a, 0.25 ∣ B/b, 0.75 ∣ A/b, 0.15

A/a, 0.75

A/b, 0.25 ∣ B/b, 1

B/b, 0.4 B/a, 0.4

A/b, 0.85B/b, 0.2

0.3 0.7

Page 10: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Other Examples of WFA

Automata-theoretic:

§ Probabilistic Finite Automata (PFA)

§ Deterministic Finite Automata (DFA)

Dynamical Systems:

§ Observable Operator Models (OOM)

§ Predictive State Representations (PSR)

Disclaimer: All weights in R with usual addition and multiplication (nosemi-rings!)

Page 11: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Applications of WFA

WFA Can Model:

§ Probability distributions fApxq “ Prxs§ Binary classifiers gpxq “ signpfApxq ` θq

§ Real predictors fApxq

§ Sequence predictors gpxq “ argmaxy fApx,yq (with Σ “ Xˆ Y)

Used In Several Applications:

§ Speech recognition [Mohri, Pereira, and Riley ’08]

§ Image processing [Albert and Kari ’09]

§ OCR systems [Knight and May ’09]

§ System testing [Baier, Grosser, and Ciesinski ’09]

§ etc.

Page 12: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Useful Intuitions About fA

fApxq “ fApx1 . . . xT q “ αJ0 Ax1 ¨ ¨ ¨AxTα8 “ αJ0 Axα8

§ Sum-Product: fApxq is a sum–product computation

ÿ

i0,i1,...,iTPrns

α0pi0q

˜

t“1

Axtpit´1, itq

¸

α8piT q

§ Forward-Backward: fApxq is dot product between forward andbackward vectors

fAppsq “`

αJ0 Ap˘

¨ pAsα8q “ αp ¨ βs

§ Compositional Features: fApxq is a linear model

fApxq “`

αJ0 Ax˘

¨ α8 “ φpxq ¨ α8

where φ : Σ‹ Ñ Rn compositional features (i.e. φpxσq “ φpxqAσ)

Page 13: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Forward–Backward Equations for AσAny WFA A defines forward and backward maps

αA,βA : Σ‹ Ñ Rn

such that for any splitting x “ p ¨ s one has

αAppq “ αJ0 Ap1 ¨ ¨ ¨ApTβApsq “ As1 ¨ ¨ ¨AsT 1α8

fApxq “ αAppq ¨ βApsq

Example

§ In HMM and PFA one has for every i P rns

rαAppqsi “ Prp , h`1 “ is

rβApsqsi “ Prs | h “ is

Page 14: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Forward–Backward Equations for AσAny WFA A defines forward and backward maps

αA,βA : Σ‹ Ñ Rn

such that for any splitting x “ p ¨ s one has

αAppq “ αJ0 Ap1 ¨ ¨ ¨ApTβApsq “ As1 ¨ ¨ ¨AsT 1α8

fApxq “ αAppq ¨ βApsq

Key Observation

If fAppσsq, αAppq, and βApsq were known for many p, s, then Aσ couldbe recovered from equations of the form

fAppσsq “ αAppq ¨Aσ ¨ βApsq

Hankel matrices help organize these equations!

Page 15: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

The Hankel Matrix

Two Equivalent Representations

§ Functional: f : Σ‹ Ñ R§ Matricial: Hf P RΣ

‹ˆΣ‹ , the Hankel matrix of f

Definition: p prefix, s suffix ñ Hfpp, sq “ fpp ¨ sq

Properties

§ |x| ` 1 entries for fpxq

§ Depends on ordering of Σ‹

§ Captures structure

Hf “

»

ε a b aa ¨¨¨

ε 0 1 0 2 ¨ ¨ ¨

a 1 2 1 3b 0 1 0 2aa 2 3 2 4...

.... . .

fi

ffi

ffi

ffi

ffi

ffi

fl

Hfpε,aaq “ Hfpa,aq “ Hfpaa, εq “ 2

Page 16: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

A Fundamental Theorem about WFA

Relates the rank of Hf

and the number of states of WFA computing f

Theorem [Carlyle and Paz ’71, Fliess ’74]

Let f : Σ‹ Ñ R be any function

1. If f “ fA for some WFA A with n states ñ rankpHfq ď n

2. If rankpHfq “ n ñ exists WFA A with n states s.t. f “ fA

Why Fundamental?

Because proof of (2) gives an algorithm for “recovering” A from the Hankelmatrix of fAExample: Can “recover” an HMM from the probabilities it assigns to se-quences of observations

Page 17: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Structure of Low-rank Hankel Matrices

Hf P RΣ‹ˆΣ‹ P P RΣ

‹ˆn S P RnˆΣ‹

»

s

...

...

...p ¨ ¨ ¨ ¨ ¨ ¨ ‚ ¨ ¨ ¨ ¨ ¨ ¨

...

fi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

fl

»

¨ ¨ ¨

¨ ¨ ¨

¨ ¨ ¨

p ‚ ‚ ‚

¨ ¨ ¨

fi

ffi

ffi

ffi

ffi

fl

»

s

¨ ¨ ‚ ¨ ¨

¨ ¨ ‚ ¨ ¨

¨ ¨ ‚ ¨ ¨

fi

fl

fpp1 ¨ ¨ ¨pT ¨ s1 ¨ ¨ ¨ sT 1q “ αJ0 Ap1 ¨ ¨ ¨ApTloooooooomoooooooon

αAppq

As1 ¨ ¨ ¨AsT 1α8loooooooomoooooooon

βApsq

αAppq “ Ppp, ¨q βApsq “ Sp¨, sq

Page 18: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Hankel Factorizations and Operators

Hσ P RΣ‹ˆΣ‹ P P RΣ‹ˆn Aσ P Rnˆn S P RnˆΣ‹

»

s

¨

¨

¨

p ¨ ¨ ‚ ¨ ¨

¨

fi

ffi

ffi

ffi

ffi

fl

»

¨ ¨ ¨

¨ ¨ ¨

¨ ¨ ¨

p ‚ ‚ ‚

¨ ¨ ¨

fi

ffi

ffi

ffi

ffi

fl

»

‚ ‚ ‚

‚ ‚ ‚

‚ ‚ ‚

fi

fl

»

s

¨ ¨ ‚ ¨ ¨

¨ ¨ ‚ ¨ ¨

¨ ¨ ‚ ¨ ¨

fi

fl

fpp1 ¨ ¨ ¨pT ¨ σ ¨ s1 ¨ ¨ ¨ sT 1q “ αJ0 Ap1¨ ¨ ¨ApT

loooooooomoooooooon

αAppq

¨ Aσ ¨ As1 ¨ ¨ ¨AsT 1α8loooooooomoooooooon

βApsq

Hσ “ P Aσ S =ñ Aσ “ P` Hσ S`

Note: Works with finite sub-blocks as well (assuming rankpPq “ rankpSq “ n)

Page 19: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

General Learning Algorithm for WFA

Data Hankelmatrix WFALow-rank matrix

estimationFactorization and

linear algebra

Key Idea: The Hankel Trick

1. Learn a low-rank Hankel matrix that implicitly induces“latent” states

2. Recover the states from a decomposition of theHankel matrix

Page 20: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Limitations of WFAInvariance Under Change of BasisFor any invertible matrix Q the following WFA are equivalent:

§ A “ xα0,α8, tAσuy

§ B “ xQJα0,Q´1α8, tQ´1AσQuy

fApxq “ αJ0 Ax1 ¨ ¨ ¨AxTα8

“ pαJ0 QqpQ´1Ax1Qq ¨ ¨ ¨ pQ

´1AxTQqpQ´1α8q “ fBpxq

Example

Aa “

0.5 0.10.2 0.3

Q “

0 1´1 0

Q´1AaQ “

0.3 ´0.2´0.1 0.5

Consequences

§ There is no unique parametrization for WFA

§ Given A it is undecidable whether @x fApxq ě 0

§ Cannot expect to recover a probabilistic parametrization

Page 21: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Outline

1. Weighted Automata and Hankel Matrices

2. Spectral Learning of Probabilistic Automata

3. Spectral Methods for Transducers and GrammarsSequence TaggingFinite-State TransductionsTree Automata

4. Hankel Matrices with Missing Entries

5. Conclusion

6. References

Page 22: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Spectral Learning of Probabilistic Automata

Data Hankelmatrix WFALow-rank matrix

estimationFactorization and

linear algebra

Basic Setup:

§ Data are strings sampled from probability distribution on Σ‹

§ Hankel matrix is estimated by empiricial probabilities

§ Factorization and low-rank approximation is computed using SVD

Page 23: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

The Empirical Hankel MatrixSuppose S “ px1, . . . , xNq is a sample of N i.i.d. stringsEmpirical distribution:

fSpxq “1

N

Nÿ

i“1

Irxi “ xs

Empirical Hankel matrix:

HSpp, sq “ fSppsq

Example:

S “

$

&

%

aa, b, bab, a,b, a, ab, aa,ba, b, aa, a,aa, bab, b, aa

,

/

/

.

/

/

-

−Ñ H “

»

a b

ε .19 .25a .31 .06b .06 .00ba .00 .13

fi

ffi

ffi

fl

(Hankel with rows P “ tε,a,b,bau and columns S “ ta,bu)

Page 24: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Finite Sub-blocks of Hankel MatricesParameters:

§ Set of rows (prefixes) P Ă Σ‹

§ Set of columns (suffixes) S Ă Σ‹

Σ λ a b aa ab ...

λ 1 0.3 0.7 0.05 0.25 . . .a 0.3 0.05 0.25 0.02 0.03 . . .b 0.7 0.6 0.1 0.03 0.2 . . .

aa 0.05 0.02 0.03 0.017 0.003 . . .ab 0.25 0.23 0.02 0.11 0.12 . . ....

......

......

.... . .

Σ λ a b aa ab ...

λ 1 0.3 0.7 0.05 0.25 . . .a 0.3 0.05 0.25 0.02 0.03 . . .b 0.7 0.6 0.1 0.03 0.2 . . .

aa 0.05 0.02 0.03 0.017 0.003 . . .ab 0.25 0.23 0.02 0.11 0.12 . . ....

......

......

.... . .

H

Ha

Σ λ a b aa ab ...

λ 1 0.3 0.7 0.05 0.25 . . .a 0.3 0.05 0.25 0.02 0.03 . . .b 0.7 0.6 0.1 0.03 0.2 . . .

aa 0.05 0.02 0.03 0.017 0.003 . . .ab 0.25 0.23 0.02 0.11 0.12 . . ....

......

......

.... . .

H

Ha

§ H for finding P and S

§ Hσ for finding Aσ§ hλ,S for finding α0

§ hP,λ for finding α8

Page 25: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Low-rank Approximation and Factorization

Parameters:

§ Desired number of states n

§ Block H P RPˆS of the empirical Hankel matrix

Low-rank Approximation: compute truncated SVD of rank n

Hloomoon

PˆS

« Unloomoon

Pˆn

Λnloomoon

nˆn

VJnloomoon

nˆS

Factorization: H « PS already given by SVD

P “ UnΛn ñ P` “ Λ´1n UJn`

“ pHVnq`˘

S “ VJn ñ S` “ Vn

Page 26: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Computing the WFA

Parameters:

§ Factorization H « pUΛqVJ

§ Hankel blocks Hσ, hλ,S, hP,λ

Aσ “ Λ´1UJHσV`

“ pHVq`HσV˘

α0 “ VJhλ,S

α8 “ Λ´1UJhP,λ

`

“ pHVq`hP,λ

˘

Page 27: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Computational and Statistical Complexity

Running Time:

§ Empirical Hankel matrix: Op|PS| ¨Nq

§ SVD and linear algebra: Op|P| ¨ |S| ¨ nq

Statistical Consistency:

§ By law of large numbers, HS Ñ ErHs when NÑ8

§ If ErHs is Hankel of some WFA A, then AÑ A

§ Works for data coming from PFA and HMM

PAC Analysis: (assuming data from A with n states)

§ With high probability, }HS ´H} ď OpN´1{2q

§ When N ě Opn|Σ|2T4{ε2snpHq4q, then

ÿ

|x|ďT

|fApxq ´ fApxq| ď ε

Proofs can be found in [Hsu, Kakade, and Zhang ’09, Bailly ’11, Balle ’13]

Page 28: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Practical Considerations

Data Hankelmatrix WFALow-rank matrix

estimationFactorization and

linear algebra

Basic Setup:

§ Data are strings sampled from probability distribution on Σ‹

§ Hankel matrix is estimated by empiricial probabilities

§ Factorization and low-rank approximation is computed using SVD

Advanced Implementations:

§ Choice of parameters P and S

§ Scalable estimation and factorization of Hankel matrices

§ Smoothing and variance normalization

§ Use of prefix and substring statistics

Page 29: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Choosing the Basis

Definition: The pair pP, Sq defining the sub-block is called a basis

Intuitions:

§ Basis should be choosen such that ErHs has full rank

§ P must contain strings reaching each possible state of the WFA

§ S must contain string producing different outcomes for each pair ofstates in the WFA

Popular Approaches:

§ Set P “ S “ Σďk for some k ě 1 [Hsu, Kakade, and Zhang ’09]

§ Choose P and S to contain the K most frequent prefixes and suffixesin the sample [Balle, Quattoni, and Carreras ’12]

§ Take all prefixes and suffixes appearing in the sample [Bailly, Denis, and

Ralaivola ’09]

Page 30: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Scalable Implementations

Problem: When |Σ| is large, even the simplest basis become huge

Hankel Matrix Representation:

§ Use hash functions to map P (S) to row (column) indices

§ Use sparse matrix data structures because statistics are usually sparse

§ Never store the full Hankel matrix in memory

Efficient SVD Computation:

§ SVD for sparse matrices [Berry ’92]

§ Approximate randomized SVD [Halko, Matrinsson, and Tropp ’11]

§ On-line SVD with rank 1 updates [Brand ’06]

Page 31: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Refining the Statistics in the Hankel Matrix

Smoothing the Estimates

§ Empirical probabilities fSpxq tend to be sparse

§ Like in n-gram models, smoothing can help when Σ is large

§ Should take into account that strings in PS have different lengths

§ Open Problem: How to smooth empirical Hankels properly

Row and Column Weighting

§ More frequent prefixes (suffixes) have better estimated rows(columns)

§ Can scale rows and columns to reflect that

§ Will lead to more reliable SVD decompositions

§ See [Cohen, Stratos, Collins, Foster, and Ungar ’13] for details

Page 32: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Substring StatisticsProblem: If the sample contains strings with wide range of lengths, smallbasis will ignore most of the examples

String Statistics (occurence probability):

S “

$

&

%

aa, b, bab, a,bbab, abb, babba, abbb,ab, a, aabba, baa,abbab, baba, bb, a

,

/

/

.

/

/

-

−Ñ H “

»

a b

ε .19 .06a .06 .06b .00 .06ba .06 .06

fi

ffi

ffi

fl

Substring Statistics (expected number of occurences as substring):

Empirical expectation “1

N

Nÿ

i“1

rnumber of occurences of x in xis

S “

$

&

%

aa, b, bab, a,bbab, abb, babba, abbb,ab, a, aabba, baa,abbab, baba, bb, a

,

/

/

.

/

/

-

−Ñ H “

»

a b

ε 1.31 1.56a .19 .62b .56 .50ba .06 .31

fi

ffi

ffi

fl

Page 33: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Substring Statistics

Theorem [Balle, Carreras, Luque, and Quattoni ’14]

If a probability distribution f is computed by a WFA with n states, thenthe corresponding substring statistics are also computed by a WFA with nstates

Learning from Substring Statistics

§ Can work with smaller Hankel matrices

§ But estimating the matrix takes longer

Page 34: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Experiment: PoS-tag Sequence Models

60

62

64

66

68

70

72

74

0 10 20 30 40 50

Wo

rd E

rro

r R

ate

(%

)

Number of States

Spectral, Σ basisSpectral, basis k=25Spectral, basis k=50

Spectral, basis k=100Spectral, basis k=300Spectral, basis k=500

UnigramBigram

§ PTB sequences of simplified PoS tags [Petrov, Das, and McDonald 2012]

§ Configuration: expectations on frequent substrings

§ Metric: error rate on predicting next symbol in test sequences

Page 35: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Experiment: PoS-tag Sequence Models

58

60

62

64

66

68

70

0 10 20 30 40 50

Wo

rd E

rro

r R

ate

(%

)

Number of States

Spectral, Σ basisSpectral, basis k=500

EMUnigram

Bigram

§ Comparison with a bigram baseline and EM

§ Metric: error rate on predicting next symbol in test sequences

§ At training, the Spectral Method is ą 100 faster than EM

Page 36: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Outline

1. Weighted Automata and Hankel Matrices

2. Spectral Learning of Probabilistic Automata

3. Spectral Methods for Transducers and GrammarsSequence TaggingFinite-State TransductionsTree Automata

4. Hankel Matrices with Missing Entries

5. Conclusion

6. References

Page 37: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Sequence Tagging and Transduction

§ Many applications involve pairs of input-output sequences:§ Sequence tagging (one output tag per input token)

e.g.: part of speech tagging

output: NNP NNP VBZ NNP .input: Ms. Haag plays Elianti .

§ Transductions (sequence lenghts might differ)

e.g.: spelling correction

output: a p p l einput: a p l e

§ Finite-state automata are classic methods to model these relations.Spectral methods apply naturally to this setting.

Page 38: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Sequence Tagging

§ Notation:§ Input alphabet X§ Output alphabet Y§ Joint alphabet Σ “ Xˆ Y

§ Goal: map input sequences to output sequences of the same length

§ Approach: learn a function

f : pXˆ Yq‹ Ñ R

Then, given an input x P XT return

argmaxyPYT

fpx,yq

(note: this maximization is not tractable in general)

Page 39: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Weighted Finite Tagger

§ Notation:§ Xˆ Y: joint alphabet – finite set§ n: number of states – positive integer§ α0: initial weights – vector in Rn (features of empty prefix)§ α8: final weights – vector in Rn (features of empty suffix)§ Aba: transition weights – matrix in Rnˆn (@a P X,b P Y)

§ Definition: WFTagger with n states over Xˆ Y

A “ xα0,α8, tAbauy

§ Compositional Function: Every WFTagger defines a functionfA : pXˆ Yq‹ Ñ R

fApx1 . . . xT ,y1 . . .yT q “ αJ0 Ay1x1¨ ¨ ¨AyTxT α8 “ αJ0 A

yxα8

Page 40: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

The Spectral Method for WFTaggers

Data Hankelmatrix WFALow-rank matrix

estimationFactorization and

linear algebra

§ Assume fpx,yq “ Ppx,yq§ Same mechanics as for WFA, with Σ “ Xˆ Y§ In a nutshell:

1. Choose set of prefixes and suffixes to define HankelÑ in this case they are bistrings

2. Estimate Hankel with prefix-suffix training statistics3. Factorize Hankel using SVD4. Compute α and β projections,

and compute operators xα0,α8, tAσuy

§ Other cases:§ fApx,yq “ Ppy | xq — see [Balle et al., 2011]§ fApx,yq non-probabilistic — see [Quattoni et al., 2014]

Page 41: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Prediction with WFTaggers§ Assume fApx,yq “ Ppx,yq

§ Given x1:T , compute most likely output tag at position t:

argmaxaPY

µpt,aq

where

µpt,aq fi Ppyt “ a | xq “ÿ

y“y1...a...yT

Ppx,yq

“ÿ

y“y1...a...yT

αJ0 Ayxα8

“ αJ0

˜

ÿ

y1...yt´1

Ay1:t´1x1:t´1

¸

loooooooooooomoooooooooooon

α‹Apx1:t´1q

Aaxt

˜

ÿ

yt`1...yT

Ayi`1:Txt`1:T

¸

α8

loooooooooooomoooooooooooon

β‹Apxt`1:T q

α‹Apx1:tq “ α‹Apx1:t´1q

˜

ÿ

bPY

Abxt

¸

β‹Apxt:T q “

˜

ÿ

bPY

Abxt

¸

β‹Apxt`1:T q

Page 42: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Prediction with WFTaggers (II)

§ Assume fApx,yq “ Ppx,yq

§ Given x1:T , compute most likely output bigram ab at position t:

argmaxa,bPY

µpt,a,bq

where

µpt,a,bq “ Ppyt “ a,yt`1 “ b | xq

“ α‹Apx1:t´1qAaxtAbxt`1

β‹Apxt`2:T q

§ Compute most likely full sequence y – intractableIn practice, use Minimum Bayes-Risk decoding:

argmaxyPYT

ÿ

t

µpt,yt,yt`1q

Page 43: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Finite State Transducers

(ab,cde)

a

b

c d e

a-c

ε-d

b-e

§ A WFTransducer evaluates aligned strings,using the empty symbol ε to produce one-to-one alignments:

fpcadεebq “ αJ0 A

caA

dεA

eb

§ Then, a function can be defined on unaligned strings by aggregatingalignments

gpab, cdeq “ÿ

πPΠpab,cdeq

fpπq

Page 44: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Finite State Transducers: Main Problems

§ Inference: given an FST A, how to . . .§ Compute gpx,yq for unaligned strings?Ñ using edit-distance recursions

§ Compute marginal quantities µpedgeq “ Ppedge | xq?Ñ also using edit-distance recursions

§ Compute most-likely y for given x?Ñ use MBR-decoding with marginal scores

§ Unsupervised Learning: learn an FST from pairs of unaligned strings§ Unlike with EM, the spectral method can not recover latent structure

such as alignments(recall: alignments are needed to estimate Hankel entries)

§ See [Bailly et al., 2013b] for a solution based on Hankel matrixcompletion

Page 45: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Spectral Learning of Tree Automata and Grammars

S

NP

noun

Mary

VP

verb

plays

NP

det

the

noun

guitar

Some References:

§ Tree Series: [Bailly et al., 2010, Bailly et al., 2010]

§ Latent-annotated PCFG: [Cohen et al., 2012, Cohen et al., 2013b]

§ Dependency parsing: [Luque et al., 2012, Dhillon et al., 2012]

§ Unsupervised learning of WCFG: [Bailly et al., 2013a, Parikh et al., 2014]

§ Synchronous grammars: [Saluja et al., 2014]

Page 46: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Compositional Functions over Trees

f

¨

˚

˝

a

b a

c c

b b

˛

‚“ f

¨

˚

˝

a

b a

c c

b b

˛

‚“ αA

˜

a

b ‹

¸J

βA

¨

˝

a

c c

b b

˛

“ f

¨

˚

˝

a

b a

c c

b b

˛

‚“ αA

˜

a

b ‹

¸J

Aa

´

βA

´

c

b b

¯

b βAp cq

¯

“ f

¨

˚

˝

a

b a

c c

b b

˛

‚“ αA

˜

a

b a

‹ c

¸J

Ac pβApbq b βApbqq

Page 47: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Inside-Outside Composition of Trees

a

c b

c a

b

a

c ‹ b

c a

b

d“

t “ to d ti

note: i-o composition generalizes the notion of concatenation in strings,i.e., outside trees are prefixes, inside trees are suffixes

Page 48: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Weighted Finite Tree Automata (WFTA)

An algebraic model for compositional functions on trees

Page 49: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

WFTA Notation (I)

Labeled Trees

§ tΣku “ tΣ0,Σ1, . . . ,Σru – ranked alphabet

§ T – space of labeled trees over some ranked alphabet

Tree:

§ t P T “ xV,E, lpvqy: a labeled tree

§ V “ t1, . . . ,mu: the set of vertices

§ E “ txi, jyu: the set of edges forming a tree

§ lpvq Ñ tΣku: returns the label of v – (i.e. a symbol in tΣku)

Page 50: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

WFTA Notation (II)

Labeled Trees

§ tΣku “ tΣ0,Σ1, . . . ,Σru – ranked alphabet

§ T – space of labeled trees over some ranked alphabet

Leaf Trees and Inside Compositions:

leaf tree unary composition binary compositionσ P Σ0 σ P Σ1, t1 P T σ P Σ2, t1, t2 P T

σ σ

t1

σ

t1 t2

t “ σ t “ σrt1s t “ σrt1, t2s

Page 51: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

WFTA Notation (III)

Labeled Trees

§ tΣku “ tΣ0,Σ1, . . . ,Σru – ranked alphabet

§ T – space of labeled trees over some ranked alphabet

Useful functions (to access the nodes of a tree t):

§ rptq: returns the root node of t

§ ppt, vq: returns the parent of v

§ apt, vq: returns the arity of v (number of children of v)§ cpt, vq: returns the children of v

§ if cpt, vq “ rv1, . . . vks we use cipt, vq for i-th child§ children are assumed to be ordered from left to right

Page 52: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Notation for WFTA (IV): Tensors

Kronecker product:

§ for v1 P Rn and v2 P Rn:

§ v1 b v2 P Rn2

contains all products between elements of v1 and v2§ Example:

§ v1 “ ra,bs§ v2 “ rc,ds§ v1 b v2 “ rac,ad,bc,bds

Simplifying assumption:

§ We consider trees with apt, vq ď 2Ñ i.e. tensors of order 3 (two children per parent)

Page 53: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Weighted Finite Tree Automata (WFTA)

Σ “ tΣ0,Σ1,Σ2u: ranked alphabet of order 2 – finite set

Definition: WFTA with n states over Σ

A “ xα‹, tβσu, tA1σu, tA

2σuy

§ n: number of states – positive integer

§ α‹ P Rn: root weights

§ βσ P Rn: leaf weights – (@σ P Σ0)

§ A1σ P Rnˆn: transition weights – (@σ P Σ1)

§ A2σ P R

nˆn2: transition weights – (@σ P Σ2)

§ Note: A2σ is a tensor in Rnˆnˆn packed as a matrix

Page 54: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

WFTA: Inside Function

Definition: Any WFTA A defines an inside function:

βA : T Ñ Rn – maps a tree to a vector in Rn

§ if t is a leaf:

βApt “ σq “ βσ

§ if t results from a unary composition:

βApt “ σrt1sq “ A1σβApt1q

§ if t results from a binary composition:

βApt “ σrt1, t2sq “ A2σ pβApt1q b βApt2qq

t “ σ

t “σ

t1

t “σ

t1 t2

Page 55: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

WFTA Function:

Every WFTA A defines a function

fA : T Ñ R

computed as:fAptq “ αJ‹ βAptq

Page 56: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Weighted Finite Tree Automaton (WFTA)

Example of inside computation:

a

b a

c c

b

αJ‹ A2apβb bA2

apA1cpβbq b βcqq

loooooooooooooooooomoooooooooooooooooon

A2ap βbloomoon

bA2apA

1cpβbq b βcq

loooooooooomoooooooooon

q

βbA2apA

1cpβbq

looomooon

b βcloomoon

q

A1cp βbloomoon

q βc

βb

Page 57: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Useful Intuition: Latent-variable Models as WFTA

fAptq “ αt‹βAptq

§ Each labeled node v is decorated with a latent variable hv P rns§ fAptq is a sum–product computation:

ř

h0,h1,...,h|V|Prns

¨

˝α‹ph0qź

vPV :apvq“0

βlpvqrhvs

ˆź

vPV :apvq“1

A1lpvqrhv,hcpt,vqs

ˆź

vPV :apvq“2

A2lpvqrhv,hc1pt,vq,hc2pt,vqs

˛

§ fAptq is a linear model in the latent space defined by βA : T Ñ Rn

fAptq “

nÿ

i“1

α‹ris βAptqris

Page 58: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Inside/Outside Decomposition

v

tree t

outside tree tzv

v

inside tree trvs

Consider a tree t and one node v:§ Inside tree trvs: the subtree of t rooted at v

§ trvs P T

§ Outside tree tzv: the rest of t when removing trvs§ T‹: the space of outside trees, i.e. tzv P T‹§ Foot node ‹: a tree insertion point (a special symbol ‹ R tΣku)§ An outside tree has exactly one foot node in the leaves

Page 59: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Inside/Outside Composition

a

c b

c a

b

a

c ‹ b

c a

b

d“

§ A tree is formed by composing an outside tree with an inside treeÑ generalizes prefix/suffix concatenation in strings

§ Multiple ways to decompose a full tree into inside/outside treesÑ as many as nodes in a tree

Page 60: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Outside Trees

§ Outside trees t‹ P T‹ are defined recursively using compositions:

foot node unary composition binary compositionto P T‹, σ P Σ

1 to P T‹, σ P Σ2, ti P T

‹‹tod

σ

‹tod

σ

ti‹

‹tod

σ

ti‹

t‹ “ ‹ t‹ “ to d σr‹s t‹ “ to d σr‹, tis t‹ “ to d σrti, ‹s

Page 61: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

WFTA: Outside Function

Definition: Any WFTA A defines an outside function:

αA : T‹ Ñ Rn – maps an outside tree to a vector in Rn

§ if t‹ is a foot node:

αApt‹ “ ‹q “ α0

§ if t‹ results from a unary composition:

αApt‹ “ to d σr‹sq “ αAptoqJA1

σ

§ if t‹ results from a binary composition:

αApt‹“todσrti, ‹sq “ αAptoqJA2

σ pβAptiq b 1nq

(note: similar expression for t‹ “ to d σr‹, tis)

t‹ “ ‹

t‹ “‹tod

σ

t‹ “‹tod

σ

ti‹

Page 62: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

WFTA are fully compositional

a

c b

c a

b

a

c ‹ b

c a

b

d“

For any inside-outside decomposition of a tree:

fAptq “ αAptoqJβAptiq plet t “ to d tiq

“ αAptoqJA2

σpβApt1q b βApt2qq plet ti “ σrt1, t2sq

Consequences:

§ We can isolate the αA and βA vector spaces

§ Given αA and βA, we can isolate the operators Akσ

Page 63: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Hankel Matrices of functions over Labeled Trees

Two Equivalent Representations

§ Functional: fA : T Ñ R§ Matricial: HfA P R|T‹|ˆ|T| (the Hankel matrix of fA)

»

a b

a

a

a

b

aa b

¨¨¨

‹ 0 1 ´1 2 3 ...

a

‹´1 2 1 ´1 ¨ ¨ ¨

b

‹4 1 6 2

a‹b 0 ´1 ´3 ´7

a‹ b 3

...

......

fi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

fl

§ Definition:Hpto, tiq “ fpto d tiq

§ Sublock for σ:Hσpto,σrt1, t2sq “ fpto d σrt1, t2sq

§ Properties:§ |V| ` 1 entries for fptq§ Depends on ordering of T‹ and T§ Captures structure

Page 64: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

A Fundamental Theorem about WFTA

Relates the rank of Hf

and the number of states of WFTA computing f

Let f : T Ñ R be any function over labeled trees.

1. If f “ fA for some WFTA A with n states ñ rankpHfq ď n

2. If rankpHfq “ n ñ exists WFTA A with n states s.t. f “ fA

Why Fundamental?

Proof of (2) gives an algorithm for “recovering” A from the Hankel matrixof fA

Page 65: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Structure of Low-rank Hankel Matrices

Hf P RT‹ˆT O P RT‹ˆn I P RnˆT

»

ti

...

...

...to ¨ ¨ ¨ ¨ ¨ ¨ ‚ ¨ ¨ ¨ ¨ ¨ ¨

...

fi

ffi

ffi

ffi

ffi

ffi

ffi

ffi

fl

»

¨ ¨ ¨

¨ ¨ ¨

¨ ¨ ¨

to ‚ ‚ ‚

¨ ¨ ¨

fi

ffi

ffi

ffi

ffi

fl

»

ti

¨ ¨ ‚ ¨ ¨

¨ ¨ ‚ ¨ ¨

¨ ¨ ‚ ¨ ¨

fi

fl

fpto d tiq “ αAptoqJβAptiq

αAptoq “ Opto, ¨q βAptiq “ Ip¨, tiq

Page 66: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Hankel Factorizations and Operators

Hσ P RT‹ˆT O P RT‹ˆn A2σ P Rnˆn

2I P RnˆT I P RnˆT

»

σrt1,t2s

¨

¨

¨

to ¨ ¨ ‚ ¨ ¨

¨

fi

ffi

ffi

ffi

ffi

fl

»

¨ ¨

¨ ¨

¨ ¨

to ‚ ‚

¨ ¨

fi

ffi

ffi

ffi

ffi

fl

‚ ‚ ‚ ‚

‚ ‚ ‚ ‚

»

t1

¨ ¨ ¨ ‚ ¨

¨ ¨ ¨ ‚ ¨

b

t2

¨ ‚ ¨ ¨ ¨

¨ ‚ ¨ ¨ ¨

fi

ffi

ffi

fl

fpto dσrt1, t2sqlooomooon

ti

“ αAptoqJA2

σpβApt1q b βApt2qqloooooooooooomoooooooooooon

βAptiq

Page 67: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Hankel Factorizations and Operators

Hσ “ O A2σ rIb Is =ñ A2

σ “ O` Hσ rIb Is`

Note: Works with finite sub-blocks as well(assuming rankpOq “ rankpIq “ n)

Page 68: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

WFTA: Application to Parsing

S

NP

noun

Mary

VP

verb

plays

NP

det

the

noun

guitar

Some intuitions:

§ Derivation = Labeled Tree

§ Learning compositional functions over derivations=ñ learning functions over trees

§ We are interested in functions computed by WFTA

Page 69: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

WFTA for Parsing: Key Questions

§ What is the latent state representing?§ For example: latent real valued embeddings of words and phrases

§ What form of supervision do we get?§ Full Derivations (labeled trees)

i.e., supervised learning of latent-variable grammars

§ Derivation skeletons (unlabeled trees)e.g. [Pereira and Schabes, 1992]

§ Yields from the grammar (only the leaves)i.e., grammar induction

Page 70: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Parsing and Tree Automaton

S

NP

Mary

VP

plays NP

the guitarβMary βplays

βthe βguitar

ANPrβMarys

ANPrβthe b βguitars

AVPrβplays bANPrβthe b βguitarss

ASrANPrβMarys bAVPrβplays bANPrβthe b βguitarsss

αJ0 ASrANPrβMarys bAVPrβplays bANPrβthe b βguitarsss

Page 71: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Phrase Embeddings using WFTA

Assume a WCFG in Chomsky Normal Form

§ n – number of states; i.e. dimensionality of the embedding.§ Ranked alphabet:

§ Σ0 “ tthe, Mary, plays, . . . u – terminal words§ Σ1 “ tnoun, verb, det, NP, VP, . . . u – unary non-terminals§ Σ2 “ tS, NP, VP, . . . u – binary non-terminals

§ α‹ – final weights

§ tβwu for all w P Σ0 – word embeddings

§ tA1N1u for all N1 P Σ

1 – computes phrase embedding

§ tA2N2u for all N2 P Σ

2 – computes phrase embedding

Page 72: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Phrase Embeddings using WFTA

§ A “ xα‹, tβwu, tA1N1u, tA2

N2uy – WFTA

§ fAptq “ αt‹βApt,Sq – scores a derivation

§ βApt,Sq – is the n-dimensional embedding of derivation t

Page 73: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Spectral Learning Algorithm for WFTA

Assume A is stochastic – i.e. it computes a distribution over derivations§ General Algorithm:

§ Chose a basis – i.e. a set of inside and outside trees§ Estimate their empirical probabilities from a sample of derivations§ Compute H and tHNu§ tHNu – one for each non-terminal§ Perform SVD on H§ Recover parameters of A using the WFTA Theorem§ Note: We can also use sub-tree statistics

Page 74: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Spectral Learning of Tree Automata

§ WFTA are a general algebraic framework for compositional functions

§ WFTA can exploit real-valued embeddings

§ There are simple algorithms for learning WFTAs from samples

Page 75: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Outline

1. Weighted Automata and Hankel Matrices

2. Spectral Learning of Probabilistic Automata

3. Spectral Methods for Transducers and GrammarsSequence TaggingFinite-State TransductionsTree Automata

4. Hankel Matrices with Missing Entries

5. Conclusion

6. References

Page 76: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Learning WFA in More General Settings

Data Hankelmatrix WFALow-rank matrix

estimationFactorization and

linear algebra

Question: How do we use these approach to learn f : Σ‹ Ñ R where fpxqdoes not have a probabilistic interpretation?

Examples:

§ Classification f : Σ‹ Ñ t`1,´1u

§ Unconstrained real-valued predictions f : Σ‹ Ñ R§ General scoring functions for tagging: f : pΣˆ ∆q‹ Ñ R

Page 77: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Example: Hankel Matrices with Missing EntriesWhen learning probabilistic functions. . .entries in Hf are estimated from empirical counts, e.g. fpxq “ Prxs

$

&

%

aa, b, bab, a,b, a, ab, aa,ba, b, aa, a,

aa, bab, b, aa

,

/

/

.

/

/

-

−Ñ

»

a b

ε .19 .25a .31 .06b .06 .00ba .00 .13

fi

ffi

ffi

fl

But in a general regression setting...entries in Hf are labels observed in the sample, and many may be missing

$

&

%

(bab,1)(bbb,0)(aaa,3)

(a,1)(ab,1)(aa,2)

(aba,2)(bb,0)

,

/

/

/

/

/

/

/

/

/

/

.

/

/

/

/

/

/

/

/

/

/

-

−Ñ

»

ε a b

a 1 2 1b ? ? 0aa 2 3 ?ab 1 2 ?ba ? ? 1bb 0 ? 0

fi

ffi

ffi

ffi

ffi

ffi

ffi

fl

Page 78: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Inference of Hankel Matrices

Goal: Learn a Hankel matrix H P RPˆS from partial information, thenapply the Hankel trick

Information Models:

§ Subset of entries: tHpp, sq|pp, sq P Iu

§ Linear measurements: tHv|v P Vu

§ Bilinear measurements: tuJHv|u P U, v P Vu

§ Constraints between entries: tHpp, sq ě Hpp 1, s 1q|pp, s,p 1, s 1q P Iu

§ Noisy versions of all the above

Constraints and Inductive Bias:

§ Hankel constraints Hpp, sq “ Hpp 1, s 1q if ps “ p 1s 1

§ Constraints on entries |Hpp, sq| ď C

§ Low-rank constraints/regularization rankpHq

Page 79: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Empirical Risk Minimization ApproachData: tpxi,yiquNi“1, xi P Σ‹, yi P RParameters:

§ Rows and columns P, S Ă Σ‹

§ (Convex) Loss function ` : Rˆ RÑ R`§ Regularization parameter λ / rank bound R

Optimization (constrained formulation):

argminHPRPˆS

1

N

Nÿ

i“1

`pyi,Hpxiqq subject to rankpHq ď R

Optimization (regularized formulation):

argminHPRPˆS

1

N

Nÿ

i“1

`pyi,Hpxiqq ` λ rankpHq

Note: These optimization problems are non-convex!

Page 80: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Nuclear Norm Relaxation

Nuclear Norm: matrix M, }M}˚ “ř

sipMq

In machine learning, minimizing the nuclear norm is a commonly usedconvex surrogate for minimizing the rank

Convex Optimization for Hankel matrix estimation

argminHPRPˆS

1

N

Nÿ

i“1

`pyi,Hpxiqq ` λ}H}˚

Page 81: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Optimization Algorithms for Hankel MatrixEstimation

Optimizing the Nuclear Norm Surrogate

§ Projected/Proximal sub-gradient (e.g. [Duchi and Singer ’09])

§ Frank–Wolfe [Jaggi and Sulovsk ’10]

§ Singular value thresholding [Cai, Candes, and Shen ’10]

Non-Convex “Heuristics”

§ Alternating minimization (e.g. [Jain, Netrapalli, and Sanghavi ’13])

Page 82: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Applications of Hankel Matrix Estimation

§ Max-margin taggers [Quattoni, Balle, Carreras, and Globerson ’14]

§ Unsupervised transducers [Bailly, Quattoni, and Carreras ’13]

§ Unsupervised WCFG [Bailly, Carreras, Luque, Quattoni ’13]

Page 83: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Outline

1. Weighted Automata and Hankel Matrices

2. Spectral Learning of Probabilistic Automata

3. Spectral Methods for Transducers and GrammarsSequence TaggingFinite-State TransductionsTree Automata

4. Hankel Matrices with Missing Entries

5. Conclusion

6. References

Page 84: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Conclusion

§ Spectral methods provide new tools to learn compositional functionsby means of algebraic operations

§ Key result:forward-backward recursions ô low-rank Hankel matrices

§ Applicable to a wide range of compositional formalisms:finite-state automata and transducers, context-free grammars, . . .

§ Relation to loss-regularized methods, by means of matrix-completiontechniques

Page 85: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Spectral Learning Techniques for Weighted Automata,Transducers, and Grammars

Borja Balle♦ Ariadna Quattoni♥ Xavier Carreras♥

p♦q McGill Universityp♥q Xerox Research Centre Europe

TUTORIAL @ EMNLP 2014

Page 86: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Outline

1. Weighted Automata and Hankel Matrices

2. Spectral Learning of Probabilistic Automata

3. Spectral Methods for Transducers and GrammarsSequence TaggingFinite-State TransductionsTree Automata

4. Hankel Matrices with Missing Entries

5. Conclusion

6. References

Page 87: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Anandkumar, A., Chaudhuri, K., Hsu, D., Kakade, S. M., Song, L.,and Zhang, T. (2011).Spectral methods for learning multivariate latent tree structure.In NIPS.

Anandkumar, A., Foster, D. P., Hsu, D., Kakade, S. M., and Liu, Y.(2012a).A spectral algorithm for latent Dirichlet allocation.In NIPS.

Anandkumar, A., Ge, R., Hsu, D., and Kakade, S. (2013a).A tensor spectral approach to learning mixed membership communitymodels.In COLT.

Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., and Telgarsky, M.(2012b).Tensor decompositions for learning latent variable models.CoRR, abs/1210.7559.

Anandkumar, A., Hsu, D., Huang, F., and Kakade, S. M. (2012c).Learning mixtures of tree graphical models.

Page 88: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

In NIPS.

Anandkumar, A., Hsu, D., and Kakade, S. M. (2012d).A method of moments for mixture models and hidden markov models.In 25th Annual Conference on Learning Theory.

Anandkumar, A., Hsu, D., and Kakade, S. M. (2012e).A method of moments for mixture models and hidden Markov models.In COLT.

Anandkumar, A., Hsu, D., and Kakade, S. M. (2013b).Tutorial: Tensor decomposition algorithms for latent variable modelestimation.In ICML.

Bailly, R. (2011).Quadratic weighted automata: Spectral algorithm and likelihoodmaximization.In ACML.

Bailly, R., Carreras, X., Luque, F., and Quattoni, A. (2013a).Unsupervised spectral learning of WCFG as low-rank matrixcompletion.

Page 89: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

In EMNLP.

Bailly, R., Carreras, X., and Quattoni, A. (2013b).Unsupervised spectral learning of finite state transducers.In NIPS.

Bailly, R., Denis, F., and Ralaivola, L. (2009).Grammatical inference as a principal component analysis problem.In ICML.

Bailly, R., Habrard, A., and Denis, F. (2010).A spectral approach for probabilistic grammatical inference on trees.In ALT.

Balle, B. (2013).Learning Finite-State Machines: Algorithmic and Statistical Aspects.PhD thesis, Universitat Politecnica de Catalunya.

Balle, B., Carreras, X., Luque, F., and Quattoni, A. (2013).Spectral learning of weighted automata: A forward-backwardperspective.Machine Learning.

Page 90: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Balle, B. and Mohri, M. (2012).Spectral learning of general weighted automata via constrained matrixcompletion.In NIPS.

Balle, B., Quattoni, A., and Carreras, X. (2011).A spectral learning algorithm for finite state transducers.In ECML-PKDD.

Balle, B., Quattoni, A., and Carreras, X. (2012).Local loss optimization in operator models: A new insight intospectral learning.In ICML.

Berry, M. W. (1992).Large-scale sparse singular value computations.

Boots, B. and Gordon, G. (2011).An online spectral learning algorithm for partially observabledynamical systems.In Association for the Advancement of Artificial Intelligence.

Page 91: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Boots, B., Gretton, A., and Gordon, G. (2013).Hilbert space embeddings of predictive state representations.In UAI.

Boots, B., Siddiqi, S., and Gordon, G. (2009).Closing the learning-planning loop with predictive staterepresentations.In Proceedings of Robotics: Science and Systems VI.

Boots, B., Siddiqi, S., and Gordon, G. (2011).Closing the learning planning loop with predictive staterepresentations.International Journal of Robotics Research.

Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011).Distributed optimization and statistical learning via the alternatingdirection method of multipliers.Foundations and Trends in Machine Learning.

Brand, M. (2006).Fast low-rank modifications of the thin singular value decomposition.Linear algebra and its applications, 415(1):20–30.

Page 92: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Carlyle, J. W. and Paz, A. (1971).Realizations by stochastic finite automata.Journal of Computer Systems Science.

Chaganty, A. T. and Liang, P. (2013).Spectral experts for estimating mixtures of linear regressions.In ICML.

Cohen, S. B., Collins, M., Foster, D. P., Stratos, K., and Ungar, L.(2013a).Tutorial: Spectral learning algorithms for natural language processing.In NAACL.

Cohen, S. B., Stratos, K., Collins, M., Foster, D. P., and Ungar, L.(2012).Spectral learning of latent-variable PCFGs.ACL.

Cohen, S. B., Stratos, K., Collins, M., Foster, D. P., and Ungar, L.(2013b).Experiments with spectral learning of latent-variable PCFGs.In NAACL-HLT.

Page 93: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Cohen, S. B., Stratos, K., Collins, M., Foster, D. P., and Ungar, L.(2014).Spectral learning of latent-variable PCFGs: Algorithms and samplecomplexity.Journal of Machine Learning Research.

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977).Maximum likelihood from incomplete data via the EM algorithm.Journal of the Royal Statistical Society.

Denis, F. and Esposito, Y. (2008).On rational stochastic languages.Fundamenta Informaticae.

Dhillon, P. S., Rodu, J., Collins, M., Foster, D. P., and Ungar, L. H.(2012).Spectral dependency parsing with latent variables.In EMNLP-CoNLL.

Dupont, P., Denis, F., and Esposito, Y. (2005).Links between probabilistic automata and hidden Markov models:probability distributions, learning models and induction algorithms.

Page 94: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Pattern Recognition.

Fliess, M. (1974).Matrices de Hankel.Journal de Mathematiques Pures et Appliquees.

Foster, D. P., Rodu, J., and Ungar, L. H. (2012).Spectral dimensionality reduction for HMMs.CoRR, abs/1203.6130.

Gordon, G. J., Song, L., and Boots, B. (2012).Tutorial: Spectral approaches to learning latent variable models.In ICML.

Halko, N., Martinsson, P., and Tropp, J. (2011).Finding structure with randomness: Probabilistic algorithms forconstructing approximate matrix decompositions.SIAM Review, 53(2):217–288.

Hamilton, W., Fard, M., and Pineau, J. (2013a).Modelling sparse dynamical systems with compressed predictive staterepresentations.

Page 95: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

In ICML.

Hamilton, W. L., Fard, M. M., and Pineau, J. (2013b).Modelling sparse dynamical systems with compressed predictive staterepresentations.In Proceedings of the 30th International Conference on MachineLearning.

Hsu, D. and Kakade, S. M. (2013).Learning mixtures of spherical Gaussians: moment methods andspectral decompositions.In ITCS.

Hsu, D., Kakade, S. M., and Zhang, T. (2009).A spectral algorithm for learning hidden Markov models.In COLT.

Hulden, M. (2012).Treba: Efficient numerically stable EM for PFA.In ICGI.

Jaeger, H. (2000).Observable operator models for discrete stochastic time series.

Page 96: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Neural Computation, 12(6):1371–1398.

Littman, M., Sutton, R. S., and Singh, S. (2002).Predictive representations of state.In Advances In Neural Information Processing Systems.

Luque, F., Quattoni, A., Balle, B., and Carreras, X. (2012).Spectral learning in non-deterministic dependency parsing.In EACL.

Marcus, M., Marcinkiewicz, M., and Santorini, B. (1993).Building a large annotated corpus of english: The Penn Treebank.Computational linguistics, 19(2):313–330.

Parikh, A., Song, L., Ishteva, M., Teodoru, G., and Xing, E. (2012).A spectral algorithm for latent junction trees.In UAI.

Parikh, A. P., Cohen, S. B., and Xing, E. (2014).Spectral unsupervised parsing with additive tree metrics.In Proceedings of ACL.

Parikh, A. P., Song, L., and Xing, E. (2011).

Page 97: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

A spectral algorithm for latent tree graphical models.In ICML.

Pereira, F. and Schabes, Y. (1992).Inside-outside reestimation from partially bracketed corpora.In Proceedings of the 30th annual meeting on Association forComputational Linguistics, pages 128–135. Association forComputational Linguistics.

Quattoni, A., Balle, B., Carreras, X., and Globerson, A. (2014).Spectral regularization for max-margin sequence tagging.In ICML.

Rabiner, L. R. (1990).A tutorial on hidden markov models and selected applications inspeech recognition.In Waibel, A. and Lee, K., editors, Readings in speech recognition,pages 267–296.

Recasens, A. and Quattoni, A. (2013).Spectral learning of sequence taggers over continuous sequences.In ECML-PKDD.

Page 98: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

Rosencrantz, M., Gordon, G., and Thrun, S. (2004).Learning low dimensional predictive representations.In Proceedings of the 21st International Conference on Machinelearning.

Saluja, A., Dyer, C., and Cohen, S. B. (2014).Latent-variable synchronous CFGs for hierarchical translation.Proceedings of EMNLP.

Siddiqi, S. M., Boots, B., and Gordon, G. (2010).Reduced-rank hidden Markov models.In AISTATS.

Singh, S., James, M., and Rudary, M. (2004).Predictive state representations: a new theory for modeling dynamicalsystems.In Proceedings of the 20th Conference on Uncertainty in ArtificialIntelligence.

Song, L., Boots, B., Siddiqi, S., Gordon, G., and Smola, A. (2010).Hilbert space embeddings of hidden Markov models.

Page 99: Spectral Learning Techniques for Weighted Automata ...emnlp2014.org/tutorials/10_notes.pdfSpectral Learning Techniques for Weighted Automata, Transducers, and Grammars Borja Balle}

In ICML.

Song, L., Ishteva, M., Parikh, A., Xing, E., and Park, H. (2013).Hierarchical tensor decomposition of latent tree graphical models.In ICML.

Stratos, K., Rush, A. M., Cohen, S. B., and Collins, M. (2013).Spectral learning of refinement hmms.In CoNLL.

Verwer, S., Eyraud, R., and Higuera, C. (2012).Results of the PAutomaC probabilistic automaton learningcompetition.In ICGI.

Wiewiora, E. (2007).Modeling probability distributions with predictive staterepresentations.PhD thesis, University of California at San Diego.