Learning Weighted Automata - New York Universitymohri/talks/LICS2017.pdfLICS2017 - Mohri@ page Learning Guarantees Existing analyses: • (Hsu et al., 2009; Denis et al., 2016): statistical

Learning Weighted Automata

MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH..

Joint work with Borja Balle (Amazon Research)

pageLICS2017 - Mohri@

Weighted Automata (WFAs)

2

��

��

��

��

��

��

��

�

��

��


MotivationWeighted automata (WFAs):

• image processing (Kari, 1993).

• automatic speech recognition (MM, Pereira, Riley, 1996, 2008).

• speech synthesis (Sproat, 1995; Allauzen, MM, Riley 2004).

• machine translation (e.g., Iglesias et al., 2011).

• many other NLP tasks (very long list of refs).

• bioinformatics (Durbin et al., 1998).

• optical character recognition (Bruel, 2008).

• model checking (Baier et al., 2009; Aminof et al., 2011).

• machine learning (Cortes, Kuznetsov, MM, Warmuth, 2015).

3


MotivationTheory: rational power series, extensively studied (Eilenberg, 1993; Salomaa and Soittola, 1978; Kuich and Salomaa, 1986; Berstel and

Retenauer, 1988).

Algorithms (see survey chapter: MM, 2009): rational operations, intersection or composition, epsilon-removal, determinization, minimization, disambiguation.

4


Learning AutomataClassical results:

• passive learning (Gold, 1978; Angluin, 1978), (Pitt and Warmuth, 1993).

• active learning (model with membership and equivalences queries) (Angluin, 1987), (Bergadano and Varrichio, 1994, 1996, 2000)

Spectral learning:

• algorithms (Hsu et al., 2009; Bailly et al., 2009), (Balle and MM, 2012).

• natural language processing (Balle et al., 2014).

• reinforcement learning (Boots et al., 2009; Hamilton et al., 2013).

5


Learning GuaranteesExisting analyses:

• (Hsu et al., 2009; Denis et al., 2016): statistical consistency, finite-sample guarantees in the realizable case.

• (Balle and MM, 2012): algorithm-dependent finite-sample guarantee based on a stability analysis.

• (Kulesza et al., 2014): algorithm-dependent guarantees with distributional assumption (data drawn from some WFA).

6

can we derive general theoretical guarantees for learning WFAs?


This TalkLearning scenario, complexity tools.

Hypothesis sets.

Learning guarantees.

7


Learning ScenarioTraining data: sample drawn i.i.d. from according to some distribution ,

Problem: find WFA in hypothesis set with small expected loss

• note: problem not assumed realizable (distribution according to probabilistic WFA).

8

X⇥ Y

D

S =�(x1, y1), . . . , (xm, ym)

�.

A

L(A) = E(x,y)⇠D

[L(A(x), y)].

H


Emp. Rademacher ComplexityDefinition:

• family of functions mapping from set to .

• sample .

• s (Rademacher variables): independent uniform random variables taking values in .

9

correlation with random noise

bRS(G) = E�

supg2G

1

m

�1...�m

�· g(z1)...g(zm)

��= E

�

supg2G

1

m

mX

i=1

�ig(zi)

�.

�i

G Z [a, b]

S=(z1, . . . , zm)

{�1,+1}


Emp. Rademacher ComplexityDefinition:

• family of functions mapping from set to .

• sample .

• s (Rademacher variables): independent uniform random variables taking values in .

• Rademacher complexity of :

10

correlation with random noise

bRS(G) = E�

supg2G

1

m

�1...�m

�· g(z1)...g(zm)

��= E

�

supg2G

1

m

mX

i=1

�ig(zi)

�.

�i

G Z [a, b]

S=(z1, . . . , zm)

{�1,+1}

G

Rm(G) = ES⇠Dm

[bRS(G)].


Rademacher Complexity BoundTheorem: Let be a family of functions mapping from to . Then, for any , with probability at least , the following holds for all :

Proof: Apply McDiarmid’s inequality to

11

(Koltchinskii and Panchenko, 2002; MM et al., 2012)

G

[0, 1]

Z

� > 0 1� �g2G

�(S) = supg2G

E[g]� bES [g].

E[g(z)] 1

m

mX

i=1

g(zi) + 2Rm(G) +

slog

1�

2m.

E[g(z)] 1

m

mX

i=1

g(zi) + 2

bRS(G) + 3

slog

2�

2m.



Hypothesis sets.


12


Learning AutomataClassical formulation:

• sample

• find smallest automaton consistent with sample:

• NP-complete problem (Gold 1978, Angluin 1978); even polynomial approximation is NP-hard (Pitt and Wamuth, 1993).

13

S =�(x1, y1), . . . , (xm, ym)

�2 (⌃⇤ ⇥ {0, 1})m.

not the right formulation.

A

minA

kAk0

s.t. 8i 2 [m],A(xi) = yi.


Analogy: Linear ClassifiersSparse learning formulation:

• non-convex optimization problem.

• NP-hard problem.

14

minw2RN

kwk0

s.t. Aw = b.

not the right formulation.alternative norm (e.g. norm-1).


QuestionsWhat is the appropriate norm to use for learning WFAs?

Which hypothesis sets should we consider?

• description in terms of Hankel matrix.

• description in terms of transition matrices.

• description in terms of function norm.

15


WFA - DefinitionWFA over a semiring and alphabet with a finite set of states is defined by

• initial weight vector ;

• final weight vector ;

• transition weight matrix , .

Function defined: for any ,

Notation:

16

A (S,�,⌦, 0, 1) ⌃QA

↵A 2 SQA

�A 2 SQA

Aa 2 SQA⇥QA a 2 ⌃

x = x1 · · ·xk 2 ⌃⇤

A(x) = ↵>AAx1 · · ·Axk�A.

Ax

= Ax1 · · ·Axk .


WFA - Illustration

17

��

��

��

��

��

↵A =

2

4134

3

5 Aa =

2

40 0 30 0 31 0 0

3

5

�A =

2

4211

3

5 Ab =

2

40 1 02 0 00 0 4

3

5

state number initial weight final weight


Hankel MatrixDefinition: the Hankel matrix of function is the infinite matrix defined by

• redundancy: appears in all entries with .

18

Hf

8u, v 2 ⌃⇤, Hf (u, v) = f(uv).

f(x) (u, v) x = uv

f : ⌃⇤ ! R

Hf =

2

664

...· · · f(uv) · · ·

...

3

775u

v


Theorem of FliessTheorem (Fliess, 1974): iff is rational. In that case, there exists a (minimal) WFA representing with states.

19

f

A

rank(Hf )

rank(Hf )<+1f


Theorem of FliessTheorem (Fliess, 1974): iff is rational. In that case, there exists a (minimal) WFA representing with states.

Proof: For any , if is the Hankel matrix of , then,

Thus, with

20

f

A

rank(Hf )

rank(Hf )<+1f

u, v 2 ⌃⇤

H(u, v) = A(uv) = (↵>AAu)(Av�A).

H A

H = PAS>A

PA =

" ...↵>

AAu...

#2 R⌃⇤⇥QA SA =

" ...�>

AA>v...

#2 R⌃⇤⇥QA .


Standardization

21

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

(Schützenberger, 1961; Cardon and Crochemore, 1980)


Hypothesis SetsIn view of the theorem of Fliess, a natural choice is

• for some .

But, does not define a convex function (equivalent of norm-0 for column vectors). Instead, definition based on nuclear norm and more generally Schatten -norms:

• with

22

rank

r < +1

p

H0 =n

A : rank(HA) < ro

.

Hp =n

A : kHAkp < ro

,

kHAkp =

X

i

�pi (HA)

� 1p

.



Hypothesis sets.


23


Schatten NormsCommon choices for :

• : nuclear norm (or trace norm) .

• : Frobenius norm .

• : spectral norm

Properties:

• Hölder’s inequality: for with ,

• von Neumann’s trace inequality theorem:

24

p = 1

p = 2 kAk2 =qTr[A>A]

p

kAk1 = Tr⇥p

A>A⇤

p = +1 kAk+1 =

p�max

(A>A) = �max

(A).

1p + 1

p⇤ = 1p, p⇤ � 1

|hA,Bi| kAkp kBkp⇤ .

|hA,Bi| P

i �i(A)�i(B).


Emp. Rademacher ComplexityBy definition of the dual norm (or Hölder’s inequality), for a sample and any decomposition ,

25

S = (x1, . . . , xm)xi = uivi

bRS(Hp) =1

mE

supA2Hp

mX

i=1

�ie>uiHAevi

�

=1

mE

supkHAkpr

D mX

i=1

�ievie>ui,HA

E�

r

mE��

mX

i=1

�ievie>ui

��p⇤

�.


Rad. Complexity for p = 2Lemma:

Proof: since for ,

26

p⇤ = 2 p = 2

bRS(H2) rpm.

bRS(H2) r

mE��

mX

i=1

�ievie>ui

��2

�

r

m

vuutE��

mX

i=1

�ievie>ui

��2

2

�

=r

m

vuutE mX

i,j=1

�i�jheuie>vi , euje

>vj i

�

=r

m

vuutE mX

i=1

heuie>vi , euie

>vii

�=

rpm.


Lower BoundBy the Khintchine-Kahane inequality,

27

r

mE�

��mX

i=1

�ievie>ui

��2

�� rp

2m

vuutE��

mX

i=1

�ievie>ui

��2

2

�

=rp2m

s

E�

heuie

>vi , evie

>uii�

=1p2

r

m.


Generalization BoundTheorem: assume that the loss is the loss and is bounded by . Then, for any , with probability at least over the draw of a sample of size , for all ,

where

28

L� > 0

1� � mS

A 2 H2

M

Lp

L(A) bLS(A) +

2µprpm

+M

slog

1�

2m,

µp = pMp�1.


ProofBy Talagrand’s contraction lemma,

29

bR⇣n

(x, y) 7! L(A(x), y) : A 2 H2

o⌘

=1

m

E�

supA2H2

mX

i=1

�i |A(xi)� yi|p�

µp

m

E�

supA2H2

mX

i=1

�i(A(xi)� yi)

�

(x 7! |x|p µp-Lipschitz)

=µp

m

E�

supA2H2

mX

i=1

�iA(xi)

�

+µp

m

E�

mX

i=1

�iyi

�

=µp

m

E�

supA2H2

mX

i=1

�iA(xi)

�

= µpbRS(H2).


Rad. Complexity for p = 1Lemma:

• where , with

Proof: apply Matrix Berstein bound with , ,

30

bRS(H1) rm

c1 log(2m+ 1) + c2

pWS log(2m+ 1)

�,

WS = min

decomp.

max{US , VS}US = maxu2⌃⇤ |{i : ui = u}| VS = maxu2⌃⇤ |{i : vi = v}|.

V1 =P

i euie>ui

V2 =P

i evie>vi

kV1

kop

= US kV2

kop

= VS .

M = 1 d m


Matrix Bernstein BoundCorollary: let be a finite sum of i.i.d. random matrices with

• and for all ;

• and . Then,

where , ;

, , .

31

E[kMkop

] c1

M log(d+ 1) + c2

p⌫ log(d+ 1),

c1

= 2+8/ log(2)

3

c2 =

p2 + 4/

plog(2)

M =P

i Mi

E[M] = 0 kMikop M iP

i E[MiM>i ] � V1

Pi E[M>

i Mi] � V2

V = diag(V1,V2) ⌫ = kVkop

d = Tr(V)kVk

op

(MInsker, 2011; Tropp, 2015)


Generalization BoundTheorem: assume that the loss is the loss and is bounded by . Then, for any , with probability at least over the draw of a sample of size , for all ,

where

32

L� > 0

1� � mSM

Lp

µp = pMp�1.

A 2 H1

L(A) bLS(A) +

2µpc1r log(2m+ 1)

m

+

2µpc2rpWS log(2m+ 1)

m+ 3M

slog

2�

2m,

c1

= 2+8/ log(2)

3

c2 =

p2 + 4/

plog(2)


ConclusionTheory of learning WFAs:

• data-dependent learning guarantees.

• data-dependent combinatorial quantities (e.g. ).

• can help guide design of algorithms.

• key role of notion of Hankel matrix ( spectral methods, (Hsu et al., 2009; Balle and MM, 2012)).

33

WS


QuestionsQuestions:

• can we use learning bounds (e.g. ) to select prefixes/suffixes defining sub-blocks of Hankel matrix?

• can we derive learning guarantees for more general algorithms than (Balle and MM, 2012)?

• computational challenges.

34

WS


Hypothesis SetsDefinition based on matrix representation:

35

An,p,r =

n

A : |QA| = n, k↵k r↵, k�kp⇤ r� ,

max

akAkp⇤ r⌃

o

.


Rademacher ComplexitiesCorollary: let and , then,

36

bRS(An,p,r) 6

rn(n|⌃|+ 2)r↵r�

m

⇣C +

plog(LS + 2)

⌘

Rm(An,p,r) 6

rn(n|⌃|+ 2)r↵r�

m

⇣C +

plog(Lm + 2)

⌘,

LS = max

i|xi| Lm = E

S⇠Dm[LS ]

C =

q(log(r↵r�))+

n|⌃|+2

+

qlog

+

(r⌃

) +

qlog

+

(r̃) + 3

plog(2)

r̃ = max{qr↵/r� ,

qr�/r↵,

pr↵r�/r⌃}

where

Documents

Learning Weighted Automata - New York Universitymohri/talks/LICS2017.pdfLICS2017 - Mohri@ page Learning Guarantees Existing analyses: • (Hsu et al., 2009; Denis et al., 2016): statistical