Upload
others
View
16
Download
0
Embed Size (px)
Citation preview
Learning Weighted Automata
MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH..
Joint work with Borja Balle (Amazon Research)
pageLICS2017 - Mohri@
Weighted Automata (WFAs)
2
����
���
���
���
������
�������
���
�
����� �����
���
pageLICS2017 - Mohri@
MotivationWeighted automata (WFAs):
• image processing (Kari, 1993).
• automatic speech recognition (MM, Pereira, Riley, 1996, 2008).
• speech synthesis (Sproat, 1995; Allauzen, MM, Riley 2004).
• machine translation (e.g., Iglesias et al., 2011).
• many other NLP tasks (very long list of refs).
• bioinformatics (Durbin et al., 1998).
• optical character recognition (Bruel, 2008).
• model checking (Baier et al., 2009; Aminof et al., 2011).
• machine learning (Cortes, Kuznetsov, MM, Warmuth, 2015).
3
pageLICS2017 - Mohri@
MotivationTheory: rational power series, extensively studied (Eilenberg, 1993; Salomaa and Soittola, 1978; Kuich and Salomaa, 1986; Berstel and
Retenauer, 1988).
Algorithms (see survey chapter: MM, 2009): rational operations, intersection or composition, epsilon-removal, determinization, minimization, disambiguation.
4
pageLICS2017 - Mohri@
Learning AutomataClassical results:
• passive learning (Gold, 1978; Angluin, 1978), (Pitt and Warmuth, 1993).
• active learning (model with membership and equivalences queries) (Angluin, 1987), (Bergadano and Varrichio, 1994, 1996, 2000)
Spectral learning:
• algorithms (Hsu et al., 2009; Bailly et al., 2009), (Balle and MM, 2012).
• natural language processing (Balle et al., 2014).
• reinforcement learning (Boots et al., 2009; Hamilton et al., 2013).
5
pageLICS2017 - Mohri@
Learning GuaranteesExisting analyses:
• (Hsu et al., 2009; Denis et al., 2016): statistical consistency, finite-sample guarantees in the realizable case.
• (Balle and MM, 2012): algorithm-dependent finite-sample guarantee based on a stability analysis.
• (Kulesza et al., 2014): algorithm-dependent guarantees with distributional assumption (data drawn from some WFA).
6
can we derive general theoretical guarantees for learning WFAs?
pageLICS2017 - Mohri@
This TalkLearning scenario, complexity tools.
Hypothesis sets.
Learning guarantees.
7
pageLICS2017 - Mohri@
Learning ScenarioTraining data: sample drawn i.i.d. from according to some distribution ,
Problem: find WFA in hypothesis set with small expected loss
• note: problem not assumed realizable (distribution according to probabilistic WFA).
8
X⇥ Y
D
S =�(x1, y1), . . . , (xm, ym)
�.
A
L(A) = E(x,y)⇠D
[L(A(x), y)].
H
pageLICS2017 - Mohri@
Emp. Rademacher ComplexityDefinition:
• family of functions mapping from set to .
• sample .
• s (Rademacher variables): independent uniform random variables taking values in .
9
correlation with random noise
bRS(G) = E�
supg2G
1
m
�1...�m
�· g(z1)...g(zm)
��= E
�
supg2G
1
m
mX
i=1
�ig(zi)
�.
�i
G Z [a, b]
S=(z1, . . . , zm)
{�1,+1}
pageLICS2017 - Mohri@
Emp. Rademacher ComplexityDefinition:
• family of functions mapping from set to .
• sample .
• s (Rademacher variables): independent uniform random variables taking values in .
• Rademacher complexity of :
10
correlation with random noise
bRS(G) = E�
supg2G
1
m
�1...�m
�· g(z1)...g(zm)
��= E
�
supg2G
1
m
mX
i=1
�ig(zi)
�.
�i
G Z [a, b]
S=(z1, . . . , zm)
{�1,+1}
G
Rm(G) = ES⇠Dm
[bRS(G)].
pageLICS2017 - Mohri@
Rademacher Complexity BoundTheorem: Let be a family of functions mapping from to . Then, for any , with probability at least , the following holds for all :
Proof: Apply McDiarmid’s inequality to
11
(Koltchinskii and Panchenko, 2002; MM et al., 2012)
G
[0, 1]
Z
� > 0 1� �g2G
�(S) = supg2G
E[g]� bES [g].
E[g(z)] 1
m
mX
i=1
g(zi) + 2Rm(G) +
slog
1�
2m.
E[g(z)] 1
m
mX
i=1
g(zi) + 2
bRS(G) + 3
slog
2�
2m.
pageLICS2017 - Mohri@
This TalkLearning scenario, complexity tools.
Hypothesis sets.
Learning guarantees.
12
pageLICS2017 - Mohri@
Learning AutomataClassical formulation:
• sample
• find smallest automaton consistent with sample:
• NP-complete problem (Gold 1978, Angluin 1978); even polynomial approximation is NP-hard (Pitt and Wamuth, 1993).
13
S =�(x1, y1), . . . , (xm, ym)
�2 (⌃⇤ ⇥ {0, 1})m.
not the right formulation.
A
minA
kAk0
s.t. 8i 2 [m],A(xi) = yi.
pageLICS2017 - Mohri@
Analogy: Linear ClassifiersSparse learning formulation:
• non-convex optimization problem.
• NP-hard problem.
14
minw2RN
kwk0
s.t. Aw = b.
not the right formulation.alternative norm (e.g. norm-1).
pageLICS2017 - Mohri@
QuestionsWhat is the appropriate norm to use for learning WFAs?
Which hypothesis sets should we consider?
• description in terms of Hankel matrix.
• description in terms of transition matrices.
• description in terms of function norm.
15
pageLICS2017 - Mohri@
WFA - DefinitionWFA over a semiring and alphabet with a finite set of states is defined by
• initial weight vector ;
• final weight vector ;
• transition weight matrix , .
Function defined: for any ,
Notation:
16
A (S,�,⌦, 0, 1) ⌃QA
↵A 2 SQA
�A 2 SQA
Aa 2 SQA⇥QA a 2 ⌃
x = x1 · · ·xk 2 ⌃⇤
A(x) = ↵>AAx1 · · ·Axk�A.
Ax
= Ax1 · · ·Axk .
pageLICS2017 - Mohri@
WFA - Illustration
17
����� ��������
�����
���
��� ���
��� ���
↵A =
2
4134
3
5 Aa =
2
40 0 30 0 31 0 0
3
5
�A =
2
4211
3
5 Ab =
2
40 1 02 0 00 0 4
3
5
state number initial weight final weight
pageLICS2017 - Mohri@
Hankel MatrixDefinition: the Hankel matrix of function is the infinite matrix defined by
• redundancy: appears in all entries with .
18
Hf
8u, v 2 ⌃⇤, Hf (u, v) = f(uv).
f(x) (u, v) x = uv
f : ⌃⇤ ! R
Hf =
2
664
...· · · f(uv) · · ·
...
3
775u
v
pageLICS2017 - Mohri@
Theorem of FliessTheorem (Fliess, 1974): iff is rational. In that case, there exists a (minimal) WFA representing with states.
19
f
A
rank(Hf )
rank(Hf )<+1f
pageLICS2017 - Mohri@
Theorem of FliessTheorem (Fliess, 1974): iff is rational. In that case, there exists a (minimal) WFA representing with states.
Proof: For any , if is the Hankel matrix of , then,
Thus, with
20
f
A
rank(Hf )
rank(Hf )<+1f
u, v 2 ⌃⇤
H(u, v) = A(uv) = (↵>AAu)(Av�A).
H A
H = PAS>A
PA =
" ...↵>
AAu...
#2 R⌃⇤⇥QA SA =
" ...�>
AA>v...
#2 R⌃⇤⇥QA .
pageLICS2017 - Mohri@
Standardization
21
�����
��������
��������
�����
��������
���
���
����������
���
�����
��������
���
�����
���
����
�����
���
���
���
����
(Schützenberger, 1961; Cardon and Crochemore, 1980)
pageLICS2017 - Mohri@
Hypothesis SetsIn view of the theorem of Fliess, a natural choice is
• for some .
But, does not define a convex function (equivalent of norm-0 for column vectors). Instead, definition based on nuclear norm and more generally Schatten -norms:
• with
22
rank
r < +1
p
H0 =n
A : rank(HA) < ro
.
Hp =n
A : kHAkp < ro
,
kHAkp =
X
i
�pi (HA)
� 1p
.
pageLICS2017 - Mohri@
This TalkLearning scenario, complexity tools.
Hypothesis sets.
Learning guarantees.
23
pageLICS2017 - Mohri@
Schatten NormsCommon choices for :
• : nuclear norm (or trace norm) .
• : Frobenius norm .
• : spectral norm
Properties:
• Hölder’s inequality: for with ,
• von Neumann’s trace inequality theorem:
24
p = 1
p = 2 kAk2 =qTr[A>A]
p
kAk1 = Tr⇥p
A>A⇤
p = +1 kAk+1 =
p�max
(A>A) = �max
(A).
1p + 1
p⇤ = 1p, p⇤ � 1
|hA,Bi| kAkp kBkp⇤ .
|hA,Bi| P
i �i(A)�i(B).
pageLICS2017 - Mohri@
Emp. Rademacher ComplexityBy definition of the dual norm (or Hölder’s inequality), for a sample and any decomposition ,
25
S = (x1, . . . , xm)xi = uivi
bRS(Hp) =1
mE
supA2Hp
mX
i=1
�ie>uiHAevi
�
=1
mE
supkHAkpr
D mX
i=1
�ievie>ui,HA
E�
r
mE���
mX
i=1
�ievie>ui
���p⇤
�.
pageLICS2017 - Mohri@
Rad. Complexity for p = 2Lemma:
Proof: since for ,
26
p⇤ = 2 p = 2
bRS(H2) rpm.
bRS(H2) r
mE���
mX
i=1
�ievie>ui
���2
�
r
m
vuutE���
mX
i=1
�ievie>ui
���2
2
�
=r
m
vuutE mX
i,j=1
�i�jheuie>vi , euje
>vj i
�
=r
m
vuutE mX
i=1
heuie>vi , euie
>vii
�=
rpm.
pageLICS2017 - Mohri@
Lower BoundBy the Khintchine-Kahane inequality,
27
r
mE�
���mX
i=1
�ievie>ui
���2
�� rp
2m
vuutE���
mX
i=1
�ievie>ui
���2
2
�
=rp2m
s
E�
heuie
>vi , evie
>uii�
=1p2
r
m.
pageLICS2017 - Mohri@
Generalization BoundTheorem: assume that the loss is the loss and is bounded by . Then, for any , with probability at least over the draw of a sample of size , for all ,
where
28
L� > 0
1� � mS
A 2 H2
M
Lp
L(A) bLS(A) +
2µprpm
+M
slog
1�
2m,
µp = pMp�1.
pageLICS2017 - Mohri@
ProofBy Talagrand’s contraction lemma,
29
bR⇣n
(x, y) 7! L(A(x), y) : A 2 H2
o⌘
=1
m
E�
supA2H2
mX
i=1
�i |A(xi)� yi|p�
µp
m
E�
supA2H2
mX
i=1
�i(A(xi)� yi)
�
(x 7! |x|p µp-Lipschitz)
=µp
m
E�
supA2H2
mX
i=1
�iA(xi)
�
+µp
m
E�
mX
i=1
�iyi
�
=µp
m
E�
supA2H2
mX
i=1
�iA(xi)
�
= µpbRS(H2).
pageLICS2017 - Mohri@
Rad. Complexity for p = 1Lemma:
• where , with
Proof: apply Matrix Berstein bound with , ,
30
bRS(H1) rm
c1 log(2m+ 1) + c2
pWS log(2m+ 1)
�,
WS = min
decomp.
max{US , VS}US = maxu2⌃⇤ |{i : ui = u}| VS = maxu2⌃⇤ |{i : vi = v}|.
V1 =P
i euie>ui
V2 =P
i evie>vi
kV1
kop
= US kV2
kop
= VS .
M = 1 d m
pageLICS2017 - Mohri@
Matrix Bernstein BoundCorollary: let be a finite sum of i.i.d. random matrices with
• and for all ;
• and . Then,
where , ;
, , .
31
E[kMkop
] c1
M log(d+ 1) + c2
p⌫ log(d+ 1),
c1
= 2+8/ log(2)
3
c2 =
p2 + 4/
plog(2)
M =P
i Mi
E[M] = 0 kMikop M iP
i E[MiM>i ] � V1
Pi E[M>
i Mi] � V2
V = diag(V1,V2) ⌫ = kVkop
d = Tr(V)kVk
op
(MInsker, 2011; Tropp, 2015)
pageLICS2017 - Mohri@
Generalization BoundTheorem: assume that the loss is the loss and is bounded by . Then, for any , with probability at least over the draw of a sample of size , for all ,
where
32
L� > 0
1� � mSM
Lp
µp = pMp�1.
A 2 H1
L(A) bLS(A) +
2µpc1r log(2m+ 1)
m
+
2µpc2rpWS log(2m+ 1)
m+ 3M
slog
2�
2m,
c1
= 2+8/ log(2)
3
c2 =
p2 + 4/
plog(2)
pageLICS2017 - Mohri@
ConclusionTheory of learning WFAs:
• data-dependent learning guarantees.
• data-dependent combinatorial quantities (e.g. ).
• can help guide design of algorithms.
• key role of notion of Hankel matrix ( spectral methods, (Hsu et al., 2009; Balle and MM, 2012)).
33
WS
pageLICS2017 - Mohri@
QuestionsQuestions:
• can we use learning bounds (e.g. ) to select prefixes/suffixes defining sub-blocks of Hankel matrix?
• can we derive learning guarantees for more general algorithms than (Balle and MM, 2012)?
• computational challenges.
34
WS
pageLICS2017 - Mohri@
Hypothesis SetsDefinition based on matrix representation:
35
An,p,r =
n
A : |QA| = n, k↵k r↵, k�kp⇤ r� ,
max
akAkp⇤ r⌃
o
.
pageLICS2017 - Mohri@
Rademacher ComplexitiesCorollary: let and , then,
36
bRS(An,p,r) 6
rn(n|⌃|+ 2)r↵r�
m
⇣C +
plog(LS + 2)
⌘
Rm(An,p,r) 6
rn(n|⌃|+ 2)r↵r�
m
⇣C +
plog(Lm + 2)
⌘,
LS = max
i|xi| Lm = E
S⇠Dm[LS ]
C =
q(log(r↵r�))+
n|⌃|+2
+
qlog
+
(r⌃
) +
qlog
+
(r̃) + 3
plog(2)
r̃ = max{qr↵/r� ,
qr�/r↵,
pr↵r�/r⌃}
where