43
Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute, NYU This document contains detailed notes supplementing the slides distributed for the tutorial Weighted Finite-State Transducers in Computational Biol- ogy. The notes are composed of two technical papers. The first paper, Mehryar Mohri. Weighted Finite-State Transducer Algorithms: An Overview. In Carlos Mart´ ın-Vide, Victor Mitrana, and Gheorghe Paun, editors, Formal Languages and Applications, volume 148, VIII, 620 p. Springer, Berlin, 2004. presents an overview of several of the fundamental algorithms related to weighted finite-state transducers that are presented in the tutorial. The second paper, Corinna Cortes, Patrick Haffner, and Mehryar Mohri. Rational Ker- nels: Theory and Algorithms. Journal of Machine Learning Research (JMLR), 5:1035–1062, 2004. describes a unifying framework for kernels relevant to computational bi- ology (rational kernels), the computation algorithm, and their applications. 1

Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Tutorial Hand-Outs – Weighted Finite-State Transducers in

Computational Biology

Corinna Cortes Mehryar Mohri

Google Research Courant Institute, NYU

This document contains detailed notes supplementing the slides distributed

for the tutorial Weighted Finite-State Transducers in Computational Biol-

ogy.

The notes are composed of two technical papers. The first paper,

• Mehryar Mohri. Weighted Finite-State Transducer Algorithms: An

Overview. In Carlos Martın-Vide, Victor Mitrana, and Gheorghe

Paun, editors, Formal Languages and Applications, volume 148, VIII,

620 p. Springer, Berlin, 2004.

presents an overview of several of the fundamental algorithms related to

weighted finite-state transducers that are presented in the tutorial.

The second paper,

• Corinna Cortes, Patrick Haffner, and Mehryar Mohri. Rational Ker-

nels: Theory and Algorithms. Journal of Machine Learning Research

(JMLR), 5:1035–1062, 2004.

describes a unifying framework for kernels relevant to computational bi-

ology (rational kernels), the computation algorithm, and their applications.

1

Page 2: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Weighted Finite-State Transducer Algorithms

An Overview

Mehryar Mohri

AT&T Labs – Research

Shannon Laboratory

180 Park Avenue, Florham Park, NJ 07932, USA

[email protected]

May 14, 2004

Abstract

Weighted finite-state transducers are used in many applications suchas text, speech and image processing. This chapter gives an overviewof several recent weighted transducer algorithms, including compositionof weighted transducers, determinization of weighted automata, a weightpushing algorithm, and minimization of weighted automata. It brieflydescribes these algorithms, discusses their running time complexity andconditions of application, and shows examples illustrating their applica-tion.

1 Introduction

Weighted transducers are used in many applications such as text, speech and im-age processing [9, 12, 5]. They are automata in which each transition in additionto its usual input label is augmented with an output label from a possibly newalphabet, and carries some weight element of a semiring. Transducers can beused to define a mapping between two different types of information sources, e.g.,word and phoneme sequences. The weights are critically needed to model theuncertainty or the variability of such information sources. Weighted transducerscan be used for example to assign different pronunciations to the same word butwith different ranks or probabilities. Their weights are typically derived fromlarge data sets using various sophisticated statistical learning techniques.

Much of the theory of weighted transducers and rational power series was de-veloped more than two decades ago [15, 7, 4]. However, many essential weightedtransducer algorithms such as determinization and minimization of weightedtransducers [9] are recent and arise new questions, both theoretical and algo-rithmic. This chapter overviews several recent weighted transducer algorithms,including composition of weighted transducers, determinization of weighted au-tomata, a weight pushing algorithm, and minimization of weighted automata.

1

Page 3: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Semiring Set ⊕ ⊗ 0 1

Boolean 0, 1 ∨ ∧ 0 1Probability R+ + × 0 1Log R ∪ −∞,+∞ ⊕log + +∞ 0Tropical R ∪ −∞,+∞ min + +∞ 0

Table 1: Semiring examples. ⊕log is defined by: x⊕log y = − log(e−x + e−y).

2 Preliminaries

This section introduces the definitions and notation used in the following.

Definition 1 ([7]) A system (K,⊕,⊗, 0, 1) is a semiring if: (K,⊕, 0) is a com-mutative monoid with identity element 0; (K,⊗, 1) is a monoid with iden-tity element 1; ⊗ distributes over ⊕; and 0 is an annihilator for ⊗: for alla ∈ K, a⊗ 0 = 0⊗ a = 0.

Thus, a semiring is a ring that may lack negation. Table 1 lists some familiarsemirings. A semiring is said to be commutative when the multiplicative oper-ation ⊗ is commutative. It is said to be left divisible if for any x 6= 0, thereexists y ∈ K such that y⊗x = 1, that is if all elements of K admit a left inverse.(K,⊕,⊗, 0, 1) is said to be weakly left divisible if for any x and y in K such thatx ⊕ y 6= 0, there exists at least one z such that x = (x ⊕ y) ⊗ z. When the ⊗operation is cancellative, z is unique and we can then write: z = (x ⊕ y)−1x.When z is not unique, we can still assume that we have an algorithm to find oneof the possible z and call it (x⊕ y)−1x. Furthermore, we will assume that z canbe found in a consistent way, that is: ((u⊗ x)⊕ (u⊗ y))−1(u⊗ x) = (x⊕ y)−1xfor any x, y, u ∈ K such that u 6= 0. A semiring is zero-sum-free if for any x andy in K, x ⊕ y = 0 implies x = y = 0. In the following definitions, K is assumedto be a left semiring or a semiring.

Definition 2 A weighted finite-state transducer T over a semiring K is an8-tuple T = (Σ,∆, Q, I, F,E, λ, ρ) where: Σ is the finite input alphabet of thetransducer; ∆ is the finite output alphabet; Q is a finite set of states; I ⊆ Q theset of initial states; F ⊆ Q the set of final states; E ⊆ Q × (Σ ∪ ε) × (∆ ∪ε)× K × Q a finite set of transitions; λ : I → K the initial weight function;and ρ : F → K the final weight function mapping F to K.

We denote by |T | the sum of the number of states and transitions of T . Weightedautomata are defined in a similar way by simply omitting the input or outputlabels.

Given a transition e ∈ E, we denote by p[e] its origin or previous stateand n[e] its destination state or next state, and w[e] its weight. A path π =e1 · · · ek is an element of E∗ with consecutive transitions: n[ei−1] = p[ei], i =2, . . . , k. We extend n and p to paths by setting: n[π] = n[ek] and p[π] =p[e1]. The weight function w can also be extended to paths by defining theweight of a path as the ⊗-product of the weights of its constituent transitions:

2

Page 4: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

w[π] = w[e1] ⊗ · · · ⊗ w[ek]. We denote by P (q, q′) the set of paths from q toq′ and by P (q, x, y, q′) the set of paths from q to q′ with input label x ∈ Σ∗

and output label y. These definitions can be extended to subsets R,R′ ⊆ Q,by: P (R, x, y,R′) = ∪q∈R, q′∈R′P (q, x, y, q′). A transducer T is regulated if theoutput weight associated by T to any pair of input-output string (x, y) by:

[[T ]](x, y) =⊕

π∈P (I,x,y,F )

λ[p[π]] ⊗ w[π]⊗ ρ[n[π]] (1)

is well-defined and in K. [[T ]](x, y) = 0 when P (I, x, y, F ) = ∅. If for all q ∈ Q⊕

π∈P (q,ε,ε,q)w[π] ∈ K, then T is regulated. In particular, when T does not

have any ε-cycle, it is regulated. We define the domain of T , Dom(T ), as:Dom(T ) =

(x, y) : [[T ]](x, y) 6= 0

.

3 Composition of Weighted Transducers

Composition is a fundamental algorithm used to create complex weighted trans-ducers from simpler ones. Let K be a commutative semiring and let T1 and T2

be two weighted transducers defined over K such that the input alphabet ofT2 coincides with the output alphabet of T1. Assume that the infinite sum⊕

z T1(x, z) ⊗ T2(z, y) is well-defined and in K for all (x, y) ∈ Σ∗ × Ω∗. Thiscondition holds for all transducers defined over a closed semiring [8, 11] such asthe Boolean semiring and the tropical semiring and for all acyclic transducersdefined over an arbitrary semiring. Then, the result of the composition of T1

and T2 is a weighted transducer denoted by T1 T2 and defined for all x, y by[3, 6, 15, 7]:1

[[T1 T2]](x, y) =⊕

z

T1(x, z)⊗ T2(z, y) (2)

There exists a general and efficient composition algorithm for weighted trans-ducers [13, 12]. States in the composition T1 T2 of two weighted transducersT1 and T2 are identified with pairs of a state of T1 and a state of T2. Leavingaside transitions with ε inputs or outputs, the following rule specifies how tocompute a transition of T1 T2 from appropriate transitions of T1 and T2:

(q1, a, b, w1, q2) and (q′1, b, c, w2, q′

2) =⇒ ((q1, q′

1), a, c, w1 ⊗ w2, (q2, q′

2)) (3)

See [13, 12] for a detailed presentation of the algorithm including the use of atransducer filter for dealing with ε-multiplicity in the case of non-idempotentsemirings. In the worst case, all transitions of T1 leaving a state q1 match allthose of T2 leaving state q′1, thus the space and time complexity of compositionis quadratic: O(|T1||T2|). However, an on-the-fly implementation of compositioncan be used to construct just the part of the composed transducer that is needed.Figures 1(a)-(c) illustrate the algorithm when applied to the transducers ofFigures 1(a)-(b) defined over the tropical semiring.

1Note that we use a matrix notation for the definition of composition as opposed to a func-

tional notation. This is a deliberate choice motivated in many cases by improved readability.

3

Page 5: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

0 1a:b/0.1a:b/0.2

2b:b/0.33/0.7b:b/0.4

a:b/0.5a:a/0.6

0 1b:b/0.1b:a/0.2 2a:b/0.3

3/0.6a:b/0.4

b:a/0.5

(a) (b)

(0, 0) (1, 1)a:b/0.2

(0, 1)a:a/0.4

(2, 1)b:a/0.5 (3, 1)

b:a/0.6

a:a/0.3

a:a/0.7

(3, 2)a:b/0.9

(3, 3)/1.3

a:b/1

(c)

Figure 1: (a) Weighted transducer T1 over the tropical semiring. (b) Weightedtransducer T2 over the tropical semiring. (c) Composition of T1 and T2.

Intersection (or Hadamard product) of weighted automata and compositionof finite-state transducers are both special cases of composition of weightedtransducers. Intersection corresponds to the case where input and output la-bels of transitions are identical and composition of unweighted transducers isobtained simply by omitting the weights.

In general, the definition of composition cannot be extended to the case ofnon-commutative semirings because the composite transduction cannot alwaysbe represented by a weighted finite-state transducer. Consider for example, thecase of two transducers T1 and T2, with Dom(T1) = Dom(T2) = (a, a)∗, with[[T1]](a, a) = x ∈ K and [[T2]](a, a) = y ∈ K and let τ be the composite of thetransductions corresponding to T1 and T2. Then, for any non-negative integern, τ(an, an) = xn ⊗ yn which in general is different from (x ⊗ y)n if x and ydo not commute. An argument similar to the classical Pumping lemma canthen be used to show that τ cannot be represented by a weighted finite-statetransducer.

When T1 and T2 are acyclic, composition can be extended to the case of non-commutative semirings. The algorithm would then consist of matching pathsof T1 and T2 directly rather than matching their constituent transitions. Thetermination of the algorithm is guaranteed by the fact that the number of pathsof T1 and T2 is finite. However, the time and space complexity of the algorithmis then exponential.

The weights of matching transitions and paths are ⊗-multiplied in compo-sition. One might wonder if another operation, ×, can be used instead of ⊗, inparticular when K is not commutative. The following proposition proves thatthat cannot be.

Proposition 1 Let (K,×, e) be a monoid. Assume that × is used instead of ⊗in composition. Then, × coincides with ⊗ and (K,⊕,⊗, 0, 1) is a commutativesemiring.

4

Page 6: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Proof. Consider two sets of consecutive transitions of two paths: π1 =(p1, a, a, x, q1)(q1, b, b, y, r1) and π2 = (p2, a, a, u, q2)(q2, b, b, v, r2). Matchingthese transitions using × result in the following:

((p1, p2), a, a, x× u, (q1, q2)) and ((q1, q2), b, b, y × v, (r1, r2)) (4)

Since the weight of the path obtained by matching π1 and π2 must also corre-spond to the ×-multiplication of the weight of π1, x ⊗ y, and the weight of π2,u⊗ v, we have:

(x× u)⊗ (y × v) = (x⊗ y)× (u⊗ v) (5)

This identity must hold for all x, y, u, v ∈ K. Setting u = y = e and v = 1 leadsto x = x ⊗ e and similarly x = e⊗ x for all x. Since the identity element of ⊗is unique, this proves that e = 1.

With u = y = 1, identity 5 can be rewritten as: x ⊗ v = x × v for all xand v, which shows that × coincides with ⊗. Finally, setting x = v = 1 givesu⊗ y = y × u for all y and u which shows that ⊗ is commutative.

4 Determinization

A weighted automaton is said to be deterministic or subsequential [16] if it hasa unique initial state and if no two transitions leaving any state share the sameinput label.

There exists a natural extension of the classical subset construction to thecase of weighted automata over a weakly left divisible semiring called deter-minization [9].2 The algorithm is generic: it works with any weakly left di-visible semiring. Figures 2(a)-(b) illustrate the determinization of a weightedautomaton over the tropical semiring. A state r of the output automaton thatcan be reached from the start state by a path π corresponds to the set of pairs(q, x) ∈ Q × K such that q can be reached from an initial state of the originalmachine by a path σ with l[σ] = l[π] and λ[p[σ]] ⊗ w[σ] = λ[p[π]] ⊗ w[π] ⊗ x.Thus, x is the remaining weight at state q. The worst case complexity of de-terminization is exponential even in the unweighted case. However, in manypractical cases such as for weighted automata used in large-vocabulary speechrecognition, this blow-up does not occur. It is also important to notice that justlike composition, determinization admits a natural on-the-fly implementation[9] which can be useful for saving space.

Unlike the unweighted case, determinization does not halt for some inputweighted automata. In fact, some weighted automata, non subsequentiable au-tomata, do not even admit equivalent subsequential machines. We say that aweighted automaton A is determinizable if the determinization algorithm haltsfor the input A. With a determinizable input, the algorithm outputs an equiv-alent subsequential weighted automaton [9].

2We assume that the weighted automata considered are all such that for any string x ∈ Σ∗,w[P (I, x, Q)] 6= 0. This condition is always satisfied with trim machines over the tropicalsemiring or any zero-sum-free semiring.

5

Page 7: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

0

1a/1

2

a/2

b/3

3/0

c/5

b/3 d/6 (0,0) (1,0),(2,1)a/1

b/3

(3,0)/0c/5d/7

0

1a/1

2

a/2

b/3

3/0

c/5

b/4 d/6

(a) (b) (c)

Figure 2: Determinization of weighted automata. (a) Weighted automaton overthe tropical semiring A. (b) Equivalent weighted automaton B obtained bydeterminization of A. (c) Non-determinizable weighted automaton over thetropical semiring, states 1 and 2 are non-twin siblings.

There exists a general property, the twins property, first formulated for finite-state transducers by C. Choffrut [3], later generalized to weighted automata overthe tropical semiring by [9], that provides a characterization of determinizableweighted automata under some general conditions.

Let A be a weighted automaton over a weakly left divisible left semiring K.Two states q and q′ of A are said to be siblings if there exist two strings x andy in Σ∗ such that both q and q′ can be reached from I by paths labeled withx and there is a cycle at q and a cycle at q′ both labeled with y. When K isa commutative and cancellative semiring, then two sibling states are said to betwins iff for any string y:

w[P (q, y, q)] = w[P (q′, y, q′)] (6)

A has the twins property if any two sibling states of A are twins. Figure 2(c)shows an unambiguous weighted automaton over the tropical semiring that doesnot have the twins property: states 1 and 2 can be reached by paths labeledwith a from the initial state and admit cycles with the same label b, but theweights of these cycles (3 and 4) are different.

Theorem 1 ([9]) Let A be a weighted automaton over the tropical semiring.If A has the twins property, then A is determinizable.

With trim unambiguous weighted automata, the condition is also necessary.

Theorem 2 ([9]) Let A be a trim unambiguous weighted automaton over thetropical semiring. Then the three following properties are equivalent:

1. A is determinizable.

2. A has the twins property.

3. A is subsequentiable.

There exists an efficient algorithm for testing the twins property for weightedautomata [2]. The test of the twins property for finite-state transducers andweighted automata over other semirings is also discussed by [2]. Note that anyacyclic weighted automaton over a zero-sum-free semiring has the twins propertyand is determinizable.

6

Page 8: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

5 Weight Pushing

The choice of the distribution of the total weight along each successful path ofa weighted automaton does not affect the definition of the function realized bythat automaton, but this may have a critical impact on the efficiency in manyapplications, e.g., natural language processing applications, when a heuristicpruning is used to visit only a subpart of the automaton. There exists analgorithm, weight pushing, for normalizing the distribution of the weights alongthe paths of a weighted automaton or more generally a weighted directed graph.

Let A be a weighted automaton over a semiring K. Assume that K is zero-sum-free and weakly left divisible. For any state q ∈ Q, assume that the follow-ing sum is well-defined and in K:

d[q] =⊕

π∈P (q,F )

(w[π] ⊗ ρ[n[π]]) (7)

d[q] is the shortest-distance from q to F [11]. d[q] is well-defined for all q ∈ Qwhen K is a k-closed semiring [11]. The weight pushing algorithm consists ofcomputing each shortest-distance d[q] and of reweighting the transition weights,initial weights and final weights in the following way:

∀e ∈ E s.t. d[p[e]] 6= 0, w[e] ← d[p[e]]−1 ⊗ w[e]⊗ d[n[e]] (8)

∀q ∈ I, λ[q] ← λ[q]⊗ d[q] (9)

∀q ∈ F, s.t. d[q] 6= 0, ρ[q] ← d[q]−1 ⊗ ρ[q] (10)

Each of these operations can be assumed to be done in constant time, thusreweighting can be done in linear time O(T⊗|A|) where T⊗ denotes the worstcost of an ⊗-operation. The complexity of the computation of the shortest-distances depends on the semiring. In the case of k-closed semirings such asthe tropical semiring, d[q], q ∈ Q, can be computed using a generic shortest-distance algorithm [11]. The complexity of the algorithm is linear in the caseof an acyclic automaton: O(|Q| + (T⊕ + T⊗)|E|), where T⊕ denotes the worstcost of an ⊕-operation. In the case of a general weighted automaton over thetropical semiring, the complexity of the algorithm is O(|E|+ |Q| log |Q|).

In the case of closed semirings such as (R+,+,×, 0, 1), a generalization of theFloyd-Warshall algorithm for computing all-pairs shortest-distances can be used.The complexity of the algorithm is Θ(|Q|3(T⊕ +T⊗+T∗)) where T∗ denotes theworst cost of the closure operation. The space complexity of these algorithmsis Θ(|Q|2). These complexities make it impractical to use the Floyd-Warshallalgorithm for computing d[q], q ∈ Q for relatively large graphs or automataof several hundred million states or transitions. An approximate version of theshortest-distance algorithm of [11] can be used instead to compute d[q] efficiently[10].

Roughly speaking, the algorithm pushes the weights of each path as much aspossible towards the initial states. Figures 3(a)-(c) illustrate the application ofthe algorithm in a special case both for the tropical and probability semirings.

7

Page 9: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

0

1

a/0

b/1

c/5

2

d/0

e/1

3

e/0f/1

e/4

f/5

0/0

1

a/0

b/1

c/5

2

d/4

e/5

3/0

e/0f/1

e/0

f/1

0/15

1

a/0

b/(1/15)

c/(5/15)

2

d/0

e/(9/15)

3/1

e/0f/1

e/(4/9)

f/(5/9)0/0 1

a/0b/1c/5

3/0e/0f/1

(a) (b) (c) (d)

Figure 3: Weight pushing algorithm. (a) Weighted automatonA. (b) Equivalentweighted automaton B obtained by weight pushing in the tropical semiring. (c)Weighted automaton C obtained from A by weight pushing in the probabilitysemiring. (d) Minimal weighted automaton over the tropical semiring equivalentto A.

Note that if d[q] = 0, then, since K is zero-sum-free, the weight of all pathsfrom q to F is 0.

Let A be a weighted automaton over the semiring K. Assume that K isclosed or k-closed and that the shortest-distances d[q] are all well-defined and inK−

0

. Note that in both cases we can use the distributivity over the infinitesums defining shortest distances. Let e′ (π′) denote the transition e (path π)after application of the weight pushing algorithm. e′ (π′) differs from e (resp.π) only by its weight. Let λ′ denote the new initial weight function, and ρ′ thenew final weight function.

Proposition 2 Let B = (Σ, Q, I, F,E′, λ′, ρ′) be the result of the weight pushingalgorithm applied to the weighted automaton A, then

1. the weight of a successful path π is unchanged after application of weightpushing:

λ′[p[π′]]⊗ w[π′]⊗ ρ′[n[π′]] = λ[p[π]]⊗ w[π] ⊗ ρ[n[π]] (11)

2. the weighted automaton B is stochastic, i.e.

∀q ∈ Q,⊕

e′∈E′[q]

w[e′] = 1 (12)

Proof. Let π′ = e′1 . . . e′

k. By definition of λ′ and ρ′,

λ′[p[π′]] ⊗ w[π′] ⊗ ρ

′[n[π′]] = λ[p[e1]] ⊗ d[p[e1]] ⊗ d[p[e1]]−1

⊗ w[e1] ⊗ d[n[e1]] ⊗ · · ·

⊗ d[p[ek]]−1⊗ w[ek] ⊗ d[n[ek]] ⊗ d[n[ek]]−1

⊗ ρ[n[π]]

= λ[p[π]] ⊗ w[e1] ⊗ · · · ⊗ w[ek] ⊗ ρ[n[π]]

which proves the first statement of the proposition. Let q ∈ Q,M

e′∈E′[q]

w[e′] =M

e∈E[q]

d[q]−1⊗w[e] ⊗ d[n[e]]

8

Page 10: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

= d[q]−1⊗

M

e∈E[q]

w[e] ⊗ d[n[e]]

= d[q]−1⊗

M

e∈E[q]

w[e] ⊗M

π∈P (n[e],F )

(w[π] ⊗ ρ[n[π]])

= d[q]−1⊗

M

e∈E[q],π∈P (n[e],F )

(w[e] ⊗w[π] ⊗ ρ[n[π]])

= d[q]−1⊗ d[q] = 1

where we used the distributivity of the multiplicative operation over infinitesums in closed or k-closed semirings. This proves the second statement of theproposition.

These two properties of weight pushing are illustrated by Figures 3(a)-(c): thetotal weight of a successful path is unchanged after pushing; at each state ofthe weighted automaton of Figure 3(b), the minimum weight of the outgoingtransitions is 0, and at at each state of the weighted automaton of Figure 3(c),the weights of outgoing transitions sum to 1. Weight pushing can also be usedto test the equivalence of two weighted automata [9].

6 Minimization

A deterministic weighted automaton is said to be minimal if there exists no otherdeterministic weighted automaton with a smaller number of states and realizingthe same function. Two states of a deterministic weighted automaton are said tobe equivalent if exactly the same set of strings with the same weights label pathsfrom these states to a final state, the final weights being included. Thus, twoequivalent states of a deterministic weighted automaton can be merged withoutaffecting the function realized by that automaton. A weighted automaton isminimal when it admits no two distinct equivalent states after any redistributionof the weights along its paths.

There exists a general algorithm for computing a minimal deterministic au-tomaton equivalent to a given weighted automaton [9]. The algorithm consistsof first applying the weight pushing algorithm to normalize the distribution ofthe weights along the paths of the input automaton, and then of treating eachpair (label, weight) as a single label and applying the classical (unweighted)automata minimization.

Theorem 3 ([9]) Let A be a deterministic weighted automaton over a semiringK. Assume that the conditions of application of the weight pushing algorithmhold, then the execution of the following steps:

1. weight pushing,

2. (unweighted) automata minimization,

lead to a minimal weighted automaton equivalent to A.

9

Page 11: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

0

1

a/1

b/2

c/3

2

d/4

e/5

3/1

e/.8f/1

e/4

f/5

0/(459/5) 1

a/(1/51)

b/(2/51)c/(3/51)

d/(20/51)e/(25/51)

2/1e/(4/9)f/(5/9)

0/25 1

a/.04

b/.08c/.12d/.80e/1

2/1e/.8f/1

(a) (b) (c)

Figure 4: Minimization of weighted automata. (a) Weighted automaton A′ overthe probability semiring. (b) Minimal weighted automaton B′ equivalent to A′.(c) Minimal weighted automaton C′ equivalent to A′.

The complexity of automata minimization is linear in the case of acyclic au-tomata O(|Q| + |E|) [14] and in O(|E| log |Q|) in the general case [1]. Thus, inview of the complexity results given in the previous section, in the case of thetropical semiring, the total complexity of the weighted minimization algorithmis linear in the acyclic case O(|Q|+ |E|) and in O(|E| log |Q|) in the general case.

Figures 3(a), 3(b), and 3(d) illustrate the application of the algorithm inthe tropical semiring. The automaton of Figure 3(a) cannot be further mini-mized using the classical unweighted automata minimization since no two statesare equivalent in that machine. After weight pushing, the automaton (Figure3(b)) has two states (1 and 2) that can be merged by the classical unweightedautomata minimization.

Figures 4(a)-(c) illustrate the minimization of an automaton defined overthe probability semiring. Unlike the unweighted case, a minimal weighted au-tomaton is not unique, but all minimal weighted automata have the same graphtopology, they only differ by the way the weights are distributed along eachpath. The weighted automata B′ and C′ are both minimal and equivalent toA′. B′ is obtained from A′ using the algorithm described above in the probabil-ity semiring and it is thus a stochastic weighted automaton in the probabilitysemiring.

For a deterministic weighted automaton, the first operation of the semiringcan be arbitrarily chosen without affecting the definition of the function it real-izes. This is because, by definition, a deterministic weighted automaton admitsat most one path labeled with any given string. Thus, in the algorithm de-scribed in theorem 3, the weight pushing step can be executed in any semiringK

′ whose multiplicative operation matches that of K. The minimal weightedautomata obtained by pushing the weights in K

′ is also minimal in K since itcan be interpreted as a (deterministic ) weighted automaton over K.

In particular, A′ can be interpreted as a weighted automaton over the(max,×)-semiring (R+,max,×, 0, 1). The application of the weighted minimiza-tion algorithm to A′ in this semiring leads to the minimal weighted automatonC′ of Figure 4(c). C′ is also a stochastic weighted automaton in the sense that,at any state, the maximum weight of all outgoing transitions is one.

10

Page 12: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

This fact has several interesting observations. One is related to the com-plexity of the algorithms. Indeed, we can choose a semiring K

′ in which thecomplexity of weight pushing is better than in K. The resulting automaton isstill minimal in K and has the additional property of being stochastic in K

′.It only differs from the weighted automaton obtained by pushing weights inK in the way weights are distributed along the paths. They can be obtainedfrom each other by application of weight pushing in the appropriate semiring.In the particular case of a weighted automaton over the probability semiring,it may be preferable to use weight pushing in the (max,×)-semiring since thecomplexity of the algorithm is then equivalent to that of classical single-sourceshortest-paths algorithms. The corresponding algorithm is a special instance ofthe generic shortest-distance algorithm given by [11].

Another important point is that the weight pushing algorithm may not bedefined in K because the machine is not zero-sum-free or for other reasons.But an alternative semiring K

′ can sometimes be used to minimize the inputweighted automaton.

The results just presented were all related to the minimization of the num-ber of states of a deterministic weighted automaton. The following propositionshows that minimizing the number of states coincides with minimizing the num-ber of transitions.

Proposition 3 ([9]) Let A be a minimal deterministic weighted automaton,then A has the minimal number of transitions.

Proof. Let A be a deterministic weighted automaton with the minimal numberof transitions. If two distinct states of A were equivalent, they could be merged,thereby strictly reducing the number of its transitions. Thus, A must be aminimal deterministic automaton. Since, minimal deterministic automata havethe same topology, in particular the same number of states and transitions, thisproves the proposition.

7 Conclusion

We surveyed several recent weighted finite-state transducer algorithms. Thesealgorithms can be used in a variety of applications to create efficient and com-plex systems. They have been used with weighted transducers of several hundredmillion states and transitions to create large-vocabulary speech recognition andcomplex spoken-dialog systems. Other algorithms such as ε-removal and syn-chronization of weighted transducers also play a critical role in the design ofsuch large-scale systems.

References

[1] Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman. The design andanalysis of computer algorithms. Addison Wesley: Reading, MA, 1974.

11

Page 13: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

[2] Cyril Allauzen and Mehryar Mohri. Efficient Algorithms for Testing theTwins Property. Journal of Automata, Languages and Combinatorics, 8(2),2003.

[3] Jean Berstel. Transductions and Context-Free Languages. Teubner Studi-enbucher: Stuttgart, 1979.

[4] Jean Berstel and Christophe Reutenauer. Rational Series and Their Lan-guages. Springer-Verlag: Berlin-New York, 1988.

[5] Karel Culik II and Jarkko Kari. Digital Images and Formal Languages.In Grzegorz Rozenberg and Arto Salomaa, editors, Handbook of FormalLanguages, volume 3, pages 599–616. Springer, 1997.

[6] Samuel Eilenberg. Automata, Languages and Machines, volume A. Aca-demic Press, 1974.

[7] Werner Kuich and Arto Salomaa. Semirings, Automata, Languages. Num-ber 5 in EATCS Monographs on Theoretical Computer Science. Springer-Verlag, Berlin, Germany, 1986.

[8] Daniel J. Lehmann. Algebraic Structures for Transitive Closures. Theoret-ical Computer Science, 4:59–76, 1977.

[9] Mehryar Mohri. Finite-State Transducers in Language and Speech Pro-cessing. Computational Linguistics, 23:2, 1997.

[10] Mehryar Mohri. General Algebraic Frameworks and Algorithms forShortest-Distance Problems. Technical Memorandum 981210-10TM,AT&T Labs - Research, 62 pages, 1998.

[11] Mehryar Mohri. Semiring Frameworks and Algorithms for Shortest-Distance Problems. Journal of Automata, Languages and Combinatorics,7(3):321–350, 2002.

[12] Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. Weighted Au-tomata in Text and Speech Processing. In Proceedings of the 12th biennialEuropean Conference on Artificial Intelligence (ECAI-96), Workshop onExtended finite state models of language, Budapest, Hungary. ECAI, 1996.

[13] Fernando C. N. Pereira and Michael D. Riley. Speech recognition by com-position of weighted finite automata. In Emmanuel Roche and Yves Sch-abes, editors, Finite-State Language Processing, pages 431–453. MIT Press,Cambridge, Massachusetts, 1997.

[14] Dominique Revuz. Minimisation of acyclic deterministic automata in lineartime. Theoretical Computer Science, 92:181–189, 1992.

[15] Arto Salomaa and Matti Soittola. Automata-Theoretic Aspects of FormalPower Series. Springer-Verlag: New York, 1978.

12

Page 14: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

[16] Marcel Paul Schutzenberger. Sur une variante des fonctions sequentielles.Theoretical Computer Science, 4(1):47–57, 1977.

13

Page 15: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Journal of Machine Learning Research 1 (2004) 1-50 Submitted 12/03; Published 01/04

Rational Kernels: Theory and Algorithms

Corinna Cortes [email protected]

Google Labs1440 Broadway, New York, NY 10018

Patrick Haffner [email protected]

AT&T Labs – Research180 Park Avenue, Florham Park, NJ 07932

Mehryar Mohri [email protected]

AT&T Labs – Research

180 Park Avenue, Florham Park, NJ 07932

Editor: Kristin Bennett and Nicolo Cesa-Bianchi

Abstract

Many classification algorithms were originally designed for fixed-size vectors. Recent ap-plications in text and speech processing and computational biology require however theanalysis of variable-length sequences and more generally weighted automata. An approachwidely used in statistical learning techniques such as Support Vector Machines (SVMs) isthat of kernel methods, due to their computational efficiency in high-dimensional featurespaces. We introduce a general family of kernels based on weighted transducers or rationalrelations, rational kernels, that extend kernel methods to the analysis of variable-lengthsequences or more generally weighted automata. We show that rational kernels can becomputed efficiently using a general algorithm of composition of weighted transducers anda general single-source shortest-distance algorithm.

Not all rational kernels are positive definite and symmetric (PDS), or equivalently verifythe Mercer condition, a condition that guarantees the convergence of training for discrimi-nant classification algorithms such as SVMs. We present several theoretical results relatedto PDS rational kernels. We show that under some general conditions these kernels areclosed under sum, product, or Kleene-closure and give a general method for constructing aPDS rational kernel from an arbitrary transducer defined on some non-idempotent semir-ings. We give the proof of several characterization results that can be used to guide thedesign of PDS rational kernels. We also show that some commonly used string kernels orsimilarity measures such as the edit-distance, the convolution kernels of Haussler, and somestring kernels used in the context of computational biology are specific instances of rationalkernels. Our results include the proof that the edit-distance over a non-trivial alphabetis not negative definite, which, to the best of our knowledge, was never stated or provedbefore.

Rational kernels can be combined with SVMs to form efficient and powerful techniquesfor a variety of classification tasks in text and speech processing, or computational biology.We describe examples of general families of PDS rational kernels that are useful in manyof these applications and report the result of our experiments illustrating the use of ratio-nal kernels in several difficult large-vocabulary spoken-dialog classification tasks based on

c©2004 Corinna Cortes, Patrick Haffner, and Mehryar Mohri.

Page 16: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Cortes, Haffner, and Mohri

deployed spoken-dialog systems. Our results show that rational kernels are easy to designand implement and lead to substantial improvements of the classification accuracy.

1. Introduction

Many classification algorithms were originally designed for fixed-length vectors. Recentapplications in text and speech processing and computational biology require however theanalysis of variable-length sequences and more generally weighted automata. Indeed, theoutput of a large-vocabulary speech recognizer for a particular input speech utterance, orthat of a complex information extraction system combining several knowledge sources for aspecific input query, is typically a weighted automaton compactly representing a large setof alternative sequences. The weights assigned by the system to each sequence are usedto rank different alternatives according to the models the system is based on. The errorrate of such complex systems is still too high in many tasks to rely only on their one-bestoutput, thus it is preferable instead to use the full weighted automata which contain thecorrect result in most cases.

An approach widely used in statistical learning techniques such as Support Vector Ma-chines (SVMs) (Boser et al., 1992, Cortes and Vapnik, 1995, Vapnik, 1998) is that of kernelmethods, due to their computational efficiency in high-dimensional feature spaces. We in-troduce a general family of kernels based on weighted transducers or rational relations,rational kernels, that extend kernel methods to the analysis of variable-length sequencesor more generally weighted automata.1 We show that rational kernels can be computedefficiently using a general algorithm of composition of weighted transducers and a generalsingle-source shortest-distance algorithm.

Not all rational kernels are positive definite and symmetric (PDS), or equivalently verifythe Mercer condition (Berg et al., 1984), a condition that guarantees the convergence oftraining for discriminant classification algorithms such as SVMs. We present several theo-retical results related to PDS rational kernels. We show that under some general conditionsthese kernels are closed under sum, product, or Kleene-closure and give a general methodfor constructing a PDS rational kernel from an arbitrary transducer defined on some non-idempotent semirings. We give the proof of several characterization results that can be usedto guide the design of PDS rational kernels.

We also study the relationship between rational kernels and some commonly used stringkernels or similarity measures such as the edit-distance, the convolution kernels of Haus-sler (Haussler, 1999), and some string kernels used in the context of computational biology(Leslie et al., 2003). We show that these kernels are all specific instances of rational kernels.In each case, we explicitly describe the corresponding weighted transducer. These trans-ducers are often simple and efficient for computing kernels. Their diagram provides moreinsight into the definition of kernels and can guide the design of new kernels. Our resultsalso include the proof of the fact that the edit-distance over a non-trivial alphabet is notnegative definite, which, to the best of our knowledge, was never stated or proved before.

Rational kernels can be combined with SVMs to form efficient and powerful techniquesfor a variety of applications to text and speech processing, or to computational biology. Wedescribe examples of general families of PDS rational kernels that are useful in many of

1. We have described in shorter publications part of the material presented here (Cortes et al., 2003a,b,c,d).

2

Page 17: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Rational Kernels: Theory and Algorithms

Semiring Set ⊕ ⊗ 0 1

Boolean 0, 1 ∨ ∧ 0 1

Probability R+ + × 0 1

Log R ∪ −∞,+∞ ⊕log + +∞ 0

Tropical R ∪ −∞,+∞ min + +∞ 0

Table 1: Semiring examples. ⊕log is defined by: x⊕log y = − log(e−x + e−y).

these applications. We report the result of our experiments illustrating the use of rationalkernels in several difficult large-vocabulary spoken-dialog classification tasks based on de-ployed spoken-dialog systems. Our results show that rational kernels are easy to design andimplement and lead to substantial improvements of the classification accuracy.

The paper is organized as follows. In the following section, we introduce the notationand some preliminary algebraic and automata-theoretic definitions used in the remainingsections. Section 3 introduces the definition of rational kernels. In Section 4, we presentgeneral algorithms that can be used to compute rational kernels efficiently. Section 5 intro-duces the classical definitions of positive definite and negative definite kernels and gives anumber of novel theoretical results, including the proof of some general closure propertiesof PDS rational kernels, a general construction of PDS rational kernels starting from anarbitrary weighted transducer, a characterization of acyclic PDS rational kernels, and theproof of the closure properties of a very general class of PDS rational kernels. Section 6studies the relationship between some commonly used kernels and rational kernels. Finally,the results of our experiments in several spoken-dialog classification tasks are reported inSection 7.

2. Preliminaries

In this section, we present the algebraic definitions and notation needed to introduce rationalkernels.

A system (K,, e) is a monoid if it is closed under : a b ∈ K for all a, b ∈ K; is associative: (a b) c = a (b c) for all a, b, c ∈ K; and e is an identity for :a e = e a = a, for all a ∈ K. When additionally is commutative: a b = b a for alla, b ∈ K, then (K,, e) is said to be a commutative monoid.

Definition 1 (Kuich and Salomaa (1986)) A system (K,⊕,⊗, 0, 1) is a semiring if:(K,⊕, 0) is a commutative monoid with identity element 0; (K,⊗, 1) is a monoid with iden-tity element 1; ⊗ distributes over ⊕; and 0 is an annihilator for ⊗: for all a ∈ K, a⊗ 0 =0⊗ a = 0.

Thus, a semiring is a ring that may lack negation. Table 1 lists some familiar semirings.In addition to the Boolean semiring and the probability semiring, two semirings often usedin applications are the log semiring which is isomorphic to the probability semiring via alog morphism, and the tropical semiring which is derived from the log semiring using theViterbi approximation.

3

Page 18: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Cortes, Haffner, and Mohri

Definition 2 A weighted finite-state transducer T over a semiring K is an 8-tuple T =(Σ,∆, Q, I, F,E, λ, ρ) where: Σ is the finite input alphabet of the transducer; ∆ is the finiteoutput alphabet; Q is a finite set of states; I ⊆ Q the set of initial states; F ⊆ Q the set offinal states; E ⊆ Q× (Σ ∪ ε) × (∆ ∪ ε) × K ×Q a finite set of transitions; λ : I → K

the initial weight function; and ρ : F → K the final weight function mapping F to K.

Weighted automata can be formally defined in a similar way by simply omitting the inputor output labels.

Given a transition e ∈ E, we denote by p[e] its origin or previous state and n[e] itsdestination state or next state, and w[e] its weight. A path π = e1 · · · ek is an elementof E∗ with consecutive transitions: n[ei−1] = p[ei], i = 2, . . . , k. We extend n and pto paths by setting: n[π] = n[ek] and p[π] = p[e1]. A cycle π is a path whose originand destination coincide: p[π] = n[π]. A weighted automaton or transducer is said to beacyclic if it admits no cycle. A successful path in a weighted automaton or transducerM is a path from an initial state to a final state. The weight function w can also beextended to paths by defining the weight of a path as the ⊗-product of the weights of itsconstituent transitions: w[π] = w[e1] ⊗ · · · ⊗ w[ek]. We denote by P (q, q′) the set of pathsfrom q to q′ and by P (q, x, y, q′) the set of paths from q to q′ with input label x ∈ Σ∗

and output label y ∈ ∆∗. These definitions can be extended to subsets R,R′ ⊆ Q, by:P (R,x, y,R′) = ∪q∈R, q′∈R′P (q, x, y, q′). We denote by w[M ] the ⊕-sum of the weights ofall the successful paths of the automaton or transducer M , when that sum is well-definedand in K. A transducer T is regulated if the output weight associated by T to any pair ofinput-output string (x, y) by:

[[T ]](x, y) =⊕

π ∈P (I,x,y,F )

λ(p[π]) ⊗ w[π]⊗ ρ[n[π]] (1)

is well-defined and in K. [[T ]](x, y) = 0 when P (I, x, y, F ) = ∅. If for all q ∈ Q, thesum

π∈P (q,ε,ε,q)w[π] is in K, then T is regulated. In particular, when T does not haveany ε-cycle, that is a cycle labeled with ε (both input and output labels), it is regulated.In the following, we will assume that all the transducers considered are regulated. Regu-lated weighted transducers are closed under the rational operations: ⊕-sum, ⊗-product andKleene-closure which are defined as follows for all transducers T1 and T2 and (x, y) ∈ Σ∗×∆∗:

[[T1 ⊕ T2]](x, y) = [[T1]](x, y)⊕ [[T2]](x, y) (2)

[[T1 ⊗ T2]](x, y) =⊕

x=x1x2,y=y1y2

[[T1]](x1, y1)⊗ [[T2]](x2, y2) (3)

[[T ∗]](x, y) =∞⊕

n=0

T n(x, y) (4)

where T n stands for the (n − 1)-⊗-product of T with itself.For any transducer T , we denote by T−1 its inverse, that is the transducer obtained

from T by transposing the input and output labels of each transition and the input andoutput alphabets.

Composition is a fundamental operation on weighted transducers that can be usedin many applications to create complex weighted transducers from simpler ones. Let

4

Page 19: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Rational Kernels: Theory and Algorithms

T1 = (Σ,∆, Q1, I1, F1, E1, λ1, ρ1) and T2 = (∆,Ω, Q2, I2, F2, E2, λ2, ρ2) be two weightedtransducers defined over a commutative semiring K such that ∆, the output alphabet of T1,coincides with the input alphabet of T2. Then, the result of the composition of T1 and T2 isa weighted transducer T1 T2 which, when it is regulated, is defined for all x, y by (Berstel,1979, Eilenberg, 1974, Salomaa and Soittola, 1978, Kuich and Salomaa, 1986):2

[[T1 T2]](x, y) =⊕

z ∈∆∗

[[T1]](x, z) ⊗ [[T2]](z, y) (5)

Note that a transducer can be viewed as a matrix over a countable set Σ∗ ×∆∗ and com-position as the corresponding matrix-multiplication.

The definition of composition extends naturally to weighted automata since a weightedautomaton can be viewed as a weighted transducer with identical input and output labels foreach transition. The corresponding transducer associates [[A]](x) to a pair (x, x), and 0 to allother pairs. Thus, the composition of a weighted automaton A1 = (∆, Q1, I1, F1, E1, λ1, ρ1)and a weighted transducer T2 = (∆,Ω, Q2, I2, F2, E2, λ2, ρ2) is simply defined for all x, y in∆∗ × Ω∗ by:

[[A1 T2]](x, y) =⊕

x∈∆∗

[[A1]](x)⊗ [[T2]](x, y) (6)

when these sums are well-defined and in K. Intersection of two weighted automata is thespecial case of composition where both operands are weighted automata, or equivalentlyweighted transducers with identical input and output labels for each transition.

3. Definitions

Let X and Y be non-empty sets. A function K : X × Y → R is said to be a kernel overX × Y . This section introduces rational kernels, which are kernels defined over sets ofstrings or weighted automata.

Definition 3 A kernel K over Σ∗ × ∆∗ is said to be rational if there exist a weightedtransducer T = (Σ,∆, Q, I, F,E, λ, ρ) over the semiring K and a function ψ : K→ R suchthat for all x ∈ Σ∗ and y ∈ ∆∗:3

K(x, y) = ψ([[T ]](x, y)) (7)

K is then said to be defined by the pair (ψ, T ).

This definition and many of the results presented in this paper can be generalized byreplacing the free monoids Σ∗ and ∆∗ with arbitrary monoids M1 and M2. Also, notethat we are not making any particular assumption about the function ψ in this definition.In general, it is an arbitrary function mapping K to R.

Figure 1 shows an example of a weighted transducer over the probability semiring corre-sponding to the gappy n-gram kernel with decay factor λ as defined by (Lodhi et al., 2001).Such gappy n-gram kernels are rational kernels (Cortes et al., 2003c).

2. We use a matrix notation for the definition of composition as opposed to a functional notation.3. We chose to call these kernels “rational” because their definition is based on rational relations or rational

transductions (Salomaa and Soittola, 1978, Kuich and Salomaa, 1986) represented by a weighted trans-ducer. The mathematical counterpart of weighted automata and transducers are also called rational

power series Berstel and Reutenauer (1988) which further justifies our terminology.

5

Page 20: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Cortes, Haffner, and Mohri

0

a:ε/1b:ε/1 1ε:a/1

ε:b/1

2a:a/0.01b:b/0.01

ε:a/1ε:b/1

a:a/0.01

b:b/0.01a:ε/0.1b:ε/0.1 3

ε:a/0.1

ε:b/0.1

4/1a:a/0.01b:b/0.01

ε:a/0.1ε:b/0.1

a:a/0.01

b:b/0.01a:ε/1b:ε/1

5/1ε:a/1ε:b/1

ε:a/1ε:b/1

Figure 1: Gappy bigram rational kernel with decay factor λ = .1. Bold face circles representinitial states and double circles indicate final states. Inside each circle, the firstnumber indicates the state number, the second, at final states only, the value ofthe final weight function ρ at that state. Arrows represent transitions. They arelabeled with an input and an output symbol separated by a colon and followedby their corresponding weight after the slash symbol.

Rational kernels can be naturally extended to kernels over weighted automata. Let Abe a weighted automaton defined over the semiring K and the alphabet Σ and B a weightedautomaton defined over the semiring K and the alphabet ∆, K(A,B) is defined by:

K(A,B) = ψ

(x,y)∈Σ∗×∆∗

[[A]](x) ⊗ [[T ]](x, y) ⊗ [[B]](y)

(8)

for all weighted automata A and B such that the ⊕-sum:⊕

(x,y)∈Σ∗×∆∗

[[A]](x)⊗ [[T ]](x, y) ⊗ [[B]](y)

is well-defined and in K. This sum is always defined and in K when A and B are acyclicweighted automata since the sum then runs over a finite set. It is defined for all weightedautomata in all closed semirings (Kuich and Salomaa, 1986) such as the tropical semiring. Inthe probability semiring, the sum is well-defined for all A, B, and T representing probabilitydistributions. When K(A,B) is defined, Equation 8 can be equivalently written as:

K(A,B) = ψ

(x,y)∈Σ∗×∆∗

[[A T B]](x, y)

= ψ(w[A T B]) (9)

The next section presents a general algorithm for computing rational kernels.

4. Algorithms

The algorithm for computing K(x, y), or K(A,B), for any two acyclic weighted automata,or for any two weighted automata such that the sum above is well-defined, is based on twogeneral algorithms that we briefly present: composition of weighted transducers to combineA, T , and B, and a general shortest-distance algorithm in a semiring K to compute the⊕-sum of the weights of all successful paths of the composed transducer.

6

Page 21: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Rational Kernels: Theory and Algorithms

0

1a:a/1.61

2b:b/0.22

a:b/0b:a/0.69

3/0b:a/0.69 0

a:a/1.2

1a:b/2.3b:a/0.51

b:b/0.92

2/0a:a/0.51

(a) (b)

0

1a:a/2.81

4a:b/3.91 2

b:a/0.73

a:a/0.51

a:b/0.92 3/0b:a/1.2

(c)

Figure 2: (a) Weighted transducer T1 over the log semiring. (b) Weighted transducer T2

over the log semiring. (c) T1 T2, result of the composition of T1 and T2.

4.1 Composition of weighted transducers

There exists a general and efficient composition algorithm for weighted transducers whichtakes advantage of the sparseness of the input transducers (Pereira and Riley, 1997, Mohriet al., 1996). States in the composition T1 T2 of two weighted transducers T1 and T2 areidentified with pairs of a state of T1 and a state of T2. Leaving aside transitions with εinputs or outputs, the following rule specifies how to compute a transition of T1 T2 fromappropriate transitions of T1 and T2:

4

(q1, a, b, w1, q2) and (q′1, b, c, w2, q′2) =⇒ ((q1, q

′1), a, c, w1 ⊗ w2, (q2, q

′2)) (10)

In the worst case, all transitions of T1 leaving a state q1 match all those of T2 leaving state q′1,thus the space and time complexity of composition is quadratic: O((|Q1|+|E1|)(|Q2|+|E2|)).Figure 2 illustrates the algorithm when applied to the transducers of Figure 2 (a)-(b) definedover the log semiring.

4.2 Single-source shortest distance algorithm over a semiring

Given a weighted automaton or transducer M , the shortest-distance from state q to the setof final states F is defined as the ⊕-sum of all the paths from q to F :

d[q] =⊕

π∈P (q,F )

w[π]⊗ ρ[n[π]] (11)

when this sum is well-defined and in K. This is always the case when the semiring is k-closedor when M is acyclic (Mohri, 2002), the case of interest in our experiments. There exists ageneral algorithm for computing the shortest-distance d[q] (Mohri, 2002). The algorithm isbased on a generalization to k-closed semirings of the relaxation technique used in classical

4. See (Pereira and Riley, 1997, Mohri et al., 1996) for a detailed presentation of the algorithm includingthe use of a transducer filter for dealing with ε-multiplicity in the case of non-idempotent semirings.

7

Page 22: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Cortes, Haffner, and Mohri

single-source shortest-paths algorithms. When M is acyclic, the complexity of the algorithmis linear: O(|Q|+(T⊕+T⊗)|E|), where T⊕ denotes the maximum time to compute ⊕ and T⊗the time to compute ⊗ (Mohri, 2002). The algorithm can then be viewed as a generalizationof Lawler’s algorithm (Lawler, 1976) to the case of an arbitrary semiring K. It is then basedon a generalized relaxation of the outgoing transitions of each state of M visited in reversetopological order (Mohri, 2002).

Let K be a rational kernel and let T be the associated weighted transducer. Let A andB be two acyclic weighted automata or, more generally, two weighted automata such thatthe sum in Equation 9 is well-defined and in K. A and B may represent just two stringsx, y ∈ Σ∗ or may be any complex weighted automata. By definition of rational kernels(Equation 9) and the shortest-distance (Equation 11), K(A,B) can be computed by:

1. Constructing the composed transducer N = A T B.

2. Computing w[N ], by determining the shortest-distance from the initial states of N toits final states using the shortest-distance algorithm just described.

3. Computing ψ(w[N ]).

When A and B are acyclic, the shortest-distance algorithm is linear and the total com-plexity of the algorithm is O(|T ||A||B|+Φ), where |T |, |A|, and |B| denote respectively thesize of T , A and B and Φ the worst case complexity of computing ψ(x), x ∈ K. If we assumethat Φ can be computed in constant time as in many applications, then the complexity ofthe computation of K(A,B) is quadratic with respect to A and B: O(|T ||A||B|).

5. Theory of PDS and NDS Rational Kernels

In learning techniques such as those based on SVMs, we are particularly interested in ker-nels that are positive definite symmetric (PDS), or, equivalently, kernels verifying Mercer’scondition, which guarantee the existence of a Hilbert space and a dot product associated tothe kernel considered. This ensures the convergence of the training algorithm to a uniqueoptimum. Thus, in what follows, we will focus on theoretical results related to the con-struction of rational kernels that are PDS. Due to the symmetry condition, the input andoutput alphabets Σ and ∆ will coincide for the underlying transducers associated to thekernels.

This section reviews a number of results related to general PDS kernels, that is theclass of all kernels that have the Mercer property (Berg et al., 1984). It also gives novelproofs and results in the specific case of PDS rational kernels. These results can be usedto combine PDS rational kernels to design new PDS rational kernels or to construct aPDS rational kernel. Our proofs and results are original and are not just straightforwardextensions of those existing in the case of general PDS kernels. This is because, for example,a closure property for PDS rational kernels must guarantee not just that the PDS propertyis preserved but also that the rational property is retained. Our original results include ageneral construction of PDS rational kernels from arbitrary weighted transducers, a numberof theorems related to the converse, and a study of the negative definiteness of some rationalkernels.

8

Page 23: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Rational Kernels: Theory and Algorithms

Definition 4 Let X be a non-empty set. A function K : X ×X → R is said to be a PDSkernel if it is symmetric (K(x, y) = K(y, x) for all x, y ∈ X) and

n∑

i,j=1

cicjK(xi, xj) ≥ 0 (12)

for all n ≥ 1, x1, . . . , xn ⊆ X and c1, . . . , cn ⊆ R.

It is clear from classical results of linear algebra that K is a PDS kernel iff the matrixK(xi, xj)i,j≤n for all n ≥ 1 and all x1, . . . , xn ⊆ X is symmetric and all its eigenvaluesare non-negative.

PDS kernels can be used to construct other families of kernels that also meet theseconditions (Scholkopf and Smola, 2002). Polynomial kernels of degree p are formed fromthe expression (K+a)p, and Gaussian kernels can be formed as exp(−d2/σ2) with d2(x, y) =K(x, x)+K(y, y)−2K(x, y). The following sections will provide other ways of constructingPDS rational kernels.

5.1 General Closure Properties of PDS Kernels

The following theorem summarizes general closure properties of PDS kernels (Berg et al.,1984).

Theorem 5 Let X and Y be two non-empty sets.

1. Closure under sum: Let K1,K2 : X×X → R be PDS kernels, then K1+K2 : X×X →R is a PDS kernel.

2. Closure under product: Let K1,K2 : X × X → R be PDS kernels, then K1 · K2 :X ×X → R is a PDS kernel.

3. Closure under tensor product: Let K1 : X × X → R and K2 : Y × Y → R bePDS kernels, then their tensor product K1 K2 : (X × Y ) × (X × Y ) → R, whereK1 K2((x1, y1), (x2, y2)) = K1(x1, x2) ·K2(y1, y2) is a PDS kernel.

4. Closure under pointwise limit: Let Kn : X ×X → R be a PDS kernel for all n ∈ N

and assume that limn→∞Kn(x, y) exists for all x, y ∈ X, then K defined by K(x, y) =limn→∞Kn(x, y) is a PDS kernel.

5. Closure under composition with a power series: Let K : X ×X → R be a PDS kernelsuch that |K(x, y)| < ρ for all (x, y) ∈ X ×X. Then if the radius of convergence ofthe power series S =

∑∞n=0 anx

n is ρ and an ≥ 0 for all n ≥ 0, the composed kernelS K is a PDS kernel. In particular, if K : X ×X → R is a PDS kernel, then so isexp(K).

In particular, these closure properties all apply to PDS kernels that are rational, e.g., thesum or product of two PDS rational kernels is a PDS kernel. However, Theorem 5 does notguarantee the result to be a rational kernel. In the next section, we examine precisely thequestion of the closure properties of PDS rational kernels (under rational operations).

9

Page 24: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Cortes, Haffner, and Mohri

5.2 Closure Properties of PDS Rational Kernels

In this section, we assume that a fixed function ψ is used in the definition of all the rationalkernels mentioned. We denote by KT the rational kernel corresponding to the transducerT and defined for all x, y ∈ Σ∗ by KT (x, y) = ψ([[T ]](x, y)).

Theorem 6 Let Σ be a non-empty alphabet. The following closure properties hold for PDSrational kernels.

1. Closure under ⊕-sum: Assume that ψ : (K,⊕, 0)→ (R,+, 0) is a monoid morphism.5

Let KT1 ,KT2 : Σ∗ × Σ∗ → R be PDS rational kernels, then KT1⊕T2 : Σ∗ × Σ∗ → R isa PDS rational kernel and KT1⊕T2 = KT1 +KT2 .

2. Closure under ⊗-product: Assume that ψ : (K,⊕,⊗, 0, 1) → (R,+,×, 0, 1) is asemiring morphism. Let KT1 ,KT2 : Σ∗ × Σ∗ → R be PDS rational kernels, thenKT1⊗T2 : Σ∗ × Σ∗ → R is a PDS rational kernel.

3. Closure under Kleene-closure: Assume that ψ : (K,⊕,⊗, 0, 1) → (R,+,×, 0, 1) is acontinuous semiring morphism. Let KT : Σ∗×Σ∗ → R be a PDS rational kernel, thenKT ∗ : Σ∗ ×Σ∗ → R is a PDS rational kernel.

Proof The closure under ⊕-sum follows directly from Theorem 5 and the fact that for allx, y ∈ Σ∗:

ψ([[T1]](x, y)⊕ [[T2]](x, y)) = ψ([[T1]](x, y)) + ψ([[T2]](x, y))

when ψ : (K,⊕, 0) → (R,+, 0) is a monoid morphism. For the closure under ⊗-product,when ψ is a semiring morphism, for all x, y ∈ Σ∗:

ψ([[T1 ⊗ T2]](x, y)) =∑

x1x2=x,y1y2=y

ψ([[T1]](x1, y1)) · ψ([[T2]](x2, y2)) (13)

=∑

x1x2=x,y1y2=y

KT1 KT2((x1, x2), (y1, y2))

By Theorem 5, since KT1 and KT2 are PDS kernels, their tensor product KT1 KT2 is aPDS kernel and there exists a Hilbert space H ⊆ R

Σ∗

and a mapping u → φu such thatKT1 KT2(u, v) = 〈φu, φv〉 (Berg et al., 1984). Thus

ψ([[T1 ⊗ T2]](x, y)) =∑

x1x2=x,y1y2=y

〈φ(x1,x2), φ(y1,y2)〉 (14)

=

x1x2=x

φ(x1,x2),∑

y1y2=y

φ(y1,y2)

Since a dot product is positive definite, this equality implies that KT1⊗T2 is a PDS kernel.A similar proof is given by Haussler (1999). The closure under Kleene-closure is a direct

5. A monoid morphism ψ : (K,⊕, 0) → (R,+, 0) is a function verifying ψ(x ⊕ y) = ψ(x) + ψ(y) for allx, y ∈ K, and ψ(0) = 0. A semiring morphism ψ is a function ψ : (K,⊕,⊗, 0, 1) → (R,+,×, 0, 1) furtherverifying ψ(x⊗ y) = ψ(x) · ψ(y) for all x, y ∈ K, and ψ(1) = 1.

10

Page 25: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Rational Kernels: Theory and Algorithms

consequence of the closure under ⊕-sum and ⊗-product of PDS rational kernels and theclosure under pointwise limit of PDS kernels (Theorem 5).

Theorem 6 provides a general method for constructing complex PDS rational kernels fromsimpler ones. PDS rational kernels defined to model specific prior knowledge sources canbe combined using rational operations to create a more general PDS kernel.

In contrast to Theorem 6, PDS rational kernels are not closed under composition. Thisis clear since the ordinary matrix multiplication does not preserve positive definiteness ingeneral.

The next section studies a general construction of PDS rational kernels using composi-tion.

5.3 A General Construction of PDS Rational Kernels

In this section, we assume that ψ : (K,⊕,⊗, 0, 1)→ (R,+,×, 0, 1) is a continuous semiringmorphism. This limits the choice of the semiring associated to the weighted transducer defin-ing a rational kernel, since it needs in particular to be commutative and non-idempotent.6

Our study of PDS rational kernels in this section is thereby limited to such semirings. Thisshould not leave the reader with the incorrect perception that all PDS rational kernels aredefined over non-idempotent semirings though. As already mentioned before, in general,the function ψ can be chosen arbitrarily and needs not impose any algebraic property onthe semiring used.

We show that there exists a general way of constructing a PDS rational kernel from anyweighted transducer T . The construction is based on composing T with its inverse T−1.

Proposition 7 Let T = (Σ,∆, Q, I, F,E, λ, ρ) be a weighted finite-state transducer definedover the semiring (K,⊕,⊗, 0, 1). Assume that the weighted transducer T T−1 is regulated,then (ψ, T T−1) defines a PDS rational kernel over Σ∗ × Σ∗.

Proof Denote by S the composed transducer T T−1. Let K be the rational kernel definedby S. By definition of composition

K(x, y) = ψ([[S]](x, y)) = ψ

(

z∈∆∗

[[T ]](x, z) ⊗ [[T ]](y, z)

)

(15)

for all x, y ∈ Σ∗. Since ψ is a continuous semiring morphism, for all x, y ∈ Σ∗

K(x, y) = ψ([[S]](x, y)) =∑

z∈∆∗

ψ([[T ]](x, z)) · ψ([[T ]](y, z)) (16)

For all n ∈ N and x, y ∈ Σ∗, define Kn(x, y) by:

Kn(x, y) =∑

|z|≤n

ψ([[T ]](x, z)) · ψ([[T ]](y, z)) (17)

6. If K is idempotent, for any x ∈ K, ψ(x) = ψ(x⊕x) = ψ(x)+ψ(x) = 2ψ(x), which implies that ψ(x) = 0for all x.

11

Page 26: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Cortes, Haffner, and Mohri

where the sum runs over all strings z ∈ ∆∗ of length less than or equal to n. Clearly, Kn

defines a symmetric kernel. For any l ≥ 1 and any x1, . . . , xl ∈ Σ∗, define the matrix Mn

by: Mn = (Kn(xi, xj))i≤l,j≤l. Let z1, z2, . . . , zm be an arbitrary ordering of the strings oflength less than or equal to n. Define the matrix A by:

A = (ψ([[T ]](xi, zj)))i≤l;j≤m (18)

By definition of Kn, Mn = AAt. The eigenvalues of AAt are non-negative for any rectan-gular matrix A, thus Kn is a PDS kernel. Since K is a pointwise limit of Kn, K(x, y) =limn→∞Kn(x, y), by Theorem 5, K is a PDS kernel. This ends the proof of the proposition.

The next propositions provide results related to the converse of Proposition 7. We denoteby IdR the identity function over R.

Proposition 8 Let S = (Σ,Σ, Q, I, F,E, λ, ρ) be an acyclic weighted finite-state transducerover (K,⊕,⊗, 0, 1) such that (ψ, S) defines a PDS rational kernel on Σ∗ × Σ∗, then thereexists a weighted transducer T over the probability semiring such that (IdR, T T

−1) definesthe same rational kernel.

Proof Let S be an acyclic weighted transducer verifying the hypotheses of the proposition.Let X ⊂ Σ∗ be the finite set of strings accepted by S. Since S is symmetric, X×X is the setof pairs of strings (x, y) defining the rational relation associated with S. Let x1, x2, . . . , xn

be an arbitrary numbering of the elements of X. Define the matrix M by:

M = (ψ([[S]](xi, xj)))1≤i≤n, 1≤j≤n (19)

Since S defines a PDS rational kernel, M is a symmetric matrix with non-negative eigen-values, i.e., M is symmetric positive semi-definite. The Cholesky decomposition extendsto the case of semi-definite matrices (Dongarra et al., 1979): there exists an upper trian-gular matrix R = (Rij) with non-negative diagonal elements such that M = RRt. LetY = y1, . . . , yn be an arbitrary subset of n distinct strings of Σ∗. Define the weightedtransducer T over the X × Y by:

[[T ]](xi, yj) = Rij (20)

for all i, j, 1 ≤ i, j ≤ n. By definition of composition, [[T T−1]](xi, xj) = ψ([[S]](xi, xj)) forall i, j, 1 ≤ i, j ≤ n. Thus, T T−1 = ψ(S), which proves the claim of the proposition.

Note that when the matrix M introduced in the proof is positive definite, that is when theeigenvalues of the matrix associated with S are all positive, then Cholesky’s decompositionand thus the weights associated to the input strings of T are unique.

Assume that the same continuous semiring morphism ψ is used in the definition of allthe rational kernels.

Proposition 9 Let Θ be the subset of the weighted transducers over (K,⊕,⊗, 0, 1) such thatfor any S ∈ Θ, (ψ, S) defines a PDS rational kernel and there exists a weighted transducerT = (Σ,∆, Q, I, F,E, λ, ρ) over the probability semiring such that (IdR, T T

−1) definesthe same rational kernel as (ψ, S). Then Θ is closed under ⊕-sum, ⊗-product, and Kleene-closure.

12

Page 27: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Rational Kernels: Theory and Algorithms

Proof Let S1, S2 ∈ Θ, we will show that S1 ⊕ S2 ∈ Θ, S1 ⊗ S2 ∈ Θ, and S∗1 ∈ Θ. By defi-

nition, there exist T1 = (Σ,∆1, Q1, I1, F1, E1, λ1, ρ1) and T2 = (Σ,∆2, Q2, I2, F2, E2, λ2, ρ2)such that:

K1 = T1 T−11 and K2 = T2 T

−12 (21)

where K1 (K2) is the PDS rational kernel defined by (ψ, S1) (resp. (ψ, S2)). Let u bean alphabetic morphism mapping ∆2 to a new alphabet ∆′

2 such that ∆1 ∩ ∆′2 = ∅. u

is clearly a rational transduction (Berstel, 1979) and can be represented by a finite-statetransducer U . Thus, we can define a new weighted transducer T ′

2 by: T ′2 = T2 U =

(Σ,∆′2, Q2, I2, F2, E

′2, λ2, ρ2), which only differs from T2 by some renaming of its output

labels. This does not affect the result of the composition with the inverse transducer sinceU U−1 is the identity mapping over ∆∗

2:

T ′2 T

′−12 = T2 U (U−1 T−1

2 ) = T2 T−12 = K2 (22)

Since, T1 and T2 have distinct output alphabets, their output labels cannot match, thus:

T1 T′−12 = ∅ and T ′

2 T−11 = ∅ (23)

Let T = T1 + T ′2, in view of Equation 22 and Equation 23:

T T−1 = (T1 + T ′2) (T1 + T ′

2)−1 = (T1 T

−11 ) + (T ′

2 T′−12 ) = K1 +K2 (24)

Since the same continuous semiring morphism ψ is used for the definition of all the rationalkernels in Θ, by Theorem 6, K1 +K2 is a PDS rational kernel defined by (ψ, S1 ⊕ S2) andS1 ⊕ S2 is in Θ. Similarly, define T ′ as T ′ = T1 · T

′2.

T ′ T ′−1 = (T1 · T′2) (T1 · T

′2)

−1 = (T1 T−11 ) · (T2 T

′−12 ) (25)

Thus, S1 ⊗ S2 is in Θ. Let x be a symbol not in ∆1 and let ∆′1 = ∆1 ∪ x. Let V be

the finite-state transducer accepting as input only ε and mapping ε to x and define T ′1 by

T ′1 = V · T1. Since x does not match any of the output labels of T1, T

′1 T

′1−1 = T1 T1

−1

and (T ′1 T

′1−1)∗ = T ′

1∗ (T ′

1−1)∗:

(T1 T1−1)∗ = (T ′

1 T′1−1

)∗ = T ′1∗ (T ′

1−1

)∗ (26)

Thus, by Theorem 6, S∗1 is a PDS rational kernel that is in Θ.

Proposition 9 raises the following question: under the same assumptions, are all PDS ra-tional kernels defined by a pair of the form (ψ, T T−1)? A natural conjecture is that thisis the case and that this property provides a characterization of the weighted transducersdefining PDS rational kernels. Propositions 8 and 9 both favor that conjecture. Proposition8 shows that this holds in the acyclic case. Proposition 9 might be useful to extend this tothe general case.

In the case of PDS rational kernels defined by (IdR, S) with S a weighted transducerover the probability semiring, the conjecture could be reformulated as: is S of the formS = T T−1? If true, this could be viewed as a generalization of Cholesky’s decompositiontheorem to the case of infinite matrices given by weighted transducers over the probabilitysemiring.

This ends our discussion of PDS rational kernels. In the next section, we will examinenegative definite kernels and their relationship with PDS rational kernels.

13

Page 28: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Cortes, Haffner, and Mohri

5.4 Negative Definite Kernels

As mentioned before, given a set X and a distance or dissimilarity measure d : X×X → R+,a common method used to define a kernel K is the following. For all x, y ∈ X,

K(x, y) = exp(−td2(x, y)) (27)

where t > 0 is some constant typically used for normalization. Gaussian kernels are definedin this way. However, such kernels K are not necessarily positive definite, e.g., for X = R,d(x, y) = |x − y|p, p > 1 and t = 1, K is not positive definite. The positive definiteness ofK depends on t and the properties of the function d. The classical results presented in thissection exactly address such questions (Berg et al., 1984). They include a characterizationof PDS kernels based on negative definite kernels which may be viewed as distances withsome specific properties.7

The results we are presenting are general, but we are particularly interested in the casewhere d can be represented by a rational kernel. We will use these results later when dealingwith the case of the edit-distance.

Definition 10 Let X be a non-empty set. A function K : X × X → R is said to be anegative definite symmetric kernel (NDS kernel) if it is symmetric (K(x, y) = K(y, x) forall x, y ∈ X) and

n∑

i,j=1

cicjK(xi, xj) ≤ 0 (28)

for all n ≥ 1, x1, . . . , xn ⊆ X and c1, . . . , cn ⊆ R with∑n

i=1 ci = 0.

Clearly, if K is a PDS kernel then −K is a NDS kernel, however the converse does nothold in general. Negative definite kernels often correspond to distances, e.g., K(x, y) =(x− y)α, x, y ∈ R, with 0 < α ≤ 2 is a negative definite kernel.

The next theorem summarizes general closure properties of NDS kernels (Berg et al.,1984).

Theorem 11 Let X be a non-empty set.

1. Closure under sum: Let K1,K2 : X×X → R be NDS kernels, then K1+K2 : X×X →R is a NDS kernel.

2. Closure under log and exponentiation: Let K : X × X → R be a NDS kernel withK ≥ 0, and α a real number with 0 < α < 1, then log(1 +K),Kα : X ×X → R areNDS kernels.

3. Closure under pointwise limit: Let Kn : X ×X → R be a NDS kernel for all n ∈ N,then K defined by K(x, y) = limn→∞Kn(x, y) is a NDS kernel.

7. Many of the results given by Berg et al. (1984) are re-presented in (Scholkopf, 2001) with the terminologyof conditionally positive definite instead of negative definite kernels. We adopt the original terminologyused by Berg et al. (1984).

14

Page 29: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Rational Kernels: Theory and Algorithms

The following theorem clarifies the relation between NDS and PDS kernels and provides inparticular a way of constructing PDS kernels from NDS ones (Berg et al., 1984).

Theorem 12 Let X be a non-empty set, xo ∈ X, and let K : X ×X → R be a symmetrickernel.

1. K is negative definite iff exp(−tK) is positive definite for all t > 0.

2. Let K ′ be the function defined by:

K ′(x, y) = K(x, x0) +K(y, x0)−K(x, y)−K(x0, x0) (29)

Then K is negative definite iff K ′ is positive definite.

The theorem gives two ways of constructing a positive definite kernel using a negativedefinite kernel. The first construction is similar to the way Gaussian kernels are defined.The second construction has been put forward by (Scholkopf, 2001).

6. Relationship with some commonly used kernels or similarity measures

This section studies the relationships between several families of kernels or similaritiesmeasures and rational kernels.

6.1 Edit-Distance

A common similarity measure in many applications is that of the edit-distance, that is theminimal cost of a series of edit operations (symbol insertions, deletions, or substitutions)transforming one string into the other (Levenshtein, 1966). We denote by de(x, y) the edit-distance between two strings x and y over the alphabet Σ with cost 1 assigned to all editoperations.

Proposition 13 Let Σ be a non-empty finite alphabet and let de be the edit-distance overΣ, then de is a symmetric rational kernel. Furthermore, (1): de is not a PDS kernel, and(2): de is a NDS kernel iff |Σ| = 1.

Proof The edit-distance between two strings, or weighted automata, can be representedby a simple weighted transducer over the tropical semiring (Mohri, 2003). Since the edit-distance is symmetric, de is a symmetric rational kernel. Figure 3(a) shows the correspond-ing transducer when the alphabet is Σ = a, b. The cost of the alignment between twosequences can also be computed by a weighted transducer over the probability semiring(Mohri, 2003), see Figure 3(b).

Let a ∈ Σ, then the matrix (de(xi, xj))1≤i,j≤2 with x1 = ε and x2 = a has a negativeeigenvalue (−1), thus de is not a PDS kernel.

When |Σ| = 1, the edit-distance simply measures the absolute value of the difference oflength between two strings. A string x ∈ Σ∗ can then be viewed as a vector of the Hilbertspace R

∞. Denote by ‖ · ‖ the corresponding norm. For all x, y ∈ Σ∗:

de(x, y) = ‖x− y‖

15

Page 30: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Cortes, Haffner, and Mohri

0/0

a:a/0b:b/0a:b/1b:a/1ε:a/1ε:b/1a:ε/1b:ε/1

0

a:a/1a:b/1b:a/1b:b/1ε:a/1ε:b/1a:ε/1b:ε/1

1/1

b:ε/1

b:a/1

ε:a/1ε:b/1a:ε/1a:b/1

a:a/1a:b/1b:a/1b:b/1ε:a/1ε:b/1a:ε/1b:ε/1

(a) (b)

Figure 3: (a) Weighted transducer over the tropical semiring representing the edit-distanceover the alphabet Σ = a, b. (b) Weighted transducer over the probabilitysemiring computing the cost of alignments over the alphabet Σ = a, b.

The square distance ‖ · ‖2 is negative definite, thus by Theorem 11, de = (‖ · ‖2)1/2 is alsonegative definite.

Assume now that |Σ| > 1. We show that exp(−de) is not PDS. By theorem 12, thisimplies that de is not negative definite. Let x1, · · · , x2n be any ordering of the strings oflength n over the alphabet a, b. Define the matrix Mn by:

Mn = (exp(−de(xi, xj)))1≤i,j,≤2n (30)

Figure 4(a) shows the smallest eigenvalue αn of Mn as a function of n. Clearly, there arevalues of n for which αn < 0, thus the edit-distance is not negative definite. Table 4(b)provides a simple example with five strings of length 3 over the alphabet Σ = a, b, c, dshowing directly that the edit-distance is not negative definite. Indeed, it is easy to verifythat:

∑5i=1

∑5j=1 cicjK(xi, xj) = 2

3 > 0.

To our knowledge, this is the first statement and proof of the fact that de is not NDSfor |Σ| > 1. This result has a direct consequence on the design of kernels in computationalbiology, often based on the edit-distance or other related similarity measures. The edit-distance and other related similarity measures are often used in computational biology.When |Σ| > 1, Proposition 13 shows that de is not NDS. Thus, there exists t > 0 for whichexp(−tde) is not PDS. Similarly, d2

e is not NDS since otherwise by Theorem 11, de = (d2e)

1/2

would be NDS.

16

Page 31: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Rational Kernels: Theory and Algorithms

*

*

**

* * **

all strings of length n

smal

lest

eig

enva

lue

2 4 6 8

0.0

0.2

0.4

0.6

i 1 2 3 4 5

xi abc bad dab adc bcd

ci 1 1 −23 −2

3 −23

(a) (b)

Figure 4: (a) Smallest eigenvalue of the matrix Mn = (exp(−de(xi, xj)))1≤i,j,≤2n as a func-tion of n. (b) Example demonstrating that the edit-distance is not negativedefinite.

6.2 Haussler’s Convolution Kernels for Strings

D. Haussler describes a class of kernels for strings built by applying iteratively convolutionkernels (Haussler, 1999). We show that these convolution kernels for strings are specificinstances of rational kernels. Haussler (1999) defines the convolution of two string kernelsK1 and K2 over the alphabet Σ as the kernel denoted by K1?K2 and defined for all x, y ∈ Σ∗

by:

K1 ? K2(x, y) =∑

x1x2=x,y1y2=y

K1(x1, y1) ·K2(x2, y2) (31)

Clearly, when K1 and K2 are given by weighted transducers over the probability semiring,this definition coincides with that of the product (or concatenation) of transducers (Equa-tion 3). Haussler (1999) also introduces for 0 ≤ γ < 1 the γ-infinite iteration of a mappingH : Σ∗ × Σ∗ → R by:

H∗γ = (1− γ)

∞∑

n=1

γn−1H(n) (32)

where H(n) = H ?H(n−1) is the result of the convolution of H with itself n− 1 times. Notethat H∗

γ = 0 for γ = 0.

Lemma 14 For 0 < γ < 1, the γ-infinite iteration of a rational transduction H : Σ∗×Σ∗ →R can be defined in the following way with respect to the Kleene †-operator:

H∗γ =

1− γ

γ(γH)† (33)

Proof Haussler’s convolution simply corresponds to the product (or concatenation) in thecase of rational transductions. Thus, for 0 < γ < 1, by definition of the †-operator:

(γH)† =

∞∑

n=1

(γH)n =

∞∑

n=1

γnHn =γ

1− γ

∞∑

n=1

(1− γ)γn−1Hn =γ

1− γH∗

γ

17

Page 32: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Cortes, Haffner, and Mohri

0 1K2(1 − γ)

2K1γ

K2

Figure 5: Haussler’s convolution kernels KH for strings: specific instances of rational ker-nels. K1, (K2), corresponds to a specific weighted transducer over the probabilitysemiring and modeling substitutions (resp. insertions).

Given a probability distribution p over all symbols of Σ, Haussler’s convolution kernels forstrings are defined by:

KH(x, y) = γK2 ? (K1 ? K2)?γ + (1− γ)K2

where K1 is the specific polynomial PDS rational transduction over the probability semiringdefined by: K1(x, y) =

a∈Σ p(x|a)p(y|a)p(a) and models substitutions, and K2 anotherspecific PDS rational transduction over the probability semiring modeling insertions.

Proposition 15 For any 0 ≤ γ < 1, Haussler’s convolution kernels KH coincide with thefollowing special cases of rational kernels:

KH = (1− γ)[K2(γK1K2)∗] (34)

Proof As mentioned above, Haussler’s convolution simply corresponds to concatenation inthis context. When γ = 0, by definition, KH is reduced to K2 which is a rational transducerand the proposition’s formula above is satisfied. Assume now that γ 6= 0. By lemma 14,KH can be re-written as:

KH = γK2(K1K2)?γ + (1− γ)K2 = γK2

1− γ

γ(γK1K2)

† + (1− γ)K2 (35)

= (1− γ)[K2(γK1K2)† +K2] = (1− γ)[K2(γK1K2)

∗]

Since rational transductions are closed under rational operations, KH also defines a rationaltransduction. Since K1 and K2 are PDS kernels, by theorem 6, KH defines a PDS kernel.

The transducer of Figure 5 illustrates the convolution kernels for strings proposed byHaussler. They correspond to special cases of rational kernels whose mechanism is clarifiedby the figure: the kernel corresponds to an insertion with weight (1 − γ) modeled byK2 followed by any number of sequences of substitutions modeled by K1 and insertionsmodeled by K2 with weight γ. Clearly, there are many other ways of defining kernelsbased on weighted transducers with more complex definitions and perhaps more data-drivendefinitions.

18

Page 33: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Rational Kernels: Theory and Algorithms

0

a:εb:ε

1a:ab:b

4

a:bb:a

2a:ab:b

5

a:bb:a

a:ab:b

7

a:bb:a

3a:ab:b

6

a:bb:a

a:ab:b

8

a:bb:a

a:εb:ε

a:εb:ε

a:ab:b

a:εb:ε

Figure 6: Mismatch kernel K(k,m) = Tk,m T−1k,m (Leslie et al., 2003) with k = 3 and m = 2

and with Σ = a, b. The transducer T3,2 defined over the probability semiringis shown. All transition weights and final weights are equal to one. Note thatstates 3, 6, and 8 of the transducer are equivalent and thus can be merged andsimilarly that states 2 and 5 can then be merged as well.

6.3 Other Kernels Used in Computational Biology

In this section we show the relationship between rational kernels and another class of kernelsused in computational biology.

A family of kernels, mismatch string kernels, was introduced by (Leslie et al., 2003) forprotein classification using SVMs. Let Σ be a finite alphabet, typically that of amino acidsfor protein sequences. For any two sequences z1, z2 ∈ Σ∗ of same length (|z1| = |z2|), wedenote by d(z1, z2) the total number of mismatching symbols between these sequences. Forall m ∈ N, we define the bounded distance dm between two sequences of same length by:

dm(z1, z2) =

1 if (d(z1, z2) ≤ m)0 otherwise

(36)

and for all k ∈ N, we denote by Fk(x) the set of all factors of x of length k:

Fk(x) = z : x ∈ Σ∗zΣ∗, |z| = k

For any k,m ∈ N with m ≤ k, a (k,m)-mismatch kernel K(k,m) : Σ∗×Σ∗ → R is the kerneldefined over protein sequences x, y ∈ Σ∗ by:

K(k,m)(x, y) =∑

z1∈Fk(x), z2∈Fk(y), z∈Σk

dm(z1, z) dm(z, z2) (37)

Proposition 16 For any k,m ∈ N with m ≤ k, the (k,m)-mismatch kernel K(k,m) :Σ∗ × Σ∗ → R is a PDS rational kernel.

Proof Let M , S, and D be the weighted transducers over the probability semiring definedby:

M =∑

a∈Σ

(a, a) S =∑

a6=b

(a, b) D =∑

a∈Σ

(a, ε) (38)

19

Page 34: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Cortes, Haffner, and Mohri

M associates weight 1 to each pair of identical symbols of the alphabet Σ, S associates 1 toeach pair of distinct or mismatching symbols, and D associates 1 to all pairs with secondelement ε.

For i, k ∈ N with 0 ≤ i ≤ k, Define the shuffle of Si and Mk−i, denoted by Si ttMk−i,as the the sum over all products made of factors S and M with exactly i factors S andk − i factors M . As a finite sum of products of S and M , Si ttMk−i is rational. Sinceweighted transducers are closed under rational operations the following defines a weightedtransducer T over the probability semiring for any k,m ∈ N with m ≤ k: Tk,m = D∗RD∗

with R =∑m

i=0 Si ttMk−i. Consider two sequences z1, z2 such that |z1| = |z2| = k. By

definition of M and S and the shuffle product, for any i, with 0 ≤ i ≤ m,

[[Si ttMk−i]](z1, z2) =

1 if (d(z1, z2) = i)0 otherwise

(39)

Thus, [[R]](z1, z2) =

m∑

i=0

Si ttMk−i(z1, z2) =

1 if (d(z1, z2) ≤ m)0 otherwise

= dm(z1, z2)

By definition of the product of weighted transducers, for any x ∈ Σ∗ and z ∈ Σk,

Tk,m(x, z) =∑

x=uvw,z=u′v′w′

[[D∗]](u, u′) [[R]](v, v′) [[D∗]](w,w′) (40)

=∑

v∈Fk(x),z=v′

[[R]](v, v′) =∑

v∈Fk(x)

dm(v, z)

It is clear from the definition of Tk,m that Tk,m(x, z) = 0 for all x, z ∈ Σ∗ with |z| > k.Thus, by definition of the composition of weighted transducer, for all x, y ∈ Σ∗

[[Tk,m Tk,m−1]](x, y) =

z1∈Fk(x), z2∈Fk(y), z∈Σ∗

dm(z1, z) dm(z, z2) (41)

=∑

z1∈Fk(x), z2∈Fk(y), z∈Σk

dm(z1, z) dm(z, z2) = K(k,m)(x, y)

By proposition 7, this proves that K(k,m) is a PDS rational kernel.

Figure 6 shows T3,2, a simple weighted transducer over the probability semiring that canbe used to compute the mismatch kernel K(3,2) = T3,2 T3,2

−1. Such transducers providea compact representation of the kernel and are very efficient to use with the compositionalgorithm already described in (Cortes et al., 2003c). The transitions of these transducerscan be defined implicitly and expanded on-demand as needed for the particular input stringsor weighted automata. This substantially reduces the space needed for their representation,e.g., a single transition with labels x : y, x 6= y can be used to represent all transitions withsimilar labels ((a : b), a, b ∈ Σ, with a 6= b). Similarly, composition can also be performedon-the-fly. Furthermore, the transducer of Figure 6 can be made more compact since itadmits several states that are equivalent.

20

Page 35: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Rational Kernels: Theory and Algorithms

7. Applications and Experiments

Rational kernels can be used in a variety of applications ranging from computational biologyto optical character recognition. We have applied them successfully to a number of speechprocessing tasks including the identification from speech of traits, or voice signatures, such asemotion (Shafran et al., 2003). This section describes some of our most recent applicationsto spoken-dialog classification.

We first introduce a general family of PDS rational kernels relevant to spoken-dialogclassification tasks that we used in our experiments, then discuss the spoken-dialog classi-fication problem and report our experimental results.

7.1 A General Family of PDS Kernels: n-gram Kernels

A rational kernel can be viewed as a similarity measure between two sequences or weightedautomata. One may for example consider two utterances to be similar when they sharemany common n-gram subsequences. The exact transcriptions of the utterances are notavailable but we can use the word lattices output by the recognizer instead.

A word lattice is a weighted automaton over the log semiring that compactly representsthe most likely transcriptions of a speech utterance. Each path of the automaton is labeledwith a sequence of words whose weight is obtained by adding the weights of the constituenttransitions. The weight assigned by the lattice to a sequence of words can often be inter-preted as the log-likelihood of that transcription based on the models used by the recognizer.More generally, the weights are used to rank possible transcriptions, the sequence with thelowest weight being the most favored transcription.

A word lattice A can be viewed as a probability distribution PA over all strings s ∈ Σ∗.Modulo a normalization constant, the weight assigned by A to a string x is [[A]](x) =− logPA(x). Denote by |s|x the number of occurrences of a sequence x in the string s. Theexpected count or number of occurrences of an n-gram sequence x in s for the probabilitydistribution PA is:

c(A,x) =∑

s

PA(s)|s|x

Two lattices output by a speech recognizer can be viewed as similar when the sum of theproduct of the expected counts they assign to their common n-gram sequences is sufficientlyhigh. Thus, we define an n-gram kernel kn for two lattices A1 and A2 by:

kn(A1, A2) =∑

|x|=n

c(A1, x) c(A2, x) (42)

The kernel kn is a PDS rational kernel of type T T−1 and it can be computed efficiently.

Indeed, there exists a simple weighted transducer T that can be used to computedc(A1, x) for all n-gram sequences x ∈ Σ∗. Figure 7 shows that transducer in the case ofbigram sequences (n = 2) and for the alphabet Σ = a, b. The general definition of T is:

T = (Σ× ε)∗ (∑

x∈Σ

x × x)n (Σ × ε)∗ (43)

21

Page 36: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Cortes, Haffner, and Mohri

0

a:ε/1b:ε/1

1a:a/1b:b/1

2/1a:a/1b:b/1

a:ε/1b:ε/1

Figure 7: Weighted transducer T computing expected counts of bigram sequences of a wordlattice with Σ = a, b.

kn can be written in terms of the weighted transducer T as:

kn(A1, A2) = w[(A1 T ) (T−1 A2)] (44)

= w[(A1 (T T−1) A2)] (45)

which shows that it is a rational kernel whose associated weighted transducer is T T−1. Inview of Proposition 7, kn is a PDS rational kernel. Furthermore, the general compositionalgorithm and shortest-distance algorithm described in Section 4 can be used to compute kn

efficiently. The size of the transducer T is O(n|Σ|) but in practice, a lazy implementationcan be used to simulate the presence of the transitions of T labeled with all elements ofΣ. This reduces the size of the machine used to O(n). Thus, since the complexity ofcomposition is quadratic (Mohri et al., 1996, Pereira and Riley, 1997) and since the generalshortest distance algorithm just mentioned is linear for acyclic graphs such as the latticesoutput by speech recognizers (Mohri, 2002), the worst case complexity of the algorithm is:O(n2 |A1| |A2|).

By Theorem 6, the sum of two kernels kn and km is also a PDS rational kernel. Wedefine an n-gram rational kernel Kn as the PDS rational kernel obtained by taking the sumof all km, with 1 ≤ m ≤ n:

Kn =

n∑

m=1

km

Thus, the feature space associated with Kn is the set of all m-gram sequences with m ≤ n.n-gram kernels are used in our experiments in spoken-dialog classification.

7.2 Spoken-Dialog Classification

7.2.1 Definition

One of the key tasks of spoken-dialog systems is classification. This consists of assigning,out of a finite set, a specific category to each speech utterance based on the transcriptionof that utterance by a speech recognizer. The choice of possible categories depends on thedialog context considered. A category may correspond to the type of billing problem in thecontext of a dialog related to billing, or to the type of problem raised by the speaker in thecontext of a hot-line service. Categories are used to direct the dialog manager in formulating

22

Page 37: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Rational Kernels: Theory and Algorithms

Dataset Number of Training Testing Number of ASR wordclasses size size n-grams accuracy

HMIHY 0300 64 35551 5000 24177 72.5%VoiceTone1 97 29561 5537 22007 70.5%VoiceTone2 82 9093 5172 8689 68.8%

Table 2: Key characteristics of the three datasets used in the experiments. The fifth columndisplays the total number of unigrams, bigrams, and trigrams found in the one-best output of the ASR for the utterances of the training set, that is the numberof features used by BoosTexter or SVMs used with the one-best outputs.

a response to the speaker’s utterance. Classification is typically based on features such asrelevant key words or key sequences used by a machine learning algorithm.

The word error rate of conversational speech recognition systems is still too high inmany tasks to rely only on the one-best output of the recognizer (the word error rate in thedeployed services we have experimented with is about 70%, as we will see later). However,the word lattices output by speech recognition systems may contain the correct transcriptionin most cases. Thus, it is preferable to use instead the full word lattices for classification.

The application of classification algorithms to word lattices raises several issues. Evensmall word lattices may contain billions of paths, thus the algorithms cannot be generalizedby simply applying them to each path of the lattice. Additionally, the paths are weightedand these weights must be used to guide appropriately the classification task. The use ofrational kernels solves both of these problems since they define kernels between weightedautomata and since they can be computed efficiently (Section 4).

7.2.2 Description of Tasks and Datasets

We did a series of experiments in several large-vocabulary spoken-dialog tasks using rationalkernels with a twofold objective: to improve classification accuracy in those tasks, and toevaluate the impact on classification accuracy of the use a word lattice rather than theone-best output of the automatic speech recognition (ASR) system.

The first task we considered is that of a deployed customer-care application (HMIHY0300). In this task, users interact with a spoken-dialog system via the telephone, speakingnaturally, to ask about their bills, their calling plans, or other similar topics. Their responsesto the open-ended prompts of the system are not constrained by the system, they may beany natural language sequence. The objective of the spoken-dialog classification is to assignone or several categories or call-types, e.g., Billing Credit, or Calling Plans, to the users’speech utterances. The set of categories is finite and is limited to 64 classes. The calls areclassified based on the user’s response to the first greeting prompt: “Hello, this is AT&T.How may I help you?”.

Table 7.2.2 indicates the size of the HMIHY 0300 datasets we used for training andtesting. The training set is relatively large with more than 35,000 utterances, this is anextension of the one we used in our previous classification experiments with HMIHY 0300(Cortes et al., 2003c). In our experiments, we used the n-gram rational kernels described

23

Page 38: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Cortes, Haffner, and Mohri

0

2

4

6

8

10

12

14

16

18

20

0 5 10 15 20 25 30 35 40

Err

or R

ate

Rejection Rate

BoosTexter/1-bestSVM/1-bestSVM/Lattice

Figure 8: Classification error rate as a function of rejection rate in HMIHY 0300.

in the previous section with n = 3. Thus, the feature set we used was that of all n-gramswith n ≤ 3. Table 7.2.2 indicates the total number of distinct features of this type found inthe datasets. The word accuracy of the system based on the best hypothesis of the speechrecognizer is 72.5%. This motivated our use of the word lattices, which may contain thecorrect transcription in most cases. The average number of transitions of a word lattice inthis task was about 260.

Table 7.2.2 reports similar information for two other datasets, VoiceTone1, and Voice-Tone2. These are more recently deployed spoken-dialog systems in different areas, e.g.,VoiceTone1 is a task where users interact with a system related to health-care with a largerset of categories (97). The size of the VoiceTone1 datasets we used and the word accuracyof the recognizer (70.5%) make this task otherwise similar to HMIHY 0300. The datasetsprovided for VoiceTone2 are significantly smaller with a higher word error rate. The worderror rate is indicative of the difficulty of classification task since a higher error rate impliesa more noisy input. The average number of transitions of a word lattice in VoiceTone1 wasabout 210 and in VoiceTone2 about 360.

Each utterance of the dataset may be labeled with several classes. The evaluation isbased on the following criterion: it is considered an error if the highest scoring class givenby the classifier is none of these labels.

7.2.3 Implementation and Results

We used the AT&T FSM Library (Mohri et al., 2000) and the GRM Library (Allauzenet al., 2004) for the implementation of the n-gram rational kernels Kn used. We usedthese kernels with SVMs, using a general learning library for large-margin classification(LLAMA), which offers an optimized multi-class recombination of binary SVMs (Haffneret al., 2003). Training time took a few hours on a single processor of a 2.4GHz Intel Pentiumprocessor Linux cluster with 2GB of memory and 512 KB cache.

In our experiments, we used the trigram kernel K3 with a second-degree polynomial.Preliminary experiments showed that the top performance was reached for trigram kernelsand that 4-gram kernels, K4, did not significantly improve the performance. We also found

24

Page 39: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Rational Kernels: Theory and Algorithms

0 2 4 6 8

10 12 14 16 18 20 22 24

0 5 10 15 20 25 30 35 40

Err

or R

ate

Rejection Rate

BoosTexter/1-bestSVM/1-bestSVM/Lattice

0 2 4 6 8

10 12 14 16 18 20 22 24 26 28 30

0 5 10 15 20 25 30 35 40

Err

or R

ate

Rejection Rate

BoosTexter/1-bestSVM/1-bestSVM/Lattice

(a) (b)

Figure 9: Classification error rate as a function of rejection rate in (a) VoiceTone1 and (b)VoiceTone2 .

that the combination of a second-degree polynomial kernel with the trigram kernel signifi-cantly improves performance over a linear classifier, but that no further improvement couldbe obtained with a third-degree polynomial.

We used the same kernels in the three datasets previously described and applied themto both the speech recognizer’s single best hypothesis (one-best results), and to the fullword lattices output by the speech recognizer. We also ran, for the sake of comparison, theBoosTexter algorithm (Schapire and Singer, 2000) on the same datasets by applying it tothe one-best hypothesis. This served as a baseline for our experiments.

Figure 7.2.3 shows the result of our experiments in the HMIHY 0300 task. It givesclassification error rate as a function of rejection rate (utterances for which the top score islower than a given threshold are rejected) in HMIHY 0300 for: BoosTexter, SVM combinedwith our kernels when applied to the one-best hypothesis, and SVM combined with kernelsapplied to the full lattices.

SVM with trigram kernels applied to the one-best hypothesis leads to better classificationthan BoosTexter everywhere in the range of 0-40% rejection rate. The accuracy is about2-3% absolute value better than that of BoosTexter in the range of interest for this task,which is roughly between 20% and 40% rejection rate. The results also show that theclassification accuracy of SVMs combined with trigram kernels applied to word lattices isconsistently better than that of SVMs applied to the one-best alone by about 1% absolutevalue.

Figure 7.2.3 shows the results of our experiments in the VoiceTone1 and VoiceTone2tasks using the same techniques and comparisons. As observed previously, in many re-gards, VoiceTone1 is similar to the HMIHY 0300 task, and our results for VoiceTone1 arecomparable to those for HMIHY 0300. The results show that the classification accuracy ofSVMs combined with trigram kernels applied to word lattices is consistently better thanthat of BoosTexter, by more than 4% absolute value at about 20% rejection rate. Theyalso demonstrate more clearly the benefits of the use of the word lattices for classification

25

Page 40: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Cortes, Haffner, and Mohri

in this task. This advantage is even more manifest for the VoiceTone2 task for which thespeech recognition accuracy is lower. VoiceTone2 is also a harder classification task as canbe seen by the comparison of the plots of Figure 7.2.3. The classification accuracy of SVMswith kernels applied to lattices is more than 6% absolute value better than that of Boos-Texter near 40% rejection rate, and about 3% better than SVMs applied to the one-besthypothesis.

Thus, our experiments in spoken-dialog classification in three distinct large-vocabularytasks demonstrates that using rational kernels with SVMs consistently leads to very com-petitive classifiers. They also show that their application to the full word lattices insteadof the single best hypothesis output by the recognizer systematically improves classificationaccuracy.

8. Conclusion

We presented a general framework based on weighted transducers, rational kernels, to extendkernel methods to the analysis of variable-length sequences or more generally weightedautomata. The transducer representation provides a very compact representation benefitingfrom existing and well-studied optimizations. It further avoids the design of special-purposealgorithms for the computation of the kernels covered by the framework of rational kernels.A single general and efficient algorithm was presented to compute effectively all rationalkernels. Thus, it is sufficient to implement that algorithm and let different instances ofrational kernels be given by the weighted transducers that define them. A general frameworkis also likely to help understand better kernels over strings or automata and their relation.

We gave the proof of several characterization results and closure properties for PDSrational kernels. These results can be used to design a complex PDS rational kernel fromsimpler ones or from an arbitrary weighted transducer over an appropriate semiring, or fromnegative definite kernels.

We also gave a study of the relation between rational kernels and several kernels orsimilarity measures introduced by others. Rational kernels provide a unified frameworkfor the design of computationally efficient kernels for strings or weighted automata. Theframework includes in particular pair-HMM string kernels (Durbin et al., 1998, Watkins,1999), Haussler’s convolution kernels for strings, the path kernels of Takimoto and Warmuth(2003), and other classes of string kernels introduced for computational biology. We alsoshowed that the classical edit-distance does not define a negative definite kernel when thealphabet contains more than one symbol, a result that to our knowledge had never beenstated or proved and that can guide the study of kernels for strings in computational biologyand other similar applications.

Our experiments in several different large-vocabulary spoken-dialog tasks show thatrational kernels can be combined with SVMs to form powerful classifiers and demonstratethe benefits of the use of kernels applied to weighted automata. There are many otherrational kernels such as complex gappy n-gram kernels that could be explored and thatperhaps could further improve classification accuracy in such experiments. We presentelsewhere new rational kernels exploiting higher-order moments of the distribution of thecounts of sequences, moment kernels, and report the results of our experiments on the sametasks which demonstrate a consistent gain in classification accuracy (Cortes and Mohri,

26

Page 41: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Rational Kernels: Theory and Algorithms

2004). Rational kernels can be used in a similar way in many other natural languageprocessing, speech processing, and bioinformatics tasks.

Acknowledgments

We thank Allen Van Gelder, David Haussler, Risi Kondor, Alex Smola, and Manfred War-muth for discussions about this work. We are also grateful to our colleagues of the AT&THMIHY and VoiceTone teams for making available the data we used in our experiments.

References

Cyril Allauzen, Mehryar Mohri, and Brian Roark. A General Weighted Grammar Library. InProceedings of the Ninth International Conference on Automata (CIAA 2004), Kingston,Ontario, Canada, July 2004. URL http://www.research.att.com/sw/tools/grm.

Christian Berg, Jens Peter Reus Christensen, and Paul Ressel. Harmonic Analysis onSemigroups. Springer-Verlag: Berlin-New York, 1984.

Jean Berstel. Transductions and Context-Free Languages. Teubner Studienbucher:Stuttgart, 1979.

Jean Berstel and Christophe Reutenauer. Rational Series and Their Languages. Springer-Verlag: Berlin-New York, 1988.

Bernhard E. Boser, Isabelle Guyon, and Vladimir N. Vapnik. A training algorithm foroptimal margin classifiers. In Proceedings of the Fifth Annual Workshop of ComputationalLearning Theory, volume 5, pages 144–152, Pittsburg, 1992. ACM.

Corinna Cortes, Patrick Haffner, and Mehryar Mohri. Lattice Kernels for Spoken DialogClassification. In Proceedings ICASSP’03, Hong Kong, 2003a.

Corinna Cortes, Patrick Haffner, and Mehryar Mohri. Positive Definite Rational Kernels. InProceedings of The 16th Annual Conference on Computational Learning Theory (COLT2003), volume 2777 of Lecture Notes in Computer Science, pages 41–56, WashingtonD.C., August 2003b. Springer, Heidelberg, Germany.

Corinna Cortes, Patrick Haffner, and Mehryar Mohri. Rational Kernels. In Advances inNeural Information Processing Systems (NIPS 2002), volume 15, Vancouver, Canada,March 2003c. MIT Press.

Corinna Cortes, Patrick Haffner, and Mehryar Mohri. Weighted Automata Kernels – Gen-eral Framework and Algorithms. In Proceedings of the 9th European Conference onSpeech Communication and Technology (Eurospeech ’03), Special Session Advanced Ma-chine Learning Algorithms for Speech and Language Processing, Geneva, Switzerland,September 2003d.

Corinna Cortes and Mehryar Mohri. Distribution Kernels Based on Moments of Counts. InProceedings of the Twenty-First International Conference on Machine Learning (ICML2004), Banff, Alberta, Canada, July 2004.

27

Page 42: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Cortes, Haffner, and Mohri

Corinna Cortes and Vladimir N. Vapnik. Support-Vector Networks. Machine Learning, 20(3):273–297, 1995.

Jack J. Dongarra, Jim R. Bunch, Cleve B. Moler, and G. W. Stewart. LINPACK User’sGuide. SIAM Publications, 1979.

Richard Durbin, Sean Eddy, Anders Krogh, , and Graeme Mitchison. Biological SequenceAnalysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge UniversityPress, Cambridge UK, 1998.

Samuel Eilenberg. Automata, Languages and Machines, volume A-B. Academic Press,1974.

Patrick Haffner, Gokhan Tur, and Jeremy Wright. Optimizing SVMs for complex CallClassification. In Proceedings ICASSP’03, 2003.

David Haussler. Convolution Kernels on Discrete Structures. Technical Report UCSC-CRL-99-10, University of California at Santa Cruz, 1999.

Werner Kuich and Arto Salomaa. Semirings, Automata, Languages. Number 5 in EATCSMonographs on Theoretical Computer Science. Springer-Verlag, 1986.

Eugene L. Lawler. Combinatorial Optimization: Networks and Matroids. Holt, Rinehart,and Winston, 1976.

Christina Leslie, Eleazar Eskin, Jason Weston, and William Stafford Noble. Mismatch StringKernels for SVM Protein Classification. In NIPS 2002, Vancouver, Canada, March 2003.MIT Press.

Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and re-versals. Soviet Physics - Doklady, 10:707–710, 1966.

Huma Lodhi, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. Text classificationusing string kernels. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors,NIPS 2000, pages 563–569. MIT Press, 2001.

Mehryar Mohri. Semiring Frameworks and Algorithms for Shortest-Distance Problems.Journal of Automata, Languages and Combinatorics, 7(3):321–350, 2002.

Mehryar Mohri. Edit-Distance of Weighted Automata: General Definitions and Algorithms.International Journal of Foundations of Computer Science, 14(6):957–982, 2003.

Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. Weighted Automata in Textand Speech Processing. In Proceedings of the 12th biennial European Conference onArtificial Intelligence (ECAI-96), Workshop on Extended finite state models of language,Budapest, Hungary, 1996. John Wiley and Sons, Chichester.

Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. The Design Principles of aWeighted Finite-State Transducer Library. Theoretical Computer Science, 231:17–32,January 2000. URL http://www.research.att.com/sw/tools/fsm.

28

Page 43: Weighted Finite-State Transducers in …...Tutorial Hand-Outs – Weighted Finite-State Transducers in Computational Biology Corinna Cortes Mehryar Mohri Google Research Courant Institute,

Rational Kernels: Theory and Algorithms

Fernando C. N. Pereira and Michael D. Riley. Speech recognition by composition of weightedfinite automata. In Finite-State Language Processing, pages 431–453. MIT Press, Cam-bridge, Massachusetts, 1997.

Arto Salomaa and Matti Soittola. Automata-Theoretic Aspects of Formal Power Series.Springer-Verlag: New York, 1978.

Robert E. Schapire and Yoram Singer. Boostexter: A boosting-based system for textcategorization. Machine Learning, 39(2/3):135–168, 2000.

Bernhard Scholkopf. The Kernel Trick for Distances. In Todd K. Leen, Thomas G. Diet-terich, and Volker Tresp, editors, NIPS 2001, pages 301–307. MIT Press, 2001.

Bernhard Scholkopf and Alex Smola. Learning with Kernels. MIT Press: Cambridge, MA,2002.

Izhak Shafran, Michael Riley, and Mehryar Mohri. Voice Signatures. In Proceedings of The8th IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2003), St.Thomas, U.S. Virgin Islands, November 2003.

Eiji Takimoto and Manfred K. Warmuth. Path kernels and multiplicative updates. TheJournal of Machine Learning Research (JMLR), 4:773 – 818, 2003.

Vladimir N. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998.

Chris Watkins. Dynamic alignment kernels. Technical Report CSD-TR-98-11, Royal Hol-loway, University of London, 1999.

29