21
Smola, Bartlett, Sch¨olkopf, and Schuurmans: Advances in Large Margin Classifiers 2004/05/13 15:05 18 Computing the Bayes Kernel Classifier P´alRuj´an FB Physik and ICBM Carl von Ossietzky Universit¨at Oldenburg Postfach 2503, D-26111 Oldenburg, Germany [email protected] http://www.neuro.uni-oldenburg.de/rujan Mario Marchand SITE, University of Ottawa Ottawa, K1N-6N5 Ontario, Canada [email protected] http://www.site.uottawa.ca/marchand We present below a simple ray-tracing algorithm for estimating the Bayes classifier for a given class of parameterized kernels. 18.1 Introduction Support Vector Machines try to achieve good generalization by computing the maximum margin separating hyperplane in a high-dimensional feature space. This approach effectively combines two very good ideas. The first idea is to map the space of input vectors into a very high-dimensional feature space in such a way that nonlinear decisions functions on the input space can be constructed by using only separating hyperplanes on the feature space. By making use of kernels, we can implicitly perform such mappings without explicitly using high-dimensional separating vectors(Boser et al., 1992). Since it is very likely that the training examples will be linearly separable in the high-dimensional feature space, this method offers an elegant alternative to network growth algorithms as in(Ruj´an and Marchand, 1989; Marchand et al., 1990) which try to construct non- linear decision surfaces by combining perceptrons. The second idea is to construct the separating hyperplane on the feature space which has the largest possible margin. Indeed, it was shown by Vapnik and others that this may give good generalization even if the dimension of the feature space

Computing the Bayes Kernel Classifler

Embed Size (px)

Citation preview

Smola, Bartlett, Scholkopf, and Schuurmans: Advances in Large Margin Classifiers 2004/05/13 15:05

18 Computing the Bayes Kernel Classifier

Pal Rujan

FB Physik and ICBMCarl von Ossietzky Universitat OldenburgPostfach 2503, D-26111 Oldenburg, [email protected]://www.neuro.uni-oldenburg.de/∼rujan

Mario Marchand

SITE, University of OttawaOttawa, K1N-6N5 Ontario, [email protected]://www.site.uottawa.ca/∼marchand

We present below a simple ray-tracing algorithm for estimating the Bayes classifierfor a given class of parameterized kernels.

18.1 Introduction

Support Vector Machines try to achieve good generalization by computing themaximum margin separating hyperplane in a high-dimensional feature space. Thisapproach effectively combines two very good ideas.

The first idea is to map the space of input vectors into a very high-dimensionalfeature space in such a way that nonlinear decisions functions on the input spacecan be constructed by using only separating hyperplanes on the feature space. Bymaking use of kernels, we can implicitly perform such mappings without explicitlyusing high-dimensional separating vectors(Boser et al., 1992). Since it is very likelythat the training examples will be linearly separable in the high-dimensional featurespace, this method offers an elegant alternative to network growth algorithms asin(Rujan and Marchand, 1989; Marchand et al., 1990) which try to construct non-linear decision surfaces by combining perceptrons.

The second idea is to construct the separating hyperplane on the feature spacewhich has the largest possible margin. Indeed, it was shown by Vapnik and othersthat this may give good generalization even if the dimension of the feature space

Smola, Bartlett, Scholkopf, and Schuurmans: Advances in Large Margin Classifiers 2004/05/13 15:05

330 Computing the Bayes Kernel Classifier

is infinite. We must keep in mind however that the optimal classifier is the Bayesdecision rule.Better learning

takes moretime

Hence, in an attempt to obtain classifiers with better generalization, we presenthere a method, based on ergodic billiards, for estimating the Bayes classifieron the high-dimensional feature space at the price of longer training times. Wepresent empirical evidence that the resulting classifier can generalize better thanthe maximum margin solution.

18.2 A simple geometric problem

Here is the main message of this Chapter in a two-dimensional nutshell. Considera convex polygon P in 2D. Given a direction v, compute the line partitioning thepolygon P into two parts of equal area A1 = A2 (see Fig. 18.1). Call this line theBayes decision line for direction v.

v

A

P

A21

Figure 18.1 Partitioning a convex polyhedron in two equal volumes by a hyper-plane with normal v.

Now let us draw such Bayes lines in all possible directions, as shown in Fig. 18.2.Contrary to our triangle preconditioned expectations, these lines do not intersectin one point. Hence, no matter how we choose a point inside the polygon P, therewill be directions along which the v oriented line will not partition P into two equalareas. The Bayes point is defined here as the point for which the direction-averagedThe Bayes

point squared area-difference is minimal.

Smola, Bartlett, Scholkopf, and Schuurmans: Advances in Large Margin Classifiers 2004/05/13 15:05

18.2 A simple geometric problem 331

Figure 18.2 The set of Bayes-decision lines.

In general, it is rather difficult to compute the Bayes point. The convex polyhedra wewill consider are defined through a set of inequalities, the vertices of the polyhedronare not known. Under such conditions one feasible approximation of the Bayes pointis to compute the center of the largest inscribed circle or, better, the symmetrycenter of the largest area inscribed ellipse. As shown below, the center of thelargest inscribed circle corresponds to the maximal margin perceptron. For stronglyelongated polygons, a definitely better approach is to consider the center of mass ofthe polygon, as illustrated in Fig. 18.3 Jumping suddenly into N -dimensions, the

Figure 18.3 The largest inscribed circle and the center of mass (cross).

question is now how to sample effectively a high dimensional polyhedron in order tocompute its center of mass. One possibility is to use Monte Carlo sampling (Neal,1996) or sampling with pseudo-random sequences (Press et al., 1992).

As suggested by one of us, (Rujan, 1997), another viable alternative is to use a

Smola, Bartlett, Scholkopf, and Schuurmans: Advances in Large Margin Classifiers 2004/05/13 15:05

332 Computing the Bayes Kernel Classifier

Figure 18.4 Trajectory (dashed line) after 1000 bounces.

ray-tracing method. The bounding polygon will be considered as a billiard tableBilliardsinside which a ball bounces elastically on the walls. With few exceptions corre-sponding to fully integrable or rational angle billiards, the so defined Hamiltoniandynamics will be ergodic, implying that a typical trajectory will homogeneouslycover the polygon, as illustrated in Fig. 18.4. It is known that the entropy of polyg-onal (and polyhedral) billiards vanish (Zemlyakov and Katok, 1975). However, byexcluding from the billiard a spherical region inside the polyhedron, one can makethe dynamics hyperbolic (Bunimovich, 1979). Such a billiard is somewhat similarto the Sinai-Lorenz billiard, which has strong mixing properties (Bunimovich andSinai, 1980; Berry, 1981). The absence of a general theorem, however, does not pre-vent us from using this algorithm. In the following, we will simply assume that atypical classification problem leads to ergodic dynamics and will sample accordinglythe phase space. The presence of limit cycles, typical for fully integrable systemsor KAM-tori in soft chaos can be - in principle - detected numerically.

18.3 The maximal margin perceptron

Consider a set of N -dimensional m data points {x1, . . . ,xm} belonging to twoclasses labeled by {y1, . . . , ym}, yi = ±1. We are seeking the plane normal w andtwo thresholds b+1, b−1 such that

(w · xi)− b+1 > 0, yi = +1 (18.1)

(w · xi)− b−1 < 0, yi = −1 (18.2)

b+1 > b−1

Smola, Bartlett, Scholkopf, and Schuurmans: Advances in Large Margin Classifiers 2004/05/13 15:05

18.3 The maximal margin perceptron 333

This corresponds to finding two parallel hyperplanes passing between the twoconvex hulls of the positive and negative examples, respectively, such that theirdistance (the gap or margin) is maximalMaximal margin

G = maxw

b+1 − b−1

(w ·w)(18.3)

This primal problem is the maximal margin or maximal ‘dead zone’ perceptron(Vapnik, 1979; Lampert, 1969). Eq. (18.1-18.2) and can be rewritten compactly asa set of linear inequalitiesA set of linear

inequalities...yi(xi, 1)T (w,−b) ≥ ∆ ≥ 0, i = 1, 2, · · · , m (18.4)

where b = b+1+b−12 is the threshold of the maximal margin perceptron and ∆ =

b+1−b−12 the stability of the set of linear inequalities (18.4). Note that in this notation...and its stability

the normal vector (w,−b) is also N + 1-dimensional.A two dimensional illustration of these concepts is shown in Fig. 18.5.

+1

-1

A

B

C

Figure 18.5 The maximal margin perceptron (once again!).

As shown by (Lampert, 1969) and geometrically evident from Fig. 18.5, theminimal connector problem is dual to the maximal margin problem: find two convexcombinations

X+ =∑

{i:yi=+1}α+

i xi;∑

{i:yi=+1}α+

i = 1; α+i ≥ 0 (18.5)

Smola, Bartlett, Scholkopf, and Schuurmans: Advances in Large Margin Classifiers 2004/05/13 15:05

334 Computing the Bayes Kernel Classifier

X− =∑

{i:yi=−1}α−i xi;

{i:yi=−1}α−i = 1; α−i ≥ 0 (18.6)

such that

L2 = min{α+i ,α−i }‖X+ −X−‖2 (18.7)

Minimalconnector The corresponding weight vector is given by wmm = X+−X−. (18.7) together with

the convexity constraints (18.5-18.6) defines a quadratic programming problem. Inthe mathematical programming literature the vertices A, B, and C in Fig. 18.5 arecalled active constraints. Only the Lagrange multipliers α+

i and α−i correspondingActiveconstraints to active vertices i are strictly positive, all others vanish. The active constraints

satisfy the inequalities Eq. (18.1-18.2) as equalities. Furthermore, the gap G, Eq.(18.3) and the minimal connector’s length L, Eq. (18.7) are equal only at optimality.Vapnik calls the active constraints support vectors, since the expression for wmm

(Eqs. (18.5-18.6)) involves two convex combination of the active constraints only.A better geometric picture can be obtained by considering the linear conjugate

vector space, so that Eq. (18.4) describes now hyperplanes whose normals are givenby zi = yi(xi, 1). In this space all W = (w,−b) vectors satisfying the linearinequalities Eq. (18.4) lie within the convex cone shown in Fig. 18.6. The rayThe version spacecorresponds to the direction of the maximal margin perceptron. The point at whichit intersects the unit sphere is at distance

dmmi =

∣∣∣∣∣(wmm · xi)√

(xi · xi)− b

∣∣∣∣∣ = ∆ (18.8)

from the active example xi. Hence, if all active examples have the same length, theThe max marginperceptron aslargest inscribedsphere

maximal margin perceptron corresponds to the center of the largest sphere inscribedinto the spherical polyhedron defined by the intersection of the unit sphere withthe version space polyhedral cone.

18.4 The Bayes perceptron

Given a set of data z = {zi}mi=1 and an unknow example x, the (optimal) Bayes

decision rule for x is given by (see for example (Neal, 1996)):

f(x) = sgn(∫

g(x,W)P (W|z)dW)

(18.9)

where g is the perceptron output-function, g(x,W) = sgn{(x ·w)− b} and P (W|z)the posterior network distribution. As usual, the Bayes identity

P (W|z) =P (z|W)P (W)

P (z)(18.10)

relates the posterior distribution to the prior P (W). It was shown by (Watkin, 1993)that for the uniform posterior, the Bayes classifier Eq. (18.9) can be represented bya single perceptron corresponding to the center of mass of the version space V :Bayes point =

Bayes perceptron

Smola, Bartlett, Scholkopf, and Schuurmans: Advances in Large Margin Classifiers 2004/05/13 15:05

18.4 The Bayes perceptron 335

Figure 18.6 The version space is a convex polyhedral cone. The upper left planecorresponds to the active vertex A, the one upper right to the vertex B, and thelower one to the vertex C in Fig. (18.5). The version space is restricted to theintersection between the unit sphere and the convex cone by the normalizationconstant (Wi ·Wi) ≡ 1. i = 1, . . . , m. The ray corresponds to the direction of themaximal margin perceptron.

limN→∞,M→∞, m

N =constf(x) = g(x,W∗) (18.11)

where

W∗ =1

vol (W)

W∈V

WdV (18.12)

Hence, the center of mass of the intersection of the version space polyhedral conewith the unit sphere defines the Bayes perceptron parameters as long as the pos-terior is uniform. It is, however, not difficult to generalize the sampling algorithmpresented below to nonuniform posteriors. As described in Chapter 1, the Bayespoint is the center of mass of a polyhedral hypersurface whose local mass densityis proportional to the posterior distribution (18.10). Under the ergodicity assump-tion the ray-tracing algorithm provides a collection of homogeneously distributedsampling points. The center of mass can be then estimated by a weighed average ofthese vectors, where the weights are proportional to the posterior probability den-sity. Before describing the implementation of the billiard algorithm in more detail,we consider the generalization of these ideas to SVM.

Smola, Bartlett, Scholkopf, and Schuurmans: Advances in Large Margin Classifiers 2004/05/13 15:05

336 Computing the Bayes Kernel Classifier

18.5 The Kernel-Billiard

As explained in the introduction to this book (Chapter 1), the support vectormachines are somewhat similar to the method of potential functions and “boosting”architectures. ¿From a physicist’s point of view, however, the support vectormachines (SVM) operate on the quantum probability amplitude (wave-packets)Φ with L2 metric, while the method of potential functions and the boosting on aclassical probability density with L1 metric. To see this consider Fig. ?? in Chapter1: the SVM architecture acts as an operator on the input Φ(x) feature vector, whilethe method of potential functions propagates directly the input vector throughthe set of real functions Φ(xi), the dot-product layer is missing. The boostingarchitecture requires, in addition, the weights connected to the output unit tobe convex coefficients. Hence, they can be interpreted as the probability that the“hypothesis” function Φi is correct.

In order to generalize the perceptron learning method described above to kernels,one must rewrite the algorithm in terms of dot-products of the form qij = (xi ·xj). Ifthis is possible, one makes the substitution qij ← (Φ(xi) ·Φ(xj)) = k(xi,xj), wherek(x,y) is a Mercer-kernel. The main trick is thus to substitute this positive definiteChange in the

dot-product kernel for all scalar products. The explicit form of Φi ≡ Φ(xi) is not needed.

18.5.1 The Flipper Algorithm

Algorithm 18.1 : Flipper (Rujan, 1997)

1. Initialize center of mass vector, counters, etc., normalize example vectors according toEq. (18.19).

2. Find a feasible solution W inside the version space,

3. Generate a random unit direction vector V in version space,

4. For iteration nB < Nmax max-iterations, (flight time τ < τmax max-time)

compute flight-times to all bounding planes (m2 dot-products∼ kernel evaluations),

compute plane-index of the next bounce (corresponds to shortest positive flighttime),

compute new position in version space W′ and store it in center of mass vector,

compute the reflected direction vector V′,

5. Test conditions : go back to 3) if escaped to ∞ or exit at Nmax

Therefore, the typical number of operations is O(M2Nmax) kernel evaluations. Tofind a feasible solution use any perceptron learning algorithm or the light-trapmethod described below. Fig. 18.7 illustrates the main step of the algorithm. WeElastic scatteringcompute the flight times for all planes and choose the smallest positive one. Thiswill be the plane first hit by the billiard trajectory. We move the ball to the new

Smola, Bartlett, Scholkopf, and Schuurmans: Advances in Large Margin Classifiers 2004/05/13 15:05

18.5 The Kernel-Billiard 337

’V

d⊥

V⊥V

W’

X

W

Figure 18.7 Bouncing on a plane. The flight time from point W to the plane canbe computed from the scalar products d⊥ = (W·X) and V⊥ = (V·X) as τ = −d⊥/V⊥.The collision takes place at W′ = W+ τV and the new direction is V′ = V+2V⊥X.

position and reflect the direction of flight. The arc between the old and the newposition is added according to the rules derived below to the center of mass estimate.Since the version space is a polyhedral cone, the trajectory will escape sometimesto infinity. In such cases we restart it from the actual estimated center of mass.Note the analogy to the two-dimensional geometrical problem mentioned above.

How many bounces NB do we need before the estimated center of mass vectorwill be within an ε distance from the true center of mass with probability 1− η? Ifthe generated series of vectors W are identically and independently generated fromthe posterior distribution, then, as shown in Appendix A, it is sufficient to make

NB =m(b− a)2

2ε2ln

(2m

η

)(18.13)

Convergenceestimate bounces. Where m is the number of training vectors and for simplicity we have

assumed that each component of W lies in the interval [a, b]. However, to thisnumber, we must add the number of bounces needed for the generated series ofvectors W to become identically and independently distributed.

Let us describe a given trajectory by the sequence of hyperplane indices hit by thebilliard ball. Consider two trajectories started from neighboring points in slightlydifferent directions. It seems reasonable to assume that the two trajectories willbecome fully uncorrelated once the two balls start to bounce on different planes.Hence, let us define the correlation length Λ as the average number of bounces afterwhich two trajectories originally close in phase space bounce for the first time on

Smola, Bartlett, Scholkopf, and Schuurmans: Advances in Large Margin Classifiers 2004/05/13 15:05

338 Computing the Bayes Kernel Classifier

different planes. The efficiency of the flipper algorithm depends on how Λ scaleswith m. In comparison, note that the best known algorithm for estimating thecenter of mass of a convex polyhedron uses the largest volume inscribed ellipsoidand apart logarithmic terms scales with O(m3.5) (Khachiyan and Todd, 1993).By introducing slight changes in the flipper algorithm we can enhance the mixingproperties of the billiard and thus optimize Λ. A simple solution is, for instance, tointroduce a reflecting sphere around the (estimated) center of mass, lying completelyinside the polyhedron. Another possibility is to use a random, nonlinear scatteringangle function simulating a dispersing boundary. Such a dynamics would dissipateenergy but leads to better mixing properties. This interesting topic will be addressedelsewhere.

When generalizing the billiard algorithm to kernel methods, we have first toshow that the center of mass of the now very high dimensional space lies in thespan defined by the example set. Let Φi ≡ Φ(xi) denote in feature space the imageof the training example xi. The version space V is defined to be the following setof weight vectors:

V = {w : (w ·Φi) > 0 for i = 1, . . . , m} (18.14)

where m denotes the number of training examples.Any vector w lying in the version space can be written as w = w‖ + w⊥ where

w‖ lies in the space spanned by {Φi}mi=1:

∃ α1, . . . , αm : w‖ =m∑

i=1

αiΦi (18.15)

and w⊥ lies in the orthogonal complement of that space (with respect to V):

(w⊥ ·Φi) = 0 ∀i = 1, . . . , m (18.16)

Hence, (w‖+w⊥) ∈ V if and only if (w‖−w⊥) ∈ V. For each w the w⊥ componentscancel and the center of mass w∗:

w∗ =1

vol (V)

Vwdw (18.17)

lies in the space spanned by {Φi}mi=1.Center of mass

is in examples’sspan

This simple result implies that instead of choosing a general direction in featurespace we can restrict ourselves to a vector lying in the subspace spanned by theexample vectors:

V =

(m∑

i=1

βiΦi, v0

)(18.18)

subject to the normalization condition

(V ·V) =m∑

i=1

m∑

j=1

βiβjk(xi,xj) + v20 = 1 (18.19)

The formulation of the flipper algorithm ensures that all subsequent direction

Smola, Bartlett, Scholkopf, and Schuurmans: Advances in Large Margin Classifiers 2004/05/13 15:05

18.5 The Kernel-Billiard 339

vectors will also lie in the linear span. As usual, we expand also the weight vectoras

W =

(m∑

i=1

αiΦi,−b

)(18.20)

The normal vectors of the version space polyhedron boundaries are given in thisdescription by

Zp =1√

k(xp,xp) + 1(ypαpΦp, yp) (18.21)

where p ∈ P denotes the subset of the extended example vectors (hyperplanes),which bounds from inside the zero-error feature space. Support vectors as definedby the maximal margin algorithm (MMSV) are related to those training exampleswhich touch the largest inscribed hypersphere. Support vectors as defined in aBayesian context correspond to hyperplanes bounding the zero-error feature space(BSV). Obviously, MMSV ⊆ BSV ⊆ X.Feature space

support vectors In general, the vectors Φi, i = 1, . . . , m are neither orthogonal, nor linearlyindependent. Therefore, the expansions (18.18)-(18.20) are not unique. In addition,some of the example hyperplanes might (will !) lie outside the version spacepolyhedral cone and therefore cannot contribute to the center of mass we areseeking. Strictly speaking, the expansions in Eqs. (18.18) - (18.21) should be only interms of the p ∈ P support vectors. This set is unknown, however, at the beginningof the algorithm. This ambiguity, together with the problem of a drifting trajectory(discussed below in more detail), are specific to the kernel method.

As explained in Fig. 18.7, if we bounce on the j-th example hyperplane

V ⊥j =

m∑

i=1

yjβik(xi,xj) + v0yj (18.22)

and, similarly

W⊥j =

m∑

i=1

yjαik(xi,xj)− byj (18.23)

Flight timesOnce the example index j∗ corresponding to the smallest positive flight time τmin,

j∗ = argminj,τmin>0

τmin, τmin = −W⊥j

V ⊥j

(18.24)

has been calculated, the update rules for V and W are

βi ← βi − 2yj∗V⊥δi,j∗ (18.25)

v0 ← v0 − 2yj∗V⊥

and

αi ← αi + τminβi (18.26)

b ← b + τminv0

Smola, Bartlett, Scholkopf, and Schuurmans: Advances in Large Margin Classifiers 2004/05/13 15:05

340 Computing the Bayes Kernel Classifier

In order to update the center of mass vector we define the sum of two unit vectorsa and b as follows (see Fig. 18.8). Each vector has an associated weight, denoted byρa and ρb, respectively. The sum of the two vectors is c, with weight ρc = ρa + ρb.If we choose the vector a as the unit vector in direction x, then, according to Fig.18.8, the coordinates of the three vectors are a = (1, 0), b = (cos φ, sin φ), andc = (cos φ1, sin φ1), respectively. Note that cos φ = (a ·b). We now require that theangle of the resultant vector c equals φ1 = ρb

ρcφ, as when adding parallel forces.Center of mass

update

φ1

a

bc

φ

ρ

ρρ

a

b

c

Figure 18.8 Adding two vectors lying on the unit hypersphere. The weights aredenoted by ρ and c = a + b implies ρc = ρa + ρb.

This leads to the following addition rule

c = cos(φρb

ρa + ρb)a +

sin( φρb

ρa+ρb)

sin(φ)[b− cos(φ)a] (18.27)

where cos φ = ab ← k(a,b). As explained in (Rujan, 1997), we can add recursivelyusing Eq. (18.27) the middle points of the arcs between the two bouncing pointsW and W′, with a weight given by the length of the arc, which is proportionalto the angle φ. If the new direction points towards the free space, τmin = ∞, werestart the billiard at the actual center of mass in a randomly chosen direction. Ifthe number of bounces and the dimension are large enough, which is almost alwaysthe case, then it is enough to add the bouncing points with the weight determinedby the posterior probability density.

Smola, Bartlett, Scholkopf, and Schuurmans: Advances in Large Margin Classifiers 2004/05/13 15:05

18.5 The Kernel-Billiard 341

Now consider the case when the number of BSV’s is less than m. This meansthat in the example set there are some hyperplanes which can not contribute to thecenter of mass. The simplest such example occurs when the ball bounces for a longtime between two given planes while moving (drifting) along with a constant speed,while the other examples are inactive. According to the update rules Eqs. (18.25-18.26), all examples contribute to both w and the threshold b! However, whenusing the center of mass for classification the contributions from inactive examplescancel out.

18.5.2 The light-trap algorithm

We can extend the main idea behind the billiard dynamics in several ways. Onevariant is the light-trap algorithm, which can be used to generate a feasible solution.Instead of the full reflecting mirror in Fig. 18.7, consider a half-reflecting mirror,schematically shown in Fig. 18.9Trapping

billiards

’V

V

x

V’’

Figure 18.9 Bouncing on a semi-transparent plane. If the “light” comes fromabove (V⊥ < 0) the trajectory is elastically reflected. If, however, the light shinesfrom below, (V⊥ > 0), the trajectory is allowed to pass through the mirror.

Therefore, if we start the trajectory at a point making few errors in version space,any time we encounter a non satisfied linear inequality we will “pass” through it,reducing by one the number of errors. Eventually, the trajectory will be trappedin the version space with zero errors. This algorithm is particularly simple to

Smola, Bartlett, Scholkopf, and Schuurmans: Advances in Large Margin Classifiers 2004/05/13 15:05

342 Computing the Bayes Kernel Classifier

implement for the Gaussian kernel

k(x,y) = e−‖x−y‖2

2σ2 (18.28)Gauss kernels

where the starting point corresponds to the limit σ → 0. In this limit all Φi vectorsare pairwise orthogonal and X−, X+ are given by the center of mass of negative andpositive example vectors in feature space: αi = 1/m− for negative and αi = 1/m+

for positive examples, respectively. m±1 is the number of positive (negative) trainingexamples.

This is nothing else than a Bayesian decision for fully uncorrelated patterns, theonly useful information being the occurrence frequency of the two classes. Thissolution is a good starting point, making relatively few errors also for finite σ

values. This idea can be easily implemented and provides in this context a newperceptron-learning method, thus making the billiard method self contained.

18.5.3 The soft billiard

Another natural extension is to “soften” the billiard, by allowing for one or moreerrors, as illustrated in the Fig. (18.10)

Figure 18.10 Projecting a trajectory beyond the zero-error version space.

Softbilliards The “onion-algorithm” sorts the positive flight times in increasing order. While the

shortest flight time corresponds to the boundary of the zero-error feature spacepolyhedron, we can sample now the position this ray would have on the second(one-error version space), third (two-error version space), etc., bounce. Therefore,while keeping the trajectory inside the zero-error cone, we generate at a minimaladditional cost simultaneously estimates of the zero, one-error, two-errors, etc.

Smola, Bartlett, Scholkopf, and Schuurmans: Advances in Large Margin Classifiers 2004/05/13 15:05

18.6 Numerical Tests 343

version spaces. By keeping track of the original labels of the planes the trajectory isbouncing upon, we produce a set of ‘center-of-mass’ vectors corresponding to a given(n−, n+) number of negative and positive errors, respectively. After convergence,we can try different ways of weighting these results without having to rerun thesampling procedure again.

This method provides a powerful tool of searching for good ‘mass-centers’ inaddition to changing the kernel parameter σ. It has also the same drawback, namelyit requires a careful testing according to the ‘k out on m’-estimate protocol.

18.6 Numerical Tests

We did not yet perform exhaustive tests on the SVM variant of these algorithms.While in the process of editing this manuscript we received an article (Herbrichet al., 1999) where numerical results obtained on a large set of data strongly sup-port our conclusions. For illustration purposes we present below results obtainedfor normal perceptrons, were also analytic results are available (Opper and Haus-sler, 1991). The training examples have been generated at random and classifiedaccording to a randomly chosen but fixed “teacher” perceptron. Fig. (18.11) showsthe theoretical Bayes learning curve. The experimental results were obtained usingthe billiard algorithm and represents an average over 10 different runs.

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

0 1 2 3 4 5 6

α

G

Figure 18.11 The learning curve (generalization probability) as a function ofα = M

D, where M is the number of randomly generated examples and D = 100 the

input dimension. The continuous curve is the Bayes result of Opper and Haussler,the points with error bars represent billiard results.

Smola, Bartlett, Scholkopf, and Schuurmans: Advances in Large Margin Classifiers 2004/05/13 15:05

344 Computing the Bayes Kernel Classifier

0.65

0.70

0.75

0.80

0.85

0.90

0.95

0 1 2 3 4 5 6

α

G

Figure 18.12 Comparison between average generalization values obtained withthe maximal margin perceptron and the flipper algorithm, respectively. Sameparameter values as in Fig. 18.11.

Fig. (18.11) shows a comparison between the average generalization error of themaximal margin perceptron (lower curve) vs. the billiard results. Not shown is thefact that in each single run the maximal margin perceptron was worse than thebilliard result. We present below the results obtained with the Gaussian kernel Eq.(18.28) for two real-life data sets. In Fig. 18.13 are the sonar data (Gorman andSejnowsky, 1988), split into two sets according to the aperture angle. Note thatdifferent versions of this split are in circulation. Our split can be found in P.R.’swww-home page (http://www.neuro.uni-oldenburg/∼rujan)

These results show that using the onion-algorithm one can obtain very good,stable results for a rather wide range of σ values.

We also tested our algorithms on the Wisconsin breast cancer data collectedby W. H. Wolberg (Breast Cancer Database, University of Wisconsin Hospitals,Madison). After removing all incomplete examples we are left with 673 cases. Eachhas 9 attributes and 2 classes. Using the leave-10-out cross-validation method weobtain for σ = 1.75 25 errors for the maximal margin SVM and 23 errors by thebilliard algorithm (out of 670 tests).

Smola, Bartlett, Scholkopf, and Schuurmans: Advances in Large Margin Classifiers 2004/05/13 15:05

18.7 Conclusions 345

0

0.05

0.1

0.15

0.2

0.25

3 3.5 4 4.5 5

erro

r

sigma

Gaussian Kernel Classifier

SVM

Billiard

soft billiard

Figure 18.13 The aspect-angle dependent sonar data split: comparison betweenthe generalization error obtained with the maximal margin SVM, the flipper, andthe onion algorithms, respectively. Note that different data splits are available inthe literature.

18.7 Conclusions

Although we have not yet tested in great detail the billiard algorithm for SVM’s,we believe that this algorithm and its variants provide good estimates for theBayesian kernel classifiers. Another interesting observation is that the best resultswere obtained just before the billiard dynamics “closed”. In general, for small valuesof the kernel parameter σ, the trajectories are rather short, the ball escapes oftento ∞. As σ increases the trajectory length increases as well. A phase-transition likechange happens for even larger values of σ: the length of the trajectory seems todiverge, the billiard is closed.

Further theoretical work is necessary for determining correctly the typical cor-relation length Λ in Eq. (18.13). If it turns out that Λ does not depend on thedimension (number of training examples), then the estimate (18.13) would suggestthat the billiard method is superior to any known algorithm for solving convexprogramming problems.Ideal gas

of balls In the present implementation the billiard based algorithms are slow in compar-ison to the QP algorithms. However, since they can be run in parallel on differentprocessors without having to exchange a lot of data, they are well suited for mas-sively parallel processing.

Smola, Bartlett, Scholkopf, and Schuurmans: Advances in Large Margin Classifiers 2004/05/13 15:05

346 Computing the Bayes Kernel Classifier

err +0 +1 +2 +3 +4 +5

-0 8 (3,5) 9 (1,8) 13 (1,12) 18 (1,17) 20 (0,20) 21 (0,21)

-1 10 (5,5) 7 (2,5) 8 (1, 7) 8 (1, 7) 9 (1, 8) 11 (1,10)

-2 10 (5,5) 8 (3,5) 8 (2, 6) 6 (1, 5) 10 (1, 9) 16 (1,15)

-3 10 (5,5) 9 (4,5) 10 (3, 7) 6 (1, 5) 7 (1, 6) 9 (1, 8)

-4 12 (7,5) 10 (5,5) 10 (4, 6) 9 (3, 6) 8 (1, 7) 8 (1, 7)

-5 10 (8,2) 10 (5,5) 9 (4, 5) 9 (4, 5) 8 (3, 5) 7 (1, 6)

Table 18.1 Typical error table delivered by the soft billiard (onion) algorithm.The matrix indices represent the number of errors made allowed when computingthe weight vector for the − and the + class, respectively. The matrix elementscontain the total number of errors made on the test set and their split into − and+ class errors, respectively. The results shown are for the aspect-dependent anglesonar data. A gaussian kernel with σ = 4.4 has been used, the total number of testexamples was 104.

Acknowledgments

This work has been performed during P.R.’s visit at SITE, University of Ottawain September 1998. During the editing of this manuscript we were informed abouta similar work by R. Herbrich, T. Graepel, and C. Campbell, arriving at similarconclusions (Herbrich et al., 1999). We thank them for sending us a reprint of theirwork. The kernel implementation of the billiard algorithm as described above canbe found at http://www.neuro.uni-oldenburg/∼rujan.

Smola, Bartlett, Scholkopf, and Schuurmans: Advances in Large Margin Classifiers 2004/05/13 15:05

18.8 Appendix 347

18.8 Appendix

In this section, we first present a sampling lemma (that directly follows fromHoeffding’s inequality) and then discuss its implications on the number of samplesneeded to find a good estimate of the center of mass of the version space.

Lemma 18.1

Let v be a n-dimensional vector (v1, v2, . . . , vn) where each component vi is confinedto the real interval [a, b]. Let v1,v2, . . . ,vk be k independent vectors that areidentically distributed according to some unknown probability distribution P . Letµ be the true mean of v (i.e. µ =

∫vdP (v)) and let µ be the empirical estimate of µ

obtained from k samples (i.e. µ = (1/k)∑k

j=1 vj). Let ||µ−µ|| denote the Euclideandistance between µ and µ. Then with probability at least 1−η, ||µ−µ|| < ε whenever

k >n(b− a)2

2ε2ln

(2n

η

)(18.29)

Proof To have ||µ−µ|| < ε, it is sufficient to have |µi−µi| < ε/√

n simultaneouslyfor each component i = 1, . . . , n. Let Ai be the event for which |µi − µi| < ε/

√n

and let Ai be the complement of that event so that Pr(A1 ∩ A2 · · · ∩ An) =1 − Pr(A1 ∪ A2 · · · ∪ An). Hence we have Pr(A1 ∩ A2 · · · ∩ An) > 1 − η if andonly if Pr(A1 ∪ A2 · · · ∪ An) < η. However, from the well known union bound, wehave Pr(A1∪A2 · · ·∪An) ≤ ∑n

i=1 Pr(Ai). Hence to have Pr(A1∩A2 · · ·∩An) > 1−η,it is sufficient to have Pr(Ai) < η/n for i = 1, . . . , n. Consequently, in order to have||µ−µ|| < ε with probability at least 1− η, it is sufficient to have |µi−µi| > ε/

√n

with probability at most η/n for each component i. Now, for any component i,Hoeffding’s inequality (Hoeffding, 1963) states that:

Pr {|µi − µi| ≥ α} ≤ 2e−2kα2/(b−a)2 (18.30)

The lemma then follows by choosing α = ε/√

n and by imposing that η/n be largerthan the right-hand side of Hoeffding’s inequality.

To apply this lemma to the problem of estimating the center of mass of the versionspace, we could just substitute for the vector v, the separating weight vector (thatwould include an extra component for the threshold). However, we immediatelyrun into a difficulty when the separating vector lies in an infinite-dimensionalfeature space. In that case, we just apply the lemma for the m-dimensional vector(α1, α2, . . . , αm), dual to the separating weight vector. Hence, because we must nowreplace n by the number m of training examples, the number k of samples neededbecomes:

k >m(b− a)2

2ε2ln

(2m

η

)(18.31)

Regardless of whether we are sampling directly the weight vectors or the duals,the bound obtained from this lemma for estimating the center of mass applies onlywhen we are sampling according to the true Bayesian posterior distribution.

Smola, Bartlett, Scholkopf, and Schuurmans: Advances in Large Margin Classifiers 2004/05/13 15:05

References

M. V. Berry. Quantizing a classically ergodic system: Sinai’s billiard and the KKRmethod. Annals of Physics, 131:163–216, 1981.

B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimalmargin classifiers. In D. Haussler, editor, Proceedings of the 5th Annual ACMWorkshop on Computational Learning Theory, pages 144–152, Pittsburgh, PA,July 1992. ACM Press.

L. A. Bunimovich. On the ergodic properties of nowhere dispersing billiards.Commun. Math. Phys., 65:295–312, 1979.

L. A. Bunimovich and Ya.G. Sinai. Markov partitions for dispersed billiards.Commun. Math. Phys., 78:247–280, 1980.

R. P. Gorman and T. J. Sejnowsky. Analysis of hidden units in a layered networktrained to classify sonar targets. Neural Networks, 1, 1988.

R. Herbrich, T. Graepel, and C. Cambell. Bayes point machines: Estimating thebayes point in kernel space. IJCAI 99, 1999.

W. Hoeffding. Probability inequalities for sums of bounded random variables.Journal of the American Statistical Association, 58:13–30, 1963.

L. G. Khachiyan and M. J. Todd. On the complexity of approximating the maximalinscribed ellipsoid for a polytope. Mathematical Programming, 61:137–159, 1993.

P. R. Lampert. Designing pattern categories with extremal paradigm information.In M. S. Watanabe, editor, Methodologies of Pattern Recognition. AcademicPress, N.Y., 1969.

M. Marchand, M. Golea, and P. Rujan. Convergence theorem for sequential learningin two layer perceptrons. Europhysics Letters, 11:487–492, 1990.

R. Neal. Bayesian Learning in Neural Networks. Springer Verlag, 1996.

M. Opper and D. Haussler. Generalization performance of bayes optimal classi-fication algorithm for learning a perceptron. Physical Review Letters, 66:2677,1991.

W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. NumericalRecipes in C: The Art of Scientific Computing (2nd ed.). Cambridge UniversityPress, Cambridge, 1992.

P. Rujan. Playing billiard in version space. Neural Computation, 9:99–122, 1997.

Smola, Bartlett, Scholkopf, and Schuurmans: Advances in Large Margin Classifiers 2004/05/13 15:05

356 REFERENCES

P. Rujan and M. Marchand. Learning by minimizing resources in neural networks.Complex Systems, 3:229–242, 1989.

V. Vapnik. Estimation of Dependences Based on Empirical Data [in Russian].Nauka, Moscow, 1979. (English translation: Springer Verlag, New York, 1982).

T. Watkin. Optimal learning with a neural network. Europhysics Letters, 21:871,1993.

A. N. Zemlyakov and A. .B. Katok. The topological transitivity of billiards inpolygons. mat. zametki, 18:291–301, 1975.