Kernel Machines - Theory And Practice 26, 2017 · Kernel Machines Kernel Machines Theory And Practice Marco Del Vecchio [email protected] Warwick Manufacturing Group University

Kernel Machines

Kernel MachinesTheory And Practice

Marco Del [email protected]

Warwick Manufacturing GroupUniversity of Warwick

26/07/2017

[email protected]

Kernel Machines

Outline I

1 Introduction

2 Kernel Methods

3 Support Vector Machines For Classification

4 Support Vector Machines For Regression

Kernel Machines

Introduction

Outline I

1 IntroductionTraining DataLoss FunctionsGeneralisation And Overfitting

2 Kernel MethodsKernel Methods: An IntroductionDual RepresentationsConstructing Kernels

3 Support Vector Machines For ClassificationSupport Vector Machines For Classification: An OverviewHard Margin SVMs For ClassificationSoft Margin SVMs For ClassificationA Note on ComputationsA Note on Probabilistic Output

Kernel Machines

Introduction

Outline II

A Note on Multiclass Problems

4 Support Vector Machines For RegressionSupport Vector Machines For Regression: An OverviewCommon Kernels Choices

Kernel Machines

Introduction

Training Data

Training Data

Definition (Training Data)

Let the training data be denoted as

{(x(n), y(n)) ∈ RD × R | n = 1, . . . , N},

in the case where the response variable is a scalar, and

{(x(n),y(n)) ∈ RD × RK | n = 1, . . . , N}

when it is a multidimensional vector, where N denotes the totalnumber of training examples.

Kernel Machines

Introduction

Loss Functions

Loss Functions

Definition (Loss Function)

A loss function L(X,y,w) is a single, overall measure of lossincurred in taking any of the available decisions or actions. Inparticular, in this context, we define a loss function to be a map-ping that quantifies how unhappy we would be if we used w tomake a prediction on X when the correct output is y.

Kernel Machines

Introduction

Generalisation And Overfitting

Generalisation And Overfitting I

Definition (Overfitting)

A model, is said to overfit to the data if it does not generalise toout-of-sample cases although it fits the training data very well.More specifically, a model which explains the random error ornoise in the data instead of underlying relationship is said to beoverfitting.

Kernel Machines

Introduction


Generalisation And Overfitting II

Ideally, we would like to choose a model which performs best (i.e.minimises the loss) on new unseen data.That is, we would like a model that can generalises well beyondthe data used during training. However, by the very nature ofthe problem, unseen data is not available.

Kernel Machines

Introduction


Generalisation And Overfitting III

One solution to this is to train and validate the model on twodifferent datasets.

Definition (K-Fold Cross Validation)

In k-fold cv, the original sample is randomly partitioned into kequal sized subsamples. Of the k subsamples, a single subsampleis retained as the validation data for testing the model, and theremaining k1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with eachof the k subsamples used exactly once as the validation data. Thek results from the folds are then averaged to produce a singlemeasure of performance.

Kernel Machines

Kernel Methods

Outline I




Kernel Machines

Kernel Methods

Outline II



Kernel Machines

Kernel Methods

Kernel Methods: An Introduction

An Introduction I

Many linear parametric models can be re-cast into an equivalent“dual representation” in which the predictions are based on linearcombinations of a kernel function evaluated at the training datapoints.As we shall see, for models which are based on a fixed non-linearfeature space mapping φ(x), the kernel function is given by

k(x(n),x(m)) = φ(x(n))Tφ(x(m))

Kernel Machines

Kernel Methods


An Introduction II

The kernel concept was introduced into the field of pattern recog-nition by Aizerman et al. (1964) however, it was neglected formany years until Boser et al. (1992) popularised it by giving riseto the technique of large margin classifiers which lead to supportvector machines introduced by Vapnik (1995).

Kernel Machines

Kernel Methods


An Introduction III

The concept of a kernel formulated as an inner product in afeature space allows us to build interesting extensions of manyalgorithms by making use of the kernel trick, also known as kernelsubstitution.In this case, the kernel can be written in form of a feature mapφ : X → V which satisfies

k(x(n),x(m)) = 〈φ(x(n)),φ(x(m))〉V

for x(n) and x(m) ∈ X , where 〈·, ·〉V is a proper inner product.

Kernel Machines

Kernel Methods


An Introduction IV

Viewing kernel functions in this way gives rise to a powerful result

Result

An implicitly defined function φ exists whenever the space X canbe equipped with a suitable measure ensuring the function k sat-isfies Mercer’s condition.Hence, an explicit representation for φ is not necessary, as longas V is an inner product space.

Kernel Machines

Kernel Methods


An Introduction V

The general idea is that, if we have an algorithm formulated insuch a way that the input vector x enters only in the form ofscalar products, then we can replace that scalar product a kernelfunction.

Kernel Machines

Kernel Methods


An Introduction VI

Question: What if the algorithm to which we want to apply thekernel trick does not posses the above stated requirement?

Kernel Machines

Kernel Methods

Dual Representations

Dual Representations I

Many linear models for regression and classification can be re-formulated in terms of a dual representation in which the kernelfunction arises naturally.

Kernel Machines

Kernel Methods


Dual Representations II

Definition ((Lagrangian) Dual Problem)

Given a Lagrangian L(x, {λs}Ms=1) where x is the input vectorand {λs}Ms=1 are the non-negative Lagrangian multipliers, thedual problem can be derived by:

1 Solving for some primal variable values that minimize theLagrangian

2 Writing the solution in terms of primal variables asfunctions of the Lagrange multipliers called dual variables

3 Formulate the dual problem to maximize the objectivefunction with respect to the dual variables under thederived constraints on the dual variables (includingnon-negativity).

Kernel Machines

Kernel Methods


Dual Representations III

Example (Dual Representation of Ridge Regression with BasisFunctions)

Consider the linear regression model h(x;w) = wTφ(x) whoseparameters are determined by minimizing a regularized sum-of-squares error function given by

L(X,y,w, λ) =1

2

N∑n=1

(wTφ(x(n))− y(n))2 +λ

2wTw

where λ ≥ 0 gives the regularisation strength, and φ : RD+1 →RM+1.

Kernel Machines

Kernel Methods


Dual Representations IV

Example (Dual Representation of Ridge Regression with BasisFunctions (Cont.))

Our objective is then so solve

arg minw,λ≥0

L(X,y,w, λ).

Taking derivatives w.r.t. w, we find that at the optimum, wsatisfies

w = − 1

λ

N∑n=1

(wTφ(x(n))− y(n))φ(x(n)) =

N∑n=1

anφ(x(n)) = ΦTa

Kernel Machines

Kernel Methods


Dual Representations V


where Φ is a N × (M + 1) matrix whose nth row is given byφ(x(n))T . Hence the vector a = (a1, . . . , aN )T is defined as

an = − 1

λ(wTx(n) − y(n))

Now, we can give rise to the dual formulation of this optimisationproblem by writing the loss function in terms of a by substitutingw with the expression derived previously.

Kernel Machines

Kernel Methods


Dual Representations VI


Doing so gives rise to

L(a) =1

2aTΦΦTΦΦTa− aTΦΦTy +

1

2yTy +

λ

2aTΦΦTa

where y = (y1, . . . , yN )T . Now, define the Gram matrix K asK = ΦΦT , which is an N ×N symmetric matrix with elements

Knm = φ(x(n))Tφ(x(m)) = k(x(n),x(m))

Kernel Machines

Kernel Methods


Dual Representations VII


Now, writing the regularised sum-of-squares in terms of K yields

L(a) =1

2aTKKTa− aTKy +

1

2yTy +

λ

2aTKa

Thus we see that the dual formulation allows the solution to theleast-squares problem to be expressed entirely in terms of thekernel function k(x(n),x(m)).

Kernel Machines

Kernel Methods


Dual Representations VIII


To conclude, setting the partial derivative of L(a) w.r.t. a tozero we yields the following solution

a = (K + λIN )−1y.

If we then substitute this back into the linear regression model,we obtain the following

h(x;w) = wTφ(x) = aTΦφ(x) = k(x)T (K + λIN )−1y

where k(x) = (k1(x), . . . , kN (x)) with elements kn(x) = k(x(n),x)

Kernel Machines

Kernel Methods

Constructing Kernels

Constructing Kernels I

In order to exploit the kernel trick, we need to be able to constructvalid kernel functions. Generally, there are two approaches:

1 Choose a feature space mapping φ(x) and then use this tofind the corresponding kernel function.

2 Construct the kernel function directly.

Kernel Machines

Kernel Methods


Constructing Kernels II

Example (From Feature Map to Kernel)

Suppose that we want to find the kernel which corresponds tothe following feature mapping φ(x) = (1, x1, x2, x1x2)

T then, thekernel corresponding to this feature mapping is

k(x,x′) = φ(x)Tφ(x′)

= (1, x1, x2, x1x2)T (1, x′1, x

′2, x′1x′2)

= 1 + x′1 + x′2 + x′1x′2

+ x1 + x1x′1 + x1x

′2 + x1x

′1x′2

+ x2 + x2x′1 + x2x

′2 + x2x

′1x′2

+ x1x2 + x1x2x′1 + x1x2x

′2 + x1x2x

′1x′2.

Kernel Machines

Kernel Methods


Constructing Kernels III

Example (From Kernel to Feature Map)

Let x ∈ R2 and suppose that we want to know what is the featuremap associated with the following kernel, and if this is indeed avalid one:

k(x,x′) = (xTx′)2 = (x1x′1, x2x

′2)

2

= (x21x′21 + 2x1x

′1x2x

′2 + x2x

′22)

= (x1,√

2x1x2, x22)T (x′1,

√2x′1x

′2, x′22)

= φ(x)φ(x′)T .

Kernel Machines

Kernel Methods


Constructing Kernels IV

Example (From Kernel to Feature Map (Cont.))

We see that the feature mapping takes the form

φ(x) = (x1,√

2x1x2, x22)T .

and therefore comprises all possible second order terms, with aspecific weighting between them. Furthermore, we can see thatthe kernel can be written in terms of an inner product in s spacedefined by φ(x).

Kernel Machines

Kernel Methods


Constructing Kernels V

More generally, however, we need a simple way to test whethera function constitutes a valid kernel without having to constructthe function φ(x) explicitly.

Kernel Machines

Kernel Methods


Constructing Kernels VI

Result

A necessary and sufficient condition for a function k(x(n),x(m))to be a valid kernel (Shawe Taylor and Cristianini, 2004) is thatthe Gram matrix K, whose elements are given by k(x(n),x(m)),should be positive semi-definite for all possible choices of the set{x(n),x(m)}

Kernel Machines

Kernel Methods


Constructing Kernels VII

Therefore, we can use a kernel k(·, ·) without having to do anycalculation in the underlying feature space !

Kernel Machines

Kernel Methods


Constructing Kernels VIII

Result (Gaussian Kernel)

Consider the Gaussian kernel, which belongs to the class of RadialBasis Function (RBF) kernels

k(x,x′) = exp(− γ||x− x′||2

)then, the underlying space is infinite dimensional.

Kernel Machines

Kernel Methods


Constructing Kernels IX

Proof.

Take the case where x ∈ R and γ = 1 then,

k(x, x′) = exp(− (x− x′)2

)= exp(x2) exp(x′

2) exp(2xx′)

= exp(x2) exp(x′2)

∞∑k=0

2k(x)k(x′)k

k!

= exp(x2)

∞∑k=0

2k2 (x)k√k!

exp(x′2)

∞∑k=0

2k2 (x′)k√k!

Kernel Machines

Kernel Methods


Constructing Kernels X

One powerful technique for constructing new kernels is to buildthem out of simpler kernels as building blocks. This can be doneusing the following properties.Let c > 0 be a constant, f(·) any function, q(·) a polynomial withnon-negative coefficients, φ : RD → RM , k3(·, ·) a valid kernel inRM , and A a positive semidefinite matrix.

Kernel Machines

Kernel Methods


Constructing Kernels XI

Given valid kernels k1(x,x′) and k2(xx

′) the following are alsovalid kernels

k(x,x′) = ck1(x,x′)

k(x,x′) = f(x)k1(x,x′)f(x′)

k(x,x′) = q(k1(x,x′))

k(x,x′) = exp(k1(x,x′))

k(x,x′) = k1(x,x′) + k2(x,x

′)

k(x,x′) = k1(x,x′)k2(x,x

′)

k(x,x′) = k3(φ(x),φ(x′))

k(x,x′) = xTAx′

Kernel Machines

Support Vector Machines For Classification

Outline I




Kernel Machines


Outline II



Kernel Machines


Support Vector Machines For Classification: An Overview

An Overview I

Support Vector Machines (SVMs) belong to a class of non prob-abilistic classification models called maximum margin classifiers.

Definition (Margin)

We define the margin as the smallest distance between the deci-sion boundary and any of the samples.

Kernel Machines


Support Vector Machines For Classification: An Overview

An Overview II

An important property of support vector machines is that thedetermination of the model parameters corresponds to a convexoptimization problem, and so any local solution is also a globaloptimum.

Kernel Machines


Hard Margin SVMs For Classification

Hard Margin SVMs For Classification I

Consider the class of linear models for binary classification prob-lems specified by

h(x) = wTφ(x) + b

where φ : RD → RM denotes a fixed feature-space transforma-tion, and b ∈ R denotes the bias term.

Kernel Machines



Hard Margin SVMs For Classification II

Consider a data set given by

{(x(n), y(n)) ∈ RD × {−1,+1} | n = 1, . . . , N},

and assume that is it linearly separable in the feature space spec-ified by φ : RD → RM .

Kernel Machines



Hard Margin SVMs For Classification III

by definition of linear separability, there exist at least one choiceof the parameters w and b such that

h(x(n)) = wTφ(x(n)) + b =

{> 0 if y(n) = +1

< 0 if y(n) = −1

So that

h(x(n))y(n) > 0

for all training data points.

Kernel Machines



Hard Margin SVMs For Classification IV

In support vector machines the decision boundary is cho-sen to be the one for which the margin is maximized.

The maximum margin solution can be motivated using computa-tional learning theory, also known as statistical learning theory.

Kernel Machines



Hard Margin SVMs For Classification V

the perpendicular distance of a point x from a hyperplane definedby h(x) = 0 where h(x) = wTφ(x) + b is given by

|h(x)|||w||2

.

Furthermore, we are only interested in solutions for which alldata points are correctly classified, so that h(x(n))y(n) > 0

Kernel Machines



Hard Margin SVMs For Classification VI

Thus the distance of an observation x(n) to the decision surfaceunder perfect classification is given by

|h(x(n))y(n)|||w||2

=|(wTφ(x(n)) + b)y(n)|

||w||2

Kernel Machines



Hard Margin SVMs For Classification VII

The margin, is the perpendicular distance to the closest pointx(n) from decision boundary. Thus, the maximum margin solu-tion is found by solving

arg maxw,b

{ 1

||w||2minn{(wTφ(x(n)) + b)y(n)}

}.

Kernel Machines



Hard Margin SVMs For Classification VIII

Direct solutions to this optimisation problem would be very com-plex, and so we shall impose that

(wTφ(x(n)) + b)y(n) = 1

for the point that is closest to the surface. So that, all data pointswill satisfy the constraints

(wTφ(x(n)) + b)y(n) ≥ 1, ∀ n = 1, 2, . . . , N.

Kernel Machines



Hard Margin SVMs For Classification IX

So, the optimisation problem that needs to be solved is given by

arg maxw,b

1

||w||2s.t. (wTφ(x(n)) + b)y(n) ≥ 1, ∀ n = 1, 2, . . . , N.

which is equivalent to

arg minw,b

1

2||w||22

s.t. (wTφ(x(n)) + b)y(n) ≥ 1, ∀ n = 1, 2, . . . , N.

Kernel Machines



Hard Margin SVMs For Classification X

The above optimisation problem is an example of a quadraticprogramming problem in which we are minimizing a quadraticfunction subject to a set of linear inequality constraints.

Kernel Machines



Hard Margin SVMs For Classification XI

In order to solve this constrained optimization problem, we in-troduce Lagrange multipliers {αn ≥ 0}Nn=1, giving the Lagrangefunction

L(w,α, b) =1

2||w||22 −

N∑n=1

αn((wTφ(x(n)) + b)y(n) − 1)

where α = (α1, α2, . . . , αN )T . Note the minus sign is in front ofthe Lagrangian multipliers because we are minimising w.r.t. wand b, but we are maximising with respect to α since we want asthe classifications to be correct, that is we want

(wTφ(x(n)) + b)y(n) ≥ 1.

Kernel Machines



Hard Margin SVMs For Classification XII

Setting the derivatives of L(w,α, b) with respect tow and b equalto zero, we obtain the following two conditions

∂L(w,α, b)

∂w= 0 =⇒ w =

N∑n=1

αnφ(x(n))y(n)

∂L(w,α, b)

∂b= 0 =⇒ 0 =

N∑n=1

αny(n).

Kernel Machines



Hard Margin SVMs For Classification XIII

Using the conditions above to get rid of the dependencies onw and b from L(w,α, b) leads to the dual formulation of theoptimisation problem:

arg maxα

N∑n=1

αn −1

2

N∑n=1

N∑m=1

αnαmφ(x(n))Tφ(x(m))y(n)y(m)

s.t. αn ≥ 0, ∀ n = 1, 2, . . . , N,

N∑n=1

αny(n) = 0.

Kernel Machines



Hard Margin SVMs For Classification XIV

We note that the data enters the function to optimise only inthe form of an inner product in the feature space specified byφ : RD → RM . Hence, we can apply the kernel trick by replacingφ(x(n))Tφ(x(m)) with

k(x(n),x(m)).

Kernel Machines



Hard Margin SVMs For Classification XV

In order to classify new data points using the trained model, weevaluate the sign of h(x) = wTφ(x) + b which can be expressedin terms of the kernel function and the Lagrange multipliers bysubstituting for

w =N∑n=1

αnφ(x(n))y(n)

giving

h(x) =

N∑n=1

αnk(x,x(n))y(n) + b.

Kernel Machines



Hard Margin SVMs For Classification XVI

Result (Hard Margin SVM)

Consider an Hard Margin SVM where the decision boundary ismodelled as

h(x) = wTφ(x) + b,

where φ : RD → RM denotes a fixed feature-space transforma-tion, and b ∈ R denotes the bias term. Let the training data beof the form

{(x(n), y(n)) ∈ RD × {−1,+1} | n = 1, . . . , N}.

Kernel Machines



Hard Margin SVMs For Classification XVII

Result (Hard Margin SVM (Cont.))

Assume that the data is linearly separable in the feature spacespecified by φ, and impose that

(wTφ(x(n)) + b)y(n) ≥ 1, ∀ n = 1, 2, . . . , N.

Kernel Machines



Hard Margin SVMs For Classification XVIII


Then, the set of parameters w and b which achieve perfect clas-sification can be found by solving the dual problem given by

arg maxα

N∑n=1

αn −1

2

N∑n=1

N∑m=1

αnαmk(x(n),x(m))y(n)y(m)

s.t. αn ≥ 0, ∀ n = 1, 2, . . . , N,

N∑n=1

αny(n) = 0.

(3.1)

where k(x(n),x(m)) is a valid kernel.

Kernel Machines



Hard Margin SVMs For Classification XIX


Finally, we can make predictions by computing

h(x) =

N∑n=1


Kernel Machines


Soft Margin SVMs For Classification

Soft Margin SVMs For Classification I

So far, we have assumed that the training data points are linearlyseparable in the feature space given by φ.In practice, however, the training data might not be linearly sep-arable neither in the input nor in the feature space.

Kernel Machines



Soft Margin SVMs For Classification II

Question: How can we deal with data that is not linearly sepa-rable neither in the input nor in the feature space?

Kernel Machines



Soft Margin SVMs For Classification III

Answer: Modify the support vector machine so as to allow someof the training points to be misclassified.

Kernel Machines



Soft Margin SVMs For Classification IV

To do so we introduce the slack variables ξn ≥ 0 to allow formisclassification errors to occur. That is, we allow non linearlyseparable dataset to be dealt with.These are defined as

ξn = 0, if x(n) is on or inside the correct margin boundary

ξn = |y(n) − h(x(n))| if x(n) is not on or inside the correctmargin boundary

Note: ξn can be greater than 0 even thought the nth observationis on the correct side of the decision boundary if it is inside themargin.

Kernel Machines



Soft Margin SVMs For Classification V

The exact classification constraints are then replaced with

h(x(n))y(n) ≥ 1− ξn, n = 1, 2, . . . , N.

in which the slack variables are constrained to satisfy ξn ≥ 0.

Kernel Machines



Soft Margin SVMs For Classification VI

Our goal is now to maximize the margin while softly penalizingpoints that lie on the wrong side of the margin boundary. Wetherefore minimize

arg minw,b

||w||22 + CN∑n=1

ξn

s.t.(wTφ(x(n)) + b)y(n) ≥ 1− ξn, ξn ≥ 0 ∀ n = 1, 2, . . . , N.

where the parameter C > 0 controls the trade-off between theslack variable penalty and the margin.

Kernel Machines



Soft Margin SVMs For Classification VII

The corresponding Lagrangian is given by

L(w,α, b) =||w||22

+ C

N∑n=1

ξn

−N∑n=1

αn((wTφ(x(n)) + b)y(n) − 1 + ξn)

−N∑n=1

µnξn,

where {µn ≥ 0}Nn=1 are the Lagrange multipliers associated with{ξn ≥ 0}Nn=1.

Kernel Machines



Soft Margin SVMs For Classification VIII

By setting to zero the partial derivatives w.r.t w, {ξn}Nn=1, andb we get the following three conditions:

∂L(w,α, b)

∂w= 0 =⇒ w =

N∑n=1

αnx(n)y(n)

∂L(w,α, b)

∂b= 0 =⇒ 0 =

N∑n=1

αny(n).

∂L(w,α, b)

∂ξn= 0 =⇒ αn = C − µn ∀ n = 1, 2, . . . , N.

Kernel Machines



Soft Margin SVMs For Classification IX

Using these conditions to get rid of the dependencies on w and bfrom L(w,α, b) leads to the dual formulation of the optimisationproblem:

arg maxα

N∑n=1

αn −1

2

N∑n=1

N∑m=1

αnαmφ(x(n))Tφ(x(m))y(n)y(m)

s.t. 0 ≤ αn ≤ C,N∑n=1

αny(n) = 0

∀ n = 1, 2, . . . , N.

Kernel Machines



Soft Margin SVMs For Classification X


w =

N∑n=1

αnφ(x(n))y(n)

giving, as for the hard margin solution,

h(x) =

N∑n=1


Kernel Machines



Soft Margin SVMs For Classification XI

Result (Soft Margin SVM)

Consider a Soft Margin SVM where the decision boundary is mod-elled as

h(x) = wTφ(x) + b,


{(x(n), y(n)) ∈ RD × {−1,+1} | n = 1, . . . , N},

Kernel Machines



Soft Margin SVMs For Classification XII

Result (Soft Margin SVM (Cont.))

Impose that

(wTφ(x(n)) + b)y(n) ≥ 1− ξn, ∀ n = 1, 2, . . . , N.

where ξn ≥ 0.

Kernel Machines



Soft Margin SVMs For Classification XIII


Then, the set of parameters w and b can be found by solving thedual problem given by

arg maxα

N∑n=1

αn −1

2

N∑n=1

N∑m=1

αnαmk(x(n),x(m))y(n)y(m)

s.t. 0 ≤ αn ≤ C,N∑n=1

αny(n) = 0

∀ n = 1, 2, . . . , N.

(3.2)

where k(x(n),x(m)) is a valid kernel.

Kernel Machines



Soft Margin SVMs For Classification XIV


Finally, we can make predictions by computing

h(x) =

N∑n=1


Kernel Machines


A Note on Computations

A Note on Computations I

We first note that the objective function given by 3.1 or 3.2 isquadratic and so any local optimum will also be a global optimumsince the constraints define a convex region as a consequence ofbeing liner.

Kernel Machines


A Note on Computations

A Note on Computations II

Direct solution of the quadratic programming problem using tra-ditional techniques is often infeasible due to the demanding com-putation and memory requirements.So, a wide range of more practical approaches have been pro-posed:

Protected conjugate gradients (Burges, 1998).

Decomposition methods (Osuna et al., 1996).

Sequential minimal optimization (Platt, 1999).

Kernel Machines


A Note on Probabilistic Output

Support Vector Machines: A Note on ProbabilisticOutput I

As we have seen, support vector machines do not provide proba-bilistic outputs but instead make hard classification decisions fornew input vectors.However, it is possible to construct conditional probabilities ofthe form

P(y = 1|x) = 1− P(y = −1|x)

by fitting a logistic sigmoid to the outputs of a previously trainedSVM Platt (2000).

Kernel Machines



Support Vector Machines: A Note on ProbabilisticOutput II

Specifically, the conditional probability is assumed to be of theform

P(y = 1|x; w) = σ(w0 + w1h(x))

where the parameters w = (w0, w1) ∈ R2 are found by minimis-ing the negative log-likelihood of the training data:

− logP(y|w,X) = −N∑n=1

(y(n) logP(y(n) = 1|w,x(n))

+(1− y(n)) log(1− P(y(n) = 1|w,x(n)))

Kernel Machines



Support Vector Machines: A Note on ProbabilisticOutput III

Note: it is important to fit the logistic model using data thathas not been used to train the underlying SVM in order to avoidsevere overfitting.

Kernel Machines



Support Vector Machines: A Note on MulticlassProblems I

The support vector machine is fundamentally a two-class clas-sifier. In practice, however, we often have to tackle problemsinvolving K > 2 classes. Various methods have therefore beenproposed for combining multiple two-class SVMs in order to builda multiclass classifier.

Kernel Machines



Support Vector Machines: A Note on MulticlassProblems II

one-versus-the-rest: train K separate SVMs, in which thekth model hk(x) is trained using the data from class Ck asthe positive examples and the data from the remainingK − 1 classes as the negative examples.

one-versus-one: train K(K − 1)/2 different 2-class SVMson all possible pairs of classes, and then to classify testpoints according to which class has the highest number ofvotes.

Kernel Machines

Support Vector Machines For Regression

Outline I




Kernel Machines


Outline II



Kernel Machines


Support Vector Machines For Regression: An Overview


We now extend support vector machines to regression problemswhile at the same time preserving the property of sparseness.In regularised linear regression, the objective is to minimise theregularised error function given by

1

2

N∑n=1

(h(x(n))− y(n))2 +λ

2||w||22

where λ ≥ 0 determines the regularisation strength.Support vector regression replaces the sum-of-squared error func-tion with an ε- insensitive error function in order to recoversparse solutions.

Kernel Machines



Derivation I

Let us take a special case of an ε- insensitive error function hav-ing a linear cost associated with errors outside the insensitiveregion

Eε(h(x;w)− y) =

{0, if |h(x;w)− y| < ε

|h(x;w)− y| − ε, otherwise

This error function gives zero error if the absolute difference be-tween the predicted value h(x;w) and the target y is less than εwhere ε > 0.

Kernel Machines



Derivation II

Figure: Plot of an ε- insensitive error function (red), and thequadratic error function (green) for comparison with w ∈ R.

Kernel Machines



Derivation III

Given the model family expressed as

h(x) = wTφ(x) + b

where φ : RD → RM denotes a fixed feature-space transforma-tion,we therefore want to find the weights w ∈ RD and b which min-imize the regularized error function given by

C

N∑n=1

Eε(h(x(n))− y(n)) +1

2||w||22

Kernel Machines



Derivation IV

As before, we re-express the optimization problem by introducingslack variables. However, unlike before, we need to introduce twokind of slack variables.For each observation x(n) we need two slack variables ξn ≥ 0 andξn ≥ 0 where

ξn ≥ 0 corresponds to a point for which

y(n) > h(x(n)) + ε

ξn ≥ 0 corresponds to a point for which

y(n) < h(x(n))− ε

Kernel Machines



Derivation V

Figure: Illustration of SVM regression, showing the regression curvetogether with the ε-insensitive bound with examples of slack variables.

Kernel Machines



Derivation VI

The condition for a target point to lie inside the ε-tube is that

h(x(n))− ε ≥ y(n) ≤ h(x(n)) + ε.

Introducing the slack variables allows points to lie outside thetube provided the slack variables are non-zero, and the corre-sponding conditions are

y(n) ≤ h(x(n)) + ε+ ξn,

y(n) ≥ h(x(n))− ε− ξn.

Kernel Machines



Derivation VII

The error function for support vector regression can then be writ-ten as

C

N∑n=1

(ξn + ξn)− y(n)) +1

2||w||22.

Kernel Machines



Derivation VIII

From this, we can write the optimisation problem as

arg maxw

C

N∑n=1

(ξn + ξn)− y(n)) +1

2||w||22

s.t. ξn ≥ 0,

ξn ≥ 0,

y(n) ≤ h(x(n)) + ε+ ξn,

y(n) ≥ h(x(n))− ε− ξn,∀ n = 1, 2, . . . , N.

Kernel Machines



Derivation IX

In order to solve this constrained optimization problem, we in-troduce Lagrange multipliers {αn ≥ 0}Nn=1, {αn ≥ 0}Nn=1, {µn ≥0}Nn=1, and {µn ≥ 0}Nn=1 giving the Lagrange function

Kernel Machines



Derivation X

L =C

N∑n=1

(ξn + ξn) +1

2||w||22

−N∑n=1

(µnξn + µnξn)

−N∑n=1

αn(ε+ ξn + h(x)− y(n))

−N∑n=1

αn(ε+ ξn − h(x) + y(n)).

Kernel Machines



Derivation XI

By setting to zero the partial derivatives w.r.tw, {ξn}Nn=1,{ξn}Nn=1,and b we get the following four conditions:

∂L(w,α, b)

∂w= 0 =⇒ w =

N∑n=1

(αn − αn)φ(x(n))

∂L(w,α, b)

∂b= 0 =⇒ 0 =

N∑n=1

(αn − αn)

∂L(w,α, b)

∂ξn= 0 =⇒ αn = C − µn ∀ n = 1, 2, . . . , N

∂L(w,α, b)

∂ξn= 0 =⇒ αn = C − µn ∀ n = 1, 2, . . . , N.

Kernel Machines



Derivation XII

Using these conditions to get rid of the dependencies on w and bfrom L leads to the dual formulation of the optimisation problem:

arg minα,α

− 1

2

N∑n=1

N∑m=1

(αn − αn)(αm − αm)k(xn,xm)

s.t. 0 ≤ αn ≤ C,0 ≤ αn ≤ C,N∑n=1

(αn − αn) = 0

∀ n = 1, 2, . . . , N.

where we have introduced the kernel k(xn,xm) = φ(x(n))Tφ(x(m)).

Kernel Machines



Derivation XIII


w =N∑n=1

(αn − αn)φ(x(n))

giving

h(x) =

N∑n=1

(αn − αn)k(x,xn) + b.

Kernel Machines



Derivation XIV

Result (Support Vector Regression)

Consider a support vector regression model where specified as

h(x) = wTφ(x) + b,


{(x(n), y(n)) ∈ RD × R | n = 1, . . . , N}.

Kernel Machines



Derivation XV


Impose that

y(n) ≤ h(x(n)) + ε+ ξn,

y(n) ≥ h(x(n))− ε− ξn.

where {ξn ≥ 0}Nn=1, {ξn ≥ 0}Nn=1, and ε > 0

Kernel Machines



Derivation XVI


Then, the set of parameters w and b can be found by solving thedual problem given by

arg minα,α

− 1

2

N∑n=1

N∑m=1

(αn − αn)(αm − αm)k(xn,xm)

s.t. 0 ≤ αn ≤ C,0 ≤ αn ≤ C,N∑n=1

(αn − αn) = 0

∀ n = 1, 2, . . . , N.

(4.1)

Kernel Machines



Derivation XVII


where k(x(n),x(m)) is a valid kernel.Finally, we can make predictions by computing

h(x) =

N∑n=1

(αn − αn)k(x,xn) + b.

Kernel Machines


Common Kernels Choices

Some of the most common choices of kernels are

Linear kernel: k(x,x′) = xTx′

Radial Basis Function (RBF) kernel:k(x,x′) = exp(−γ||x− x′||22)Polynomial kernel: k(x,x′) = (c+ xTx′)β

Sigmoid kernel: k(x,x′) = σ(c+ xTx′)

Kernel Machines

Appendix

For Further Reading

For Further Reading I

Shawe-Taylor, J. and N. Cristianini (2004)Kernel Methods for Pattern Analysis.Cambridge University Press.

Aizerman, M. A., E. M. Braverman, and L. I. Rozonoer(1964)The probability problem of pattern recognition learning andthe method of potential functionsAutomation and Remote Control, 25:1175–1190.

Boser, B. E., I. M. Guyon, and V. N. Vapnik (1992)A training algorithm for optimal margin classifiers. In D.Haussler (Ed.)Proceedings Fifth Annual Workshop on ComputationalLearning Theory (COLT), 144–152.

Kernel Machines

Appendix

For Further Reading

For Further Reading II

Cortes, C. and V. N. Vapnik (1995)Support vector networksMachine Learning, 20: 273-297.

Documents

Kernel Machines - Theory And Practice 26, 2017 · Kernel Machines Kernel Machines Theory And Practice Marco Del Vecchio [email protected] Warwick Manufacturing Group University