Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU

Sparse Kernel Methods 1

Sparse Kernel Methods for Classification and Regression

October 17, 2007

Kyungchul Park

SKKU


General Model of Learning

Learning Model (Vapnik, 2000) Generator (G): generates random vectors x, drawn independently

from a fixed but unknown distribution F(x). Supervisor (S): returns an output value y according to a

conditional distribution F(y|x), also fixed but unknown. Learning Machine (LM): capable of implementing a set of

functions

The Learning Problem: Choose a function that best approximates the supervisor’s response based on a training set of N i.i.d. observations drawn from the distribution F(x,y) = F(x)F(y|x),

G S

LM

y

y

x

( , ),f x w w W( , ),f x w w W

1 1( , ), , ( , )N Nx y x y


Risk Minimization

To best approximate the supervisor’s response, find the function that minimizes

Risk Functional L: Loss function Note F(x, y) is fixed but unknown and the only

information available is contained in the training set. How to estimate the Risk?

( ) ( , ( , )) ( , )R w L y f x w dF x y


Classification and Regression

Classification Problem Supervisor’s output y {0, 1} The loss function

Regression Problem Supervisor’s output y: real value The loss function

0 if ( , )( , ( , ))

1 if ( , )

y f x wL y f x w

y f x w

2( , ( , )) ( ( , ))L y f x w y f x w


Empirical Risk Minimization Framework

Empirical Risk F(x, y) is unknown. Estimation of R(w) based on training data

Empirical Risk Minimization Framework Find a function that minimizes the empirical risk. Fundamental assumption in inductive learning For classification problem, this leads to finding a function

with the minimum training error. For regression problem, leads to the least squares error

method.

1

1( , ( , ))

Nemp n n

nR L y f x w

N


Over-Fitting Problem

Over-Fitting Small training error (empirical risk), but large

generalization error! Consider the problem of polynomial curve fitting.

• Polynomials of sufficiently high degree can perfectly fit a given finite set of training data.

• However, when applied to new (unknown) data, the prediction quality can be very poor.

Why Over-Fitting? Many Possible Causes: Insufficient data, Noise, etc Source of continuing debate However, we know that over-fitting is closely related to

model complexity (the expressive power of the learning machine).


Over-Fitting Problem: Illustration (Bishop,

2006)

M: Degree of polynomial, Green: True function, Red: Least Squares Error Estimation


How to Avoid Over-Fitting Problem

General Idea Penalize models with high complexity Occam’s Razor

Regularization Add regularized functional to risk functional E.g., Ridge regression

SRM (Structural Risk Minimization) Principle Due to Vapnik (1996)

• h: Capacity of a set of functions

22

1

1( ) { ( , )}

2 2

Nn n

nE w y f x w w

(log(2 / ) 1) log( / 4)( ) ( )emp

h N hR w R w

N


How to Avoid Over-Fitting Problem

Bayesian Methods Incorporate Prior Knowledge on the form of functions Prior distribution F(w) Final result: Predictive distribution F(y|D), where D is a

training set, is obtained by marginalizing on w.

Remark.

1) Bayesian framework gives probabilistic generative models.

2) Strong connection with regularization theory

3) Kernels can be generated based on generative models.


Motivation: Linear Regression

Primal Problem Use Ridge regression

Solution

1

0( ) , where

D T Di i

if w x R

x,w w x x,w

2

1

1min ( ) min{ ( ) }

2 2

N T Tn n

w w nE y

w w x w w

T T(X X+ I)w = X y

1T T -w = (X X+ I) X y

1( T T Tf -x,w) = w x = y X(X X+ I) x



Dual ProblemT T(X X+ I)w = X y

1 T T-w = - X (Xw - y) = X a 1where { }Tn na y nw x

1 -a = (K + I) y -1( T T Tf k x,w) = w x = a Xx = (x) (K + I) y

1 1

2 2 2T T T T T T T TE

(a) = a XX XX a -a XX y + y y + a XX a

1 1

2 2 2T T T T

a KKa -a Ky + y y + a Ka

where TK XX and ( , )Tnm n m n mk K x x x x



Discussion In primal formulation, we should invert D X D matrix. In dual formulation, invert N X N matrix. Dual representation shows the predicted value is a linear

combination of observed values with weights given by the function k().

So Why Dual? Note the solution of the dual problem is determined by

K. K is called a Gram matrix and is defined by the function

k() called kernel function. The major observation here is that we can solve the

regression problem only by knowing the Gram matrix K, or alternatively the kernel function k.

Can generalize to other form of functions if we define new kernel function!


Beyond Linear Relations

Extension to Nonlinear Function Feature Space Transform

And Define the set of functions

For example, polynomials of degree D. By using feature space transform, we can extend the

linear relation to nonlinear relations. These models is still a linear model since the function is

linear in the unknowns (w).

※ (x): a vector of basis functions.

: ( ) x x

( Tf x,w) = w (x)


Beyond Linear Relations

Problems in Feature Space Transform Difficulty in finding the appropriate transform. Curse of Dimensionality: The number of parameters

rapidly increases.

So Kernel Functions! Note in dual formulation, the only necessary information

is the kernel function. Kernel function is defined as an inner product of two

vectors. If we can find an appropriate kernel function, we can

solve the problem without explicitly considering the feature space transform.

Some kernel functions have the effect of considering infinite dimensional feature space.


Kernel Functions

A Kernel is a function k that for all x, z X satisfies

where is a mapping from X to a feature space F

Example

( ) ( ) ( )Tk x, z x z

: ( ) F x x

2 21 1 2 2( ) ( , 2 , )Tx x x x x

21 1 2 2( ) ( ) ( ) ( )T x z x z k x z x, z


Characterization of Kernel Functions

How to Find a Kernel Function? First define feature space transform, then define a kernel

as an inner product in the space. Direct method to characterize a kernel

Characterization of Kernel (Shawe-Taylor and Cristianini, 2004)

A function , which is either continuous or has a finite domain, can be decomposed as if and only if it is a finite positive semi-definite function, that is, for any choice of finite set , the matrix is positive semi-definite.

For proof, see the reference (RKHS, Reproducing Kernel Hilbert Space)

Alternative characterization is given by Mercer’s Theorem.

:k X X R ( ) ( ) ( )Tk x, z x z

1{ , , }Nx x X ( ( , ))n mK k x x


Examples of Kernel Functions

Example

1st : Polynomial Kernel, 2nd: Gaussian Kernel 3rd: Kernel derived from generative model, where p(x) is

a probability. 4th: Kernel defined on power set of a given set S. There are many known techniques for constructing new

kernels from existing kernels, see reference.

( )T Mk cx, z) = (x z2 2( exp( / )k sx, z) = x - z

| |( , 2 A Bk A B ) =

( ( ) ( )k p px, z) = x z


Kernel in Practice

In practical applications, you can choose a kernel that reflects the similarity between two objects. Note

Hence if appropriately normalized, the kernel represents the similarity between two objects in some feature space.

Remark.1) Kernel Trick: Develop a learning algorithm based on

inner products. Then replace the inner product with a kernel (e.g., Regression, Classification, etc).

2) Generalized Distance: We can generalize the notion of kernel to the case where it represents dissimilarity in some feature space (conditionally positive semi-definite kernel). Then we can use the kernel in learning algorithms based on distance between objects (e.g., clustering, Nearest Neighbor, etc).

2( ) ( ) ( ) ( ) ( ) ( ) 2 ( ) ( )T T T x z x x z z x z


Support Vector Machines

2 Class Classification Problem Given a training set , where ,

find a function

that satisfies for all points having and

for points having

Equivalently, , for all n.

1{( , )}Nn n ny x { 1, 1}ny

( ) ( )Tf b x w x

( ) 0nf x 1ny ( ) 0nf x 1ny

( ) 0n ny f x


Support Vector Machines: Linearly Separable

Case

Linearly Separable Case If we can find such a function f(x). In this case, the points (training data) are separated by a

hyperplane (separating hyperplane) f(x) = 0 in the feature space.

There can be infinitely many such functions.Margin

Margin: the distance between the hyperplane and the closest point.



Case

Maximum Margin Classifiers Find a hyperplane with the maximum margin. Why Maximum Margin?

• Recall SRM. Maximum margin hyperplane corresponds to the case with the smallest capacity (Vapnik, 2000).

• So it is the solution when we choose SRM framework.



Case

Formulation: Quadratic Programming

21min

2w

( ( ) ) 1Tn ny x b w for all 1, ,n N

1) The parameters are normalized so that the margin = 1.

2) Then the margin = 1/ w



Case

Dual Formulation

1 ,

1max ( )

2

Nn n m n m n m

n n ma a a y y k

x ,x

0na for all 1, ,n N

1) Obtained by applying Lagrange Duality

2)

3) The hyperplane found is

10

Nn n

na y

( , ) ( ) ( )Tn m n mk x x x x

1( ) ( , )

Nn n n

nf a y k b

x x x



Case

Discussion KKT condition

So only if • Such vectors are called support vectors.

• Note the maximum margin hyperplane is dependent only on the support vectors. Sparsity.

Note to solve the dual problem, we only need the kernel function k, and so we need not consider the feature space transform explicitly.

The form of the maximum margin hyperplane shows that the prediction is given by a combination of observations (with weights given by kernels), specifically, the support vectors.

0na { ( ) 1} 0n n na y f x

0na ( ) 1n ny f x



Case

Example (Bishop, 2006)

Gaussian kernels are used here.


Support Vector Machines: Overlapping Classes

Overlapping Class Introduce slack variables. The results are almost the same except additional

constraints. For details, see reference.


SVM for Regression

-insensitive Error Function

0, if | ( ) | ( ( ) )

| ( ) | , otherwise

f yE f y

f y

xx

x


SVM for Regression

Formulation

2

1

1ˆmin ( )2

Nn n

nC

w

( )Tn n ny w x

ˆ( )Tn n ny w x

for all 1, ,n N ˆ, 0n n


SVM for Regression

Solution Similar to SVM for classification, use Lagrange dual. Then the solution is given by

By considering KKT condition, we can show that the dual variable is positive only if the corresponding point is either on the boundary of or outside the -tube. So Sparsity results.

1ˆ( ) ( ) ( , )

Nn n n

nf a a k b

x x x


Summary

Classification and Regression based on Kernels Dual formulation extension to arbitrary kernels Sparsity: Support Vectors

Some Limitations of SVM Choice of Kernel Solution Algorithm: Efficiency of solving large scale QP

problems. Multi-class Problem


Related Topics

Relevance Vector Machines Use prior knowledge on the distribution of the functions

(parameters)1( | , ) ( | ( ), )Tp y N y x,w w x

1

1( | ) ( | 0, )

Mi i

ip N w

w α

( | , , ) ( | )p N w y,X α w m,ΣTm ΣΦ y

1(diag( ) )Ti Σ Φ Φ

( | , ) ( | , ) ( | )p p p d y X,α y X,w w α w

Choose , that maximizes the above function (marginal likelihood function). Then using them, find the predictive distribution of y given a new value x by using posterior of w.


Related Topics

Gaussian Process

For any finite set of points , jointly have a Gaussian distribution.

Usually, due to lake of prior knowledge, the mean = 0. The covariance is defined by a kernel function k. The regression problem given a set of observations

reduces to finding a conditional distribution of y.

{ ( )}f xx

1{ , , }N x x 1{ ( )}Nn nf x

1{( , )}Nn n ny x


References

General introductory material for machine learning [1] Pattern Recognition and Machine Learning by C. M. Bishop,

Springer, 2006.Very well written book with an emphasis on Bayesian methods.

Fundamentals of Statistical learning theory and kernel methods[2] Statistical Learning Theory by V. Vapnik, John Wiley and Sons,

1996[3] The Nature of Statistical Learning Theory, 2nd Ed. by V. Vapnik,

Springer, 2000Both books deal with essentially the same topic but in [3],

mathematical details are kept at minimum, while [2] gives all the details. Origin of SVM.

Kernel Engineering[4] Kernel Methods for Pattern Analysis by J. Shawe-Taylor and N.

Cristianini, Cambridge University Press, 2004Deals with various kernel methods with applications to problems

with texts, sequences, trees, etc.Gaussian Process

[5] Gaussian Processes for Machine Learning by C. Rasmussen and C. Williams, MIT Press, 2006

Presents up-to-date survey on Gaussian process and related topics.

Documents

Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU