Upload
dominic-wheeler
View
244
Download
0
Tags:
Embed Size (px)
Citation preview
Sparse Kernel Methods 1
Sparse Kernel Methods for Classification and Regression
October 17, 2007
Kyungchul Park
SKKU
Sparse Kernel Methods 2
General Model of Learning
Learning Model (Vapnik, 2000) Generator (G): generates random vectors x, drawn independently
from a fixed but unknown distribution F(x). Supervisor (S): returns an output value y according to a
conditional distribution F(y|x), also fixed but unknown. Learning Machine (LM): capable of implementing a set of
functions
The Learning Problem: Choose a function that best approximates the supervisor’s response based on a training set of N i.i.d. observations drawn from the distribution F(x,y) = F(x)F(y|x),
G S
LM
y
y
x
( , ),f x w w W( , ),f x w w W
1 1( , ), , ( , )N Nx y x y
Sparse Kernel Methods 3
Risk Minimization
To best approximate the supervisor’s response, find the function that minimizes
Risk Functional L: Loss function Note F(x, y) is fixed but unknown and the only
information available is contained in the training set. How to estimate the Risk?
( ) ( , ( , )) ( , )R w L y f x w dF x y
Sparse Kernel Methods 4
Classification and Regression
Classification Problem Supervisor’s output y {0, 1} The loss function
Regression Problem Supervisor’s output y: real value The loss function
0 if ( , )( , ( , ))
1 if ( , )
y f x wL y f x w
y f x w
2( , ( , )) ( ( , ))L y f x w y f x w
Sparse Kernel Methods 5
Empirical Risk Minimization Framework
Empirical Risk F(x, y) is unknown. Estimation of R(w) based on training data
Empirical Risk Minimization Framework Find a function that minimizes the empirical risk. Fundamental assumption in inductive learning For classification problem, this leads to finding a function
with the minimum training error. For regression problem, leads to the least squares error
method.
1
1( , ( , ))
Nemp n n
nR L y f x w
N
Sparse Kernel Methods 6
Over-Fitting Problem
Over-Fitting Small training error (empirical risk), but large
generalization error! Consider the problem of polynomial curve fitting.
• Polynomials of sufficiently high degree can perfectly fit a given finite set of training data.
• However, when applied to new (unknown) data, the prediction quality can be very poor.
Why Over-Fitting? Many Possible Causes: Insufficient data, Noise, etc Source of continuing debate However, we know that over-fitting is closely related to
model complexity (the expressive power of the learning machine).
Sparse Kernel Methods 7
Over-Fitting Problem: Illustration (Bishop,
2006)
M: Degree of polynomial, Green: True function, Red: Least Squares Error Estimation
Sparse Kernel Methods 8
How to Avoid Over-Fitting Problem
General Idea Penalize models with high complexity Occam’s Razor
Regularization Add regularized functional to risk functional E.g., Ridge regression
SRM (Structural Risk Minimization) Principle Due to Vapnik (1996)
• h: Capacity of a set of functions
22
1
1( ) { ( , )}
2 2
Nn n
nE w y f x w w
(log(2 / ) 1) log( / 4)( ) ( )emp
h N hR w R w
N
Sparse Kernel Methods 9
How to Avoid Over-Fitting Problem
Bayesian Methods Incorporate Prior Knowledge on the form of functions Prior distribution F(w) Final result: Predictive distribution F(y|D), where D is a
training set, is obtained by marginalizing on w.
Remark.
1) Bayesian framework gives probabilistic generative models.
2) Strong connection with regularization theory
3) Kernels can be generated based on generative models.
Sparse Kernel Methods 10
Motivation: Linear Regression
Primal Problem Use Ridge regression
Solution
1
0( ) , where
D T Di i
if w x R
x,w w x x,w
2
1
1min ( ) min{ ( ) }
2 2
N T Tn n
w w nE y
w w x w w
T T(X X+ I)w = X y
1T T -w = (X X+ I) X y
1( T T Tf -x,w) = w x = y X(X X+ I) x
Sparse Kernel Methods 11
Motivation: Linear Regression
Dual ProblemT T(X X+ I)w = X y
1 T T-w = - X (Xw - y) = X a 1where { }Tn na y nw x
1 -a = (K + I) y -1( T T Tf k x,w) = w x = a Xx = (x) (K + I) y
1 1
2 2 2T T T T T T T TE
(a) = a XX XX a -a XX y + y y + a XX a
1 1
2 2 2T T T T
a KKa -a Ky + y y + a Ka
where TK XX and ( , )Tnm n m n mk K x x x x
Sparse Kernel Methods 12
Motivation: Linear Regression
Discussion In primal formulation, we should invert D X D matrix. In dual formulation, invert N X N matrix. Dual representation shows the predicted value is a linear
combination of observed values with weights given by the function k().
So Why Dual? Note the solution of the dual problem is determined by
K. K is called a Gram matrix and is defined by the function
k() called kernel function. The major observation here is that we can solve the
regression problem only by knowing the Gram matrix K, or alternatively the kernel function k.
Can generalize to other form of functions if we define new kernel function!
Sparse Kernel Methods 13
Beyond Linear Relations
Extension to Nonlinear Function Feature Space Transform
And Define the set of functions
For example, polynomials of degree D. By using feature space transform, we can extend the
linear relation to nonlinear relations. These models is still a linear model since the function is
linear in the unknowns (w).
※ (x): a vector of basis functions.
: ( ) x x
( Tf x,w) = w (x)
Sparse Kernel Methods 14
Beyond Linear Relations
Problems in Feature Space Transform Difficulty in finding the appropriate transform. Curse of Dimensionality: The number of parameters
rapidly increases.
So Kernel Functions! Note in dual formulation, the only necessary information
is the kernel function. Kernel function is defined as an inner product of two
vectors. If we can find an appropriate kernel function, we can
solve the problem without explicitly considering the feature space transform.
Some kernel functions have the effect of considering infinite dimensional feature space.
Sparse Kernel Methods 15
Kernel Functions
A Kernel is a function k that for all x, z X satisfies
where is a mapping from X to a feature space F
Example
( ) ( ) ( )Tk x, z x z
: ( ) F x x
2 21 1 2 2( ) ( , 2 , )Tx x x x x
21 1 2 2( ) ( ) ( ) ( )T x z x z k x z x, z
Sparse Kernel Methods 16
Characterization of Kernel Functions
How to Find a Kernel Function? First define feature space transform, then define a kernel
as an inner product in the space. Direct method to characterize a kernel
Characterization of Kernel (Shawe-Taylor and Cristianini, 2004)
A function , which is either continuous or has a finite domain, can be decomposed as if and only if it is a finite positive semi-definite function, that is, for any choice of finite set , the matrix is positive semi-definite.
For proof, see the reference (RKHS, Reproducing Kernel Hilbert Space)
Alternative characterization is given by Mercer’s Theorem.
:k X X R ( ) ( ) ( )Tk x, z x z
1{ , , }Nx x X ( ( , ))n mK k x x
Sparse Kernel Methods 17
Examples of Kernel Functions
Example
1st : Polynomial Kernel, 2nd: Gaussian Kernel 3rd: Kernel derived from generative model, where p(x) is
a probability. 4th: Kernel defined on power set of a given set S. There are many known techniques for constructing new
kernels from existing kernels, see reference.
( )T Mk cx, z) = (x z2 2( exp( / )k sx, z) = x - z
| |( , 2 A Bk A B ) =
( ( ) ( )k p px, z) = x z
Sparse Kernel Methods 18
Kernel in Practice
In practical applications, you can choose a kernel that reflects the similarity between two objects. Note
Hence if appropriately normalized, the kernel represents the similarity between two objects in some feature space.
Remark.1) Kernel Trick: Develop a learning algorithm based on
inner products. Then replace the inner product with a kernel (e.g., Regression, Classification, etc).
2) Generalized Distance: We can generalize the notion of kernel to the case where it represents dissimilarity in some feature space (conditionally positive semi-definite kernel). Then we can use the kernel in learning algorithms based on distance between objects (e.g., clustering, Nearest Neighbor, etc).
2( ) ( ) ( ) ( ) ( ) ( ) 2 ( ) ( )T T T x z x x z z x z
Sparse Kernel Methods 19
Support Vector Machines
2 Class Classification Problem Given a training set , where ,
find a function
that satisfies for all points having and
for points having
Equivalently, , for all n.
1{( , )}Nn n ny x { 1, 1}ny
( ) ( )Tf b x w x
( ) 0nf x 1ny ( ) 0nf x 1ny
( ) 0n ny f x
Sparse Kernel Methods 20
Support Vector Machines: Linearly Separable
Case
Linearly Separable Case If we can find such a function f(x). In this case, the points (training data) are separated by a
hyperplane (separating hyperplane) f(x) = 0 in the feature space.
There can be infinitely many such functions.Margin
Margin: the distance between the hyperplane and the closest point.
Sparse Kernel Methods 21
Support Vector Machines: Linearly Separable
Case
Maximum Margin Classifiers Find a hyperplane with the maximum margin. Why Maximum Margin?
• Recall SRM. Maximum margin hyperplane corresponds to the case with the smallest capacity (Vapnik, 2000).
• So it is the solution when we choose SRM framework.
Sparse Kernel Methods 22
Support Vector Machines: Linearly Separable
Case
Formulation: Quadratic Programming
21min
2w
( ( ) ) 1Tn ny x b w for all 1, ,n N
1) The parameters are normalized so that the margin = 1.
2) Then the margin = 1/ w
Sparse Kernel Methods 23
Support Vector Machines: Linearly Separable
Case
Dual Formulation
1 ,
1max ( )
2
Nn n m n m n m
n n ma a a y y k
x ,x
0na for all 1, ,n N
1) Obtained by applying Lagrange Duality
2)
3) The hyperplane found is
10
Nn n
na y
( , ) ( ) ( )Tn m n mk x x x x
1( ) ( , )
Nn n n
nf a y k b
x x x
Sparse Kernel Methods 24
Support Vector Machines: Linearly Separable
Case
Discussion KKT condition
So only if • Such vectors are called support vectors.
• Note the maximum margin hyperplane is dependent only on the support vectors. Sparsity.
Note to solve the dual problem, we only need the kernel function k, and so we need not consider the feature space transform explicitly.
The form of the maximum margin hyperplane shows that the prediction is given by a combination of observations (with weights given by kernels), specifically, the support vectors.
0na { ( ) 1} 0n n na y f x
0na ( ) 1n ny f x
Sparse Kernel Methods 25
Support Vector Machines: Linearly Separable
Case
Example (Bishop, 2006)
Gaussian kernels are used here.
Sparse Kernel Methods 26
Support Vector Machines: Overlapping Classes
Overlapping Class Introduce slack variables. The results are almost the same except additional
constraints. For details, see reference.
Sparse Kernel Methods 27
SVM for Regression
-insensitive Error Function
0, if | ( ) | ( ( ) )
| ( ) | , otherwise
f yE f y
f y
xx
x
Sparse Kernel Methods 28
SVM for Regression
Formulation
2
1
1ˆmin ( )2
Nn n
nC
w
( )Tn n ny w x
ˆ( )Tn n ny w x
for all 1, ,n N ˆ, 0n n
Sparse Kernel Methods 29
SVM for Regression
Solution Similar to SVM for classification, use Lagrange dual. Then the solution is given by
By considering KKT condition, we can show that the dual variable is positive only if the corresponding point is either on the boundary of or outside the -tube. So Sparsity results.
1ˆ( ) ( ) ( , )
Nn n n
nf a a k b
x x x
Sparse Kernel Methods 30
Summary
Classification and Regression based on Kernels Dual formulation extension to arbitrary kernels Sparsity: Support Vectors
Some Limitations of SVM Choice of Kernel Solution Algorithm: Efficiency of solving large scale QP
problems. Multi-class Problem
Sparse Kernel Methods 31
Related Topics
Relevance Vector Machines Use prior knowledge on the distribution of the functions
(parameters)1( | , ) ( | ( ), )Tp y N y x,w w x
1
1( | ) ( | 0, )
Mi i
ip N w
w α
( | , , ) ( | )p N w y,X α w m,ΣTm ΣΦ y
1(diag( ) )Ti Σ Φ Φ
( | , ) ( | , ) ( | )p p p d y X,α y X,w w α w
Choose , that maximizes the above function (marginal likelihood function). Then using them, find the predictive distribution of y given a new value x by using posterior of w.
Sparse Kernel Methods 32
Related Topics
Gaussian Process
For any finite set of points , jointly have a Gaussian distribution.
Usually, due to lake of prior knowledge, the mean = 0. The covariance is defined by a kernel function k. The regression problem given a set of observations
reduces to finding a conditional distribution of y.
{ ( )}f xx
1{ , , }N x x 1{ ( )}Nn nf x
1{( , )}Nn n ny x
Sparse Kernel Methods 33
References
General introductory material for machine learning [1] Pattern Recognition and Machine Learning by C. M. Bishop,
Springer, 2006.Very well written book with an emphasis on Bayesian methods.
Fundamentals of Statistical learning theory and kernel methods[2] Statistical Learning Theory by V. Vapnik, John Wiley and Sons,
1996[3] The Nature of Statistical Learning Theory, 2nd Ed. by V. Vapnik,
Springer, 2000Both books deal with essentially the same topic but in [3],
mathematical details are kept at minimum, while [2] gives all the details. Origin of SVM.
Kernel Engineering[4] Kernel Methods for Pattern Analysis by J. Shawe-Taylor and N.
Cristianini, Cambridge University Press, 2004Deals with various kernel methods with applications to problems
with texts, sequences, trees, etc.Gaussian Process
[5] Gaussian Processes for Machine Learning by C. Rasmussen and C. Williams, MIT Press, 2006
Presents up-to-date survey on Gaussian process and related topics.