Efficient Model Selection for Support Vector Machines Shibdas Bandyopadhyay

Efficient Model Selection for Support Vector Machines

Shibdas Bandyopadhyay

Efficient Predictions Using SVM

2

Outline

• Brief Introduction to SVM• Cross-Validation• Methods for Parameter tuning• Grid Search• Genetic Algorithm• Auto-tuning for Classification• Results• Conclusion• Pattern Search for Regression


3

Support Vector Machines• Classification

- Given a set (x1, y1), (x2, y2),…, (xm, ym) (X, Y) where X = set of input vectors and Y = set of classes, we are to predict the class y for an unseen x X

A+A-

w X,yx ii li ,...1

*

-Tube

0

• Regression- Given a set (x1, y1), (x2, y2),…, (xm, ym) (X, Y) where X = set of input vectors and Y = set of values, we are to predict the value y for an unseen x X


4

Support Vector Machines

• Kernels

- kernel maps the linearly non-separable data in to higher dimensional feature space where it may become linearly separable


5

Support Vector Classification

• Soft Margin Classifier

Optimization Problem

minimize

Subject to

where C(>0) is the trade-off between margin maximization and training error minimization

m

ii

Cww

1

22

2||||

2

1),(

0,1),( iiii bxwy


6

10 fold Cross Validation

Training

Testing

Testing


7

Cross-Validation

• Widely regarded as the best method to measure the generalization error (Test error)

• Training set is divided into p folds

• Training runs are done using all possible combinations of (p – 1) training folds

• Testing is done on the remaining fold for each run

• We are to find the parameter values for which average cross-validation error is minimum


8

Model Parameter Selection

• Consider RBF kernel and SVM Classification (Soft Margin Case)

• RBF Kernel is given by

• Two Parameters C (trade-off of soft margin case) and of Kernel.

• Benchmark Dataset – Breast Cancer (100 realizations)

• Change of parameters changes the test Error

• Parameters should be chosen such that test error is minimal

0,),(2

21 ||||21 xxexxK

9616.4,6364.33,1,10

3150.4,53.28,10,1

stderrormeanCfor

stderrormeanCfor


9

Range input

Parameter selection

Raw data

SVM classifier

optimal C and gamma

final results

C, gamma

misclass. error

Approach

SVM classifier


10

Methods for parameter tuning

• Grid Search

• Genetic Algorithm

• Auto-tuning for Classification


11

Grid Search

Two Dimensional Parameter Space

1 C 5000

1000

1

2857, 571.5

2142.8 C 3571.4

714.3

428.5

3571.42142.8

714.3

428.5


12

Grid Search• Simple technique resembling exhaustive search

• Take exponentially increasing values in a particular range

• Find the set with minimum Cross-validation error

• Adjust the new range in the neighborhood of that chosen set

• Repeat the process until a satisfactory value for cross-validation error is obtained


13

Genetic Algorithm

• Genetic Algorithm is a subclass of “Evolutionary Computing”

• It is based on Darwin’s theory of evolution

• Widely accepted for parameter search and optimization

• Has a high probability of finding the global optimum


14

Genetic Algorithm - Steps

• Selection - “Survival of the fittest”. Choose the set of parameter

values for which the objective function is optimal

• Cross-Over - Combine the chosen values

• Mutation - Modify the combined values to produce the next

generation


15

Genetic Algorithm –Selection

• Set a criterion for choosing parents which will cross-over

• For example, Two individual or binary strings are selected with 1’s preferred over 0’s .

1

1

1

1

1

1

1

11

1

1 11

0

00

0

0

0

0

0

0

0

0

0

0

0

110

10 0 11

11 110

10 0 11


16

Genetic Algorithm – Cross - Over

• Combine the chosen parents to produce the offspring• For example, two parents represented as binary strings

performing cross–over

1

0

0

1

1

0

1

1

1

1X

1

1

1

1

1

0

1

1

0

0


17

Genetic Algorithm - Mutation

• Structure of the produced offspring is changed• Prevents the algorithm from being trapped in a local minima• For example, the produced is mutated ( one bit position is

flipped)

0

1

1

0

0

0

0

1

0

0


18

Genetic Algorithm - Coding

• Parameters are to be coded into strings before applying GA

• Real – Coded GA operates on real numbers

• Simulates the cross-over and mutation through various operators

• Simulated Binary Cross-over and polynomial mutation operators are used


19

Auto-tuning

• Consider a bound for the expected generalization error

• Try to minimize it by varying the parameters

• Apply well known minimization procedures to make this “automatic”


20

Generalization Error Estimates

• Validation Error - Keep a part of the training data for validation - Find the error while performing tests on validation set - Try to minimize the error on that set

• Leave One-out Error- Keep one element of training data set for testing

- Do training on the remaining elements - Test the element which was previously removed - Do this for all training data elements - Provides an unbiased estimate of the expected generalization error


21

Leave-One-Out Bounds

• Span Bound

where Sp is the distance between the point and where

• Radius-margin Bound

where R is the radius of the smallest sphere enclosing all data points and M is the margin obtained from SVM optimization solution

px

l

pppSl

T1

20 )1(1

0, 0

1),(ipi pi

iiip x

2

21

M

R

lT

p


22

Why Radius-margin Bound?

• It can be thought of an upper bound of the span-bound,which is an accurate estimate of the test error

• Minimization of the span-bound is more difficult to implement and to control(more local minima)

• Margin can be obtained from the solution of SVM optimization problem

• Radius can be calculated by solving a Quadratic optimization problem

• Soft-margin SVM can be easily incorporated by modifying the kernel of the hard margin version so that C will be considered just as another parameter of the kernel function


23

Auto-tuning - Steps

• M = 1 / ||w||, where ||w|| can be obtained by solving the problem: maximize

subject to

• R is obtained by solving the Quadratic Optimization Problem

subject to

l

ji

jiji

l

i

iii xxKxxKR1,1

2 ),(),(max

l

iii

1

0,1

m

i

iji

m

ji

jii xxkyyW1 ,

),(2/1

0,01

i

m

i

iiy


24

Auto-tuning - Steps Let θ = set of parameters. Steps are as follows:-

• Initialize θ to some value

• Using SVM find the maximum of W

• Update θ by a minimization method such that T is minimized

• Go to step 2 or stop when minimum of T is achieved

),(maxarg)(0

W


25

Results

• Methods are tested on five benchmark datasets• Mean Error, Minimum error among 100 realizations, Maximum error

among 100 realizations and std. deviation is reported

• Breast-Cancer Dataset

• Thyroid Dataset

• Titanic Dataset

• Heart Dataset

• Diabetics Dataset


26

Classification Results – Breast Cancer Dataset

• Number of train patterns : 200

• Number of test patterns : 77

• Input dimension : 9

• Output dimension : 1

MethodsMean Error

Min. Error Max. ErrorStandard

Deviation

Benchmark 26.04 4.74

Grid Search 27.22 14.58 36.36 4.75

Auto-tuning 27.47 16.88 36.36 3.97

Genetic Algorithm

25.40 15.58 33.77 4.39


27

Classification Results – Thyroid Dataset• Number of train patterns : 140




Methods Mean ErrorMin. Error

Max. Error

Standard

Deviation

Benchmark 4.80 2.19

Grid Search 4.32 0 8.00 1.74

Auto-tuning 4.56 0 9.333 2.02

Genetic Algorithm 4.44 0 10.667 2.43


28

Classification Results – Titanic Dataset• Number of train patterns : 150





Max. Error

Standard

Deviation


Grid Search 23.08 21.55 33.21 1.18

Auto-tuning 23.01 20.87 33.21 1.33

Genetic Algorithm 22.66 21.69 33.21 1.11


29

Classification Results – Heart Dataset• Number of train patterns : 170





Max. Error

Standard

Deviation


Grid Search 15.49 8.00 23.00 3.29

Auto-tuning 15.65 8.00 23.00 3.21



30

Classification Results – Diabetis Dataset• Number of train patterns : 468





Max. Error

Standard

Deviation


Grid Search 23.14 19.33 26.67 1.17

Auto-tuning 23.68 19.33 27.33 1.68



31

Conclusion

• Grid Search is the best technique if the number of parameters is low as it does an exhaustive search on the parameter space

• Auto-tuning performs much less number of training runs in all cases

• Genetic Algorithm is quite steady and gives near-optimal solutions

• Future work would be to test these techniques for regression

• Analysis of pattern search method for regression


32

Support Vector Regression

• Regression Estimate

• Optimization Problem

maximize

subject to

bxxKxf i

l

iii

,1

*

),())((2

1

)()(),(

*

1,

*

1 1

** *

jijj

m

jiii

i

m

i

m

iii

xxk

Wi

m

iiiC

1

** 0)(,,0


33

Pattern Search

• Simple and efficient optimization technique

• No derivatives, only direct function evaluations are needed

• It gets rarely trapped in a bad local minima

• Converges rapidly to an optimum


34

Pattern Search


35

Pattern Search

• Patterns determine which points on the parameter space are searched

• Pattern is usually specified by a matrix. We have considered the matrix

which corresponds to the pattern obtained from

(x,y) (x+d,y)d

d

d

(x-d,y)

(x,y+d)

(x,y-d)

d

kknk dcxx


36

Pattern Search - Algorithm

Cross-validation error is the function to be minimized

1. Fix a pattern matrix Pk, set sk = 0

2. Given and , randomly pick an initial center of pattern qk

3. Compute the function value f(qk) and set min f(qk)

4. If < then stop

For i =1 … (p -1) where p is the number of columns in Pk

compute if < min then

5. go to step 2

1),(min,, 11 kkqfqq ikkkik

k

k ikk

ik sqq

)( ikqf

)( ikqf

1,21

kkkk


37

Thank You

Mail your Questions/ Suggestions at: [email protected]


38

Genetic Algorithm - Implementation

• Simulated Binary Cross - over

- ui is chosen randomly between 0 and 1.

- βi follows the distribution

- find out such that cumulative probability density is ui

1,)1(5.0

,1)1(5.0)(

2

i

ni

i

ifcn

icn

otherwisecn

p

c

qi

dxxpui

)(

iq


39

Genetic Algorithm – Implementation (Cont…)

- Generate the offspring xi(1,t+1) and xi(2,t+1) from parents xi(1,t) and xi(2,t).

• Polynomial Mutation

- A random number ri is selected between 0 and 1. - is found out such that cumulative probability of polynomial

distribution up to is ri. The polynomial distribution can be written as:

- Mutated offspring are obtained using the following rule:

where and are respectively the upper and lower bound on x i.

])1()1[(5.0

])1()1[(5.0),2(),1()1,2(

),2(),1()1,1(

ti

qi

ti

qi

ti

ti

qi

ti

qi

ti

xxx

xxx

i

Li

Ui

ti

ti xxxy )( )()()1,1()1,1(

)||1)(1(5.0)( mmP

i

i

)(Uix

)(Lix


40

LOO Bounds• Jaakola-Haussler Bound

where is the α’s obtained from the solution of SVM optimization problem in case of testing with ‘p’th training example and where is the step function when x > 0 and otherwise. is the number of elements in the training set.

• Opper-Winther Bound

where KSV is the matrix of dot product of support vectors.

)1),((1

1

0

pp

l

pp xxK

lT

0p

)1)(

(1

11

0

l

p ppSV

p

KlT

)(x1)( x 0)( x l


41

Support Vector Classification

• Finds the optimal hyper-plane which separates the two classes in feature space

• Decision Function

• Quadratic Optimization Problem

minimize

subject to

for all i = 1…m

m

i

iii bxxkyxf1

)),(sgn(

RbHwww ,||||2

1)( 2

1),( bxwy ii

Documents

Efficient Model Selection for Support Vector Machines Shibdas Bandyopadhyay