Upload
julianna-boyd
View
224
Download
2
Embed Size (px)
Citation preview
Efficient Model Selection for Support Vector Machines
Shibdas Bandyopadhyay
Efficient Predictions Using SVM
2
Outline
• Brief Introduction to SVM• Cross-Validation• Methods for Parameter tuning• Grid Search• Genetic Algorithm• Auto-tuning for Classification• Results• Conclusion• Pattern Search for Regression
Efficient Predictions Using SVM
3
Support Vector Machines• Classification
- Given a set (x1, y1), (x2, y2),…, (xm, ym) (X, Y) where X = set of input vectors and Y = set of classes, we are to predict the class y for an unseen x X
A+A-
w X,yx ii li ,...1
*
-Tube
0
• Regression- Given a set (x1, y1), (x2, y2),…, (xm, ym) (X, Y) where X = set of input vectors and Y = set of values, we are to predict the value y for an unseen x X
Efficient Predictions Using SVM
4
Support Vector Machines
• Kernels
- kernel maps the linearly non-separable data in to higher dimensional feature space where it may become linearly separable
Efficient Predictions Using SVM
5
Support Vector Classification
• Soft Margin Classifier
Optimization Problem
minimize
Subject to
where C(>0) is the trade-off between margin maximization and training error minimization
m
ii
Cww
1
22
2||||
2
1),(
0,1),( iiii bxwy
Efficient Predictions Using SVM
6
10 fold Cross Validation
Training
Testing
Testing
Efficient Predictions Using SVM
7
Cross-Validation
• Widely regarded as the best method to measure the generalization error (Test error)
• Training set is divided into p folds
• Training runs are done using all possible combinations of (p – 1) training folds
• Testing is done on the remaining fold for each run
• We are to find the parameter values for which average cross-validation error is minimum
Efficient Predictions Using SVM
8
Model Parameter Selection
• Consider RBF kernel and SVM Classification (Soft Margin Case)
• RBF Kernel is given by
• Two Parameters C (trade-off of soft margin case) and of Kernel.
• Benchmark Dataset – Breast Cancer (100 realizations)
• Change of parameters changes the test Error
• Parameters should be chosen such that test error is minimal
0,),(2
21 ||||21 xxexxK
9616.4,6364.33,1,10
3150.4,53.28,10,1
stderrormeanCfor
stderrormeanCfor
Efficient Predictions Using SVM
9
Range input
Parameter selection
Raw data
SVM classifier
optimal C and gamma
final results
C, gamma
misclass. error
Approach
SVM classifier
Efficient Predictions Using SVM
10
Methods for parameter tuning
• Grid Search
• Genetic Algorithm
• Auto-tuning for Classification
Efficient Predictions Using SVM
11
Grid Search
Two Dimensional Parameter Space
1 C 5000
1000
1
2857, 571.5
2142.8 C 3571.4
714.3
428.5
3571.42142.8
714.3
428.5
Efficient Predictions Using SVM
12
Grid Search• Simple technique resembling exhaustive search
• Take exponentially increasing values in a particular range
• Find the set with minimum Cross-validation error
• Adjust the new range in the neighborhood of that chosen set
• Repeat the process until a satisfactory value for cross-validation error is obtained
Efficient Predictions Using SVM
13
Genetic Algorithm
• Genetic Algorithm is a subclass of “Evolutionary Computing”
• It is based on Darwin’s theory of evolution
• Widely accepted for parameter search and optimization
• Has a high probability of finding the global optimum
Efficient Predictions Using SVM
14
Genetic Algorithm - Steps
• Selection - “Survival of the fittest”. Choose the set of parameter
values for which the objective function is optimal
• Cross-Over - Combine the chosen values
• Mutation - Modify the combined values to produce the next
generation
Efficient Predictions Using SVM
15
Genetic Algorithm –Selection
• Set a criterion for choosing parents which will cross-over
• For example, Two individual or binary strings are selected with 1’s preferred over 0’s .
1
1
1
1
1
1
1
11
1
1 11
0
00
0
0
0
0
0
0
0
0
0
0
0
110
10 0 11
11 110
10 0 11
Efficient Predictions Using SVM
16
Genetic Algorithm – Cross - Over
• Combine the chosen parents to produce the offspring• For example, two parents represented as binary strings
performing cross–over
1
0
0
1
1
0
1
1
1
1X
1
1
1
1
1
0
1
1
0
0
Efficient Predictions Using SVM
17
Genetic Algorithm - Mutation
• Structure of the produced offspring is changed• Prevents the algorithm from being trapped in a local minima• For example, the produced is mutated ( one bit position is
flipped)
0
1
1
0
0
0
0
1
0
0
Efficient Predictions Using SVM
18
Genetic Algorithm - Coding
• Parameters are to be coded into strings before applying GA
• Real – Coded GA operates on real numbers
• Simulates the cross-over and mutation through various operators
• Simulated Binary Cross-over and polynomial mutation operators are used
Efficient Predictions Using SVM
19
Auto-tuning
• Consider a bound for the expected generalization error
• Try to minimize it by varying the parameters
• Apply well known minimization procedures to make this “automatic”
Efficient Predictions Using SVM
20
Generalization Error Estimates
• Validation Error - Keep a part of the training data for validation - Find the error while performing tests on validation set - Try to minimize the error on that set
• Leave One-out Error- Keep one element of training data set for testing
- Do training on the remaining elements - Test the element which was previously removed - Do this for all training data elements - Provides an unbiased estimate of the expected generalization error
Efficient Predictions Using SVM
21
Leave-One-Out Bounds
• Span Bound
where Sp is the distance between the point and where
• Radius-margin Bound
where R is the radius of the smallest sphere enclosing all data points and M is the margin obtained from SVM optimization solution
px
l
pppSl
T1
20 )1(1
0, 0
1),(ipi pi
iiip x
2
21
M
R
lT
p
Efficient Predictions Using SVM
22
Why Radius-margin Bound?
• It can be thought of an upper bound of the span-bound,which is an accurate estimate of the test error
• Minimization of the span-bound is more difficult to implement and to control(more local minima)
• Margin can be obtained from the solution of SVM optimization problem
• Radius can be calculated by solving a Quadratic optimization problem
• Soft-margin SVM can be easily incorporated by modifying the kernel of the hard margin version so that C will be considered just as another parameter of the kernel function
Efficient Predictions Using SVM
23
Auto-tuning - Steps
• M = 1 / ||w||, where ||w|| can be obtained by solving the problem: maximize
subject to
• R is obtained by solving the Quadratic Optimization Problem
subject to
l
ji
jiji
l
i
iii xxKxxKR1,1
2 ),(),(max
l
iii
1
0,1
m
i
iji
m
ji
jii xxkyyW1 ,
),(2/1
0,01
i
m
i
iiy
Efficient Predictions Using SVM
24
Auto-tuning - Steps Let θ = set of parameters. Steps are as follows:-
• Initialize θ to some value
• Using SVM find the maximum of W
• Update θ by a minimization method such that T is minimized
• Go to step 2 or stop when minimum of T is achieved
),(maxarg)(0
W
Efficient Predictions Using SVM
25
Results
• Methods are tested on five benchmark datasets• Mean Error, Minimum error among 100 realizations, Maximum error
among 100 realizations and std. deviation is reported
• Breast-Cancer Dataset
• Thyroid Dataset
• Titanic Dataset
• Heart Dataset
• Diabetics Dataset
Efficient Predictions Using SVM
26
Classification Results – Breast Cancer Dataset
• Number of train patterns : 200
• Number of test patterns : 77
• Input dimension : 9
• Output dimension : 1
MethodsMean Error
Min. Error Max. ErrorStandard
Deviation
Benchmark 26.04 4.74
Grid Search 27.22 14.58 36.36 4.75
Auto-tuning 27.47 16.88 36.36 3.97
Genetic Algorithm
25.40 15.58 33.77 4.39
Efficient Predictions Using SVM
27
Classification Results – Thyroid Dataset• Number of train patterns : 140
• Number of test patterns : 75
• Input dimension : 5
• Output dimension : 1
Methods Mean ErrorMin. Error
Max. Error
Standard
Deviation
Benchmark 4.80 2.19
Grid Search 4.32 0 8.00 1.74
Auto-tuning 4.56 0 9.333 2.02
Genetic Algorithm 4.44 0 10.667 2.43
Efficient Predictions Using SVM
28
Classification Results – Titanic Dataset• Number of train patterns : 150
• Number of test patterns : 2051
• Input dimension : 3
• Output dimension : 1
Methods Mean ErrorMin. Error
Max. Error
Standard
Deviation
Benchmark 22.42 1.02
Grid Search 23.08 21.55 33.21 1.18
Auto-tuning 23.01 20.87 33.21 1.33
Genetic Algorithm 22.66 21.69 33.21 1.11
Efficient Predictions Using SVM
29
Classification Results – Heart Dataset• Number of train patterns : 170
• Number of test patterns : 100
• Input dimension : 13
• Output dimension : 1
Methods Mean ErrorMin. Error
Max. Error
Standard
Deviation
Benchmark 15.95 3.26
Grid Search 15.49 8.00 23.00 3.29
Auto-tuning 15.65 8.00 23.00 3.21
Genetic Algorithm 15.87 10.00 25.00 3.27
Efficient Predictions Using SVM
30
Classification Results – Diabetis Dataset• Number of train patterns : 468
• Number of test patterns : 300
• Input dimension : 8
• Output dimension : 1
Methods Mean ErrorMin. Error
Max. Error
Standard
Deviation
Benchmark 23.53 1.73
Grid Search 23.14 19.33 26.67 1.17
Auto-tuning 23.68 19.33 27.33 1.68
Genetic Algorithm 23.69 19.00 28.33 1.71
Efficient Predictions Using SVM
31
Conclusion
• Grid Search is the best technique if the number of parameters is low as it does an exhaustive search on the parameter space
• Auto-tuning performs much less number of training runs in all cases
• Genetic Algorithm is quite steady and gives near-optimal solutions
• Future work would be to test these techniques for regression
• Analysis of pattern search method for regression
Efficient Predictions Using SVM
32
Support Vector Regression
• Regression Estimate
• Optimization Problem
maximize
subject to
bxxKxf i
l
iii
,1
*
),())((2
1
)()(),(
*
1,
*
1 1
** *
jijj
m
jiii
i
m
i
m
iii
xxk
Wi
m
iiiC
1
** 0)(,,0
Efficient Predictions Using SVM
33
Pattern Search
• Simple and efficient optimization technique
• No derivatives, only direct function evaluations are needed
• It gets rarely trapped in a bad local minima
• Converges rapidly to an optimum
Efficient Predictions Using SVM
34
Pattern Search
Efficient Predictions Using SVM
35
Pattern Search
• Patterns determine which points on the parameter space are searched
• Pattern is usually specified by a matrix. We have considered the matrix
which corresponds to the pattern obtained from
(x,y) (x+d,y)d
d
d
(x-d,y)
(x,y+d)
(x,y-d)
d
kknk dcxx
Efficient Predictions Using SVM
36
Pattern Search - Algorithm
Cross-validation error is the function to be minimized
1. Fix a pattern matrix Pk, set sk = 0
2. Given and , randomly pick an initial center of pattern qk
3. Compute the function value f(qk) and set min f(qk)
4. If < then stop
For i =1 … (p -1) where p is the number of columns in Pk
compute if < min then
5. go to step 2
1),(min,, 11 kkqfqq ikkkik
k
k ikk
ik sqq
)( ikqf
)( ikqf
1,21
kkkk
Efficient Predictions Using SVM
38
Genetic Algorithm - Implementation
• Simulated Binary Cross - over
- ui is chosen randomly between 0 and 1.
- βi follows the distribution
- find out such that cumulative probability density is ui
1,)1(5.0
,1)1(5.0)(
2
i
ni
i
ifcn
icn
otherwisecn
p
c
qi
dxxpui
)(
iq
Efficient Predictions Using SVM
39
Genetic Algorithm – Implementation (Cont…)
- Generate the offspring xi(1,t+1) and xi(2,t+1) from parents xi(1,t) and xi(2,t).
• Polynomial Mutation
- A random number ri is selected between 0 and 1. - is found out such that cumulative probability of polynomial
distribution up to is ri. The polynomial distribution can be written as:
- Mutated offspring are obtained using the following rule:
where and are respectively the upper and lower bound on x i.
])1()1[(5.0
])1()1[(5.0),2(),1()1,2(
),2(),1()1,1(
ti
qi
ti
qi
ti
ti
qi
ti
qi
ti
xxx
xxx
i
Li
Ui
ti
ti xxxy )( )()()1,1()1,1(
)||1)(1(5.0)( mmP
i
i
)(Uix
)(Lix
Efficient Predictions Using SVM
40
LOO Bounds• Jaakola-Haussler Bound
where is the α’s obtained from the solution of SVM optimization problem in case of testing with ‘p’th training example and where is the step function when x > 0 and otherwise. is the number of elements in the training set.
• Opper-Winther Bound
where KSV is the matrix of dot product of support vectors.
)1),((1
1
0
pp
l
pp xxK
lT
0p
)1)(
(1
11
0
l
p ppSV
p
KlT
)(x1)( x 0)( x l
Efficient Predictions Using SVM
41
Support Vector Classification
• Finds the optimal hyper-plane which separates the two classes in feature space
• Decision Function
• Quadratic Optimization Problem
minimize
subject to
for all i = 1…m
m
i
iii bxxkyxf1
)),(sgn(
RbHwww ,||||2
1)( 2
1),( bxwy ii