Upload
kesia
View
32
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Introduction to Predictive Learning . LECTURE SET 7 Support Vector Machines. Electrical and Computer Engineering. 1. OUTLINE. Objectives explain motivation for SVM describe basic SVM for classification & regression compare SVM vs. statistical & NN methods - PowerPoint PPT Presentation
Citation preview
11
Introduction to Predictive Learning
Electrical and Computer Engineering
LECTURE SET 7
Support Vector Machines
2
OUTLINE• Objectives
explain motivation for SVMdescribe basic SVM for classification & regression compare SVM vs. statistical & NN methods
• Motivation for margin-based loss• Linear SVM Classifiers• Nonlinear SVM Classifiers• Practical Issues and Examples• SVM for Regression• Summary and Discussion
3
MOTIVATION for SVM• Recall ‘conventional’ methods:
- model complexity ~ dimensionality- nonlinear methods multiple local minima- hard to control complexity
• ‘Good’ learning method:(a) tractable optimization formulation(b) tractable complexity control(1-2 parameters)(c) flexible nonlinear parameterization
• Properties (a), (b) hold for linear methods• SVM solution approach
4
SVM APPROACH• Linear approximation in Z-space using
special adaptive loss function• Complexity independent of dimensionality
x g x
z wz ˆ y
555
Motivation for Nonlinear Methods1. Nonlinear learning algorithm proposed using
‘reasonable’ heuristic arguments.reasonable ~ statistical or biological
2. Empirical validation + improvement3. Statistical explanation (why it really works)Examples: statistical, neural network methods.In contrast, SVM methods have been originally
proposed under VC theoretic framework.
6
OUTLINE• Objectives• Motivation for margin-based loss
Loss functions for regressionLoss functions for classificationPhilosophical interpretation
• Linear SVM Classifiers• Nonlinear SVM Classifiers• Practical Issues and Examples• SVM for Regression• Summary and Discussion
7
Main Idea• Model complexity controlled by a special
loss function used for fitting training data• Such empirical loss functions may be
different from the loss functions used in a learning problem setting
• Such loss functions are adaptive, i.e. can adapt their complexity to particular data set
• Different loss functions for different learning problems (classification, regression etc)
• Model complexity(VC-dim.) is controlled independently of the number of features
8
Robust Loss Function for Regression• Squared loss ~ motivated by
large sample settingsparametric assumptions Gaussian noise
• For practical settings better to use linear loss
9
Epsilon-insensitive Loss for Regression
0,|),(|max)),(,( xx fyfyL
• Can also control model complexity (~ falsifiability)
10
Empirical Comparison• Univariate regression• Squared, linear and SVM loss (with ):
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1
-0.5
0
0.5
1
1.5
2
x
y
xy ]1,0[x )36.0,0(~ N
0.6
• Red ~ target function, Dotted ~ estimate using squared loss, Dashed ~ linear loss, Dashed-dotted ~ SVM loss
11
Empirical Comparison (cont’d)• Univariate regression• Squared, linear and SVM loss (with )• Test error (MSE) estimated for 5 independent
realizations of training data (4 training samples)
xy ]1,0[x )36.0,0(~ N
0.6
Squared loss
Least modulus loss
SVM loss with epsilon=0.6
1 0.024 0.134 0.067
2 0.128 0.075 0.063
3 0.920 0.274 0.041
4 0.035 0.053 0.032
5 0.111 0.027 0.005
Mean 0.244 0.113 0.042
St. Dev. 0.381 0.099 0.025
12
Loss Functions for Classification
• Decision rule • Quantity is analogous to residuals in regression• Common loss functions: 0/1 loss and linear loss• Properties of a good loss function?
~ continuous + convex + robust
),()( xx fsignD ),( xyf
13
Motivation for margin-based loss (1)
• Given: Linearly separable data How to construct linear decision boundary?
(a) Many linear decision boundaries (that have no errors)
14
Motivation for margin-based loss (2)
• Given: Linearly separable data Which linear decision boundary is better ?
The model with larger margin is more robust for future data
15
Largest-margin solution• All solutions explain the data well (zero error) All solutions ~ the same linear parameterization Larger margin ~ more confidence (larger falsifiability)
2M
16
Margin-based loss for classification
1
2
1y
1y
( , ( , )) max | ( , ) |, 0L y f yf x x( , )i i iy f x
SVM loss or hinge lossMinimization of slack variables
17
Margin-based loss for classification: margin size is adapted to training data
M argin
C lass +1 C lass -1
0),,(max)),(,( xx yffyL
18
Motivation: philosophical• Classical view: good model
explains the data + low complexity Occam’s razor (complexity ~ # parameters)• VC theory: good model
explains the data + low VC-dimension VC-falsifiability (small VC-dim ~ large falsifiability),
i.e. the goal is to find a model that:can explain training data / cannot explain other data
The idea: falsifiability ~ empirical loss function
19
Adaptive Loss Functions• Both goals (explanation + falsifiability) can
encoded into empirical loss function where- (large) portion of the data has zero loss- the rest of the data has non-zero loss, i.e. it falsifies the model
• The trade-off (between the two goals) is adaptively controlled adaptive loss fct
• For classification, the degree of falsifiability is ~ margin size (see below)
20
Margin-based loss for classification
0),,(max)),(,( xx yffyL 2Margin
21
Classification: non-separable data
),(_
xyfvariablesslack
0),,(max)),(,( xx yffyL
22
Margin based complexity control• Large degree of falsifiability is achieved by
- large margin (for classification)- small epsilon (for regression)
• For linear classifiers:larger margin smaller VC-dimension
~1 2 3 ... 1 2 3 ...h h h
23
-margin hyperplanes • Solutions provided by minimization of SVM loss can be
indexed by the value of margin SRM structure: for VC-dim.
• If data samples belong to a sphere of radius R, then the VC dimension bounded by
• For large margin hyperplanes, VC-dimension controlled independent of dimensionality d.
1),/min( 22 dRh
1 2 3 ... 1 2 3 ...h h h
24
SVM Model Complexity
• Two ways to control model complexity-via model parameterization use fixed loss function:
-via adaptive loss function: use fixed (linear) parameterization
• ~ Two types of SRM structures• Margin-based loss can be motivated by
Popper’s falsifiability
)),(,( xfyL)),(,( xfyL
),( xf
bf )(),( xwx
25
Margin-based loss: summary• Classification:
falsifiability controlled by margin
• Regression:
falsifiability controlled by
• Single class learning:
falsifiability controlled by radius r
NOTE: the same interpretation/ motivation for margin-based loss for different types of learning problems.
0),,(max)),(,( xx yffyL
0,|),(|max)),(,( xx fyfyL
0,|)|max)),(( rfLr axx
26
OUTLINE• Objectives• Motivation for margin-based loss• Linear SVM Classifiers
- Primal formulation (linearly separable case)- Dual optimization formulation- Soft-margin SVM formulation
• Nonlinear SVM Classifiers• Practical Issues and Examples• SVM for Regression• Summary and Discussion
27
Optimal Separating Hyperplane
Distance btwn hyperplane and sample Margin Shaded points are SVsw/1
wx /)'(f
28
Optimization Formulation• Given training data • Find parameters of linear hyperplane
that minimizeunder constraints
• Quadratic optimization with linear constraintstractable for moderate dimensions d
• For large dimensions use dual formulation:- scales better with n (rather than d)- uses only dot products
bf xwx 25.0 ww
1 by ii xw
ni ,...,1
x i ,yi b,w
29
From Optimization Theory:• For a given convex minimization problem
with convex inequality constraints there exists an equivalent dual unconstrained maximization formulation with nonnegative Lagrange multipliers
• Karush-Kuhn-Tucker (KKT) conditions:Lagrange coefficients only for samples that satisfy the original constraint
with equality ~ SV’s have positive Lagrange coefficients
1 by ii xw
0* i
0i
30
Convex Hull Interpretation of Dual
Find convex hulls for each class. The closest points to an optimal hyperplane are support vectors
31
Dual Optimization Formulation• Given training data • Find parameters of an opt. hyperplane
as a solution to maximization problem
under constraints
• Note: data samples with nonzero are SV’s• Formulation requires only inner products
ni ,...,1
x i ,yi ** ,bi
n
iiii byD
1
** xxx
max21
1,1
n
jijijiji
n
ii yy xxL
,01
n
iiiy i 0
*i
'xx
32
Support Vectors• SV’s ~ training samples with non-zero loss• SV’s are samples that falsify the model • The model depends only on SVs SV’s ~ robust characterization of the data
WSJ Feb 27, 2004:About 40% of us (Americans) will vote for a Democrat, even if the candidate is Genghis Khan. About 40% will vote for a Republican, even if the candidate is Attila the Han. This means that the election is left in the hands of one-fifth of the voters.
33
Support Vectors• SVM test error bound:
small # SV’s ~ good generalization• Can be explained using LOO cross-validation
• SVM generalization can be related to data compression
#_ support _ vectors_
nE
E Test error
34
Soft-Margin SVM formulation
min21 2
1
wn
iiC
iii by 1xw
Minimize:
under constraints
bf xwx)(
f x( ) = +1
f x( ) = -1
f x( ) = 0
x1 = 1 - f x1( )
x1
x 2
x 3
x2 =1 - f x2( )
x3 = 1+ f x3( )
35
SVM Dual Formulation• Given training data • Find parameters of an opt. hyperplane
as a solution to maximization problem
under constraints
• Note: data samples with nonzero are SVs• Formulation requires only inner products
ni ,...,1
x i ,yi ** ,bi
n
iiii byD
1
** xxx
max21
1,1
n
jijijiji
n
ii yy xxL
,01
n
iiiy Ci 0
*i
'xx
36
OUTLINE• Objectives• Motivation for margin-based loss• Linear SVM Classifiers• Nonlinear SVM Classifiers• Practical Issues and Examples• SVM for Regression• Summary and Discussion
37
Nonlinear Decision Boundary• Fixed (linear) parameterization is too rigid• Nonlinear curved margin may yield larger margin
(falsifiability) and lower error
38
Nonlinear Mapping via KernelsNonlinear f(x,w) + margin-based loss = SVM• Nonlinear mapping to feature z space• Linear in z-space ~ nonlinear in x-space• But ~ symmetric fct Compute dot product via kernel analytically
x g x
z wz ˆ y
1
( ) ( ) ,m
i j k i k j i jk
g g
z z x x x x
39
Example of Kernel Function• 2D input space • Mapping to z space (2-d order polynomial)
• Can show by direct substitution that for two input vectors
Their dot product is calculated analytically
x x1 ,x2
11 z 12 2xz 23 2xz 214 2 xxz 215 xz 2
26 xz
21,uuu 21,vvv
2( ) ( ) , 1G G u v u v u v
40
SVM Formulation (with kernels)• Replacing leads to: • Find parameters of an optimal
hyperplane as a solution to maximization problem
under constraints • Given: the training data
an inner product kernelregularization parameter C
** ,bi
n
iiii bKyD
1
** ,xxx
max,21
1,1
n
jijijiji
n
ii Kyy xxL
,01
n
iiiy nCi /0
xxzz ,K
x i ,yi xx ,K
ni ,...,1
41
Examples of KernelsKernel is a symmetric function satisfying general
(Mercer’s) conditions. Examples of kernels for different mappings xz• Polynomials of degree m
• RBF kernel(width parameter)
• Neural Networksfor given parameters
Automatic selection of the number of hidden units (SV’s)
xx ,K
mK 1', xxxx
2
2'exp,
xx
xxK
avK )'(tanh, xxxxav,
42
More on Kernels• The kernel matrix has all info (data + kernel)
K(1,1) K(1,2)…….K(1,n)K(2,1) K(2,2)…….K(2,n)………………………….K(n,1) K(n,2)…….K(n,n)
• Kernel defines a distance in some feature space (aka kernel-induced feature space)
• Kernel parameter controls nonlinearity• Kernels can incorporate a priori knowledge• Kernels can be defined over complex
structures (trees, sequences, sets, etc.)
43
New insights provided by SVM
• Why linear classifiers can generalize?(1) Margin is large (relative to R)(2) % of SV’s is small(3) ratio d/n is small
• SVM offers an effective way to control complexity (via margin + kernel selection) i.e. implementing (1) or (2) or both
• Requires common-sense parameter tuning
44
OUTLINE• Objectives• Motivation for margin-based loss• Linear SVM Classifiers• Nonlinear SVM Classifiers• Practical Issues and Examples
- Model Selection- Histogram of Projections- SVM Extensions and Modifications
• SVM for Regression• Summary and Discussion
45
SVM Model Selection• The quality of SVM classifiers depends on
proper tuning of model parameters:- kernel type (poly, RBF, etc)- kernel complexity parameter- regularization parameter C
• Note: VC-dimension depends on both C and kernel parameters
• These parameters are usually selected via x-validation, by searching over wide range of parameter values (on the log-scale)
46
SVM Example 1: Ripley’s data• Ripley’s data set:
- 250 training samples- SVM using RBF kernel- model selection via 10-fold cross-validation
• Cross-validation error table:
optimal C = 1,000, gamma = 1Note: may be multiple optimal parameter values
2( , ) exp u v u v
gamma C= 0.1 C= 1 C= 10 C= 100 C= 1000 C= 10000=2-3 98.4% 23.6% 18.8% 20.4% 18.4% 14.4%=2-2 51.6% 22% 20% 20% 16% 14%=2-1 33.2% 19.6% 18.8% 15.6% 13.6% 14.8%=20 28% 18% 16.4% 14% 12.8% 15.6%=21 20.8% 16.4% 14% 12.8% 16% 17.2%=22 19.2% 14.4% 13.6% 15.6% 15.6% 16%=23 15.6% 14% 15.6% 16.4% 18.4% 18.4%
47
Optimal SVM Model• RBF SVM with optimal parameters C = 1,000, gamma = 1
• Test error is 9.8% (estimated using1,000 test samples)
-1.5 -1 -0.5 0 0.5 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
x1
x2
48
SVM Example 2: Noisy Hyperbolas• Noisy Hyperbolas data set:
- 100 training samples (50 per class)- 100 validation samples (used for parameter tuning)RBF SVM model: Poly SVM model (5-th degree):
• Which model is ‘better’?• Model interpretation?
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
49
SVM Example 3: handwritten digits• MNIST handwritten digits (5 vs. 8) ~ high-dimensional data
- 1,000 training samples (500 per class)- 1,000 validation samples (used for parameter tuning)- 1,866 test samples
• Each sample is a real-valued vector of size 28*28=784:
• RBF SVM: optimal parameters C=1,
28 pixels
28 pixels
82
50
How to visualize high-dim SVM model?• Histogram of projections for linear SVM:
- project training data onto normal vector w (of SVM model)- show univariate histogram of projected training samples
• On the histogram: ‘0’~ decision boundary, -1/+1 ~ margins• Similar histograms can be obtained for nonlinear SVM
51
Histogram of Projections for Digits Data• Projections of training/test data onto normal direction of RBF SVM decision boundary:
Training data Test data
-1.5 -1 -0.5 0 0.5 1 1.50
50
100
150
200
250
300
350
400
450
500
-1.5 -1 -0.5 0 0.5 1 1.50
20
40
60
80
100
120
140
160
180
52
Practical Issues for SVM Classifiers
• Pre-processing all inputs pre-scaled to the range [0,1] or [-1,+1]
• Model Selection (parameter tuning) • SVM Extensions
- multi-class problems - unbalanced data sets- unequal misclassification costs
53
SVM for multi-class problems
• Digit recognition ~ ten-class problem:- estimate 10 binary classifiers
(one digit vs the rest)• For prediction:
a test input is applied to all 10 binary SVM classifiers, and the class with the largest SVM output value is selected
54
Unbalanced Settings and Unequal Costs
• Unbalanced settings: - different number of positive and negative samples- different prior probabilities for training /test data
• Different Misclassification Costs- two types of errors, FP and FN- Cost (false_positive) vs. Cost(false_negative)- Loss function:- these ‘costs’ need to be specified a priori, based on application requirements minPCPC fnfp
55
SVM Modifications• The Problem:
- How to modify standard SVM formulation? • Unbalanced Data + Unequal Costs
where
• In practice, need to specify
minCCclassi
iclassi
i
2
21 w
S
S
posfalseCostC
negfalseCostC
CCC ,
56
Example: SVM with unequal costs• Ripley’s data set (as before) where
- negative samples ~ ‘triangles’- given misclassification costs
Note: boundary shifted away from positive samples/ 3 :1C C
-1.5 -1 -0.5 0 0.5 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
x1
x2
57
SVM Applications
• Handwritten digit recognition • Face detection in unrestricted
images • Text/ document classification• Image classification and retrieval• …….
58
Handwritten Digit Recognition (mid-90’s)
• Data set: postal images (zip-code), segmented, cropped;
~ 7K training samples, and 2K test samples
• Data encoding: 28x28 grey scale pixel image
• Original motivation: Compare SVM with custom MLP network (LeNet) designed for this application
• Multi-class problem: one-vs-all approach 10 SVM classifiers (one per each digit)
59
Digit Recognition Results• Summary
- prediction accuracy better than custom NN’s- accuracy does not depend on the kernel type- 100 – 400 support vectors per class (digit)
• More details Type of kernel No. of Support Vectors Error% Polynomial 274 4.0RBF 291 4.1Neural Network 254 4.2
• ~ 80-90% of SV’s coincide (for different kernels)• Reduced-set SVM (Burges, 1996) ~ 15 per class
60
Document Classification (Joachims, 1998)
• The Problem: Classification of text documents in large data bases, for text indexing and retrieval
• Traditional approach: human categorization (i.e. via feature selection) – relies on a good indexing scheme. This is time-consuming and costly
• Predictive Learning Approach (SVM): construct a classifier using all possible features (words)
• Document/ Text Representation: individual words = input features (possibly weighted)
• SVM performance:– Very promising (~ 90% accuracy vs 80% by other classifiers)– Most problems are linearly separable use linear SVM
61
Image Data Mining (Chapelle et al, 1999)• Example image data:• Classification of images in
data bases, for image indexing etc
• DATA SETCorel photo images:2670 samples divided into 7 classes: airplanes, birds, fish, vehicles etc.
Training data: 1375 images; Test data: 1375 images (50%)
MAIN ISSUE: invariant representation/ data encoding
62
OUTLINE• Objectives• Motivation for margin-based loss• Linear SVM Classifiers• Nonlinear SVM Classifiers• Practical Issues and Examples• SVM for Regression
- SV Regression formulation- Dual optimization formulation- Model selection- Example: Boston Housing
• Summary and Discussion
63
General SVM Modeling Approach1 For linear model
minimize SVM functional
using SVM loss suitable for the learning problem at hand
2 Transform (1) to dual optimization formulation (using only dot products)
3 Use kernels to obtain nonlinear version of (2).
Note: this approach is used for all learning problems. However, tunable parameters of margin-based loss are different for various types of learning problems
21( , , ) ( , , )2SVM n emp nR b C R b w Z w w Z
( , )f b x w x
64
SVM Regression• For linear model
minimize SVM functional
where empirical loss (for regression) is given by
• Two distinct ways to control model complexity:- by the value of C (with fixed epsilon)- by the value of epsilon (with fixed large C)
• SVM regression tunes both epsilon and C for optimal performance
21( , , ) ( , , )2SVM n emp nR b C R b w Z w w Z
( , )f b x w x
0,|),f(|max)),f(,( xx yyL
65
Linear SVM regression
0,|),(|max)),(,( xx fyfyL
bf xwx ),(
min),(21),,( 2 nempnSVM RCbR ZwZw
n
iiinemp fyL
nR
1
)),(,(1),( xZ
For linear parameterization
SVM regression functional:
where
66
Direct Optimization Formulation
x i ,yi
x
y
*2
1
Minimize
n
iiiC
1
*)()(21 ww
niybby
ii
iii
iii
,...,1,0,)(
)(
*
*
xwxw
Under constraints
Given training data
ni ,...,1
67
Dual Formulation for SVM Regressionx i ,yi
And the values of
Under constraints
Given training data ni ,...,1
C,
Find coefficients which maximize
)())(()()(),(,
ji
n
jijjiiii
n
iiii
n
iii y xx
111 21 L
n
ii
n
ii
11
Ci 0Ci 0
niii ,...,1,, **
Yields the following solution*** )()()( bf i
SViii
xxx
68
Example: RBF regression
RBF estimate (dashed line) usingSVM model uses only 5 SV’s (out of the 40 points)
2000,16.0 C
2
21
exp0.20
mj
jj
x cf x w
,w
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.2
0
0.2
0.4
0.6
0.8
1
x
y
69
Example: decomposition of RBF model
Weighted sum of 5 RBF kernel fcts gives the SVM model
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-4
-3
-2
-1
0
1
2
3
4
x
y
70
SVM Model Selection: General• Setting/ tuning of SVM hyper-parameters
- usually performed by experts- more recently, by non-expert practitioners
• Issues for SVM model selection (1) parameters controlling the ‘margin’ size(2) kernel type and kernel complexity
• Strategies for model selection- exhaustive search in the parameter space(via resampling)- efficient search using VC analytic bounds- rule-of-thumb analytic strategies (for a particular type of learning problem)
71
Model Selection: continued• Parameters controlling margin size
- for classification, parameter C- for regression, the value of epsilon- for single-class learning, the radius
• Complexity control ~ the fraction of SV’s( -SVM)- for classification, replace C with - for regression, specify the fraction of points allowed to lie outside -insensitive zone
• For very sparse data (d/n>>1) use linear SVM
]1,0[
72
Parameter Selection for SVM Regression• Selection of parameter C
Recall the SVM solution where and with bounded kernels (RBF)
• Selection of
in general, (noise level)But this does not reflect dependency on sample sizeFor linear regression: suggesting
• The final prescription
Ci *0 niCi ,...,1,0 *
minmax yyC
~
nxy
22
/
n
nnln3
*** )()()( bf iSVi
ii
xxx
73
Effect of SVM parameters on test error• Training data
univariate Sinc(x) function
with additive Gaussian noise (sigma=0.2)(a) small sample size 50 (b) large sample size 200
02
46
810
00.2
0.40.6
0
0.05
0.1
0.15
0.2
C/n
Prediction Risk
epsilon0
24
68
10
00.2
0.40.6
0
0.05
0.1
0.15
0.2
C/n
Prediction Risk
epsilon 02
46
810
00.2
0.40.6
0
0.05
0.1
0.15
0.2
C/n
Prediction Risk
epsilon
]10,10[)sin()( xx
xxt
74
SVM vs Regularization• System imitation SVM• System identification regularization
But their risk functionals ‘look similar’
Recent claims: SVM = special case of regularization
• These claims neglect the role of margin loss
2
1 21)),(,(),( wxw
n
iiiSVM fyLCbR
2
1
2)),((),( wxw
n
iiireg fybR
75
Comparison for Classification• Linear SVM vs Penalized LDA – comparison is fair• Data Sets: small (20 samples per class) large (100
samples per class)
76
Comparison results: classification• Small sample size:
Linear SVM yields 0.5% - 1.1% error ratePenalized LDA yields 2.8% - 3% error
• Large sample size:Linear SVM yields 0.4% - 1.1% error ratePenalized LDA yields 1.1% - 2.2% error
• Conclusion: margin based complexity control is better than regularization
77
Comparison for regression• Linear SVM vs linear ridge regressionNote: Linear SVM has 2 parameters
• Sparse Data Set: 30 noisy samples, using target function
corrupted with gaussian noise with• Complexity Control:
- for RR vary regularization parameter- for SVM ~ epsilon and C parameters
54321 0002)( xxxxxt x 5]1,0[x2.0
78
Control for ridge regression• Coefficient shrinkage for ridge regression
-8 -6 -4 -2 0 2 4 6 8-0.5
0
0.5
1
1.5
2
2.5
Log(Lambda)
Coe
ffici
ents
w1
w2 w2 w2
w3
w4
w5
79
Complexity control for SVM• Coefficient shrinkage for SVM:(a)Vary C (epsilon=0) (b)Vary epsilon(C=large)
0 0.5 1 1.5 2-0.5
0
0.5
1
1.5
2
2.5
epsilon
Coe
ffici
ents
w1
w3
w4
w5
w2
-8 -6 -4 -2 0 2 4 6 8-0.5
0
0.5
1
1.5
2
2.5
Log(n/C)
Coe
ffici
ents
w1
w2
w3
w4
w5
80
Comparison: ridge regression vs SVM • Sparse setting: n=10, noise• Ridge regression: chosen by cross-validation• SV Regression:
C selected by cross-validation • Ave Risk (100 realizations): 0.44 (RR) vs 0.37 (SVM)
2.0
2.0
Ridge SVM 10
-2
10-1
100
101
Ris
k (M
SE
)
81
OUTLINE• Objectives• Motivation for margin-based loss• Linear SVM Classifiers• Nonlinear SVM Classifiers• Practical Issues and Examples• SVM for Regression• Summary and Discussion
82
Summary• Direct approach different formulations• Margin-based loss: robust, controls
complexity (falsifiability)• SRM: new type of structure• Nonlinear feature selection (~ SV’s):
incorporated into model estimation• Appropriate applications
- high-dimensional data- content-based /content-dependent