Upload
others
View
15
Download
0
Embed Size (px)
Citation preview
Tony Jebara, Columbia University
Advanced Machine Learning & Perception
Instructor: Tony Jebara
Tony Jebara, Columbia University
Topic 3 • Maximum Margin
• Empirical Risk Minimization
• VC Dimension & Structural Risk Minimization
• Support Vector Machines
• Reduced Working Sets
• Sequential Minimal Optimization (SMO)
• Ellipsoidal Kernel Machines
• Maximum Relative Margin
• Other SVM Extensions…
Tony Jebara, Columbia University
Generative vs. Discriminative • Generative approach to classification: get p(x,y) then p(y|x) then via y = f(x) = argmaxy p(y|x) • For example, learn Gaussians by fitting each with maximum likelihood and use as classifiers by posterior:
• Vapnik: it is inefficient to learn a probability of everything if we only need a binary classifier y=f(x) • Also, maximum likelihood tries to capture everything about the data, not the classification decision. • Try just learning best possible classifier discriminatively.
x1+,…,x
N+{ }→ p x | y = +1( )
x1−,…,x
M−{ }→ p x | y = −1( )
p y = 1 | x( ) =p x | y = +1( )p y = +1( )
p x | y( )p y( )y∑
Tony Jebara, Columbia University
Large Margins • 1995: SVMs & VC Theory popularize large margin learning
• Other margin methods: Boosting, Max Margin Markov Nets, Max Margin Matrix Factorization, … • Other margin theories: Boosting, PAC-Bayes, Rademacher • Are large margins right? Can SDPs help? • Let’s re-visit SRM, VC, & SVMs…
Tony Jebara, Columbia University
Empirical vs. Structural Risk • Example: want a linear classifier to separate two classes
• Choose a loss function:
• Empirical Risk Minimization fits only to training data:
• Empirical approximates the true risk (expected error)
• We don’t know the true P(x,y)
• Instead SVMs perform Structural Risk Minimization
R
empθ( ) = 1
NL y
i, f x
i;θ( )( )i=1
N∑ ∈ 0,1⎡⎣⎢⎤⎦⎥
f x;θ( ) = sign wTx +b( ) where θ = w,b{ }
L y,x, θ( ) = step −yf x;θ( )( )
R θ( ) = E
PL x,y, θ( ){ } = P x,y( )L x,y, θ( )dx dy
X×Y∫ ∈ 0,1⎡⎣⎢⎤⎦⎥
arg min
θR
empθ( )≠ arg min
θR θ( )
R
empθ( )
Tony Jebara, Columbia University
• ERM: inconsistent Not guaranteed. May do better on training than on test! • SRM: add a prior or regularizer to • Define capacity or confidence = which favors simpler
• If, we have bound is a guaranteed risk • SRM: minimize J, guarantee future error rate is
Empirical vs. Structural Risk
R θ( )
Rempθ( )
θ
R θ̂( )≥ R
empθ̂( )
J θ( ) = R
empθ( ) +C θ( )
≤ min
θJ θ( )
R θ( )≤ J θ( )
C θ( )
Tony Jebara, Columbia University
Structural Risk Minimization & VC • How to bound risk? Learning theory… • Theorem (Vapnik): with probability 1-η the following holds:
N = number of data points h = Vapnik-Chervonenkis (VC) dimension
• VC dimension of linear classifiers in d-dimensions is h=d+1
R θ( )≤ J θ( ) = R
empθ( ) +Φ
hN
,lnηN
⎛
⎝⎜⎜⎜⎜
⎞
⎠⎟⎟⎟⎟
R θ( )≤ J θ( ) = Rempθ( ) +
2h log 2eNh( ) + 2 log 4
η( )N
1 + 1 +NR
empθ( )
h log 2eNh( ) + log 4
η( )⎛
⎝
⎜⎜⎜⎜⎜⎜⎜
⎞
⎠
⎟⎟⎟⎟⎟⎟⎟⎟
Tony Jebara, Columbia University
• Arbitrary linear classifiers are too flexible as a function class • Can improve estimate of VC dimension if we restrict them • Constrain linear classifiers to data living inside a sphere • Gap-Tolerant classifiers: a linear classifier whose activity is constrained to a sphere & outside a margin
• If M is small relative to D, can still shatter 3 points:
VC of Gap-Tolerant Classifiers
M
D M=margin D=diameter d=dimensionality
Only count errors in shaded region Elsewhere have L(x,y,θ)=0
Tony Jebara, Columbia University
VC of Gap-Tolerant Classifiers • But, as M grows relative to D, can only shatter 2 points!
• For hyperplanes, as M grows vs. D, shatter fewer points! • VC dimension h goes down if gap-tolerant classifier has larger margin, general formula is:
• Before, just had h=d+1. Now we have a smaller h • If data is anywhere, D is infinite and back to h=d+1 • Typically real data is bounded (by sphere), D is fixed • Maximizing M reduces h, improves guaranteed risk J(θ) • There is no way to modify R with θ
Can’t shatter 3
Can shatter 2
h ≤ min ceil
D2
M 2
⎡
⎣⎢⎢⎢
⎤
⎦⎥⎥⎥,d
⎧⎨⎪⎪
⎩⎪⎪
⎫⎬⎪⎪
⎭⎪⎪
+1
Tony Jebara, Columbia University
Support Vector Machines • Support vector machines are (in the simplest case) linear classifiers that do structural risk minimization (SRM) • Directly maximize margin to reduce guaranteed risk J(θ) • Assume first the 2-class data is linearly separable:
• Decision boundary or hyperplane given by • Note: can scale w & b while keeping same boundary • Many solutions exist which have empirical error Remp(θ)=0 • Want unique widest one max margin.
have x
1,y
1( ),…, xN,y
N( ){ } where xi∈ D and y
i∈ −1,1{ }
f x;θ( ) = sign wTx +b( )
wTx +b = 0
⇒
Tony Jebara, Columbia University
Support Vector Machines • The constraints on the SVM for Remp(θ)=0 are:
• Or more simply: • The margin of the SVM is:
• Distance to origin:
• Therefore: and margin
• Want to max margin, or equivalently minimize: • SVM Problem: • This is the primal quadratic program • Dual, nonseparable & nonlinear case are straightforward.
H+→ wTx +b = +1
H− → wTx +b = −1
wTx
i+b ≥+1 ∀y
i= +1
wTx
i+b ≤−1 ∀y
i= −1
y
iwTx
i+b( )−1≥ 0
m = d+
+d−
H → q =
b
w H
+→ q
+=
b−1
w H−→ q
−=
−1−b
w
d
+= d
−= 1
w
m =2
w
w or w
2
min 1
2w
2subject to y
iwTx
i+b( )−1≥ 0
Tony Jebara, Columbia University
Support Vector Machines • Dual, kernelized, slackened SVM QP:
• The margin is
• Find bounding sphere on data to get radius using QP:
• The VC dimension is then:
M =
2
wTwwhere w = α
ty
tk x
t,.( )t∑
maxα
αii∑ − 1
2α
iα
jy
iy
jk x
i,x
j( )i, j∑subject to α
iy
i= 0
i∑ & αi∈ 0,C⎡⎣⎢
⎤⎦⎥
h ≤ min ceil
4R2
M 2
⎡
⎣⎢⎢⎢
⎤
⎦⎥⎥⎥,d
⎧⎨⎪⎪
⎩⎪⎪
⎫⎬⎪⎪
⎭⎪⎪
+1
R2 = max
ββ
ik x
i,x
i( )i∑ − βiβ
jk x
i,x
j( )i, j∑ subject to β
ii∑ = 1 & βi≥ 0
M
D
Tony Jebara, Columbia University
Support Vector Machines • Can thus design all sorts of strange kernels on non vector objects as long as all computations involve kernel function and not explicit mapping and solve the SVM in dual space, for example using quadratic programming (how big is this quadratic program?)
Tony Jebara, Columbia University
Reduced Working Set Algorithms • Since QP needs NxN Gram matrix, this is impossible to store or use for N=50,000 training points • Idea (Vapnik): only support vectors with non-zero Lagrange multipliers need to be used and stored • So, split data into chunks that fit into memory and solve QP only over their Lagrange multipliers, freezing others • Definitions of the points and note KKT conditions:
Non-critical points are safely outside margin:
Support vectors on the margin:
Violaters inside margin or on wrong side:
λ
i= 0 ⇔ f x
i( )yi> 1
λ
i∈ 0,C( ) ⇔ f x
i( )yi= 1
λ
i= C ⇔ f x
i( )yi< 1
Tony Jebara, Columbia University
Reduced Working Set Algorithms • For example Osuna splits training set into two sets:
B: working set N: inactive set 1) Init with random choice of points in B 2) Solve SVM on B only 3) If have violater j in N where
replace with non-critical i in B where 4) Go to 2
• Should to converge in finite number of steps • Since we only add in new points which do not satisfy KKT and dual maximization problem has to increase.
f x
j( )yj< 1
λi= 0
Tony Jebara, Columbia University
Sequential Minimal Optimization • What is the smallest working set we could consider?
max α
ii∑ − 12
αiα
jy
iy
jφ x
i( )T φ xj( )i, j∑ s.t. α
ii∑ yi
= 0 & αi∈ 0,C⎡⎣⎢
⎤⎦⎥
Tony Jebara, Columbia University
Sequential Minimal Optimization • What is the smallest working set we could consider?
• Can update just 2 Lagrange multipliers at a time (Platt) since we have the sum to 0 constraint
• Like a constrained axis-parallel optimization (as in EM)
• But, since SVM is convex program, updating subset of variables while others fixed will also converge globally
max α
ii∑ − 12
αiα
jy
iy
jφ x
i( )T φ xj( )i, j∑ s.t. α
ii∑ yi
= 0 & αi∈ 0,C⎡⎣⎢
⎤⎦⎥
α
ii∑ yi
= 0
Tony Jebara, Columbia University
Sequential Minimal Optimization • SMO is fast since update for 2 alphas is just quadratic eqn • Rewrite SVM dual problem as a function of just 2 alphas
• Can now write a simple update rule, noting constraints (also, avoid whole Gram matrix, only compute needed K’s)
L
D∝ α
i+α
j− 1
2 Kiiα
i2 + 2K
ijα
iα
j+ K
jjα
j2( )−h
iα
i−h
jα
j
subject to : y
iα
i+ y
jα
j+ y
tα
tt≠i, j∑ = 0 and αi,α
j∈ 0,C⎡⎣⎢
⎤⎦⎥
S = yiy
j
L = max 0,αj
+ Sαi− 1
2 S +1( )C( )H = min C,α
j+ Sα
i− 1
2 S −1( )C( )
αjNEW = α
j+
yj
E1−E2( )k x
i,x
i( ) + k xj,x
j( )− 2k xi,x
j( )clipped inside L,H⎡⎣⎢
⎤⎦⎥
αiNEW = α
i+ S α
j−α
jNEW( )
E1 = αty
tk x
i,x
t( ) +bt∑ −y
i
E2 = αty
tk x
j,x
t( ) +bt∑ −y
j
Tony Jebara, Columbia University
VC of Spherical Gap-Tolerance • Led to large margin SVMs…
• The margin is
• Find bounding sphere on data to get radius using QP:
• The VC dimension is then:
M =
2
wTwwhere w = α
ty
tk x
t,.( )t∑
h ≤ min ceil
4R2
M 2
⎡
⎣⎢⎢⎢
⎤
⎦⎥⎥⎥,d
⎧⎨⎪⎪
⎩⎪⎪
⎫⎬⎪⎪
⎭⎪⎪
+1
R2 = max
ββ
ik x
i,x
i( )i∑ − βiβ
jk x
i,x
j( )i, j∑ subject to β
ii∑ = 1 & βi≥ 0
M
D
Tony Jebara, Columbia University
12a2bb2 + 3a2
,b2 − 3a2
b2 + 3a2b
⎛
⎝
⎜⎜⎜⎜⎜
⎞
⎠
⎟⎟⎟⎟⎟⎟
− 12a2bb2 + 3a2
,b2 − 3a2
b2 + 3a2b
⎛
⎝
⎜⎜⎜⎜⎜
⎞
⎠
⎟⎟⎟⎟⎟⎟
ε = x,y( ) :
x 2
a2+
y 2
b2≤ 1
⎧⎨⎪⎪
⎩⎪⎪
⎫⎬⎪⎪
⎭⎪⎪
• Consider 2d ellipse:
• Largest margin configuration for 3 points is a centered equilateral triangle
VC of Ellipsoidal Gap-Tolerance
Tony Jebara, Columbia University
VC of Ellipsoidal Gap-Tolerance
12a2bb2 + 3a2
,b2 − 3a2
b2 + 3a2b
⎛
⎝
⎜⎜⎜⎜⎜
⎞
⎠
⎟⎟⎟⎟⎟⎟
− 12a2bb2 + 3a2
,b2 − 3a2
b2 + 3a2b
⎛
⎝
⎜⎜⎜⎜⎜
⎞
⎠
⎟⎟⎟⎟⎟⎟
• Ellipse max margin is
• Sphere rounds up b=a • Its max margin is
• So, if M is
Sphere says VC=3 But, Ellipse says VC=2
• Ellipsoids always give lower VC than spheres
6a2bb2 + 3a2
6a2bb2 + 3a2
< M <32a
Tony Jebara, Columbia University
Minimum Volume Ellipsoid • Extend QP from bounding sphere to ellipsoid.
with volume related to |Σ|
• Change variables: • Get SDP: • Slacken SDP for outliers:
• Enforce quadratic SDP constraints via:
2R
2λ2
2λ1
ε = x : x − µ( )T Σ−1 x − µ( )≤ 1{ }
A = Σ−12 and b = Σ−
12 µ
min
A,b− ln A s.t. Ax
i−b( )T Ax
i−b( )≤ 1 ∀i & A 0
min
A,b,τ− ln A + E τ
ii∑ s.t. Axi−b( )T Ax
i−b( )≤ 1 + τ
i& A 0
I Axi
+b( )Ax
i−b( )T 1 + τ
i
⎡
⎣
⎢⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥⎥
0
Tony Jebara, Columbia University
• LP < QP < QCQP < SDP < Convex Programming • Matlab, Matlab, Mosek, Yalmip
• LP
• QP
• QCQP
• SDP
• SDP det
• We use the SDP determinant maximization in YALMIP
Minimum Volume Ellipsoid SDP
min x
bT x s.t.
c
iT x ≥ α
i∀i
min x
12
xTHx +bT x s.t.
c
iT x ≥ α
i∀i
min x
12
xTHx +bT x s.t.
c
iT x ≥ α
i∀i,xT x ≤ η
min
Ktr BTK( ) s.t. tr C
iTK( )≥ αi
∀i,K 0
min
K− log K s.t. tr C
iTK( )≥ αi
∀i,K 0
Tony Jebara, Columbia University
Ellipsoidal Machines • How does shape affect classification? • Consider two linear classifiers of margin M • Inside same bounding ellipsoid:
• Which is better (has lower VC)?
Tony Jebara, Columbia University
Ellipsoidal Machines • Affine transform ellipsoid space to sphere Then solve standard SVM in transformed space. • Or, implicitly solve SVM with new margin metric:
• The linear boundary tilts, margins are not enough! • Linear SVMs not affine invariant (just rotation & translation)
x̂
i= Σ−
12 x
i− µ( )
min
w,ξ12wTΣw +C ξ
ii∑ subject to yi
wTxi
+b( )≥ 1−ξi
and ξi≥ 0
Tony Jebara, Columbia University
Experiments: SVM vs EVM Setup 1: UCI Data, Ten Folds per Dataset Split into 80% Train for (w,b) and (Σ,µ) 10% Cross-validate over C & E 10% Test accuracy
Tony Jebara, Columbia University
Experiments: SVM vs EVM Setup 2: UCI Data, Ten Folds per Dataset Train (Σ,µ) for various values of E on all x’s. Split into 80% Train for (w,b) 10% Cross-validate over C & E 10% Test accuracy
Tony Jebara, Columbia University
• Details in Shivaswamy and Jebara in NIPS 2008
• Red is maximum margin, Green is max relative margin • Top is a two d classification problem • Bottom is projection of data on solution wTx+b • SVM solution changes as axes get scaled, has large spread
Maximum Relative Margin
Tony Jebara, Columbia University
• Fast trick to solve the same problem as on previous slides: Bound the spread of the SVM! • Recall original SVM primal problem (with slack):
• Add the following constraints:
• This bounds the spread. Call it Relative Margin Machine. • Above is still a QP, scales to 100k examples • Can also be kernelized, solved in the dual, etc. • Unlike previous SDP which only runs on ~1k examples
• RMM as fast as SVM but much higher accuracy…
Maximum Relative Margin
min
w,b,ξ12
w2+C ξ
ii∑ subject to yi
wTxi
+b( )≥ 1−ξi
−B ≤ wTxi
+b ≤ B
Tony Jebara, Columbia University
• RMM vs. SVM on digit classification (two-class 0,…,9)
Maximum Relative Margin
Tony Jebara, Columbia University
• RMM vs. SVM on digit classification (two-class 0,…,9) • Cross-validate to obtain best B and C fro SVM and RMM • Compare also to Kernel Linear Discriminant Analysis • Try different polynomial kernels and RBF • RMM has consistently lower error for kernel classification
Maximum Relative Margin