Advanced Machine Learning & Perceptionjebara/6772/notes/topic3.pdf · Learning & Perception Instructor: Tony Jebara . Tony Jebara, Columbia University Topic 3 •Maximum Margin •Empirical

Tony Jebara, Columbia University

Advanced Machine Learning & Perception

Instructor: Tony Jebara


Topic 3 • Maximum Margin

• Empirical Risk Minimization

• VC Dimension & Structural Risk Minimization

• Support Vector Machines

• Reduced Working Sets

• Sequential Minimal Optimization (SMO)

• Ellipsoidal Kernel Machines

• Maximum Relative Margin

• Other SVM Extensions…


Generative vs. Discriminative • Generative approach to classification: get p(x,y) then p(y|x) then via y = f(x) = argmaxy p(y|x) • For example, learn Gaussians by fitting each with maximum likelihood and use as classifiers by posterior:

• Vapnik: it is inefficient to learn a probability of everything if we only need a binary classifier y=f(x) • Also, maximum likelihood tries to capture everything about the data, not the classification decision. • Try just learning best possible classifier discriminatively.

x1+,…,x

N+{ }→ p x | y = +1( )

x1−,…,x

M−{ }→ p x | y = −1( )

p y = 1 | x( ) =p x | y = +1( )p y = +1( )

p x | y( )p y( )y∑


Large Margins • 1995: SVMs & VC Theory popularize large margin learning

• Other margin methods: Boosting, Max Margin Markov Nets, Max Margin Matrix Factorization, … • Other margin theories: Boosting, PAC-Bayes, Rademacher • Are large margins right? Can SDPs help? • Let’s re-visit SRM, VC, & SVMs…


Empirical vs. Structural Risk • Example: want a linear classifier to separate two classes

• Choose a loss function:

• Empirical Risk Minimization fits only to training data:

• Empirical approximates the true risk (expected error)

• We don’t know the true P(x,y)

• Instead SVMs perform Structural Risk Minimization

R

empθ( ) = 1

NL y

i, f x

i;θ( )( )i=1

N∑ ∈ 0,1⎡⎣⎢⎤⎦⎥

f x;θ( ) = sign wTx +b( ) where θ = w,b{ }

L y,x, θ( ) = step −yf x;θ( )( )

R θ( ) = E

PL x,y, θ( ){ } = P x,y( )L x,y, θ( )dx dy

X×Y∫ ∈ 0,1⎡⎣⎢⎤⎦⎥

arg min

θR

empθ( )≠ arg min

θR θ( )

R

empθ( )


• ERM: inconsistent Not guaranteed. May do better on training than on test! • SRM: add a prior or regularizer to • Define capacity or confidence = which favors simpler

• If, we have bound is a guaranteed risk • SRM: minimize J, guarantee future error rate is

Empirical vs. Structural Risk

R θ( )

Rempθ( )

θ

R θ̂( )≥ R

empθ̂( )

J θ( ) = R

empθ( ) +C θ( )

≤ min

θJ θ( )

R θ( )≤ J θ( )

C θ( )


Structural Risk Minimization & VC • How to bound risk? Learning theory… • Theorem (Vapnik): with probability 1-η the following holds:

N = number of data points h = Vapnik-Chervonenkis (VC) dimension

• VC dimension of linear classifiers in d-dimensions is h=d+1

R θ( )≤ J θ( ) = R

empθ( ) +Φ

hN

,lnηN

⎛

⎝⎜⎜⎜⎜

⎞

⎠⎟⎟⎟⎟

R θ( )≤ J θ( ) = Rempθ( ) +

2h log 2eNh( ) + 2 log 4

η( )N

1 + 1 +NR

empθ( )

h log 2eNh( ) + log 4

η( )⎛

⎝

⎜⎜⎜⎜⎜⎜⎜

⎞

⎠

⎟⎟⎟⎟⎟⎟⎟⎟


• Arbitrary linear classifiers are too flexible as a function class • Can improve estimate of VC dimension if we restrict them • Constrain linear classifiers to data living inside a sphere • Gap-Tolerant classifiers: a linear classifier whose activity is constrained to a sphere & outside a margin

• If M is small relative to D, can still shatter 3 points:

VC of Gap-Tolerant Classifiers

M

D M=margin D=diameter d=dimensionality

Only count errors in shaded region Elsewhere have L(x,y,θ)=0


VC of Gap-Tolerant Classifiers • But, as M grows relative to D, can only shatter 2 points!

• For hyperplanes, as M grows vs. D, shatter fewer points! • VC dimension h goes down if gap-tolerant classifier has larger margin, general formula is:

• Before, just had h=d+1. Now we have a smaller h • If data is anywhere, D is infinite and back to h=d+1 • Typically real data is bounded (by sphere), D is fixed • Maximizing M reduces h, improves guaranteed risk J(θ) • There is no way to modify R with θ

Can’t shatter 3

Can shatter 2

h ≤ min ceil

D2

M 2

⎡

⎣⎢⎢⎢

⎤

⎦⎥⎥⎥,d

⎧⎨⎪⎪

⎩⎪⎪

⎫⎬⎪⎪

⎭⎪⎪

+1


Support Vector Machines • Support vector machines are (in the simplest case) linear classifiers that do structural risk minimization (SRM) • Directly maximize margin to reduce guaranteed risk J(θ) • Assume first the 2-class data is linearly separable:

• Decision boundary or hyperplane given by • Note: can scale w & b while keeping same boundary • Many solutions exist which have empirical error Remp(θ)=0 • Want unique widest one max margin.

have x

1,y

1( ),…, xN,y

N( ){ } where xi∈ D and y

i∈ −1,1{ }

f x;θ( ) = sign wTx +b( )

wTx +b = 0

⇒


Support Vector Machines • The constraints on the SVM for Remp(θ)=0 are:

• Or more simply: • The margin of the SVM is:

• Distance to origin:

• Therefore: and margin

• Want to max margin, or equivalently minimize: • SVM Problem: • This is the primal quadratic program • Dual, nonseparable & nonlinear case are straightforward.

H+→ wTx +b = +1

H− → wTx +b = −1

wTx

i+b ≥+1 ∀y

i= +1

wTx

i+b ≤−1 ∀y

i= −1

y

iwTx

i+b( )−1≥ 0

m = d+

+d−

H → q =

b

w H

+→ q

+=

b−1

w H−→ q

−=

−1−b

w

d

+= d

−= 1

w

m =2

w

w or w

2

min 1

2w

2subject to y

iwTx

i+b( )−1≥ 0


Support Vector Machines • Dual, kernelized, slackened SVM QP:

• The margin is

• Find bounding sphere on data to get radius using QP:

• The VC dimension is then:

M =

2

wTwwhere w = α

ty

tk x

t,.( )t∑

maxα

αii∑ − 1

2α

iα

jy

iy

jk x

i,x

j( )i, j∑subject to α

iy

i= 0

i∑ & αi∈ 0,C⎡⎣⎢

⎤⎦⎥

h ≤ min ceil

4R2

M 2

⎡

⎣⎢⎢⎢

⎤

⎦⎥⎥⎥,d

⎧⎨⎪⎪

⎩⎪⎪

⎫⎬⎪⎪

⎭⎪⎪

+1

R2 = max

ββ

ik x

i,x

i( )i∑ − βiβ

jk x

i,x

j( )i, j∑ subject to β

ii∑ = 1 & βi≥ 0

M

D


Support Vector Machines • Can thus design all sorts of strange kernels on non vector objects as long as all computations involve kernel function and not explicit mapping and solve the SVM in dual space, for example using quadratic programming (how big is this quadratic program?)


Reduced Working Set Algorithms • Since QP needs NxN Gram matrix, this is impossible to store or use for N=50,000 training points • Idea (Vapnik): only support vectors with non-zero Lagrange multipliers need to be used and stored • So, split data into chunks that fit into memory and solve QP only over their Lagrange multipliers, freezing others • Definitions of the points and note KKT conditions:

Non-critical points are safely outside margin:

Support vectors on the margin:

Violaters inside margin or on wrong side:

λ

i= 0 ⇔ f x

i( )yi> 1

λ

i∈ 0,C( ) ⇔ f x

i( )yi= 1

λ

i= C ⇔ f x

i( )yi< 1


Reduced Working Set Algorithms • For example Osuna splits training set into two sets:

B: working set N: inactive set 1) Init with random choice of points in B 2) Solve SVM on B only 3) If have violater j in N where

replace with non-critical i in B where 4) Go to 2

• Should to converge in finite number of steps • Since we only add in new points which do not satisfy KKT and dual maximization problem has to increase.

f x

j( )yj< 1

λi= 0


Sequential Minimal Optimization • What is the smallest working set we could consider?

max α

ii∑ − 12

αiα

jy

iy

jφ x

i( )T φ xj( )i, j∑ s.t. α

ii∑ yi

= 0 & αi∈ 0,C⎡⎣⎢

⎤⎦⎥


Sequential Minimal Optimization • What is the smallest working set we could consider?

• Can update just 2 Lagrange multipliers at a time (Platt) since we have the sum to 0 constraint

• Like a constrained axis-parallel optimization (as in EM)

• But, since SVM is convex program, updating subset of variables while others fixed will also converge globally

max α

ii∑ − 12

αiα

jy

iy

jφ x

i( )T φ xj( )i, j∑ s.t. α

ii∑ yi

= 0 & αi∈ 0,C⎡⎣⎢

⎤⎦⎥

α

ii∑ yi

= 0


Sequential Minimal Optimization • SMO is fast since update for 2 alphas is just quadratic eqn • Rewrite SVM dual problem as a function of just 2 alphas

• Can now write a simple update rule, noting constraints (also, avoid whole Gram matrix, only compute needed K’s)

L

D∝ α

i+α

j− 1

2 Kiiα

i2 + 2K

ijα

iα

j+ K

jjα

j2( )−h

iα

i−h

jα

j

subject to : y

iα

i+ y

jα

j+ y

tα

tt≠i, j∑ = 0 and αi,α

j∈ 0,C⎡⎣⎢

⎤⎦⎥

S = yiy

j

L = max 0,αj

+ Sαi− 1

2 S +1( )C( )H = min C,α

j+ Sα

i− 1

2 S −1( )C( )

αjNEW = α

j+

yj

E1−E2( )k x

i,x

i( ) + k xj,x

j( )− 2k xi,x

j( )clipped inside L,H⎡⎣⎢

⎤⎦⎥

αiNEW = α

i+ S α

j−α

jNEW( )

E1 = αty

tk x

i,x

t( ) +bt∑ −y

i

E2 = αty

tk x

j,x

t( ) +bt∑ −y

j


VC of Spherical Gap-Tolerance • Led to large margin SVMs…

• The margin is

• Find bounding sphere on data to get radius using QP:

• The VC dimension is then:

M =

2

wTwwhere w = α

ty

tk x

t,.( )t∑

h ≤ min ceil

4R2

M 2

⎡

⎣⎢⎢⎢

⎤

⎦⎥⎥⎥,d

⎧⎨⎪⎪

⎩⎪⎪

⎫⎬⎪⎪

⎭⎪⎪

+1

R2 = max

ββ

ik x

i,x

i( )i∑ − βiβ

jk x

i,x

j( )i, j∑ subject to β

ii∑ = 1 & βi≥ 0

M

D


12a2bb2 + 3a2

,b2 − 3a2

b2 + 3a2b

⎛

⎝

⎜⎜⎜⎜⎜

⎞

⎠

⎟⎟⎟⎟⎟⎟

− 12a2bb2 + 3a2

,b2 − 3a2

b2 + 3a2b

⎛

⎝

⎜⎜⎜⎜⎜

⎞

⎠

⎟⎟⎟⎟⎟⎟

ε = x,y( ) :

x 2

a2+

y 2

b2≤ 1

⎧⎨⎪⎪

⎩⎪⎪

⎫⎬⎪⎪

⎭⎪⎪

• Consider 2d ellipse:

• Largest margin configuration for 3 points is a centered equilateral triangle

VC of Ellipsoidal Gap-Tolerance


VC of Ellipsoidal Gap-Tolerance

12a2bb2 + 3a2

,b2 − 3a2

b2 + 3a2b

⎛

⎝

⎜⎜⎜⎜⎜

⎞

⎠

⎟⎟⎟⎟⎟⎟

− 12a2bb2 + 3a2

,b2 − 3a2

b2 + 3a2b

⎛

⎝

⎜⎜⎜⎜⎜

⎞

⎠

⎟⎟⎟⎟⎟⎟

• Ellipse max margin is

• Sphere rounds up b=a • Its max margin is

• So, if M is

Sphere says VC=3 But, Ellipse says VC=2

• Ellipsoids always give lower VC than spheres

6a2bb2 + 3a2

6a2bb2 + 3a2

< M <32a


Minimum Volume Ellipsoid • Extend QP from bounding sphere to ellipsoid.

with volume related to |Σ|

• Change variables: • Get SDP: • Slacken SDP for outliers:

• Enforce quadratic SDP constraints via:

2R

2λ2

2λ1

ε = x : x − µ( )T Σ−1 x − µ( )≤ 1{ }

A = Σ−12 and b = Σ−

12 µ

min

A,b− ln A s.t. Ax

i−b( )T Ax

i−b( )≤ 1 ∀i & A 0

min

A,b,τ− ln A + E τ

ii∑ s.t. Axi−b( )T Ax

i−b( )≤ 1 + τ

i& A 0

I Axi

+b( )Ax

i−b( )T 1 + τ

i

⎡

⎣

⎢⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥⎥

0


• LP < QP < QCQP < SDP < Convex Programming • Matlab, Matlab, Mosek, Yalmip

• LP

• QP

• QCQP

• SDP

• SDP det

• We use the SDP determinant maximization in YALMIP

Minimum Volume Ellipsoid SDP

min x

bT x s.t.

c

iT x ≥ α

i∀i

min x

12

xTHx +bT x s.t.

c

iT x ≥ α

i∀i

min x

12

xTHx +bT x s.t.

c

iT x ≥ α

i∀i,xT x ≤ η

min

Ktr BTK( ) s.t. tr C

iTK( )≥ αi

∀i,K 0

min

K− log K s.t. tr C

iTK( )≥ αi

∀i,K 0


Ellipsoidal Machines • How does shape affect classification? • Consider two linear classifiers of margin M • Inside same bounding ellipsoid:

• Which is better (has lower VC)?


Ellipsoidal Machines • Affine transform ellipsoid space to sphere Then solve standard SVM in transformed space. • Or, implicitly solve SVM with new margin metric:

• The linear boundary tilts, margins are not enough! • Linear SVMs not affine invariant (just rotation & translation)

x̂

i= Σ−

12 x

i− µ( )

min

w,ξ12wTΣw +C ξ

ii∑ subject to yi

wTxi

+b( )≥ 1−ξi

and ξi≥ 0


Experiments: SVM vs EVM Setup 1: UCI Data, Ten Folds per Dataset Split into 80% Train for (w,b) and (Σ,µ) 10% Cross-validate over C & E 10% Test accuracy


Experiments: SVM vs EVM Setup 2: UCI Data, Ten Folds per Dataset Train (Σ,µ) for various values of E on all x’s. Split into 80% Train for (w,b) 10% Cross-validate over C & E 10% Test accuracy


• Details in Shivaswamy and Jebara in NIPS 2008

• Red is maximum margin, Green is max relative margin • Top is a two d classification problem • Bottom is projection of data on solution wTx+b • SVM solution changes as axes get scaled, has large spread

Maximum Relative Margin


• Fast trick to solve the same problem as on previous slides: Bound the spread of the SVM! • Recall original SVM primal problem (with slack):

• Add the following constraints:

• This bounds the spread. Call it Relative Margin Machine. • Above is still a QP, scales to 100k examples • Can also be kernelized, solved in the dual, etc. • Unlike previous SDP which only runs on ~1k examples

• RMM as fast as SVM but much higher accuracy…


min

w,b,ξ12

w2+C ξ

ii∑ subject to yi

wTxi

+b( )≥ 1−ξi

−B ≤ wTxi

+b ≤ B


• RMM vs. SVM on digit classification (two-class 0,…,9)



• RMM vs. SVM on digit classification (two-class 0,…,9) • Cross-validate to obtain best B and C fro SVM and RMM • Compare also to Kernel Linear Discriminant Analysis • Try different polynomial kernels and RBF • RMM has consistently lower error for kernel classification


Documents

Advanced Machine Learning & Perceptionjebara/6772/notes/topic3.pdf · Learning & Perception Instructor: Tony Jebara . Tony Jebara, Columbia University Topic 3 •Maximum Margin •Empirical