94
Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech

Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

Lecture 3: Support Vector Machines

Tuo Zhao

Schools of ISYE and CSE, Georgia Tech

Page 2: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Suppor Vector Machines

Numerous books, papers, and tutorials......Once upon a time, SVMis considered as the ultimate recipe for classification and regression.

Tuo Zhao — Lecture 3: Support Vector Machines 2/47

Page 3: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Suppor Vector Machines

SVM have faded away, but it is still an important chapter in anyelementary machine learning textbook.

Tuo Zhao — Lecture 3: Support Vector Machines 3/47

Page 4: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

Margins: Intuition

Page 5: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Margins: Intuition

Consider a standard linear classification problem:

Feature: X ∈ Rd

Response: Y ∈ {−1, 1}Linear Classifier: Y = sign(X>θ)

Our confidence is represented by |X>θ|

Logistic Regression : P(Y = 1) =1

1 + exp(−X>θ).

Tuo Zhao — Lecture 3: Support Vector Machines 5/47

Page 6: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Margins: Intuition

Consider a standard linear classification problem:

Feature: X ∈ Rd

Response: Y ∈ {−1, 1}Linear Classifier: Y = sign(X>θ)

Our confidence is represented by |X>θ|

Logistic Regression : P(Y = 1) =1

1 + exp(−X>θ).

Tuo Zhao — Lecture 3: Support Vector Machines 5/47

Page 7: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Margins: Intuition

Consider a standard linear classification problem:

Feature: X ∈ Rd

Response: Y ∈ {−1, 1}Linear Classifier: Y = sign(X>θ)

Our confidence is represented by |X>θ|

Logistic Regression : P(Y = 1) =1

1 + exp(−X>θ).

Tuo Zhao — Lecture 3: Support Vector Machines 5/47

Page 8: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Margins: Intuition

Consider a standard linear classification problem:

Feature: X ∈ Rd

Response: Y ∈ {−1, 1}Linear Classifier: Y = sign(X>θ)

Our confidence is represented by |X>θ|

Logistic Regression : P(Y = 1) =1

1 + exp(−X>θ).

Tuo Zhao — Lecture 3: Support Vector Machines 5/47

Page 9: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Margin: Intuition

We have three testing samples: A, B, C.

Which one should we have the most confidence in classification?

2

θT x(i) ≪ 0 whenever y(i) = 0, since this would reflect a very confident (andcorrect) set of classifications for all the training examples. This seems to bea nice goal to aim for, and we’ll soon formalize this idea using the notion offunctional margins.

For a different type of intuition, consider the following figure, in which x’srepresent positive training examples, o’s denote negative training examples,a decision boundary (this is the line given by the equation θT x = 0, andis also called the separating hyperplane) is also shown, and three pointshave also been labeled A, B and C.

B

A

C

Notice that the point A is very far from the decision boundary. If we areasked to make a prediction for the value of y at A, it seems we should bequite confident that y = 1 there. Conversely, the point C is very close tothe decision boundary, and while it’s on the side of the decision boundaryon which we would predict y = 1, it seems likely that just a small change tothe decision boundary could easily have caused our prediction to be y = 0.Hence, we’re much more confident about our prediction at A than at C. Thepoint B lies in-between these two cases, and more broadly, we see that ifa point is far from the separating hyperplane, then we may be significantlymore confident in our predictions. Again, informally we think it’d be nice if,given a training set, we manage to find a decision boundary that allows usto make all correct and confident (meaning far from the decision boundary)predictions on the training examples. We’ll formalize this later using thenotion of geometric margins.

Tuo Zhao — Lecture 3: Support Vector Machines 6/47

Page 10: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

Geometric Margin

Page 11: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Geometric Margin

Function margin: γi = yi(x>i θ).

Scale Invariance: ‖θ‖2 = 1.

Minimum functional margin:

γmin(θ) = miniyi(x

>i θ).

Maximize the minimum margin:

θ = argmaxθ

γmin(θ) subject to ‖θ‖2 = 1.

Tuo Zhao — Lecture 3: Support Vector Machines 8/47

Page 12: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Geometric Margin

Function margin: γi = yi(x>i θ).

Scale Invariance: ‖θ‖2 = 1.

Minimum functional margin:

γmin(θ) = miniyi(x

>i θ).

Maximize the minimum margin:

θ = argmaxθ

γmin(θ) subject to ‖θ‖2 = 1.

Tuo Zhao — Lecture 3: Support Vector Machines 8/47

Page 13: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Geometric Margin

Function margin: γi = yi(x>i θ).

Scale Invariance: ‖θ‖2 = 1.

Minimum functional margin:

γmin(θ) = miniyi(x

>i θ).

Maximize the minimum margin:

θ = argmaxθ

γmin(θ) subject to ‖θ‖2 = 1.

Tuo Zhao — Lecture 3: Support Vector Machines 8/47

Page 14: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Geometric Margin

Function margin: γi = yi(x>i θ).

Scale Invariance: ‖θ‖2 = 1.

Minimum functional margin:

γmin(θ) = miniyi(x

>i θ).

Maximize the minimum margin:

θ = argmaxθ

γmin(θ) subject to ‖θ‖2 = 1.

Tuo Zhao — Lecture 3: Support Vector Machines 8/47

Page 15: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Convexification

Maximize the minimum margin:

θ =argmaxθ,γ

γ

subject to ‖θ‖2 = 1, yi(x>i θ) ≥ γ, i = 1, ..., n.

‖θ‖2 = 1 is nonconvex

Rescale the margin

θ =argmaxθ,γ

γ

‖θ‖2subject to yi(x

>i θ) ≥ γ, i = 1, ..., n.

Tuo Zhao — Lecture 3: Support Vector Machines 9/47

Page 16: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Convexification

Maximize the minimum margin:

θ =argmaxθ,γ

γ

subject to ‖θ‖2 = 1, yi(x>i θ) ≥ γ, i = 1, ..., n.

‖θ‖2 = 1 is nonconvex

Rescale the margin

θ =argmaxθ,γ

γ

‖θ‖2subject to yi(x

>i θ) ≥ γ, i = 1, ..., n.

Tuo Zhao — Lecture 3: Support Vector Machines 9/47

Page 17: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Convexification

Maximize the minimum margin:

θ =argmaxθ,γ

γ

subject to ‖θ‖2 = 1, yi(x>i θ) ≥ γ, i = 1, ..., n.

‖θ‖2 = 1 is nonconvex

Rescale the margin

θ =argmaxθ,γ

γ

‖θ‖2subject to yi(x

>i θ) ≥ γ, i = 1, ..., n.

Tuo Zhao — Lecture 3: Support Vector Machines 9/47

Page 18: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Convexification

Maximize the minimum margin:

θ =argmaxθ,γ

γ

subject to ‖θ‖2 = 1, yi(x>i θ) ≥ γ, i = 1, ..., n.

‖θ‖2 = 1 is nonconvex

Rescale the margin

θ =argmaxθ,γ

γ

‖θ‖2subject to yi(x

>i θ) ≥ γ, i = 1, ..., n.

Tuo Zhao — Lecture 3: Support Vector Machines 9/47

Page 19: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Convexification

The scale of γ does not matter: enforce γ = 1

Equivalent to minimizing ‖θ‖2

θ =argminθ

1

2‖θ‖22

subject to 1− yi(x>i θ) ≤ 0, i = 1, ..., n.

Quadratic Programming with Linear Constraints

Efficient Optimization Solvers

Tuo Zhao — Lecture 3: Support Vector Machines 10/47

Page 20: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Convexification

The scale of γ does not matter: enforce γ = 1

Equivalent to minimizing ‖θ‖2

θ =argminθ

1

2‖θ‖22

subject to 1− yi(x>i θ) ≤ 0, i = 1, ..., n.

Quadratic Programming with Linear Constraints

Efficient Optimization Solvers

Tuo Zhao — Lecture 3: Support Vector Machines 10/47

Page 21: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Convexification

The scale of γ does not matter: enforce γ = 1

Equivalent to minimizing ‖θ‖2

θ =argminθ

1

2‖θ‖22

subject to 1− yi(x>i θ) ≤ 0, i = 1, ..., n.

Quadratic Programming with Linear Constraints

Efficient Optimization Solvers

Tuo Zhao — Lecture 3: Support Vector Machines 10/47

Page 22: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Convexification

The scale of γ does not matter: enforce γ = 1

Equivalent to minimizing ‖θ‖2

θ =argminθ

1

2‖θ‖22

subject to 1− yi(x>i θ) ≤ 0, i = 1, ..., n.

Quadratic Programming with Linear Constraints

Efficient Optimization Solvers

Tuo Zhao — Lecture 3: Support Vector Machines 10/47

Page 23: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Optimization Software

Existing Convex Programming Solvers:

CPLEX: IBM ILOG CPLEX Optimization Studio.

Gurobi: Zonghao Gu, Edward Rothberg and Robert Bixby.

Programming Language: C++, Java, Python, MATLAB, R

Modeling Language: AIMMS, AMPL, GAMS, MPL

Free Academic Licenses.

Tuo Zhao — Lecture 3: Support Vector Machines 11/47

Page 24: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Optimization Software

Existing Convex Programming Solvers:

CPLEX: IBM ILOG CPLEX Optimization Studio.

Gurobi: Zonghao Gu, Edward Rothberg and Robert Bixby.

Programming Language: C++, Java, Python, MATLAB, R

Modeling Language: AIMMS, AMPL, GAMS, MPL

Free Academic Licenses.

Tuo Zhao — Lecture 3: Support Vector Machines 11/47

Page 25: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Optimization Software

Existing Convex Programming Solvers:

CPLEX: IBM ILOG CPLEX Optimization Studio.

Gurobi: Zonghao Gu, Edward Rothberg and Robert Bixby.

Programming Language: C++, Java, Python, MATLAB, R

Modeling Language: AIMMS, AMPL, GAMS, MPL

Free Academic Licenses.

Tuo Zhao — Lecture 3: Support Vector Machines 11/47

Page 26: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Optimization Software

Existing Convex Programming Solvers:

CPLEX: IBM ILOG CPLEX Optimization Studio.

Gurobi: Zonghao Gu, Edward Rothberg and Robert Bixby.

Programming Language: C++, Java, Python, MATLAB, R

Modeling Language: AIMMS, AMPL, GAMS, MPL

Free Academic Licenses.

Tuo Zhao — Lecture 3: Support Vector Machines 11/47

Page 27: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Optimization Software

Existing Convex Programming Solvers:

CPLEX: IBM ILOG CPLEX Optimization Studio.

Gurobi: Zonghao Gu, Edward Rothberg and Robert Bixby.

Programming Language: C++, Java, Python, MATLAB, R

Modeling Language: AIMMS, AMPL, GAMS, MPL

Free Academic Licenses.

Tuo Zhao — Lecture 3: Support Vector Machines 11/47

Page 28: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

Lagrangian Duality

Page 29: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Lagrangian Multiplier Method

Dual variables: λi ≥ 0, i = 1, ..., n

L(θ,λ) =1

2‖θ‖22 +

n∑

i=1

λi(1− yi(x>i θ))

Min-Max problem:

minθ

maxλ≥0

L(θ,λ)

Strong duality:

minθ

maxλ≥0

L(θ,λ) = maxλ≥0

minθL(θ,λ)

Tuo Zhao — Lecture 3: Support Vector Machines 13/47

Page 30: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Lagrangian Multiplier Method

Dual variables: λi ≥ 0, i = 1, ..., n

L(θ,λ) =1

2‖θ‖22 +

n∑

i=1

λi(1− yi(x>i θ))

Min-Max problem:

minθ

maxλ≥0

L(θ,λ)

Strong duality:

minθ

maxλ≥0

L(θ,λ) = maxλ≥0

minθL(θ,λ)

Tuo Zhao — Lecture 3: Support Vector Machines 13/47

Page 31: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Lagrangian Multiplier Method

Dual variables: λi ≥ 0, i = 1, ..., n

L(θ,λ) =1

2‖θ‖22 +

n∑

i=1

λi(1− yi(x>i θ))

Min-Max problem:

minθ

maxλ≥0

L(θ,λ)

Strong duality:

minθ

maxλ≥0

L(θ,λ) = maxλ≥0

minθL(θ,λ)

Tuo Zhao — Lecture 3: Support Vector Machines 13/47

Page 32: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Penalty of Constraint Violation

Min-Max Problem

minθ

maxλ≥0

1

2‖θ‖22 +

n∑

i=1

λi(1− yi(x>i θ))

Given (1− yi(x>i θ)) ≤ 0,

maxλi

λi(1− yi(x>i θ)) = 0

Given (1− yi(x>i θ)) > 0,

maxλi

λi(1− yi(x>i θ)) = +∞

Tuo Zhao — Lecture 3: Support Vector Machines 14/47

Page 33: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Penalty of Constraint Violation

Min-Max Problem

minθ

maxλ≥0

1

2‖θ‖22 +

n∑

i=1

λi(1− yi(x>i θ))

Given (1− yi(x>i θ)) ≤ 0,

maxλi

λi(1− yi(x>i θ)) = 0

Given (1− yi(x>i θ)) > 0,

maxλi

λi(1− yi(x>i θ)) = +∞

Tuo Zhao — Lecture 3: Support Vector Machines 14/47

Page 34: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Penalty of Constraint Violation

Min-Max Problem

minθ

maxλ≥0

1

2‖θ‖22 +

n∑

i=1

λi(1− yi(x>i θ))

Given (1− yi(x>i θ)) ≤ 0,

maxλi

λi(1− yi(x>i θ)) = 0

Given (1− yi(x>i θ)) > 0,

maxλi

λi(1− yi(x>i θ)) = +∞

Tuo Zhao — Lecture 3: Support Vector Machines 14/47

Page 35: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Convex Concave Saddle Point Problem

Tuo Zhao — Lecture 3: Support Vector Machines 15/47

Page 36: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Karush-Kuhn-Tucker Condition

KKT Condition:

∂L(θ, λ)

∂θ

∣∣∣θ=θ

= 0,∂L(θ,λ)

∂λ

∣∣∣λ=λ

= 0.

Complementary Slackness

λi = 0 ⇒ 1− yix>i θ ≤ 0,

λi ≥ 0 ⇒ 1− yix>i θ = 0.

λi > 0 — Support vector.

Tuo Zhao — Lecture 3: Support Vector Machines 16/47

Page 37: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Karush-Kuhn-Tucker Condition

KKT Condition:

∂L(θ, λ)

∂θ

∣∣∣θ=θ

= 0,∂L(θ,λ)

∂λ

∣∣∣λ=λ

= 0.

Complementary Slackness

λi = 0 ⇒ 1− yix>i θ ≤ 0,

λi ≥ 0 ⇒ 1− yix>i θ = 0.

λi > 0 — Support vector.

Tuo Zhao — Lecture 3: Support Vector Machines 16/47

Page 38: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Karush-Kuhn-Tucker Condition

KKT Condition:

∂L(θ, λ)

∂θ

∣∣∣θ=θ

= 0,∂L(θ,λ)

∂λ

∣∣∣λ=λ

= 0.

Complementary Slackness

λi = 0 ⇒ 1− yix>i θ ≤ 0,

λi ≥ 0 ⇒ 1− yix>i θ = 0.

λi > 0 — Support vector.

Tuo Zhao — Lecture 3: Support Vector Machines 16/47

Page 39: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Dual Problem

Solve the minimization problem:

∂L(θ,λ)

∂θ= 0⇒ θ =

n∑

i=1

λiyixi.

Maximization problem:

maxλ≥0

L

(n∑

i=1

λiyixi,λ

)

= maxλ≥0

1

2

∥∥∥∥∥n∑

i=1

λiyixi

∥∥∥∥∥

2

2

+

n∑

i=1

λi

(1− yi

(x>i

n∑

k=1

λkykxk

)).

Tuo Zhao — Lecture 3: Support Vector Machines 17/47

Page 40: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Dual Problem

Solve the minimization problem:

∂L(θ,λ)

∂θ= 0⇒ θ =

n∑

i=1

λiyixi.

Maximization problem:

maxλ≥0

L

(n∑

i=1

λiyixi,λ

)

= maxλ≥0

1

2

∥∥∥∥∥n∑

i=1

λiyixi

∥∥∥∥∥

2

2

+

n∑

i=1

λi

(1− yi

(x>i

n∑

k=1

λkykxk

)).

Tuo Zhao — Lecture 3: Support Vector Machines 17/47

Page 41: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Simplication

∥∥∥∥∥n∑

i=1

λiyixi

∥∥∥∥∥

2

2

=

(n∑

i=1

λiyix>i

)(n∑

i=1

λiyixi

)

=

n∑

i=1

n∑

k=1

λiλkyiykx>i xk,

n∑

i=1

λi

(1− yi

(x>i

n∑

i=1

λiyixi

))

=

n∑

i=1

λi −n∑

i=1

n∑

k=1

λiλkyiykx>i xk.

Tuo Zhao — Lecture 3: Support Vector Machines 18/47

Page 42: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Dual Problem

After simplification, we have

λ = argmaxλ≥0

n∑

i=1

λi −1

2

n∑

i=1

n∑

k=1

λiλkyiykx>i xk.

We define Q as Qik = yiykx>i xk.

Quadratic Minimization:

λ =argminλ

1

2λ>Qλ− λ>1

subject to λi ≥ 0, i = 1, ..., n

Tuo Zhao — Lecture 3: Support Vector Machines 19/47

Page 43: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Support Vector Machines

Let S denote the set of support vectors. Given x∗, we predict

y∗ = sign(θ>x∗) = sign

(∑ni=1 λiyix

>i x∗)

= sign(∑n

i=1 λiyix>i x∗) = sign

(∑i∈S λiyix

>i x∗)

11

corresponding to constraints that hold with equality, gi(w) = 0). Considerthe figure below, in which a maximum margin separating hyperplane is shownby the solid line.

The points with the smallest margins are exactly the ones closest to thedecision boundary; here, these are the three points (one negative and two pos-itive examples) that lie on the dashed lines parallel to the decision boundary.Thus, only three of the αi’s—namely, the ones corresponding to these threetraining examples—will be non-zero at the optimal solution to our optimiza-tion problem. These three points are called the support vectors in thisproblem. The fact that the number of support vectors can be much smallerthan the size the training set will be useful later.

Let’s move on. Looking ahead, as we develop the dual form of the prob-lem, one key idea to watch out for is that we’ll try to write our algorithmin terms of only the inner product ⟨x(i), x(j)⟩ (think of this as (x(i))T x(j))between points in the input feature space. The fact that we can express ouralgorithm in terms of these inner products will be key when we apply thekernel trick.

When we construct the Lagrangian for our optimization problem we have:

L(w, b, α) =1

2||w||2 −

m!

i=1

αi

"y(i)(wTx(i) + b) − 1

#. (8)

Note that there’re only “αi” but no “βi” Lagrange multipliers, since theproblem has only inequality constraints.

Let’s find the dual form of the problem. To do so, we need to firstminimize L(w, b, α) with respect to w and b (for fixed α), to get θD, which

Tuo Zhao — Lecture 3: Support Vector Machines 20/47

Page 44: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

SVM with Soft Margin

Page 45: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Linearly Separable and Non-separable

19

algorithms that we’ll see later in this class will also be amenable to thismethod, which has come to be known as the “kernel trick.”

8 Regularization and the non-separable case

The derivation of the SVM as presented so far assumed that the data islinearly separable. While mapping data to a high dimensional feature spacevia φ does generally increase the likelihood that the data is separable, wecan’t guarantee that it always will be so. Also, in some cases it is not clearthat finding a separating hyperplane is exactly what we’d want to do, sincethat might be susceptible to outliers. For instance, the left figure belowshows an optimal margin classifier, and when a single outlier is added in theupper-left region (right figure), it causes the decision boundary to make adramatic swing, and the resulting classifier has a much smaller margin.

To make the algorithm work for non-linearly separable datasets as wellas be less sensitive to outliers, we reformulate our optimization (using ℓ1

regularization) as follows:

minγ,w,b1

2||w||2 + C

m!

i=1

ξi

s.t. y(i)(wT x(i) + b) ≥ 1 − ξi, i = 1, . . . , m

ξi ≥ 0, i = 1, . . . , m.

Thus, examples are now permitted to have (functional) margin less than 1,and if an example has functional margin 1 − ξi (with ξ > 0), we would paya cost of the objective function being increased by Cξi. The parameter Ccontrols the relative weighting between the twin goals of making the ||w||2small (which we saw earlier makes the margin large) and of ensuring thatmost examples have functional margin at least 1.

19

algorithms that we’ll see later in this class will also be amenable to thismethod, which has come to be known as the “kernel trick.”

8 Regularization and the non-separable case

The derivation of the SVM as presented so far assumed that the data islinearly separable. While mapping data to a high dimensional feature spacevia φ does generally increase the likelihood that the data is separable, wecan’t guarantee that it always will be so. Also, in some cases it is not clearthat finding a separating hyperplane is exactly what we’d want to do, sincethat might be susceptible to outliers. For instance, the left figure belowshows an optimal margin classifier, and when a single outlier is added in theupper-left region (right figure), it causes the decision boundary to make adramatic swing, and the resulting classifier has a much smaller margin.

To make the algorithm work for non-linearly separable datasets as wellas be less sensitive to outliers, we reformulate our optimization (using ℓ1

regularization) as follows:

minγ,w,b1

2||w||2 + C

m!

i=1

ξi

s.t. y(i)(wT x(i) + b) ≥ 1 − ξi, i = 1, . . . , m

ξi ≥ 0, i = 1, . . . , m.

Thus, examples are now permitted to have (functional) margin less than 1,and if an example has functional margin 1 − ξi (with ξ > 0), we would paya cost of the objective function being increased by Cξi. The parameter Ccontrols the relative weighting between the twin goals of making the ||w||2small (which we saw earlier makes the margin large) and of ensuring thatmost examples have functional margin at least 1.

Tuo Zhao — Lecture 3: Support Vector Machines 22/47

Page 46: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Soft Margin

Slack variables: Given ξi ≥ 0, i = 1, ..., 0,

θ =argminθ

1

2‖θ‖22 + µ

n∑

i=1

ξi

subject to 1− yi(x>i θ)− ξi ≤ 0, ξi ≥ 0, i = 1, ..., n,

where C > 0 is a regularization parameter.

Lagrangian Form: Given λ ≥ 0 and β ≥ 0,

L(θ,α,β) =1

2‖θ‖22 + C

n∑

i=1

ξi

+

n∑

i=1

λi(1− yi(x>i θ)− ξi)−n∑

i=1

βiξi

Tuo Zhao — Lecture 3: Support Vector Machines 23/47

Page 47: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Soft Margin

Slack variables: Given ξi ≥ 0, i = 1, ..., 0,

θ =argminθ

1

2‖θ‖22 + µ

n∑

i=1

ξi

subject to 1− yi(x>i θ)− ξi ≤ 0, ξi ≥ 0, i = 1, ..., n,

where C > 0 is a regularization parameter.

Lagrangian Form: Given λ ≥ 0 and β ≥ 0,

L(θ,α,β) =1

2‖θ‖22 + C

n∑

i=1

ξi

+

n∑

i=1

λi(1− yi(x>i θ)− ξi)−n∑

i=1

βiξi

Tuo Zhao — Lecture 3: Support Vector Machines 23/47

Page 48: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Dual Formulation

Quadratic Minimization:

λ =argminλ

1

2λ>Qλ− λ>1

subject to C ≥ λi ≥ 0

KKT condition ⇒ Complementary Slackness

λi = 0 ⇒ yix>i θ ≥ 1,

C ≥ λi > 0 ⇒ yix>i θ < 1.

λi > 0 — Support vector.

Tuo Zhao — Lecture 3: Support Vector Machines 24/47

Page 49: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Dual Formulation

Quadratic Minimization:

λ =argminλ

1

2λ>Qλ− λ>1

subject to C ≥ λi ≥ 0

KKT condition ⇒ Complementary Slackness

λi = 0 ⇒ yix>i θ ≥ 1,

C ≥ λi > 0 ⇒ yix>i θ < 1.

λi > 0 — Support vector.

Tuo Zhao — Lecture 3: Support Vector Machines 24/47

Page 50: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Dual Formulation

Quadratic Minimization:

λ =argminλ

1

2λ>Qλ− λ>1

subject to C ≥ λi ≥ 0

KKT condition ⇒ Complementary Slackness

λi = 0 ⇒ yix>i θ ≥ 1,

C ≥ λi > 0 ⇒ yix>i θ < 1.

λi > 0 — Support vector.

Tuo Zhao — Lecture 3: Support Vector Machines 24/47

Page 51: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Soft Margin

Tuo Zhao — Lecture 3: Support Vector Machines 25/47

Page 52: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Hinge Loss Function

Tuo Zhao — Lecture 3: Support Vector Machines 26/47

Page 53: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Unconstrained Formulation

SVM with soft margin is equivalent to

θ = argminθ∈Rd

C

n∑

i=1

max{1− yix>i θ, 0}+1

2‖θ‖22 .

Logistic loss: f(z) = log(1 + exp(−z))

Square Hinge Loss: f(z) = max{1− z, 0}2

Huberized Hinge Loss: f(z) =

−z + 1

2if z ≤ 0,

1

2(z − 1)2 if 0 < z ≤ 1,

0 otherwise.

0-1 Loss: `i(z) = 1(z ≥ 0)

Tuo Zhao — Lecture 3: Support Vector Machines 27/47

Page 54: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Huberized Hinge Loss

Tuo Zhao — Lecture 3: Support Vector Machines 28/47

Page 55: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Nonsmooth (Stochastic) Optimization

Given a convex function L(θ), the subdifferential of L(θ) at θ is aset of vectors denoted by ∂L(θ) such that for any θ,θ′ ∈ Rd,

L(θ′) ≥ L(θ) + ζ>(θ′ − θ) for all ζ ∈ ∂L(θ).

f(z) = |z|: ∂f(z) = [−1,+1] at z = 0

f(z) = max{1− z, 0}: ∂f(z) = [−1, 0] at z = 1.

Tuo Zhao — Lecture 3: Support Vector Machines 29/47

Page 56: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Nonsmooth (Stochastic) Optimization

We consider a nonsmooth optimization problem

minθL(θ), where L(θ) = 1

n

n∑

i=1

`i(θ)

Subgradient Algorithm: At the (t+ 1)-th iteration, we take

θ(t+1) = θ(t) − ηtζ(t), where ζ(t) ∈ ∂L(θ(t)).

Stochastic Variant: At the (t+ 1)-th iteration, we randomlyselect i from 1, ..., n, and take

θ(t+1) = θ(t) − ηtζ(t), where ζ(t) ∈ ∂`i(θ(t)).

Tuo Zhao — Lecture 3: Support Vector Machines 30/47

Page 57: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Nonsmooth (Stochastic) Optimization

We consider a nonsmooth optimization problem

minθL(θ), where L(θ) = 1

n

n∑

i=1

`i(θ)

Subgradient Algorithm: At the (t+ 1)-th iteration, we take

θ(t+1) = θ(t) − ηtζ(t), where ζ(t) ∈ ∂L(θ(t)).

Stochastic Variant: At the (t+ 1)-th iteration, we randomlyselect i from 1, ..., n, and take

θ(t+1) = θ(t) − ηtζ(t), where ζ(t) ∈ ∂`i(θ(t)).

Tuo Zhao — Lecture 3: Support Vector Machines 30/47

Page 58: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Stochastic Subgradient Algorithms

How many iterations do we need?

A sequence of decreasing step size parameters: ηt � 1tµ

‖ζ‖2 ≤M for any ζ ∈ ∂L(θ)Given a pre-specified error ε, we need

T = O

(M2

µ2ε

)

iterations such that

E∥∥∥θ(T ) − θ

∥∥∥2

2≤ ε, where θ

(T )=

2∑T

t=0(t+ 1)θ(t)

(T + 1)(T + 2).

Slower than gradient descent for the square hinge loss SVM.

Tuo Zhao — Lecture 3: Support Vector Machines 31/47

Page 59: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

SVM with Kernel

Page 60: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Nonlinear SVM

Feature Mapping:

φ(x) = (φ1(x), ..., φm(x))>

SVM with Feature Mapping:

θ = argminθ∈Rm

C

n∑

i=1

max{1− yiφ(xi)>θ, 0}+1

2‖θ‖22 .

Dual Problem:

λ =argminC1≥λ≥0

1

2

n∑

i=1

n∑

k=1

λiλkyiykφ(xi)>φ(xk)− λ>1

Tuo Zhao — Lecture 3: Support Vector Machines 33/47

Page 61: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Nonlinear SVM

Feature Mapping:

φ(x) = (φ1(x), ..., φm(x))>

SVM with Feature Mapping:

θ = argminθ∈Rm

C

n∑

i=1

max{1− yiφ(xi)>θ, 0}+1

2‖θ‖22 .

Dual Problem:

λ =argminC1≥λ≥0

1

2

n∑

i=1

n∑

k=1

λiλkyiykφ(xi)>φ(xk)− λ>1

Tuo Zhao — Lecture 3: Support Vector Machines 33/47

Page 62: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Nonlinear SVM

Feature Mapping:

φ(x) = (φ1(x), ..., φm(x))>

SVM with Feature Mapping:

θ = argminθ∈Rm

C

n∑

i=1

max{1− yiφ(xi)>θ, 0}+1

2‖θ‖22 .

Dual Problem:

λ =argminC1≥λ≥0

1

2

n∑

i=1

n∑

k=1

λiλkyiykφ(xi)>φ(xk)− λ>1

Tuo Zhao — Lecture 3: Support Vector Machines 33/47

Page 63: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Feature Mapping

Tuo Zhao — Lecture 3: Support Vector Machines 34/47

Page 64: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Kernel Trick

Kernel Function:

K(xi,xj) = 〈φ(xi),φ(xj)〉

Dual Problem:

λ =argminC1≥λ≥0

1

2

n∑

i=1

n∑

k=1

λiλkyiykK(xi,xk)− λ>1

Gaussian/RBF Kernel – infinite dimensional mapping:

K(xi,xj) = exp(−γ ‖xi − xj‖22)

Polynomial Kernel – finite dimensional mapping:

K(xi,xj) = (1 + x>i xj)r

Tuo Zhao — Lecture 3: Support Vector Machines 35/47

Page 65: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Kernel Trick

Kernel Function:

K(xi,xj) = 〈φ(xi),φ(xj)〉

Dual Problem:

λ =argminC1≥λ≥0

1

2

n∑

i=1

n∑

k=1

λiλkyiykK(xi,xk)− λ>1

Gaussian/RBF Kernel – infinite dimensional mapping:

K(xi,xj) = exp(−γ ‖xi − xj‖22)

Polynomial Kernel – finite dimensional mapping:

K(xi,xj) = (1 + x>i xj)r

Tuo Zhao — Lecture 3: Support Vector Machines 35/47

Page 66: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Kernel Trick

Kernel Function:

K(xi,xj) = 〈φ(xi),φ(xj)〉

Dual Problem:

λ =argminC1≥λ≥0

1

2

n∑

i=1

n∑

k=1

λiλkyiykK(xi,xk)− λ>1

Gaussian/RBF Kernel – infinite dimensional mapping:

K(xi,xj) = exp(−γ ‖xi − xj‖22)

Polynomial Kernel – finite dimensional mapping:

K(xi,xj) = (1 + x>i xj)r

Tuo Zhao — Lecture 3: Support Vector Machines 35/47

Page 67: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Kernel Trick

Kernel Function:

K(xi,xj) = 〈φ(xi),φ(xj)〉

Dual Problem:

λ =argminC1≥λ≥0

1

2

n∑

i=1

n∑

k=1

λiλkyiykK(xi,xk)− λ>1

Gaussian/RBF Kernel – infinite dimensional mapping:

K(xi,xj) = exp(−γ ‖xi − xj‖22)

Polynomial Kernel – finite dimensional mapping:

K(xi,xj) = (1 + x>i xj)r

Tuo Zhao — Lecture 3: Support Vector Machines 35/47

Page 68: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

More on Kernels

Reproducing Kernel Hilbert Space

Representer Theorem

Mercer Theorem

K � 0, where Kij = K(xi,xj),

Nonparametric SVM:

f = argminf∈H

Cn∑

i=1

max{1− yif(xi), 0}+1

2‖f‖2H .

Tuo Zhao — Lecture 3: Support Vector Machines 36/47

Page 69: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

More on Kernels

Reproducing Kernel Hilbert Space

Representer Theorem

Mercer Theorem

K � 0, where Kij = K(xi,xj),

Nonparametric SVM:

f = argminf∈H

C

n∑

i=1

max{1− yif(xi), 0}+1

2‖f‖2H .

Tuo Zhao — Lecture 3: Support Vector Machines 36/47

Page 70: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

More on Kernels

Reproducing Kernel Hilbert Space

Representer Theorem

Mercer Theorem

K � 0, where Kij = K(xi,xj),

Nonparametric SVM:

f = argminf∈H

C

n∑

i=1

max{1− yif(xi), 0}+1

2‖f‖2H .

Tuo Zhao — Lecture 3: Support Vector Machines 36/47

Page 71: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

More on Kernels

Reproducing Kernel Hilbert Space

Representer Theorem

Mercer Theorem

K � 0, where Kij = K(xi,xj),

Nonparametric SVM:

f = argminf∈H

C

n∑

i=1

max{1− yif(xi), 0}+1

2‖f‖2H .

Tuo Zhao — Lecture 3: Support Vector Machines 36/47

Page 72: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Randomized Coordinate Minimization

Solve the dual problem:

λ =argminλ

1

2λ>Qλ− λ>1

subject to C ≥ λi ≥ 0

Minimize λj with λ1, ..., λj−1, λj+1, λn fixed.

At the (t+ 1)-th iteration, we select i ∈ {1, ..., n} and take

λ(t+1)i =argmin

λi

1

2Qiiλ

2i +

∑j 6=iQijλiλ

(t)j − λi

subject to C ≥ λi ≥ 0

and λ(t+1)j = λ

(t)j for all j 6= i.

Tuo Zhao — Lecture 3: Support Vector Machines 37/47

Page 73: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Randomized Coordinate Minimization

Solve the dual problem:

λ =argminλ

1

2λ>Qλ− λ>1

subject to C ≥ λi ≥ 0

Minimize λj with λ1, ..., λj−1, λj+1, λn fixed.

At the (t+ 1)-th iteration, we select i ∈ {1, ..., n} and take

λ(t+1)i =argmin

λi

1

2Qiiλ

2i +

∑j 6=iQijλiλ

(t)j − λi

subject to C ≥ λi ≥ 0

and λ(t+1)j = λ

(t)j for all j 6= i.

Tuo Zhao — Lecture 3: Support Vector Machines 37/47

Page 74: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Randomized Coordinate Minimization

Solve the dual problem:

λ =argminλ

1

2λ>Qλ− λ>1

subject to C ≥ λi ≥ 0

Minimize λj with λ1, ..., λj−1, λj+1, λn fixed.

At the (t+ 1)-th iteration, we select i ∈ {1, ..., n} and take

λ(t+1)i =argmin

λi

1

2Qiiλ

2i +

∑j 6=iQijλiλ

(t)j − λi

subject to C ≥ λi ≥ 0

and λ(t+1)j = λ

(t)j for all j 6= i.

Tuo Zhao — Lecture 3: Support Vector Machines 37/47

Page 75: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Computational Cost per Iteration

An analytical updating formula:

λ(t+1)i =argmin

C≥λi≥0

1

2Qiiλ

2i + λi

∑j 6=iQijλ

(t)j − λi

=

C if λ(t+1) ≥ C,0 if λ(t+1) ≤ 0,

λ(t+1) otherwise,

where λ(t+1) = Q−1ii(∑

j 6=iQijλ(t)j − 1

).

What is the computational cost per iteration? O(n)

Better than gradient descent? O(n2)

Tuo Zhao — Lecture 3: Support Vector Machines 38/47

Page 76: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Computational Cost per Iteration

An analytical updating formula:

λ(t+1)i =argmin

C≥λi≥0

1

2Qiiλ

2i + λi

∑j 6=iQijλ

(t)j − λi

=

C if λ(t+1) ≥ C,0 if λ(t+1) ≤ 0,

λ(t+1) otherwise,

where λ(t+1) = Q−1ii(∑

j 6=iQijλ(t)j − 1

).

What is the computational cost per iteration? O(n)

Better than gradient descent? O(n2)

Tuo Zhao — Lecture 3: Support Vector Machines 38/47

Page 77: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Computational Cost per Iteration

An analytical updating formula:

λ(t+1)i =argmin

C≥λi≥0

1

2Qiiλ

2i + λi

∑j 6=iQijλ

(t)j − λi

=

C if λ(t+1) ≥ C,0 if λ(t+1) ≤ 0,

λ(t+1) otherwise,

where λ(t+1) = Q−1ii(∑

j 6=iQijλ(t)j − 1

).

What is the computational cost per iteration? O(n)

Better than gradient descent? O(n2)

Tuo Zhao — Lecture 3: Support Vector Machines 38/47

Page 78: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Geometric Illustration

Optimal Solution

Tuo Zhao — Lecture 3: Support Vector Machines 39/47

Page 79: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Coordinate Strong Smoothness

There exists a constant Lj such that for any λ and λ′ withλi = λ′i for i 6= j, we have

F(λ′)−F(λ)−∇jF(λ)(λ′j − λj) ≤Lj2(λ′j − λj)2.

There exists a constant µ such that for any λ and λ′, we have

F(λ′)−F(λ)−∇F(λ)>(λ′ − λ) ≥ µ

2

∥∥λ′ − λ∥∥22.

Coordinate v.s. Global Strong Smoothness:

L

nµ≤ maxj Lj

µ≤ L

µ.

Tuo Zhao — Lecture 3: Support Vector Machines 40/47

Page 80: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Coordinate Strong Smoothness

There exists a constant Lj such that for any λ and λ′ withλi = λ′i for i 6= j, we have

F(λ′)−F(λ)−∇jF(λ)(λ′j − λj) ≤Lj2(λ′j − λj)2.

There exists a constant µ such that for any λ and λ′, we have

F(λ′)−F(λ)−∇F(λ)>(λ′ − λ) ≥ µ

2

∥∥λ′ − λ∥∥22.

Coordinate v.s. Global Strong Smoothness:

L

nµ≤ maxj Lj

µ≤ L

µ.

Tuo Zhao — Lecture 3: Support Vector Machines 40/47

Page 81: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Coordinate Strong Smoothness

There exists a constant Lj such that for any λ and λ′ withλi = λ′i for i 6= j, we have

F(λ′)−F(λ)−∇jF(λ)(λ′j − λj) ≤Lj2(λ′j − λj)2.

There exists a constant µ such that for any λ and λ′, we have

F(λ′)−F(λ)−∇F(λ)>(λ′ − λ) ≥ µ

2

∥∥λ′ − λ∥∥22.

Coordinate v.s. Global Strong Smoothness:

L

nµ≤ maxj Lj

µ≤ L

µ.

Tuo Zhao — Lecture 3: Support Vector Machines 40/47

Page 82: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Iteration Complexity

How many iterations do we need?

Given a pre-specified error ε, we need

T = O

(nLmax

µlog

(1

ε

))

iterations such that

E∥∥∥λ(T ) − λ

∥∥∥2

2≤ ε.

Faster than the gradient descent algorithm? Yes!

Tuo Zhao — Lecture 3: Support Vector Machines 41/47

Page 83: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Iteration Complexity

How many iterations do we need?

Given a pre-specified error ε, we need

T = O

(nLmax

µlog

(1

ε

))

iterations such that

E∥∥∥λ(T ) − λ

∥∥∥2

2≤ ε.

Faster than the gradient descent algorithm? Yes!

Tuo Zhao — Lecture 3: Support Vector Machines 41/47

Page 84: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Randomized v.s. Cyclic v.s. Shuffled

Cyclic Order: 1, 2, ..., d, 1, 2, ..., d, ...

Shuffled and Random Shuffled Order

Randomized Order

(a) 102 iterations (b) 104 iterations

Figure 1: Relative errorf(xk)�f⇤f(x0)�f⇤ v.s. iterations, for 5 methods C-CD, cyclic CGD with small stepsize 1/�max(A), randomly permuted

CD, randomized CD and GD. Minimize f(x) = xT Ax, n = 100, A = Ac with c = 0.8.

Next, we discuss numerical experiments for randomly generated A; for simplicity, we will normalize the

diagonal entries of A to be 1. Since di↵erent random distributions of A will lead to di↵erent comparison

results of various algorithms, we test many distributions and try to understand for which distributions C-CD

performs well/poorly. To guarantee that A is positive semidefinite, we generate a random matrix U and let

A = UT U . We generate the entries of U i.i.d. from a certain random distribution (which implies that the

columns of U are independent), such as N (0, 1) (standard Gaussian distribution), Unif[0, 1] (uniform [0, 1]

distribution), log-normal distribution, etc. It turns out for most distributions C-CD is slower than R-CD

and RP-CD, but for standard Gaussian distribution C-CD is better than R-CD and RP-CD.

Inspired by the numerical experiments for the example (24), we suspect that the performance of C-CD

depends on how large the o↵-diagonal entries of A are (with fixed diagonal entries). When Uij ⇠ N (0, 1),

UT U has small o↵-diagonal entries since E(Aij) = 0, i 6= j. When Uij ⇠ Unif[0, 1], UT U has large o↵-

diagonal entries since E((UT U)ij) = n/4. The di↵erence of these two scenarios lie in the fact that Unif[0, 1]

has non-zero mean while N (0, 1) has zero mean. It is then interesting to compare the case of drawing entries

from Unif[0, 1] with the case of drawing entries from Unif[�0.5, 0.5]. We will do some sort of A/B testing for

each distribution: compare the zero-mean case with the non-zero mean case. To quantify the “o↵-diagonals

over diagonals ratio”, we define

�i =

Pj 6=i |Aij |Aii

, i = 1, . . . , n; �avg =1

n

X

i

�i.

We also define

⌧ =L

Lmin=

�max(A)

mini Aii.

As we have normalized Aii to be 1, we can simply write

�i =X

j 6=i

|Aij |, ⌧ = �max(A) = L.

Obviously L = �max(A) 1 + maxi �i. In many examples we find L to be close to 1 + �avg especially when

both of them are large.

24

Worst Case v.s. Average Case

Tuo Zhao — Lecture 3: Support Vector Machines 42/47

Page 85: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Randomized v.s. Cyclic v.s. Shuffled

Cyclic Order: 1, 2, ..., d, 1, 2, ..., d, ...

Shuffled and Random Shuffled Order

Randomized Order

(a) 102 iterations (b) 104 iterations

Figure 1: Relative errorf(xk)�f⇤f(x0)�f⇤ v.s. iterations, for 5 methods C-CD, cyclic CGD with small stepsize 1/�max(A), randomly permuted

CD, randomized CD and GD. Minimize f(x) = xT Ax, n = 100, A = Ac with c = 0.8.

Next, we discuss numerical experiments for randomly generated A; for simplicity, we will normalize the

diagonal entries of A to be 1. Since di↵erent random distributions of A will lead to di↵erent comparison

results of various algorithms, we test many distributions and try to understand for which distributions C-CD

performs well/poorly. To guarantee that A is positive semidefinite, we generate a random matrix U and let

A = UT U . We generate the entries of U i.i.d. from a certain random distribution (which implies that the

columns of U are independent), such as N (0, 1) (standard Gaussian distribution), Unif[0, 1] (uniform [0, 1]

distribution), log-normal distribution, etc. It turns out for most distributions C-CD is slower than R-CD

and RP-CD, but for standard Gaussian distribution C-CD is better than R-CD and RP-CD.

Inspired by the numerical experiments for the example (24), we suspect that the performance of C-CD

depends on how large the o↵-diagonal entries of A are (with fixed diagonal entries). When Uij ⇠ N (0, 1),

UT U has small o↵-diagonal entries since E(Aij) = 0, i 6= j. When Uij ⇠ Unif[0, 1], UT U has large o↵-

diagonal entries since E((UT U)ij) = n/4. The di↵erence of these two scenarios lie in the fact that Unif[0, 1]

has non-zero mean while N (0, 1) has zero mean. It is then interesting to compare the case of drawing entries

from Unif[0, 1] with the case of drawing entries from Unif[�0.5, 0.5]. We will do some sort of A/B testing for

each distribution: compare the zero-mean case with the non-zero mean case. To quantify the “o↵-diagonals

over diagonals ratio”, we define

�i =

Pj 6=i |Aij |Aii

, i = 1, . . . , n; �avg =1

n

X

i

�i.

We also define

⌧ =L

Lmin=

�max(A)

mini Aii.

As we have normalized Aii to be 1, we can simply write

�i =X

j 6=i

|Aij |, ⌧ = �max(A) = L.

Obviously L = �max(A) 1 + maxi �i. In many examples we find L to be close to 1 + �avg especially when

both of them are large.

24

Worst Case v.s. Average Case

Tuo Zhao — Lecture 3: Support Vector Machines 42/47

Page 86: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Randomized v.s. Cyclic v.s. Shuffled

Cyclic Order: 1, 2, ..., d, 1, 2, ..., d, ...

Shuffled and Random Shuffled Order

Randomized Order

(a) 102 iterations (b) 104 iterations

Figure 1: Relative errorf(xk)�f⇤f(x0)�f⇤ v.s. iterations, for 5 methods C-CD, cyclic CGD with small stepsize 1/�max(A), randomly permuted

CD, randomized CD and GD. Minimize f(x) = xT Ax, n = 100, A = Ac with c = 0.8.

Next, we discuss numerical experiments for randomly generated A; for simplicity, we will normalize the

diagonal entries of A to be 1. Since di↵erent random distributions of A will lead to di↵erent comparison

results of various algorithms, we test many distributions and try to understand for which distributions C-CD

performs well/poorly. To guarantee that A is positive semidefinite, we generate a random matrix U and let

A = UT U . We generate the entries of U i.i.d. from a certain random distribution (which implies that the

columns of U are independent), such as N (0, 1) (standard Gaussian distribution), Unif[0, 1] (uniform [0, 1]

distribution), log-normal distribution, etc. It turns out for most distributions C-CD is slower than R-CD

and RP-CD, but for standard Gaussian distribution C-CD is better than R-CD and RP-CD.

Inspired by the numerical experiments for the example (24), we suspect that the performance of C-CD

depends on how large the o↵-diagonal entries of A are (with fixed diagonal entries). When Uij ⇠ N (0, 1),

UT U has small o↵-diagonal entries since E(Aij) = 0, i 6= j. When Uij ⇠ Unif[0, 1], UT U has large o↵-

diagonal entries since E((UT U)ij) = n/4. The di↵erence of these two scenarios lie in the fact that Unif[0, 1]

has non-zero mean while N (0, 1) has zero mean. It is then interesting to compare the case of drawing entries

from Unif[0, 1] with the case of drawing entries from Unif[�0.5, 0.5]. We will do some sort of A/B testing for

each distribution: compare the zero-mean case with the non-zero mean case. To quantify the “o↵-diagonals

over diagonals ratio”, we define

�i =

Pj 6=i |Aij |Aii

, i = 1, . . . , n; �avg =1

n

X

i

�i.

We also define

⌧ =L

Lmin=

�max(A)

mini Aii.

As we have normalized Aii to be 1, we can simply write

�i =X

j 6=i

|Aij |, ⌧ = �max(A) = L.

Obviously L = �max(A) 1 + maxi �i. In many examples we find L to be close to 1 + �avg especially when

both of them are large.

24

Worst Case v.s. Average Case

Tuo Zhao — Lecture 3: Support Vector Machines 42/47

Page 87: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Randomized v.s. Cyclic v.s. Shuffled

Cyclic Order: 1, 2, ..., d, 1, 2, ..., d, ...

Shuffled and Random Shuffled Order

Randomized Order

(a) 102 iterations (b) 104 iterations

Figure 1: Relative errorf(xk)�f⇤f(x0)�f⇤ v.s. iterations, for 5 methods C-CD, cyclic CGD with small stepsize 1/�max(A), randomly permuted

CD, randomized CD and GD. Minimize f(x) = xT Ax, n = 100, A = Ac with c = 0.8.

Next, we discuss numerical experiments for randomly generated A; for simplicity, we will normalize the

diagonal entries of A to be 1. Since di↵erent random distributions of A will lead to di↵erent comparison

results of various algorithms, we test many distributions and try to understand for which distributions C-CD

performs well/poorly. To guarantee that A is positive semidefinite, we generate a random matrix U and let

A = UT U . We generate the entries of U i.i.d. from a certain random distribution (which implies that the

columns of U are independent), such as N (0, 1) (standard Gaussian distribution), Unif[0, 1] (uniform [0, 1]

distribution), log-normal distribution, etc. It turns out for most distributions C-CD is slower than R-CD

and RP-CD, but for standard Gaussian distribution C-CD is better than R-CD and RP-CD.

Inspired by the numerical experiments for the example (24), we suspect that the performance of C-CD

depends on how large the o↵-diagonal entries of A are (with fixed diagonal entries). When Uij ⇠ N (0, 1),

UT U has small o↵-diagonal entries since E(Aij) = 0, i 6= j. When Uij ⇠ Unif[0, 1], UT U has large o↵-

diagonal entries since E((UT U)ij) = n/4. The di↵erence of these two scenarios lie in the fact that Unif[0, 1]

has non-zero mean while N (0, 1) has zero mean. It is then interesting to compare the case of drawing entries

from Unif[0, 1] with the case of drawing entries from Unif[�0.5, 0.5]. We will do some sort of A/B testing for

each distribution: compare the zero-mean case with the non-zero mean case. To quantify the “o↵-diagonals

over diagonals ratio”, we define

�i =

Pj 6=i |Aij |Aii

, i = 1, . . . , n; �avg =1

n

X

i

�i.

We also define

⌧ =L

Lmin=

�max(A)

mini Aii.

As we have normalized Aii to be 1, we can simply write

�i =X

j 6=i

|Aij |, ⌧ = �max(A) = L.

Obviously L = �max(A) 1 + maxi �i. In many examples we find L to be close to 1 + �avg especially when

both of them are large.

24

Worst Case v.s. Average Case

Tuo Zhao — Lecture 3: Support Vector Machines 42/47

Page 88: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

Bias Variance Tradeoff

Page 89: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Bias Variance Tradeoff

Given a regression model:

Y = f∗(X) + ε,

where Eε = 0 and Eε2 = σ2

Bias Variance Decomposition:

Eε,X(Y − f(X))2

= Eε,X(Y − f∗(X) + f∗(X)− Eε|X f(X) + Eε|X f(X)− f(X))2

= σ2 + EX(f∗(X)− Eε|X f(X))2︸ ︷︷ ︸

Bias2

+EX(f(X)− Eε|X f(X))2︸ ︷︷ ︸

Variance

.

Bias ↑ + Variance ↓ or Bias ↓ + Variance ↑.

Tuo Zhao — Lecture 3: Support Vector Machines 44/47

Page 90: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Bias Variance Tradeoff

Given a regression model:

Y = f∗(X) + ε,

where Eε = 0 and Eε2 = σ2

Bias Variance Decomposition:

Eε,X(Y − f(X))2

= Eε,X(Y − f∗(X) + f∗(X)− Eε|X f(X) + Eε|X f(X)− f(X))2

= σ2 + EX(f∗(X)− Eε|X f(X))2︸ ︷︷ ︸

Bias2

+EX(f(X)− Eε|X f(X))2︸ ︷︷ ︸

Variance

.

Bias ↑ + Variance ↓ or Bias ↓ + Variance ↑.

Tuo Zhao — Lecture 3: Support Vector Machines 44/47

Page 91: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Bias Variance Tradeoff

Given a regression model:

Y = f∗(X) + ε,

where Eε = 0 and Eε2 = σ2

Bias Variance Decomposition:

Eε,X(Y − f(X))2

= Eε,X(Y − f∗(X) + f∗(X)− Eε|X f(X) + Eε|X f(X)− f(X))2

= σ2 + EX(f∗(X)− Eε|X f(X))2︸ ︷︷ ︸

Bias2

+EX(f(X)− Eε|X f(X))2︸ ︷︷ ︸

Variance

.

Bias ↑ + Variance ↓ or Bias ↓ + Variance ↑.

Tuo Zhao — Lecture 3: Support Vector Machines 44/47

Page 92: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Bias Variance Tradeoff

Tuo Zhao — Lecture 3: Support Vector Machines 45/47

Page 93: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Overfitting

Tuo Zhao — Lecture 3: Support Vector Machines 46/47

Page 94: Lecture 3: Support Vector Machines - ISyE Home | ISyEtzhao80/Lectures/Lecture_3.pdf · Lecture 3: Support Vector Machines Tuo Zhao Schools of ISYE and CSE, Georgia Tech. CS7641/ISYE/CSE

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Optimal Model Complexity

Simultaneously control bias and variance!

Next Lecture: Model Selection and Regularization

Tuo Zhao — Lecture 3: Support Vector Machines 47/47