109
Introduction to Machine Learning (67577) Lecture 6 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Convexity,Optimization,Surrogates,SGD Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 1 / 47

Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Introduction to Machine Learning (67577)Lecture 6

Shai Shalev-Shwartz

School of CS and Engineering,The Hebrew University of Jerusalem

Convexity,Optimization,Surrogates,SGD

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 1 / 47

Page 2: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Outline

1 Convexity

2 Convex OptimizationEllipsoidGradient Descent

3 Convex Learning Problems

4 Surrogate Loss Functions

5 Learning Using Stochastic Gradient Descent

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 2 / 47

Page 3: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Definition (Convex Set)

A set C in a vector space is convex if for any two vectors u,v in C, theline segment between u and v is contained in C. That is, for anyα ∈ [0, 1] we have that the convex combination αu + (1− α)v is in C.

non-convex convex

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 3 / 47

Page 4: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Definition (Convex function)

Let C be a convex set. A function f : C → R is convex if for everyu,v ∈ C and α ∈ [0, 1],

f(αu + (1− α)v) ≤ αf(u) + (1− α)f(v) .

f(u)

f(v)

u

αu+ (1− α)v

v

αf(u) + (1− α)f(v)

f(αu+ (1− α)v)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 4 / 47

Page 5: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Epigraph

A function f is convex if and only if its epigraph is a convex set:

epigraph(f) = {(x, β) : f(x) ≤ β} .

x

f(x)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 5 / 47

Page 6: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Property I: local minima are global

If f is convex then every local minimum of f is also a global minimum.

let B(u, r) = {v : ‖v − u‖ ≤ r}f(u) is a local minimum of f at u if ∃r > 0 s.t. ∀v ∈ B(u, r) wehave f(v) ≥ f(u)

It follows that for any v (not necessarily in B), there is a smallenough α > 0 such that u + α(v − u) ∈ B(u, r) and therefore

f(u) ≤ f(u + α(v − u)) .

If f is convex, we also have that

f(u + α(v − u)) = f(αv + (1− α)u) ≤ (1− α)f(u) + αf(v) .

Combining, we obtain that f(u) ≤ f(v).

This holds for every v, hence f(u) is also a global minimum of f .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47

Page 7: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Property I: local minima are global

If f is convex then every local minimum of f is also a global minimum.

let B(u, r) = {v : ‖v − u‖ ≤ r}

f(u) is a local minimum of f at u if ∃r > 0 s.t. ∀v ∈ B(u, r) wehave f(v) ≥ f(u)

It follows that for any v (not necessarily in B), there is a smallenough α > 0 such that u + α(v − u) ∈ B(u, r) and therefore

f(u) ≤ f(u + α(v − u)) .

If f is convex, we also have that

f(u + α(v − u)) = f(αv + (1− α)u) ≤ (1− α)f(u) + αf(v) .

Combining, we obtain that f(u) ≤ f(v).

This holds for every v, hence f(u) is also a global minimum of f .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47

Page 8: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Property I: local minima are global

If f is convex then every local minimum of f is also a global minimum.

let B(u, r) = {v : ‖v − u‖ ≤ r}f(u) is a local minimum of f at u if ∃r > 0 s.t. ∀v ∈ B(u, r) wehave f(v) ≥ f(u)

It follows that for any v (not necessarily in B), there is a smallenough α > 0 such that u + α(v − u) ∈ B(u, r) and therefore

f(u) ≤ f(u + α(v − u)) .

If f is convex, we also have that

f(u + α(v − u)) = f(αv + (1− α)u) ≤ (1− α)f(u) + αf(v) .

Combining, we obtain that f(u) ≤ f(v).

This holds for every v, hence f(u) is also a global minimum of f .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47

Page 9: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Property I: local minima are global

If f is convex then every local minimum of f is also a global minimum.

let B(u, r) = {v : ‖v − u‖ ≤ r}f(u) is a local minimum of f at u if ∃r > 0 s.t. ∀v ∈ B(u, r) wehave f(v) ≥ f(u)

It follows that for any v (not necessarily in B), there is a smallenough α > 0 such that u + α(v − u) ∈ B(u, r) and therefore

f(u) ≤ f(u + α(v − u)) .

If f is convex, we also have that

f(u + α(v − u)) = f(αv + (1− α)u) ≤ (1− α)f(u) + αf(v) .

Combining, we obtain that f(u) ≤ f(v).

This holds for every v, hence f(u) is also a global minimum of f .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47

Page 10: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Property I: local minima are global

If f is convex then every local minimum of f is also a global minimum.

let B(u, r) = {v : ‖v − u‖ ≤ r}f(u) is a local minimum of f at u if ∃r > 0 s.t. ∀v ∈ B(u, r) wehave f(v) ≥ f(u)

It follows that for any v (not necessarily in B), there is a smallenough α > 0 such that u + α(v − u) ∈ B(u, r) and therefore

f(u) ≤ f(u + α(v − u)) .

If f is convex, we also have that

f(u + α(v − u)) = f(αv + (1− α)u) ≤ (1− α)f(u) + αf(v) .

Combining, we obtain that f(u) ≤ f(v).

This holds for every v, hence f(u) is also a global minimum of f .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47

Page 11: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Property I: local minima are global

If f is convex then every local minimum of f is also a global minimum.

let B(u, r) = {v : ‖v − u‖ ≤ r}f(u) is a local minimum of f at u if ∃r > 0 s.t. ∀v ∈ B(u, r) wehave f(v) ≥ f(u)

It follows that for any v (not necessarily in B), there is a smallenough α > 0 such that u + α(v − u) ∈ B(u, r) and therefore

f(u) ≤ f(u + α(v − u)) .

If f is convex, we also have that

f(u + α(v − u)) = f(αv + (1− α)u) ≤ (1− α)f(u) + αf(v) .

Combining, we obtain that f(u) ≤ f(v).

This holds for every v, hence f(u) is also a global minimum of f .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47

Page 12: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Property I: local minima are global

If f is convex then every local minimum of f is also a global minimum.

let B(u, r) = {v : ‖v − u‖ ≤ r}f(u) is a local minimum of f at u if ∃r > 0 s.t. ∀v ∈ B(u, r) wehave f(v) ≥ f(u)

It follows that for any v (not necessarily in B), there is a smallenough α > 0 such that u + α(v − u) ∈ B(u, r) and therefore

f(u) ≤ f(u + α(v − u)) .

If f is convex, we also have that

f(u + α(v − u)) = f(αv + (1− α)u) ≤ (1− α)f(u) + αf(v) .

Combining, we obtain that f(u) ≤ f(v).

This holds for every v, hence f(u) is also a global minimum of f .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47

Page 13: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Property II: tangents lie below f

If f is convex and differentiable, then

∀u, f(u) ≥ f(w) + 〈∇f(w),u−w〉

(recall, ∇f(w) =(∂f(w)∂w1

, . . . , ∂f(w)∂wd

)is the gradient of f at w)

f(w)

f(u)

w u

f(w

) +〈u−w,∇f(w

)〉

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 7 / 47

Page 14: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Sub-gradients

v is sub-gradient of f at w if ∀u, f(u) ≥ f(w) + 〈v,u−w〉The differential set, ∂f(w), is the set of sub-gradients of f at w

Lemma: f is convex iff for every w, ∂f(w) 6= ∅

f(w)

f(u)

w u

f(w

) +〈u−w,v〉

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 8 / 47

Page 15: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Property II: tangents lie below f

f is “locally flat” around w (i.e. 0 is a sub-gradient) iff w is a globalminimizer

w

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 9 / 47

Page 16: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Lipschitzness

Definition (Lipschitzness)

A function f : C → R is ρ-Lipschitz if for every w1,w2 ∈ C we have that|f(w1)− f(w2)| ≤ ρ ‖w1 −w2‖.

Lemma

If f is convex then f is ρ-Lipschitz iff the norm of all sub-gradients of f isat most ρ

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 10 / 47

Page 17: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Lipschitzness

Definition (Lipschitzness)

A function f : C → R is ρ-Lipschitz if for every w1,w2 ∈ C we have that|f(w1)− f(w2)| ≤ ρ ‖w1 −w2‖.

Lemma

If f is convex then f is ρ-Lipschitz iff the norm of all sub-gradients of f isat most ρ

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 10 / 47

Page 18: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Outline

1 Convexity

2 Convex OptimizationEllipsoidGradient Descent

3 Convex Learning Problems

4 Surrogate Loss Functions

5 Learning Using Stochastic Gradient Descent

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 11 / 47

Page 19: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Convex optimization

Approximately solve:argminw∈C

f(w)

where C is a convex set and f is a convex function.

Special cases:

Feasibility problem: f is a constant function

Unconstrained minimization: C = Rd

Can reduce one to another:

Adding the function IC(w) to the objective eliminates the constraintAdding the constraint f(w) ≤ f∗ + ε eliminates the objective

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 12 / 47

Page 20: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Convex optimization

Approximately solve:argminw∈C

f(w)

where C is a convex set and f is a convex function.

Special cases:

Feasibility problem: f is a constant function

Unconstrained minimization: C = Rd

Can reduce one to another:

Adding the function IC(w) to the objective eliminates the constraintAdding the constraint f(w) ≤ f∗ + ε eliminates the objective

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 12 / 47

Page 21: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Convex optimization

Approximately solve:argminw∈C

f(w)

where C is a convex set and f is a convex function.

Special cases:

Feasibility problem: f is a constant function

Unconstrained minimization: C = Rd

Can reduce one to another:

Adding the function IC(w) to the objective eliminates the constraintAdding the constraint f(w) ≤ f∗ + ε eliminates the objective

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 12 / 47

Page 22: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Convex optimization

Approximately solve:argminw∈C

f(w)

where C is a convex set and f is a convex function.

Special cases:

Feasibility problem: f is a constant function

Unconstrained minimization: C = Rd

Can reduce one to another:

Adding the function IC(w) to the objective eliminates the constraintAdding the constraint f(w) ≤ f∗ + ε eliminates the objective

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 12 / 47

Page 23: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Convex optimization

Approximately solve:argminw∈C

f(w)

where C is a convex set and f is a convex function.

Special cases:

Feasibility problem: f is a constant function

Unconstrained minimization: C = Rd

Can reduce one to another:

Adding the function IC(w) to the objective eliminates the constraint

Adding the constraint f(w) ≤ f∗ + ε eliminates the objective

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 12 / 47

Page 24: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Convex optimization

Approximately solve:argminw∈C

f(w)

where C is a convex set and f is a convex function.

Special cases:

Feasibility problem: f is a constant function

Unconstrained minimization: C = Rd

Can reduce one to another:

Adding the function IC(w) to the objective eliminates the constraintAdding the constraint f(w) ≤ f∗ + ε eliminates the objective

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 12 / 47

Page 25: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Outline

1 Convexity

2 Convex OptimizationEllipsoidGradient Descent

3 Convex Learning Problems

4 Surrogate Loss Functions

5 Learning Using Stochastic Gradient Descent

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 13 / 47

Page 26: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

The Ellipsoid Algorithm

Consider a feasibility problem: find w ∈ CAssumptions:

B(w∗, r) ⊆ C ⊂ B(0, R)Separation oracle: Given w, the oracle tells if it’s in C or not.If w 6∈ C then the oracle finds v s.t. for every w′ ∈ C we have〈w,v〉 < 〈w′,v〉

w

C

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 14 / 47

Page 27: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

The Ellipsoid Algorithm

We implicitly maintain an ellipsoid: Et = E(A1/2t ,wt)

Start with w1 = 0, A1 = I

For t = 1, 2, . . .

Call oracle with wt

If wt ∈ C, break and return wt

Otherwise, let vt be the vector defining a separating hyperplaneUpdate:

wt+1 = wt +1

d+ 1

Atvt√v>t Atvt

At+1 =d2

d2 − 1

(At −

2

d+ 1

Atvtv>t At

v>t Atvt

)

Theorem

The Ellipsoid converges after at most 2d(2d+ 2) log(R/r) iterations.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 15 / 47

Page 28: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

The Ellipsoid Algorithm

We implicitly maintain an ellipsoid: Et = E(A1/2t ,wt)

Start with w1 = 0, A1 = I

For t = 1, 2, . . .

Call oracle with wt

If wt ∈ C, break and return wt

Otherwise, let vt be the vector defining a separating hyperplaneUpdate:

wt+1 = wt +1

d+ 1

Atvt√v>t Atvt

At+1 =d2

d2 − 1

(At −

2

d+ 1

Atvtv>t At

v>t Atvt

)

Theorem

The Ellipsoid converges after at most 2d(2d+ 2) log(R/r) iterations.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 15 / 47

Page 29: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Implementing the separation oracle using sub-gradients

Suppose C = ∩ni=1{w : fi(w) ≤ 0} where each fi is a convexfunction.

Given w, we can check if fi(w) ≤ 0 for every i

If fi(w) > 0 for some i, consider v ∈ ∂fi(w), then, for every w′ ∈ C

0 ≥ fi(w′) ≥ fi(w) + 〈w′ −w,v〉 > 〈w′ −w,v〉

So, the oracle can return −v

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 16 / 47

Page 30: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

The Ellipsoid Algorithm for unconstrained minimization

Consider minw f(w) and let w∗ be a minimizer

Let C = {w : f(w)− f(w∗)− ε ≤ 0}We can apply the Ellipsoid algorithm while letting vt ∈ ∂f(wt)

Analysis:

Let r be s.t. B(w∗, r) ⊆ CFor example, if f is ρ-Lipschitz we can take r = ε/ρ

Let R = ‖w∗‖+ r

Then, after 2d(2d+ 2) log(R/r) iterations, wt must be in C

For f being ρ-Lipschitz, we obtain the iteration bound

2d(2d+ 2) log

(ρ ‖w∗‖ε

+ 1

)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 17 / 47

Page 31: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

The Ellipsoid Algorithm for unconstrained minimization

Consider minw f(w) and let w∗ be a minimizer

Let C = {w : f(w)− f(w∗)− ε ≤ 0}

We can apply the Ellipsoid algorithm while letting vt ∈ ∂f(wt)

Analysis:

Let r be s.t. B(w∗, r) ⊆ CFor example, if f is ρ-Lipschitz we can take r = ε/ρ

Let R = ‖w∗‖+ r

Then, after 2d(2d+ 2) log(R/r) iterations, wt must be in C

For f being ρ-Lipschitz, we obtain the iteration bound

2d(2d+ 2) log

(ρ ‖w∗‖ε

+ 1

)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 17 / 47

Page 32: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

The Ellipsoid Algorithm for unconstrained minimization

Consider minw f(w) and let w∗ be a minimizer

Let C = {w : f(w)− f(w∗)− ε ≤ 0}We can apply the Ellipsoid algorithm while letting vt ∈ ∂f(wt)

Analysis:

Let r be s.t. B(w∗, r) ⊆ CFor example, if f is ρ-Lipschitz we can take r = ε/ρ

Let R = ‖w∗‖+ r

Then, after 2d(2d+ 2) log(R/r) iterations, wt must be in C

For f being ρ-Lipschitz, we obtain the iteration bound

2d(2d+ 2) log

(ρ ‖w∗‖ε

+ 1

)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 17 / 47

Page 33: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

The Ellipsoid Algorithm for unconstrained minimization

Consider minw f(w) and let w∗ be a minimizer

Let C = {w : f(w)− f(w∗)− ε ≤ 0}We can apply the Ellipsoid algorithm while letting vt ∈ ∂f(wt)

Analysis:

Let r be s.t. B(w∗, r) ⊆ CFor example, if f is ρ-Lipschitz we can take r = ε/ρ

Let R = ‖w∗‖+ r

Then, after 2d(2d+ 2) log(R/r) iterations, wt must be in C

For f being ρ-Lipschitz, we obtain the iteration bound

2d(2d+ 2) log

(ρ ‖w∗‖ε

+ 1

)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 17 / 47

Page 34: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

The Ellipsoid Algorithm for unconstrained minimization

Consider minw f(w) and let w∗ be a minimizer

Let C = {w : f(w)− f(w∗)− ε ≤ 0}We can apply the Ellipsoid algorithm while letting vt ∈ ∂f(wt)

Analysis:

Let r be s.t. B(w∗, r) ⊆ C

For example, if f is ρ-Lipschitz we can take r = ε/ρ

Let R = ‖w∗‖+ r

Then, after 2d(2d+ 2) log(R/r) iterations, wt must be in C

For f being ρ-Lipschitz, we obtain the iteration bound

2d(2d+ 2) log

(ρ ‖w∗‖ε

+ 1

)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 17 / 47

Page 35: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

The Ellipsoid Algorithm for unconstrained minimization

Consider minw f(w) and let w∗ be a minimizer

Let C = {w : f(w)− f(w∗)− ε ≤ 0}We can apply the Ellipsoid algorithm while letting vt ∈ ∂f(wt)

Analysis:

Let r be s.t. B(w∗, r) ⊆ CFor example, if f is ρ-Lipschitz we can take r = ε/ρ

Let R = ‖w∗‖+ r

Then, after 2d(2d+ 2) log(R/r) iterations, wt must be in C

For f being ρ-Lipschitz, we obtain the iteration bound

2d(2d+ 2) log

(ρ ‖w∗‖ε

+ 1

)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 17 / 47

Page 36: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

The Ellipsoid Algorithm for unconstrained minimization

Consider minw f(w) and let w∗ be a minimizer

Let C = {w : f(w)− f(w∗)− ε ≤ 0}We can apply the Ellipsoid algorithm while letting vt ∈ ∂f(wt)

Analysis:

Let r be s.t. B(w∗, r) ⊆ CFor example, if f is ρ-Lipschitz we can take r = ε/ρ

Let R = ‖w∗‖+ r

Then, after 2d(2d+ 2) log(R/r) iterations, wt must be in C

For f being ρ-Lipschitz, we obtain the iteration bound

2d(2d+ 2) log

(ρ ‖w∗‖ε

+ 1

)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 17 / 47

Page 37: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

The Ellipsoid Algorithm for unconstrained minimization

Consider minw f(w) and let w∗ be a minimizer

Let C = {w : f(w)− f(w∗)− ε ≤ 0}We can apply the Ellipsoid algorithm while letting vt ∈ ∂f(wt)

Analysis:

Let r be s.t. B(w∗, r) ⊆ CFor example, if f is ρ-Lipschitz we can take r = ε/ρ

Let R = ‖w∗‖+ r

Then, after 2d(2d+ 2) log(R/r) iterations, wt must be in C

For f being ρ-Lipschitz, we obtain the iteration bound

2d(2d+ 2) log

(ρ ‖w∗‖ε

+ 1

)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 17 / 47

Page 38: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

The Ellipsoid Algorithm for unconstrained minimization

Consider minw f(w) and let w∗ be a minimizer

Let C = {w : f(w)− f(w∗)− ε ≤ 0}We can apply the Ellipsoid algorithm while letting vt ∈ ∂f(wt)

Analysis:

Let r be s.t. B(w∗, r) ⊆ CFor example, if f is ρ-Lipschitz we can take r = ε/ρ

Let R = ‖w∗‖+ r

Then, after 2d(2d+ 2) log(R/r) iterations, wt must be in C

For f being ρ-Lipschitz, we obtain the iteration bound

2d(2d+ 2) log

(ρ ‖w∗‖ε

+ 1

)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 17 / 47

Page 39: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Outline

1 Convexity

2 Convex OptimizationEllipsoidGradient Descent

3 Convex Learning Problems

4 Surrogate Loss Functions

5 Learning Using Stochastic Gradient Descent

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 18 / 47

Page 40: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Gradient Descent

Start with initial w(1) (usually, the zero vector)

At iteration t, update

w(t+1) = w(t) − η∇f(w(t)) ,

where η > 0 is a parameter

Intuition:

By Taylor’s approximation, if w close to w(t) thenf(w) ≈ f(w(t)) + 〈w −w(t),∇f(w(t))〉Hence, we want to minimize the approximation while staying close tow(t):

w(t+1) = argminw

1

2‖w−w(t)‖2+η

(f(w(t)) + 〈w −w(t),∇f(w(t))〉

).

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 19 / 47

Page 41: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Gradient Descent

Start with initial w(1) (usually, the zero vector)

At iteration t, update

w(t+1) = w(t) − η∇f(w(t)) ,

where η > 0 is a parameter

Intuition:

By Taylor’s approximation, if w close to w(t) thenf(w) ≈ f(w(t)) + 〈w −w(t),∇f(w(t))〉Hence, we want to minimize the approximation while staying close tow(t):

w(t+1) = argminw

1

2‖w−w(t)‖2+η

(f(w(t)) + 〈w −w(t),∇f(w(t))〉

).

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 19 / 47

Page 42: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Gradient Descent

Start with initial w(1) (usually, the zero vector)

At iteration t, update

w(t+1) = w(t) − η∇f(w(t)) ,

where η > 0 is a parameter

Intuition:

By Taylor’s approximation, if w close to w(t) thenf(w) ≈ f(w(t)) + 〈w −w(t),∇f(w(t))〉Hence, we want to minimize the approximation while staying close tow(t):

w(t+1) = argminw

1

2‖w−w(t)‖2+η

(f(w(t)) + 〈w −w(t),∇f(w(t))〉

).

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 19 / 47

Page 43: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Gradient Descent

Start with initial w(1) (usually, the zero vector)

At iteration t, update

w(t+1) = w(t) − η∇f(w(t)) ,

where η > 0 is a parameter

Intuition:

By Taylor’s approximation, if w close to w(t) thenf(w) ≈ f(w(t)) + 〈w −w(t),∇f(w(t))〉

Hence, we want to minimize the approximation while staying close tow(t):

w(t+1) = argminw

1

2‖w−w(t)‖2+η

(f(w(t)) + 〈w −w(t),∇f(w(t))〉

).

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 19 / 47

Page 44: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Gradient Descent

Start with initial w(1) (usually, the zero vector)

At iteration t, update

w(t+1) = w(t) − η∇f(w(t)) ,

where η > 0 is a parameter

Intuition:

By Taylor’s approximation, if w close to w(t) thenf(w) ≈ f(w(t)) + 〈w −w(t),∇f(w(t))〉Hence, we want to minimize the approximation while staying close tow(t):

w(t+1) = argminw

1

2‖w−w(t)‖2+η

(f(w(t)) + 〈w −w(t),∇f(w(t))〉

).

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 19 / 47

Page 45: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Gradient Descent

Initialize w(1) = 0

Updatew(t+1) = w(t) − η∇f(w(t))

Output w̄ = 1T

∑Tt=1w

(t)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 20 / 47

Page 46: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Sub-Gradient Descent

Replace gradients with sub-gradients:

w(t+1) = w(t) − ηvt ,

where vt ∈ ∂f(w(t))

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 21 / 47

Page 47: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Analyzing sub-gradient descent

Lemma

T∑t=1

(f(w(t))− f(w?)) ≤T∑t=1

〈w(t) −w?,vt〉

=‖w(1) −w?‖2 − ‖w(T+1) −w?‖2

2η+η

2

T∑t=1

‖vt‖2 .

Proof:

The inequality is by the definition of sub-gradients

The equality follows from the definition of the update using algebraicmanipulations

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 22 / 47

Page 48: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Analyzing sub-gradient descent for Lipschitz functions

Since f is convex and ρ-Lipschitz, ‖vt‖ ≤ ρ for every t

Therefore,

1

T

T∑t=1

(f(wt)− f(w?)) ≤ ‖w?‖2

2ηT+ηρ2

2

For every w?, if T ≥ ‖w∗‖2ρ2ε2

, and η =√‖w∗‖2ρ2 T

, then the right-hand

side is at most ε

By convexity, f(w̄) ≤ 1T

∑Tt=1 f(wt), hence f(w̄)− f(w?) ≤ ε

Corollary: Sub-gradient descent needs ‖w∗‖2ρ2ε2

iterations to converge

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 23 / 47

Page 49: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Analyzing sub-gradient descent for Lipschitz functions

Since f is convex and ρ-Lipschitz, ‖vt‖ ≤ ρ for every t

Therefore,

1

T

T∑t=1

(f(wt)− f(w?)) ≤ ‖w?‖2

2ηT+ηρ2

2

For every w?, if T ≥ ‖w∗‖2ρ2ε2

, and η =√‖w∗‖2ρ2 T

, then the right-hand

side is at most ε

By convexity, f(w̄) ≤ 1T

∑Tt=1 f(wt), hence f(w̄)− f(w?) ≤ ε

Corollary: Sub-gradient descent needs ‖w∗‖2ρ2ε2

iterations to converge

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 23 / 47

Page 50: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Analyzing sub-gradient descent for Lipschitz functions

Since f is convex and ρ-Lipschitz, ‖vt‖ ≤ ρ for every t

Therefore,

1

T

T∑t=1

(f(wt)− f(w?)) ≤ ‖w?‖2

2ηT+ηρ2

2

For every w?, if T ≥ ‖w∗‖2ρ2ε2

, and η =√‖w∗‖2ρ2 T

, then the right-hand

side is at most ε

By convexity, f(w̄) ≤ 1T

∑Tt=1 f(wt), hence f(w̄)− f(w?) ≤ ε

Corollary: Sub-gradient descent needs ‖w∗‖2ρ2ε2

iterations to converge

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 23 / 47

Page 51: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Analyzing sub-gradient descent for Lipschitz functions

Since f is convex and ρ-Lipschitz, ‖vt‖ ≤ ρ for every t

Therefore,

1

T

T∑t=1

(f(wt)− f(w?)) ≤ ‖w?‖2

2ηT+ηρ2

2

For every w?, if T ≥ ‖w∗‖2ρ2ε2

, and η =√‖w∗‖2ρ2 T

, then the right-hand

side is at most ε

By convexity, f(w̄) ≤ 1T

∑Tt=1 f(wt), hence f(w̄)− f(w?) ≤ ε

Corollary: Sub-gradient descent needs ‖w∗‖2ρ2ε2

iterations to converge

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 23 / 47

Page 52: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Analyzing sub-gradient descent for Lipschitz functions

Since f is convex and ρ-Lipschitz, ‖vt‖ ≤ ρ for every t

Therefore,

1

T

T∑t=1

(f(wt)− f(w?)) ≤ ‖w?‖2

2ηT+ηρ2

2

For every w?, if T ≥ ‖w∗‖2ρ2ε2

, and η =√‖w∗‖2ρ2 T

, then the right-hand

side is at most ε

By convexity, f(w̄) ≤ 1T

∑Tt=1 f(wt), hence f(w̄)− f(w?) ≤ ε

Corollary: Sub-gradient descent needs ‖w∗‖2ρ2ε2

iterations to converge

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 23 / 47

Page 53: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Example: Finding a Separating Hyperplane

Let (x1, y1), . . . , (xm, ym) we would like to find a separating w:

∀i, yi〈w,xi〉 > 0 .

Notation:

Denote by w∗ a separating hyperplane of unit norm and letγ = mini yi〈w∗,xi〉W.l.o.g. assume ‖xi‖ = 1 for every i.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 24 / 47

Page 54: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Separating Hyperplane using the Ellipsoid

We can take the initial ball to be the unit ball

The separation oracle looks for i s.t. yi〈w(t),xi〉 ≤ 0

If there’s no such i, we’re done. Otherwise, the oracle returns yixi

The algorithm stops after at most 2d(2d+ 2) log(1/γ) iterations

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 25 / 47

Page 55: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Separating Hyperplane using Sub-gradient Descent

Consider the problem:

minw

f(w) where f(w) = maxi−yi〈w,xi〉

Observe:

f is convex

A sub-gradient of f at w is −yixi for some i ∈ argmax−yi〈w,xi〉f is 1-Lipschitz

f(w∗) = −γTherefore, after t > 1

γ2iterations, we have f(w(t)) < f(w∗) + γ = 0

So, w(t) is a separating hyperplane

The resulting algorithm is closely related to the Batch Perceptron

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 26 / 47

Page 56: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Separating Hyperplane using Sub-gradient Descent

Consider the problem:

minw

f(w) where f(w) = maxi−yi〈w,xi〉

Observe:

f is convex

A sub-gradient of f at w is −yixi for some i ∈ argmax−yi〈w,xi〉f is 1-Lipschitz

f(w∗) = −γTherefore, after t > 1

γ2iterations, we have f(w(t)) < f(w∗) + γ = 0

So, w(t) is a separating hyperplane

The resulting algorithm is closely related to the Batch Perceptron

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 26 / 47

Page 57: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Separating Hyperplane using Sub-gradient Descent

Consider the problem:

minw

f(w) where f(w) = maxi−yi〈w,xi〉

Observe:

f is convex

A sub-gradient of f at w is −yixi for some i ∈ argmax−yi〈w,xi〉

f is 1-Lipschitz

f(w∗) = −γTherefore, after t > 1

γ2iterations, we have f(w(t)) < f(w∗) + γ = 0

So, w(t) is a separating hyperplane

The resulting algorithm is closely related to the Batch Perceptron

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 26 / 47

Page 58: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Separating Hyperplane using Sub-gradient Descent

Consider the problem:

minw

f(w) where f(w) = maxi−yi〈w,xi〉

Observe:

f is convex

A sub-gradient of f at w is −yixi for some i ∈ argmax−yi〈w,xi〉f is 1-Lipschitz

f(w∗) = −γTherefore, after t > 1

γ2iterations, we have f(w(t)) < f(w∗) + γ = 0

So, w(t) is a separating hyperplane

The resulting algorithm is closely related to the Batch Perceptron

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 26 / 47

Page 59: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Separating Hyperplane using Sub-gradient Descent

Consider the problem:

minw

f(w) where f(w) = maxi−yi〈w,xi〉

Observe:

f is convex

A sub-gradient of f at w is −yixi for some i ∈ argmax−yi〈w,xi〉f is 1-Lipschitz

f(w∗) = −γ

Therefore, after t > 1γ2

iterations, we have f(w(t)) < f(w∗) + γ = 0

So, w(t) is a separating hyperplane

The resulting algorithm is closely related to the Batch Perceptron

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 26 / 47

Page 60: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Separating Hyperplane using Sub-gradient Descent

Consider the problem:

minw

f(w) where f(w) = maxi−yi〈w,xi〉

Observe:

f is convex

A sub-gradient of f at w is −yixi for some i ∈ argmax−yi〈w,xi〉f is 1-Lipschitz

f(w∗) = −γTherefore, after t > 1

γ2iterations, we have f(w(t)) < f(w∗) + γ = 0

So, w(t) is a separating hyperplane

The resulting algorithm is closely related to the Batch Perceptron

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 26 / 47

Page 61: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Separating Hyperplane using Sub-gradient Descent

Consider the problem:

minw

f(w) where f(w) = maxi−yi〈w,xi〉

Observe:

f is convex

A sub-gradient of f at w is −yixi for some i ∈ argmax−yi〈w,xi〉f is 1-Lipschitz

f(w∗) = −γTherefore, after t > 1

γ2iterations, we have f(w(t)) < f(w∗) + γ = 0

So, w(t) is a separating hyperplane

The resulting algorithm is closely related to the Batch Perceptron

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 26 / 47

Page 62: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Separating Hyperplane using Sub-gradient Descent

Consider the problem:

minw

f(w) where f(w) = maxi−yi〈w,xi〉

Observe:

f is convex

A sub-gradient of f at w is −yixi for some i ∈ argmax−yi〈w,xi〉f is 1-Lipschitz

f(w∗) = −γTherefore, after t > 1

γ2iterations, we have f(w(t)) < f(w∗) + γ = 0

So, w(t) is a separating hyperplane

The resulting algorithm is closely related to the Batch Perceptron

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 26 / 47

Page 63: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

The Batch Perceptron

Initialize, w(1) = 0

While exists i s.t. yi〈w(t),xi〉 ≤ 0 update

w(t+1) = w(t) + yixi

Exercise: why did we eliminate η ?

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 27 / 47

Page 64: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

The Batch Perceptron

Initialize, w(1) = 0

While exists i s.t. yi〈w(t),xi〉 ≤ 0 update

w(t+1) = w(t) + yixi

Exercise: why did we eliminate η ?

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 27 / 47

Page 65: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Ellipsoid vs. Sub-gradient

For f convex and ρ-Lipschitz:iterations cost of iteration

Ellipsoid d2 log(ρ ‖w∗‖

ε

)d2+ “gradient oracle”

Sub-gradient descent ‖w∗‖2ρ2ε2

d+”gradient oracle”

For separating hyperplane:iterations cost of iteration

Ellipsoid d2 log(

)d2 + dm

Sub-gradient descent 1γ2

dm

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 28 / 47

Page 66: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Ellipsoid vs. Sub-gradient

For f convex and ρ-Lipschitz:iterations cost of iteration

Ellipsoid d2 log(ρ ‖w∗‖

ε

)d2+ “gradient oracle”

Sub-gradient descent ‖w∗‖2ρ2ε2

d+”gradient oracle”

For separating hyperplane:iterations cost of iteration

Ellipsoid d2 log(

)d2 + dm

Sub-gradient descent 1γ2

dm

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 28 / 47

Page 67: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Outline

1 Convexity

2 Convex OptimizationEllipsoidGradient Descent

3 Convex Learning Problems

4 Surrogate Loss Functions

5 Learning Using Stochastic Gradient Descent

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 29 / 47

Page 68: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Convex Learning Problems

Definition (Convex Learning Problem)

A learning problem, (H, Z, `), is called convex if the hypothesis class H isa convex set and for all z ∈ Z, the loss function, `(·, z), is a convexfunction (where, for any z, `(·, z) denotes the function f : H → R definedby f(w) = `(w, z)).

The ERMH problem w.r.t. a convex learning problem is a convexoptimization problem: minw∈H

1m

∑mi=1 `(w, zi)

Example – least squares: H = Rd, Z = Rd × R,`(w, (x, y)) = (〈w,x〉 − y)2

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 30 / 47

Page 69: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Convex Learning Problems

Definition (Convex Learning Problem)

A learning problem, (H, Z, `), is called convex if the hypothesis class H isa convex set and for all z ∈ Z, the loss function, `(·, z), is a convexfunction (where, for any z, `(·, z) denotes the function f : H → R definedby f(w) = `(w, z)).

The ERMH problem w.r.t. a convex learning problem is a convexoptimization problem: minw∈H

1m

∑mi=1 `(w, zi)

Example – least squares: H = Rd, Z = Rd × R,`(w, (x, y)) = (〈w,x〉 − y)2

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 30 / 47

Page 70: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Convex Learning Problems

Definition (Convex Learning Problem)

A learning problem, (H, Z, `), is called convex if the hypothesis class H isa convex set and for all z ∈ Z, the loss function, `(·, z), is a convexfunction (where, for any z, `(·, z) denotes the function f : H → R definedby f(w) = `(w, z)).

The ERMH problem w.r.t. a convex learning problem is a convexoptimization problem: minw∈H

1m

∑mi=1 `(w, zi)

Example – least squares: H = Rd, Z = Rd × R,`(w, (x, y)) = (〈w,x〉 − y)2

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 30 / 47

Page 71: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Learnability of convex learning problems

Claim: Not all convex learning problems over Rd are learnable

The intuitive reason is numerical stability

But, with two additional mild conditions, we obtain learnability

H is boundedThe loss function (or its gradient) is Lipschitz

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 31 / 47

Page 72: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Convex-Lipschitz-bounded learning problem

Definition (Convex-Lipschitz-Bounded Learning Problem)

A learning problem, (H, Z, `), is called Convex-Lipschitz-Bounded, withparameters ρ,B if the following holds:

The hypothesis class H is a convex set and for all w ∈ H we have‖w‖ ≤ B.

For all z ∈ Z, the loss function, `(·, z), is a convex and ρ-Lipschitzfunction.

Example:

H = {w ∈ Rd : ‖w‖ ≤ B}X = {x ∈ Rd : ‖x‖ ≤ ρ}, Y = R,

`(w, (x, y)) = |〈w,x〉 − y|

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 32 / 47

Page 73: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Convex-Lipschitz-bounded learning problem

Definition (Convex-Lipschitz-Bounded Learning Problem)

A learning problem, (H, Z, `), is called Convex-Lipschitz-Bounded, withparameters ρ,B if the following holds:

The hypothesis class H is a convex set and for all w ∈ H we have‖w‖ ≤ B.

For all z ∈ Z, the loss function, `(·, z), is a convex and ρ-Lipschitzfunction.

Example:

H = {w ∈ Rd : ‖w‖ ≤ B}X = {x ∈ Rd : ‖x‖ ≤ ρ}, Y = R,

`(w, (x, y)) = |〈w,x〉 − y|

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 32 / 47

Page 74: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Convex-Smooth-bounded learning problem

A function f is β-smooth if it is differentiable and its gradient isβ-Lipschitz.

Definition (Convex-Smooth-Bounded Learning Problem)

A learning problem, (H, Z, `), is called Convex-Smooth-Bounded, withparameters β,B if the following holds:

The hypothesis class H is a convex set and for all w ∈ H we have‖w‖ ≤ B.

For all z ∈ Z, the loss function, `(·, z), is a convex, non-negative, andβ-smooth function.

Example:

H = {w ∈ Rd : ‖w‖ ≤ B}X = {x ∈ Rd : ‖x‖ ≤ β/2}, Y = R,

`(w, (x, y)) = (〈w,x〉 − y)2

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 33 / 47

Page 75: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Convex-Smooth-bounded learning problem

A function f is β-smooth if it is differentiable and its gradient isβ-Lipschitz.

Definition (Convex-Smooth-Bounded Learning Problem)

A learning problem, (H, Z, `), is called Convex-Smooth-Bounded, withparameters β,B if the following holds:

The hypothesis class H is a convex set and for all w ∈ H we have‖w‖ ≤ B.

For all z ∈ Z, the loss function, `(·, z), is a convex, non-negative, andβ-smooth function.

Example:

H = {w ∈ Rd : ‖w‖ ≤ B}X = {x ∈ Rd : ‖x‖ ≤ β/2}, Y = R,

`(w, (x, y)) = (〈w,x〉 − y)2

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 33 / 47

Page 76: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Learnability

We will later show that all Convex-Lipschitz-Bounded andConvex-Smooth-Bounded learning problems are learnable, with samplecomplexity that depends only on ε, δ, B, and ρ or β.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 34 / 47

Page 77: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Outline

1 Convexity

2 Convex OptimizationEllipsoidGradient Descent

3 Convex Learning Problems

4 Surrogate Loss Functions

5 Learning Using Stochastic Gradient Descent

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 35 / 47

Page 78: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Surrogate Loss Functions

In many natural cases, the loss function is not convex

For example, the 0− 1 loss for halfspaces

`0−1(w, (x, y)) = 1[y 6=sign(〈w,x〉)] = 1[y〈w,x〉≤0] .

Non-convex loss function usually yields intractable learning problems

Popular approach: circumvent hardness by upper bounding thenon-convex loss function using a convex surrogate loss function

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 36 / 47

Page 79: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Hinge-loss

`hinge(w, (x, y))def= max{0, 1− y〈w,x〉} .

y〈w,x〉

`hinge

`0−1

1

1

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 37 / 47

Page 80: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Error Decomposition Revisited

Suppose we have a learner for the hinge-loss that guarantees:

LhingeD (A(S)) ≤ min

w∈HLhingeD (w) + ε ,

Using the surrogate property,

L0−1D (A(S)) ≤ min

w∈HLhingeD (w) + ε .

We can further rewrite the upper bound as:

L0−1D (A(S)) ≤ min

w∈HL0−1D (w) +

(minw∈H

LhingeD (w)− min

w∈HL0−1D (w)

)+ ε

= εapproximation + εoptimization + εestimation

The optimization error is a result of our inability to minimize thetraining loss with respect to the original loss.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 38 / 47

Page 81: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Error Decomposition Revisited

Suppose we have a learner for the hinge-loss that guarantees:

LhingeD (A(S)) ≤ min

w∈HLhingeD (w) + ε ,

Using the surrogate property,

L0−1D (A(S)) ≤ min

w∈HLhingeD (w) + ε .

We can further rewrite the upper bound as:

L0−1D (A(S)) ≤ min

w∈HL0−1D (w) +

(minw∈H

LhingeD (w)− min

w∈HL0−1D (w)

)+ ε

= εapproximation + εoptimization + εestimation

The optimization error is a result of our inability to minimize thetraining loss with respect to the original loss.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 38 / 47

Page 82: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Error Decomposition Revisited

Suppose we have a learner for the hinge-loss that guarantees:

LhingeD (A(S)) ≤ min

w∈HLhingeD (w) + ε ,

Using the surrogate property,

L0−1D (A(S)) ≤ min

w∈HLhingeD (w) + ε .

We can further rewrite the upper bound as:

L0−1D (A(S)) ≤ min

w∈HL0−1D (w) +

(minw∈H

LhingeD (w)− min

w∈HL0−1D (w)

)+ ε

= εapproximation + εoptimization + εestimation

The optimization error is a result of our inability to minimize thetraining loss with respect to the original loss.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 38 / 47

Page 83: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Error Decomposition Revisited

Suppose we have a learner for the hinge-loss that guarantees:

LhingeD (A(S)) ≤ min

w∈HLhingeD (w) + ε ,

Using the surrogate property,

L0−1D (A(S)) ≤ min

w∈HLhingeD (w) + ε .

We can further rewrite the upper bound as:

L0−1D (A(S)) ≤ min

w∈HL0−1D (w) +

(minw∈H

LhingeD (w)− min

w∈HL0−1D (w)

)+ ε

= εapproximation + εoptimization + εestimation

The optimization error is a result of our inability to minimize thetraining loss with respect to the original loss.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 38 / 47

Page 84: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Outline

1 Convexity

2 Convex OptimizationEllipsoidGradient Descent

3 Convex Learning Problems

4 Surrogate Loss Functions

5 Learning Using Stochastic Gradient Descent

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 39 / 47

Page 85: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Learning Using Stochastic Gradient Descent

Consider a convex-Lipschitz-bounded learning problem.

Recall: our goal is to (probably approximately) solve:

minw∈H

LD(w) where LD(w) = Ez∼D

[`(w, z)]

So far, learning was based on the empirical risk, LS(w)

We now consider directly minimizing LD(w)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 40 / 47

Page 86: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Learning Using Stochastic Gradient Descent

Consider a convex-Lipschitz-bounded learning problem.

Recall: our goal is to (probably approximately) solve:

minw∈H

LD(w) where LD(w) = Ez∼D

[`(w, z)]

So far, learning was based on the empirical risk, LS(w)

We now consider directly minimizing LD(w)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 40 / 47

Page 87: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Learning Using Stochastic Gradient Descent

Consider a convex-Lipschitz-bounded learning problem.

Recall: our goal is to (probably approximately) solve:

minw∈H

LD(w) where LD(w) = Ez∼D

[`(w, z)]

So far, learning was based on the empirical risk, LS(w)

We now consider directly minimizing LD(w)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 40 / 47

Page 88: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Learning Using Stochastic Gradient Descent

Consider a convex-Lipschitz-bounded learning problem.

Recall: our goal is to (probably approximately) solve:

minw∈H

LD(w) where LD(w) = Ez∼D

[`(w, z)]

So far, learning was based on the empirical risk, LS(w)

We now consider directly minimizing LD(w)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 40 / 47

Page 89: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Stochastic Gradient Descent

minw∈H

LD(w) where LD(w) = Ez∼D

[`(w, z)]

Recall the gradient descent method in which we initialize w(1) = 0and update w(t+1) = w(t) − η∇LD(w)

Observe: ∇LD(w) = Ez∼D[∇`(w, z)]We can’t calculate ∇LD(w) because we don’t know DBut we can estimate it by ∇`(w, z) for z ∼ DIf we take a step in the direction v = ∇`(w, z) then in expectationwe’re moving in the right direction

In other words, v is an unbiased estimate of the gradient

We’ll show that this is good enough

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 41 / 47

Page 90: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Stochastic Gradient Descent

minw∈H

LD(w) where LD(w) = Ez∼D

[`(w, z)]

Recall the gradient descent method in which we initialize w(1) = 0and update w(t+1) = w(t) − η∇LD(w)

Observe: ∇LD(w) = Ez∼D[∇`(w, z)]

We can’t calculate ∇LD(w) because we don’t know DBut we can estimate it by ∇`(w, z) for z ∼ DIf we take a step in the direction v = ∇`(w, z) then in expectationwe’re moving in the right direction

In other words, v is an unbiased estimate of the gradient

We’ll show that this is good enough

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 41 / 47

Page 91: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Stochastic Gradient Descent

minw∈H

LD(w) where LD(w) = Ez∼D

[`(w, z)]

Recall the gradient descent method in which we initialize w(1) = 0and update w(t+1) = w(t) − η∇LD(w)

Observe: ∇LD(w) = Ez∼D[∇`(w, z)]We can’t calculate ∇LD(w) because we don’t know D

But we can estimate it by ∇`(w, z) for z ∼ DIf we take a step in the direction v = ∇`(w, z) then in expectationwe’re moving in the right direction

In other words, v is an unbiased estimate of the gradient

We’ll show that this is good enough

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 41 / 47

Page 92: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Stochastic Gradient Descent

minw∈H

LD(w) where LD(w) = Ez∼D

[`(w, z)]

Recall the gradient descent method in which we initialize w(1) = 0and update w(t+1) = w(t) − η∇LD(w)

Observe: ∇LD(w) = Ez∼D[∇`(w, z)]We can’t calculate ∇LD(w) because we don’t know DBut we can estimate it by ∇`(w, z) for z ∼ D

If we take a step in the direction v = ∇`(w, z) then in expectationwe’re moving in the right direction

In other words, v is an unbiased estimate of the gradient

We’ll show that this is good enough

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 41 / 47

Page 93: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Stochastic Gradient Descent

minw∈H

LD(w) where LD(w) = Ez∼D

[`(w, z)]

Recall the gradient descent method in which we initialize w(1) = 0and update w(t+1) = w(t) − η∇LD(w)

Observe: ∇LD(w) = Ez∼D[∇`(w, z)]We can’t calculate ∇LD(w) because we don’t know DBut we can estimate it by ∇`(w, z) for z ∼ DIf we take a step in the direction v = ∇`(w, z) then in expectationwe’re moving in the right direction

In other words, v is an unbiased estimate of the gradient

We’ll show that this is good enough

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 41 / 47

Page 94: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Stochastic Gradient Descent

minw∈H

LD(w) where LD(w) = Ez∼D

[`(w, z)]

Recall the gradient descent method in which we initialize w(1) = 0and update w(t+1) = w(t) − η∇LD(w)

Observe: ∇LD(w) = Ez∼D[∇`(w, z)]We can’t calculate ∇LD(w) because we don’t know DBut we can estimate it by ∇`(w, z) for z ∼ DIf we take a step in the direction v = ∇`(w, z) then in expectationwe’re moving in the right direction

In other words, v is an unbiased estimate of the gradient

We’ll show that this is good enough

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 41 / 47

Page 95: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Stochastic Gradient Descent

minw∈H

LD(w) where LD(w) = Ez∼D

[`(w, z)]

Recall the gradient descent method in which we initialize w(1) = 0and update w(t+1) = w(t) − η∇LD(w)

Observe: ∇LD(w) = Ez∼D[∇`(w, z)]We can’t calculate ∇LD(w) because we don’t know DBut we can estimate it by ∇`(w, z) for z ∼ DIf we take a step in the direction v = ∇`(w, z) then in expectationwe’re moving in the right direction

In other words, v is an unbiased estimate of the gradient

We’ll show that this is good enough

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 41 / 47

Page 96: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Stochastic Gradient Descent

initialize: w(1) = 0for t = 1, 2, . . . , T

choose zt ∼ Dlet vt ∈ ∂`(w(t), zt) update w(t+1) = w(t) − ηvt

output w̄ = 1T

∑Tt=1w

(t)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 42 / 47

Page 97: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Stochastic Gradient Descent

initialize: w(1) = 0for t = 1, 2, . . . , T

choose zt ∼ Dlet vt ∈ ∂`(w(t), zt) update w(t+1) = w(t) − ηvt

output w̄ = 1T

∑Tt=1w

(t)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 42 / 47

Page 98: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Analyzing SGD for convex-Lipschitz-bounded

By algebraic manipulations, for any sequence of v1, . . . ,vT , and any w?,

T∑t=1

〈w(t) −w?,vt〉 =‖w(1) −w?‖2 − ‖w(T+1) −w?‖2

2η+η

2

T∑t=1

‖vt‖2

Assume that ‖vt‖ ≤ ρ for all t and that ‖w?‖ ≤ B we obtain

T∑t=1

〈w(t) −w?,vt〉 ≤B2

2η+η ρ2 T

2

In particular, for η =√

B2

ρ2 Twe get

T∑t=1

〈w(t) −w?,vt〉 ≤ B ρ√T .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 43 / 47

Page 99: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Analyzing SGD for convex-Lipschitz-bounded

By algebraic manipulations, for any sequence of v1, . . . ,vT , and any w?,

T∑t=1

〈w(t) −w?,vt〉 =‖w(1) −w?‖2 − ‖w(T+1) −w?‖2

2η+η

2

T∑t=1

‖vt‖2

Assume that ‖vt‖ ≤ ρ for all t and that ‖w?‖ ≤ B we obtain

T∑t=1

〈w(t) −w?,vt〉 ≤B2

2η+η ρ2 T

2

In particular, for η =√

B2

ρ2 Twe get

T∑t=1

〈w(t) −w?,vt〉 ≤ B ρ√T .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 43 / 47

Page 100: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Analyzing SGD for convex-Lipschitz-bounded

By algebraic manipulations, for any sequence of v1, . . . ,vT , and any w?,

T∑t=1

〈w(t) −w?,vt〉 =‖w(1) −w?‖2 − ‖w(T+1) −w?‖2

2η+η

2

T∑t=1

‖vt‖2

Assume that ‖vt‖ ≤ ρ for all t and that ‖w?‖ ≤ B we obtain

T∑t=1

〈w(t) −w?,vt〉 ≤B2

2η+η ρ2 T

2

In particular, for η =√

B2

ρ2 Twe get

T∑t=1

〈w(t) −w?,vt〉 ≤ B ρ√T .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 43 / 47

Page 101: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Analyzing SGD for convex-Lipschitz-bounded

Taking expectation of both sides w.r.t. the randomness of choosingz1, . . . , zT we obtain:

Ez1,...,zT

[T∑t=1

〈w(t) −w?,vt〉

]≤ B ρ

√T .

The law of total expectation: for every two random variables α, β, and afunction g, Eα[g(α)] = Eβ Eα[g(α)|β]. Therefore

Ez1,...,zT

[〈w(t) −w?,vt〉] = Ez1,...,zt−1

Ez1,...,zT

[〈w(t) −w?,vt〉 | z1, . . . , zt−1] .

Once we know β the value of w(t) is not random, hence,

Ez1,...,zT

[〈w(t) −w?,vt〉 | z1, . . . , zt−1] = 〈w(t) −w? , Ezt

[∇`(w(t), zt)]〉

= 〈w(t) −w? , ∇LD(w(t))〉

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 44 / 47

Page 102: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Analyzing SGD for convex-Lipschitz-bounded

Taking expectation of both sides w.r.t. the randomness of choosingz1, . . . , zT we obtain:

Ez1,...,zT

[T∑t=1

〈w(t) −w?,vt〉

]≤ B ρ

√T .

The law of total expectation: for every two random variables α, β, and afunction g, Eα[g(α)] = Eβ Eα[g(α)|β].

Therefore

Ez1,...,zT

[〈w(t) −w?,vt〉] = Ez1,...,zt−1

Ez1,...,zT

[〈w(t) −w?,vt〉 | z1, . . . , zt−1] .

Once we know β the value of w(t) is not random, hence,

Ez1,...,zT

[〈w(t) −w?,vt〉 | z1, . . . , zt−1] = 〈w(t) −w? , Ezt

[∇`(w(t), zt)]〉

= 〈w(t) −w? , ∇LD(w(t))〉

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 44 / 47

Page 103: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Analyzing SGD for convex-Lipschitz-bounded

Taking expectation of both sides w.r.t. the randomness of choosingz1, . . . , zT we obtain:

Ez1,...,zT

[T∑t=1

〈w(t) −w?,vt〉

]≤ B ρ

√T .

The law of total expectation: for every two random variables α, β, and afunction g, Eα[g(α)] = Eβ Eα[g(α)|β]. Therefore

Ez1,...,zT

[〈w(t) −w?,vt〉] = Ez1,...,zt−1

Ez1,...,zT

[〈w(t) −w?,vt〉 | z1, . . . , zt−1] .

Once we know β the value of w(t) is not random, hence,

Ez1,...,zT

[〈w(t) −w?,vt〉 | z1, . . . , zt−1] = 〈w(t) −w? , Ezt

[∇`(w(t), zt)]〉

= 〈w(t) −w? , ∇LD(w(t))〉

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 44 / 47

Page 104: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Analyzing SGD for convex-Lipschitz-bounded

Taking expectation of both sides w.r.t. the randomness of choosingz1, . . . , zT we obtain:

Ez1,...,zT

[T∑t=1

〈w(t) −w?,vt〉

]≤ B ρ

√T .

The law of total expectation: for every two random variables α, β, and afunction g, Eα[g(α)] = Eβ Eα[g(α)|β]. Therefore

Ez1,...,zT

[〈w(t) −w?,vt〉] = Ez1,...,zt−1

Ez1,...,zT

[〈w(t) −w?,vt〉 | z1, . . . , zt−1] .

Once we know β the value of w(t) is not random, hence,

Ez1,...,zT

[〈w(t) −w?,vt〉 | z1, . . . , zt−1] = 〈w(t) −w? , Ezt

[∇`(w(t), zt)]〉

= 〈w(t) −w? , ∇LD(w(t))〉

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 44 / 47

Page 105: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Analyzing SGD for convex-Lipschitz-bounded

We got:

Ez1,...,zT

[T∑t=1

〈w(t) −w? , ∇LD(w(t))〉

]≤ B ρ

√T

By convexity, this means

Ez1,...,zT

[T∑t=1

(LD(w(t))− LD(w?))

]≤ B ρ

√T

Dividing by T and using convexity again,

Ez1,...,zT

[LD

(1

T

T∑t=1

w(t)

)]≤ LD(w?) +

B ρ√T

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 45 / 47

Page 106: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Analyzing SGD for convex-Lipschitz-bounded

We got:

Ez1,...,zT

[T∑t=1

〈w(t) −w? , ∇LD(w(t))〉

]≤ B ρ

√T

By convexity, this means

Ez1,...,zT

[T∑t=1

(LD(w(t))− LD(w?))

]≤ B ρ

√T

Dividing by T and using convexity again,

Ez1,...,zT

[LD

(1

T

T∑t=1

w(t)

)]≤ LD(w?) +

B ρ√T

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 45 / 47

Page 107: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Analyzing SGD for convex-Lipschitz-bounded

We got:

Ez1,...,zT

[T∑t=1

〈w(t) −w? , ∇LD(w(t))〉

]≤ B ρ

√T

By convexity, this means

Ez1,...,zT

[T∑t=1

(LD(w(t))− LD(w?))

]≤ B ρ

√T

Dividing by T and using convexity again,

Ez1,...,zT

[LD

(1

T

T∑t=1

w(t)

)]≤ LD(w?) +

B ρ√T

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 45 / 47

Page 108: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Learning convex-Lipschitz-bounded problems using SGD

Corollary

Consider a convex-Lipschitz-bounded learning problem with parametersρ,B. Then, for every ε > 0, if we run the SGD method for minimizingLD(w) with a number of iterations (i.e., number of examples)

T ≥ B2ρ2

ε2

and with η =√

B2

ρ2 T, then the output of SGD satisfies:

E [LD(w̄)] ≤ minw∈H

LD(w) + ε .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 46 / 47

Page 109: Introduction to Machine Learning (67577) Lecture 6shais/Lectures2014/lecture6.pdf · Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 6 / 47. Property I: local minima are global

Summary

Convex optimization

Convex learning problems

Learning using SGD

Shai Shalev-Shwartz (Hebrew U) IML Lecture 6 Convexity 47 / 47