76
Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 1 / 31

Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Introduction to Machine Learning (67577)Lecture 7

Shai Shalev-Shwartz

School of CS and Engineering,The Hebrew University of Jerusalem

Solving Convex Problems using SGD and RLM

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 1 / 31

Page 2: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Outline

1 Reminder: Convex learning problems

2 Learning Using Stochastic Gradient Descent

3 Learning Using Regularized Loss Minimization

4 Dimension vs. Norm boundsExample application: Text categorization

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 2 / 31

Page 3: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Convex-Lipschitz-bounded learning problem

Definition (Convex-Lipschitz-Bounded Learning Problem)

A learning problem, (H, Z, `), is called Convex-Lipschitz-Bounded, withparameters ρ,B if the following holds:

The hypothesis class H is a convex set and for all w ∈ H we have‖w‖ ≤ B.

For all z ∈ Z, the loss function, `(·, z), is a convex and ρ-Lipschitzfunction.

Example:

H = {w ∈ Rd : ‖w‖ ≤ B}X = {x ∈ Rd : ‖x‖ ≤ ρ}, Y = R,

`(w, (x, y)) = |〈w,x〉 − y|

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 3 / 31

Page 4: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Convex-Lipschitz-bounded learning problem

Definition (Convex-Lipschitz-Bounded Learning Problem)

A learning problem, (H, Z, `), is called Convex-Lipschitz-Bounded, withparameters ρ,B if the following holds:

The hypothesis class H is a convex set and for all w ∈ H we have‖w‖ ≤ B.

For all z ∈ Z, the loss function, `(·, z), is a convex and ρ-Lipschitzfunction.

Example:

H = {w ∈ Rd : ‖w‖ ≤ B}X = {x ∈ Rd : ‖x‖ ≤ ρ}, Y = R,

`(w, (x, y)) = |〈w,x〉 − y|

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 3 / 31

Page 5: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Convex-Smooth-bounded learning problem

Definition (Convex-Smooth-Bounded Learning Problem)

A learning problem, (H, Z, `), is called Convex-Smooth-Bounded, withparameters β,B if the following holds:

The hypothesis class H is a convex set and for all w ∈ H we have‖w‖ ≤ B.

For all z ∈ Z, the loss function, `(·, z), is a convex, non-negative, andβ-smooth function.

Example:

H = {w ∈ Rd : ‖w‖ ≤ B}X = {x ∈ Rd : ‖x‖ ≤ β/2}, Y = R,

`(w, (x, y)) = (〈w,x〉 − y)2

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 4 / 31

Page 6: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Convex-Smooth-bounded learning problem

Definition (Convex-Smooth-Bounded Learning Problem)

A learning problem, (H, Z, `), is called Convex-Smooth-Bounded, withparameters β,B if the following holds:

The hypothesis class H is a convex set and for all w ∈ H we have‖w‖ ≤ B.

For all z ∈ Z, the loss function, `(·, z), is a convex, non-negative, andβ-smooth function.

Example:

H = {w ∈ Rd : ‖w‖ ≤ B}X = {x ∈ Rd : ‖x‖ ≤ β/2}, Y = R,

`(w, (x, y)) = (〈w,x〉 − y)2

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 4 / 31

Page 7: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Outline

1 Reminder: Convex learning problems

2 Learning Using Stochastic Gradient Descent

3 Learning Using Regularized Loss Minimization

4 Dimension vs. Norm boundsExample application: Text categorization

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 5 / 31

Page 8: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Learning Using Stochastic Gradient Descent

Consider a learning problem.

Recall: our goal is to (probably approximately) solve:

minw∈H

LD(w) where LD(w) = Ez∼D

[`(w, z)]

So far, learning was based on the empirical risk, LS(w)

We now consider directly minimizing LD(w)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 6 / 31

Page 9: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Learning Using Stochastic Gradient Descent

Consider a learning problem.

Recall: our goal is to (probably approximately) solve:

minw∈H

LD(w) where LD(w) = Ez∼D

[`(w, z)]

So far, learning was based on the empirical risk, LS(w)

We now consider directly minimizing LD(w)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 6 / 31

Page 10: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Learning Using Stochastic Gradient Descent

Consider a learning problem.

Recall: our goal is to (probably approximately) solve:

minw∈H

LD(w) where LD(w) = Ez∼D

[`(w, z)]

So far, learning was based on the empirical risk, LS(w)

We now consider directly minimizing LD(w)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 6 / 31

Page 11: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Learning Using Stochastic Gradient Descent

Consider a learning problem.

Recall: our goal is to (probably approximately) solve:

minw∈H

LD(w) where LD(w) = Ez∼D

[`(w, z)]

So far, learning was based on the empirical risk, LS(w)

We now consider directly minimizing LD(w)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 6 / 31

Page 12: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Stochastic Gradient Descent

minw∈H

LD(w) where LD(w) = Ez∼D

[`(w, z)]

Recall the gradient descent method in which we initialize w(1) = 0and update w(t+1) = w(t) − η∇LD(w)

Observe: ∇LD(w) = Ez∼D[∇`(w, z)]We can’t calculate ∇LD(w) because we don’t know DBut we can estimate it by ∇`(w, z) for z ∼ DIf we take a step based on the direction v = ∇`(w, z) then inexpectation we’re moving in the right direction

In other words, v is an unbiased estimate of the gradient

We’ll show that this is good enough

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 7 / 31

Page 13: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Stochastic Gradient Descent

minw∈H

LD(w) where LD(w) = Ez∼D

[`(w, z)]

Recall the gradient descent method in which we initialize w(1) = 0and update w(t+1) = w(t) − η∇LD(w)

Observe: ∇LD(w) = Ez∼D[∇`(w, z)]

We can’t calculate ∇LD(w) because we don’t know DBut we can estimate it by ∇`(w, z) for z ∼ DIf we take a step based on the direction v = ∇`(w, z) then inexpectation we’re moving in the right direction

In other words, v is an unbiased estimate of the gradient

We’ll show that this is good enough

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 7 / 31

Page 14: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Stochastic Gradient Descent

minw∈H

LD(w) where LD(w) = Ez∼D

[`(w, z)]

Recall the gradient descent method in which we initialize w(1) = 0and update w(t+1) = w(t) − η∇LD(w)

Observe: ∇LD(w) = Ez∼D[∇`(w, z)]We can’t calculate ∇LD(w) because we don’t know D

But we can estimate it by ∇`(w, z) for z ∼ DIf we take a step based on the direction v = ∇`(w, z) then inexpectation we’re moving in the right direction

In other words, v is an unbiased estimate of the gradient

We’ll show that this is good enough

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 7 / 31

Page 15: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Stochastic Gradient Descent

minw∈H

LD(w) where LD(w) = Ez∼D

[`(w, z)]

Recall the gradient descent method in which we initialize w(1) = 0and update w(t+1) = w(t) − η∇LD(w)

Observe: ∇LD(w) = Ez∼D[∇`(w, z)]We can’t calculate ∇LD(w) because we don’t know DBut we can estimate it by ∇`(w, z) for z ∼ D

If we take a step based on the direction v = ∇`(w, z) then inexpectation we’re moving in the right direction

In other words, v is an unbiased estimate of the gradient

We’ll show that this is good enough

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 7 / 31

Page 16: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Stochastic Gradient Descent

minw∈H

LD(w) where LD(w) = Ez∼D

[`(w, z)]

Recall the gradient descent method in which we initialize w(1) = 0and update w(t+1) = w(t) − η∇LD(w)

Observe: ∇LD(w) = Ez∼D[∇`(w, z)]We can’t calculate ∇LD(w) because we don’t know DBut we can estimate it by ∇`(w, z) for z ∼ DIf we take a step based on the direction v = ∇`(w, z) then inexpectation we’re moving in the right direction

In other words, v is an unbiased estimate of the gradient

We’ll show that this is good enough

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 7 / 31

Page 17: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Stochastic Gradient Descent

minw∈H

LD(w) where LD(w) = Ez∼D

[`(w, z)]

Recall the gradient descent method in which we initialize w(1) = 0and update w(t+1) = w(t) − η∇LD(w)

Observe: ∇LD(w) = Ez∼D[∇`(w, z)]We can’t calculate ∇LD(w) because we don’t know DBut we can estimate it by ∇`(w, z) for z ∼ DIf we take a step based on the direction v = ∇`(w, z) then inexpectation we’re moving in the right direction

In other words, v is an unbiased estimate of the gradient

We’ll show that this is good enough

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 7 / 31

Page 18: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Stochastic Gradient Descent

minw∈H

LD(w) where LD(w) = Ez∼D

[`(w, z)]

Recall the gradient descent method in which we initialize w(1) = 0and update w(t+1) = w(t) − η∇LD(w)

Observe: ∇LD(w) = Ez∼D[∇`(w, z)]We can’t calculate ∇LD(w) because we don’t know DBut we can estimate it by ∇`(w, z) for z ∼ DIf we take a step based on the direction v = ∇`(w, z) then inexpectation we’re moving in the right direction

In other words, v is an unbiased estimate of the gradient

We’ll show that this is good enough

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 7 / 31

Page 19: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Stochastic Gradient Descent

initialize: w(1) = 0for t = 1, 2, . . . , T

choose zt ∼ Dlet vt ∈ ∂`(w(t), zt) update w(t+1) = w(t) − ηvt

output w̄ = 1T

∑Tt=1w

(t)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 8 / 31

Page 20: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Stochastic Gradient Descent

initialize: w(1) = 0for t = 1, 2, . . . , T

choose zt ∼ Dlet vt ∈ ∂`(w(t), zt) update w(t+1) = w(t) − ηvt

output w̄ = 1T

∑Tt=1w

(t)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 8 / 31

Page 21: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Analyzing SGD for convex-Lipschitz-bounded

By algebraic manipulations, for any sequence of v1, . . . ,vT , and any w?,

T∑t=1

〈w(t) −w?,vt〉 =‖w(1) −w?‖2 − ‖w(T+1) −w?‖2

2η+η

2

T∑t=1

‖vt‖2

Assume that ‖vt‖ ≤ ρ for all t and that ‖w?‖ ≤ B we obtain

T∑t=1

〈w(t) −w?,vt〉 ≤B2

2η+η ρ2 T

2

In particular, for η =√

B2

ρ2 Twe get

T∑t=1

〈w(t) −w?,vt〉 ≤ B ρ√T .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 9 / 31

Page 22: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Analyzing SGD for convex-Lipschitz-bounded

By algebraic manipulations, for any sequence of v1, . . . ,vT , and any w?,

T∑t=1

〈w(t) −w?,vt〉 =‖w(1) −w?‖2 − ‖w(T+1) −w?‖2

2η+η

2

T∑t=1

‖vt‖2

Assume that ‖vt‖ ≤ ρ for all t and that ‖w?‖ ≤ B we obtain

T∑t=1

〈w(t) −w?,vt〉 ≤B2

2η+η ρ2 T

2

In particular, for η =√

B2

ρ2 Twe get

T∑t=1

〈w(t) −w?,vt〉 ≤ B ρ√T .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 9 / 31

Page 23: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Analyzing SGD for convex-Lipschitz-bounded

By algebraic manipulations, for any sequence of v1, . . . ,vT , and any w?,

T∑t=1

〈w(t) −w?,vt〉 =‖w(1) −w?‖2 − ‖w(T+1) −w?‖2

2η+η

2

T∑t=1

‖vt‖2

Assume that ‖vt‖ ≤ ρ for all t and that ‖w?‖ ≤ B we obtain

T∑t=1

〈w(t) −w?,vt〉 ≤B2

2η+η ρ2 T

2

In particular, for η =√

B2

ρ2 Twe get

T∑t=1

〈w(t) −w?,vt〉 ≤ B ρ√T .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 9 / 31

Page 24: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Analyzing SGD for convex-Lipschitz-bounded

Taking expectation of both sides w.r.t. the randomness of choosingz1, . . . , zT we obtain:

Ez1,...,zT

[T∑t=1

〈w(t) −w?,vt〉

]≤ B ρ

√T .

The law of total expectation: for every two random variables α, β, and afunction g, Eα[g(α)] = Eβ Eα[g(α)|β]. Therefore

Ez1,...,zT

[〈w(t) −w?,vt〉] = Ez1,...,zt−1

Ez1,...,zT

[〈w(t) −w?,vt〉 | z1, . . . , zt−1] .

Once we know z1, . . . , zt−1 the value of w(t) is not random, hence,

Ez1,...,zT

[〈w(t) −w?,vt〉 | z1, . . . , zt−1] = 〈w(t) −w? , Ezt

[∇`(w(t), zt)]〉

= 〈w(t) −w? , ∇LD(w(t))〉

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 10 / 31

Page 25: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Analyzing SGD for convex-Lipschitz-bounded

Taking expectation of both sides w.r.t. the randomness of choosingz1, . . . , zT we obtain:

Ez1,...,zT

[T∑t=1

〈w(t) −w?,vt〉

]≤ B ρ

√T .

The law of total expectation: for every two random variables α, β, and afunction g, Eα[g(α)] = Eβ Eα[g(α)|β].

Therefore

Ez1,...,zT

[〈w(t) −w?,vt〉] = Ez1,...,zt−1

Ez1,...,zT

[〈w(t) −w?,vt〉 | z1, . . . , zt−1] .

Once we know z1, . . . , zt−1 the value of w(t) is not random, hence,

Ez1,...,zT

[〈w(t) −w?,vt〉 | z1, . . . , zt−1] = 〈w(t) −w? , Ezt

[∇`(w(t), zt)]〉

= 〈w(t) −w? , ∇LD(w(t))〉

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 10 / 31

Page 26: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Analyzing SGD for convex-Lipschitz-bounded

Taking expectation of both sides w.r.t. the randomness of choosingz1, . . . , zT we obtain:

Ez1,...,zT

[T∑t=1

〈w(t) −w?,vt〉

]≤ B ρ

√T .

The law of total expectation: for every two random variables α, β, and afunction g, Eα[g(α)] = Eβ Eα[g(α)|β]. Therefore

Ez1,...,zT

[〈w(t) −w?,vt〉] = Ez1,...,zt−1

Ez1,...,zT

[〈w(t) −w?,vt〉 | z1, . . . , zt−1] .

Once we know z1, . . . , zt−1 the value of w(t) is not random, hence,

Ez1,...,zT

[〈w(t) −w?,vt〉 | z1, . . . , zt−1] = 〈w(t) −w? , Ezt

[∇`(w(t), zt)]〉

= 〈w(t) −w? , ∇LD(w(t))〉

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 10 / 31

Page 27: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Analyzing SGD for convex-Lipschitz-bounded

Taking expectation of both sides w.r.t. the randomness of choosingz1, . . . , zT we obtain:

Ez1,...,zT

[T∑t=1

〈w(t) −w?,vt〉

]≤ B ρ

√T .

The law of total expectation: for every two random variables α, β, and afunction g, Eα[g(α)] = Eβ Eα[g(α)|β]. Therefore

Ez1,...,zT

[〈w(t) −w?,vt〉] = Ez1,...,zt−1

Ez1,...,zT

[〈w(t) −w?,vt〉 | z1, . . . , zt−1] .

Once we know z1, . . . , zt−1 the value of w(t) is not random, hence,

Ez1,...,zT

[〈w(t) −w?,vt〉 | z1, . . . , zt−1] = 〈w(t) −w? , Ezt

[∇`(w(t), zt)]〉

= 〈w(t) −w? , ∇LD(w(t))〉

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 10 / 31

Page 28: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Analyzing SGD for convex-Lipschitz-bounded

We got:

Ez1,...,zT

[T∑t=1

〈w(t) −w? , ∇LD(w(t))〉

]≤ B ρ

√T

By convexity, this means

Ez1,...,zT

[T∑t=1

(LD(w(t))− LD(w?))

]≤ B ρ

√T

Dividing by T and using convexity again,

Ez1,...,zT

[LD

(1

T

T∑t=1

w(t)

)]≤ LD(w?) +

B ρ√T

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 11 / 31

Page 29: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Analyzing SGD for convex-Lipschitz-bounded

We got:

Ez1,...,zT

[T∑t=1

〈w(t) −w? , ∇LD(w(t))〉

]≤ B ρ

√T

By convexity, this means

Ez1,...,zT

[T∑t=1

(LD(w(t))− LD(w?))

]≤ B ρ

√T

Dividing by T and using convexity again,

Ez1,...,zT

[LD

(1

T

T∑t=1

w(t)

)]≤ LD(w?) +

B ρ√T

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 11 / 31

Page 30: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Analyzing SGD for convex-Lipschitz-bounded

We got:

Ez1,...,zT

[T∑t=1

〈w(t) −w? , ∇LD(w(t))〉

]≤ B ρ

√T

By convexity, this means

Ez1,...,zT

[T∑t=1

(LD(w(t))− LD(w?))

]≤ B ρ

√T

Dividing by T and using convexity again,

Ez1,...,zT

[LD

(1

T

T∑t=1

w(t)

)]≤ LD(w?) +

B ρ√T

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 11 / 31

Page 31: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Learning convex-Lipschitz-bounded problems using SGD

Corollary

Consider a convex-Lipschitz-bounded learning problem with parametersρ,B. Then, for every ε > 0, if we run the SGD method for minimizingLD(w) with a number of iterations (i.e., number of examples)

T ≥ B2ρ2

ε2

and with η =√

B2

ρ2 T, then the output of SGD satisfies:

E [LD(w̄)] ≤ minw∈H

LD(w) + ε .

Remark: Can obtain high probability bound using “boosting theconfidence” (Lecture 4)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 12 / 31

Page 32: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Learning convex-Lipschitz-bounded problems using SGD

Corollary

Consider a convex-Lipschitz-bounded learning problem with parametersρ,B. Then, for every ε > 0, if we run the SGD method for minimizingLD(w) with a number of iterations (i.e., number of examples)

T ≥ B2ρ2

ε2

and with η =√

B2

ρ2 T, then the output of SGD satisfies:

E [LD(w̄)] ≤ minw∈H

LD(w) + ε .

Remark: Can obtain high probability bound using “boosting theconfidence” (Lecture 4)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 12 / 31

Page 33: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Convex-smooth-bounded problems

Similar result holds for smooth problems:

Corollary

Consider a convex-smooth-bounded learning problem with parametersβ,B. Assume in addition that `(0, z) ≤ 1 for all z ∈ Z. For every ε > 0,set η = 1

β(1+3/ε) . Then, running SGD with T ≥ 12B2β/ε2 yields

E[LD(w̄)] ≤ minw∈H

LD(w) + ε .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 13 / 31

Page 34: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Outline

1 Reminder: Convex learning problems

2 Learning Using Stochastic Gradient Descent

3 Learning Using Regularized Loss Minimization

4 Dimension vs. Norm boundsExample application: Text categorization

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 14 / 31

Page 35: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Regularized Loss Minimization (RLM)

Given a regularization function R : Rd → R, the RLM rule is:

A(S) = argminw

(LS(w) +R(w)) .

We will focus on Tikhonov regularization

A(S) = argminw

(LS(w) + λ‖w‖2

).

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 15 / 31

Page 36: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Regularized Loss Minimization (RLM)

Given a regularization function R : Rd → R, the RLM rule is:

A(S) = argminw

(LS(w) +R(w)) .

We will focus on Tikhonov regularization

A(S) = argminw

(LS(w) + λ‖w‖2

).

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 15 / 31

Page 37: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Why to regularize ?

Similar to MDL: specify “prior belief” in hypotheses. We biasourselves toward “short” vectors.

Stabilizer: we’ll show that Tikhonov regularization makes the learnerstable w.r.t. small perturbation of the training set, which in turnleads to generalization

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 16 / 31

Page 38: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Stability

Informally: an algorithm A is stable if a small change of its input Swill lead to a small change of its output hypothesis

Need to specify what is “small change of input” and what is “smallchange of output”

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 17 / 31

Page 39: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Stability

Informally: an algorithm A is stable if a small change of its input Swill lead to a small change of its output hypothesis

Need to specify what is “small change of input” and what is “smallchange of output”

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 17 / 31

Page 40: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Stability

Replace one sample: given S = (z1, . . . , zm) and an additionalexample z′, let S(i) = (z1, . . . , zi−1, z

′, zi+1, . . . , zm)

Definition (on-average-replace-one-stable)

Let ε : N→ R be a monotonically decreasing function. We say that alearning algorithm A is on-average-replace-one-stable with rate ε(m) if forevery distribution D

E(S,z′)∼Dm+1,i∼U(m)

[`(A(S(i), zi))− `(A(S), zi)] ≤ ε(m) .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 18 / 31

Page 41: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Stability

Replace one sample: given S = (z1, . . . , zm) and an additionalexample z′, let S(i) = (z1, . . . , zi−1, z

′, zi+1, . . . , zm)

Definition (on-average-replace-one-stable)

Let ε : N→ R be a monotonically decreasing function. We say that alearning algorithm A is on-average-replace-one-stable with rate ε(m) if forevery distribution D

E(S,z′)∼Dm+1,i∼U(m)

[`(A(S(i), zi))− `(A(S), zi)] ≤ ε(m) .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 18 / 31

Page 42: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Stable rules do not ovefit

Theorem

if A is on-average-replace-one-stable with rate ε(m) then

ES∼Dm

[LD(A(S))− LS(A(S))] ≤ ε(m) .

Proof.

Since S and z′ are both drawn i.i.d. from D, we have that for every i,

ES

[LD(A(S))] = ES,z′

[`(A(S), z′)] = ES,z′

[`(A(S(i)), zi)] .

On the other hand, we can write

ES

[LS(A(S))] = ES,i

[`(A(S), zi)] .

The proof follows from the definition of stability.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 19 / 31

Page 43: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Stable rules do not ovefit

Theorem

if A is on-average-replace-one-stable with rate ε(m) then

ES∼Dm

[LD(A(S))− LS(A(S))] ≤ ε(m) .

Proof.

Since S and z′ are both drawn i.i.d. from D, we have that for every i,

ES

[LD(A(S))] = ES,z′

[`(A(S), z′)] = ES,z′

[`(A(S(i)), zi)] .

On the other hand, we can write

ES

[LS(A(S))] = ES,i

[`(A(S), zi)] .

The proof follows from the definition of stability.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 19 / 31

Page 44: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Tikhonov Regularization as Stabilizer

Theorem

Assume that the loss function is convex and ρ-Lipschitz. Then, the RLMrule with the regularizer λ‖w‖2 is on-average-replace-one-stable with rate2 ρ2

λm . It follows that

ES∼Dm

[LD(A(S))− LS(A(S))] ≤ 2 ρ2

λm.

Similarly, for convex, β-smooth, and non-negative, loss the rate is 48βCλm ,

where C is an upper bound on maxz `(0, z).

The proof relies on the notion of strong convexity and can be found in thebook.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 20 / 31

Page 45: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Tikhonov Regularization as Stabilizer

Theorem

Assume that the loss function is convex and ρ-Lipschitz. Then, the RLMrule with the regularizer λ‖w‖2 is on-average-replace-one-stable with rate2 ρ2

λm . It follows that

ES∼Dm

[LD(A(S))− LS(A(S))] ≤ 2 ρ2

λm.

Similarly, for convex, β-smooth, and non-negative, loss the rate is 48βCλm ,

where C is an upper bound on maxz `(0, z).

The proof relies on the notion of strong convexity and can be found in thebook.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 20 / 31

Page 46: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

The Fitting-Stability Tradeoff

Observe:

ES

[LD(A(S))] = ES

[LS(A(S))] + ES

[LD(A(S))− LS(A(S))] .

The first term is how good A fits the training set

The 2nd term is the overfitting, and is bounded by the stability of A

λ controls the tradeoff between the two terms

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 21 / 31

Page 47: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

The Fitting-Stability Tradeoff

Observe:

ES

[LD(A(S))] = ES

[LS(A(S))] + ES

[LD(A(S))− LS(A(S))] .

The first term is how good A fits the training set

The 2nd term is the overfitting, and is bounded by the stability of A

λ controls the tradeoff between the two terms

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 21 / 31

Page 48: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

The Fitting-Stability Tradeoff

Observe:

ES

[LD(A(S))] = ES

[LS(A(S))] + ES

[LD(A(S))− LS(A(S))] .

The first term is how good A fits the training set

The 2nd term is the overfitting, and is bounded by the stability of A

λ controls the tradeoff between the two terms

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 21 / 31

Page 49: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

The Fitting-Stability Tradeoff

Let A be the RLM rule

We saw (for convex-Lipschitz losses)

ES

[LD(A(S))− LS(A(S))] ≤ 2 ρ2

λm

Fix some arbitrary vector w∗, then:

LS(A(S)) ≤ LS(A(S)) + λ‖A(S)‖2 ≤ LS(w∗) + λ‖w∗‖2 .

Taking expectation of both sides with respect to S and noting thatES [LS(w∗)] = LD(w∗), we obtain that

ES

[LS(A(S))] ≤ LD(w∗) + λ‖w∗‖2 .

Therefore:

ES

[LD(A(S))] ≤ LD(w∗) + λ‖w∗‖2 +2 ρ2

λm

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 22 / 31

Page 50: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

The Fitting-Stability Tradeoff

Let A be the RLM rule

We saw (for convex-Lipschitz losses)

ES

[LD(A(S))− LS(A(S))] ≤ 2 ρ2

λm

Fix some arbitrary vector w∗, then:

LS(A(S)) ≤ LS(A(S)) + λ‖A(S)‖2 ≤ LS(w∗) + λ‖w∗‖2 .

Taking expectation of both sides with respect to S and noting thatES [LS(w∗)] = LD(w∗), we obtain that

ES

[LS(A(S))] ≤ LD(w∗) + λ‖w∗‖2 .

Therefore:

ES

[LD(A(S))] ≤ LD(w∗) + λ‖w∗‖2 +2 ρ2

λm

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 22 / 31

Page 51: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

The Fitting-Stability Tradeoff

Let A be the RLM rule

We saw (for convex-Lipschitz losses)

ES

[LD(A(S))− LS(A(S))] ≤ 2 ρ2

λm

Fix some arbitrary vector w∗, then:

LS(A(S)) ≤ LS(A(S)) + λ‖A(S)‖2 ≤ LS(w∗) + λ‖w∗‖2 .

Taking expectation of both sides with respect to S and noting thatES [LS(w∗)] = LD(w∗), we obtain that

ES

[LS(A(S))] ≤ LD(w∗) + λ‖w∗‖2 .

Therefore:

ES

[LD(A(S))] ≤ LD(w∗) + λ‖w∗‖2 +2 ρ2

λm

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 22 / 31

Page 52: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

The Fitting-Stability Tradeoff

Let A be the RLM rule

We saw (for convex-Lipschitz losses)

ES

[LD(A(S))− LS(A(S))] ≤ 2 ρ2

λm

Fix some arbitrary vector w∗, then:

LS(A(S)) ≤ LS(A(S)) + λ‖A(S)‖2 ≤ LS(w∗) + λ‖w∗‖2 .

Taking expectation of both sides with respect to S and noting thatES [LS(w∗)] = LD(w∗), we obtain that

ES

[LS(A(S))] ≤ LD(w∗) + λ‖w∗‖2 .

Therefore:

ES

[LD(A(S))] ≤ LD(w∗) + λ‖w∗‖2 +2 ρ2

λm

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 22 / 31

Page 53: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

The Fitting-Stability Tradeoff

Let A be the RLM rule

We saw (for convex-Lipschitz losses)

ES

[LD(A(S))− LS(A(S))] ≤ 2 ρ2

λm

Fix some arbitrary vector w∗, then:

LS(A(S)) ≤ LS(A(S)) + λ‖A(S)‖2 ≤ LS(w∗) + λ‖w∗‖2 .

Taking expectation of both sides with respect to S and noting thatES [LS(w∗)] = LD(w∗), we obtain that

ES

[LS(A(S))] ≤ LD(w∗) + λ‖w∗‖2 .

Therefore:

ES

[LD(A(S))] ≤ LD(w∗) + λ‖w∗‖2 +2 ρ2

λm

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 22 / 31

Page 54: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

The Regularization Path

The RLM rule as a function of λ is w(λ) = argminw LS(w) + λ‖w‖2

Can be seen as a pareto objective: minimize both LS(w) and ‖w‖2

feasible w

‖w‖2

LS(w)

pareto optimal

w(0)

w(∞)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 23 / 31

Page 55: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

The Regularization Path

The RLM rule as a function of λ is w(λ) = argminw LS(w) + λ‖w‖2Can be seen as a pareto objective: minimize both LS(w) and ‖w‖2

feasible w

‖w‖2

LS(w)

pareto optimal

w(0)

w(∞)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 23 / 31

Page 56: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

The Regularization Path

The RLM rule as a function of λ is w(λ) = argminw LS(w) + λ‖w‖2Can be seen as a pareto objective: minimize both LS(w) and ‖w‖2

feasible w

‖w‖2

LS(w)

pareto optimal

w(0)

w(∞)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 23 / 31

Page 57: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

How to choose λ ?

Bound minimization: choose λ according to the bound on LD(w)usually far from optimal as the bound is worst case

Validation: calculate several pareto optimal points on theregularization path (by varying λ) and use validation set to choosethe best one

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 24 / 31

Page 58: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Outline

1 Reminder: Convex learning problems

2 Learning Using Stochastic Gradient Descent

3 Learning Using Regularized Loss Minimization

4 Dimension vs. Norm boundsExample application: Text categorization

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 25 / 31

Page 59: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Dimension vs. Norm Bounds

Previously in the course, when we learnt d parameters the samplecomplexity grew with d

Here, we learn d parameters but the sample complexity depends onthe norm of ‖w?‖ and on the Lipschitzness/smoothness, rather thanon d

Which approach is better depends on the properties of the distribution

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 26 / 31

Page 60: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Example: document categorization

Signs all encouraging for Phelps in comeback. He did not winany gold medals or set any world records but Michael Phelpsticked all the boxes he needed in his comeback to competitiveswimming.

About sport ?

?

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 27 / 31

Page 61: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Bag-of-words representation

Signs all encouraging for Phelps in comeback. He did not winany gold medals or set any world records but Michael Phelpsticked all the boxes he needed in his comeback to competitiveswimming.

1 10 0 0 0 0 0 0 0

swim

min

g

wor

ld

elep

hant

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 28 / 31

Page 62: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Document categorization

Let X = {x ∈ {0, 1}d : ‖x‖2 ≤ R2, xd = 1}

Think on x ∈ X as a text document represented as a bag of words:

At most R2 − 1 words in each documentd− 1 is the size of the dictionaryLast coordinate is the bias

Let Y = {±1} (e.g., the document is about sport or not)

Linear classifiers x 7→ sign(〈w,x〉)Intuitively: wi is large (positive) for words indicative to sport while wiis small (negative) for words indicative to non-sport

Hinge-loss: `(w, (x, y)) = [1− y〈w,x〉]+

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31

Page 63: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Document categorization

Let X = {x ∈ {0, 1}d : ‖x‖2 ≤ R2, xd = 1}Think on x ∈ X as a text document represented as a bag of words:

At most R2 − 1 words in each documentd− 1 is the size of the dictionaryLast coordinate is the bias

Let Y = {±1} (e.g., the document is about sport or not)

Linear classifiers x 7→ sign(〈w,x〉)Intuitively: wi is large (positive) for words indicative to sport while wiis small (negative) for words indicative to non-sport

Hinge-loss: `(w, (x, y)) = [1− y〈w,x〉]+

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31

Page 64: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Document categorization

Let X = {x ∈ {0, 1}d : ‖x‖2 ≤ R2, xd = 1}Think on x ∈ X as a text document represented as a bag of words:

At most R2 − 1 words in each document

d− 1 is the size of the dictionaryLast coordinate is the bias

Let Y = {±1} (e.g., the document is about sport or not)

Linear classifiers x 7→ sign(〈w,x〉)Intuitively: wi is large (positive) for words indicative to sport while wiis small (negative) for words indicative to non-sport

Hinge-loss: `(w, (x, y)) = [1− y〈w,x〉]+

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31

Page 65: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Document categorization

Let X = {x ∈ {0, 1}d : ‖x‖2 ≤ R2, xd = 1}Think on x ∈ X as a text document represented as a bag of words:

At most R2 − 1 words in each documentd− 1 is the size of the dictionary

Last coordinate is the bias

Let Y = {±1} (e.g., the document is about sport or not)

Linear classifiers x 7→ sign(〈w,x〉)Intuitively: wi is large (positive) for words indicative to sport while wiis small (negative) for words indicative to non-sport

Hinge-loss: `(w, (x, y)) = [1− y〈w,x〉]+

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31

Page 66: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Document categorization

Let X = {x ∈ {0, 1}d : ‖x‖2 ≤ R2, xd = 1}Think on x ∈ X as a text document represented as a bag of words:

At most R2 − 1 words in each documentd− 1 is the size of the dictionaryLast coordinate is the bias

Let Y = {±1} (e.g., the document is about sport or not)

Linear classifiers x 7→ sign(〈w,x〉)Intuitively: wi is large (positive) for words indicative to sport while wiis small (negative) for words indicative to non-sport

Hinge-loss: `(w, (x, y)) = [1− y〈w,x〉]+

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31

Page 67: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Document categorization

Let X = {x ∈ {0, 1}d : ‖x‖2 ≤ R2, xd = 1}Think on x ∈ X as a text document represented as a bag of words:

At most R2 − 1 words in each documentd− 1 is the size of the dictionaryLast coordinate is the bias

Let Y = {±1} (e.g., the document is about sport or not)

Linear classifiers x 7→ sign(〈w,x〉)Intuitively: wi is large (positive) for words indicative to sport while wiis small (negative) for words indicative to non-sport

Hinge-loss: `(w, (x, y)) = [1− y〈w,x〉]+

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31

Page 68: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Document categorization

Let X = {x ∈ {0, 1}d : ‖x‖2 ≤ R2, xd = 1}Think on x ∈ X as a text document represented as a bag of words:

At most R2 − 1 words in each documentd− 1 is the size of the dictionaryLast coordinate is the bias

Let Y = {±1} (e.g., the document is about sport or not)

Linear classifiers x 7→ sign(〈w,x〉)

Intuitively: wi is large (positive) for words indicative to sport while wiis small (negative) for words indicative to non-sport

Hinge-loss: `(w, (x, y)) = [1− y〈w,x〉]+

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31

Page 69: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Document categorization

Let X = {x ∈ {0, 1}d : ‖x‖2 ≤ R2, xd = 1}Think on x ∈ X as a text document represented as a bag of words:

At most R2 − 1 words in each documentd− 1 is the size of the dictionaryLast coordinate is the bias

Let Y = {±1} (e.g., the document is about sport or not)

Linear classifiers x 7→ sign(〈w,x〉)Intuitively: wi is large (positive) for words indicative to sport while wiis small (negative) for words indicative to non-sport

Hinge-loss: `(w, (x, y)) = [1− y〈w,x〉]+

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31

Page 70: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Document categorization

Let X = {x ∈ {0, 1}d : ‖x‖2 ≤ R2, xd = 1}Think on x ∈ X as a text document represented as a bag of words:

At most R2 − 1 words in each documentd− 1 is the size of the dictionaryLast coordinate is the bias

Let Y = {±1} (e.g., the document is about sport or not)

Linear classifiers x 7→ sign(〈w,x〉)Intuitively: wi is large (positive) for words indicative to sport while wiis small (negative) for words indicative to non-sport

Hinge-loss: `(w, (x, y)) = [1− y〈w,x〉]+

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31

Page 71: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Dimension vs. Norm

VC dimension is d, but d can be extremely large (number of words inEnglish)

Loss function is convex and R Lipschitz

Assume that the number of relevant words is small, and their weightsis not too large, then there is a w? with small norm and small LD(w?)

Then, can learn it with sample complexity that depends on R2‖w?‖2,and does not depend on d at all !

But, there are of course opposite cases, in which d is much smallerthan R2‖w?‖2

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 30 / 31

Page 72: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Dimension vs. Norm

VC dimension is d, but d can be extremely large (number of words inEnglish)

Loss function is convex and R Lipschitz

Assume that the number of relevant words is small, and their weightsis not too large, then there is a w? with small norm and small LD(w?)

Then, can learn it with sample complexity that depends on R2‖w?‖2,and does not depend on d at all !

But, there are of course opposite cases, in which d is much smallerthan R2‖w?‖2

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 30 / 31

Page 73: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Dimension vs. Norm

VC dimension is d, but d can be extremely large (number of words inEnglish)

Loss function is convex and R Lipschitz

Assume that the number of relevant words is small, and their weightsis not too large, then there is a w? with small norm and small LD(w?)

Then, can learn it with sample complexity that depends on R2‖w?‖2,and does not depend on d at all !

But, there are of course opposite cases, in which d is much smallerthan R2‖w?‖2

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 30 / 31

Page 74: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Dimension vs. Norm

VC dimension is d, but d can be extremely large (number of words inEnglish)

Loss function is convex and R Lipschitz

Assume that the number of relevant words is small, and their weightsis not too large, then there is a w? with small norm and small LD(w?)

Then, can learn it with sample complexity that depends on R2‖w?‖2,and does not depend on d at all !

But, there are of course opposite cases, in which d is much smallerthan R2‖w?‖2

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 30 / 31

Page 75: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Dimension vs. Norm

VC dimension is d, but d can be extremely large (number of words inEnglish)

Loss function is convex and R Lipschitz

Assume that the number of relevant words is small, and their weightsis not too large, then there is a w? with small norm and small LD(w?)

Then, can learn it with sample complexity that depends on R2‖w?‖2,and does not depend on d at all !

But, there are of course opposite cases, in which d is much smallerthan R2‖w?‖2

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 30 / 31

Page 76: Introduction to Machine Learning (67577) Lecture 7shais/Lectures2014/lecture7.pdfIntroduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering,

Summary

Learning convex learning problems using SGD

Learning convex learning problems using RLM

The regularization path

Dimension vs. Norm

Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 31 / 31