62
Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech

Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

Embed Size (px)

Citation preview

Page 1: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

Lecture 1: Supervised Learning

Tuo Zhao

Schools of ISYE and CSE, Georgia Tech

Page 2: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

(Supervised) Regression Analysis

Example: living areas and prices of 47 houses:

CS229 Lecture notes

Andrew Ng

Supervised learning

Let’s start by talking about a few examples of supervised learning problems.Suppose we have a dataset giving the living areas and prices of 47 housesfrom Portland, Oregon:

Living area (feet2) Price (1000$s)2104 4001600 3302400 3691416 2323000 540

......

We can plot this data:

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

0

100

200

300

400

500

600

700

800

900

1000

housing prices

square feet

pric

e (in

$10

00)

Given data like this, how can we learn to predict the prices of other housesin Portland, as a function of the size of their living areas?

1

CS229 Lecture notes

Andrew Ng

Supervised learning

Let’s start by talking about a few examples of supervised learning problems.Suppose we have a dataset giving the living areas and prices of 47 housesfrom Portland, Oregon:

Living area (feet2) Price (1000$s)2104 4001600 3302400 3691416 2323000 540

......

We can plot this data:

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

0

100

200

300

400

500

600

700

800

900

1000

housing prices

square feet

pric

e (in

$10

00)

Given data like this, how can we learn to predict the prices of other housesin Portland, as a function of the size of their living areas?

1

Given x1, ...,xn ∈ Rd, y1, ..., yn ∈ R, and f∗ : Rd → R,

yi = f∗(xi) + εi for i = 1, ..., n,

where εi’s are i.i.d. with Eεi = 0, Eε2i = σ2 <∞.

Simple Linear Function: f∗(xi) = x>i θ∗.

Why is it called supervised learning?

Tuo Zhao — Lecture 1: Supervised Learning 2/62

Page 3: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Why Supervised?

Tuo Zhao — Lecture 1: Supervised Learning 3/62

Page 4: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Play on Words?

Two unknown functions f∗0 , f∗1 : Rd → R?

yi = 1(zi = 1) · f∗1 (xi) + 1(zi = 0) · f∗0 (xi) + εi,

where i = 1, ..., n, and zi’s are i.i.d. with

P(zi = 1) = δ and P(zi = 0) = 1− δ for δ ∈ (0, 1).

zi’: Latent Variables. Supervised? Unsupervised?

Tuo Zhao — Lecture 1: Supervised Learning 4/62

Page 5: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

Linear Regression

Page 6: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Linear Regression

Given x1, ...,xn ∈ Rd, y1, ..., yn ∈ R, and θ∗ : Rd,

yi = x>i θ∗ + εi for i = 1, ..., n,

where εi’s are i.i.d. with Eεi = 0, Eε2i = σ2 <∞.

Ordinary Least Square Regression

θOLS

= arg minθ

1

2n

n∑

i=1

(yi − x>i θ)2.

Least Absolute Deviation Regression:

θLAD

= arg minθ

1

n

n∑

i=1

|yi − x>i θ|.

Tuo Zhao — Lecture 1: Supervised Learning 6/62

Page 7: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Robust Regression

Tuo Zhao — Lecture 1: Supervised Learning 7/62

Page 8: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Linear Regression — Matrix Notation

X = [x>1 , ...,x>n ]> ∈ Rn×d, y = [y1, ..., yn]> ∈ Rn,

y = Xθ∗ + ε,

where Eε = 0, Eεε> = σ2In.

Ordinary Least Square Regression

θOLS

= arg minθ

1

2n‖y −Xθ‖22 .

Least Absolute Deviation Regression:

θLAD

= arg minθ

1

n‖y −Xθ‖1 .

Tuo Zhao — Lecture 1: Supervised Learning 8/62

Page 9: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Least Square Regression — Analytical Solution

Ordinary Least Square Regression

θOLS

= arg minθ

1

2n‖y −Xθ‖22

︸ ︷︷ ︸L(θ)

.

First order optimality condition:

∇L(θ) =1

nX>(Xθ − y) = 0⇒ X>Xθ = X>y.

Analytical solution and unbiasedness:

θ = (X>X)−1X>y

= (X>X)−1X>(Xθ∗ + ε)

= θ∗ + (X>X)−1X>ε⇒ Eεθ = θ∗

Tuo Zhao — Lecture 1: Supervised Learning 9/62

Page 10: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Least Square Regression — Convexity

Ordinary Least Square Regression

θOLS

= arg minθ

1

2n‖y −Xθ‖22

︸ ︷︷ ︸L(θ)

.

Second order optimality condition:

∇2L(θ) =1

nX>X � 0.

Convexity Function:

L(θ′) ≥ L(θ) +∇L(θ)>(θ′ − θ).

Tuo Zhao — Lecture 1: Supervised Learning 10/62

Page 11: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Convex v.s. Nonconvex Optimization

Stationary Solutions: ∇L(θ) = 0

Global Optimum Global Optimum

Local Optimum

Local Maximum

Saddle Points

Easy but restrictive Difficult but flexible

We may get stuck at a local optimum or saddle point fornonconvex optimization.

Tuo Zhao — Lecture 1: Supervised Learning 11/62

Page 12: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Maximum Likelihood Estimation

X = [x>1 , ...,x>n ]> ∈ Rn×d, y = [y1, ..., yn]> ∈ Rn,

y = Xθ∗ + ε,

where ε ∼ N(0, σ2In).

Likelihood Function

L(θ) = (2πσ2)−n2 exp

(− 1

2σ2(y −Xθ)>(y −Xθ)

)

Maximum Log Likelihood Estimation:

θMLE

= arg maxθ

logL(θ)

= arg maxθ

−n2

log(2πσ2)− 1

2σ2‖y −Xθ‖22

Tuo Zhao — Lecture 1: Supervised Learning 12/62

Page 13: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Maximum Likelihood Estimation

Maximum Log Likelihood Estimation:

θMLE

= arg maxθ

−n2

log(2πσ2)− 1

2σ2‖y −Xθ‖22

Given σ2 as some unknown constant,

θMLE

= arg maxθ

− 1

2n‖y −Xθ‖22 = arg min

θ

1

2n‖y −Xθ‖22 .

Probabilistic Interpretation:

Simple and illustrative.

Restrictive and potentially misleading.

Remember t-test? What if the model is wrong?

Tuo Zhao — Lecture 1: Supervised Learning 13/62

Page 14: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Computational Cost of OLS

The number of basic operations, e.g., addition, subtraction,multiplication, division.

Matrix Multiplication: X>X: O(nd2)

Matrix Inverse: (X>X)−1: O(d3)

Matrix Vector Multiplication: X>y: O(nd)

Matrix Vector Multiplication: [(X>X)−1][X>y]: O(d2)

Overall computational cost: O(nd2), given n� d

Tuo Zhao — Lecture 1: Supervised Learning 14/62

Page 15: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Scalability and Efficiency of OLS

Simple Closed-form

Overall computational cost: O(nd2).

Massive data: Both n and d are large.

Not very efficient and scalable.

Better ways to improve the computation?

Tuo Zhao — Lecture 1: Supervised Learning 15/62

Page 16: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

Optimization for Linear Regression

Page 17: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Vanilla Gradient Descent

θ(k+1) = θ(k) − ηk∇L(θ(k)).

ηk > 0 — the step size parameter (fixed or line search)

Stop when the gradient is small:∥∥∥∇L(θ(K))

∥∥∥2≤ δ

✓(k)

f(✓)

rf(b✓) = 0

�rf(✓(k))

Tuo Zhao — Lecture 1: Supervised Learning 17/62

Page 18: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Computational Cost of VGD

Gradient: ∇L(θ(k)) = 1nX>(Xθ(k) − y)

Matrix Vector Multiplication: Xθ(k): O(nd)

Vector Subtraction: Xθ(k) − y: O(n)

Matrix Vector Multiplication: X>(Xθ(k) − y): O(nd)

Overall computational cost per iteration: O(nd)

Better than O(nd2), but how many iterations?

Tuo Zhao — Lecture 1: Supervised Learning 18/62

Page 19: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Rate of Convergence

What are Good Algorithms?

Asymptotic Convergence: θ(k) → θ as k →∞?

Nonasymptotic Rate of Convergence: The optimization errorafter k iterations

Example: Gap in Objective Value – Sublinear Convergence

f(θ(k))− f(θ) = O(L/k2

)v.s. O (L/k) ,

where L is some constant depending on the problem.

Example: Gap in Parameter – Linear Convergence∥∥∥θ(k) − θ∥∥∥2

2= O

((1− 1/κ)k

)v.s. O

((1− 1/

√κ)k)

where κ is some constant depending on the problem.

Tuo Zhao — Lecture 1: Supervised Learning 19/62

Page 20: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Iteration Complexity of Gradient Descent

Iteration Complexity

We need at most

K = O

(κ log

(1

ε

))

iterations such that ∥∥∥θ(K) − θ∥∥∥2

2≤ ε,

where κ is some constant depending on the problem.

What is κ? It is related to smoothness and convexity.

Tuo Zhao — Lecture 1: Supervised Learning 20/62

Page 21: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Strong Convexity

There exists a constant µ such that for any θ and θ′, we have

L(θ′) ≥ L(θ) +∇L(θ)>(θ′ − θ) +µ

2

∥∥θ′ − θ∥∥22.

L(✓) + rL(✓)>(✓0 � ✓) +µ

2k✓0 � ✓k2

2

L(✓0)

L(✓) + rL(✓)>(✓0 � ✓)

Tuo Zhao — Lecture 1: Supervised Learning 21/62

Page 22: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Strong Smoothness

There exists a constant L such that for any θ and θ′, we have

L(θ′) ≤ L(θ) +∇L(θ)>(θ′ − θ) +L

2

∥∥θ′ − θ∥∥22.

L(✓0)

L(✓) + rL(✓)>(✓0 � ✓)

L(✓) + rL(✓)>(✓0 � ✓) +L

2k✓0 � ✓k2

2

Tuo Zhao — Lecture 1: Supervised Learning 22/62

Page 23: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Condition Number κ = L/µ

f(θ) = 0.9× θ21 + 0.1× θ22 f(θ) = 0.5× θ21 + 0.5× θ22

Tuo Zhao — Lecture 1: Supervised Learning 23/62

Page 24: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Vector Field Representation

f(θ) = 0.9× θ21 + 0.1× θ22 f(θ) = 0.5× θ21 + 0.5× θ22

Tuo Zhao — Lecture 1: Supervised Learning 24/62

Page 25: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Understanding Regularity Conditions

Mean Value Theorem: There exists a constant z ∈ [0, 1] suchthat for any θ and θ′, we have

L(θ′)− L(θ)−∇L(θ)>(θ′ − θ) =1

2(θ′ − θ)>∇2L(θ)(θ′ − θ),

where θ is a convex combination: θ = zθ + (1− z)θ′.Hessian Matrix for OLS:

∇2L(θ) =1

nX>X.

Control the reminder:

Λmin(1

nX>X)

︸ ︷︷ ︸µ

≤ (θ′ − θ)>∇2L(θ)(θ′ − θ)∥∥θ′ − θ∥∥22

≤ Λmax(1

nX>X)

︸ ︷︷ ︸L

.

Tuo Zhao — Lecture 1: Supervised Learning 25/62

Page 26: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Understanding Gradient Descent Algorithms

Iteratively Minimize Quadratic Approximation

At the (t+ 1)-th iteration, we consider

Q(θ;θ(k)) = L(θ(k)) +∇>L(θ(k))(θ − θ(k)) +L

2

∥∥∥θ − θ(k)∥∥∥2

2.

We have

Q(θ;θ(k)) ≥ L(θ) and Q(θ(k);θ(k)) = L(θ(k)).

We take

θ(k+1) = arg minθ

Q(θ;θ(k)) = θ(k) − 1

L∇L(θ(k)).

Tuo Zhao — Lecture 1: Supervised Learning 26/62

Page 27: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Backtracking Line Search

The worst case – fixed step size: ηk = 1/L.

At the (k + 1)-th iteration, we take ηk = ηk−1, i.e.,

θ(k+1) = θ(k) − ηk∇L(θ(k)) if Q(θ(k+1);θ(k)) ≥ L(θ).

Otherwise, we take

ηk+1 = (1− δ)mηk,where δ > 0 and m is the smallest positive integer such that

Q(θ(k+1);θ(k)) ≥ L(θ).

Tuo Zhao — Lecture 1: Supervised Learning 27/62

Page 28: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Backtracking Line Search

✓(k)

⌘k = 0.752 ⇥ ⌘k�1 ⌘k = ⌘k�1

⌘k = 0.75 ⇥ ⌘k�1

Tuo Zhao — Lecture 1: Supervised Learning 28/62

Page 29: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

Tradeoff Statistics and Computation

Page 30: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

High Precision or Low Precision

Can we tolerate a large ε?

From a learning perspective, our interest is θ∗, not θ.

Error decomposition:∥∥∥θ(K) − θ∗∥∥∥2≤∥∥∥θ(K) − θ

∥∥∥2︸ ︷︷ ︸

Opt. Error

+∥∥∥θ − θ∗

∥∥∥2︸ ︷︷ ︸

Stat. Error

.

High precision expects something like∥∥∥θ(K) − θ∥∥∥2≈ 10−10

Does it make any difference?

Tuo Zhao — Lecture 1: Supervised Learning 30/62

Page 31: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

High Precision or Low Precision

The statistical error is undefeatable!

HIGH-DIMENSIONAL NOISY LASSO 1649

lower- and upper-RE conditions with high probability. The proof of Theorem 2 it-self is based on an extension of a result due to Agarwal et al. [1] on the convergenceof projected gradient descent and composite gradient descent in high dimensions.Their result, as originally stated, imposed convexity of the loss function, but theproof can be modified so as to apply to the nonconvex loss functions of interesthere. As noted following Theorem 1, all global minimizers of the nonconvex pro-gram (2.4) lie within a small ball. In addition, Theorem 2 guarantees that the localminimizers also lie within a ball of the same magnitude. Note that in order to showthat Theorem 2 can be applied to the specific statistical models of interest in thispaper, a considerable amount of technical analysis remains in order to establishthat its conditions hold with high probability.

In order to understand the significance of the bounds (3.5) and (3.7), note thatthey provide upper bounds for the ℓ2-distance between the iterate β t at time t ,which is easily computed in polynomial-time, and any global optimum !β of theprogram (2.4) or (2.7), which may be difficult to compute. Focusing on bound(3.5), since γ ∈ (0,1), the first term in the bound vanishes as t increases. Theremaining terms involve the statistical errors ∥!β − β∗∥q , for q = 1,2, which arecontrolled in Theorem 1. It can be verified that the two terms involving the statisti-cal error on the right-hand side are bounded as O(k logp

n ), so Theorem 2 guaranteesthat projected gradient descent produce an output that is essentially as good—interms of statistical error—as any global optimum of the program (2.4). Bound(3.7) provides a similar guarantee for composite gradient descent applied to theLagrangian version.

Experimentally, we have found that the predictions of Theorem 2 are borne outin simulations. Figure 2 shows the results of applying the projected gradient de-scent method to solve the optimization problem (2.4) in the case of additive noise

(a) (b)

FIG. 2. Plots of the optimization error log(∥βt − !β∥2) and statistical error log(∥β t −β∗∥2) versusiteration number t , generated by running projected gradient descent on the nonconvex objective.Each plot shows the solution path for the same problem instance, using 10 different starting points.As predicted by Theorem 2, the optimization error decreases geometrically.

Statistical ErrorOptimization Error

Number of Iterations

Erro

r in

Loga

rithm

ic Sc

ale

Tuo Zhao — Lecture 1: Supervised Learning 31/62

Page 32: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Tradeoff Statistical and Optimization Errors

The statistical error is undefeatable!

The statistical error of the optimal solution:

E∥∥∥θ − θ∗

∥∥∥2

2= E

∥∥∥(X>X)−1X>ε∥∥∥2

2

= σ2tr[(X>X)−1] = O

(σ2d

n

)

We only need∥∥∥θ(K) − θ

∥∥∥2�∥∥∥θ − θ∗

∥∥∥2.

Given K = O(κ · log

(nσ2d

)), we have

E∥∥∥θ(K) − θ∗

∥∥∥2

2= O

(σ2d

n

).

Tuo Zhao — Lecture 1: Supervised Learning 32/62

Page 33: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Agnostic Learning

All models are wrong, but some are useful!

Data generating process: (X, Y ) ∼ D.

The oracle model: foracle(X) = X>θoracle, where

θoracle = arg minθ

ED(Y −X>θ)2.

The estimated model: f(X) = X>θ

θ = arg minθ

1

2n‖y −Xθ‖22 ,

and (x1, y1), ..., (xn, yn) ∼ DAt the K-th iteration: f (K)(X) = X>θ(K)

Tuo Zhao — Lecture 1: Supervised Learning 33/62

Page 34: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Agnostic Learning (See more details in CS-7545)

All models are wrong, but some are useful!

All linear models

“True” model

“Oracle” model

“Estimated” model

Tuo Zhao — Lecture 1: Supervised Learning 34/62

Page 35: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Agnostic Learning

“Approximation” error: ED(Y − foracle(X))2

“Estimation” error : ED(f(X)− foracle(X))2

Optimization error : ED(f (K)(X)− f(X))2

Decomposition of the statistical error:

ED(Y − f (K)(X))2 ≤ED(Y − foracle(X))2

+ ED(f(X)− foracle(X))2

+ ED(f (K)(X)− f(X))2

How should we choose ε?

Tuo Zhao — Lecture 1: Supervised Learning 35/62

Page 36: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

Scalable Computation of Linear Regression

Page 37: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Stochastic Approximation

What if n is too large?

Empirical Risk Minimization: L(θ) = 1n

∑ni=1 `i(θ).

For Least Square Regression:

`i(θ) =1

2(yi − x>i θ)2 or `i(θ) =

1

2|Mi|∑

j∈Mi

(yj − x>j θ)2.

Randomly sample i from 1, ..., n with equal probability,

Ei∇`i(θ) = ∇L(θ) and E ‖∇`i(θ)−∇L(θ)‖22 ≤M2.

Stochastic Gradient (SG): replace ∇L(θ) with ∇`i(θ)

θ(k+1) = θ(k) − ηk∇`it(θ(k)).

Tuo Zhao — Lecture 1: Supervised Learning 37/62

Page 38: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Why Stochastic Gradient?

Perturbed Descent Directions

Tuo Zhao — Lecture 1: Supervised Learning 38/62

Page 39: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Convergence of Stochastic Gradient Algorithms

How many iterations do we need?

A sequence of decreasing step size parameters: ηk � 1kµ

Given a pre-specified error ε, we need

K = O

(M2 + L2

µ2ε

)

iterations such that

E∥∥∥θ(K) − θ

∥∥∥2

2≤ ε, where θ

(K)=

1

T

T∑

t=1

θ(k).

When µ2εn� (M2 + L2), i.e., n is super large,

O

(d(M2 + L2)

µ2ε

)v.s. O (κnd) v.s. O(nd2).

Tuo Zhao — Lecture 1: Supervised Learning 39/62

Page 40: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Why decreasing step size?

Control Variance + Sufficient Descent ⇒ Convergence

Intuition:

θ(k+1) = θ(k) − η∇L(θ(k))︸ ︷︷ ︸Descent

+ η(∇L(θ(k))−∇`i(θ(k)))︸ ︷︷ ︸Error

Not summable: Sufficient exploration,∞∑

t=1

ηk =∞.

Square summable: Diminishing Variance,∞∑

t=1

η2k <∞.

Tuo Zhao — Lecture 1: Supervised Learning 40/62

Page 41: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Minibatch Variance Reduction

Mini-batch SGD: If |Mi| ↑, then M2 ↓

|Mi| ↑ means more computational cost per iteration.

M2 ↓ means fewer iterations.

Tuo Zhao — Lecture 1: Supervised Learning 41/62

Page 42: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Variance Reduction by Control Variates

Stochastic Variance Reduced Gradient Algorithm (SVRG):

At the k-th epoch,

θ = θ[k], θ(0) = θ[k].

At the k-th iteration of the k-th epoch,

θ(k+1) = θ(k) − ηk(∇`i(θ(k))−∇`i(θ) +∇L(θ)).

After m iterations of the k-th epoch,

θ[k+1] = θ(m).

Tuo Zhao — Lecture 1: Supervised Learning 42/62

Page 43: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Strong Smoothness and Convexity

Regularity Conditions

(Strong Smoothness) There exist constants Li’s such that forany θ and θ′, we have

`i(θ′)− `i(θ)−∇`i(θ)>(θ′ − θ) ≤ Li

2

∥∥θ′ − θ∥∥22.

(Strong Convexity) There exists a constant µ such that forany θ and θ′, we have

L(θ′)− L(θ)−∇L(θ)>(θ′ − θ) ≥ µ

2

∥∥θ′ − θ∥∥22.

Condition Number

κmax =maxi Li

µ≥ κ =

L

µ.

Tuo Zhao — Lecture 1: Supervised Learning 43/62

Page 44: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Why does SVRG work?

The strong smoothness implies:∥∥∥∇`i(θ(k))−∇`i(θ)

∥∥∥2≤ Lmax

∥∥∥θ(k) − θ∥∥∥2

Bias correction:

E[∇`i(θ(k))−∇`i(θ)] = ∇L(θ(k))−∇L(θ).

Variance Reduction: As θ(k) → θ and θ → θ

E∥∥∥∇`i(θ(k))−∇`i(θ) +∇L(θ)−∇L(θ(k))

∥∥∥2

2

≤ E∥∥∥∇`i(θ(k))−∇`i(θ)

∥∥∥2

2→ 0.

Tuo Zhao — Lecture 1: Supervised Learning 44/62

Page 45: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Convergence of SVRG

How many iterations do we need?

Fixed step size parameter: ηk � 1Lmax

Given a pre-specified error ε and m � κmax, we need

K = O

(log

(1

ε

))

epochs such that

E∥∥∥θ[K] − θ

∥∥∥2

2≤ ε.

Total number of operations:

O (nd+ dκmax) v.s. O

(dM2

µ2ε+dκ2

ε

)v.s. O (ndκ) .

Tuo Zhao — Lecture 1: Supervised Learning 45/62

Page 46: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Comparison of GD, SGD and SVRG

Tuo Zhao — Lecture 1: Supervised Learning 46/62

Page 47: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Summary

The empirical performance highly depends on theimplementation.

Cyclic or shuffled order (Not stochastic) in practice.

“Too many tuning parameters” means “my algorithm mightonly work in theory”.

Theoretical bounds can be very loose. The constant maymatter a lot in practice.

Good software engineers with B.S./M.S. degrees can earnmuch more than Ph.D.’s, if they know how to code efficientalgorithms.

Tuo Zhao — Lecture 1: Supervised Learning 47/62

Page 48: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

Classification Analysis

Page 49: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Classification v.s. Regression

Tuo Zhao — Lecture 1: Supervised Learning 49/62

Page 50: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Logistic Regression

Given x1, ...,xn ∈ Rd and θ∗ ∈ Rd,

yi ∼ Bernoulli(h(x>i θ∗)) for i = 1, ..., n,

h: (−∞,∞)→ [0, 1]

Logistic/Sigmoid function

h(z) =1

1 + exp(−z) .

Remark: h(0) = 0.5, h(−∞) = 0 and h(∞) = 1.

Tuo Zhao — Lecture 1: Supervised Learning 50/62

Page 51: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Logistic/Sigmoid Function

Tuo Zhao — Lecture 1: Supervised Learning 51/62

Page 52: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Logistic Regression

Maximum Likelihood Estimation

θ = arg maxθ

L(θ)

= arg maxθ

log

n∏

i=1

(h(x>i θ)

)yi (1− h(x>i θ)

)1−yi

= arg maxθ

n∑

i=1

yi log h(x>i θ) + (1− yi) log(1− h(x>i θ))

= arg maxθ

n∑

i=1

[yi · x>i θ − log(1 + exp(x>i θ))

]

= arg minθ

1

n

n∑

i=1

[log(1 + exp(x>i θ))− yi · x>i θ

].

Tuo Zhao — Lecture 1: Supervised Learning 52/62

Page 53: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Optimization for Logistic Regression

Convex Problem? Let F(θ) = −L(θ).

∇2F(θ) =1

n

n∑

i=1

h(x>i θ)(1− h(x>i θ))xix>i � 0.

No closed form solution.

Gradient descent and stochastic gradient algorithms areapplicable.

Tuo Zhao — Lecture 1: Supervised Learning 53/62

Page 54: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Prediction for Logistic Regression

Prediction: Given x∗, we have

P(y∗ = 1) =1

1 + exp(−θ>x∗)≥ 0.5.

Why linear classification?

P(y∗ = 1) ≥ 0.5⇔ θ>x∗ ≥ 0⇔ y∗ = sign(θ

>x∗).

Tuo Zhao — Lecture 1: Supervised Learning 54/62

Page 55: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Logistic Loss

Given x1, ...,xn ∈ Rd, y1, ..., yn ∈ {−1, 1}, and θ∗ : Rd,

P(yi = 1) =1

1 + exp(−x>i θ∗)for i = 1, ..., n,

An alternative formulation:

θ = arg minθ

1

n

n∑

i=1

log(1 + exp(−yix>i θ)).

We can also use 0-1 loss:

θ = arg minθ

1

n

n∑

i=1

1(sign(x>i θ) 6= yi).

Tuo Zhao — Lecture 1: Supervised Learning 55/62

Page 56: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Loss Functions for Classification

Tuo Zhao — Lecture 1: Supervised Learning 56/62

Page 57: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Newton’s Method

At the k-th iteration, we take

θ(k+1) = θ(k) − ηk[∇2F(θ)

]−1∇F(θ),

where ηk > 0 is a step size parameter.

The Second Order Taylor’s Approximation:

θ(k+0.5) = arg minθ

F(θ(k)) +∇F(θ(k))(θ − θ(k))+

+1

2(θ − θ(k))>∇2F(θ(k))(θ − θ(k)).

Backtracking Line Search:

θ(k+1) = θ(k) + ηk(θ(k+0.5) − θ(k)).

Tuo Zhao — Lecture 1: Supervised Learning 57/62

Page 58: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Newton’s Method

Tuo Zhao — Lecture 1: Supervised Learning 58/62

Page 59: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Newton’s Method

Sublinear + Quadratic Convergence:

Given∥∥∥θ(k) − θ

∥∥∥2

2≤ R� 1, we have

∥∥∥θ(k+1) − θ∥∥∥2

2≤ (1− δ)

∥∥∥θ(k) − θ∥∥∥4

2.

Given∥∥∥θ(k) − θ

∥∥∥2

2≥ R, we have

∥∥∥θ(k+1) − θ∥∥∥2

2= O(1/k).

Iteration complexity: (Some parameters hidden for simplicity)

O

(log

(1

R

)+ log log

(1

ε

)).

Tuo Zhao — Lecture 1: Supervised Learning 59/62

Page 60: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Newton’s Method

Advantage:

More efficient for highly accurate solutions.

Avoid extensively calculating log or exp functions.

— Taylor expansions combined with a table.

Fewer line search steps.

— Due to quadratic convergence.

Often more efficient than gradient descent.

Tuo Zhao — Lecture 1: Supervised Learning 60/62

Page 61: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Newton’s Method

Tuo Zhao — Lecture 1: Supervised Learning 61/62

Page 62: Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

Newton’s Method

Disadvantage:

Computing inverse Hessian matrices is expensive!

Storing inverse Hessian matrices is expensive!

Subsampled Newton:

θ(k+1) = θ(k) − ηk[H(θ(k))

]−1∇F(θ(k)).

Quasi-Newton Method: DFP, BFGS, Broyden, SR1

— Use differences of gradient vectors to approximate Hessianmatrices.

Tuo Zhao — Lecture 1: Supervised Learning 62/62