Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs

Lecture 1: Supervised Learning

Tuo Zhao

Schools of ISYE and CSE, Georgia Tech

ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine Learning

(Supervised) Regression Analysis

Example: living areas and prices of 47 houses:

CS229 Lecture notes

Andrew Ng

Supervised learning

Let’s start by talking about a few examples of supervised learning problems.Suppose we have a dataset giving the living areas and prices of 47 housesfrom Portland, Oregon:

Living area (feet2) Price (1000$s)2104 4001600 3302400 3691416 2323000 540

......

We can plot this data:

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

0

100

200

300

400

500

600

700

800

900

1000

housing prices

square feet

pric

e (in

$10

00)

Given data like this, how can we learn to predict the prices of other housesin Portland, as a function of the size of their living areas?

1

CS229 Lecture notes

Andrew Ng

Supervised learning

Let’s start by talking about a few examples of supervised learning problems.Suppose we have a dataset giving the living areas and prices of 47 housesfrom Portland, Oregon:

Living area (feet2) Price (1000$s)2104 4001600 3302400 3691416 2323000 540

......

We can plot this data:

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

0

100

200

300

400

500

600

700

800

900

1000

housing prices

square feet

pric

e (in

$10

00)

Given data like this, how can we learn to predict the prices of other housesin Portland, as a function of the size of their living areas?

1

Given x1, ...,xn ∈ Rd, y1, ..., yn ∈ R, and f∗ : Rd → R,

yi = f∗(xi) + εi for i = 1, ..., n,

where εi’s are i.i.d. with Eεi = 0, Eε2i = σ2 <∞.

Simple Linear Function: f∗(xi) = x>i θ∗.

Why is it called supervised learning?

Tuo Zhao — Lecture 1: Supervised Learning 2/62


Why Supervised?



Play on Words?

Two unknown functions f∗0 , f∗1 : Rd → R?

yi = 1(zi = 1) · f∗1 (xi) + 1(zi = 0) · f∗0 (xi) + εi,

where i = 1, ..., n, and zi’s are i.i.d. with

P(zi = 1) = δ and P(zi = 0) = 1− δ for δ ∈ (0, 1).

zi’: Latent Variables. Supervised? Unsupervised?


Linear Regression


Linear Regression

Given x1, ...,xn ∈ Rd, y1, ..., yn ∈ R, and θ∗ : Rd,

yi = x>i θ∗ + εi for i = 1, ..., n,

where εi’s are i.i.d. with Eεi = 0, Eε2i = σ2 <∞.

Ordinary Least Square Regression

θOLS

= arg minθ

1

2n

n∑

i=1

(yi − x>i θ)2.

Least Absolute Deviation Regression:

θLAD

= arg minθ

1

n

n∑

i=1

|yi − x>i θ|.



Robust Regression



Linear Regression — Matrix Notation

X = [x>1 , ...,x>n ]> ∈ Rn×d, y = [y1, ..., yn]> ∈ Rn,

y = Xθ∗ + ε,

where Eε = 0, Eεε> = σ2In.


θOLS

= arg minθ

1

2n‖y −Xθ‖22 .

Least Absolute Deviation Regression:

θLAD

= arg minθ

1

n‖y −Xθ‖1 .



Least Square Regression — Analytical Solution


θOLS

= arg minθ

1

2n‖y −Xθ‖22

︸︷︷︸L(θ)

.

First order optimality condition:

∇L(θ) =1

nX>(Xθ − y) = 0⇒ X>Xθ = X>y.

Analytical solution and unbiasedness:

θ = (X>X)−1X>y

= (X>X)−1X>(Xθ∗ + ε)

= θ∗ + (X>X)−1X>ε⇒ Eεθ = θ∗



Least Square Regression — Convexity


θOLS

= arg minθ

1

2n‖y −Xθ‖22

︸︷︷︸L(θ)

.

Second order optimality condition:

∇2L(θ) =1

nX>X � 0.

Convexity Function:

L(θ′) ≥ L(θ) +∇L(θ)>(θ′ − θ).



Convex v.s. Nonconvex Optimization

Stationary Solutions: ∇L(θ) = 0

Global Optimum Global Optimum

Local Optimum

Local Maximum

Saddle Points

Easy but restrictive Difficult but flexible

We may get stuck at a local optimum or saddle point fornonconvex optimization.



Maximum Likelihood Estimation

X = [x>1 , ...,x>n ]> ∈ Rn×d, y = [y1, ..., yn]> ∈ Rn,

y = Xθ∗ + ε,

where ε ∼ N(0, σ2In).

Likelihood Function

L(θ) = (2πσ2)−n2 exp

(− 1

2σ2(y −Xθ)>(y −Xθ)

)

Maximum Log Likelihood Estimation:

θMLE

= arg maxθ

logL(θ)

= arg maxθ

−n2

log(2πσ2)− 1

2σ2‖y −Xθ‖22




Maximum Log Likelihood Estimation:

θMLE

= arg maxθ

−n2

log(2πσ2)− 1

2σ2‖y −Xθ‖22

Given σ2 as some unknown constant,

θMLE

= arg maxθ

− 1

2n‖y −Xθ‖22 = arg min

θ

1

2n‖y −Xθ‖22 .

Probabilistic Interpretation:

Simple and illustrative.

Restrictive and potentially misleading.

Remember t-test? What if the model is wrong?



Computational Cost of OLS

The number of basic operations, e.g., addition, subtraction,multiplication, division.

Matrix Multiplication: X>X: O(nd2)

Matrix Inverse: (X>X)−1: O(d3)

Matrix Vector Multiplication: X>y: O(nd)

Matrix Vector Multiplication: [(X>X)−1][X>y]: O(d2)

Overall computational cost: O(nd2), given n� d



Scalability and Efficiency of OLS

Simple Closed-form

Overall computational cost: O(nd2).

Massive data: Both n and d are large.

Not very efficient and scalable.

Better ways to improve the computation?


Optimization for Linear Regression


Vanilla Gradient Descent

θ(k+1) = θ(k) − ηk∇L(θ(k)).

ηk > 0 — the step size parameter (fixed or line search)

Stop when the gradient is small:∥∥∥∇L(θ(K))

∥∥∥2≤ δ

✓(k)

f(✓)

✓

rf(b✓) = 0

�rf(✓(k))



Computational Cost of VGD

Gradient: ∇L(θ(k)) = 1nX>(Xθ(k) − y)

Matrix Vector Multiplication: Xθ(k): O(nd)

Vector Subtraction: Xθ(k) − y: O(n)

Matrix Vector Multiplication: X>(Xθ(k) − y): O(nd)

Overall computational cost per iteration: O(nd)

Better than O(nd2), but how many iterations?



Rate of Convergence

What are Good Algorithms?

Asymptotic Convergence: θ(k) → θ as k →∞?

Nonasymptotic Rate of Convergence: The optimization errorafter k iterations

Example: Gap in Objective Value – Sublinear Convergence

f(θ(k))− f(θ) = O(L/k2

)v.s. O (L/k) ,

where L is some constant depending on the problem.

Example: Gap in Parameter – Linear Convergence∥∥∥θ(k) − θ∥∥∥2

2= O

((1− 1/κ)k

)v.s. O

((1− 1/

√κ)k)

where κ is some constant depending on the problem.



Iteration Complexity of Gradient Descent

Iteration Complexity

We need at most

K = O

(κ log

(1

ε

))

iterations such that ∥∥∥θ(K) − θ∥∥∥2

2≤ ε,

where κ is some constant depending on the problem.

What is κ? It is related to smoothness and convexity.



Strong Convexity

There exists a constant µ such that for any θ and θ′, we have

L(θ′) ≥ L(θ) +∇L(θ)>(θ′ − θ) +µ

2

∥∥θ′ − θ∥∥22.

L(✓) + rL(✓)>(✓0 � ✓) +µ

2k✓0 � ✓k2

2

L(✓0)

L(✓) + rL(✓)>(✓0 � ✓)



Strong Smoothness

There exists a constant L such that for any θ and θ′, we have

L(θ′) ≤ L(θ) +∇L(θ)>(θ′ − θ) +L

2

∥∥θ′ − θ∥∥22.

L(✓0)

L(✓) + rL(✓)>(✓0 � ✓)

L(✓) + rL(✓)>(✓0 � ✓) +L

2k✓0 � ✓k2

2



Condition Number κ = L/µ

f(θ) = 0.9× θ21 + 0.1× θ22 f(θ) = 0.5× θ21 + 0.5× θ22



Vector Field Representation

f(θ) = 0.9× θ21 + 0.1× θ22 f(θ) = 0.5× θ21 + 0.5× θ22



Understanding Regularity Conditions

Mean Value Theorem: There exists a constant z ∈ [0, 1] suchthat for any θ and θ′, we have

L(θ′)− L(θ)−∇L(θ)>(θ′ − θ) =1

2(θ′ − θ)>∇2L(θ)(θ′ − θ),

where θ is a convex combination: θ = zθ + (1− z)θ′.Hessian Matrix for OLS:

∇2L(θ) =1

nX>X.

Control the reminder:

Λmin(1

nX>X)

︸︷︷︸µ

≤ (θ′ − θ)>∇2L(θ)(θ′ − θ)∥∥θ′ − θ∥∥22

≤ Λmax(1

nX>X)

︸︷︷︸L

.



Understanding Gradient Descent Algorithms

Iteratively Minimize Quadratic Approximation

At the (t+ 1)-th iteration, we consider

Q(θ;θ(k)) = L(θ(k)) +∇>L(θ(k))(θ − θ(k)) +L

2

∥∥∥θ − θ(k)∥∥∥2

2.

We have

Q(θ;θ(k)) ≥ L(θ) and Q(θ(k);θ(k)) = L(θ(k)).

We take

θ(k+1) = arg minθ

Q(θ;θ(k)) = θ(k) − 1

L∇L(θ(k)).



Backtracking Line Search

The worst case – fixed step size: ηk = 1/L.

At the (k + 1)-th iteration, we take ηk = ηk−1, i.e.,

θ(k+1) = θ(k) − ηk∇L(θ(k)) if Q(θ(k+1);θ(k)) ≥ L(θ).

Otherwise, we take

ηk+1 = (1− δ)mηk,where δ > 0 and m is the smallest positive integer such that

Q(θ(k+1);θ(k)) ≥ L(θ).



Backtracking Line Search

✓(k)

⌘k = 0.752 ⇥ ⌘k�1 ⌘k = ⌘k�1

⌘k = 0.75 ⇥ ⌘k�1


Tradeoff Statistics and Computation


High Precision or Low Precision

Can we tolerate a large ε?

From a learning perspective, our interest is θ∗, not θ.

Error decomposition:∥∥∥θ(K) − θ∗∥∥∥2≤∥∥∥θ(K) − θ

∥∥∥2︸︷︷︸

Opt. Error

+∥∥∥θ − θ∗

∥∥∥2︸︷︷︸

Stat. Error

.

High precision expects something like∥∥∥θ(K) − θ∥∥∥2≈ 10−10

Does it make any difference?



High Precision or Low Precision

The statistical error is undefeatable!

HIGH-DIMENSIONAL NOISY LASSO 1649

lower- and upper-RE conditions with high probability. The proof of Theorem 2 it-self is based on an extension of a result due to Agarwal et al. [1] on the convergenceof projected gradient descent and composite gradient descent in high dimensions.Their result, as originally stated, imposed convexity of the loss function, but theproof can be modified so as to apply to the nonconvex loss functions of interesthere. As noted following Theorem 1, all global minimizers of the nonconvex pro-gram (2.4) lie within a small ball. In addition, Theorem 2 guarantees that the localminimizers also lie within a ball of the same magnitude. Note that in order to showthat Theorem 2 can be applied to the specific statistical models of interest in thispaper, a considerable amount of technical analysis remains in order to establishthat its conditions hold with high probability.

In order to understand the significance of the bounds (3.5) and (3.7), note thatthey provide upper bounds for the ℓ2-distance between the iterate β t at time t ,which is easily computed in polynomial-time, and any global optimum !β of theprogram (2.4) or (2.7), which may be difficult to compute. Focusing on bound(3.5), since γ ∈ (0,1), the first term in the bound vanishes as t increases. Theremaining terms involve the statistical errors ∥!β − β∗∥q , for q = 1,2, which arecontrolled in Theorem 1. It can be verified that the two terms involving the statisti-cal error on the right-hand side are bounded as O(k logp

n ), so Theorem 2 guaranteesthat projected gradient descent produce an output that is essentially as good—interms of statistical error—as any global optimum of the program (2.4). Bound(3.7) provides a similar guarantee for composite gradient descent applied to theLagrangian version.

Experimentally, we have found that the predictions of Theorem 2 are borne outin simulations. Figure 2 shows the results of applying the projected gradient de-scent method to solve the optimization problem (2.4) in the case of additive noise

(a) (b)

FIG. 2. Plots of the optimization error log(∥βt − !β∥2) and statistical error log(∥β t −β∗∥2) versusiteration number t , generated by running projected gradient descent on the nonconvex objective.Each plot shows the solution path for the same problem instance, using 10 different starting points.As predicted by Theorem 2, the optimization error decreases geometrically.

Statistical ErrorOptimization Error

Number of Iterations

Erro

r in

Loga

rithm

ic Sc

ale



Tradeoff Statistical and Optimization Errors

The statistical error is undefeatable!

The statistical error of the optimal solution:

E∥∥∥θ − θ∗

∥∥∥2

2= E

∥∥∥(X>X)−1X>ε∥∥∥2

2

= σ2tr[(X>X)−1] = O

(σ2d

n

)

We only need∥∥∥θ(K) − θ

∥∥∥2�∥∥∥θ − θ∗

∥∥∥2.

Given K = O(κ · log

(nσ2d

)), we have

E∥∥∥θ(K) − θ∗

∥∥∥2

2= O

(σ2d

n

).



Agnostic Learning

All models are wrong, but some are useful!

Data generating process: (X, Y ) ∼ D.

The oracle model: foracle(X) = X>θoracle, where

θoracle = arg minθ

ED(Y −X>θ)2.

The estimated model: f(X) = X>θ

θ = arg minθ

1

2n‖y −Xθ‖22 ,

and (x1, y1), ..., (xn, yn) ∼ DAt the K-th iteration: f (K)(X) = X>θ(K)



Agnostic Learning (See more details in CS-7545)

All models are wrong, but some are useful!

All linear models

“True” model

“Oracle” model

“Estimated” model



Agnostic Learning

“Approximation” error: ED(Y − foracle(X))2

“Estimation” error : ED(f(X)− foracle(X))2

Optimization error : ED(f (K)(X)− f(X))2

Decomposition of the statistical error:

ED(Y − f (K)(X))2 ≤ED(Y − foracle(X))2

+ ED(f(X)− foracle(X))2

+ ED(f (K)(X)− f(X))2

How should we choose ε?


Scalable Computation of Linear Regression


Stochastic Approximation

What if n is too large?

Empirical Risk Minimization: L(θ) = 1n

∑ni=1 ì(θ).

For Least Square Regression:

ì(θ) =1

2(yi − x>i θ)2 or ì(θ) =

1

2|Mi|∑

j∈Mi

(yj − x>j θ)2.

Randomly sample i from 1, ..., n with equal probability,

Ei∇ì(θ) = ∇L(θ) and E ‖∇ì(θ)−∇L(θ)‖22 ≤M2.

Stochastic Gradient (SG): replace ∇L(θ) with ∇ì(θ)

θ(k+1) = θ(k) − ηk∇ìt(θ(k)).



Why Stochastic Gradient?

Perturbed Descent Directions



Convergence of Stochastic Gradient Algorithms

How many iterations do we need?

A sequence of decreasing step size parameters: ηk � 1kµ

Given a pre-specified error ε, we need

K = O

(M2 + L2

µ2ε

)

iterations such that

E∥∥∥θ(K) − θ

∥∥∥2

2≤ ε, where θ

(K)=

1

T

T∑

t=1

θ(k).

When µ2εn� (M2 + L2), i.e., n is super large,

O

(d(M2 + L2)

µ2ε

)v.s. O (κnd) v.s. O(nd2).



Why decreasing step size?

Control Variance + Sufficient Descent ⇒ Convergence

Intuition:

θ(k+1) = θ(k) − η∇L(θ(k))︸︷︷︸Descent

+ η(∇L(θ(k))−∇ì(θ(k)))︸︷︷︸Error

Not summable: Sufficient exploration,∞∑

t=1

ηk =∞.

Square summable: Diminishing Variance,∞∑

t=1

η2k <∞.



Minibatch Variance Reduction

Mini-batch SGD: If |Mi| ↑, then M2 ↓

|Mi| ↑ means more computational cost per iteration.

M2 ↓ means fewer iterations.



Variance Reduction by Control Variates

Stochastic Variance Reduced Gradient Algorithm (SVRG):

At the k-th epoch,

θ = θ[k], θ(0) = θ[k].

At the k-th iteration of the k-th epoch,

θ(k+1) = θ(k) − ηk(∇ì(θ(k))−∇ì(θ) +∇L(θ)).

After m iterations of the k-th epoch,

θ[k+1] = θ(m).



Strong Smoothness and Convexity

Regularity Conditions

(Strong Smoothness) There exist constants Li’s such that forany θ and θ′, we have

ì(θ′)− ì(θ)−∇ì(θ)>(θ′ − θ) ≤ Li

2

∥∥θ′ − θ∥∥22.

(Strong Convexity) There exists a constant µ such that forany θ and θ′, we have

L(θ′)− L(θ)−∇L(θ)>(θ′ − θ) ≥ µ

2

∥∥θ′ − θ∥∥22.

Condition Number

κmax =maxi Li

µ≥ κ =

L

µ.



Why does SVRG work?

The strong smoothness implies:∥∥∥∇ì(θ(k))−∇ì(θ)

∥∥∥2≤ Lmax

∥∥∥θ(k) − θ∥∥∥2

Bias correction:

E[∇ì(θ(k))−∇ì(θ)] = ∇L(θ(k))−∇L(θ).

Variance Reduction: As θ(k) → θ and θ → θ

E∥∥∥∇ì(θ(k))−∇ì(θ) +∇L(θ)−∇L(θ(k))

∥∥∥2

2

≤ E∥∥∥∇ì(θ(k))−∇ì(θ)

∥∥∥2

2→ 0.



Convergence of SVRG

How many iterations do we need?

Fixed step size parameter: ηk � 1Lmax

Given a pre-specified error ε and m � κmax, we need

K = O

(log

(1

ε

))

epochs such that

E∥∥∥θ[K] − θ

∥∥∥2

2≤ ε.

Total number of operations:

O (nd+ dκmax) v.s. O

(dM2

µ2ε+dκ2

ε

)v.s. O (ndκ) .



Comparison of GD, SGD and SVRG



Summary

The empirical performance highly depends on theimplementation.

Cyclic or shuffled order (Not stochastic) in practice.

“Too many tuning parameters” means “my algorithm mightonly work in theory”.

Theoretical bounds can be very loose. The constant maymatter a lot in practice.

Good software engineers with B.S./M.S. degrees can earnmuch more than Ph.D.’s, if they know how to code efficientalgorithms.


Classification Analysis


Classification v.s. Regression



Logistic Regression

Given x1, ...,xn ∈ Rd and θ∗ ∈ Rd,

yi ∼ Bernoulli(h(x>i θ∗)) for i = 1, ..., n,

h: (−∞,∞)→ [0, 1]

Logistic/Sigmoid function

h(z) =1

1 + exp(−z) .

Remark: h(0) = 0.5, h(−∞) = 0 and h(∞) = 1.



Logistic/Sigmoid Function



Logistic Regression


θ = arg maxθ

L(θ)

= arg maxθ

log

n∏

i=1

(h(x>i θ)

)yi (1− h(x>i θ)

)1−yi

= arg maxθ

n∑

i=1

yi log h(x>i θ) + (1− yi) log(1− h(x>i θ))

= arg maxθ

n∑

i=1

[yi · x>i θ − log(1 + exp(x>i θ))

]

= arg minθ

1

n

n∑

i=1

[log(1 + exp(x>i θ))− yi · x>i θ

].



Optimization for Logistic Regression

Convex Problem? Let F(θ) = −L(θ).

∇2F(θ) =1

n

n∑

i=1

h(x>i θ)(1− h(x>i θ))xix>i � 0.

No closed form solution.

Gradient descent and stochastic gradient algorithms areapplicable.



Prediction for Logistic Regression

Prediction: Given x∗, we have

P(y∗ = 1) =1

1 + exp(−θ>x∗)≥ 0.5.

Why linear classification?

P(y∗ = 1) ≥ 0.5⇔ θ>x∗ ≥ 0⇔ y∗ = sign(θ

>x∗).



Logistic Loss

Given x1, ...,xn ∈ Rd, y1, ..., yn ∈ {−1, 1}, and θ∗ : Rd,

P(yi = 1) =1

1 + exp(−x>i θ∗)for i = 1, ..., n,

An alternative formulation:

θ = arg minθ

1

n

n∑

i=1

log(1 + exp(−yix>i θ)).

We can also use 0-1 loss:

θ = arg minθ

1

n

n∑

i=1

1(sign(x>i θ) 6= yi).



Loss Functions for Classification



Newton’s Method

At the k-th iteration, we take

θ(k+1) = θ(k) − ηk[∇2F(θ)

]−1∇F(θ),

where ηk > 0 is a step size parameter.

The Second Order Taylor’s Approximation:

θ(k+0.5) = arg minθ

F(θ(k)) +∇F(θ(k))(θ − θ(k))+

+1

2(θ − θ(k))>∇2F(θ(k))(θ − θ(k)).

Backtracking Line Search:

θ(k+1) = θ(k) + ηk(θ(k+0.5) − θ(k)).



Newton’s Method



Newton’s Method

Sublinear + Quadratic Convergence:

Given∥∥∥θ(k) − θ

∥∥∥2

2≤ R� 1, we have

∥∥∥θ(k+1) − θ∥∥∥2

2≤ (1− δ)

∥∥∥θ(k) − θ∥∥∥4

2.

Given∥∥∥θ(k) − θ

∥∥∥2

2≥ R, we have

∥∥∥θ(k+1) − θ∥∥∥2

2= O(1/k).

Iteration complexity: (Some parameters hidden for simplicity)

O

(log

(1

R

)+ log log

(1

ε

)).



Newton’s Method

Advantage:

More efficient for highly accurate solutions.

Avoid extensively calculating log or exp functions.

— Taylor expansions combined with a table.

Fewer line search steps.

— Due to quadratic convergence.

Often more efficient than gradient descent.



Newton’s Method



Newton’s Method

Disadvantage:

Computing inverse Hessian matrices is expensive!

Storing inverse Hessian matrices is expensive!

Subsampled Newton:

θ(k+1) = θ(k) − ηk[H(θ(k))

]−1∇F(θ(k)).

Quasi-Newton Method: DFP, BFGS, Broyden, SR1

— Use differences of gradient vectors to approximate Hessianmatrices.


Documents

Lecture 1: Supervised Learning - ISyE Home | ISyE ...tzhao80/Lectures/Lecture_1.pdf · Lecture 1: Supervised Learning ... CS229 Lecture notes Andrew Ng Supervised learning LetÕs