Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%

Linear Regression

Aar$ Singh & Barnabas Poczos

Machine Learning 10-‐701/15-‐781 Jan 23, 2014

•  Learning distribu$ons – Maximum Likelihood Es$ma$on (MLE) – Maximum A Posteriori (MAP)

•  Learning classifiers – Naïve Bayes

2

So far …

3

Discrete to Con3nuous Labels

Sports Science News

Classification Regression

Anemic cell Healthy cell

Stock Market Predic$on

Y = ?

X = Feb01

X = Document Y = Topic X = Cell Image Y = Diagnosis

Regression Tasks

4

Weather Predic$on

Y = Temp

X = 7 pm

Es$ma$ng Contamina$on

X = new loca3on Y = sensor reading

5

Supervised Learning

Sports Science News

Classification: Regression:

Probability of Error

Goal:

Mean Squared Error

Y = ?

X = Feb01

loss function (performance measure)

Regression algorithms

Learning algorithm

6

Linear Regression Regularized Linear Regression – Ridge regression, Lasso Polynomial Regression Kernel Regression Regression Trees, Splines, Wavelet es$mators, …

Replace Expecta3on with Empirical Mean

7

Empirical Minimizer:

Optimal predictor:

Law of Large Numbers:

Empirical mean

n ∞

Restrict class of predictors

8


Optimal predictor:

Class of predictors

Why? Overfi_ng! Empiricial loss minimized by any func$on of the form

Xi

Yi

Restrict class of predictors

9


Optimal predictor:

Class of predictors

-‐  Class of Linear func$ons -‐  Class of Polynomial func$ons -‐  Class of nonlinear func$ons

F

Linear Regression

10

-‐ Class of Linear func$ons

β1 -‐ intercept

β2 = slope Uni-‐variate case:

Mul$-‐variate case: 1

where ,

Least Squares Estimator

Least Squares Es3mator

11

f(Xi) = Xi�

Least Squares Es3mator

12

Normal Equa3ons

13

If is inver$ble,

When is inver$ble ? Recall: Full rank matrices are inver$ble. What is rank of ? What if is not inver$ble ? Regulariza$on (later)

p xp p x1 p x1

Gradient Descent

14

Even when is inver$ble, might be computa$onally expensive if A is huge.

Treat as op$miza$on problem Observa$on: J(β) is convex in β.

J(β1) J(β1, β2)

β1 β1 β2

How to find the minimizer?

Gradient Descent

15

Even when is inver$ble, might be computa$onally expensive if A is huge.

Ini$alize: Update:

0 if = Stop: when some criterion met e.g. fixed # itera$ons, or < ε.

Since J(β) is convex, move along nega3ve of gradient

step size

Effect of step-‐size α

16

Large α => Fast convergence but larger residual error Also possible oscilla$ons

Small α => Slow convergence but small residual error

Least Squares and MLE

17

Intui$on: Signal plus (zero-‐mean) Noise model

Least Square Estimate is same as Maximum Likelihood Estimate under a Gaussian model !

log likelihood

= X�⇤

Regularized Least Squares and MAP

18

What if is not inver$ble ?

log likelihood log prior

I) Gaussian Prior

0

Ridge Regression

b�MAP = (AAA>AAA+ �III)�1AAA>YYY


19



Prior belief that β is Gaussian with zero-‐mean biases solu$on to “small” β

I) Gaussian Prior

0

Ridge Regression


20



Prior belief that β is Laplace with zero-‐mean biases solu$on to “small” β

Lasso

II) Laplace Prior

Ridge Regression vs Lasso

21

Ridge Regression: Lasso:

Lasso (l1 penalty) results in sparse solu3ons – vector with more zero coordinates Good for high-‐dimensional problems – don’t have to store all coordinates!

βs with constant l1 norm

Ideally l0 penalty, but op$miza$on becomes non-‐convex


βs with constant J(β) (level sets of J(β))


β2

β1

HOT!

Beyond Linear Regression

26

Polynomial regression Regression with nonlinear features Later … Kernel regression -‐ Local/Weighted regression

Polynomial Regression

27

Univariate (1-‐dim) case:

where ,

MulGvariate (p-‐dim) case:

degree m

f(X) = �0 + �1X(1)

+ �2X(2)

+ · · ·+ �pX(p)

+

pX

i=1

pX

j=1

�ijX(i)X(j)

+

pX

i=1

pX

j=1

pX

k=1

X(i)X(j)X(k)

+ . . . terms up to degree m

28

Polynomial Regression

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1.4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-45

-40

-35

-30

-25

-20

-15

-10

-5

0

5

k=1 k=2

k=3 k=7

Polynomial of order k, equivalently of degree up to k-‐1

What is the right order? Recall overfiPng! More later …

29

Regression with nonlinear features

In general, use any nonlinear features

e.g. eX, log X, 1/X, sin(X), …

Nonlinear features

Weight of each feature

What you should know Linear Regression Least Squares Es$mator Normal Equa$ons Gradient Descent Probabilis$c Interpreta$on (connec$on to MLE)

Regularized Linear Regression (connec$on to MAP) Ridge Regression, Lasso

Polynomial Regression, Regression with Non-‐linear features

25

Documents

Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%