26
Linear Regression Aar$ Singh & Barnabas Poczos Machine Learning 10701/15781 Jan 23, 2014

Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%

Linear  Regression  

Aar$  Singh  &  Barnabas  Poczos      

Machine  Learning  10-­‐701/15-­‐781  Jan  23,  2014  

Page 2: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%

•  Learning  distribu$ons    – Maximum  Likelihood  Es$ma$on  (MLE)  – Maximum  A  Posteriori  (MAP)      

•  Learning  classifiers  – Naïve  Bayes  

2  

So  far  …  

Page 3: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%

3  

Discrete  to  Con3nuous  Labels  

Sports  Science  News  

Classification Regression  

Anemic  cell  Healthy  cell  

Stock  Market    Predic$on  

Y  =  ?  

X  =  Feb01    

X  =  Document   Y  =  Topic   X  =  Cell  Image   Y  =  Diagnosis  

Page 4: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%

Regression  Tasks  

4  

Weather  Predic$on  

Y  =  Temp  

X  =  7  pm  

Es$ma$ng  Contamina$on  

X  =  new  loca3on  Y  =  sensor  reading  

Page 5: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%

5  

Supervised  Learning  

Sports  Science  News  

Classification: Regression:  

Probability  of  Error

Goal:

Mean  Squared  Error

Y  =  ?  

X  =  Feb01    

loss function (performance measure)

Page 6: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%

Regression  algorithms  

Learning  algorithm  

6  

Linear  Regression  Regularized  Linear  Regression  –  Ridge  regression,  Lasso  Polynomial  Regression  Kernel  Regression  Regression  Trees,  Splines,  Wavelet  es$mators,  …  

Page 7: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%

Replace  Expecta3on  with  Empirical  Mean  

 

7  

Empirical Minimizer:

Optimal predictor:

Law of Large Numbers:

Empirical  mean  

n              ∞  

Page 8: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%

Restrict  class  of  predictors  

 

8  

Empirical Minimizer:

Optimal predictor:

Class  of  predictors  

Why?      Overfi_ng!      Empiricial  loss  minimized  by  any        func$on  of  the  form    

Xi  

Yi  

Page 9: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%

Restrict  class  of  predictors  

 

9  

Empirical Minimizer:

Optimal predictor:

Class  of  predictors  

-­‐  Class  of  Linear  func$ons  -­‐  Class  of  Polynomial  func$ons  -­‐  Class  of  nonlinear  func$ons  

F

Page 10: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%

Linear  Regression  

10  

-­‐  Class  of  Linear  func$ons  

β1  -­‐  intercept  

β2  =  slope  Uni-­‐variate  case:  

Mul$-­‐variate  case:   1  

where                                                                                                          ,  

Least Squares Estimator

Page 11: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%

Least  Squares  Es3mator  

11  

f(Xi) = Xi�

Page 12: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%

Least  Squares  Es3mator  

12  

Page 13: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%

Normal  Equa3ons  

13  

If                                    is  inver$ble,    

When  is                                        inver$ble  ?    Recall:  Full  rank  matrices  are  inver$ble.  What  is  rank  of                                  ?      What  if                                      is  not  inver$ble  ?    Regulariza$on  (later)  

p  xp   p  x1   p  x1  

Page 14: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%

Gradient  Descent  

14  

Even  when                                        is  inver$ble,  might  be  computa$onally  expensive  if  A  is  huge.  

Treat  as  op$miza$on  problem    Observa$on:      J(β)  is  convex  in  β.  

J(β1)   J(β1,  β2)  

β1  β1   β2  

How  to  find  the  minimizer?  

Page 15: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%

Gradient  Descent  

15  

Even  when                                        is  inver$ble,  might  be  computa$onally  expensive  if  A  is  huge.  

Ini$alize:      Update:        

       0  if            =      Stop:    when  some  criterion  met  e.g.  fixed  #  itera$ons,  or                                    <  ε.    

Since  J(β)  is  convex,  move  along  nega3ve  of  gradient  

step  size  

Page 16: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%

Effect  of  step-­‐size  α  

16  

Large  α    =>  Fast  convergence  but  larger  residual  error    Also  possible  oscilla$ons  

 Small  α    =>  Slow  convergence  but  small  residual  error  

       

Page 17: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%

Least  Squares  and  MLE  

17  

Intui$on:  Signal  plus  (zero-­‐mean)  Noise  model  

Least Square Estimate is same as Maximum Likelihood Estimate under a Gaussian model !

log  likelihood  

= X�⇤

Page 18: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%

Regularized  Least  Squares  and  MAP  

18  

What  if                                      is  not  inver$ble  ?    

log  likelihood   log  prior  

I)  Gaussian  Prior  

0  

Ridge Regression

b�MAP = (AAA>AAA+ �III)�1AAA>YYY

Page 19: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%

Regularized  Least  Squares  and  MAP  

19  

What  if                                      is  not  inver$ble  ?    

log  likelihood   log  prior  

Prior  belief  that  β  is  Gaussian  with  zero-­‐mean  biases  solu$on  to  “small”  β  

I)  Gaussian  Prior  

0  

Ridge Regression

Page 20: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%

Regularized  Least  Squares  and  MAP  

20  

What  if                                      is  not  inver$ble  ?    

log  likelihood   log  prior  

Prior  belief  that  β  is  Laplace  with  zero-­‐mean  biases  solu$on  to  “small”  β  

Lasso

II)  Laplace  Prior  

Page 21: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%

Ridge  Regression  vs  Lasso  

21  

Ridge  Regression:                Lasso:          

Lasso  (l1  penalty)  results  in  sparse  solu3ons  –  vector  with  more  zero  coordinates  Good  for  high-­‐dimensional  problems  –  don’t  have  to  store  all  coordinates!  

βs  with    constant    l1  norm  

Ideally  l0  penalty,    but  op$miza$on    becomes  non-­‐convex  

βs  with    constant    l0  norm  

βs  with  constant  J(β)  (level  sets  of  J(β))  

βs  with    constant    l2  norm  

β2  

β1  

HOT!  

Page 22: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%

Beyond  Linear  Regression  

26  

Polynomial  regression      Regression  with  nonlinear  features        Later  …    Kernel  regression  -­‐  Local/Weighted  regression    

Page 23: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%

Polynomial  Regression  

27  

Univariate  (1-­‐dim)    case:  

where                                                                                                                ,  

MulGvariate  (p-­‐dim)    case:  

degree  m  

f(X) = �0 + �1X(1)

+ �2X(2)

+ · · ·+ �pX(p)

+

pX

i=1

pX

j=1

�ijX(i)X(j)

+

pX

i=1

pX

j=1

pX

k=1

X(i)X(j)X(k)

+ . . . terms up to degree m

Page 24: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%

28  

Polynomial  Regression  

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1.4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-45

-40

-35

-30

-25

-20

-15

-10

-5

0

5

k=1   k=2  

k=3   k=7  

Polynomial  of  order  k,  equivalently  of  degree  up  to  k-­‐1  

 

What  is  the  right  order?  Recall  overfiPng!  More  later  …    

Page 25: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%

29  

Regression  with  nonlinear  features  

In  general,  use  any  nonlinear  features      

 e.g.  eX,  log  X,  1/X,  sin(X),  …  

Nonlinear features

Weight of each feature

Page 26: Linear’Regression’ · Linear’Regression’ Aar$%Singh%&%Barnabas%Poczos% % % Machine%Learning%108701/158781% Jan%23,%2014%

What  you  should  know  Linear  Regression      Least  Squares  Es$mator      Normal  Equa$ons      Gradient  Descent      Probabilis$c  Interpreta$on  (connec$on  to  MLE)  

 

Regularized  Linear  Regression  (connec$on  to  MAP)      Ridge  Regression,  Lasso  

 

Polynomial  Regression,  Regression  with  Non-­‐linear  features    

25