Linear Regression - seas.upenn.educis519/fall2016/lectures/04_LinearRegression.… · Linear Regression Robot Image Credit ... Based on slide by Christopher Bishop (PRML) Example

LinearRegression

RobotImageCredit:ViktoriyaSukhanova©123RF.com

TheseslideswereassembledbyEricEaton,withgratefulacknowledgementofthemanyotherswhomadetheircoursematerialsfreelyavailableonline.Feelfreetoreuseoradapttheseslidesforyourownacademicpurposes,providedthatyouincludeproperaHribuIon.PleasesendcommentsandcorrecIonstoEric.

RegressionGiven:–  Datawhere

–  Correspondinglabelswhere

2

0

1

2

3

4

5

6

7

8

9

1975 1980 1985 1990 1995 2000 2005 2010 2015

Septem

berA

rc+cSeaIceExtent

(1,000,000sq

km)

Year

DatafromG.WiH.JournalofStaIsIcsEducaIon,Volume21,Number1(2013)

LinearRegressionQuadraIcRegression

X =n

x

(1), . . . ,x(n)o

x

(i) 2 Rd

y =n

y(1), . . . , y(n)o

y(i) 2 R

•  97samples,parIIonedinto67train/30test•  Eightpredictors(features):

–  6conInuous(4logtransforms),1binary,1ordinal•  ConInuousoutcomevariable:

–  lpsa:log(prostatespecificanIgenlevel)

ProstateCancerDataset

BasedonslidebyJeffHowbert

LinearRegression•  Hypothesis:

•  Fitmodelbyminimizingsumofsquarederrors

5

x x

y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX

j=0

✓jxj

Assumex0=1

y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX

j=0

✓jxj

FiguresarecourtesyofGregShakhnarovich

LeastSquaresLinearRegression

6

•  CostFuncIon

•  Fitbysolving

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2

min✓

J(✓)

IntuiIonBehindCostFuncIon

7

ForinsightonJ(),let’sassumesox 2 R ✓ = [✓0, ✓1]

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2

BasedonexamplebyAndrewNg


8

0

1

2

3

0 1 2 3

y

x

(forfixed,thisisafuncIonofx) (funcIonoftheparameter)

0

1

2

3

-0.5 0 0.5 1 1.5 2 2.5


J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2



9

0

1

2

3

0 1 2 3

y

x


0

1

2

3

-0.5 0 0.5 1 1.5 2 2.5


J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2

J([0, 0.5]) =1

2⇥ 3

⇥(0.5� 1)2 + (1� 2)2 + (1.5� 3)2

⇤⇡ 0.58Basedonexample

byAndrewNg


10

0

1

2

3

0 1 2 3

y

x


0

1

2

3

-0.5 0 0.5 1 1.5 2 2.5


J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2

J([0, 0]) ⇡ 2.333


J()isconcave


11SlidebyAndrewNg


12

(forfixed,thisisafuncIonofx) (funcIonoftheparameters)

SlidebyAndrewNg


13


SlidebyAndrewNg


14


SlidebyAndrewNg


15


SlidebyAndrewNg

BasicSearchProcedure•  ChooseiniIalvaluefor•  UnIlwereachaminimum:–  Chooseanewvaluefortoreduce

16

✓

✓ J(✓)

�1�0

J(�0,�1)

FigurebyAndrewNg


17

✓

✓

J(✓)

�1�0

J(�0,�1)

✓

FigurebyAndrewNg


18

✓

✓

J(✓)

�1�0

J(�0,�1)

✓

FigurebyAndrewNg

SincetheleastsquaresobjecIvefuncIonisconvex(concave),wedon’tneedtoworryaboutlocalminima

GradientDescent•  IniIalize•  RepeatunIlconvergence

19

✓

✓j ✓j � ↵@

@✓jJ(✓) simultaneousupdate

forj=0...d

learningrate(small)e.g.,α=0.05

J(✓)

✓

0

1

2

3

-0.5 0 0.5 1 1.5 2 2.5

↵


20

✓

✓j ✓j � ↵@


forj=0...d

ForLinearRegression:@

@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y

(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!x

(i)j

=1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j


21

✓

✓j ✓j � ↵@


forj=0...d


@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y

(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!x

(i)j

=1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j


22

✓

✓j ✓j � ↵@


forj=0...d


@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y

(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!x

(i)j

=1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j


23

✓

✓j ✓j � ↵@


forj=0...d


@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y

(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!x

(i)j

=1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j

GradientDescentforLinearRegression

•  IniIalize•  RepeatunIlconvergence

24

✓

simultaneousupdateforj=0...d

✓j ✓j � ↵

1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j

•  Toachievesimultaneousupdate•  AtthestartofeachGDiteraIon,compute•  Usethisstoredvalueintheupdatesteploop

h✓

⇣x

(i)⌘

kvk2 =

sX

i

v2i =q

v21 + v22 + . . .+ v2|v|L2norm:

k✓new

� ✓old

k2 < ✏•  Assumeconvergencewhen

GradientDescent

25


h(x)=-900–0.1x

SlidebyAndrewNg

GradientDescent

26


SlidebyAndrewNg

GradientDescent

27


SlidebyAndrewNg

GradientDescent

28


SlidebyAndrewNg

GradientDescent

29


SlidebyAndrewNg

GradientDescent

30


SlidebyAndrewNg

GradientDescent

31


SlidebyAndrewNg

GradientDescent

32


SlidebyAndrewNg

GradientDescent

33


SlidebyAndrewNg

Choosingα

34

αtoosmall

slowconvergence

αtoolarge

IncreasingvalueforJ(✓)

•  Mayovershoottheminimum•  Mayfailtoconverge•  Mayevendiverge

Toseeifgradientdescentisworking,printouteachiteraIon•  ThevalueshoulddecreaseateachiteraIon•  Ifitdoesn’t,adjustα

J(✓)

ExtendingLinearRegressiontoMoreComplexModels

•  TheinputsXforlinearregressioncanbe:–  OriginalquanItaIveinputs–  TransformaIonofquanItaIveinputs

•  e.g.log,exp,squareroot,square,etc.–  PolynomialtransformaIon

•  example:y=�0+�1�x+�2�x2+�3�x3–  Basisexpansions–  Dummycodingofcategoricalinputs–  InteracIonsbetweenvariables

•  example:x3=x1�x2

Thisallowsuseoflinearregressiontechniquestofitnon-lineardatasets.

LinearBasisFuncIonModels

•  Generally,

•  Typically,sothatactsasabias•  Inthesimplestcase,weuselinearbasisfuncIons:

h✓(x) =dX

j=0

✓j�j(x)

�0(x) = 1 ✓0

�j(x) = xj

basisfuncIon

BasedonslidebyChristopherBishop(PRML)


–  Theseareglobal;asmallchangeinxaffectsallbasisfuncIons

•  PolynomialbasisfuncIons:

•  GaussianbasisfuncIons:

–  Thesearelocal;asmallchangeinxonlyaffectnearbybasisfuncIons.μjandscontrollocaIonandscale(width).


LinearBasisFuncIonModels•  SigmoidalbasisfuncIons:

where

–  Thesearealsolocal;asmallchangeinxonlyaffectsnearbybasisfuncIons.μjandscontrollocaIonandscale(slope).


ExampleofFinngaPolynomialCurvewithaLinearModel

y = ✓0 + ✓1x+ ✓2x2 + . . .+ ✓px

p =pX

j=0

✓jxj


•  BasicLinearModel:

•  GeneralizedLinearModel:•  OncewehavereplacedthedatabytheoutputsofthebasisfuncIons,finngthegeneralizedmodelisexactlythesameproblemasfinngthebasicmodel–  Unlessweusethekerneltrick–moreonthatwhenwecoversupportvectormachines

–  Therefore,thereisnopointincluHeringthemathwithbasisfuncIons

40

h✓(x) =dX

j=0

✓j�j(x)

h✓(x) =dX

j=0

✓jxj

BasedonslidebyGeoffHinton

LinearAlgebraConcepts•  Vectorinisanorderedsetofdrealnumbers

–  e.g.,v=[1,6,3,4]isin–  “[1,6,3,4]” isacolumnvector:–  asopposedtoarowvector:

•  Anm-by-nmatrixisanobjectwithmrowsandncolumns,whereeachentryisarealnumber:

⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜

⎝

⎛

4361

( )4361

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛

2396784821

Rd

R4

BasedonslidesbyJosephBradley

•  Transpose:reflectvector/matrixonline:

( )baba T

=⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=⎟⎟⎠

⎞⎜⎜⎝

⎛dbca

dcba T

–  Note:(Ax)T=xTAT(We’lldefinemulIplicaIonsoon…)

•  Vectornorms:–  Lpnormofv=(v1,…,vk)is–  Commonnorms:L1,L2–  Linfinity=maxi|vi|

•  LengthofavectorvisL2(v)

X

i

|vi|p! 1

p


LinearAlgebraConcepts

•  Vectordotproduct:

–  Note:dotproductofuwithitself=length(u)2=

•  Matrixproduct:

( ) ( ) 22112121 vuvuvvuuvu +=•=•

⎟⎟⎠

⎞⎜⎜⎝

⎛++++

=

⎟⎟⎠

⎞⎜⎜⎝

⎛=⎟⎟⎠

⎞⎜⎜⎝

⎛=

2222122121221121

2212121121121111

2221

1211

2221

1211 ,

babababababababa

AB

bbbb

Baaaa

A

kuk22



•  Vectorproducts:–  Dotproduct:

–  Outerproduct:

( ) 22112

121 vuvuvv

uuvuvu T +=⎟⎟⎠

⎞⎜⎜⎝

⎛==•

( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=⎟⎟⎠

⎞⎜⎜⎝

⎛=

2212

211121

2

1

vuvuvuvu

vvuu

uvT



h(x) = ✓

|x

x

| =⇥1 x1 . . . xd

⇤

VectorizaIon•  BenefitsofvectorizaIon– MorecompactequaIons–  Fastercode(usingopImizedmatrixlibraries)

•  Considerourmodel:•  Let

•  Canwritethemodelinvectorizedformas45

h(x) =dX

j=0

✓jxj

✓ =

2

6664

✓0✓1...✓d

3

7775

VectorizaIon•  Considerourmodelforninstances:•  Let

•  Canwritethemodelinvectorizedformas46

h✓(x) = X✓

X =

2

66666664

1 x

(1)1 . . . x

(1)d

......

. . ....

1 x

(i)1 . . . x

(i)d

......

. . ....

1 x

(n)1 . . . x

(n)d

3

77777775

✓ =

2

6664

✓0✓1...✓d

3

7775

h

⇣x

(i)⌘=

dX

j=0

✓jx(i)j

R(d+1)⇥1 Rn⇥(d+1)

J(✓) =1

2n

nX

i=1

⇣✓

|x

(i) � y(i)⌘2

VectorizaIon•  ForthelinearregressioncostfuncIon:

47

J(✓) =1

2n(X✓ � y)| (X✓ � y)

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2

Rn⇥(d+1)

R(d+1)⇥1

Rn⇥1R1⇥n

Let:

y =

2

6664

y(1)

y(2)

...y(n)

3

7775

ClosedFormSoluIon:

ClosedFormSoluIon•  InsteadofusingGD,solveforopImal analyIcally–  NoIcethatthesoluIoniswhen

•  DerivaIon:

TakederivaIveandsetequalto0,thensolvefor:

48

✓@

@✓J(✓) = 0

J (✓) =1

2n(X✓ � y)| (X✓ � y)

/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y

1x1J (✓) =1

2n(X✓ � y)| (X✓ � y)

/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y

J (✓) =1

2n(X✓ � y)| (X✓ � y)

/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y

@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

✓@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

ClosedFormSoluIon•  CanobtainbysimplypluggingXand into

•  IfXTXisnotinverIble(i.e.,singular),mayneedto:–  Usepseudo-inverseinsteadoftheinverse

•  Inpython,numpy.linalg.pinv(a) –  Removeredundant(notlinearlyindependent)features–  Removeextrafeaturestoensurethatd≤n

49

@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

y =

2

6664

y(1)

y(2)

...y(n)

3

7775X =

2

66666664

1 x

(1)1 . . . x

(1)d

......

. . ....

1 x

(i)1 . . . x

(i)d

......

. . ....

1 x

(n)1 . . . x

(n)d

3

77777775

✓ y

GradientDescentvsClosedForm

GradientDescentClosedFormSolu+on

50

•  RequiresmulIpleiteraIons•  Needtochooseα•  Workswellwhennislarge•  Cansupportincremental

learning

•  Non-iteraIve•  Noneedforα•  Slowifnislarge

–CompuIng(XTX)-1isroughlyO(n3)

ImprovingLearning:FeatureScaling

•  Idea:Ensurethatfeaturehavesimilarscales

•  Makesgradientdescentconvergemuchfaster

51

0

5

10

15

20

0 5 10 15 20✓1

✓2

BeforeFeatureScaling

0

5

10

15

20

0 5 10 15 20✓1

✓2

AverFeatureScaling

FeatureStandardizaIon•  Rescalesfeaturestohavezeromeanandunitvariance

– Letμjbethemeanoffeaturej:

– Replaceeachvaluewith:

•  sjisthestandarddeviaIonoffeaturej •  Couldalsousetherangeoffeaturej (maxj–minj)forsj

•  MustapplythesametransformaIontoinstancesforbothtrainingandpredicIon

•  Outlierscancauseproblems

52

µj =1

n

nX

i=1

x

(i)j

x

(i)j

x

(i)j � µj

sj

forj=1...d(notx0!)

QualityofFit

OverfiHng:•  Thelearnedhypothesismayfitthetrainingsetverywell()

•  ...butfailstogeneralizetonewexamples

53

Price

Size

Price

Size

Price

Size

Underfinng(highbias)

Overfinng(highvariance)

Correctfit

J(✓) ⇡ 0


RegularizaIon•  AmethodforautomaIcallycontrollingthecomplexityofthelearnedhypothesis

•  Idea:penalizeforlargevaluesof–  CanincorporateintothecostfuncIon– Workswellwhenwehavealotoffeatures,eachthatcontributesabittopredicIngthelabel

•  CanalsoaddressoverfinngbyeliminaIngfeatures(eithermanuallyorviamodelselecIon)

54

✓j

RegularizaIon•  LinearregressionobjecIvefuncIon

–  istheregularizaIonparameter()– NoregularizaIonon!

55

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+ �

dX

j=1

✓2j

modelfittodata regularizaIon

✓0

� � � 0

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

�

2

dX

j=1

✓2j

UnderstandingRegularizaIon

•  Notethat

–  Thisisthemagnitudeofthefeaturecoefficientvector!

•  Wecanalsothinkofthisas:

•  L2regularizaIonpullscoefficientstoward0

56

dX

j=1

✓2j = k✓1:dk22

dX

j=1

(✓j � 0)2 = k✓1:d � ~0k22

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

�

2

dX

j=1

✓2j


•  Whathappensifwesettobehuge(e.g.,1010)?

57

�Price

Size


J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

�

2

dX

j=1

✓2j


•  Whathappensifwesettobehuge(e.g.,1010)?

58

�Price

Size0 0 0 0


J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

�

2

dX

j=1

✓2j

RegularizedLinearRegression

59

•  CostFuncIon

•  Fitbysolving

•  Gradientupdate:

min✓

J(✓)

✓j ✓j � ↵

1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j

✓0 ✓0 � ↵1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘

regularizaIon

@

@✓jJ(✓)

@

@✓0J(✓)

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

�

2

dX

j=1

✓2j

� ↵�✓j


60

✓0 ✓0 � ↵1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘

•  Wecanrewritethegradientstepas:

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

�

2

dX

j=1

✓2j

✓j ✓j (1� ↵�)� ↵

1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j

✓j ✓j � ↵

1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j � ↵�✓j


61

✓ =

0

BBBBB@X|X + �

2

666664

0 0 0 . . . 00 1 0 . . . 00 0 1 . . . 0...

......

. . ....

0 0 0 . . . 1

3

777775

1

CCCCCA

�1

X|y

•  ToincorporateregularizaIonintotheclosedformsoluIon:


62

•  ToincorporateregularizaIonintotheclosedformsoluIon:

•  Canderivethisthesameway,bysolving

•  Canprovethatforλ>0,inverseexistsintheequaIonabove

✓ =

0

BBBBB@X|X + �

2

666664

0 0 0 . . . 00 1 0 . . . 00 0 1 . . . 0...

......

. . ....

0 0 0 . . . 1

3

777775

1

CCCCCA

�1

X|y

@

@✓J(✓) = 0

Documents

Linear Regression - seas.upenn.educis519/fall2016/lectures/04_LinearRegression.… · Linear Regression Robot Image Credit ... Based on slide by Christopher Bishop (PRML) Example