61
Linear Regression Robot Image Credit: Viktoriya Sukhanova © 123RF.com These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these slides for your own academic purposes, provided that you include proper aHribuIon. Please send comments and correcIons to Eric.

Linear Regression - seas.upenn.educis519/fall2016/lectures/04_LinearRegression.… · Linear Regression Robot Image Credit ... Based on slide by Christopher Bishop (PRML) Example

  • Upload
    vanque

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

LinearRegression

RobotImageCredit:ViktoriyaSukhanova©123RF.com

TheseslideswereassembledbyEricEaton,withgratefulacknowledgementofthemanyotherswhomadetheircoursematerialsfreelyavailableonline.Feelfreetoreuseoradapttheseslidesforyourownacademicpurposes,providedthatyouincludeproperaHribuIon.PleasesendcommentsandcorrecIonstoEric.

RegressionGiven:–  Datawhere

–  Correspondinglabelswhere

2

0

1

2

3

4

5

6

7

8

9

1975 1980 1985 1990 1995 2000 2005 2010 2015

Septem

berA

rc+cSeaIceExtent

(1,000,000sq

km)

Year

DatafromG.WiH.JournalofStaIsIcsEducaIon,Volume21,Number1(2013)

LinearRegressionQuadraIcRegression

X =n

x

(1), . . . ,x(n)o

x

(i) 2 Rd

y =n

y(1), . . . , y(n)o

y(i) 2 R

•  97samples,parIIonedinto67train/30test•  Eightpredictors(features):

–  6conInuous(4logtransforms),1binary,1ordinal•  ConInuousoutcomevariable:

–  lpsa:log(prostatespecificanIgenlevel)

ProstateCancerDataset

BasedonslidebyJeffHowbert

LinearRegression•  Hypothesis:

•  Fitmodelbyminimizingsumofsquarederrors

5

x x

y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX

j=0

✓jxj

Assumex0=1

y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX

j=0

✓jxj

FiguresarecourtesyofGregShakhnarovich

LeastSquaresLinearRegression

6

•  CostFuncIon

•  Fitbysolving

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2

min✓

J(✓)

IntuiIonBehindCostFuncIon

7

ForinsightonJ(),let’sassumesox 2 R ✓ = [✓0, ✓1]

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2

BasedonexamplebyAndrewNg

IntuiIonBehindCostFuncIon

8

0

1

2

3

0 1 2 3

y

x

(forfixed,thisisafuncIonofx) (funcIonoftheparameter)

0

1

2

3

-0.5 0 0.5 1 1.5 2 2.5

ForinsightonJ(),let’sassumesox 2 R ✓ = [✓0, ✓1]

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2

BasedonexamplebyAndrewNg

IntuiIonBehindCostFuncIon

9

0

1

2

3

0 1 2 3

y

x

(forfixed,thisisafuncIonofx) (funcIonoftheparameter)

0

1

2

3

-0.5 0 0.5 1 1.5 2 2.5

ForinsightonJ(),let’sassumesox 2 R ✓ = [✓0, ✓1]

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2

J([0, 0.5]) =1

2⇥ 3

⇥(0.5� 1)2 + (1� 2)2 + (1.5� 3)2

⇤⇡ 0.58Basedonexample

byAndrewNg

IntuiIonBehindCostFuncIon

10

0

1

2

3

0 1 2 3

y

x

(forfixed,thisisafuncIonofx) (funcIonoftheparameter)

0

1

2

3

-0.5 0 0.5 1 1.5 2 2.5

ForinsightonJ(),let’sassumesox 2 R ✓ = [✓0, ✓1]

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2

J([0, 0]) ⇡ 2.333

BasedonexamplebyAndrewNg

J()isconcave

IntuiIonBehindCostFuncIon

11SlidebyAndrewNg

IntuiIonBehindCostFuncIon

12

(forfixed,thisisafuncIonofx) (funcIonoftheparameters)

SlidebyAndrewNg

IntuiIonBehindCostFuncIon

13

(forfixed,thisisafuncIonofx) (funcIonoftheparameters)

SlidebyAndrewNg

IntuiIonBehindCostFuncIon

14

(forfixed,thisisafuncIonofx) (funcIonoftheparameters)

SlidebyAndrewNg

IntuiIonBehindCostFuncIon

15

(forfixed,thisisafuncIonofx) (funcIonoftheparameters)

SlidebyAndrewNg

BasicSearchProcedure•  ChooseiniIalvaluefor•  UnIlwereachaminimum:–  Chooseanewvaluefortoreduce

16

✓ J(✓)

�1�0

J(�0,�1)

FigurebyAndrewNg

BasicSearchProcedure•  ChooseiniIalvaluefor•  UnIlwereachaminimum:–  Chooseanewvaluefortoreduce

17

J(✓)

�1�0

J(�0,�1)

FigurebyAndrewNg

BasicSearchProcedure•  ChooseiniIalvaluefor•  UnIlwereachaminimum:–  Chooseanewvaluefortoreduce

18

J(✓)

�1�0

J(�0,�1)

FigurebyAndrewNg

SincetheleastsquaresobjecIvefuncIonisconvex(concave),wedon’tneedtoworryaboutlocalminima

GradientDescent•  IniIalize•  RepeatunIlconvergence

19

✓j ✓j � ↵@

@✓jJ(✓) simultaneousupdate

forj=0...d

learningrate(small)e.g.,α=0.05

J(✓)

0

1

2

3

-0.5 0 0.5 1 1.5 2 2.5

GradientDescent•  IniIalize•  RepeatunIlconvergence

20

✓j ✓j � ↵@

@✓jJ(✓) simultaneousupdate

forj=0...d

ForLinearRegression:@

@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y

(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!x

(i)j

=1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j

GradientDescent•  IniIalize•  RepeatunIlconvergence

21

✓j ✓j � ↵@

@✓jJ(✓) simultaneousupdate

forj=0...d

ForLinearRegression:@

@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y

(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!x

(i)j

=1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j

GradientDescent•  IniIalize•  RepeatunIlconvergence

22

✓j ✓j � ↵@

@✓jJ(✓) simultaneousupdate

forj=0...d

ForLinearRegression:@

@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y

(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!x

(i)j

=1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j

GradientDescent•  IniIalize•  RepeatunIlconvergence

23

✓j ✓j � ↵@

@✓jJ(✓) simultaneousupdate

forj=0...d

ForLinearRegression:@

@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y

(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y

(i)

!x

(i)j

=1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j

GradientDescentforLinearRegression

•  IniIalize•  RepeatunIlconvergence

24

simultaneousupdateforj=0...d

✓j ✓j � ↵

1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j

•  Toachievesimultaneousupdate•  AtthestartofeachGDiteraIon,compute•  Usethisstoredvalueintheupdatesteploop

h✓

⇣x

(i)⌘

kvk2 =

sX

i

v2i =q

v21 + v22 + . . .+ v2|v|L2norm:

k✓new

� ✓old

k2 < ✏•  Assumeconvergencewhen

GradientDescent

25

(forfixed,thisisafuncIonofx) (funcIonoftheparameters)

h(x)=-900–0.1x

SlidebyAndrewNg

GradientDescent

26

(forfixed,thisisafuncIonofx) (funcIonoftheparameters)

SlidebyAndrewNg

GradientDescent

27

(forfixed,thisisafuncIonofx) (funcIonoftheparameters)

SlidebyAndrewNg

GradientDescent

28

(forfixed,thisisafuncIonofx) (funcIonoftheparameters)

SlidebyAndrewNg

GradientDescent

29

(forfixed,thisisafuncIonofx) (funcIonoftheparameters)

SlidebyAndrewNg

GradientDescent

30

(forfixed,thisisafuncIonofx) (funcIonoftheparameters)

SlidebyAndrewNg

GradientDescent

31

(forfixed,thisisafuncIonofx) (funcIonoftheparameters)

SlidebyAndrewNg

GradientDescent

32

(forfixed,thisisafuncIonofx) (funcIonoftheparameters)

SlidebyAndrewNg

GradientDescent

33

(forfixed,thisisafuncIonofx) (funcIonoftheparameters)

SlidebyAndrewNg

Choosingα

34

αtoosmall

slowconvergence

αtoolarge

IncreasingvalueforJ(✓)

•  Mayovershoottheminimum•  Mayfailtoconverge•  Mayevendiverge

Toseeifgradientdescentisworking,printouteachiteraIon•  ThevalueshoulddecreaseateachiteraIon•  Ifitdoesn’t,adjustα

J(✓)

ExtendingLinearRegressiontoMoreComplexModels

•  TheinputsXforlinearregressioncanbe:–  OriginalquanItaIveinputs–  TransformaIonofquanItaIveinputs

•  e.g.log,exp,squareroot,square,etc.–  PolynomialtransformaIon

•  example:y=�0+�1�x+�2�x2+�3�x3–  Basisexpansions–  Dummycodingofcategoricalinputs–  InteracIonsbetweenvariables

•  example:x3=x1�x2

Thisallowsuseoflinearregressiontechniquestofitnon-lineardatasets.

LinearBasisFuncIonModels

•  Generally,

•  Typically,sothatactsasabias•  Inthesimplestcase,weuselinearbasisfuncIons:

h✓(x) =dX

j=0

✓j�j(x)

�0(x) = 1 ✓0

�j(x) = xj

basisfuncIon

BasedonslidebyChristopherBishop(PRML)

LinearBasisFuncIonModels

–  Theseareglobal;asmallchangeinxaffectsallbasisfuncIons

•  PolynomialbasisfuncIons:

•  GaussianbasisfuncIons:

–  Thesearelocal;asmallchangeinxonlyaffectnearbybasisfuncIons.μjandscontrollocaIonandscale(width).

BasedonslidebyChristopherBishop(PRML)

LinearBasisFuncIonModels•  SigmoidalbasisfuncIons:

where

–  Thesearealsolocal;asmallchangeinxonlyaffectsnearbybasisfuncIons.μjandscontrollocaIonandscale(slope).

BasedonslidebyChristopherBishop(PRML)

ExampleofFinngaPolynomialCurvewithaLinearModel

y = ✓0 + ✓1x+ ✓2x2 + . . .+ ✓px

p =pX

j=0

✓jxj

LinearBasisFuncIonModels

•  BasicLinearModel:

•  GeneralizedLinearModel:•  OncewehavereplacedthedatabytheoutputsofthebasisfuncIons,finngthegeneralizedmodelisexactlythesameproblemasfinngthebasicmodel–  Unlessweusethekerneltrick–moreonthatwhenwecoversupportvectormachines

–  Therefore,thereisnopointincluHeringthemathwithbasisfuncIons

40

h✓(x) =dX

j=0

✓j�j(x)

h✓(x) =dX

j=0

✓jxj

BasedonslidebyGeoffHinton

LinearAlgebraConcepts•  Vectorinisanorderedsetofdrealnumbers

–  e.g.,v=[1,6,3,4]isin–  “[1,6,3,4]” isacolumnvector:–  asopposedtoarowvector:

•  Anm-by-nmatrixisanobjectwithmrowsandncolumns,whereeachentryisarealnumber:

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

4361

( )4361

⎟⎟⎟

⎜⎜⎜

2396784821

Rd

R4

BasedonslidesbyJosephBradley

•  Transpose:reflectvector/matrixonline:

( )baba T

=⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=⎟⎟⎠

⎞⎜⎜⎝

⎛dbca

dcba T

–  Note:(Ax)T=xTAT(We’lldefinemulIplicaIonsoon…)

•  Vectornorms:–  Lpnormofv=(v1,…,vk)is–  Commonnorms:L1,L2–  Linfinity=maxi|vi|

•  LengthofavectorvisL2(v)

X

i

|vi|p! 1

p

BasedonslidesbyJosephBradley

LinearAlgebraConcepts

•  Vectordotproduct:

–  Note:dotproductofuwithitself=length(u)2=

•  Matrixproduct:

( ) ( ) 22112121 vuvuvvuuvu +=•=•

⎟⎟⎠

⎞⎜⎜⎝

⎛++++

=

⎟⎟⎠

⎞⎜⎜⎝

⎛=⎟⎟⎠

⎞⎜⎜⎝

⎛=

2222122121221121

2212121121121111

2221

1211

2221

1211 ,

babababababababa

AB

bbbb

Baaaa

A

kuk22

BasedonslidesbyJosephBradley

LinearAlgebraConcepts

•  Vectorproducts:–  Dotproduct:

–  Outerproduct:

( ) 22112

121 vuvuvv

uuvuvu T +=⎟⎟⎠

⎞⎜⎜⎝

⎛==•

( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=⎟⎟⎠

⎞⎜⎜⎝

⎛=

2212

211121

2

1

vuvuvuvu

vvuu

uvT

BasedonslidesbyJosephBradley

LinearAlgebraConcepts

h(x) = ✓

|x

x

| =⇥1 x1 . . . xd

VectorizaIon•  BenefitsofvectorizaIon– MorecompactequaIons–  Fastercode(usingopImizedmatrixlibraries)

•  Considerourmodel:•  Let

•  Canwritethemodelinvectorizedformas45

h(x) =dX

j=0

✓jxj

✓ =

2

6664

✓0✓1...✓d

3

7775

VectorizaIon•  Considerourmodelforninstances:•  Let

•  Canwritethemodelinvectorizedformas46

h✓(x) = X✓

X =

2

66666664

1 x

(1)1 . . . x

(1)d

......

. . ....

1 x

(i)1 . . . x

(i)d

......

. . ....

1 x

(n)1 . . . x

(n)d

3

77777775

✓ =

2

6664

✓0✓1...✓d

3

7775

h

⇣x

(i)⌘=

dX

j=0

✓jx(i)j

R(d+1)⇥1 Rn⇥(d+1)

J(✓) =1

2n

nX

i=1

⇣✓

|x

(i) � y(i)⌘2

VectorizaIon•  ForthelinearregressioncostfuncIon:

47

J(✓) =1

2n(X✓ � y)| (X✓ � y)

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2

Rn⇥(d+1)

R(d+1)⇥1

Rn⇥1R1⇥n

Let:

y =

2

6664

y(1)

y(2)

...y(n)

3

7775

ClosedFormSoluIon:

ClosedFormSoluIon•  InsteadofusingGD,solveforopImal analyIcally–  NoIcethatthesoluIoniswhen

•  DerivaIon:

TakederivaIveandsetequalto0,thensolvefor:

48

✓@

@✓J(✓) = 0

J (✓) =1

2n(X✓ � y)| (X✓ � y)

/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y

1x1J (✓) =1

2n(X✓ � y)| (X✓ � y)

/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y

J (✓) =1

2n(X✓ � y)| (X✓ � y)

/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y

@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

✓@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

ClosedFormSoluIon•  CanobtainbysimplypluggingXand into

•  IfXTXisnotinverIble(i.e.,singular),mayneedto:–  Usepseudo-inverseinsteadoftheinverse

•  Inpython,numpy.linalg.pinv(a) –  Removeredundant(notlinearlyindependent)features–  Removeextrafeaturestoensurethatd≤n

49

@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

y =

2

6664

y(1)

y(2)

...y(n)

3

7775X =

2

66666664

1 x

(1)1 . . . x

(1)d

......

. . ....

1 x

(i)1 . . . x

(i)d

......

. . ....

1 x

(n)1 . . . x

(n)d

3

77777775

✓ y

GradientDescentvsClosedForm

GradientDescentClosedFormSolu+on

50

•  RequiresmulIpleiteraIons•  Needtochooseα•  Workswellwhennislarge•  Cansupportincremental

learning

•  Non-iteraIve•  Noneedforα•  Slowifnislarge

–CompuIng(XTX)-1isroughlyO(n3)

ImprovingLearning:FeatureScaling

•  Idea:Ensurethatfeaturehavesimilarscales

•  Makesgradientdescentconvergemuchfaster

51

0

5

10

15

20

0 5 10 15 20✓1

✓2

BeforeFeatureScaling

0

5

10

15

20

0 5 10 15 20✓1

✓2

AverFeatureScaling

FeatureStandardizaIon•  Rescalesfeaturestohavezeromeanandunitvariance

– Letμjbethemeanoffeaturej:

– Replaceeachvaluewith:

•  sjisthestandarddeviaIonoffeaturej •  Couldalsousetherangeoffeaturej (maxj–minj)forsj

•  MustapplythesametransformaIontoinstancesforbothtrainingandpredicIon

•  Outlierscancauseproblems

52

µj =1

n

nX

i=1

x

(i)j

x

(i)j

x

(i)j � µj

sj

forj=1...d(notx0!)

QualityofFit

OverfiHng:•  Thelearnedhypothesismayfitthetrainingsetverywell()

•  ...butfailstogeneralizetonewexamples

53

Price

Size

Price

Size

Price

Size

Underfinng(highbias)

Overfinng(highvariance)

Correctfit

J(✓) ⇡ 0

BasedonexamplebyAndrewNg

RegularizaIon•  AmethodforautomaIcallycontrollingthecomplexityofthelearnedhypothesis

•  Idea:penalizeforlargevaluesof–  CanincorporateintothecostfuncIon– Workswellwhenwehavealotoffeatures,eachthatcontributesabittopredicIngthelabel

•  CanalsoaddressoverfinngbyeliminaIngfeatures(eithermanuallyorviamodelselecIon)

54

✓j

RegularizaIon•  LinearregressionobjecIvefuncIon

–  istheregularizaIonparameter()– NoregularizaIonon!

55

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+ �

dX

j=1

✓2j

modelfittodata regularizaIon

✓0

� � � 0

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

2

dX

j=1

✓2j

UnderstandingRegularizaIon

•  Notethat

–  Thisisthemagnitudeofthefeaturecoefficientvector!

•  Wecanalsothinkofthisas:

•  L2regularizaIonpullscoefficientstoward0

56

dX

j=1

✓2j = k✓1:dk22

dX

j=1

(✓j � 0)2 = k✓1:d � ~0k22

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

2

dX

j=1

✓2j

UnderstandingRegularizaIon

•  Whathappensifwesettobehuge(e.g.,1010)?

57

�Price

Size

BasedonexamplebyAndrewNg

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

2

dX

j=1

✓2j

UnderstandingRegularizaIon

•  Whathappensifwesettobehuge(e.g.,1010)?

58

�Price

Size0 0 0 0

BasedonexamplebyAndrewNg

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

2

dX

j=1

✓2j

RegularizedLinearRegression

59

•  CostFuncIon

•  Fitbysolving

•  Gradientupdate:

min✓

J(✓)

✓j ✓j � ↵

1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j

✓0 ✓0 � ↵1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

regularizaIon

@

@✓jJ(✓)

@

@✓0J(✓)

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

2

dX

j=1

✓2j

� ↵�✓j

RegularizedLinearRegression

60

✓0 ✓0 � ↵1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

•  Wecanrewritethegradientstepas:

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x

(i)⌘� y(i)

⌘2+

2

dX

j=1

✓2j

✓j ✓j (1� ↵�)� ↵

1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j

✓j ✓j � ↵

1

n

nX

i=1

⇣h✓

⇣x

(i)⌘� y

(i)⌘x

(i)j � ↵�✓j

RegularizedLinearRegression

61

✓ =

0

BBBBB@X|X + �

2

666664

0 0 0 . . . 00 1 0 . . . 00 0 1 . . . 0...

......

. . ....

0 0 0 . . . 1

3

777775

1

CCCCCA

�1

X|y

•  ToincorporateregularizaIonintotheclosedformsoluIon:

RegularizedLinearRegression

62

•  ToincorporateregularizaIonintotheclosedformsoluIon:

•  Canderivethisthesameway,bysolving

•  Canprovethatforλ>0,inverseexistsintheequaIonabove

✓ =

0

BBBBB@X|X + �

2

666664

0 0 0 . . . 00 1 0 . . . 00 0 1 . . . 0...

......

. . ....

0 0 0 . . . 1

3

777775

1

CCCCCA

�1

X|y

@

@✓J(✓) = 0