Upload
vanque
View
217
Download
0
Embed Size (px)
Citation preview
LinearRegression
RobotImageCredit:ViktoriyaSukhanova©123RF.com
TheseslideswereassembledbyEricEaton,withgratefulacknowledgementofthemanyotherswhomadetheircoursematerialsfreelyavailableonline.Feelfreetoreuseoradapttheseslidesforyourownacademicpurposes,providedthatyouincludeproperaHribuIon.PleasesendcommentsandcorrecIonstoEric.
RegressionGiven:– Datawhere
– Correspondinglabelswhere
2
0
1
2
3
4
5
6
7
8
9
1975 1980 1985 1990 1995 2000 2005 2010 2015
Septem
berA
rc+cSeaIceExtent
(1,000,000sq
km)
Year
DatafromG.WiH.JournalofStaIsIcsEducaIon,Volume21,Number1(2013)
LinearRegressionQuadraIcRegression
X =n
x
(1), . . . ,x(n)o
x
(i) 2 Rd
y =n
y(1), . . . , y(n)o
y(i) 2 R
• 97samples,parIIonedinto67train/30test• Eightpredictors(features):
– 6conInuous(4logtransforms),1binary,1ordinal• ConInuousoutcomevariable:
– lpsa:log(prostatespecificanIgenlevel)
ProstateCancerDataset
BasedonslidebyJeffHowbert
LinearRegression• Hypothesis:
• Fitmodelbyminimizingsumofsquarederrors
5
x x
y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX
j=0
✓jxj
Assumex0=1
y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX
j=0
✓jxj
FiguresarecourtesyofGregShakhnarovich
LeastSquaresLinearRegression
6
• CostFuncIon
• Fitbysolving
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x
(i)⌘� y(i)
⌘2
min✓
J(✓)
IntuiIonBehindCostFuncIon
7
ForinsightonJ(),let’sassumesox 2 R ✓ = [✓0, ✓1]
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x
(i)⌘� y(i)
⌘2
BasedonexamplebyAndrewNg
IntuiIonBehindCostFuncIon
8
0
1
2
3
0 1 2 3
y
x
(forfixed,thisisafuncIonofx) (funcIonoftheparameter)
0
1
2
3
-0.5 0 0.5 1 1.5 2 2.5
ForinsightonJ(),let’sassumesox 2 R ✓ = [✓0, ✓1]
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x
(i)⌘� y(i)
⌘2
BasedonexamplebyAndrewNg
IntuiIonBehindCostFuncIon
9
0
1
2
3
0 1 2 3
y
x
(forfixed,thisisafuncIonofx) (funcIonoftheparameter)
0
1
2
3
-0.5 0 0.5 1 1.5 2 2.5
ForinsightonJ(),let’sassumesox 2 R ✓ = [✓0, ✓1]
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x
(i)⌘� y(i)
⌘2
J([0, 0.5]) =1
2⇥ 3
⇥(0.5� 1)2 + (1� 2)2 + (1.5� 3)2
⇤⇡ 0.58Basedonexample
byAndrewNg
IntuiIonBehindCostFuncIon
10
0
1
2
3
0 1 2 3
y
x
(forfixed,thisisafuncIonofx) (funcIonoftheparameter)
0
1
2
3
-0.5 0 0.5 1 1.5 2 2.5
ForinsightonJ(),let’sassumesox 2 R ✓ = [✓0, ✓1]
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x
(i)⌘� y(i)
⌘2
J([0, 0]) ⇡ 2.333
BasedonexamplebyAndrewNg
J()isconcave
BasicSearchProcedure• ChooseiniIalvaluefor• UnIlwereachaminimum:– Chooseanewvaluefortoreduce
16
✓
✓ J(✓)
�1�0
J(�0,�1)
FigurebyAndrewNg
BasicSearchProcedure• ChooseiniIalvaluefor• UnIlwereachaminimum:– Chooseanewvaluefortoreduce
17
✓
✓
J(✓)
�1�0
J(�0,�1)
✓
FigurebyAndrewNg
BasicSearchProcedure• ChooseiniIalvaluefor• UnIlwereachaminimum:– Chooseanewvaluefortoreduce
18
✓
✓
J(✓)
�1�0
J(�0,�1)
✓
FigurebyAndrewNg
SincetheleastsquaresobjecIvefuncIonisconvex(concave),wedon’tneedtoworryaboutlocalminima
GradientDescent• IniIalize• RepeatunIlconvergence
19
✓
✓j ✓j � ↵@
@✓jJ(✓) simultaneousupdate
forj=0...d
learningrate(small)e.g.,α=0.05
J(✓)
✓
0
1
2
3
-0.5 0 0.5 1 1.5 2 2.5
↵
GradientDescent• IniIalize• RepeatunIlconvergence
20
✓
✓j ✓j � ↵@
@✓jJ(✓) simultaneousupdate
forj=0...d
ForLinearRegression:@
@✓jJ(✓) =
@
@✓j
1
2n
nX
i=1
⇣h✓
⇣x
(i)⌘� y
(i)⌘2
=@
@✓j
1
2n
nX
i=1
dX
k=0
✓kx(i)k � y
(i)
!2
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y
(i)
!⇥ @
@✓j
dX
k=0
✓kx(i)k � y
(i)
!
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y
(i)
!x
(i)j
=1
n
nX
i=1
⇣h✓
⇣x
(i)⌘� y
(i)⌘x
(i)j
GradientDescent• IniIalize• RepeatunIlconvergence
21
✓
✓j ✓j � ↵@
@✓jJ(✓) simultaneousupdate
forj=0...d
ForLinearRegression:@
@✓jJ(✓) =
@
@✓j
1
2n
nX
i=1
⇣h✓
⇣x
(i)⌘� y
(i)⌘2
=@
@✓j
1
2n
nX
i=1
dX
k=0
✓kx(i)k � y
(i)
!2
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y
(i)
!⇥ @
@✓j
dX
k=0
✓kx(i)k � y
(i)
!
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y
(i)
!x
(i)j
=1
n
nX
i=1
⇣h✓
⇣x
(i)⌘� y
(i)⌘x
(i)j
GradientDescent• IniIalize• RepeatunIlconvergence
22
✓
✓j ✓j � ↵@
@✓jJ(✓) simultaneousupdate
forj=0...d
ForLinearRegression:@
@✓jJ(✓) =
@
@✓j
1
2n
nX
i=1
⇣h✓
⇣x
(i)⌘� y
(i)⌘2
=@
@✓j
1
2n
nX
i=1
dX
k=0
✓kx(i)k � y
(i)
!2
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y
(i)
!⇥ @
@✓j
dX
k=0
✓kx(i)k � y
(i)
!
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y
(i)
!x
(i)j
=1
n
nX
i=1
⇣h✓
⇣x
(i)⌘� y
(i)⌘x
(i)j
GradientDescent• IniIalize• RepeatunIlconvergence
23
✓
✓j ✓j � ↵@
@✓jJ(✓) simultaneousupdate
forj=0...d
ForLinearRegression:@
@✓jJ(✓) =
@
@✓j
1
2n
nX
i=1
⇣h✓
⇣x
(i)⌘� y
(i)⌘2
=@
@✓j
1
2n
nX
i=1
dX
k=0
✓kx(i)k � y
(i)
!2
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y
(i)
!⇥ @
@✓j
dX
k=0
✓kx(i)k � y
(i)
!
=1
n
nX
i=1
dX
k=0
✓kx(i)k � y
(i)
!x
(i)j
=1
n
nX
i=1
⇣h✓
⇣x
(i)⌘� y
(i)⌘x
(i)j
GradientDescentforLinearRegression
• IniIalize• RepeatunIlconvergence
24
✓
simultaneousupdateforj=0...d
✓j ✓j � ↵
1
n
nX
i=1
⇣h✓
⇣x
(i)⌘� y
(i)⌘x
(i)j
• Toachievesimultaneousupdate• AtthestartofeachGDiteraIon,compute• Usethisstoredvalueintheupdatesteploop
h✓
⇣x
(i)⌘
kvk2 =
sX
i
v2i =q
v21 + v22 + . . .+ v2|v|L2norm:
k✓new
� ✓old
k2 < ✏• Assumeconvergencewhen
GradientDescent
25
(forfixed,thisisafuncIonofx) (funcIonoftheparameters)
h(x)=-900–0.1x
SlidebyAndrewNg
Choosingα
34
αtoosmall
slowconvergence
αtoolarge
IncreasingvalueforJ(✓)
• Mayovershoottheminimum• Mayfailtoconverge• Mayevendiverge
Toseeifgradientdescentisworking,printouteachiteraIon• ThevalueshoulddecreaseateachiteraIon• Ifitdoesn’t,adjustα
J(✓)
ExtendingLinearRegressiontoMoreComplexModels
• TheinputsXforlinearregressioncanbe:– OriginalquanItaIveinputs– TransformaIonofquanItaIveinputs
• e.g.log,exp,squareroot,square,etc.– PolynomialtransformaIon
• example:y=�0+�1�x+�2�x2+�3�x3– Basisexpansions– Dummycodingofcategoricalinputs– InteracIonsbetweenvariables
• example:x3=x1�x2
Thisallowsuseoflinearregressiontechniquestofitnon-lineardatasets.
LinearBasisFuncIonModels
• Generally,
• Typically,sothatactsasabias• Inthesimplestcase,weuselinearbasisfuncIons:
h✓(x) =dX
j=0
✓j�j(x)
�0(x) = 1 ✓0
�j(x) = xj
basisfuncIon
BasedonslidebyChristopherBishop(PRML)
LinearBasisFuncIonModels
– Theseareglobal;asmallchangeinxaffectsallbasisfuncIons
• PolynomialbasisfuncIons:
• GaussianbasisfuncIons:
– Thesearelocal;asmallchangeinxonlyaffectnearbybasisfuncIons.μjandscontrollocaIonandscale(width).
BasedonslidebyChristopherBishop(PRML)
LinearBasisFuncIonModels• SigmoidalbasisfuncIons:
where
– Thesearealsolocal;asmallchangeinxonlyaffectsnearbybasisfuncIons.μjandscontrollocaIonandscale(slope).
BasedonslidebyChristopherBishop(PRML)
LinearBasisFuncIonModels
• BasicLinearModel:
• GeneralizedLinearModel:• OncewehavereplacedthedatabytheoutputsofthebasisfuncIons,finngthegeneralizedmodelisexactlythesameproblemasfinngthebasicmodel– Unlessweusethekerneltrick–moreonthatwhenwecoversupportvectormachines
– Therefore,thereisnopointincluHeringthemathwithbasisfuncIons
40
h✓(x) =dX
j=0
✓j�j(x)
h✓(x) =dX
j=0
✓jxj
BasedonslidebyGeoffHinton
LinearAlgebraConcepts• Vectorinisanorderedsetofdrealnumbers
– e.g.,v=[1,6,3,4]isin– “[1,6,3,4]” isacolumnvector:– asopposedtoarowvector:
• Anm-by-nmatrixisanobjectwithmrowsandncolumns,whereeachentryisarealnumber:
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
4361
( )4361
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛
2396784821
Rd
R4
BasedonslidesbyJosephBradley
• Transpose:reflectvector/matrixonline:
( )baba T
=⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=⎟⎟⎠
⎞⎜⎜⎝
⎛dbca
dcba T
– Note:(Ax)T=xTAT(We’lldefinemulIplicaIonsoon…)
• Vectornorms:– Lpnormofv=(v1,…,vk)is– Commonnorms:L1,L2– Linfinity=maxi|vi|
• LengthofavectorvisL2(v)
X
i
|vi|p! 1
p
BasedonslidesbyJosephBradley
LinearAlgebraConcepts
• Vectordotproduct:
– Note:dotproductofuwithitself=length(u)2=
• Matrixproduct:
( ) ( ) 22112121 vuvuvvuuvu +=•=•
⎟⎟⎠
⎞⎜⎜⎝
⎛++++
=
⎟⎟⎠
⎞⎜⎜⎝
⎛=⎟⎟⎠
⎞⎜⎜⎝
⎛=
2222122121221121
2212121121121111
2221
1211
2221
1211 ,
babababababababa
AB
bbbb
Baaaa
A
kuk22
BasedonslidesbyJosephBradley
LinearAlgebraConcepts
• Vectorproducts:– Dotproduct:
– Outerproduct:
( ) 22112
121 vuvuvv
uuvuvu T +=⎟⎟⎠
⎞⎜⎜⎝
⎛==•
( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=⎟⎟⎠
⎞⎜⎜⎝
⎛=
2212
211121
2
1
vuvuvuvu
vvuu
uvT
BasedonslidesbyJosephBradley
LinearAlgebraConcepts
h(x) = ✓
|x
x
| =⇥1 x1 . . . xd
⇤
VectorizaIon• BenefitsofvectorizaIon– MorecompactequaIons– Fastercode(usingopImizedmatrixlibraries)
• Considerourmodel:• Let
• Canwritethemodelinvectorizedformas45
h(x) =dX
j=0
✓jxj
✓ =
2
6664
✓0✓1...✓d
3
7775
VectorizaIon• Considerourmodelforninstances:• Let
• Canwritethemodelinvectorizedformas46
h✓(x) = X✓
X =
2
66666664
1 x
(1)1 . . . x
(1)d
......
. . ....
1 x
(i)1 . . . x
(i)d
......
. . ....
1 x
(n)1 . . . x
(n)d
3
77777775
✓ =
2
6664
✓0✓1...✓d
3
7775
h
⇣x
(i)⌘=
dX
j=0
✓jx(i)j
R(d+1)⇥1 Rn⇥(d+1)
J(✓) =1
2n
nX
i=1
⇣✓
|x
(i) � y(i)⌘2
VectorizaIon• ForthelinearregressioncostfuncIon:
47
J(✓) =1
2n(X✓ � y)| (X✓ � y)
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x
(i)⌘� y(i)
⌘2
Rn⇥(d+1)
R(d+1)⇥1
Rn⇥1R1⇥n
Let:
y =
2
6664
y(1)
y(2)
...y(n)
3
7775
ClosedFormSoluIon:
ClosedFormSoluIon• InsteadofusingGD,solveforopImal analyIcally– NoIcethatthesoluIoniswhen
• DerivaIon:
TakederivaIveandsetequalto0,thensolvefor:
48
✓@
@✓J(✓) = 0
J (✓) =1
2n(X✓ � y)| (X✓ � y)
/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y
1x1J (✓) =1
2n(X✓ � y)| (X✓ � y)
/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y
J (✓) =1
2n(X✓ � y)| (X✓ � y)
/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y
@
@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0
(X|X)✓ �X|y = 0
(X|X)✓ = X|y
✓ = (X|X)�1X|y
✓@
@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0
(X|X)✓ �X|y = 0
(X|X)✓ = X|y
✓ = (X|X)�1X|y
@
@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0
(X|X)✓ �X|y = 0
(X|X)✓ = X|y
✓ = (X|X)�1X|y
@
@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0
(X|X)✓ �X|y = 0
(X|X)✓ = X|y
✓ = (X|X)�1X|y
ClosedFormSoluIon• CanobtainbysimplypluggingXand into
• IfXTXisnotinverIble(i.e.,singular),mayneedto:– Usepseudo-inverseinsteadoftheinverse
• Inpython,numpy.linalg.pinv(a) – Removeredundant(notlinearlyindependent)features– Removeextrafeaturestoensurethatd≤n
49
@
@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0
(X|X)✓ �X|y = 0
(X|X)✓ = X|y
✓ = (X|X)�1X|y
y =
2
6664
y(1)
y(2)
...y(n)
3
7775X =
2
66666664
1 x
(1)1 . . . x
(1)d
......
. . ....
1 x
(i)1 . . . x
(i)d
......
. . ....
1 x
(n)1 . . . x
(n)d
3
77777775
✓ y
GradientDescentvsClosedForm
GradientDescentClosedFormSolu+on
50
• RequiresmulIpleiteraIons• Needtochooseα• Workswellwhennislarge• Cansupportincremental
learning
• Non-iteraIve• Noneedforα• Slowifnislarge
–CompuIng(XTX)-1isroughlyO(n3)
ImprovingLearning:FeatureScaling
• Idea:Ensurethatfeaturehavesimilarscales
• Makesgradientdescentconvergemuchfaster
51
0
5
10
15
20
0 5 10 15 20✓1
✓2
BeforeFeatureScaling
0
5
10
15
20
0 5 10 15 20✓1
✓2
AverFeatureScaling
FeatureStandardizaIon• Rescalesfeaturestohavezeromeanandunitvariance
– Letμjbethemeanoffeaturej:
– Replaceeachvaluewith:
• sjisthestandarddeviaIonoffeaturej • Couldalsousetherangeoffeaturej (maxj–minj)forsj
• MustapplythesametransformaIontoinstancesforbothtrainingandpredicIon
• Outlierscancauseproblems
52
µj =1
n
nX
i=1
x
(i)j
x
(i)j
x
(i)j � µj
sj
forj=1...d(notx0!)
QualityofFit
OverfiHng:• Thelearnedhypothesismayfitthetrainingsetverywell()
• ...butfailstogeneralizetonewexamples
53
Price
Size
Price
Size
Price
Size
Underfinng(highbias)
Overfinng(highvariance)
Correctfit
J(✓) ⇡ 0
BasedonexamplebyAndrewNg
RegularizaIon• AmethodforautomaIcallycontrollingthecomplexityofthelearnedhypothesis
• Idea:penalizeforlargevaluesof– CanincorporateintothecostfuncIon– Workswellwhenwehavealotoffeatures,eachthatcontributesabittopredicIngthelabel
• CanalsoaddressoverfinngbyeliminaIngfeatures(eithermanuallyorviamodelselecIon)
54
✓j
RegularizaIon• LinearregressionobjecIvefuncIon
– istheregularizaIonparameter()– NoregularizaIonon!
55
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x
(i)⌘� y(i)
⌘2+ �
dX
j=1
✓2j
modelfittodata regularizaIon
✓0
� � � 0
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x
(i)⌘� y(i)
⌘2+
�
2
dX
j=1
✓2j
UnderstandingRegularizaIon
• Notethat
– Thisisthemagnitudeofthefeaturecoefficientvector!
• Wecanalsothinkofthisas:
• L2regularizaIonpullscoefficientstoward0
56
dX
j=1
✓2j = k✓1:dk22
dX
j=1
(✓j � 0)2 = k✓1:d � ~0k22
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x
(i)⌘� y(i)
⌘2+
�
2
dX
j=1
✓2j
UnderstandingRegularizaIon
• Whathappensifwesettobehuge(e.g.,1010)?
57
�Price
Size
BasedonexamplebyAndrewNg
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x
(i)⌘� y(i)
⌘2+
�
2
dX
j=1
✓2j
UnderstandingRegularizaIon
• Whathappensifwesettobehuge(e.g.,1010)?
58
�Price
Size0 0 0 0
BasedonexamplebyAndrewNg
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x
(i)⌘� y(i)
⌘2+
�
2
dX
j=1
✓2j
RegularizedLinearRegression
59
• CostFuncIon
• Fitbysolving
• Gradientupdate:
min✓
J(✓)
✓j ✓j � ↵
1
n
nX
i=1
⇣h✓
⇣x
(i)⌘� y
(i)⌘x
(i)j
✓0 ✓0 � ↵1
n
nX
i=1
⇣h✓
⇣x
(i)⌘� y(i)
⌘
regularizaIon
@
@✓jJ(✓)
@
@✓0J(✓)
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x
(i)⌘� y(i)
⌘2+
�
2
dX
j=1
✓2j
� ↵�✓j
RegularizedLinearRegression
60
✓0 ✓0 � ↵1
n
nX
i=1
⇣h✓
⇣x
(i)⌘� y(i)
⌘
• Wecanrewritethegradientstepas:
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x
(i)⌘� y(i)
⌘2+
�
2
dX
j=1
✓2j
✓j ✓j (1� ↵�)� ↵
1
n
nX
i=1
⇣h✓
⇣x
(i)⌘� y
(i)⌘x
(i)j
✓j ✓j � ↵
1
n
nX
i=1
⇣h✓
⇣x
(i)⌘� y
(i)⌘x
(i)j � ↵�✓j
RegularizedLinearRegression
61
✓ =
0
BBBBB@X|X + �
2
666664
0 0 0 . . . 00 1 0 . . . 00 0 1 . . . 0...
......
. . ....
0 0 0 . . . 1
3
777775
1
CCCCCA
�1
X|y
• ToincorporateregularizaIonintotheclosedformsoluIon:
RegularizedLinearRegression
62
• ToincorporateregularizaIonintotheclosedformsoluIon:
• Canderivethisthesameway,bysolving
• Canprovethatforλ>0,inverseexistsintheequaIonabove
✓ =
0
BBBBB@X|X + �
2
666664
0 0 0 . . . 00 1 0 . . . 00 0 1 . . . 0...
......
. . ....
0 0 0 . . . 1
3
777775
1
CCCCCA
�1
X|y
@
@✓J(✓) = 0