©2018 Kevin Jamieson 1
Linear Regression
Machine Learning – CSE546 Kevin Jamieson University of Washington
Oct 2, 2018
2
The regression problem
©2018 Kevin Jamieson
# square feet
Sal
e P
rice
Given past sales data on zillow.com, predict: y = House sale price from x = {# sq. ft., zip code, date of sale, etc.}
Training Data:
{(xi, yi)}ni=1
xi 2 Rd
yi 2 R
3
The regression problem
©2018 Kevin Jamieson
# square feet
Sal
e P
rice
Given past sales data on zillow.com, predict: y = House sale price from x = {# sq. ft., zip code, date of sale, etc.}
Training Data:
{(xi, yi)}ni=1
xi 2 Rd
yi 2 R
Hypothesis: linear
Loss: least squares
yi ⇡ xTi w
minw
nX
i=1
�yi � xT
i w�2
best linear fit
4©2018 Kevin Jamieson
The regression problem in matrix notation
y =
2
64y1...yn
3
75 X =
2
64xT1...xTn
3
75
= argminw
(y �Xw)T (y �Xw)
bwLS = argminw
nX
i=1
�yi � xT
i w�2
5©2018 Kevin Jamieson
The regression problem in matrix notation
= argminw
(y �Xw)T (y �Xw)
bwLS = argminw
||y �Xw||22
Pwl 2Xtly Xu O
xtyXTXw_If I exists then I Xix x'y
6©2018 Kevin Jamieson
The regression problem in matrix notation
= (XTX)�1XTy
bwLS = argminw
||y �Xw||22
What about an offset?
bwLS ,bbLS = argminw,b
nX
i=1
�yi � (xT
i w + b)�2
= argminw,b
||y � (Xw + 1b)||22
7©2018 Kevin Jamieson
Dealing with an offset
bwLS ,bbLS = argminw,b
||y � (Xw + 1b)||22
til
Pw 2 XT y Xu 1b O
XT w X'Ib
Pb 2 IT ly Hut 1b 0 I I n
Ii LIXw t n b IT II Yib a Isi t IX w
8©2018 Kevin Jamieson
Dealing with an offset
If XT1 = 0 (i.e., if each feature is mean-zero) then
bwLS = (XTX)�1XTY
bbLS =1
n
nX
i=1
yi
XTX bwLS +bbLSXT1 = XTy
1TX bwLS +bbLS1T1 = 1Ty
bwLS ,bbLS = argminw,b
||y � (Xw + 1b)||22
I XwFwtx
T µ 2xi mu O o
i 54 34 µL 3
i
Given a new x predict wT X µ 15
9©2018 Kevin Jamieson
The regression problem in matrix notation
= (XTX)�1XTy
bwLS = argminw
||y �Xw||22
But why least squares?
Consider yi = xTi w + ✏i where ✏i
i.i.d.⇠ N (0,�2)
P (y|x,w,�) = exp Y
10
Maximizing log-likelihood
Maximize:
©2018 Kevin Jamieson
logP (D|w,�) = log( 1p2⇡�
)nnY
i=1
e�(yi�xT
i w)2
2�2
11
MLE is LS under linear model
©2018 Kevin Jamieson
bwLS = argminw
nX
i=1
�yi � xT
i w�2
if yi = xTi w + ✏i and ✏i
i.i.d.⇠ N (0,�2)
bwMLE = argmaxw
P (D|w,�)
bwLS = bwMLE = (XTX)�1XTY
12
Analysis of error
©2018 Kevin Jamieson
if yi = xTi w + ✏i and ✏i
i.i.d.⇠ N (0,�2)
bwMLE = (XTX)�1XTY
Y = Xw + ✏
xtxTxTfXw c
EET w TtxxfxE ELeet o
ELIE wki wyJ EfcxixtxieetxcxixijcxixTXTIEeetxcxixl
t.ae5www.otxtx5
13
Analysis of error
©2018 Kevin Jamieson
if yi = xTi w + ✏i and ✏i
i.i.d.⇠ N (0,�2)
bwMLE = (XTX)�1XTY
Y = Xw + ✏
= (XTX)�1XT (Xw + ✏)
= w + (XTX)�1XT ✏
Cov( bwMLE) = E[( bw � E[ bw])( bw � E[ bw])T ] = (XTX)�1
bwMLE ⇠ N (w, (XTX)�1)
14
The regression problem
©2018 Kevin Jamieson
# square feet
Sal
e P
rice
Given past sales data on zillow.com, predict: y = House sale price from x = {# sq. ft., zip code, date of sale, etc.}
Training Data:
{(xi, yi)}ni=1
xi 2 Rd
yi 2 R
Hypothesis: linear
Loss: least squares
yi ⇡ xTi w
minw
nX
i=1
�yi � xT
i w�2
best linear fit
15
The regression problem
©2018 Kevin Jamieson
date of sale
Sal
e P
rice
Given past sales data on zillow.com, predict: y = House sale price from x = {# sq. ft., zip code, date of sale, etc.}
Training Data:
{(xi, yi)}ni=1
xi 2 Rd
yi 2 R
Hypothesis: linear
Loss: least squares
yi ⇡ xTi w
minw
nX
i=1
�yi � xT
i w�2
best linear fit
16
The regression problem
Training Data:
{(xi, yi)}ni=1
xi 2 Rd
yi 2 R
Hypothesis: linear
Loss: least squares
yi ⇡ xTi w
minw
nX
i=1
�yi � xT
i w�2
Transformed data:
17
The regression problem
Training Data:
{(xi, yi)}ni=1
xi 2 Rd
yi 2 R
Hypothesis: linear
Loss: least squares
yi ⇡ xTi w
minw
nX
i=1
�yi � xT
i w�2
Transformed data:
in d=1:
h : Rd ! Rp maps originalfeatures to a rich, possiblyhigh-dimensional space
hj(x) =1
1 + exp(uTj x)
hj(x) = (uTj x)
2
for d>1, generate {uj}pj=1 ⇢ Rd
hj(x) = cos(uTj x)
h(x) =
2
6664
h1(x)h2(x)
...hp(x)
3
7775=
2
6664
xx2
...xp
3
7775
7 51
a is
tx
Ext s tix next I0
Given x predict iuthlx he.la xk T
xinhai L Iri Xix x'y
18
The regression problem
Training Data:
{(xi, yi)}ni=1
xi 2 Rd
yi 2 R
Hypothesis: linear
Loss: least squares
yi ⇡ xTi w
minw
nX
i=1
�yi � xT
i w�2
Transformed data:
h(x) =
2
6664
h1(x)h2(x)
...hp(x)
3
7775
Hypothesis: linear
Loss: least squares
yi ⇡ h(xi)Tw w 2 Rp
minw
nX
i=1
�yi � h(xi)
Tw�2
19
The regression problem
Training Data:
{(xi, yi)}ni=1
xi 2 Rd
yi 2 RTransformed data:
h(x) =
2
6664
h1(x)h2(x)
...hp(x)
3
7775
Hypothesis: linear
Loss: least squares
yi ⇡ h(xi)Tw w 2 Rp
minw
nX
i=1
�yi � h(xi)
Tw�2
date of sale
Sal
e P
rice
best linear fit
20
The regression problem
Training Data:
{(xi, yi)}ni=1
xi 2 Rd
yi 2 RTransformed data:
h(x) =
2
6664
h1(x)h2(x)
...hp(x)
3
7775
Hypothesis: linear
Loss: least squares
yi ⇡ h(xi)Tw w 2 Rp
minw
nX
i=1
�yi � h(xi)
Tw�2
date of sale
Sal
e P
rice
small p fit
21
The regression problem
Training Data:
{(xi, yi)}ni=1
xi 2 Rd
yi 2 RTransformed data:
h(x) =
2
6664
h1(x)h2(x)
...hp(x)
3
7775
Hypothesis: linear
Loss: least squares
yi ⇡ h(xi)Tw w 2 Rp
minw
nX
i=1
�yi � h(xi)
Tw�2
date of sale
Sal
e P
rice
large p fit
What’s going on here?
©2018 Kevin Jamieson 22
Bias-Variance Tradeoff
Machine Learning – CSE546 Kevin Jamieson University of Washington
Oct 5, 2018
Statistical Learning
©2018 Kevin Jamieson
PXY (X = x, Y = y)
Goal: Predict Y given X
Find function ⌘ that minimizes
EXY [(Y � ⌘(X))2] ExLE kz x x
Ux ME Eyfly c IX x
E fit c l x E z Y c IX x 0
2 LELY Ix x 2C O
c Z x E Y lX x
Statistical Learning
©2018 Kevin Jamieson
PXY (X = x, Y = y)
Goal: Predict Y given X
EXY [(Y � ⌘(X))2]
Find function ⌘ that minimizes
= EX
hEY |X [(Y � ⌘(x))2|X = x]
i
⌘(x) = argminc
EY |X [(Y � c)2|X = x] = EY |X [Y |X = x]
Under LS loss, optimal predictor: ⌘(x) = EY |X [Y |X = x]
Statistical Learning
©2017 Kevin Jamieson
x
y
PXY (X = x, Y = y)
EXY [(Y � ⌘(X))2]
Statistical Learning
©2018 Kevin Jamieson
x
y
PXY (X = x, Y = y)
x0 x1
PXY (Y = y|X = x0)
PXY (Y = y|X = x1)
EXY [(Y � ⌘(X))2]
Statistical Learning
©2018 Kevin Jamieson
PXY (Y = y|X = x0)
PXY (Y = y|X = x1)x
y
PXY (X = x, Y = y)
x0 x1
Ideally, we want to find:
EXY [(Y � ⌘(X))2]
⌘(x) = EY |X [Y |X = x]
Statistical Learning
©2018 Kevin Jamieson
x
y
PXY (X = x, Y = y) Ideally, we want to find:
⌘(x) = EY |X [Y |X = x]
Statistical Learning
©2018 Kevin Jamieson
x
y
PXY (X = x, Y = y) Ideally, we want to find:
(xi, yi)i.i.d.⇠ PXY for i = 1, . . . , n
But we only have samples:
⌘(x) = EY |X [Y |X = x]
Statistical Learning
©2018 Kevin Jamieson
x
y
PXY (X = x, Y = y)
bf = argminf2F
1
n
nX
i=1
(yi � f(xi))2
Ideally, we want to find:
(xi, yi)i.i.d.⇠ PXY for i = 1, . . . , n
But we only have samples:
and are restricted to afunction class (e.g., linear)so we compute:
⌘(x) = EY |X [Y |X = x]
Statistical Learning
©2018 Kevin Jamieson
x
y
PXY (X = x, Y = y)
bf = argminf2F
1
n
nX
i=1
(yi � f(xi))2
Ideally, we want to find:
(xi, yi)i.i.d.⇠ PXY for i = 1, . . . , n
But we only have samples:
and are restricted to afunction class (e.g., linear)so we compute:
We care about future predictions: EXY [(Y � bf(X))2]
⌘(x) = EY |X [Y |X = x]
Statistical Learning
©2018 Kevin Jamieson
x
y
PXY (X = x, Y = y)
Each draw D = {(xi, yi)}ni=1 results in di↵erent bf
bf = argminf2F
1
n
nX
i=1
(yi � f(xi))2
Ideally, we want to find:
(xi, yi)i.i.d.⇠ PXY for i = 1, . . . , n
But we only have samples:
and are restricted to afunction class (e.g., linear)so we compute:
⌘(x) = EY |X [Y |X = x]
Statistical Learning
©2018 Kevin Jamieson
x
y
PXY (X = x, Y = y)
Each draw D = {(xi, yi)}ni=1 results in di↵erent bf
ED[ bf(x)]bf = argmin
f2F
1
n
nX
i=1
(yi � f(xi))2
Ideally, we want to find:
(xi, yi)i.i.d.⇠ PXY for i = 1, . . . , n
But we only have samples:
and are restricted to afunction class (e.g., linear)so we compute:
⌘(x) = EY |X [Y |X = x]
Bias-Variance Tradeoff
©2018 Kevin Jamieson
bf = argminf2F
1
n
nX
i=1
(yi � f(xi))2⌘(x) = EY |X [Y |X = x]
EY |X [ED[(Y � bfD(x))2]��X = x] = EY |X [ED[(Y � ⌘(x) + ⌘(x)� bfD(x))2]
��X = x]
E y Eo Y 21 37 24 263 2cal Foca ax ElmYxO
Ey flHCxDYx x t2 ExDD tEoKuxH
Ey Ct Zan lx c Ey HX x 26
Bias-Variance Tradeoff
©2018 Kevin Jamieson
irreducible error Caused by stochastic
label noise
learning error Caused by either using too “simple”
of a model or not enough data to learn the model accurately
bf = argminf2F
1
n
nX
i=1
(yi � f(xi))2⌘(x) = EY |X [Y |X = x]
EY |X [ED[(Y � bfD(x))2]��X = x] = EY |X [ED[(Y � ⌘(x) + ⌘(x)� bfD(x))2]
��X = x]
=EY |X
hED[(Y � ⌘(x))2 + 2(Y � ⌘(x))(⌘(x)� bfD(x))
+ (⌘(x)� bfD(x))2]��X = x
i
=EY |X [(Y � ⌘(x))2��X = x] + ED[(⌘(x)� bfD(x))2]
Bias-Variance Tradeoff
©2018 Kevin Jamieson
ED[(⌘(x)� bfD(x))2] = ED[(⌘(x)� ED[ bfD(x)] + ED[ bfD(x)]� bfD(x))2]
bf = argminf2F
1
n
nX
i=1
(yi � f(xi))2⌘(x) = EY |X [Y |X = x]
Epfl Cas Eoff.CN Jt2IEIL EEDKEfEixD
IcatEolEoLf.cnFoix
Bias-Variance Tradeoff
©2018 Kevin Jamieson
=(⌘(x)� ED[ bfD(x)])2 + ED[(ED[ bfD(x)]� bfD(x))2]
=ED[(⌘(x)� ED[ bfD(x)])2 + 2(⌘(x)� ED[ bfD(x)])(ED[ bfD(x)]� bfD(x))
+ (ED[ bfD(x)]� bfD(x))2]
ED[(⌘(x)� bfD(x))2] = ED[(⌘(x)� ED[ bfD(x)] + ED[ bfD(x)]� bfD(x))2]
biased squared variance
bf = argminf2F
1
n
nX
i=1
(yi � f(xi))2⌘(x) = EY |X [Y |X = x]
Bias-Variance Tradeoff
biased squared variance
+(⌘(x)� ED[ bfD(x)])2 + ED[(ED[ bfD(x)]� bfD(x))2]
irreducible error
EY |X [ED[(Y � bfD(x))2]��X = x] = EY |X [(Y � ⌘(x))2
��X = x]
39
Example: Linear LS
©2018 Kevin Jamieson
if yi = xTi w + ✏i and ✏i
i.i.d.⇠ N (0,�2)
bwMLE = (XTX)�1XTY
Y = Xw + ✏
= w + (XTX)�1XT ✏
bfD(x) = bwTx = wTx+ ✏TX(XTX)�1x
⌘(x) = EY |X [Y |X = x] ELxtwtE X x xTw
Ttx E LEAD Ellen litxxITIE wtx3 uTx
40
Example: Linear LS
©2018 Kevin Jamieson
if yi = xTi w + ✏i and ✏i
i.i.d.⇠ N (0,�2)
bwMLE = (XTX)�1XTY
Y = Xw + ✏
= w + (XTX)�1XT ✏
bfD(x) = bwTx = wTx+ ✏TX(XTX)�1x
⌘(x) = EY |X [Y |X = x]
EXY [ED[(Y � bfD(x))2]��X = x] = EXY [(Y � ⌘(x))2
��X = x]= �2
irreducible error biased squared
+(⌘(x)� ED[ bfD(x)])2 + ED[(ED[ bfD(x)]� bfD(x))2]= 0In
41
Example: Linear LS
©2018 Kevin Jamieson
if yi = xTi w + ✏i and ✏i
i.i.d.⇠ N (0,�2)
bwMLE = (XTX)�1XTY
Y = Xw + ✏
= w + (XTX)�1XT ✏
bfD(x) = bwTx = wTx+ ✏TX(XTX)�1x
variance
+(⌘(x)� ED[ bfD(x)])2 + ED[(ED[ bfD(x)]� bfD(x))2]= ED[xT (XTX)�1XT ✏✏TX(XTX)�1x]E xtfxtxYXIEXCXTXI.be
Ex Lxi Xix5 x o
42
Example: Linear LS
©2018 Kevin Jamieson
if yi = xTi w + ✏i and ✏i
i.i.d.⇠ N (0,�2)
bwMLE = (XTX)�1XTY
Y = Xw + ✏
= w + (XTX)�1XT ✏
bfD(x) = bwTx = wTx+ ✏TX(XTX)�1x
variance
+(⌘(x)� ED[ bfD(x)])2 + ED[(ED[ bfD(x)]� bfD(x))2]= ED[xT (XTX)�1XT ✏✏TX(XTX)�1x]
= �2xT (XTX)�1x
XTX =nX
i=1
xixTi
n large! n⌃ ⌃ = E[XXT ], X ⇠ PX
= �2Trace((XTX)�1xxT )
+(⌘(x)� ED[ bfD(x)])2 + ED[(ED[ bfD(x)]� bfD(x))2]EX=x
⇥ ⇤=
�2
nEX [Trace(⌃�1XXT )] =
d�2
n
e
43
Example: Linear LS
©2018 Kevin Jamieson
if yi = xTi w + ✏i and ✏i
i.i.d.⇠ N (0,�2)
bwMLE = (XTX)�1XTY
Y = Xw + ✏
= w + (XTX)�1XT ✏
bfD(x) = bwTx = wTx+ ✏TX(XTX)�1x
⌘(x) = EY |X [Y |X = x]
EXY [ED[(Y � bfD(x))2]��X = x] = EXY [(Y � ⌘(x))2
��X = x]= �2
irreducible error biased squared
+(⌘(x)� ED[ bfD(x)])2 + ED[(ED[ bfD(x)]� bfD(x))2]= 0
variance
+(⌘(x)� ED[ bfD(x)])2 + ED[(ED[ bfD(x)]� bfD(x))2]EX=x
⇥ ⇤=
�2
nEX [Trace(⌃�1XXT )] =
d�2
n