Upload
bailee-cove
View
218
Download
1
Tags:
Embed Size (px)
Citation preview
Regression
To express the relationship between two or more variables by a mathematical formula.
x : predictor (independent) variable
y : response (dependent) variable
Identify how y varies as a function of x.
y is also considered as a random variable.
Real-Word Example:
Footwear impressions are commonly observed at crime scenes.
While there are numerous forensic properties that can be obtained
from these impressions, one in particular is the shoe size. The
detectives would like to be able to estimate the height of the
impression maker from the shoe size.
The relationship between shoe sizes and heights2
Shoe Size vs. Height
3
Shoe Size vs. Height
What is the predictor?
What is the response?
Can the height by accurately estimated from the shoe size?
If a shoe size is 11, what would you advise the police?
What if the size is 7 or 12.5?
4
General Regression Model
The systematic part m(x) is deterministic.
The error ε(x) is a random variable.
Measurement Error
Natural Variations
Additive
5
)()()( xxmxy
Example: Sin Function
6
)()sin()( xxAxy
Standard Assumptions
7
A1
8
A2
9
A3
10
Back to Shoes
11
Simple Linear Regression
12
xxm 10)(
Model Parameters
13
Derivation
14
n
iii xyR
1
21010 ),(
xy
xyn
iii
R
10
1100
020
2
1
2
11
111
1100
0
021
xnx
yxnyx
xxyxyx
xyx
n
ii
n
iii
n
iiiii
n
iiii
R
Standard Deviations
15
n
iin 1
22
2
1
2/1
2
1
2
21
0
xnx
x
n n
i
2/1
2
1
2
11
xnxn
i
Polynomial Terms
Modeling the data as a line is not always adequate.
Polynomial Regression
This is still a linear model!
m(x) is a linear combination of β.
Danger of Overfitting
16
p
k
kk
pp xxxxm
010 ...)(
Matrix Representation
17
i
p
k
kiki xy
0
XY
Matrix Representation
18
XYXYR T )(
YXXX
XXYXXYYYTT
TTTTTTR
00
YXXX TT 1
Model Comparison
19
n
ii yySST
1
2 :Total Squares of Sum
n
iii yySSE
1
2^
:Error Squares of Sum
R2
20
SST
SSE
SST
SSESSTR
12
2 / ( ( 1))1
/ ( 1)adj
SSE n pR
SST n
Example
21
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5-5
0
5
10
15
20
25
30
X
Y
Y= -3.6029+4.8802X
R2=0.9131
Y= 0.7341-0.4303X+1.0621X2
R2=0.9880
Y=X2+N(0,1)
Summary
Regression is the oldest data mining technique.
Probably the first thing that you want to try on a new data set.
No need to do programming! Matlab, Excel …
Quality of Regression
R2
Residual Plot
Cross Validation
What you should learn after class:
Confidence Interval
Multiple Regression
Nonlinear Regression
22