View
217
Download
0
Embed Size (px)
Citation preview
Regression04: 1
An Introduction to
REGRESSION AND CORRELATION
Regression04: 2
How do we measure the association of 2 continuous, numeric scale variables?
Example:
Observations are available on
• a sample of 30 individuals
• systolic blood pressure (SBP)
• age
We are interested in
• the relationship between SBP and age
• for these patients (descriptive)
• and for the population which they represent (inferential).
Regression04: 3
individual SBP AGE individual SBP AGE(i) (Y) (x) (i) (Y) (x)1 144 39 16 130 482 220 47 17 135 453 138 45 18 114 174 145 47 19 116 205 162 65 20 124 196 142 46 21 136 367 170 67 22 142 508 124 42 23 120 399 158 67 24 120 21
10 154 56 25 160 4411 162 64 26 158 5312 150 56 27 144 6313 140 59 28 130 2914 110 34 29 125 2515 128 42 30 175 69
Data on 30 individuals:
Regression04: 4
Note:
We have 30 pairs of observations which we can denote as:
(x1,y1) = (39,144)
(x2,y2 ) = (47, 220)
…
(x30,y30) = (69, 175)
Where
• xi refers to age for the i th subject
• yi to SBP for the i th subject
Regression04: 5
• These data pairs may be considered as points in two dimensional space, so that we may plot them on a graph.
Scatter diagram of age and systolic blood pressure
120
140
160
180
200
220
240
20 30 40 50 60 70 80
AGE in years
0 0
SBP(mm Hg)
Note:
• age and SBP seem to be related:
• Younger subjects tend to have lower SBP
• older subjects higher SBP.
Regression04: 6
How can this relationship be measured?
y
x
No relationship between x and y. Spread is even in all directions.
y
x
Linear relationship:A line indicates the main direction of the spread of points.
y
x
Non-linear relationship between x and y.A curve best describes the relationship.
Regression04: 7
Math Review: Equation for a Line
y
0
0 1y x
x
o= y-intercept = value of y when x=0
1= slope = y / x
(change in y)/(change in x)
Regression04: 8
y
Slope > 0: positive slope (as x increases, y increases)
Slope = 0
Slope < 0: negative slope(as x increases, y decreases)
y x
x
1 = “slope” = y / x = (change in y) / (change in x)
Regression04: 9
Now, given a set of data, how can we get the line that best fits or best represents the data?
When it is appropriate to predict one variable (y) from another variable (x) -- there is some directionality in the relationship – then :
• Commonly use a technique know as Least Squares Regression to estimate
intercept: 0
slope: 1
• denote the estimates 0 and 1, respectively^ ^
(referred to as beta-nought-hat and beta-one-hat)
Regression04: 10
d i
di
•
•
0 1ˆ ˆy x
We are looking for that line which minimizes the vertical distances to the data points.
For each observed value xi, we have
• an observed yi, and the
• “predicted” value yi, on the line: yi = 0+ 1xi
The vertical distances are : di = (yi – yi).^
^ ^^ ^
Regression04: 11
•
•
•
• • • •
•
yi
yi
xi
(xi,yi)
(xi,yi)^
x
y
0 1ˆ ˆy x
That is, we have:
xi = observed x for ith subject
yi = observed y for ith subject
yi = predicted y for ith subject^
Regression04: 12
The squared distances are:
di2 = (yi – yi)2
– and the sum of squared deviations from the line
– (sound familiar?) is
di2 = (yi – yi)2
We want the line such that
is minimized.
^
2 2 20 1
1 1 1
ˆ( ) ( )n n n
i i i i ii i i
d y y y x
^
ˆiy
Regression04: 13
The unbiased estimates of 0 and 1 which are
• the least squares estimates and
• the minimum variance estimates
Are:
11
2
1
( )( )ˆ
( )
n
i ii
n
ii
x x y y
x x
0 1ˆ ˆy x
Use calculus in previous equations to solve
Regression04: 14
Example:
Using the data on 30 individuals where we measured
• AGE (x)
• SBP (y)
n = 30, y = 142.53, x = 45.13
We get:
1ˆ 0.97
0ˆ 98.7
Regression04: 15
Thus, the equation for this straight line is given by
ˆ 98.7 0.97y x
0
AGE
120
140
160
180
200
220
240
20 30 40 50 60 70 80
Regression04: 16
Now,
If yi = yi for all i, then SSE=0 perfect fit to line
As the fit gets worse, SSE gets larger
• SSE serves as measure of fit to line
2 2
1 1
ˆ( )n n
i i ii i
d y y SSE Sum of Squared Error
^
Regression04: 17
One of the assumptions for regression analysis is that of homoscedasticity:
• the variance of y is the same for any xthat is, the spread of values for y at each level of x remains ~constant
y
x
1 2 3
2 2 2 2| | |y x y x y x
Spread of y|x
Spread of y ignoring x
Regression04: 18
An estimate of 2 is given by:
2 2|
1
1 1ˆ( )
2 2
n
y x i ii
s y y SSEn n
Lose 2 df: for estimating 0 and 1
The standard error, sy|x
• is a measure of the spread of y
• around it’s predicted value y
• for each value of x.
^
Regression04: 19
In our example:
And the estimated standard error is:
That is, • for any given age x, • the standard error of SBP is estimated as
17.31 mmhg.
2| 299.77y xs
| 17.31y xs
Regression04: 20
ˆ 98.7 0.97y x
0
AGE
120
140
160
180
200
220
240
20 30 40 50 60 70 80
To address the question of association of x and y
• We want to know if the slope is zero:
• Ho: 1=0
• Ha: 10
Regression04: 21
Now, if we assume
• that for any fixed value of x
• y is normally distributed
Then we can show that:
In practice, since 2 is unknown
• Use sy|x2 in place of 2
• Use the t-distribution, with n-2 df
• For hypothesis testing and CI
2|
1 1 2ˆ ~ ,
( 1)y x
x
Ns n
Regression04: 22
With these assumptions, to test
• Ho: 1=0
• Ha: 10
Test statistic:
1 12
|
ˆ ˆ( )
1
on
y x
x
ts
s n
Regression04: 23
1 12
|
ˆ ˆ( ) 0.97 04.62
17.31
15.29 291
on
y x
x
ts
s n
In our example:
The achieved significance is then:
With p<.05,
Reject Ho and conclude that age (x) provides
significant information for predicting SBP (y).
28(2)Pr[ 4.62] 0p t
Regression04: 24
In Minitab, enter the data in 2 columns, for SBP and AGE, and select: Stat Regression Regression
Response is Y variable
Predictor is X variable
Regression04: 25
Regression Analysis: spb versus age
The regression equation is
spb = 98.7 + 0.971 age
Predictor Coef SE Coef T P
Constant 98.71 10.00 9.87 0.000
age 0.9709 0.2102 4.62 0.000
S = 17.31 R-Sq = 43.2% R-Sq(adj) = 41.2%
Analysis of Variance
Source DF SS MS F P
Regression 1 6394.0 6394.0 21.33 0.000
Error 28 8393.4 299.8
Total 29 14787.5
Regression04: 26
You’ll note that a significance test is also provided for 0:
H0: 0=0 vs. Ha: 00
T P
9.87 0.000
We are rarely interested in tests of 0.
• It is often outside of the range of the data (e.g., here the youngest age is ~20)
• In this case it can be interpreted as the predicted SBP at age=0 – not meaningful.
• It is inappropriate to interpret regression relationships outside the range of the observed data.
Regression04: 27
A better model might exist • (e.g, one with a curvilinear term) • but there is a linear component.
••
•
•••
•
••
•
•
•
••
•
•
•••
••
•
•
•
••• ••
•
•
•
•
••
•
•
••
•
•
•
•••
•
••
• ••
• •
•
•
y
••
•
•
•
•
•
••
•
•
•
•
•
•
••
•
••
••
•
•
•
•
•
•
•
• ••
•
•
Here, a curve would provide a better fit
•
•• •
••
Linear model fits better than y = y ^
Regression04: 28
note: if Ho:1 = 0 is not rejected it means either
•
••
•
••
••
•
•••
••
•
••
•
•••
• •
• •• • ••
• ••
•
• •
x provides little or no help in predicting y
The true relationship between x and y is not linear.
•••
• ••• •
•
•
• •
••
•••
••
• ••
•
•
•
•
•••
•
••
•
•
••
•• •
•
•
•
• ••
•
•••
•
or
Note: even when Ho: 1=0 is rejected, some other non-linear model may be better
Regression04: 29
Part 2: The Correlation Coefficient
• Provides a measure of how 2 random variables are associated, without assuming any direction to the association (i.e., no sense that x is predictive of y, just that they are related)
• Also a measure of the strength of the straight-line relationship between X and Y
• It can also be shown that:
1
2 22 2
1 1
( )( )ˆcov( , )
( )( )( ) ( )
n
i ii
n nx y
i ii i
x x y yx y
rs s
x x y y
1ˆx
y
srs
Regression04: 30
Characteristics of correlation coefficient r:
• -1 r 1
• -1 implies perfect negative correlation
• 0 implies no correlation
• 1 implies perfect positive correlation
• r is dimensionless – it is independent of units of x or y
• r always has same sign as slope
• r is the sample estimator of the population correlation
Regression04: 31
yi y
xi x
xi x
yi y
+
II
III IV
y
x
-
x
I+
+
+-
-
-
xi x
xi x
yi y
yi y
If we
• divide the data into 4 quadrants by lines at the means of x and y
• and for each point, examine the direction of the deviation from these means:
for (xi, yi)
examine sign (+/-) of:
(xi – x) and (yi – y) for
each quadrant …
Regression04: 32
yi y
xi x
xi x
yi y
+
II
III IV
y
x
-
x
I+
+
+-
-
-
xi x
xi x
yi y
yi y
Quadrant
I + + +II - + -III - - +IV + - -
xi x yi y xi x yi y
Covariance between x and y:
1
( )( )N
i x i yi
xy
x y
N
Regression04: 33
xyxy
x y
Correlation between x and y:
Now, if points look like:
Since most points are in
QI and QIII: xy> 0
> 0, 1 >0
Since most points are in
QII and QIV: xy< 0
< 0, 1 <0
Regression04: 34
Since points are in all 4
quadrants: xy= 0
= 0, 1 = 0
Regression04: 35
(a) (b)
Correlation, r , in (a) is greater than r in (b), since points are closer to line in (a)
This is true, even when the slopes are the same.
Regression04: 36
Testing Hypotheses on Correlation:
To test
• Ho: = 0 vs. Ha: 0
• Use:
• It is identical to testing for 1 = 0
12 2
|
ˆ2
11
ny x
x
r nt
srs n
^
Regression04: 37
In Minitab: Stat Basic Stats Correlation
Correlations: sbp, age
Pearson correlation of sbp and age = 0.658P-Value = 0.000
Regression04: 38
Note that the Regression Analysis results provide a value for r2 (see slide 25):
R-Sq = 43.2%
Use this to compute r = .432 = .657
We also have the significance test for zero correlation:
Ho: =0 vs. Ha: 0
Since it is identical to the test of zero slope:
T P4.62 0.000
Regression04: 39
• Regression and Correlation Analysis are closely related
• Correlation evaluates the strength of a linear association
• Does not impose any directionality on the relationship
• Regression evaluates strength of a linear relationship (slope of line)
• Direction is imposed( e.g., age SBP rather than the reverse)
• Significance test on slope, 1, is equivalent to
significance test on correlation r
^