39
Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 1 An Introduction to REGRESSION AND CORRELATION

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 1

An Introduction to

REGRESSION AND CORRELATION

Page 2: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 2

How do we measure the association of 2 continuous, numeric scale variables?

Example:

Observations are available on

• a sample of 30 individuals

• systolic blood pressure (SBP)

• age

We are interested in

• the relationship between SBP and age

• for these patients (descriptive)

• and for the population which they represent (inferential).

Page 3: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 3

individual SBP AGE individual SBP AGE(i) (Y) (x) (i) (Y) (x)1 144 39 16 130 482 220 47 17 135 453 138 45 18 114 174 145 47 19 116 205 162 65 20 124 196 142 46 21 136 367 170 67 22 142 508 124 42 23 120 399 158 67 24 120 21

10 154 56 25 160 4411 162 64 26 158 5312 150 56 27 144 6313 140 59 28 130 2914 110 34 29 125 2515 128 42 30 175 69

Data on 30 individuals:

Page 4: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 4

Note:

We have 30 pairs of observations which we can denote as:

(x1,y1) = (39,144)

(x2,y2 ) = (47, 220)

(x30,y30) = (69, 175)

Where

• xi refers to age for the i th subject

• yi to SBP for the i th subject

Page 5: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 5

• These data pairs may be considered as points in two dimensional space, so that we may plot them on a graph.

Scatter diagram of age and systolic blood pressure

120

140

160

180

200

220

240

20 30 40 50 60 70 80

AGE in years

0 0

SBP(mm Hg)

Note:

• age and SBP seem to be related:

• Younger subjects tend to have lower SBP

• older subjects higher SBP.

Page 6: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 6

How can this relationship be measured?

y

x

No relationship between x and y. Spread is even in all directions.

y

x

Linear relationship:A line indicates the main direction of the spread of points.

y

x

Non-linear relationship between x and y.A curve best describes the relationship.

Page 7: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 7

Math Review: Equation for a Line

y

0

0 1y x

x

o= y-intercept = value of y when x=0

1= slope = y / x

(change in y)/(change in x)

Page 8: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 8

y

Slope > 0: positive slope (as x increases, y increases)

Slope = 0

Slope < 0: negative slope(as x increases, y decreases)

y x

x

1 = “slope” = y / x = (change in y) / (change in x)

Page 9: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 9

Now, given a set of data, how can we get the line that best fits or best represents the data?

When it is appropriate to predict one variable (y) from another variable (x) -- there is some directionality in the relationship – then :

• Commonly use a technique know as Least Squares Regression to estimate

intercept: 0

slope: 1

• denote the estimates 0 and 1, respectively^ ^

(referred to as beta-nought-hat and beta-one-hat)

Page 10: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 10

d i

di

0 1ˆ ˆy x

We are looking for that line which minimizes the vertical distances to the data points.

For each observed value xi, we have

• an observed yi, and the

• “predicted” value yi, on the line: yi = 0+ 1xi

The vertical distances are : di = (yi – yi).^

^ ^^ ^

Page 11: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 11

• • • •

yi

yi

xi

(xi,yi)

(xi,yi)^

x

y

0 1ˆ ˆy x

That is, we have:

xi = observed x for ith subject

yi = observed y for ith subject

yi = predicted y for ith subject^

Page 12: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 12

The squared distances are:

di2 = (yi – yi)2

– and the sum of squared deviations from the line

– (sound familiar?) is

di2 = (yi – yi)2

We want the line such that

is minimized.

^

2 2 20 1

1 1 1

ˆ( ) ( )n n n

i i i i ii i i

d y y y x

^

ˆiy

Page 13: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 13

The unbiased estimates of 0 and 1 which are

• the least squares estimates and

• the minimum variance estimates

Are:

11

2

1

( )( )ˆ

( )

n

i ii

n

ii

x x y y

x x

0 1ˆ ˆy x

Use calculus in previous equations to solve

Page 14: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 14

Example:

Using the data on 30 individuals where we measured

• AGE (x)

• SBP (y)

n = 30, y = 142.53, x = 45.13

We get:

1ˆ 0.97

0ˆ 98.7

Page 15: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 15

Thus, the equation for this straight line is given by

ˆ 98.7 0.97y x

0

AGE

120

140

160

180

200

220

240

20 30 40 50 60 70 80

Page 16: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 16

Now,

If yi = yi for all i, then SSE=0 perfect fit to line

As the fit gets worse, SSE gets larger

• SSE serves as measure of fit to line

2 2

1 1

ˆ( )n n

i i ii i

d y y SSE Sum of Squared Error

^

Page 17: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 17

One of the assumptions for regression analysis is that of homoscedasticity:

• the variance of y is the same for any xthat is, the spread of values for y at each level of x remains ~constant

y

x

1 2 3

2 2 2 2| | |y x y x y x

Spread of y|x

Spread of y ignoring x

Page 18: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 18

An estimate of 2 is given by:

2 2|

1

1 1ˆ( )

2 2

n

y x i ii

s y y SSEn n

Lose 2 df: for estimating 0 and 1

The standard error, sy|x

• is a measure of the spread of y

• around it’s predicted value y

• for each value of x.

^

Page 19: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 19

In our example:

And the estimated standard error is:

That is, • for any given age x, • the standard error of SBP is estimated as

17.31 mmhg.

2| 299.77y xs

| 17.31y xs

Page 20: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 20

ˆ 98.7 0.97y x

0

AGE

120

140

160

180

200

220

240

20 30 40 50 60 70 80

To address the question of association of x and y

• We want to know if the slope is zero:

• Ho: 1=0

• Ha: 10

Page 21: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 21

Now, if we assume

• that for any fixed value of x

• y is normally distributed

Then we can show that:

In practice, since 2 is unknown

• Use sy|x2 in place of 2

• Use the t-distribution, with n-2 df

• For hypothesis testing and CI

2|

1 1 2ˆ ~ ,

( 1)y x

x

Ns n

Page 22: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 22

With these assumptions, to test

• Ho: 1=0

• Ha: 10

Test statistic:

1 12

|

ˆ ˆ( )

1

on

y x

x

ts

s n

Page 23: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 23

1 12

|

ˆ ˆ( ) 0.97 04.62

17.31

15.29 291

on

y x

x

ts

s n

In our example:

The achieved significance is then:

With p<.05,

Reject Ho and conclude that age (x) provides

significant information for predicting SBP (y).

28(2)Pr[ 4.62] 0p t

Page 24: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 24

In Minitab, enter the data in 2 columns, for SBP and AGE, and select: Stat Regression Regression

Response is Y variable

Predictor is X variable

Page 25: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 25

Regression Analysis: spb versus age

The regression equation is

spb = 98.7 + 0.971 age

Predictor Coef SE Coef T P

Constant 98.71 10.00 9.87 0.000

age 0.9709 0.2102 4.62 0.000

S = 17.31 R-Sq = 43.2% R-Sq(adj) = 41.2%

Analysis of Variance

Source DF SS MS F P

Regression 1 6394.0 6394.0 21.33 0.000

Error 28 8393.4 299.8

Total 29 14787.5

Page 26: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 26

You’ll note that a significance test is also provided for 0:

H0: 0=0 vs. Ha: 00

T P

9.87 0.000

We are rarely interested in tests of 0.

• It is often outside of the range of the data (e.g., here the youngest age is ~20)

• In this case it can be interpreted as the predicted SBP at age=0 – not meaningful.

• It is inappropriate to interpret regression relationships outside the range of the observed data.

Page 27: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 27

A better model might exist • (e.g, one with a curvilinear term) • but there is a linear component.

••

•••

••

••

•••

••

••• ••

••

••

•••

••

• ••

• •

y

••

••

••

••

••

• ••

Here, a curve would provide a better fit

•• •

••

Linear model fits better than y = y ^

Page 28: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 28

note: if Ho:1 = 0 is not rejected it means either

••

••

••

•••

••

••

•••

• •

• •• • ••

• ••

• •

x provides little or no help in predicting y

The true relationship between x and y is not linear.

•••

• ••• •

• •

••

•••

••

• ••

•••

••

••

•• •

• ••

•••

or

Note: even when Ho: 1=0 is rejected, some other non-linear model may be better

Page 29: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 29

Part 2: The Correlation Coefficient

• Provides a measure of how 2 random variables are associated, without assuming any direction to the association (i.e., no sense that x is predictive of y, just that they are related)

• Also a measure of the strength of the straight-line relationship between X and Y

• It can also be shown that:

1

2 22 2

1 1

( )( )ˆcov( , )

( )( )( ) ( )

n

i ii

n nx y

i ii i

x x y yx y

rs s

x x y y

1ˆx

y

srs

Page 30: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 30

Characteristics of correlation coefficient r:

• -1 r 1

• -1 implies perfect negative correlation

• 0 implies no correlation

• 1 implies perfect positive correlation

• r is dimensionless – it is independent of units of x or y

• r always has same sign as slope

• r is the sample estimator of the population correlation

Page 31: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 31

yi y

xi x

xi x

yi y

+

II

III IV

y

x

-

x

I+

+

+-

-

-

xi x

xi x

yi y

yi y

If we

• divide the data into 4 quadrants by lines at the means of x and y

• and for each point, examine the direction of the deviation from these means:

for (xi, yi)

examine sign (+/-) of:

(xi – x) and (yi – y) for

each quadrant …

Page 32: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 32

yi y

xi x

xi x

yi y

+

II

III IV

y

x

-

x

I+

+

+-

-

-

xi x

xi x

yi y

yi y

Quadrant

I + + +II - + -III - - +IV + - -

xi x yi y xi x yi y

Covariance between x and y:

1

( )( )N

i x i yi

xy

x y

N

Page 33: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 33

xyxy

x y

Correlation between x and y:

Now, if points look like:

Since most points are in

QI and QIII: xy> 0

> 0, 1 >0

Since most points are in

QII and QIV: xy< 0

< 0, 1 <0

Page 34: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 34

Since points are in all 4

quadrants: xy= 0

= 0, 1 = 0

Page 35: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 35

(a) (b)

Correlation, r , in (a) is greater than r in (b), since points are closer to line in (a)

This is true, even when the slopes are the same.

Page 36: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 36

Testing Hypotheses on Correlation:

To test

• Ho: = 0 vs. Ha: 0

• Use:

• It is identical to testing for 1 = 0

12 2

|

ˆ2

11

ny x

x

r nt

srs n

^

Page 37: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 37

In Minitab: Stat Basic Stats Correlation

Correlations: sbp, age

Pearson correlation of sbp and age = 0.658P-Value = 0.000

Page 38: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 38

Note that the Regression Analysis results provide a value for r2 (see slide 25):

R-Sq = 43.2%

Use this to compute r = .432 = .657

We also have the significance test for zero correlation:

Ho: =0 vs. Ha: 0

Since it is identical to the test of zero slope:

T P4.62 0.000

Page 39: Regression04: 1 An Introduction to REGRESSION AND CORRELATION

Regression04: 39

• Regression and Correlation Analysis are closely related

• Correlation evaluates the strength of a linear association

• Does not impose any directionality on the relationship

• Regression evaluates strength of a linear relationship (slope of line)

• Direction is imposed( e.g., age SBP rather than the reverse)

• Significance test on slope, 1, is equivalent to

significance test on correlation r

^