13
S T A T I S T I C S 1 0 6 0 Relationships between variables: Regression The gentleman pictured above is Sir Francis Galton. Galton invented the statistical concept of correlation and the use of the regression line. Blame him, not me.

Relationships between variables: Regression · variables is called REGRESSION ANALYSIS. A line that summarizes the relationship between the variables is called a REGRESSION LINE

  • Upload
    others

  • View
    31

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Relationships between variables: Regression · variables is called REGRESSION ANALYSIS. A line that summarizes the relationship between the variables is called a REGRESSION LINE

S T A T I S T I C S 1 0 6 0

Relationships between variables:

Regression

The gentleman pictured above is Sir Francis Galton. Galton invented the statistical concept of correlation and the use of the regression line. Blame him, not me.

Page 2: Relationships between variables: Regression · variables is called REGRESSION ANALYSIS. A line that summarizes the relationship between the variables is called a REGRESSION LINE

Regression 2

1. SIMPLE LINEAR REGRESSION

1.1 We can use a model of the association between variables to help analyze a sample of the population

You will have seen the use of statistical models already (e.g., the normal density curve). Here we return to the idea in the context of a model for the association between two variables. Specifically, a MODEL is an explicit mathematical equation specifying the relationship between the response variable (y) and the explanatory variable (x). As before, we recognize that the model is only an idealized view of reality, but we assume that it is a good enough approximation to the true relationship between x- and y-variables to be useful to us.

In statistics, the mathematical technique for modeling of the relationships among variables is called REGRESSION ANALYSIS. A line that summarizes the relationship between the variables is called a REGRESSION LINE.

1.2 A straight line is a model of association

In LINEAR REGRESSION, our goal is to construct a straight-line model that summarizes the relationship between two quantitative variables (x and y). The result of this will be a REGRESSION LINE; i.e., a straight line (i.e., a constant relationship) that describes how much a change in the explanatory value (x) will yield a change (response) in y variable. The mathematical equation for the straight line is the LINEAR MODEL. Recall that a straight line relating y to x has the form:

response = y-intercept + (slope x predictor) Using the more familiar algebraic notation a = y-intercept, and b = slope:

! = ! + !" Of course, real data do not lie exactly along a straight line given by the model; i.e., given a value of x we do NOT mean that the y-value must be exactly equal to ! + !" (see Figure 1 below). For real data we expect some degree of “scatter” (Figure 1). So, what do we use the model for? In linear regression, the model is used as follows: given a value of x, the mean value (rather than an exact value) of y is equal to a + bx.

Page 3: Relationships between variables: Regression · variables is called REGRESSION ANALYSIS. A line that summarizes the relationship between the variables is called a REGRESSION LINE

Regression 3

In practice, the regression line is used because we think that it provides a good enough approximation to the true relationship to be useful. The mean value of y at x is a prediction of the response of y to x, and the accuracy of this prediction depends on the scatter of the data around the line. To distinguish the y-values in our data from the model-based prediction of a y-value we use the special notation, ! (said as “y-hat”). The smaller the scatter is, the more accurate the model prediction, !, will be. In Figure 2, the model in panel B will make much better predictions about the value of y than the model in panel A. So, the utility of the model will depend on the scatter of the data around the line. Remember, models with a regression line having a very large amount of scatter are not likely to be very useful.

60 80 100 120 140 60 80 100 120 140

Figure 2: Panels A and B above show two different datasets with a linear regression line shown in red. The scatter around the regression line is greater in A as compared with B; i.e., for any value of x, the vertical scatter above and below the regression line is larger in A than in B. The consequence is that regression-derived predictions of y ( ) will be less accurate from the A dataset than from the B dataset.

y

0

4

8

12

16

0 2 4 6 8 10

X value

Y va

lue

0

4

8

12

16

0 2 4 6 8 10

X value

Y va

lue

0

4

8

12

16

0 2 4 6 8 10

X value

Y va

lue

0

4

8

12

16

0 2 4 6 8 10

X value

Y va

lue

y = a + bx y = a + bx + !

Figure 1: Panel A shows a plot of a straight-line equation. Panel B shows the plot of the same relationship with the addition of random error (+ε). The error was simulated by sampling from a normal distribution

A B

Page 4: Relationships between variables: Regression · variables is called REGRESSION ANALYSIS. A line that summarizes the relationship between the variables is called a REGRESSION LINE

Regression 4

1.3 To find the regression line we must FIT our model to the data

Because no line can pass exactly through all the points of a real sample of data, our task is to find the regression line that passes as close as possible to all the points in the vertical direction (Figure 3). In other words the smallest distance between the observed and predicted values of y (i.e., between !! and ! ). Why the vertical direction? The vertical direction is parallel to the y-axis. As our goal is to predict the y-value we are only concerned with variability along the y-axis; this is the direction of the scatter that we want to be as small as possible. There is a consequence to this; switching the labels (between x and y values) will produce a different regression line (unless all points lie directly on the line). The best linear model to predict y-values will not be the same as the best linear model to predict the x-values! Figure 4 (below) provides and example of this.

0

4

8

12

16

0 2 4 6 8 10

X value

Y va

lue

= residual

0

4

8

12

16

0 2 4 6 8 10

X value

Y va

lue

= residualy

iy

Figure 3: The residual is the distance (in y-value units) between the observed value of y ( ) and the predicted value of y ( ) for a given value of x. A negative residual means the model overestimates y and positive residuals means the model underestimates y.

yyi

0

50

100

150

200

250

300

350

0 2 4 6 8 10

wine consumption: liters of alcohol/person/year

dea

ths

fro

m h

eart

dis

ease

/ 10

,000

-2

0

2

4

6

8

10

0 100 200 300 400

deaths from heart disease / 10,000

win

e co

nsu

mp

tio

n:

liter

s o

f al

coh

ol/p

erso

n/y

ear

y = 260.6 - 23x y = 8.9 – 0.03x

Figure 4: Switching the variables considered to be “response” and “explanatory” impacts the linear regression model for the wine consumption dataset we first saw in the pervious set of notes. Note that the slope (change in y with a one unit change in x) changes substantially.

Page 5: Relationships between variables: Regression · variables is called REGRESSION ANALYSIS. A line that summarizes the relationship between the variables is called a REGRESSION LINE

Regression 5

1.4 Least squares regression is used to find the best-fit line Notice that in the previous figures some of the vertical distances from the line were positive and some of the vertical distances were negative (Figure 3). In statistics these distances are called RESIDUALS (Figure 3); so, let’s use that term from now on. To avoid difficulties that come from having a mix of positive and negative residuals (we can’t simply add them up!) we square the residuals.

RESIDUAL = !   =  !! − !

Sum of the squared residuals = !!" =   ! ! For any line through the data we can compute the SSE. A line with a smaller SSE is a better fit to the data because it has smaller vertical distances along the y-axis than a line with a larger value. The line that minimizes this quantity is called the LEAST SQUARES REGRESSION LINE and it minimizes the vertical distances in parallel to the y-axis. Finding the “best-fit line” requires finding those values for the y-intercept and slope of the linear equation that minimize the SSE. To indicate that we are using these quantities in our model we use ! to denote the “best-fit y-intercept” and  !! to denote the “best-fit slope”. So now we write our model as:

! = !! + !!! Some mathematics leads us to the formulae:

!! = ! !!!!

and !! = ! − !!!

- r is the familiar correlation coefficient -  !! and !! are the SDs of the sample y-values and x-values - ! and ! are the sample means for the y-values and x-values

Page 6: Relationships between variables: Regression · variables is called REGRESSION ANALYSIS. A line that summarizes the relationship between the variables is called a REGRESSION LINE

Regression 6

That is it. This means that if you have, or can measure, five simple summary statistics (r,  !! , !! , ! and ! ) you can find the least squares regression model for predicting y-values from x-values.

1.5 Some notes about interpreting the regression line

• The best-fit regression line will always pass through the mean of the x- and y- values (! and ! ). • The SLOPE inherits its sign from the correlation coefficient (r), and the units of the slope (units of y per units of x) come from the ratio of the SDs. This means that if you were to change the units of one of the variables, you would also change the slope. • The formula for the SLOPE, b1 = r(sy/sx), means that a change in 1SD of the x-value corresponds to a change in r SDs in the y-value. As r becomes less strong (the strength is less strong) the predicted value, !, will change less in response to a change in x. • The intercept is simply the value of y when the x-value = 0. This raises the issue of extrapolation; it is best NOT to extrapolate results beyond the range of the sample data. For example, suppose you had a linear model for the relationship between insulin sensitivity (y-variable) and the fatty acid content of muscles (x-variable). Your model has a very good fit to the data, but suggests that when the concentration of a certain fatty acid is low the insulin sensitivity will be negative. As it is biologically impossible to have negative insulin sensitivity, extrapolation would lead you to some interesting fiction. Problem 1 requires computation of the slope and intercept of the least squares regression line.

1.6 Report the variability in y that is explained by the linear model (!!)

In the regression setting, you should think about the variability in the y-values as having two sources:

1. Variability in y-values explained by the linear model: This is visualized as the change in y-values along the regression line in response to a change in the x-value. Because the data never fall along a straight line, this will never be 100% of the variability in the data.

2. Variability explained by scatter above and below the regression line: The residuals measure this kind of variation. The regression line tells us nothing about how large this variability is.

We can think of the total variation in the data as a sum of these two sources (see Box 2 and Figure 5). One way to summarize how well our model fits the data is to measure the fraction of the total variation that is explained by the linear model. We call this !! (called “r-squared”, or the COEFFICIENT OF DETERMINATION). Hence the variability in the residuals is (1- !!).

Page 7: Relationships between variables: Regression · variables is called REGRESSION ANALYSIS. A line that summarizes the relationship between the variables is called a REGRESSION LINE

Regression 7

So, how is this to be interpreted in a specific case? Let’s take a regression of height (y) on age (x) as an example. For this regression, a value of !! is reported to be 0.849. The interpretation is that 84.9% of the variation in height can be explained by the variability in ages, leaving just 15% of the variation unexplained. This unexplained variation can be thought of as the average variability in height that occurs if we were to hold the age fixed at a certain value.

60 80 100 120 140 60 80 100 120 140

r2 = 0.57 r2 = 0.94

Figure 6: The solid red regression line explained the variation in y due to x (i.e., how much a change in x will “pull” the y-value along with it); r2 measures this variation as a fraction of the total variability in y (SSR/SST). The scatter of points above and below the regression line is NOT explained by the model; this variability can be thought of as what remains after you “pull on y” by changing x. This scatter is what is left over when you subtract the variability explained by the model from the total variability in y (SSE = SST-SSR)

!! ! ! ! ! !! ! ! ! ! !! ! !! !

SST = SSR + SSE

Total = Regression + Error

SSR SSE

r2 = SSR / SST

Figure 5: Breakdown of Sums of Squares (SS). SST is a measure of the total variation in y values. SSR is the variability predicted by the regression model. SSE is the residual variation (error). See Box 1 for additional details.

1- r2 = SSE / SST

Page 8: Relationships between variables: Regression · variables is called REGRESSION ANALYSIS. A line that summarizes the relationship between the variables is called a REGRESSION LINE

Regression 8

Box 1: The COEFFICIENT OF DETERMINATION, r2, is the fraction of variation in the value of y that is explained by the least-squares regression of y and x. Our interest is in the response variable (y), so we start with the total (T) variability in y-values. We measure this as the sum of the squared (SS) deviations from the mean:

( )2yySST i −Σ=

Next we predict a y-value ( y ) for each x-value; these values would comprise all the

variability if the data fell along a straight line. The variability of the predicted values (y ) is the variability of the regression.

We measure this as the sum of the squared (SS) deviations of the regression (R) model:

( )2ˆ yySSR i −Σ=

The fraction of variation in y-values explained by the linear regression is:

yy

SSTSSRr

data, in the varianceˆ s,prediction in the variance2 ==

r2 can be obtained by squaring the correlation coefficient (r). Lastly, the total variability in the data (SST) can be thought of as a sum of the variability in y due to changes in x (SSR), and the residual scatter around the line (SSE):

SST = SSR+ SSE

! yi " y( )2 = ! yi " y( )2 +! yi " yi( )2

Page 9: Relationships between variables: Regression · variables is called REGRESSION ANALYSIS. A line that summarizes the relationship between the variables is called a REGRESSION LINE

Regression 9

1.7 Linear modeling has several important ASSUMPTIONS • x is a variable that logically, or chronologically, precedes y. Recall that designation of the y-variable implies that it is to be thought of as a response to the x-variable. • x and y have a linear relationship. After all, that is the basis of our model!

• The “scatter” around the linear model is, at least, approximately normal. For any fixed value of x, the “vertical scatter” of the y-value is assumed to be a random variable having a mean and SD from the normal distribution. • With regard to the vertical scatter, the variance is constant over all values of x. We need this to be true to find the best-fit line. • Measurement of the x- and y-values of an individual must be independent of each other. The measurement of one value must not impact the measurement of the other value (e.g., the relationship between your midterm and final grade in this course are not independent). •Individuals must be a random sample from a larger population. Individuals must not be sampled based on their values of either the x- or y-variables! • Each value of x can be measured exactly. Recall that we only consider the “vertical scatter” (parallel with the y-axis) for a given x-value.

Problems 2 and 3 further illustrate linear regression.

1.8 Always plot and inspect your residuals The residuals measure the scatter in y-values that is not explained by the model. Because the accuracy of a model’s prediction depends on the scatter (more scatter, means less accuracy) we can use the residuals as a measure of the strength of the correlation. THE STANDARD DEVIATION OF THE RESIDUALS (!!) is a summary of this scatter. Examining the residuals can also help to assess other aspects of the regression. The best way to do this is to construct a RESIDUAL PLOT; a scatterplot of the residuals against the explanatory (x) variable. Example residual plots are shown below in Figure 7. Because the mean of the residuals is always zero, plotting the horizontal line for zero is helpful to orient the plot. If all has gone well, there should be no pattern in the residuals.

Page 10: Relationships between variables: Regression · variables is called REGRESSION ANALYSIS. A line that summarizes the relationship between the variables is called a REGRESSION LINE

Regression 10

A curved pattern to the residuals indicates the relationship between x and y might not be linear. If the amount of scatter changes with x, then not all predictions on y will have the same precision.

2. INFERENCE FOR REGRESSION Now that we have the framework for least squares regression, we can undertake one last task: testing hypotheses and assessing the reliability of our predictions. There is some very good news here. At this point in the course you have covered all the principles to be able to carry out inferences for regression. All you need to learn is how to apply what you already know about (i) t-tests and (ii) confidence intervals. Since you already have covered t-tests and confidence intervals, these notes will jump straight into how these tools can be applied to regression, and then illustrate their application with an example that will be worked in class. If you feel you need more, Chapter 27 provides a much more detailed presentation of this material.

-100

-50

0

50

100

150

200

0 5 10 15 20

Residuals

-100

-50

0

50

100

150

200

0 5 10 15 20

Residuals

-100

-50

0

50

100

150

200

0 5 10 15 20

Residuals

-100

-50

0

50

100

150

200

0 5 10 15 20

Residuals

-100

-50

0

50

100

150

200

0 5 10 15 20

Residuals

-100

-50

0

50

100

150

200

0 5 10 15 20

Residuals

-100

-50

0

50

100

150

200

0 5 10 15 20

Residuals

-100

-50

0

50

100

150

200

0 5 10 15 20

Residuals

Figure 7: Examples of a residual plot for four different datasets. (A) Horizontal, shapeless and equally distributed about zero indicates a linear regression model is likely to be appropriate for these data. (B) Residuals plot similar to plot A, but an outlier is in the upper right corner of the plot. (C) This plot indicates that variability decreases with increasing values of x; this indicates that the equal variance assumption does not hold for these data. (D) Curvature suggest non-linear relationship between x and y for this dataset.

Page 11: Relationships between variables: Regression · variables is called REGRESSION ANALYSIS. A line that summarizes the relationship between the variables is called a REGRESSION LINE

Regression 11

2.1 A linear model for the true parameters The model: From the previous notes you are already aware of the basic linear model that we use in regression.

! = !! + !!! Now we simply express this model in terms of true, but unknowable, parameter values for the mean response, slope and intercept. As always, Greek letters are used to express this.

!! = !! + !!! This model only predicts the mean response given a value of x. As we have also seen before, this is not enough to explain the individual y-values because the individual observations will vary around the mean. To include this variation in our model we will bring back an explicit error term (ε) first used on page 3.

! = !! + !!! + !

With this equation we can now account for each and every y-value with respect to the parameters !! and !!. The familiar statistics b0, b1 and e serve as estimates of the parameters !!, !! and ε. 2.2 The confidence interval for the slope is constructed by using a t-model We are interested in the slope because it is a measure of the relationship between two variables, with a slope of zero (!! = 0) indicating no linear association. So, this gives us a convenient way to express the null hypothesis (H0) for an association. As always, we must work with the sample statistic for the slope (b1) and these will vary from sample to sample. Since the sample variability in b1 is centered on !! we can use a t-model. The t-test takes the form of a difference between the statistic and its hypothesized value (b1 - !!) divided by the standard error. Hence, to test the null hypothesis !! = 0, you use the following:

Page 12: Relationships between variables: Regression · variables is called REGRESSION ANALYSIS. A line that summarizes the relationship between the variables is called a REGRESSION LINE

Regression 12

!

t =b1 " 0SE b1( )

!!:  !! = 0       There is no relationship between x and y

!!:  !! ≠ 0     There is a relationship between x and y The confidence interval of the slope also takes the basic form of an estimate ± the margin or error:

!

estimate ± t* " SEestimate To build a confidence interval for !! you use the products of the SE and a critical value (as usual):

!

b1 ± tn"2* # SE b1( )

Of course, in both cases, you will need the appropriate SE:

SE b1( ) = sesx n!1

That’s all there is. Problem 4 is used to illustrate inference for regression.

3. SOME ADVICE ABOUT REGRESSION

§ Lurking and confounding variables can create an association where there is no causal link between the x and y variables. A high r2 does not imply causality.

§ Outliers can have a very strong influence on the slope, intercept and r2. Such highly influential points do not always have large residuals.

§ Linearity is an assumption, and the real world is often non-linear. Inspect your scatter plots to make certain that a linear model in not obviously inappropriate for your data. r2 can be misleading when data are non-linear.

Page 13: Relationships between variables: Regression · variables is called REGRESSION ANALYSIS. A line that summarizes the relationship between the variables is called a REGRESSION LINE

Regression 13

§ Avoid extrapolating your predictions beyond the range of your observations. For example, predicting values of biological impossibility, like negative insulin sensitivity, just wastes everybody’s time.

Measurement independence: You must NEVER plot the change in a variable against the initial value of the variable. Attributing a significant association to such an analysis is termed REGRESSION FALLACY. Let’s use an example of how this can be very embarrassing if you get it wrong. First we simulate completely independent data from a normal distribution with a mean of 100 and a SD of 10. Let’s pretend we don’t know they are random, but that they come from measurements taken before and after a medical treatment. As expected, analysis of these data (Figure 8A below) reveals that there is absolutely no association between these data and the best-fit regression line is horizontal. Now let’s pretend we wanted to know what the change in the variable due to treatment, so we plotted the change in the variable against the values obtained before the treatment (Figure 8B below). In short, this is just a different (and inappropriate) analysis of exactly the same data. The results are shown below.

Figure 8A: Simulated date representing no effect of a drug treatment on the measured variables

Figure 8B: An INNAPROPRIATE plot of the change in a measurement versus the initial value measurement.

y = -0.0483x + 104.75R2 = 0.0021

50

60

70

80

90

100

110

120

130

140

150

60 80 100 120 140

before treatment

afte

r tr

eatm

ent

r2!"!#$##%&!!y = -1.0434x + 105.8

R2 = 0.5481

-60

-50

-40

-30

-20

-10

0

10

20

30

40

50

60 80 100 120 140

initail value before treatment

chan

ge d

ue to

trea

tmen

t

r2!"!#$'()!!