CHAPTER 5: LINEAR REGRESSION & CORRELATION

CHAPTER 5: LINEAR REGRESSION & CORRELATION

MOTIVATION Regression and correlation are the primary tools for analysing relationships between interval-scaled variables. The simplest expressions of these methods are addressed in this chapter. Generalisations of these simple methods are used in every sphere of health and medical research, from predicting survival from cancer to constructing models for economic planning to measuring experimental effects in physiology and the basic health sciences. This chapter is perhaps more technical than those preceding it; regression and correlation are such elegant and powerful methods for assessing associations and making predictions that a little more mathematics does seem to be warranted. None of the formulae need to be memorised; at a first reading it is enough to get the gist of what is going on.

chapter 5: linear regression & correlation

§ 5.1 INTRODUCTION This chapter is concerned with the quantitative assessment of the relationships between two variables measured on an interval or ratio scale. For example, is there a relationship between obesity and blood pressure? How does the heart rate respond to the level of hypoxia (lack of oxygen) in a severe asthma attack? Does the level of expenditure on health services relate more to the rate of ageing in the population or to the number of doctors in practice? Regression models can help to answer questions such as these. We will discuss regression first, then correlation, and finally the links between them. § 5.1.1 Some Basic Terms, Ideas and Historical Notes If you were to plot the relationship between subjects’ heights in centimetres and heights in metres, the result would be a perfectly straight line through each point. In Fig 5.1, for a given height in centimetres, there can be just one height in metres, there is never any uncertainty in relating the two variables, and the situation is pretty uninteresting.

heig

ht in

metr

es

height in centimetres100 110 120 130 140 150 160 170 180 190 200

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

Fig 5.1 Relationship between height in metres and in centimetres But consider the following scatterplot, Fig 5.2, which shows the number of visits per child over a 12 month period to a general practitioner versus the total income of the

2004 Philip Ryan page 5-2


parents. Data were taken from a sample of 41 low-income families on the GP’s patient register.

nu

mb

er

of vi

sits

pe

r ch

ild

income in $100016 18 20 22 24 26 28

0

2

4

6

8

example of use of “jittering”

Fig 5.2 Scatterplot This scatterplot is one, graphical, way of presenting a bivariate distribution – each point or coordinate pair in the X-Y plane represents one family’s joint measurement on each of two variables. In Fig 5.2, a statistical technique called “jittering” has been used to artificially separate points that would otherwise exactly overlap one another. Note also that the scale on the abscissa does not begin at 0: a break marker (//) alerts us to this so that we are not misled by the truncated scale. You can see that there are several possible visit scores for a given family income. We might in fact say that each family income level has a corresponding distribution of visits per child. Note that in a general sense, at least within this range of incomes, as the income rises there is a rise in the visits. (There are several plausible explanations – perhaps the poorest families can’t afford to consult their GPs as often.) We might say that the two variables tend to be associated, but there is not the same certainty of association which we saw in the previous example with heights. An income of $20,000 doesn’t allow us to specify with certainty that there will be 3 visits per child – it might be 2 or 5. This uncertainty in associating a particular value of the visit score with a value of the income might be said to reflect the error of estimating the visit score from the family income. The work of the British scientist, Sir Francis Galton, allows us to quantify the degree of association exhibited by two variables. Galton used the term reversion and later, regression, to describe the phenomenon of the children of short parents, or children of tall parents to revert, or regress, towards more average heights.



We now use the term regression to refer to a method of analysis which allows the prediction or estimation of the value of a dependent, or outcome, variable from the value of an independent, or predictor, variable. Conventionally, the dependent variable is termed “Y” and the independent variable “X” and their values are so plotted on the usual X-Y graph. In our example above, the family income is the independent (predictor) variable and the visit score is the dependent (outcome) variable. We do not necessarily imply that the dependent variable is caused by the independent variable. § 5.1.2 How Regression is Used Used in its strictest or “classical” sense, regression is applied to experimental situations in which the experimenter preselects certain values of the independent variable – these values are said to be fixed (you can think of these values as being “constants” for the purposes of the experiment). The experiment is then carried out (perhaps a number of times) at each chosen level of the independent variable and the response is measured. Example 5.1 A pharmacologist investigates the heart rate (the response variable) of mice injected with certain predetermined dosages of a beta-blocker. (Beta-blockers are a class of drug commonly used to treat high blood pressure and certain types of coronary artery disease. They have what physiologists call a “negative chronotropic effect”, which, in somewhat plainer English, means they tend to slow the heart rate.) Now a drug dosage is obviously a variable which can take values from 0 milligrams upwards, but for the purposes of the experiment, each of, say, 10 selected doses which are of interest to the researcher becomes a “constant”. The doses are fixed at these 10 values, whereas the dependent variable, heart rate after injection, remains a random variable, in that for a given dose we expect a range of resultant heart rates due to random variation in the mouse population’s response to the medication. The experiment will yield a regression relationship between heart rate and beta-blocker dosage, which, in its simplest form, may be represented as a straight line of best fit. The researcher may use this relationship at a later stage to predict heart rates for dosages that were not used in the original experiment. If the new dosage lies within the range of the original 10 experimental dosages, this is called prediction by interpolation. If the new dosage is outside the original range, the researcher is indulging in prediction by extrapolation (this can be fraught with danger). Figure 5.3 shows that the predicted heart rate for a dose of 4.5 drug units is 226 beats/min. This heart rate response is said to be interpolated, as the dosage lies between two doses (4 and 5 units) for which response data were actually observed. However, if we wanted to predict heart rate response for, say, 11 dose units, we must do so by extrapolation, and, in the absence of observed response data to doses beyond 10 units, we have to



Example 5.1 continued assume that the linear relationship continues for such doses. This assumption may or may not be warranted, and, if it is not, the predicted response may be seriously in error. In Figure 5.3, the extrapolated predicted response is about 98 beats per minute.

dose

heart rate b/min best fit prediction

1 2 3 4 5 6 7 8 9 10 11

75

100

125

150

175

200

225

250

275

300

325

Interpolated prediction

extrapolated line of best fit

Extrapolated prediction

Fig 5.3 Interpolated and extrapolated predictions § 5.2 SIMPLE LINEAR REGRESSION § 5.2.1 Introduction Simple linear regression is the process of fitting a straight line to the data represented on the scatter diagram. The equation of the fitted line allows us to estimate the likely response for a given value of the predictor or independent variable. Figure 5.3 shows such a straight line fitted to the heart rate and beta-blocker dose data. This raises some questions: Why fit a straight line? Indeed, before rushing into calculating a regression line, one should have some idea of the likely regression relationship. A simple consideration of the scatter diagram of the raw data, as shown in Fig 5.4, may indicate that a straight line is inappropriate – perhaps a parabola or other curve might be more suitable.



y

x2 4 6 8 10

0

10

20

30

Fig 5.4 Curvilinear fit Nevertheless, linear regressions are often good first approximations for describing relationships. More complicated models can readily be understood if the simple linear model is learned. What do we mean by the ‘likely’ response? The observed responses of our subjects for any one value of the predictor variable are really just a sample of all the possible response values we might observe if we repeated the experiment an infinite number of times at that predictor variable value. In other words, the responses can be thought of as having a separate distribution for each chosen value of the independent variable. Most regression details will fall into place if this concept is grasped. So, a reasonable answer to the question of ‘what is the likely response?’ is to specify the mean of each of the notional response distributions.

Y

X

Fig 5.5 Notional response (Y) distributions at each level of predictor variable (X)



Which straight line should be fitted? The basic aim of linear regression is to select a straight line that best fits the data. This boils down to choosing an intercept (the value of Y when X=0) and a slope (the change in Y for a unit change in X), since the numerical values of these parameters uniquely specify a line. Fitting by “line of sight” is an unsatisfactory approach because no two people will see a scatterplot of data in quite the same way. What we need is an agreed criterion for what constitutes “best fit” that will produce unique, reproducible estimates of the intercept and slope for any given set of data. One criterion which has gained wide acceptance is the criterion of least squares, which leads to a method of estimation called the method of least squares. This was first published by Legendre, one of the great mathematicians around the time of the French Revolution, although the German genius Carl Friedrich Gauss had justified the method 10 years before (at the age of 17). For our purposes, it is important to understand the concept, but not get too bogged down in the algebra. § 5.2.2 Least Squares Estimation We begin with a review of the basic algebra of the line. You may remember (?) from high school mathematics that the general equation of a straight line is: Y = bo + b1

.X E5.1 where Y and X are variables, and bo, the intercept (where the line hits the Y axis at X=0), and b1, the slope, are constants. The situation is somewhat changed in fitting a regression line in the experimental setting: • Although Y remains a (dependent, random) variable, values of X, the independent

variable, are generally thought of as being constant for the purposes of any particular prediction.

• The intercept, bo, and the slope, b1, of the line can be thought of as sample statistics

which are estimating the intercept and slope parameters in the population (denoted respectively βo and β1). From our raw data, we will derive specific values for the intercept and slope, but we must bear in mind that our data arise from just one of the many possible samples we could take from the population – it should not surprise us to find that values of the slope and intercept would vary from sample to sample. In fact, if we collected data on many samples we could form a sampling distribution of the values of the slope and a sampling distribution of the values of the intercept. This will be important, as we shall soon see.

You can see from Figure 5.6 that, in a statistical world, even a line of “best fit” is not a perfect fit. The data are taken from a study of 20 young adults. The objective of the study was to see if a single skinfold thickness measurement (taken over the triceps muscle in the upper arm) could be used as a proxy measure for percent total body fat.



b

od

y fa

t %

s15 20 25 30 35

10

15

20

25

30

datum (22.1, 21.3)

ei

17.4

21.3

Fig 5.6 Error (ei) for point (xi, yi) arou Consider, quite arbitrarily, the point labdatum: skinfold = 22.1 mm, body fat =independent variable, 22.1 mm, the regthe line, which differs from the observedifference (vertical distance) between by the regression line, $y i, can be thougobserving the response at that particulaerror as ei, and, in this example, ei = 21 ei = yi – y i But for the ith observation, we know froexpected, response (that is, the respons y i = bo + b1

.xi Substituting E5.3 into E5.2, we get: ei = yi – [bo + b1

.xi] The rationale of least squares is to minterms – in graphical terms, we try to fithe sum of the squared vertical distancIn mathematical terms, we attempt to m

2004 Phili

22.1
kinfold thickness (mm)
nd the line of best fit

elled (22.1, 21.3) corresponding to the raw 21.3%. For this specific value of the ression gives a “fitted” value $y i = 17.4% on d value, yi = 21.3%, of the data set. The

the observed value yi and the value predicted ht of as representing the random error of r value of the predictor, xi. We denote this .3 – 17.4 = 3.9%. More generally, we have:

E5.2

m the line equation, E5.1, that the fitted, or e predicted by the regression) is given by:

E5.3

E5.4

imise the sum of the squares of these error nd values of the intercept and slope such that es of each observed point from the line is least. inimise (see E5.2):

p Ryan page 5-8


Σ(ei2) = Σ[(yi – y i)2] E5.5

where the summation is over all i = 1…n data points. It might be wise to read all that once more. The more mathematical among you can use the differential calculus to find the unique numerical values (for any given set of data) of bo and b1 such that the error sum of squares is minimised. This is a little tedious for us here. Suffice it to say that if you apply this method to a set of data, you will find that:

slope (b1) = ∑∑

−−−

2)][()])([(

xixyiyxix

E5.6

intercept (bo) = y – b1

. x E5.7 where:

• xi and yi are the individual raw data measurements;

• x and y are the sample means, respectively, of the independent and dependent variables calculated in the usual way from the raw data (see E1.2); and

• the summation in each case is over i = 1…n. The expression for b1 in E5.6 is never used for calculations since it is quite intractable and error-prone. An equivalent expression, which looks horrible but is really much more convenient in calculations (particularly if your calculator does sums, sums of squares and sums of cross-products), is: b1 = Sxy/Sxx E5.8 where: Sxy = [Σ(xi

.yi)] – [(Σxi).(Σyi)/n] E5.9 Sxx = [Σ(xi

2)] – [(Σxi)2/n] E5.10 Anyway, once we have calculated the slope and intercept, we can substitute the expression for the intercept, bo, given in E5.7 into the original (general) equation of the straight line E5.1; this shows up an interesting little property of the least squares line. If you do the algebra, you will find that the equation of the line becomes: Y = y + b1

.(X – x ) E5.11 In other words, the regression line goes through the point whose coordinates are given by the means of the raw data of the dependent variable, Y, and the independent variable, X. This provides a good check on your calculations of bo and b1. I repeat:



the point whose coordinates are given by the sample means lies exactly on the regression line. [Don’t confuse this practical check on calculations with the theoretical notion that the regression line also goes through each mean of each distribution of the response variable as we saw above in §5.2.1 and Fig 5.5. Example 5.2 A pharmacologist injects a different dose of beta-blocker into 5 mice from the same litter. He wishes to see if there is a linear relation between increasing drug doses and heart rate. The data from the experiment are as follows: mouse 1 2 3 4 5 dose (units) 0 20 40 60 80 heart rate (bpm) 300 210 190 95 45 First notice that as the dosage increases the heart rate decreases. This is to be expected from the known pharmacological properties of the drug. We will therefore expect to find a line with a negative slope. Now let’s sketch a scatterplot.

heart

rate

propranolol dose0 20 40 60 80

0

100

200

300

Fig 5.7 Scatterplot for mouse experiment data We notice that a linear relationship is not unreasonable, although one could argue that the amount of data at our disposal is really too small to make a good judgement. The message from all this is to look at your raw data first and make sure it makes sense to use a linear regression before winding up your calculator. We will now calculate the intercept and slope of the line of best fit (the regression line) by the method of least squares. We calculate the slope first, so we need the intermediate quantities for E5.8. We have: Example 5.2 continued



Σxi = 200 x = 40 Σ(xi2) = 12000

Σyi = 840 y = 168 Σ(yi2) = 181250

Σ(xiyi) = 21100 n = 5 So, Sxy = [21100] – [(200 x 840)/5] = –12500 [see E5.9] Sxx = 12000 – 2002/5 = 4000 [see E5.10] Therefore, the slope, b1 = –12500/4000 = –3.125 [see E5.8] intercept, bo= 168 – (–3.125 x 40) = 293 [see E5.7] So the regression line has the equation: expected heart rate = 293 – (3.125 x dosage) We note the slope is negative, as our knowledge of pharmacology and physiology lead us to expect. In this example, though this is not always the case, it makes some sense to ask: What would be the mean (expected) heart rate of mice were they not to be given beta-blocker? The answer is the value of the intercept, 293 beats/minute, as this is the predicted response when dose = 0 is substitued into the regression equation. The slope may be interpreted as follows: for a unit increase in drug dose, the mean heart rate decreases by 3.125 beats/min. By the way, the calculated intercept and slope are also termed the regression coefficients. It is good practice to always draw the regression line superimposed on the scatterplot to give you a feel for how well it fits the data. We’ll return to Example 5.2 shortly. § 5.2.3 Why Use Least Squares? Certainly, there are many other possible ways of fitting a line to data. (For example, instead of minimising the sum of squares of vertical distances, we could work on perpendicular distances from the line.) However: • Least squares is, in a mathematical sense, relatively easy and its rationale seems

plausible enough. • Remember that the slope and intercept are sample statistics and so will have

sampling distributions. Under some reasonable assumptions (see §5.2.4), least squares delivers sample estimates of the true values of the population regression coefficients that are unbiased, and, out of all methods, delivers estimates which



have the least possible variance. Confidence intervals placed around population intercepts and slopes will be narrowest if least squares has been used. Least squares estimates are the most precise estimates possible.

• The straight line delivered by least squares goes through the mean of each

(theoretical) distribution of Y values. This should seem satisfying, especially when we go on to use the regression relationship to predict Y values.

In summary so far: we have used the method of least squares to find the equation of a line which “best fits” our raw data. Before going on to see how useful this line is, we must understand the assumptions we have made. § 5.2.4 Assumptions of the Simple Linear Model First, to claim a line is in some sense (eg least squares) a “best fit line” we are assuming that a linear model is appropriate. The arithmetic of least squares will blindly deliver up an equation for a straight line, regardless of the appropriateness of the assumed model. Always check your data by means of a scatterplot to see if a curved line might be better, or obtain some validation of a linear relationship from other published data. Good statistical practice is always more than just blindly applying formulae, no matter how easy it is with a computer or how mathematically satisfying are the derivations. Remember that a regression line, with intercept and slope estimated to 10 decimal places and exquisitely presented, is useless if the underlying model is inappropriate. It is certain that much published research utilises regression models which have been poorly validated. The second assumption is that the theoretical distribution of error terms, that is, all of the (yi – $y i), at each level of the predictor variable, X, has a mean of zero and the same variance (say, σ2) whatever level of the predictor variable we are considering. The latter is called the homogeneity of variances assumption. Fig 5.5 shows these features. The spread of “errors” at each value of X is symmetrically disposed around the regression line, and the spread is similar for each distribution. The third assumption of regression is that the value of any one observed error (ei) has no effect on the size or sign of any other the error term. We say that the errors are independent. The fourth assumption is that the error terms are independent of the values of the independent variable. The fifth assumption is important because it allows us to venture into the realm of statistical inference. You will remember from all our previous discussions of inference that we need theoretical probability distributions to provide models against which we compare our observed data. The starting point for inferential tests in the linear regression context is the assumption that the error terms have a Normal distribution. [The shorthand notation for the description of the population



distribution of errors is: εi ~ N(0, σ2), which is read “the errors have a Normal distribution with mean 0 and common variance of sigma-squared.] § 5.2.5 Practical Use #1: Does a Relationship Exist? The most important use of the calculated regression line is to see if a linear relationship exists between the dependent and independent variables. In other words, is it worthwhile using information you have on the independent variable to predict the response value? If it is, then you can go on to use your line to predict other responses to possible values of the independent variable. If it is not, try a different type of model. Example 5.3 If we looked for a relationship between a student’s IQ and the length of time it takes the student to fly in an aeroplane on a scheduled flight between Sydney and Perth, it is unlikely that one would be found.

IQ

time250 300 350

60

80

100

120

140

Y = y

Fig 5.8 No relationship (mins)

Now, pick a flight-time at random. Using this value of the predictor variable, what is the expected IQ of a student picked at random? In the absence of any meaningful relationship between the predictor and outcome variable, the best bet would be to nominate the mean IQ of all students in the sample – in other words we really haven’t bothered to use the flight-time information at all. The situation in Example 5.3 is supported by the least squares regression equation. Consider the regression equation in the form of E5.11:

Y = y + b1.(X – x )



If the slope, b1, of the line is 0, there is no linear relationship, and if we are asked: “What value of Y are we likely to get for a given X value?”, we may as well just choose the mean of Y from our sample regardless of which X value is nominated. That is, E5.11 degenerates to the expression:

Y = y

The predicted outcome has become a constant (with value y ) and hence is unrelated to any independent variable. The regression line is horizontal (see Fig 5.8). We will return to a somewhat less trivial example, and apply inferential methods to the regression relationship. Example 5.2 continued We found there was a negative relationship between heart rate and drug dosage – as the dose increased, the heart rate decreased. The value of the slope was: b1 = –3.125. (The units of the slope would be: beats per minute per unit drug dose.) Beta-blockers are cardiac depressants, so this negative relationship is expected. If the mice had been injected with a stimulant drug such as adrenaline, we would have expected a positive relationship with a slope greater than zero. Eventually we would like to use this relationship to predict other response values. For example, what is the predicted heart rate when the dosage is 55 units? But first we ought to see if our regression equation is worthwhile using. Time for a bit more theory! The researcher’s data are from just one of many possible samples. Consequently the values of the slope (and the less interesting intercept) might change with the next sample – the slope may even reflect that no real relationship exists. The sample slope, b1, is estimating an underlying value for the slope in the population. This population slope is denoted by the Greek letter beta with the appropriate subscript: β1 (nothing to do with the probability of a Type 2 error). We really want to know how often a population, wherein the slope is zero, might generate a sample with a slope as large as the one we observed. (This is familiar territory – hypothesis testing!) If this occurs often, then we ought not act as if our sample slope is reflecting a worthwhile linear relationship, and we would not bother using values of our independent variable to predict the dependent variable. Now we reap the benefit of specifying that the theoretical distributions of our errors in responses for each value of the independent variable are Normal. If we do this, it turns out that the distribution of sample slopes is also Normal. The mean of this sampling distribution is estimated by the observed sample slope, b1. The standard error of the sampling distribution of slopes is:

Standard error of the slope = xx

xxxyyy

SnSSS

).2()/( 2

−− E5.12

where (repeating some expressions for convenience):



Syy = [Σ(yi

2)] – [(Σyi)2/n] E5.13 Sxy = [Σ(xi

.yi)] – [(Σxi).(Σyi)/n] E5.9 Sxx = [Σ(xi

2)] – [(Σxi)2/n] E5.10 Of course, to make things absolutely explicit, in E5.12, Sxy

2 = (Sxy)2. E5.12 is a horrible expression, one you most certainly will never need to commit to memory. Other texts give algebraically equivalent versions. They are just as horrible. Example 5.2 continued If you bother to plug in all the appropriate values into E5.12, you’ll find that the standard error of the sampling distribution of the slope is: [40130 – (–125002/4000)]

√

The Null hypothrelationship bethypothesis is th That is: Ho: The test statistic t where t will follstatistic gives thpopulation slopAlthough we statwo parameters, Example 5.2 co For the beta-blo t =

se(b1)
(5 – 2) x 4000
= 0.298 (to 3 decimal places).

esis for a simple linear regression is that there is no linear ween the variables in the population. The Alternative, or Research, at there is a linear relationship in the population.

β1 = 0 versus H1: β1 ≠ 0 [“≠” means “does not equal”]

(no surprise here!) is:

sample slope – Null hypothesised slope = E5.14

standard error of the slope

ow a Student’s t distribution on (n – 2) degrees of freedom. The t e distance between the sample slope and the assumed (Null) e in units of the standard error. Why (n – 2) degrees of freedom? rt with n independent data points, we used these same data to estimate the slope and the intercept. Therefore, 2 df are lost.

ntinued

cker data, using E5.14:

(–3.125 – 0)/0.298 = –10.5



This t-statistic should be compared to the t distribution with (5 – 2) = 3 degrees of freedom. From tables, assuming we wish to do a two-tail test at the 5% level, the critical values of t cutting off a total probability of 0.05 in the tails (0.025 in each tail) is: t3df; α=0.05 = ±3.182. Our value of the test statistic, –10.5, exceeds this. That is, there are fewer than 5 chances in 100 that a sample such as this could have been drawn from a population with no linear relationship. Therefore we reject the Null hypothesis and claim that there is evidence for a linear relationship. In making this decision we run, at most, a 5% chance of being wrong (Type 1 error). In fact, from a computer, the true two-tailed P value associated with t = 10.5 is 0.002, so our chance of making a Type 1 error when we reject Ho is very low indeed. It is also possible to construct confidence intervals around the population slope in much the same way we did for the mean in Chapter 3. This will give us an idea of the precision of our estimate of the underlying slope in the population, and also another method of deciding between Ho and H1. The lower and upper limits for, say, the 95% confidence interval around β1 are given by: Lower limit = b1 – [tcrit x standard error(b1)] Upper limit = b1 + [tcrit x standard error(b1)] E5.15 In E5.15, the t values are the tabulated two-tailed critical values on the appropriate degrees of freedom, for a simple linear regression, df = n – 2, and the standard error of b1 is given by E5.12. Note once again that if the sample size, n, is increased, the standard error of the slope will decrease (see E5.12), leading to a narrower confidence interval and greater precision for the estimate of the population slope. Example 5.2 continued For the beta-blocker experiment, the limits around the population slope are: –3.125 ± (3.182 x 0.298) = [–4.07 , –2.18] We can see that the confidence interval does not include the Null hypothesised slope (zero) so we would reject Ho as before.



Example 5.2 continued

Whether the confidence interval is narrow enough to afford the required precision in estimating the underlying population slope is a matter for judgement on the part of the pharmacologist.

The interval tells us that, in the population, the heart rate probably decreases by between 2 and 4 beats per minute for each milligram increase in drug dose. If he thinks this interval is too wide – lacks precision – the researcher might narrow the interval by repeating the experiment with a larger sample size. At present, his best single estimate, the point estimate, for the underlying slope, is the sample slope – the heart rate decreases 3.125 beats per minute per milligram increase in beta-blocker dosage. In summary, a slope which does not differ statistically from zero, or a confidence interval which includes zero, points to the lack of a significant linear relationship. § 5.2.6 Practical Use #2: Prediction Having decided that the regression relation is appropriate, it is a simple matter to use it for predictive purposes. Just plug the value of the independent variable into the regression equation. It is absolutely crucial to realise that, since the regression line goes through the mean of each theoretical outcome distribution, each point on the regression line represents an estimate of the population mean response at the corresponding value of the independent variable. Example 5.2 continued

What is the expected (mean) response in heart rate of mice injected with 55 units of beta-blocker?

Solution: From the calculated regression equation:

mean heart rate = 293 – (3.125 x 55) ≈ 121 bpm

Of course, this is a mean response so it has (you guessed it) a sampling distribution that describes the relative frequency of mean responses to a 55 unit dose in repeated experiments. It turns out (handily enough) that this sampling distribution is Normal, but with a complicated standard error that we need not worry about just now. A computer program used the formula: mean response ± [tcrit x se(mean response)], where tcrit is the 5% two-tailed t value on 5–2=3 df, to calculate a 95% confidence interval for the population mean heart rate response to a dose of 55 units. The lower limit was 91 bpm and the upper limit was 152 bpm. We might remark that this interval is rather imprecise – not surprising given the sample size of only 5. Also, if another scientist proposed that the average response to a dose of 55 units was less than 80 bpm, how would we respond given the data of this experiment?



§ 5.2.7 A Final Caution and a Look at Influential Data The caution is a repeat: always look at your data by means of a scatterplot before recklessly fitting a regression line. I have already mentioned at least 3 reasons why you should do so: • the relation between dependent and independent variable may not be linear; • the relationship may be linear only over a range of the data, and prediction by

extrapolation may be seriously in error; and • you get a visual check on, for example, the sign of the slope for any subsequently

calculated regression. Here is another reason: certain specific data points may unduly influence the values of the regression coefficients. Even just one or two data points, if sufficiently strategically located, can markedly change the nature of a regression. Here is an example using the data from Figure 5.3, which shows heart-rate response to different drug doses for 32 experimental mice. From a computer run, the equation of this regression line is (see line A in Figure 5.9): expected heart rate = 314.8 – (19.5 x drug dose)

So that a unit increase in dose is associated with a decrease in heart rate of 19.5 beats/min. Now let us artificially add in a single datum at dose = 5 units, heart rate = 80 beats per minute. Note from Fig 5.9 that this response is far below the other responses for a drug dose of 5 units. With the new datum added, the regression equation (Fig 5.9 B) becomes:

expected heart rate = 309.7 – (19.4 x drug dose)

We see that, while the intercept has changed substantially, so that predicted outcomes will change, the slope has hardly been affected so our concept of the relationship between heart-rate and drug dose is virtually unaltered. Finally, let us now remove the recently added datum [5, 80], but add in 2 new observations: [10, 460] and [10, 490]. The regression equation (Fig 5.9 C) becomes:

expected heart rate = 275.8 – (8.9 x drug dose)

Now, not only will predicted outcomes change, but the slope itself has changed markedly in value – it is less than half the original value. Our concept of the relationship between the variables changes accordingly, but our changed concept is based on only two observations which are quite unusual given the majority of the data. This is an uncomfortable position to be in: we want to efficiently summarise the relationship by means of a regression line, but the validity of the line appears very fragile. A statistician would say that the regression relationship is not robust in the face of observations such as these.



he

art r

ate

dose0 1 2 3 4 5 6 7 8 9 10

0

100

200

300

400

500

he

art r

ate

dose0 1 2 3 4 5 6 7 8 9 10

0

100

200

300

400

500influential observations

BB

opoint is an outlier for heart rate, (80 bpm), but is not very influential

C

A

Fig 5.9 regression lines: (A) original data set; (B) with one non-influential outlier; (C) with two influential outliers

It is even possible for a few unusual data points to reverse the sign of a slope. Imagine, in Fig 5.9, that the new points were not [10, 460] and [10, 490] but [10, 800] and [10, 900]; the slope would become +2.4. The unusual observations yield a relationship that would seriously mislead the unwary. An observation that has an undue effect on the regression coefficients is called an influential observation. It is often true that such points are the result of data entry errors, which can be corrected, but sometimes they are valid data and the researcher must then decide how to deal with them. More advanced texts have discussions on (i) how to identify influential points (both graphically and statistically) and (ii) what to do about them. I repeat: draw a graph and think about the data before doing anything else. One day, you will either (i) be grateful that you did, or (ii) be embarrassed professionally that you did not.



§ 5.3 CORRELATION § 5.3.1 Introduction As we have seen, the regression model is used to predict or estimate a response from a pre-specified value of an independent variable. Correlation, a term also first used by Galton, is used to measure the strength of a linear relationship. The measure of correlation most often used is the value of the Pearson's correlation coefficient, named for Karl Pearson. It is also sometimes called the product-moment correlation coefficient. The correlation coefficient will answer questions like: • We know that, in Western society at least, systolic blood pressure rises with age;

what is the strength of this relationship? • How strong is the linear relationship between a nation’s expenditure on health

services and its citizens’ life expectancy? • If bone density is linearly related to average daily intake of calcium, what is the

strength of the relationship? In addition, a hypothesis test or a confidence interval around the correlation coefficient will tell us whether this measure of strength in the sample is a good reflection of the state of affairs in the population. You will have realised by now that this will require a knowledge of the sampling distribution of the correlation coefficient. § 5.3.2 Assumptions for Correlation Correlation puts two variables on an equal footing. There is no notion of one variable being “dependent” on the other. Unlike the regression setting, both X and Y are considered to be random variables – the X values are not pre-specified constant values. For ease of developing inferential tests it is handy to specify that, for a given value of the X variable, the possible values of Y are Normally distributed. Conversely, for a given value of the Y variable, the possible X values are Normally distributed. These conditions of Normality for each variable are really consequences of the main assumption that the X-Y pairs are drawn at random from a Bivariate Normal Distribution (see Figure 5.10 over page). Figure 5.10 shows 10,000 observations randomly drawn from a bivariate Normal distribution formed from two uncorrelated standard Normal distributions. You can see from the back-projection that a slice cut parallel to the Y-axis (resulting in a Y-distribution for a single X value) shows the familiar (univariate) Normal distribution. Similarly, a slice cut parallel to the X-axis will show a Normal X distribution for a single Y value.



0 1

0

1

x

y

bivnorm

birdprojection

Fig 5.10 A Bivariate Normal Distribution The usual assumptions of equal variances must also be met - one common variance, σx

2, for the X-distributions and one common variance, σy2, for the Y-distributions. We

do not require that σx2 = σy

2, although in Figure 10 this is true: since both X and Y distributions are (by construction) standard Normal, we know σx

2 = σy2 = 1. It would

not surprise you therefore that a bird’s-eye view looking directly down on the mound would be a circle in the X-Y plane. Figure 5.11 shows another bivariate Normal distribution, this time formed from a standard Normal distribution and another Normal distribution with mean 0 (just for convenience) but variance of 7. The two distributions, just like those in Fig 5.10, were uncorrelated. Our little feathered friend would now see an ellipse, rather than a circle, reflecting the different variances of the component univariate Normal distributions, but the axes of the ellipse would be parallel to the original X and Y axes. This is a consequence of the absence of correlation between these particular univariate distributions. We shall soon see what happens when the variables forming the bivariate distribution are correlated, but even now you might guess that when correlation is present, the ellipse will begin to rotate such that its axes are no longer parallel to the X-Y coordinate axes (see Fig 5.12 below).



0 1

0-1

1

x

y

bivnorm

1

-1

1

y

x

bivnorm

0 1

-1

0

x

y

bivnorm

Projection of Y distribution

Projection of X distribution

What the bird saw

Fig 5.11 Views of another Bivariate Normal Distribution § 5.3.3 A Few Basics The value of the correlation coefficient which we calculate from sample data is designated r. The population correlation coefficient which our sample r is estimating is designated by the Greek letter rho (ρ), not to be confused with p for proportion, percentage or probability! Here is the formula for calculating a sample correlation coefficient:

r = )(.)(

)(.)(yofdevstdxofdevstd

yyxx ii−− −−∑ E5.16

or, equivalently, and better for calculation:

r = SS

Syyxx

xy

. E5.17

where the formulae for the terms on the RHS of E5.17 are given by E5.9, E5.10 and E5.13. It is not too difficult to show that the correlation coefficient must take a value between –1 and +1. The former implies a perfect negative relationship, and the latter a perfect positive relationship. For example, if you calculated a correlation coefficient between people’s height in metres and height in centimetres, you would



find r = +1. A correlation coefficient equal to zero implies no linear relationship, but does not exclude a non-linear (curvilinear) relationship. I tend to think of a correlation coefficient in the range 0 to 0.4 as showing a weak relationship, 0.4 to 0.7 as a moderate one, and 0.7 to 1.0 as strong correlation. (And, for example, a correlation coefficient of –0.4 to –0.7 would be designated a moderate negative relationship.) These are reasonable, but somewhat arbitrary, classifications and others might well use different cutoffs. In any case, the particular research context may determine how strong one considers an r value. Figure 5.12 shows 4 graphs, each graph depicting a scatterplot of 600 pairs of artificially constructed data. In fact, the two variables are standard Normal, so the mean of each is zero and the variance is 1. The correlation coefficient for each scatterplot is shown. There are two sets of axes: the solid lines are the axes of the ellipse that envelopes the cloud of points; the dotted lines are axes parallel to the coordinate axes that divide the space into quadrants. Note that as the correlation gets larger the cloud of data takes on a different shape: it changes from a circular shape with axes parallel to the Y and X axis, to a more elliptical shape with axes at an angle to the coordinate axes. The cloud takes on a steeper slope and the variation in the Y direction at each level of X becomes less, and the variation in the X direction at each level of Y becomes less.

Y

X-4 -2 0 2 4

-4

-2

0

2

4

Y

X-4 -2 0 2 4

-4

-2

0

2

4

Y

X-4 -2 0 2 4

-4

-2

0

2

4

Y

X-4 -2 0 2

-4

-2

0

2

4

-ve deviation +ve deviation

+ve dev +ve deviation

r = 0 r = 0.2

r = 0.55 r = 0.9

Fig 5.12 Bivariate scatterplots of standard Normal variables with varying correlations

If you look at the scatterplot where r = 0 you see that for each point in the NE ( ) quadrant, the deviation from the mean of X and the deviation from the mean of Y will both be positive, so the product of these two deviations – the numerator of E5.16 is the sum over all points of these products – will also be positive. Points in the SW( )



quadrant will have both deviations negative, so the products of deviations will be also be positive. However, these positive products are almost exactly balanced by the negative deviation products for points in the NW( ) and SE( ) quadrants. The resultant sum of products leads to r = 0. On the other hand, by the time we get to the scatterplot for r = 0.9, almost all points are in the NE and SW quadrants, giving postive products for the numerator of E5.16, and there are very few counterbalancing points in the NW and SE quadrants giving negative products. The correlation is therefore large and positive. Similarly, you could easily imagine scatterplots showing different negative correlations. It must be emphasised that even a strong correlation between two variables does not imply causation. (Causation is a complicated notion; those interested might like to investigate Bradford Hill’s criteria for causation in science, and also read a bit on the history and philosophy of science.) While a strong correlation may support a claim of causality, it is but one piece of evidence. The real question revolves around whether the research was validly designed to demonstrate a causal connection. Association is a much safer word. § 5.3.4 Inferences on the Correlation Coefficient It is most important to distinguish between the strength of a correlation coefficient and the statistical significance of a correlation coefficient. A strong correlation coefficient, say, r = 0.85, indicates we have found a strong positive relationship in our sample. One should also look at the confidence interval around rho, or the hypothesis test for whether the correlation in the population differs from zero. Even a sample r of 0.85, as above, may not be “significant” in that it may not provide enough evidence that the population correlation coefficient is different from zero. Conversely, a weak sample correlation coefficient, say, r = 0.12, may be highly significant and give us evidence that the correlation in the population differs from zero. The t statistic E5.18 may be used to test the opposing hypotheses:

Ho: ρ = 0 versus H1: ρ ≠ 0

t = 21

2.

r

nr

−

− E5.18

where n is the number of observed data pairs, and r is the sample correlation coefficient E5.17. The t statistic of E5.18 follows a Student’s t distribution on (n – 2) degrees of freedom, so critical values cutting off desired probability levels can be found in appropriate tables. This test may only be used to test whether or not ρ is equal to zero; in any case this is the commonest hypothesis. To test whether or not ρ equals any non-zero value, say, 0.75, a more complicated test must be used. We will not discuss this here.




Let’s use the mouse data to calculate the correlation coefficient between heart rate and drug dosage. Looking back at §5.2.2, where we have already done some of the intermediate calculations, and making the appropriate substitutions in E5.17, we have: r = –12500/√(4000 x 40130) = –0.9866

This is a remarkably strong negative correlation – such a high correlation is most unlikely to be found in a biological or social research context. (However, sometimes a researcher will correlate two variables that are known to be measuring the same characteristic, just to see if using, say, the cheaper, easier-to-collect variable, will give the same information as the other variable.) For the purposes of the exercise we’ll press ahead and see if this strong sample correlation could nevertheless be consistent with no correlation in the underlying population. As usual we’ll do a two-tailed test at the 5% level. Using E5.18:

t = − −

− −

0 9866 5 2

1 9866 2

.

( . ) = –10.48

This exceeds (in absolute value) the tabulated two-tailed critical value t3df α=0.05 = 3.182. So we can reject Ho and claim there is a non-zero linear correlation in the population. The actual two-tailed P value associated with t = 10.48 is less than 0.01. That is, there is less than 1 chance in 100 that a sample correlation as extreme as –0.987 would arise from a population with no correlation. There is a reasonably standard way of presenting correlation analyses. A report typically states:

• the value of the sample correlation coefficient; • the number of paired observations; and • the probability value associated with the statistical hypothesis test: Ho: ρ = 0.


The researcher published these findings in a scientific journal: “The results of the correlation analysis were:

r = –0.987 n = 5 P < 0.01.” The first expression relates to the sample findings: there is a strong, negative, linear relationship. Next, the number of bivariate observations in the sample is given. Then there is a probability statement: this concerns the result of the hypothesis test on rho in the population - here, rho differs significantly from zero at the 0.01 level or lower. (Don’t confuse the P value given here with the notation for rho itself, ρ.) Note: Did you spot that the t statistic testing the significance of the correlation

coefficient is the same, except for rounding error, as that calculated for the test of the slope of the regression in Example 5.2 (§5.2.5)? The two tests give identical information. This is no accident, as we shall now explore further.



§ 5.4 CORRELATION and REGRESSION How does the correlation coefficient arise? Just as the Normal distribution is described by two parameters: the population mean, µ, and the variance, σ2, the Bivariate Normal Distribution (BND) is described by five parameters. These are: • The population means for X and Y, µx and µY.

• The population variances for X and Y, σx2 and σy

2.

• A fifth parameter which, in conjunction with σx2 and σy

2, is required to fully describe the shape of the BND mound (see Fig 5.10), that is, whether it is circular or elliptical in transverse section, and, if elliptical, what are the axes of the ellipse. This parameter turns out to be our old friend (?) rho, the population correlation coefficient.

The following discussion shows the simple mathematical equivalence of regression and correlation (but you should remember the different assumptions for the regression and correlation models).

30 Time in designated units20 10

Months

Years

35

Income ($1000s) Standardisedincome

Slope = r = tan(θ)

θ

Standardised time

Fig 5.13 regression standardised regression The left graph shows the regression of certain workers’ incomes on years since commencing employment. The regression line passes (as it must) through the point of the sample mean income ($35,000) and the sample mean years (20 years). The slope is $1000 per year – the workers in this sample increase their income by $1000 for each year employed. It seems a bit inefficient that, if we altered the units of measurement on the X (time) axis to reflect months, then the slope would change and so now we have two regressions (or models) with two slopes describing exactly the



same relationship simply because one or both variables are expressed in different units. This also shows why the strength of a linear relationship should not be inferred from the size of the slope coefficient – the size arbitrarily depends on the units of measurement chosen. The graph on the right shows what happens if we standardise the values of our Y (income) values and our X (years or months) values before calculating the regression coefficients. That is, we replace each original income value by: zincome = [original income – mean income]/standard deviation of incomes

and each original years value by:

zyears = [original years – mean years]/standard deviation of years and use these transformed data in place of the original raw data. Remember that the mean of a standardised variable is always zero, so the new regression line of the standardised variables must pass through the origin. Of more interest is the slope of this standardised regression. It is identical to the correlation coefficient calculated on the original unstandardised variables. Indeed, the correlation coefficient is often defined as the slope of the standardised regression line. The formula relating the correlation coefficient to the slope of the regression line is: r = b1.(sx/sy) E5.19 where sx and sy are the standard deviations of the respective variables, calculated using E1.3 and E1.4. You could check this out using the simple data set of Example 5.2 which concerned our little rodent friends. The slope, b1 = –3.125, sx = 31.623 and sy = 100.162. Plugging these into the RHS of E5.19 gives a correlation coefficient of –0.987, identical to the r calculated in §5.3.4. § 5.5 ADVANCED USES of REGRESSION Regression methods are also used in what is called model building. A researcher may believe that several different independent variables may influence the response variable. He or she will use more complicated versions of the techniques described in this chapter to see which of the independent variables is most important in predicting the response. Instead of looking at each [response variable, independent variable] pair separately, s/he will probably use multiple regression to look at the simultaneous effect of several predictor variables on a response variable. The researcher will claim to have constructed a working model to explain why the response varies in the population. The model may be put to the test using a different set of raw data – do the independent variables of the original model perform well in predicting the observed outcomes? If not, the model may need to be redefined with new, more appropriate predictors.



Note that we have discussed the basic of linear regression in this chapter. Often, especially in epidemiological studies, data for the outcome variable are not measured on a interval or ratio scale. For example, the outcome may be dichotomous (dead/alive) or may be the time to an event. Special non-linear regression techniques must be used in these cases. The town of Framingham in Massachusetts has been the site of a 30 year study into the determinants of coronary artery disease. By measuring many physiological and behavioural factors of the town’s inhabitants and seeing who develops the disease, epidemiologists have been able to model the risk of having a heart attack (the dependent variable) on several independent variables simultaneously, for example, smoking behaviour, blood pressure and serum cholesterol levels. Because of this research, these independent variables are now called risk factors for coronary artery disease. § 5.6 SUMMARY Simple linear regression fits a straight line by the method of least squares to set levels of a predictor variable and the observed responses. Uses of regression are: estimation of regression coefficients, prediction of mean responses, and model building. Correlation provides a summary measure of the strength of the linear relationship between two variables.


Documents

CHAPTER 5: LINEAR REGRESSION & CORRELATION