Correlation and simple linear regression - S6 - NKI - Home · 1/43 Basic Medical Statistics Course Correlation and simple linear regression S6 Patrycja Gradowska [email protected]

1/43

Basic Medical Statistics Course

Correlation and simple linear regression

S6

Patrycja [email protected]

December 3, 2014

2/43

Introduction

I So far we have looked at the association between:I Two categorical variables (chi-square test)I Numerical variable and categorical variable (independent

samples t-test and ANOVA)

I We will now look at the association between two numerical(continuous) variables, say x and y

3/43

Introduction

Example 1: Mortality from malignant melanoma of the skin versus latitude ofresidency among white males in the United States (van Belle et al, 2004)

Latitude Mortality rate# State (degrees North) (#deaths per 10 million)1

1 Alabama 33.0 2192 Arizona 34.5 1603 Arkansas 35.0 1704 California 37.5 1825 Colorado 39.0 1496 Connecticut 41.8 1597 Delaware 39.0 200...

......

...48 Wisconsin 44.5 11049 Wyoming 43.0 134

How do we investigate the association between these two variables?

1Mortality rate for the period 1950–1959

4/43

Scatter plot

There is a roughly linear association

5/43

Relationship between two numerical variables

If a linear relationship between x and y appears to be reasonablefrom the scatter plot, we can take the next step and

1. Calculate Pearson’s product moment correlation coefficientbetween x and yI Measures how closely the data points on the scatter plot resemble a

straight line

2. Perform a simple linear regression analysisI Finds the equation of the line that best describes the relationship

between variables seen in a scatter plot

6/43

Correlation

Sample Pearson’s product moment correlation coefficient, orcorrelation coefficient, between variables x and y is calculated as

r(x , y) =1

n − 1

n∑i=1

(xi − x

sx

)(yi − y

sy

)=

1n − 1

n∑i=1

zxi zyi

where {(xi , yi ) : i = 1, . . . ,n} is a random sample of n observationson x and y , x and y are the sample means of respectively x and y , sxand sy are corresponding sample standard deviations, and zxi and zyi

are z-scores of x and y for i-th observation.

7/43

Correlation

Properties of r :

I r estimates the true population correlation coefficient (ρ)I r takes on any value between −1 and 1, i.e. −1 ≤ r ≤ 1I Magnitude of r indicates the strength of a linear relationship

between x and y :I r = −1 or 1 means perfect linear associationI r = 0 indicates no linear association (but can be e.g. non-linear)I The closer r is to -1 or 1, the stronger the linear association

(e.g. r = -0.1 (weak association) vs r = 0.85 (strong association))I Sign of r indicates the direction of association:

I r > 0 implies positive relationshipi.e. the two variables tend to move in the same direction

I r < 0 implies negative relationshipi.e. the two variables tend to move in opposite directions

8/43

Correlation

Properties of r (cont):

I r(a · x + b, c · y + d) = r(x , y), where a > 0, c > 0, and b and d areconstants

I r(x , y) = r(y , x)

I r 6= 0 does not imply causation! Just because two variables arecorrelated does not necessarily mean that one causes the other!

I r2 is called the coefficient of determinationI r 2 is a number between 0 and 1I Represents the proportion of total variation in one variable that is

explained by the otherI For example, the coefficient of determination between body weight and

age of 0.60 means that 60% of total variation in body weight is explainedby age alone and the remaining 40% is explained by other factors.

9/43

CorrelationCorrelation

r= -1 r= 1 r= 0.8 r= -0.8

r= 0 r= 0 0 < r< 1 -1 < r< 0

6 / 49

Don’t interpret r without looking at the scatter plot!

10/43

Correlation

Hypothesis test for the population correlation coefficient ρ:

H0 : ρ = 0H1 : ρ 6= 0

Under H0, the test statistic

T = r√

n−21−r2

follows a Student-t distribution with n − 2 degrees of freedom.

Note:I This test assumes that the variables are normally distributed

11/43

Correlation

Example 1 revisited: skin cancer mortality vs latitude

Latitude

50,0045,0040,0035,0030,0025,00

Mo

rtal

ity

250,00

200,00

150,00

100,00

50,00

Page 1

What is the magnitude and sign of correlation coefficient between latitudeand skin cancer mortality?

12/43

Correlation

Example 1 revisited: skin cancer mortality vs latitude

SPSS output

Correlations

Mortality Latitude

Mortality

Pearson Correlation 1 -,825**

Sig. (2-tailed) ,000

N 49 49

Latitude

Pearson Correlation -,825** 1

Sig. (2-tailed) ,000

N 49 49

**. Correlation is significant at the 0.01 level (2-tailed).

p-value

r

n

13/43

Simple linear regression

I Pearson’s product moment correlation coefficient measures thestrength and direction of the linear association between x and y

I But often times we are also interested in predicting the value of onevariable given the value of the other

I This requires finding an equation (or mathematical model) thatdescribes or summarizes the relationship between the variables

I If a scatter plot of our data shows an approximately linearrelationship between x and y we can use simple linearregression to estimate the equation of this line

I Regression, unlike correlation, requires that we haveI a dependent variable (or outcome or response variable), i.e. the

variable being predicted (always on the vertical or y -axis)I an independent variable (or explanatory or predictor variable), i.e. the

variable used for prediction (always on the horizontal or x-axis)I Let’s assume that x and y are the independent variable and the

dependent variable, respectively

14/43


Simple linear regression postulates that in the population

y = (α + β · x) + ε,

where:I y is the dependent variableI x is the independent variableI α and β are parameters called population regression

coefficientsI ε is a random error term

15/43


x

y

x1 x2 x3 x4 x5

16/43


x

y E(y|xi)

x1 x2 x3 x4 x5

E(y |xi ) is the mean value of y when x = xi

17/43


x

y E(y|xi)

x1 x2 x3 x4 x5

E(y|x) = α + β·x

E(y |x) = α + β · x is the population regression function

18/43


x

y

1 2 3 4 5

E(y|x) = α + β·x

0

α

6

β

3β

I α is the y -intercept of the population regression function, i.e. the meanvalue of y when x equals 0

I β is the slope of the population regression function, i.e. the mean (orexpected) change in y associated with a 1-unit increase in the value of x

I c · β is the mean change in y for a c-unit increase in the value of xI α and β are estimated from the sample data using the method of least

squares (usually)

19/43


x

y

xi 0

= a + b·x

i

yi ei ei = yi - i = residual i

Least squares method chooses a and b (estimates for α and β) tominimize the sum of the squares of the residuals

n∑i=1

e2i =

n∑i=1

(yi − yi )2 =

n∑i=1

[yi − (a + b · xi )]2

20/43


The least squares estimates for α and β are:

b =

∑ni=1(xi − x)(yi − y)∑n

i=1(xi − x)2

and

a = y − b · x ,

where x and y are the respective sample means of x and y .

Note that:b = r(x , y) ·

sy

sx,

where r(x , y) is the sample product moment correlation between xand y , and sx and sy are the sample standard deviations of x and y .

21/43


Relationship between slope b and correlation coefficient r

I r 6= b unless sx = sy

I r measures the strength of a linear association between x and ywhile b measures the size of the change in the mean value of ydue to a unit change in x

I r does not distinguish between x and y while b doesI r is scale-free while b is not

But:I r and b have the same signI both r and b do not imply causationI both r and b can be affected by outliersI r = 0 if and only if b = 0, thus test of β = 0 is equivalent to the test

of ρ = 0 (i.e. no linear relationship)

22/43


Test of H0 : β = 0 versus H1 : β 6= 0

1. t-test:I Test statistic: T = b

SE(b) , where SE(b) is the standard error of bcalculated from the data

I Under H0, T follows a Student-t distribution with n − 2 degrees offreedom

2. F-test:I Test-statistic: F =

(b

SE(b)

)2= T 2, where SE(b) and T are as above

I Under, H0, F follows an F distribution with 1 and n − 2 degrees offreedom

I The t-test and the F-test lead to the same outcome

Note: The test of zero intercept α is of less interest, unless x = 0 ismeaningful

23/43


Example 2: blood pressure (mmHg) versus body weight (kg) in 20patients with hypertension (Daniel & Cross, 2013)

Weight

105.00100.0095.0090.0085.00

BP

125.00

120.00

115.00

110.00

105.00

Page 1

24/43


SPSS output:

Coefficientsa

Model

Unstandardized Coefficients

t Sig.B Std. Error Beta

1 (Constant)

Weight

2.205 8.663 .255 .802

1.201 .093 .950 12.917 .000

a.

Page 1

From above, the regression equation is BP = 2.20 + 1.20 ·Weight

ANOVAa

Model Sum of Squares df Mean Square F Sig.

1

Regression 505,472 1 505,472 166,859 ,000b

Residual 54,528 18 3,029

Total 560,000 19

a. Dependent Variable: BP

b. Predictors: (Constant), Weight

F-test

25/43


Standardized coefficientsI Obtained by standardizing both y and x (i.e. converting into

z-scores) and re-running the regressionI After standardization, the intercept will be equal to zero and the

slope for x will be equal to the sample correlation coefficientI Of greater concern in multiple linear regression (next lecture)

where the predictors are expressed in different unitsI Standardization removes the dependence of regression coefficients on

the units of measurements of y and x ’s so they can be meaningfullycompared

I The larger the standardized coefficient (in absolute value) the greaterthe contribution of the respective variable in the prediction of y

I Standardized and unstandardized coefficients have the same signand their significance tests are equivalent

26/43


Simple linear regression is only appropriate when the followingassumptions are satisfied:

1. Independence: the observations are independent, i.e. there is onlyone pair of observations per subject

2. Linearity: the relationship between x and y is linear3. Constant variance: the variance of y is constant for all values of x4. Normality: y has a Normal distribution

27/43


Checking linearity assumption:

1. Make a scatter plot of y versus xI If the assumption of linearity is met, the points in this plot should

generally form a straight line

2. Plot the residuals against the explanatory variable xI If the assumption of linearity is met, we should see a random scatter of

points around zero rather than any systematic pattern

x x

x

x

x x

x x

x x

x

x

x

x x

x

x x

x

x x x

x

x x

0

x

e

Linearity

x

x x

x

x

x

x x

x x

x x

x x

x

x

x

x

x

x

x

x

x

x

x

0

Lack of linearity

x

e

28/43


Checking constant variance assumption:I Make a residual plot, i.e. plot the residuals against the fitted values

of y (yi = a + b · xi )I If the assumption is met, we expect to observe a random scatter of

pointsI If the scatter of the residuals increases or decreases as y increases,

then this assumption is not satisfied

x x

x

x

x x

x x

x

x

x

x

x

x

x

x

x x

x

x x x

x

x x

0

e

Constant variance

0

Non-constant variance

e

x

x

x x

x

x

x

x x

x

x x x

x

x x

x

x x

x x

x x

x

x

x

29/43


Example 2 revisited: blood pressure vs body weight

Residual plot

30/43


Checking normality assumption:

1. Draw a histogram of the residuals and “eyeball” the result2. Make a normal probability plot (P–P plot) of the residuals,

i.e. plot the expected cumulative probability of a normal distributionversus the observed cumulative probability at each value of theresidualI If the assumption of normality is met, the points in this plot should form a

straight diagonal line

31/43



P–P plot

32/43


Outliers

I Outlier is a data point that stands apart from the overall patternseen in the scatter plot (i.e. unusual or unexpected observation)

I It can be detected by looking at a scatter plot or residual plotI We should always search for an explanation for any outliersI Common sources of outliers include: human and measurement

errors during data collection and entry, sampling error and chanceI Some outliers can be corrected or removed, but some cannotI In general, outliers that cannot be corrected should not be removedI Outliers may influence the estimates of model parameters and thus

the study conclusionsI In order to determine this influence, fit the line with and without the

questionable points and see what happens

33/43


Assessing goodness of fit

I The estimated regression line is the “best” one available (in theleast-squares sense)

I Yet, it can still be a very poor fit to the observed data

x

x

x

x x

x x

x

x x

x

x

x

x x

x

x

x x

x

x

x

x x

x

x

y

Good fit

x x

x

x

x

x

x

x

x

x x x

x x x

x

x x

x x

x

x

x

x

x

Bad fit

x

y

34/43


To assess goodness of fit of a regression line (i.e. how well does theline fit the data) we can:

1. Calculate the correlation coefficient between the predicted andobserved values of y , RI A higher absolute value of R indicates better fit (predicted and observed

values of y are closer to each other)

2. Calculate R2 (R Square in SPSS)I 0 ≤ R2 ≤ 1I A higher value of R2 indicates better fitI R2 = 1 indicates perfect fit (i.e. yi = yi for each i)I R2 = 0 indicates very poor fit

35/43


Alternatively, R2 can be calculated as

R2 =

∑ni=1(yi − y)2∑ni=1(yi − y)2

=variation in y explained by x

total variation in y

I We interpret R2 as the proportion of total variability in y that can beexplained by the explanatory variable xI An R2 of 1 means that x explains all variability in yI An R2 of 0 indicates that x does not explain any variability in y

I R2 is usually expressed as a percentage. For example, R2 = 0.93indicates that 93% of total variation in y can be explained by x

I In SPSS, R2 can be found in Model Summary table or it can becalculated from ANOVA table; both tables are produced whenrunning linear regression

36/43



Model Summary

Model R R Square Adjusted R

Square

Std. Error of the

Estimate

1 ,950a ,903 ,897 1,74050

a. Predictors: (Constant), Weight

37/43


Prediction: interpolation versus extrapolation

y

Range of actual data

x

Possible patterns of additional data

Extrapolation beyond the range of the data is risky!!

38/43

Categorical explanatory variable

I So far we assumed that the predictor variable is numericalI But what if we want to study an association between y and a

categorical x , e.g. between blood pressure and gender or betweenskin cancer mortality and race/ethnicity?

I Categorical variables can be incorporated into a regression modelthrough one or more indicator or dummy variables that take onthe values 0 and 1

I In general, to include a variable with p categories/levels p − 1dummy variables are required

39/43


Example: variable with 4 categories, e.g. blood group (A, B, AB, 0)

Basic steps:1. Create dummy variables for all categories

xA =

{1, if blood group is A0, otherwise

xB =

{1, if blood group is B0, otherwise

xAB =

{1, if blood group is AB0, otherwise

x0 =

{1, if blood group is 00, otherwise

40/43

Categorical explanatory variableIn a dataset:

Subject ID Blood group xA xB xAB x0

1 A 1 0 0 02 B 0 1 0 03 0 0 0 0 14 AB 0 0 1 05 B 0 1 0 06 A 1 0 0 07 0 0 0 0 18 B 0 1 0 09 AB 0 0 1 0

. . . . . . . . . . . . . . . . . .

2. Select one blood group as a reference categoryI category that results in useful comparisons (e.g. exposed versus

non-exposed, experimental versus standard treatment) or a categorywith large number of subjects

3. Include in the model all dummies except the one corresponding to thereference category

41/43


Taking blood group 0 as reference category, the model becomes

y = α + βA · xA + βB · xB + βAB · xAB + ε

and its estimated counterpart is

y = a + bA · xA + bB · xB + bAB · xAB

I Estimation of model parameters requires running multiple linearregression (next lecture), unless the explanatory variable has onlytwo categories (e.g. gender)

I Given that y represents IQ score, the estimated coefficients areinterpreted as follows:I a is the mean IQ for subjects with blood group 0, i.e. the reference

categoryI Each b represents the mean difference in IQ between subjects with a

blood group represented by the respective dummy variable and subjectswith blood group 0 (the reference category)

42/43


Specifically:I bA is the difference between the mean IQ in subjects with blood

group A and the mean IQ in subjects with blood group 0, i.e.bA = y(xA = 1, xB = 0, xAB = 0)− a

I bB is the difference between the mean IQ in subjects with bloodgroup B and the mean IQ in subjects with blood group 0, i.e.bB = y(xA = 0, xB = 1, xAB = 0)− a

I bAB is the difference between the mean IQ in subjects with bloodgroup AB and the mean IQ in subjects with blood group 0, i.e.bAB = y(xA = 0, xB = 0, xAB = 1)− a

Note:A test for the significance of a categorical explanatory variable with plevels involves the hypothesis that the coefficients of all p − 1 dummyvariables are zero. For that purpose, we need to use an overall F-test(next lecture) and not a t-test. The t-test can be used only when thevariable is binary.

43/43

References

Gerald van Belle, Lloyd D. Fisher, Patrick J. Heagerty, ThomasLumleyBiostatistics: a methodology for the health sciences, 2nd edition.John Wiley & Sons, 2004.

Wayne W. Daniel, Chad L. CrossBiostatistics: a foundation for analysis in the health sciences,10th edition.John Wiley & Sons, 2013.

Documents

Correlation and simple linear regression - S6 - NKI - Home · 1/43 Basic Medical Statistics Course Correlation and simple linear regression S6 Patrycja Gradowska [email protected]