Upload
vudiep
View
223
Download
0
Embed Size (px)
Citation preview
1/43
Basic Medical Statistics Course
Correlation and simple linear regression
S6
Patrycja [email protected]
December 3, 2014
2/43
Introduction
I So far we have looked at the association between:I Two categorical variables (chi-square test)I Numerical variable and categorical variable (independent
samples t-test and ANOVA)
I We will now look at the association between two numerical(continuous) variables, say x and y
3/43
Introduction
Example 1: Mortality from malignant melanoma of the skin versus latitude ofresidency among white males in the United States (van Belle et al, 2004)
Latitude Mortality rate# State (degrees North) (#deaths per 10 million)1
1 Alabama 33.0 2192 Arizona 34.5 1603 Arkansas 35.0 1704 California 37.5 1825 Colorado 39.0 1496 Connecticut 41.8 1597 Delaware 39.0 200...
......
...48 Wisconsin 44.5 11049 Wyoming 43.0 134
How do we investigate the association between these two variables?
1Mortality rate for the period 1950–1959
5/43
Relationship between two numerical variables
If a linear relationship between x and y appears to be reasonablefrom the scatter plot, we can take the next step and
1. Calculate Pearson’s product moment correlation coefficientbetween x and yI Measures how closely the data points on the scatter plot resemble a
straight line
2. Perform a simple linear regression analysisI Finds the equation of the line that best describes the relationship
between variables seen in a scatter plot
6/43
Correlation
Sample Pearson’s product moment correlation coefficient, orcorrelation coefficient, between variables x and y is calculated as
r(x , y) =1
n − 1
n∑i=1
(xi − x
sx
)(yi − y
sy
)=
1n − 1
n∑i=1
zxi zyi
where {(xi , yi ) : i = 1, . . . ,n} is a random sample of n observationson x and y , x and y are the sample means of respectively x and y , sxand sy are corresponding sample standard deviations, and zxi and zyi
are z-scores of x and y for i-th observation.
7/43
Correlation
Properties of r :
I r estimates the true population correlation coefficient (ρ)I r takes on any value between −1 and 1, i.e. −1 ≤ r ≤ 1I Magnitude of r indicates the strength of a linear relationship
between x and y :I r = −1 or 1 means perfect linear associationI r = 0 indicates no linear association (but can be e.g. non-linear)I The closer r is to -1 or 1, the stronger the linear association
(e.g. r = -0.1 (weak association) vs r = 0.85 (strong association))I Sign of r indicates the direction of association:
I r > 0 implies positive relationshipi.e. the two variables tend to move in the same direction
I r < 0 implies negative relationshipi.e. the two variables tend to move in opposite directions
8/43
Correlation
Properties of r (cont):
I r(a · x + b, c · y + d) = r(x , y), where a > 0, c > 0, and b and d areconstants
I r(x , y) = r(y , x)
I r 6= 0 does not imply causation! Just because two variables arecorrelated does not necessarily mean that one causes the other!
I r2 is called the coefficient of determinationI r 2 is a number between 0 and 1I Represents the proportion of total variation in one variable that is
explained by the otherI For example, the coefficient of determination between body weight and
age of 0.60 means that 60% of total variation in body weight is explainedby age alone and the remaining 40% is explained by other factors.
9/43
CorrelationCorrelation
r= -1 r= 1 r= 0.8 r= -0.8
r= 0 r= 0 0 < r< 1 -1 < r< 0
6 / 49
Don’t interpret r without looking at the scatter plot!
10/43
Correlation
Hypothesis test for the population correlation coefficient ρ:
H0 : ρ = 0H1 : ρ 6= 0
Under H0, the test statistic
T = r√
n−21−r2
follows a Student-t distribution with n − 2 degrees of freedom.
Note:I This test assumes that the variables are normally distributed
11/43
Correlation
Example 1 revisited: skin cancer mortality vs latitude
Latitude
50,0045,0040,0035,0030,0025,00
Mo
rtal
ity
250,00
200,00
150,00
100,00
50,00
Page 1
What is the magnitude and sign of correlation coefficient between latitudeand skin cancer mortality?
12/43
Correlation
Example 1 revisited: skin cancer mortality vs latitude
SPSS output
Correlations
Mortality Latitude
Mortality
Pearson Correlation 1 -,825**
Sig. (2-tailed) ,000
N 49 49
Latitude
Pearson Correlation -,825** 1
Sig. (2-tailed) ,000
N 49 49
**. Correlation is significant at the 0.01 level (2-tailed).
p-value
r
n
13/43
Simple linear regression
I Pearson’s product moment correlation coefficient measures thestrength and direction of the linear association between x and y
I But often times we are also interested in predicting the value of onevariable given the value of the other
I This requires finding an equation (or mathematical model) thatdescribes or summarizes the relationship between the variables
I If a scatter plot of our data shows an approximately linearrelationship between x and y we can use simple linearregression to estimate the equation of this line
I Regression, unlike correlation, requires that we haveI a dependent variable (or outcome or response variable), i.e. the
variable being predicted (always on the vertical or y -axis)I an independent variable (or explanatory or predictor variable), i.e. the
variable used for prediction (always on the horizontal or x-axis)I Let’s assume that x and y are the independent variable and the
dependent variable, respectively
14/43
Simple linear regression
Simple linear regression postulates that in the population
y = (α + β · x) + ε,
where:I y is the dependent variableI x is the independent variableI α and β are parameters called population regression
coefficientsI ε is a random error term
16/43
Simple linear regression
x
y E(y|xi)
x1 x2 x3 x4 x5
E(y |xi ) is the mean value of y when x = xi
17/43
Simple linear regression
x
y E(y|xi)
x1 x2 x3 x4 x5
E(y|x) = α + β·x
E(y |x) = α + β · x is the population regression function
18/43
Simple linear regression
x
y
1 2 3 4 5
E(y|x) = α + β·x
0
α
6
β
3β
I α is the y -intercept of the population regression function, i.e. the meanvalue of y when x equals 0
I β is the slope of the population regression function, i.e. the mean (orexpected) change in y associated with a 1-unit increase in the value of x
I c · β is the mean change in y for a c-unit increase in the value of xI α and β are estimated from the sample data using the method of least
squares (usually)
19/43
Simple linear regression
x
y
xi 0
= a + b·x
i
yi ei ei = yi - i = residual i
Least squares method chooses a and b (estimates for α and β) tominimize the sum of the squares of the residuals
n∑i=1
e2i =
n∑i=1
(yi − yi )2 =
n∑i=1
[yi − (a + b · xi )]2
20/43
Simple linear regression
The least squares estimates for α and β are:
b =
∑ni=1(xi − x)(yi − y)∑n
i=1(xi − x)2
and
a = y − b · x ,
where x and y are the respective sample means of x and y .
Note that:b = r(x , y) ·
sy
sx,
where r(x , y) is the sample product moment correlation between xand y , and sx and sy are the sample standard deviations of x and y .
21/43
Simple linear regression
Relationship between slope b and correlation coefficient r
I r 6= b unless sx = sy
I r measures the strength of a linear association between x and ywhile b measures the size of the change in the mean value of ydue to a unit change in x
I r does not distinguish between x and y while b doesI r is scale-free while b is not
But:I r and b have the same signI both r and b do not imply causationI both r and b can be affected by outliersI r = 0 if and only if b = 0, thus test of β = 0 is equivalent to the test
of ρ = 0 (i.e. no linear relationship)
22/43
Simple linear regression
Test of H0 : β = 0 versus H1 : β 6= 0
1. t-test:I Test statistic: T = b
SE(b) , where SE(b) is the standard error of bcalculated from the data
I Under H0, T follows a Student-t distribution with n − 2 degrees offreedom
2. F-test:I Test-statistic: F =
(b
SE(b)
)2= T 2, where SE(b) and T are as above
I Under, H0, F follows an F distribution with 1 and n − 2 degrees offreedom
I The t-test and the F-test lead to the same outcome
Note: The test of zero intercept α is of less interest, unless x = 0 ismeaningful
23/43
Simple linear regression
Example 2: blood pressure (mmHg) versus body weight (kg) in 20patients with hypertension (Daniel & Cross, 2013)
Weight
105.00100.0095.0090.0085.00
BP
125.00
120.00
115.00
110.00
105.00
Page 1
24/43
Simple linear regression
SPSS output:
Coefficientsa
Model
Unstandardized Coefficients
t Sig.B Std. Error Beta
1 (Constant)
Weight
2.205 8.663 .255 .802
1.201 .093 .950 12.917 .000
a.
Page 1
From above, the regression equation is BP = 2.20 + 1.20 ·Weight
ANOVAa
Model Sum of Squares df Mean Square F Sig.
1
Regression 505,472 1 505,472 166,859 ,000b
Residual 54,528 18 3,029
Total 560,000 19
a. Dependent Variable: BP
b. Predictors: (Constant), Weight
F-test
25/43
Simple linear regression
Standardized coefficientsI Obtained by standardizing both y and x (i.e. converting into
z-scores) and re-running the regressionI After standardization, the intercept will be equal to zero and the
slope for x will be equal to the sample correlation coefficientI Of greater concern in multiple linear regression (next lecture)
where the predictors are expressed in different unitsI Standardization removes the dependence of regression coefficients on
the units of measurements of y and x ’s so they can be meaningfullycompared
I The larger the standardized coefficient (in absolute value) the greaterthe contribution of the respective variable in the prediction of y
I Standardized and unstandardized coefficients have the same signand their significance tests are equivalent
26/43
Simple linear regression
Simple linear regression is only appropriate when the followingassumptions are satisfied:
1. Independence: the observations are independent, i.e. there is onlyone pair of observations per subject
2. Linearity: the relationship between x and y is linear3. Constant variance: the variance of y is constant for all values of x4. Normality: y has a Normal distribution
27/43
Simple linear regression
Checking linearity assumption:
1. Make a scatter plot of y versus xI If the assumption of linearity is met, the points in this plot should
generally form a straight line
2. Plot the residuals against the explanatory variable xI If the assumption of linearity is met, we should see a random scatter of
points around zero rather than any systematic pattern
x x
x
x
x x
x x
x x
x
x
x
x x
x
x x
x
x x x
x
x x
0
x
e
Linearity
x
x x
x
x
x
x x
x x
x x
x x
x
x
x
x
x
x
x
x
x
x
x
0
Lack of linearity
x
e
28/43
Simple linear regression
Checking constant variance assumption:I Make a residual plot, i.e. plot the residuals against the fitted values
of y (yi = a + b · xi )I If the assumption is met, we expect to observe a random scatter of
pointsI If the scatter of the residuals increases or decreases as y increases,
then this assumption is not satisfied
x x
x
x
x x
x x
x
x
x
x
x
x
x
x
x x
x
x x x
x
x x
0
e
Constant variance
0
Non-constant variance
e
x
x
x x
x
x
x
x x
x
x x x
x
x x
x
x x
x x
x x
x
x
x
30/43
Simple linear regression
Checking normality assumption:
1. Draw a histogram of the residuals and “eyeball” the result2. Make a normal probability plot (P–P plot) of the residuals,
i.e. plot the expected cumulative probability of a normal distributionversus the observed cumulative probability at each value of theresidualI If the assumption of normality is met, the points in this plot should form a
straight diagonal line
32/43
Simple linear regression
Outliers
I Outlier is a data point that stands apart from the overall patternseen in the scatter plot (i.e. unusual or unexpected observation)
I It can be detected by looking at a scatter plot or residual plotI We should always search for an explanation for any outliersI Common sources of outliers include: human and measurement
errors during data collection and entry, sampling error and chanceI Some outliers can be corrected or removed, but some cannotI In general, outliers that cannot be corrected should not be removedI Outliers may influence the estimates of model parameters and thus
the study conclusionsI In order to determine this influence, fit the line with and without the
questionable points and see what happens
33/43
Simple linear regression
Assessing goodness of fit
I The estimated regression line is the “best” one available (in theleast-squares sense)
I Yet, it can still be a very poor fit to the observed data
x
x
x
x x
x x
x
x x
x
x
x
x x
x
x
x x
x
x
x
x x
x
x
y
Good fit
x x
x
x
x
x
x
x
x
x x x
x x x
x
x x
x x
x
x
x
x
x
Bad fit
x
y
34/43
Simple linear regression
To assess goodness of fit of a regression line (i.e. how well does theline fit the data) we can:
1. Calculate the correlation coefficient between the predicted andobserved values of y , RI A higher absolute value of R indicates better fit (predicted and observed
values of y are closer to each other)
2. Calculate R2 (R Square in SPSS)I 0 ≤ R2 ≤ 1I A higher value of R2 indicates better fitI R2 = 1 indicates perfect fit (i.e. yi = yi for each i)I R2 = 0 indicates very poor fit
35/43
Simple linear regression
Alternatively, R2 can be calculated as
R2 =
∑ni=1(yi − y)2∑ni=1(yi − y)2
=variation in y explained by x
total variation in y
I We interpret R2 as the proportion of total variability in y that can beexplained by the explanatory variable xI An R2 of 1 means that x explains all variability in yI An R2 of 0 indicates that x does not explain any variability in y
I R2 is usually expressed as a percentage. For example, R2 = 0.93indicates that 93% of total variation in y can be explained by x
I In SPSS, R2 can be found in Model Summary table or it can becalculated from ANOVA table; both tables are produced whenrunning linear regression
36/43
Simple linear regression
Example 2 revisited: blood pressure vs body weight
Model Summary
Model R R Square Adjusted R
Square
Std. Error of the
Estimate
1 ,950a ,903 ,897 1,74050
a. Predictors: (Constant), Weight
37/43
Simple linear regression
Prediction: interpolation versus extrapolation
y
Range of actual data
x
Possible patterns of additional data
Extrapolation beyond the range of the data is risky!!
38/43
Categorical explanatory variable
I So far we assumed that the predictor variable is numericalI But what if we want to study an association between y and a
categorical x , e.g. between blood pressure and gender or betweenskin cancer mortality and race/ethnicity?
I Categorical variables can be incorporated into a regression modelthrough one or more indicator or dummy variables that take onthe values 0 and 1
I In general, to include a variable with p categories/levels p − 1dummy variables are required
39/43
Categorical explanatory variable
Example: variable with 4 categories, e.g. blood group (A, B, AB, 0)
Basic steps:1. Create dummy variables for all categories
xA =
{1, if blood group is A0, otherwise
xB =
{1, if blood group is B0, otherwise
xAB =
{1, if blood group is AB0, otherwise
x0 =
{1, if blood group is 00, otherwise
40/43
Categorical explanatory variableIn a dataset:
Subject ID Blood group xA xB xAB x0
1 A 1 0 0 02 B 0 1 0 03 0 0 0 0 14 AB 0 0 1 05 B 0 1 0 06 A 1 0 0 07 0 0 0 0 18 B 0 1 0 09 AB 0 0 1 0
. . . . . . . . . . . . . . . . . .
2. Select one blood group as a reference categoryI category that results in useful comparisons (e.g. exposed versus
non-exposed, experimental versus standard treatment) or a categorywith large number of subjects
3. Include in the model all dummies except the one corresponding to thereference category
41/43
Categorical explanatory variable
Taking blood group 0 as reference category, the model becomes
y = α + βA · xA + βB · xB + βAB · xAB + ε
and its estimated counterpart is
y = a + bA · xA + bB · xB + bAB · xAB
I Estimation of model parameters requires running multiple linearregression (next lecture), unless the explanatory variable has onlytwo categories (e.g. gender)
I Given that y represents IQ score, the estimated coefficients areinterpreted as follows:I a is the mean IQ for subjects with blood group 0, i.e. the reference
categoryI Each b represents the mean difference in IQ between subjects with a
blood group represented by the respective dummy variable and subjectswith blood group 0 (the reference category)
42/43
Categorical explanatory variable
Specifically:I bA is the difference between the mean IQ in subjects with blood
group A and the mean IQ in subjects with blood group 0, i.e.bA = y(xA = 1, xB = 0, xAB = 0)− a
I bB is the difference between the mean IQ in subjects with bloodgroup B and the mean IQ in subjects with blood group 0, i.e.bB = y(xA = 0, xB = 1, xAB = 0)− a
I bAB is the difference between the mean IQ in subjects with bloodgroup AB and the mean IQ in subjects with blood group 0, i.e.bAB = y(xA = 0, xB = 0, xAB = 1)− a
Note:A test for the significance of a categorical explanatory variable with plevels involves the hypothesis that the coefficients of all p − 1 dummyvariables are zero. For that purpose, we need to use an overall F-test(next lecture) and not a t-test. The t-test can be used only when thevariable is binary.
43/43
References
Gerald van Belle, Lloyd D. Fisher, Patrick J. Heagerty, ThomasLumleyBiostatistics: a methodology for the health sciences, 2nd edition.John Wiley & Sons, 2004.
Wayne W. Daniel, Chad L. CrossBiostatistics: a foundation for analysis in the health sciences,10th edition.John Wiley & Sons, 2013.