65
Correlation and Simple Correlation and Simple Linear Regression Linear Regression

Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation ) Measures the

Embed Size (px)

Citation preview

Page 1: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Correlation and Simple Correlation and Simple Linear RegressionLinear Regression

Page 2: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Pearson’s Product Moment Correlation Pearson’s Product Moment Correlation (sample correlation (sample correlation rr estimates population correlation estimates population correlation ))

• Measures the strength of Measures the strength of linearlinear association association between two numeric variables X and Y.between two numeric variables X and Y.

• The correlation (r) is a unit-less quantity.The correlation (r) is a unit-less quantity.• -1 -1 << r r << 1 1• If If r < 0r < 0 then there is a negative association then there is a negative association

between X and Y, i.e. as X increases Y generally between X and Y, i.e. as X increases Y generally decreasesdecreases

• If If r > 0 r > 0 then there is a positive association then there is a positive association between X and Y, i.e. as X increases Y generally between X and Y, i.e. as X increases Y generally increasesincreases

Page 3: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Pearson’s Product Moment Correlation Pearson’s Product Moment Correlation (sample correlation (sample correlation rr estimates population correlation estimates population correlation ))

• The close The close rr is to 0 the weaker the is to 0 the weaker the linear association between X and Y.linear association between X and Y.

• Sample Correlation (Sample Correlation (rr))

Adjective scale for the sample correlation coefficient ( r ).

Page 4: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Pearson’s Product Moment Pearson’s Product Moment Correlation ( Correlation ( rr ) )

Some examples of various positive and negative correlations.

Page 5: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Pearson’s Product Moment Correlation Pearson’s Product Moment Correlation (sample correlation (sample correlation rr estimates population correlation estimates population correlation ))

• Never calculate a correlation Never calculate a correlation coefficient without plotting the data!coefficient without plotting the data!

• Correlation is NOT causation!Correlation is NOT causation!• Beware of influential points.Beware of influential points.

r = .11 r = .72

Page 6: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Pearson’s Product Moment Correlation Pearson’s Product Moment Correlation (sample correlation (sample correlation rr estimates population correlation estimates population correlation ))

• Never calculate a correlation Never calculate a correlation coefficient without plotting the data!coefficient without plotting the data!

• Correlation is NOT causation!Correlation is NOT causation!• Beware of influential points.Beware of influential points.

r = .11 r = .72

Page 7: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Pearson’s Product Moment Correlation Pearson’s Product Moment Correlation (sample correlation (sample correlation rr estimates population correlation estimates population correlation ))

• Beware of outliersBeware of outliers

r = .86 r = .66

Page 8: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

aa

Examples Examples where no form where no form of correlation of correlation coefficient is coefficient is appropriateappropriate

Page 9: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Pearson’s Product Moment Pearson’s Product Moment Correlation ( Correlation ( rr ) )

• The formula (and equivalencies)The formula (and equivalencies)

n

iii

y

in

i x

i

n

ii

n

ii

n

iii

yscorezxscorezn

s

yy

s

xx

n

yyxx

yyxxr

1

1

1

2

1

2

1

) )( (1

1

1

1

Page 10: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Pearson’s Product Moment Correlation ( Pearson’s Product Moment Correlation ( rr ) )

x

y

zx , zy > 0

zx , zy < 0

zx < 0 & zy > 0

zx > 0 & zy < 0

i

n

ii yscorezxscorez

nr

1

1

1

Page 11: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Pearson’s Product Moment Correlation ( Pearson’s Product Moment Correlation ( rr ) )

x

y

i

n

ii yscorezxscorez

nr

1

1

1

zx , zy > 0

zx , zy < 0

zx < 0 & zy > 0

zx > 0 & zy < 0

Page 12: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Nonlinear RelationshipsNonlinear RelationshipsNot all relationships are linear. In cases where there is clear evidence of a nonlinear relationship DO NOT use Pearson’s Product Moment Correlation ( r ) to summarize the strength of the relationship between Y and X.

Clearly artery pressure (Y) and change in flow velocity (X) are nonlinearly related.

Pulmonary Artery Pressure vs. Relative Flow Velocity Change

Page 13: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Testing for Significant CorrelationTesting for Significant Correlation

Testing Population Correlation (Testing Population Correlation ())

2dist -t~1

2

)or (or 0:

0:

2

ndfr

nrt

H

H

a

o

Most software packages will conduct this test for any correlation you are interested in. When looking at multiple correlations be sure consider “Bonferroni correcting”.

Page 14: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Other measures of correlation/associationOther measures of correlation/association

• Spearman’s Rank Correlation ( Spearman’s Rank Correlation ( rrs s )) – can be used – can be used

with ordinal data where the levels are in some with ordinal data where the levels are in some sense equidistant. Can also be used when sense equidistant. Can also be used when relationships are nonlinear but relationships are nonlinear but monotonicmonotonic. It . It involves ranking the x’s and y’s and finding involves ranking the x’s and y’s and finding Pearson’s correlation based on the ranks.Pearson’s correlation based on the ranks.

• Kendall’s Tau (a, b, & c) Kendall’s Tau (a, b, & c) – – is used when X and Y is used when X and Y are both ordinal and they need not be “equidistant”.are both ordinal and they need not be “equidistant”.

a – a – does notdoes not adjust for tiesadjust for ties

b – b – adjusts for ties, X and Y must have same levelsadjusts for ties, X and Y must have same levels

c – c – adjust for ties, X and Y don’t have same levelsadjust for ties, X and Y don’t have same levels

Data for Kendall’s Tau-a, b, or c could summarized using a contingency table.

Page 15: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Other measures of correlation/associationOther measures of correlation/association

• Biserial Correlation – correlation when X is Biserial Correlation – correlation when X is continuous and Y is naturally dichotomous continuous and Y is naturally dichotomous (e.g. male/female, smoker/non-smoker) or (e.g. male/female, smoker/non-smoker) or a created (e.g. Age a created (e.g. Age << 18/Age > 18). 18/Age > 18).

• Polyserial Correlation – correlation when X Polyserial Correlation – correlation when X is continuous and Y is ordinal.is continuous and Y is ordinal.

• JMP/SPSS compute Spearman’s and JMP/SPSS compute Spearman’s and Kendall’s tau however the others require Kendall’s tau however the others require specialized software.specialized software.

Page 16: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Examples: NC Births DataExamples: NC Births DataAge of Father vs. Age of MotherAge of Father vs. Age of Mother

The Pearson Product Moment Correlation (r = .7543, p < .0001) suggests a fairly strong correlation between father’s age and mother’s age.

The line represents the mean age of the father given the mother’s age.

Page 17: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Examples: NC Births DataExamples: NC Births Data Birth weight (g) and Gestational Age (wks)Birth weight (g) and Gestational Age (wks)

Is the relationship between the birth weight and gestational age linear?

Hard to say, consider adding a smoothing spline to the plot. The smoothing spline helps us see the relationship between the mean birth weight and gestational age.

The notation we use for the mean of Y given X is E(Y|X).

Here we have…

E(Birth Weight|Gest. Age)

Smooth Estimate of

E(Birth Weight|Gest. Age)

Page 18: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Examples: NC Births DataExamples: NC Births DataBirth weight (g) and Gestational Age (wks)Birth weight (g) and Gestational Age (wks)

The Pearson Product Moment Correlation (r = .5515, p < .0001) suggests a moderate correlation between gestational age and birth weight. However the smooth curve estimate of the mean birth weight suggests the relationship is not linear, thus the Pearson correlation may not be appropriate.

We could consider using Spearman’s rank correlation instead (rs = .3056, p < .0001).

Page 19: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Examples: NC Births DataExamples: NC Births DataBirth Weight vs. Smoking StatusBirth Weight vs. Smoking Status

The biserial correlation between smoking status and birth weight is (rbs = -.1175, p = .0002) suggesting a weak negative association between smoking and birth weight.

The biserial correlation is found by computing Pearson’s correlation between birth weight and smoking status, where smoking status is coded as 0 = no, 1 = yes and treated as a numeric quantity.

Page 20: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Medicare Survey Data – General Health at Medicare Survey Data – General Health at Baseline & Follow-upBaseline & Follow-up

Association between baseline & follow-up general Association between baseline & follow-up general health (revisited)health (revisited)

Kendall’s Tau can be used measure the degree of association between two ordinal variables, here (p < .0001)

Page 21: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Simple Linear RegressionSimple Linear Regression

• Regression refers to the estimation of the Regression refers to the estimation of the mean of a response (Y) given information mean of a response (Y) given information about single predictor X in case of simple about single predictor X in case of simple regression or multiple X’s in the case of regression or multiple X’s in the case of multiple regression.multiple regression.

• We denote this mean as E(Y|X) in the We denote this mean as E(Y|X) in the case simple regression and E(Y|Xcase simple regression and E(Y|X11,X,X22,,…,X…,Xpp) in the case multiple regression with ) in the case multiple regression with pp potential predictors. potential predictors.

Page 22: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Simple Linear RegressionSimple Linear Regression• In the case of simple linear regression we In the case of simple linear regression we

assume that the mean can modeled using assume that the mean can modeled using function that is a linear function of function that is a linear function of unknown parameters, e.g. unknown parameters, e.g.

)ln()|(

)|(

)|(

1

221

1

XXYE

or

XXXYE

or

XXYE

o

o

o

These are all example of simple linear regression models. When many people think of simple linear regression they think of the first mean function because it is the equation of a line. This model will be our primary focus.

Page 23: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Simple Linear Regression ModelSimple Linear Regression Model

The regression model…The regression model…

The data model The data model

The fitted modelThe fitted model

errorXYEy

y

)|(

scatter random mean

iiii exyEy )|(

iiiiii eyexyEy ˆˆˆ)|(ˆ

Estimated mean function Residual or random error

E(Y|X)

E(Y|X) + SD(Y|X)

E(Y|X) - SD(Y|X)

observed y = trend + scatter

Page 24: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

AssumptionsAssumptions

2.2. The random scatter, i.e. eThe random scatter, i.e. e ii’s “random errors”’s “random errors”

We assume that eWe assume that eii ~ N(0, ~ N(0,) and independent.) and independent.

This means for a given X, Y is normal with This means for a given X, Y is normal with constant variation for all X.constant variation for all X.

3. Error standard deviation ( )– Std. dev of random process producing the “errors” ei

– Governs amount of scatter about the mean E(Y|X)• big lots of scatter, no scatter

1. The assumed functional form for the mean function E(Y|X) is correct, e.g. we might assume a line for the mean function, E(Y|X) = X.

Page 25: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Assumptions (cont’d)Assumptions (cont’d)

),(~|

)|(2

1

1

ioi

o

xNxXY

XXYE

Note: Normality is required for inference, i.e. t-tests, F-tests, and CI’s for model parameters and predictions.

Page 26: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Gestational Age and WeightGestational Age and Weight In a study of premature infants researchers In a study of premature infants researchers

looked at their gestational age weeks and looked at their gestational age weeks and weight and the following data was gathered:weight and the following data was gathered:

Page 27: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Simple Linear RegressionSimple Linear Regression

)ˆ , ( ii yx

Example 1: Gestational Age (weeks) and Weight of Premature Babies (g)

E(Weight|Age) = Age

= -1404.36 + 80.06Age

The regression equation is estimated from the data by minimizing the squared vertical distances from the data points to the fitted line.

These vertical distances are called residuals. Some of the residuals have been added to the plot.

)y , ( iix

)ˆ , ( ii yx

iii yye ˆˆ

ie

The fitted values are the points that lie on the line or each of the xi values. They represent the estimated mean for that value of x.

r = .80

Page 28: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Which line?

??

??

??

Fitting a line by least squaresFitting a line by least squares

????

Page 29: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

(a) The data (b) Which line???

Fitting a line by least squaresFitting a line by least squares

• Choose line with smallest sum of squared Choose line with smallest sum of squared prediction errorsprediction errors

• i.e. smallest i.e. smallest Residual Sum of Squares, RSSResidual Sum of Squares, RSS

Page 30: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

The prediction errors(residuals)

Least-squares line

i.e. with smallestsum of squaredprediction errors

x 1 x 2 x i xn. . . . .

yi

(xi , yi )i th data point

Place on any line

Fitting a line by least squaresFitting a line by least squares

Predictionerror ii yy ˆ

iyChoose line toMinimize (yi )2

iy

i.e. with the smallest sum of squared lengths of the “error” arrows

Page 31: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Method of Least SquaresMethod of Least Squares

To estimate the regression line we chooseTo estimate the regression line we choose

This requires calculus but the solutions are easy to This requires calculus but the solutions are easy to express in terms of standard summary statisticsexpress in terms of standard summary statistics

n

i

n

iiiiioo eyyx

1 1

22n

1i

21i1 ˆ)ˆ())ˆˆ((y minimize toˆ&ˆ

xy

s

sr

o

x

y

1

1

ˆˆ

ˆ

Y ofdeviation std.

X ofdeviation std.

Y and Xbetween n correlatio

y

x

s

s

r

Next we look at the estimated coefficients and their interpretation.

Page 32: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

0 = Intercept

1 = Slopew units

0

1 w units

The Regression LineThe Regression Line

Interpretable only if x = 0 is avalue of particular interest.

Always interpretable !

= -value at x = 0 y

= Change in for every unit increase in x

y

x0

y

xXYEy 10ˆˆ)|(ˆˆ

i.e. y = mx + b

^

^

^

^

Page 33: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Interpretation of Coefficients and the Interpretation of Coefficients and the Estimated Regression EquationEstimated Regression Equation

06.86ˆ

36.1404ˆ

parameters estimated the

06.8636.1404)|(ˆ

1

o

AgeAgeWeightE

What are the units on the intercept and slope parameters?

grams

grams/week

How do we interpret the estimate values?

oy-intercept, the mean of Y when X = 0, usually of no interest

unless 0 is reasonable value for X. Here X = 0 is meaningless.

1 = slope, the change in the mean of Y when X increases by 1.

Here we estimate the mean weight increases 86 grams/week.

Page 34: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Interpretation of Coefficients and the Interpretation of Coefficients and the Estimated Regression EquationEstimated Regression Equation

Estimating the mean weight as a function of the Estimating the mean weight as a function of the gestational age.gestational age.

Use the equation to estimate the mean weight of Use the equation to estimate the mean weight of infants born with a gestational age of 30 weeks.infants born with a gestational age of 30 weeks.

Use the equation to estimate the mean weight of Use the equation to estimate the mean weight of infants born with a gestational age of 42 weeks.infants born with a gestational age of 42 weeks.

AgeAgeWeightE 06.8636.1404)|(ˆ

g 44.11773006.8636.1404)30|(ˆ AgeWeightE

g 16.22104206.8636.1404)42|(ˆ AgeWeightE

This is beyond the range of the data as none of infants in these data were full term or longer.

Page 35: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

FF-test in Regression ANOVA Summary-test in Regression ANOVA Summary

TestsTests

HH00: : The regression is NOT useful The regression is NOT useful

i.e., Hi.e., H00: : 11 = 0 (and = 0 (and 22 = 0 and …. = 0 and …. kk = 0 if we are = 0 if we are performing multiple regression) performing multiple regression)

• Almost always significant Almost always significant ((PP-value almost always small)-value almost always small)• Very rare that an investigator’s intuition is so bad that none of Very rare that an investigator’s intuition is so bad that none of her or his explanatory variables have any predictive valueher or his explanatory variables have any predictive value

Note: F0 = MSReg /MSE

SumSource of Sqs df Mean SS F-statistic p-val.

Regression k MSReg F0=MSReg / MSE P( F F0)

Residual n-k-1 MSE

Total n-1

2)ˆ( ii yy

2)( yyi

2)ˆ( yyi

Page 36: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

FF-test in Regression ANOVA Summary-test in Regression ANOVA Summary

Fo = 67.49, p-value < .0001

Thus we conclude that the regression is useful and that gestational age helps explain the variation in the observed birth weights.

Page 37: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Summarizing the FitSummarizing the Fit

R2 = proportion of variation explained by the regression of Y on X. Here this proportion is .6398 or 63.98%. This will be discussed in next few slides.

Estimate of residual or error variance ()

This is also called Root Mean Square Error (RMSE)

11

)ˆ(ˆ 1

2

kn

MS

kn

yyError

n

iii

Page 38: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

y

Fitted or predicted values

x

Actual y-observations

Shows the variationin the y's

y

x

Towards “Percent Variation Explained”Towards “Percent Variation Explained”

Shows the variationin the ’sy

y

Page 39: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

y

x

x

transmitted from the variation in the x’s

In a situation where we had a perfectly fitting model,we would get this much variation in the y’s

Percent of Variation ExplainedPercent of Variation Explained

y

Page 40: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

y

x

x

Our data has slightly more variation in the y’s than that.

Percent of Variation ExplainedPercent of Variation Explained

y

Page 41: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

y

x

x

We see some additionalvariation in the y-values here.The excess (residual variation) is not explained by the model.

Percent of Variation ExplainedPercent of Variation Explained

Variation in the ’s:This amount of variation canbe “explained” as transmittedfrom the variation in the x’s

y

y

Page 42: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

R-squared: Percent variation ExplainedR-squared: Percent variation Explained

(R(R22 is also called the “ is also called the “Coefficient of determinationCoefficient of determination”)”)

• When expressed as a percentage,When expressed as a percentage,

RR22 is “ is “percent variation explained percent variation explained ””• It is the percentage of variation in the It is the percentage of variation in the yy-values -values that that

the model can explainthe model can explain from the variation in the from the variation in the xx--values.values.

2R s' in Variation

s'ˆ in Variation

y

y

Total.SS

.SSRegression

Total.SS

Res.SS1

Page 43: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Summarizing the FitSummarizing the Fit

R2 = 63.98% of the variation in the birth weights can be explained by the regression on the gestational age.

Page 44: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Inference for Model ParametersInference for Model Parameters

g/wk 06.86ˆ

g 36.1404ˆ

1

o

Testing Parameters (j)

Confidence Interval for j

1on distributit~)ˆ(

ˆ

0:

0:

kndfSE

t

H

H

j

j

ja

jo

1 w/ valuetable- t )ˆ()(ˆ n - k - df SEtablet jj

These both apply for multiple regression as well.

Page 45: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Inference for Model ParametersInference for Model Parameters

g/wk 06.86ˆ

g 36.1404ˆ

1

o

Testing Parameters (j)

1on distributit~)ˆ(

ˆ

0:

0:

1

1

1

1

kndfSE

t

H

H

a

o

.0001value-p 22.8)ˆ(

ˆ

1

1

SE

t

We have strong evidence that the slope is not 0 and hence conclude that gestational age is a statistically significant predictor.

Page 46: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Inference for Model ParametersInference for Model Parameters

g/wk 06.86ˆ

g 36.1404ˆ

1

o

Confidence Interval for j

)27.107 , 85.64(

21.2106.86

48.10024.206.86

1 w/ valuetable- t )ˆ()(ˆ

n - k - df SEtablet jj

We estimate that infants in this population gained between 64.85 g and 107.27 g per week of gestation with 95% confidence.

Page 47: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Estimating E(Y|X) and Predicting Estimating E(Y|X) and Predicting for Response Values for Individualsfor Response Values for Individuals• We can construct CI’s for the mean of Y for We can construct CI’s for the mean of Y for

a given value of X or for an individual with a a given value of X or for an individual with a given value of X. For the latter case we given value of X. For the latter case we refer to the interval as a prediction interval refer to the interval as a prediction interval (PI).(PI).

• Both intervals we use the same point Both intervals we use the same point estimate from the regression equation as estimate from the regression equation as the center of the confidence interval.the center of the confidence interval.

Page 48: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Estimating E(Y|X) and Predicting Estimating E(Y|X) and Predicting for Response Values for Individualsfor Response Values for Individuals

Confidence Interval for the mean birth weight of infants in this population with a gestational age of 34 weeks.

Prediction Interval for the birth weight of value of an infant with a gestational age of 30 weeks.

Page 49: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Estimating E(Y|X) and Predicting for Response Estimating E(Y|X) and Predicting for Response Values for IndividualsValues for Individuals

Gestational Age = 30 weeksWe estimate the mean birth weight of infants born with a gestational age of 30 weeks is between 1113.78 g and 1241.19 g. (CI)

We estimate that 95% of all infants born with a gestational age of 30 weeks will have a birth weight between 795.89 g and 1559.08 g. (PI)

Gestational Age = 34 weeksWe estimate that the mean birth weight of infants born with a gestational age of 34 weeks is between 1435.76 g and 1607.67 g. (CI)

We estimate that 95% of all infants born with a gestational age of 34 weeks will have a birth weight between 1135.79 g and 1907.66 g. (PI)

Page 50: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Checking AssumptionsChecking Assumptions

2.2. The errors are independent, normally distributed, The errors are independent, normally distributed, with constant variance, i.e. ewith constant variance, i.e. e ii ~ N(0, ~ N(0,).).

Plot of residuals vs. fitted valuesPlot of residuals vs. fitted values

Normal quantile plot of the residualsNormal quantile plot of the residuals

1. The assumed functional form for the mean function E(Y|X) is correct, e.g. we might assume a line for the mean function, E(Y|X) = X.

Plot of residuals vs. fitted values

Lack of fit tests (when available)

Page 51: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Checking AssumptionsChecking Assumptions

Residuals vs. Fitted ValuesResiduals vs. Fitted Values

Res

idua

ls

0 )ˆ( Values Fitted y

This is the ideal plot that suggests no violation of model assumptions. The variation is constant and there is no evidence of lack of fit.

Page 52: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Checking AssumptionsChecking Assumptions

Systolic vs. Diastolic Blood Pressure – relationship looks linear with constant variation throughout.

Residuals vs. Fitted Values – residuals exhibit no discernible trend or pattern and the variation is clearly constant throughout. This residual plot looks ideal with the possible exception of a few mild outliers.

Page 53: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Checking AssumptionsChecking Assumptions

FEV vs. Height (in.) – there is some evidence of curvature in the scatter plot.

Residuals vs. Fitted ValuesSuggests the mean function is not linear and the error variance is not constant.

Page 54: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Checking AssumptionsChecking Assumptions

CO2 vs. Time – monthly carbon

dioxide readings at Mauna Loa volcano in Hawaii. Again some curvature is evident.

Residuals vs. Fitted ValuesSuggests the mean function is not linear.

Page 55: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Checking AssumptionsChecking Assumptions

Residuals vs. Fitted ValuesA closer inspection of suggests the mean function is not linear AND more importantly the residuals are NOT independent as they are clearly related to one another. There is clearly some type of repeating seasonal pattern in these data that needs to be accounted for in the modeling process.

Page 56: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Checking AssumptionsChecking Assumptions

For the birth weight and gestational age For the birth weight and gestational age regression example the plot of the residuals regression example the plot of the residuals vs. the fitted values shows no problems vs. the fitted values shows no problems model inadequacies.model inadequacies.

Page 57: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Checking AssumptionsChecking AssumptionsTo check normality we simply construct a To check normality we simply construct a

normal quantile plot of the residuals.normal quantile plot of the residuals.

This is the normal quantile plot for the residuals from the birth weight and gestational age study. This plot suggests approx. normality for the error distribution.

Page 58: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Identifying OutliersIdentifying OutliersTo check for outliers determine the value of 2*RMSE, i.e. To check for outliers determine the value of 2*RMSE, i.e.

Any observations outside these bands are potential outliers Any observations outside these bands are potential outliers and should be investigated further to determine whether or and should be investigated further to determine whether or not they adversely affect the model.not they adversely affect the model.

.ˆ2

2

0

Page 59: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Example 2:Example 2: Evaluating two-dimensional Evaluating two-dimensional echocardiography (2D ECHO)echocardiography (2D ECHO)

A study evaluated 2D ECHO for the A study evaluated 2D ECHO for the assessment of left ventricular diastolic filling. assessment of left ventricular diastolic filling. Data was collected on half-filling fraction Data was collected on half-filling fraction (1/2FF) for 27 patients as determined by (1/2FF) for 27 patients as determined by both 2D ECHO and angiography using both 2D ECHO and angiography using diagnostic cardiac catheterization. diagnostic cardiac catheterization.

Q: Do these data provide evidence that 1/2FF Q: Do these data provide evidence that 1/2FF obtained from angiography and 2D ECHO obtained from angiography and 2D ECHO

are not equivalent?are not equivalent?

Page 60: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Example 2:Example 2: Evaluating two-dimensional Evaluating two-dimensional echocardiography (2D ECHO)echocardiography (2D ECHO)

r = .8432(p < .0001)

Regression EstimateAngio = .077 + .857 2D ECHO

What should and if the methods are equivalent?

Page 61: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Example 2:Example 2: Evaluating two-dimensional Evaluating two-dimensional echocardiography (2D ECHO)echocardiography (2D ECHO)

R-square = 71%

RMSE = .0835

Regression is useful, p < .0001

There is no evidence of lack of fit (p = .2963)

Intercept () is not significantly different from 0 (p = .2515)

Slope () is significantly different from 0 (p < .0001)

Page 62: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Example 2:Example 2: Evaluating two-dimensional Evaluating two-dimensional echocardiography (2D ECHO)echocardiography (2D ECHO)

If equivalent the intercept (If equivalent the intercept (oo) should be 0 ) should be 0

and the slope (and the slope (11) should be 1, we can test ) should be 1, we can test

these using the t-test.these using the t-test.

2515.)17.1|(|

17.1066.

0077.

0:

0:

tP

t

H

H

INTERCEPT

oa

oo

2018.)31.1|(|

31.1109.

1857.

1:

1:

1

1

tP

t

H

H

SLOPE

a

o

Page 63: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Example 2:Example 2: Evaluating two-dimensional Evaluating two-dimensional echocardiography (2D ECHO)echocardiography (2D ECHO)

Checking assumptionsChecking assumptions

Residuals vs. Fitted Values this plot suggest no model violations although there is one mild outlier.

Normal quantile plot of residualsnormality looks good again with the exception of the mild outlier.

Page 64: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

Example 2:Example 2: Evaluating two-dimensional Evaluating two-dimensional echocardiography (2D ECHO)echocardiography (2D ECHO)

Because we fail to reject the null hypothesis for Because we fail to reject the null hypothesis for both parameters we have no evidence both parameters we have no evidence against equality of the two methods for against equality of the two methods for determining the half-filling fraction (1/2FF).determining the half-filling fraction (1/2FF).

What other methods could be used to answer What other methods could be used to answer the question of interest?the question of interest?

• Dependent samples t-test (paired t-test)Dependent samples t-test (paired t-test)• Wilcoxon signed rank test Wilcoxon signed rank test • Sign test (definitely the worst choice)Sign test (definitely the worst choice)

Page 65: Correlation and Simple Linear Regression. Pearson’s Product Moment Correlation (sample correlation r estimates population correlation  ) Measures the

SummarySummary• Type of correlation measure depends onType of correlation measure depends on- -

data types of X and Ydata types of X and Y- nature of the relationship (linear?)- nature of the relationship (linear?)

• Simple Linear RegressionSimple Linear Regression- estimate the E(Y|X)- estimate the E(Y|X)- E(Y|X) need not a be line- E(Y|X) need not a be line- be sure to check assumptions- be sure to check assumptions- variety of inferential methods- variety of inferential methods

• If assumptions are violated we need to If assumptions are violated we need to change model or transform X and/or Y.change model or transform X and/or Y.