27
+ Chapter 4: Correlation and Regression Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose

Chapter 4: Correlation and Regression

  • Upload
    mala

  • View
    156

  • Download
    2

Embed Size (px)

DESCRIPTION

Chapter 4: Correlation and Regression. Lecture PowerPoint Slides. Chapter 4 Overview. 4.1 Scatterplots and Correlation 4.2 Introduction to Regression 4.3 Further Topics in Regression. The Big Picture. Where we are coming from and where we are headed… - PowerPoint PPT Presentation

Citation preview

Page 1: Chapter 4: Correlation and Regression

+

Chapter 4:Correlation and Regression

Lecture PowerPoint Slides

Discovering Statistics

2nd Edition Daniel T. Larose

Page 2: Chapter 4: Correlation and Regression

+ Chapter 4 Overview

4.1 Scatterplots and Correlation

4.2 Introduction to Regression

4.3 Further Topics in Regression

2

Page 3: Chapter 4: Correlation and Regression

+ The Big Picture

Where we are coming from and where we are headed…

Chapter 3 showed us methods for summarizing data using descriptive statistics, but only one variable at a time.

In Chapter 4, we learn how to analyze the relationship between two quantitative variables using scatterplots, correlation, and regression.

In Chapter 5, we will learn about probability, which we will need in order to perform statistical inference.

3

Page 4: Chapter 4: Correlation and Regression

+ 4.1: Scatterplots and Correlation

Objectives:

Construct and interpret scatterplots for two quantitative variables.

Calculate and interpret the correlation coefficient.

Determine whether a linear correlation exists between two variables.

4

Page 5: Chapter 4: Correlation and Regression

5

ScatterplotsWhenever you are examining the relationship between two quantitative variables, your best bet is to start with a scatterplot. A scatterplot is used to summarize the relationship between two quantitative variables that have ben measured on the same element.A scatterplot is a graph of points (x,y), each of which represents one observation from the data set. One of the variables is measured along the horizontal axis and is called the x variable. The other variable is measured along the vertical axis and is called the y variable.

Lot x=square footage (100s of sq ft)

y=sales price ($1000s)

Harding St 75 155

Newton Ave 125 210

Stacy Ct 125 290

Eastern Ave 175 360

Second St 175 250

Sunnybrook Rd 225 450

Ahlstrand Rd 225 530

Eastern Ave 275 635

Page 6: Chapter 4: Correlation and Regression

6

ScatterplotsThe relationship between two quantitative variables can take many different forms. Four of the most common are:

Positive linear relationship: As x increases, y also tends to increase.

Negative linear relationship: As x increases, y tends to decrease.

No apparent relationship: As x increases, y tends to remain unchanged.

Nonlinear relationship: The x and y variable are related, but not in a way that can be approximated using a straight line.

Page 7: Chapter 4: Correlation and Regression

7

Correlation CoefficientScatterplots provide a visual description of the relationship between two quantitative variables. The correlation coefficient is a numerical measure for quantifying the linear relationship between two quantitative variables.

The correlation coefficient r measures the strength and direction of the linear relationship between two variables. The correlation coefficient r is

where sx is the sample standard deviation of the x data values, and sy is the sample standard deviation of the y data values.

Page 8: Chapter 4: Correlation and Regression

8

Calculating Correlation Coefficient

27 33 39 57 45 47 71 16 37 39 41141.1

10 10

xx

n

35 44 46 68 55 63 79 29 42 45 50650.6

10 10

yy

n

Page 9: Chapter 4: Correlation and Regression

9

Calculating Correlation Coefficient

2

2116.915.33659386

1 10 1x

x xs

n

2

2162.415.50053763

1 10 1ysy y

n

2088.40.9761

1 9 15.33659386 15.50053763x y

x x y yr

n s s

Page 10: Chapter 4: Correlation and Regression

10

Properties of r

• If most of the data values fall in Regions 1 and 3, r will tend to be positive.• If most of the data values fall in Regions 2 and 4, r will tend to be negative.• If the four regions share the data values more or less equally, then r will be

near zero.

Page 11: Chapter 4: Correlation and Regression

11

Properties of r1. The correlation coefficient r is always -1 ≤ r ≤ 1.2. When r = +1, a perfect positive relationship exists between x and y.3. Values of r near +1 indicate a positive relationship between x and y.

• The closer r gets to +1, the stronger the evidence for a positive relationship.

• The variables are said to be positively associated.• As x increases, y tends to increase.

4. When r = -1, a perfect negative relationship exists between x and y.5. Values of r near -1 indicate a negative relationship between x and y.

• The closer r gets to -1, the stronger the evidence for a negative relationship.

• The variables are said to be negatively associated.• As x increases, y tends to decrease.

6. Values of r near 0 indicate there is no linear relationship between x and y.• The closer r gets to 0, the weaker the evidence for a linear

relationship.• The variables are not linearly associated.• A nonlinear relationship may exist between x and y.

Page 12: Chapter 4: Correlation and Regression

12

Properties of r

Page 13: Chapter 4: Correlation and Regression

13

Test for Linear CorrelationThere is a simple comparison test that will tell us whether the correlation coefficient is strong enough to conclude that the variables are correlated.

Comparison Test for Linear Correlation1.Find the absolute value |r| of the correlation coefficient.2. Turn to the Table of Critical Values for the Correlation Coefficient and select the row corresponding to the sample size n.3. Compare |r| to the critical value from the Table.•If |r| > critical value, you can conclude x and y are linearly correlated.•If r > 0, they are positively correlated.•If r < 0, they are negatively correlated.•If |r| is not greater than critical value, then x and y are not linearly correlated.

Page 14: Chapter 4: Correlation and Regression

+ 4.2: Introduction to Regression

Objectives:

Understand and calculate the range of a data set.

Explain in my own words what a deviation is.

Calculate the variance and the standard deviation for a population or a sample.

14

Page 15: Chapter 4: Correlation and Regression

15

The Regression Line

Section 4.1 introduced the correlation coefficient. In this section, we learn how to approximate the linear relationship between two numerical variables using the regression line and regression equation.

City x = low temp y = high temp

Minneapolis 10 29

Boston 20 37

Chicago 20 43

Philadelphia 30 41

Cincinnati 30 49

Wash., DC 40 50

Las Vegas 40 58

Memphis 50 64

Dallas 50 70

Miami 60 74

We write the equation of the regression line as

ˆ y b1x b0

Page 16: Chapter 4: Correlation and Regression

16

The Regression Line

Equation of the Regression Line

The equation of the regression line that approximates the relationship between x and y is

where the regression coefficients are the slope, b1, and the y intercept, b0.

The equations of these coefficients are

ˆ y b1x b0

b1 (x x )(y y )

(x x )2

b0 y (b1x )

Note: The “hat” over the y (pronounced “y-hat”) indicates this is an estimate of y and not necessarily an actual value of y.

Page 17: Chapter 4: Correlation and Regression

17

Interpreting Slope and y-Intercept• In statistics, we interpret the slope of the regression line as the

estimated change in y per unit increase in x.• The y-intercept is interpreted as the estimated value of y when x

equals 0.

b1= 0.9. For each increase of 1F in low temp, the estimated high temp increases by 0.9F.

b0= 20. When the low temp is 0F, the estimated high temp is 20F.

• The slope b1 and the correlation coefficient r always have the same sign.• b1 is positive if and only if r is positive.• b1 is negative if and only if r is negative.

Page 18: Chapter 4: Correlation and Regression

18

Predictions and Prediction ErrorWe can use the regression equation to make estimates or prediction. For any value of x, the predicted value of y lies on the regression line.

ˆ y 0.9x 200.9(50) 2065

Example: Low Temp = 50F

Note: The predicted high temp for a city with a low temp of 50F is 65F.Dallas had a low temp of 50F and an actual high temp of 70F.

Page 19: Chapter 4: Correlation and Regression

19

Predictions and Prediction Error

The prediction error, or residual, measures how far the predicted “y-hat” value is from the actual value of y observed in the data set. The prediction error may be positive or negative.

•Positive prediction error: The data value lies above the regression line, so the observed value is greater than the predicted value for the given value of x.

•Negative prediction error: The data value lies below the regression line, so the observed value is less than the predicted value for the given value of x.

•Prediction error equal to zero: The data value lies directly on the regression line, so the observed value of y is exactly equal to what is predicted for the given value of x.

Page 20: Chapter 4: Correlation and Regression

20

Cautions with Regression

The correlation coefficient and regression line are both sensitive to extreme values.

Extrapolation consists of using the regression equation to make estimates or predictions based on x-values that are outside the range of the x-values in the data set.

Page 21: Chapter 4: Correlation and Regression

+ 4.3: Further Topics in Regression AnalysisObjectives:

Calculate the sum of squares error (SSE), and use the standard error of the estimate s as a measure of a typical prediction error.

Describe how total variability, prediction error, and improvement are measured by the total sum of squares (SST), the sum of squares error (SSE), and the sum of squares regression (SSR).

Explain the meaning of the coefficient of determination r2 as a measure of the usefulness of the regression.

21

Page 22: Chapter 4: Correlation and Regression

22

Sum of Squares Error (SSE)Consider the results for ten subjects who were given a set of short-term memory tasks. The memory score and time to memorize are given.

Subject x = time to memorize (min)

y = memory score

1 1 9

2 1 10

3 2 11

4 3 12

5 3 13

6 4 14

7 5 19

8 6 17

9 7 21

10 8 24

Sum of Squares Error (SSE)

SSE (y ˆ y )2 (residual)2 (prediction error)2The least-squares criterion states that the regression line is the line for which the SSE is minimized.

Page 23: Chapter 4: Correlation and Regression

23

Standard Error of the Estimate sThe standard error of the estimate gives a measure of the typical residual. That is, s is a measure of the typical prediction error. If the typical prediction error is large, then the regression line may not be useful.

Standard Error of the Estimate s

s SSE

n 2

Subject x = time to memorize (min)

y = memory score

Predicted memory score

Residual Residual2

1 1 9 9 0 0

2 1 10 9 1 1

3 2 11 11 0 0

4 3 12 13 -1 1

5 3 13 13 0 0

6 4 14 15 -1 1

7 5 19 17 2 4

8 6 17 19 -2 4

9 7 21 21 0 0

10 8 24 23 1 1

SSE = 12

s = 1.2247

Page 24: Chapter 4: Correlation and Regression

24

SST, SSR, and SSE

The coefficient of determination r2 depends on the values of two new statistics, SST and SSR.

The least-squares criterion guarantees that SSE is the smallest possible value for a data set. However, this does not guarantee that the regression is useful.

SST (y y )2

Suppose our best estimate for y was the average y value. Now suppose you found the difference between each y value and the average y and summed the squared differences. This would be the total sum of squares (SST):

Page 25: Chapter 4: Correlation and Regression

25

SST, SSR, and SSE

SSR ( ˆ y y )2

Since we have a regression equation, we can make predictions for y that are (hopefully) more accurate than the average y value. The amount of improvement is the difference between y-hat and the average y. This leads us to the sum of squares regression (SSR):

Relationship Among SST, SSR, and SSE

SST = SSR + SSE

Page 26: Chapter 4: Correlation and Regression

26

Coefficient of Determination r2

SSR represents the amount of variability in the response variable that is accounted for by the regression equation.

SSE represents the amount of variability in the y that is left unexplained after accounting for the relationship between x and y.

Since we know that SST represents the sum of SSR and SSE, it makes sense to consider the ratio of SSR and SST, called the coefficient of determination r2.

Coefficient of Determination r2

The coefficient of determination r2 = SSR/SST measures the goodness of fit of the regression equation to the data. We interpret r2 as the proportion of the variability in y that is accounted for by the linear relationship between y and x. The values that r2 can take are 0 ≤ r2 ≤ 1.

SST y 2 y 2

n

SST xy x y n 2

x 2 x 2

n

Page 27: Chapter 4: Correlation and Regression

+ Chapter 4 Overview

4.1 Scatterplots and Correlation

4.2 Introduction to Regression

4.3 Further Topics in Regression

27