69
Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least- squares Regression Line

Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

Embed Size (px)

Citation preview

Page 1: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

Chapter 4Describing the Relation Between Two Variables

4.3

Diagnostics on the Least-squares Regression Line

Page 2: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

The coefficient of determination, R2, measures the percentage of total variation in the response variable that is explained by least-squares regression line.

The coefficient of determination is a number between 0 and 1, inclusive. That is, 0 < R2 < 1.

If R2 = 0 the line has no explanatory value

If R2 = 1 means the line variable explains 100% of the variation in the response variable.

Page 3: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

The following data are based on a study for drilling rock. The researchers wanted to determine whether the time it takes to dry drill a distance of 5 feet in rock increases with the depth at which the drilling begins. So, depth at which drilling begins is the predictor variable, x, and time (in minutes) to drill five feet is the response variable, y.Source: Penner, R., and Watts, D.G. “Mining Information.” The American Statistician, Vol. 45, No. 1, Feb. 1991, p. 6.

Page 4: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line
Page 5: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line
Page 6: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

Regression Analysis

The regression equation isTime = 5.53 + 0.0116 Depth

Sample Statistics

Mean Standard Deviation

Depth 126.2 52.2

Time 6.99 0.781

Correlation Between Depth and Time: 0.773

Page 7: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

Suppose we were asked to predict the time to drill an additional 5 feet, but we did not know the current depth of the drill. What would be our best “guess”?

Page 8: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

Suppose we were asked to predict the time to drill an additional 5 feet, but we did not know the current depth of the drill. What would be our best “guess”?

ANSWER:

The mean time to drill an additional 5 feet:

6.99 minutes.

Page 9: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

Now suppose that we are asked to predict the time to drill an additional 5 feet if the current depth of the drill is 160 feet?

Page 10: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

Now suppose that we are asked to predict the time to drill an additional 5 feet if the current depth of the drill is 160 feet?

ANSWER:

Our “guess” increased from 6.99 minutes to 7.39 minutes based on the knowledge that drill depth is positively associated with drill time.

Page 11: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

The difference between the predicted drill time of 6.99 minutes and the predicted drill time of 7.39 minutes is due to the depth of the drill. In other words, the difference in our “guess” is explained by the depth of the drill.

The difference between the predicted value of 7.39 minutes and the observed drill time of 7.92 minutes is explained by factors other than drill time.

Page 12: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line
Page 13: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

The difference between the observed value of the response variable and the mean value of the response variable is called the total deviation and is equal to

Page 14: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

The difference between the predicted value of the response variable and the mean value of the response variable is called the explained deviation and is equal to

Page 15: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

The difference between the observed value of the response variable and the predicted value of the response variable is called the unexplained deviation and is equal to

Page 16: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line
Page 17: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

Total Deviation

= Unexplained Deviation + Explained Deviation

Page 18: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

Total Variation

= Unexplained Variation + Explained Variation

Page 19: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

Total Variation

= Unexplained Variation + Explained Variation

1 =Unexplained Variation Explained Variation

Unexplained VariationExplained Variation

Total Variation Total Variation

Total VariationTotal Variation

+

= 1 –

Page 20: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

To determine R2 for the linear regression model simply square the value of the linear correlation coefficient.

Page 21: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

EXAMPLE Determining the Coefficient of Determination

Find and interpret the coefficient of determination for the drilling data.

Page 22: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

EXAMPLE Determining the Coefficient of Determination

Find and interpret the coefficient of determination for the drilling data.

Because the linear correlation coefficient, r, is 0.773, we have that R2 = 0.7732 = 0.5975 = 59.75%.

So, 59.75% of the variability in drilling time is explained by the least-squares regression line.

Page 23: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

Draw a scatter diagram for each of these data sets. For each data set, the variance of y is 17.49.

Page 24: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

Data Set A, R2 = 100%

Data Set B, R2 = 94.7%

Data Set C, R2 = 9.4%

Page 25: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

Residuals play an important role in determining the adequacy of the linear model. In fact, residuals can be used for the following purposes:

• To determine whether a linear model is appropriate to describe the relation between the predictor and response variables.

• To determine whether the variance of the residuals is constant.

• To check for outliers.

Page 26: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

If a plot of the residuals against the predictor variable shows a discernable pattern, such as curved, then the response and predictor variable may not be linearly related.

Page 27: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line
Page 28: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line
Page 29: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

A chemist as a 1000-gram sample of a radioactive material. She records the amount of radioactive material remaining in the sample every day for a week and obtains the following data. Day Weight (in grams)

0 1000.01 897.12 802.53 719.84 651.15 583.46 521.77 468.3

Page 30: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

Linear correlation coefficient: -0.994

Page 31: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line
Page 32: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

Linear model not appropriate

Page 33: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

If a plot of the residuals against the predictor variable shows the spread of the residuals increasing or decreasing as the predictor increases, then a strict requirement of the linear model is violated.

This requirement is called constant error variance. The statistical term for constant error variance is homoscedasticity

Page 34: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line
Page 35: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line
Page 36: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

A plot of residuals against the predictor variable may also reveal outliers. These values will be easy to identify because the residual will lie far from the rest of the plot.

Page 37: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line
Page 38: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

0

-5

Page 39: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

We can also use a boxplot of residuals to identify outliers.

Page 40: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

EXAMPLE Residual Analysis

Draw a residual plot of the drilling time data. Comment on the appropriateness of the linear least-squares regression model.

Page 41: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line
Page 42: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

Boxplot of Residuals for the Drilling Data

Page 43: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

An influential observation is one that has a disproportionate affect on the value of the slope and y-intercept in the least-squares regression equation.

Page 44: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

Case 1 (outlier)

Case 2

Case 3 (influential)

Influential observations typically exist when the point is large relative to its X value.

Page 45: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

EXAMPLE Influential Observations

Suppose an additional data point is added to the drilling data. At a depth of 300 feet, it took 12.49 minutes to drill 5 feet. Is this point influential?

Page 46: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line
Page 47: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line
Page 48: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

With influential

Without influential

Page 49: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

As with outliers, influential observations should be removed only if there is justification to do so. When an influential observation occurs in a data set and its removal is not warranted, there are two courses of action:

(1) Collect more data so that additional points near the influential observation are obtained, or

(2) Use techniques that reduce the influence of the influential observation (such as a transformation or different method of estimation - e.g. minimize absolute deviations).

Page 50: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

Chapter FourDescribing the Relation Between Two Variables

Section 4.4

Nonlinear Regression: Transformations

Page 51: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line
Page 52: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line
Page 53: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line
Page 54: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

EXAMPLE Using the Definition of a Logarithm

Rewrite the logarithmic expressions to an equivalent expression involving an exponent. Rewrite the exponential expressions to an equivalent logarithmic expression.

(a) log315 = a (b) 45 = z

Page 55: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

In the following properties, M, N, and a are positive real numbers, with a 1, and r is any real number.

loga (MN) = loga M + loga N

loga Mr = r loga M

Page 56: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

EXAMPLE Simplifying Logarithms

Write the following logarithms as the sum of logarithms. Express exponents as factors.

(a) log2 x4 (b) log5(a4b)

Page 57: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

If a = 10 in the expression y = logax, the resulting logarithm, y = log10x is called the common logarithm. It is common practice to omit the base, a, when it is equal to 10 and write the common logarithm as y = log x

Page 58: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

EXAMPLE Evaluating Exponential and Logarithmic Expressions

Evaluate the following expressions. Round your answers to three decimal places.

(a) log 23 (b) 102.6

Page 59: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

y = abx Exponential Model

log y = log (abx) Take the common logarithm of both sides

log y = log a + log bx

log y = log a + x log b

Y = A + B x where

b = 10B a = 10A

Page 60: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

EXAMPLE 4 Finding the Curve of Best Fit to an Exponential Model

A chemist as a 1000-gram sample of a radioactive material. She records the amount of radioactive material remaining in the sample every day for a week and obtains the following data.

DayDay Weight (in grams)Weight (in grams)0 1000.01 897.12 802.53 719.84 651.15 583.46 521.77 468.3

Page 61: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

(a) Draw a scatter diagram of the data treating the day, x, as the predictor variable.

(b) Determine Y = log y and draw a scatter diagram treating the day, x, as the predictor variable and Y = log y as the response variable. Comment on the shape of the scatter diagram.

(c) Find the least-squares regression line of the transformed data.

(d) Determine the exponential equation of best fit and graph it on the scatter diagram obtained in part (a).

(e) Use the exponential equation of best fit to predict the amount of radioactive material is left after 8 days.

Page 62: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line
Page 63: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line
Page 64: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

y = axb Power Model

log y = log (axb) Take the common logarithm of both sides

log y = log a + log xb

log y = log a + b log x

Y = A + b X where a = 10A

Page 65: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

EXAMPLE Finding the Curve of Best Fit to a Power Model

Cathy wishes to measure the relation between a light bulb’s intensity and the distance from some light source. She measures a 40-watt light bulb’s intensity 1 meter from the bulb and at 0.1-meter intervals up to 2 meters from the bulb and obtains the following data.

DistanceDistance IntensityIntensity1.0 0.09721.1 0.08041.2 0.06741.3 0.05721.4 0.04951.5 0.04331.6 0.03841.7 0.03391.8 0.02941.9 0.02682.0 0.0224

Page 66: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

(a) Draw a scatter diagram of the data treating the distance, x, as the predictor variable.

(b) Determine X = log x and Y = log y and draw a scatter diagram treating the day, X = log x, as the predictor variable and Y = log y as the response variable. Comment on the shape of the scatter diagram.

(c) Find the least-squares regression line of the transformed data.

(d) Determine the power equation of best fit and graph it on the scatter diagram obtained in part (a).

(e) Use the power equation of best fit to predict the intensity of the light if you stand 2.3 meters away from the bulb.

Page 67: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line
Page 68: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line
Page 69: Chapter 4 Describing the Relation Between Two Variables 4.3 Diagnostics on the Least-squares Regression Line

Modeling is not only a science but also an art form. Selecting an appropriate model requires experience and skill in the field in which you are modeling. For example, knowledge of economics is imperative when trying to determine a model to predict unemployment. The main reason for this is that there are theories in the field that can help the modeler to select appropriate relations and variables.