93
Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc. Chapter Describing the Relation between Two Variables 4

Sfs4e ppt 04

Embed Size (px)

Citation preview

Page 1: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.

Chapter

Describing the Relation between Two Variables

4

Page 2: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.

Section

Scatter Diagrams and Correlation

4.1

Page 3: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.

Objectives

1. Draw and interpret scatter diagrams

2. Describe the properties of the linear correlation coefficient

3. Compute and interpret the linear correlation coefficient

4. Determine whether a linear relation exists between two variables

5. Explain the difference between correlation and causation

4-3

Page 4: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-4

Objective 1

• Draw and interpret scatter diagrams

Page 5: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-5

The response variable is the variable whose value can be explained by the value of the explanatory or predictor variable.

Page 6: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-6

A scatter diagram is a graph that shows the relationship between two quantitative variables measured on the same individual. Each individual in the data set is represented by a point in the scatter diagram. The explanatory variable is plotted on the horizontal axis, and the response variable is plotted on the vertical axis.

Page 7: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-7

EXAMPLE Drawing and Interpreting a Scatter DiagramEXAMPLE Drawing and Interpreting a Scatter Diagram

The data shown to the right are based on a study for drilling rock. The researchers wanted to determine whether the time it takes to dry drill a distance of 5 feet in rock increases with the depth at which the drilling begins. So, depth at which drilling begins is the explanatory variable, x, and time (in minutes) to drill five feet is the response variable, y. Draw a scatter diagram of the data.Source: Penner, R., and Watts, D.G. “Mining Information.” The American Statistician, Vol. 45, No. 1, Feb. 1991, p. 6.

Page 8: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-8

Page 9: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-9

Various Types of Relations in a Scatter Diagram

Page 10: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-10

Two variables that are linearly related are positively associated when above-average values of one variable are associated with above-average values of the other variable and below-average values of one variable are associated with below-average values of the other variable. That is, two variables are positively associated if, whenever the value of one variable increases, the value of the other variable also increases.

Page 11: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-11

Two variables that are linearly related are negatively associated when above-average values of one variable are associated with below-average values of the other variable. That is, two variables are negatively associated if, whenever the value of one variable increases, the value of the other variable decreases.

Page 12: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-12

Objective 2

• Describe the Properties of the Linear Correlation Coefficient

Page 13: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-13

The linear correlation coefficient or Pearson product moment correlation coefficient is a measure of the strength and direction of the linear relation between two quantitative variables. The Greek letter ρ (rho) represents the population correlation coefficient, and r represents the sample correlation coefficient. We present only the formula for the sample correlation coefficient.

Page 14: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-14

Sample Linear Correlation Coefficient

where is the sample mean of the explanatory variablesx is the sample standard deviation of the explanatory variable is the sample mean of the response variablesy is the sample standard deviation of the response variablen is the number of individuals in the sample

x

y

r

xi xsx

yi ysy

n 1

Page 15: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-15

Properties of the Linear Correlation Coefficient

1.The linear correlation coefficient is always between –1 and 1, inclusive. That is, –1 ≤ r ≤ 1.

2.If r = + 1, then a perfect positive linear relation exists between the two variables.

3.If r = –1, then a perfect negative linear relation exists between the two variables.

4.The closer r is to +1, the stronger is the evidence of positive association between the two variables.

5.The closer r is to –1, the stronger is the evidence of negative association between the two variables.

Page 16: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-16

6. If r is close to 0, then little or no evidence exists of a linear relation between the two variables. So r close to 0 does not imply no relation, just no linear relation.

7. The linear correlation coefficient is a unitless measure of association. So the unit of measure for x and y plays no role in the interpretation of r.

8. The correlation coefficient is not resistant. Therefore, an observation that does not follow the overall pattern of the data could affect the value of the linear correlation coefficient.

Page 17: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-17

Page 18: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-18

Page 19: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-19

Objective 3

• Compute and Interpret the Linear Correlation Coefficient

Page 20: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-20

EXAMPLE Determining the Linear Correlation CoefficientEXAMPLE Determining the Linear Correlation Coefficient

Determine the linear correlation coefficient of the drilling data.

Page 21: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-21

xi x

sx

yi y

sy

xi x

sx

yi y

sy

xi x

sx

yi y

sy

x y

Page 22: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-22

r

xi xsx

yi y

sy

n 1

8.501037

12 10.773

Page 23: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-23

Page 24: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-24

Objective 4

• Determine Whether a Linear Relation Exists between Two Variables

Page 25: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-25

Testing for a Linear Relation

Step 1 Determine the absolute value of the correlation coefficient.

Step 2 Find the critical value in Table II from Appendix A for the given sample size.

Step 3 If the absolute value of the correlation coefficient is greater than the critical value, we say a linear relation exists between the two variables. Otherwise, no linear relation exists.

Page 26: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-26

EXAMPLE Does a Linear Relation Exist? EXAMPLE Does a Linear Relation Exist?

Determine whether a linear relation exists between time to drill five feet and depth at which drilling begins. Comment on the type of relation that appears to exist between time to drill five feet and depth at which drilling begins.

The correlation between drilling depth and time to drill is 0.773. The critical value for n = 12 observations is 0.576. Since 0.773 > 0.576, there is a positive linear relation between time to drill five feet and depth at which drilling begins.

Page 27: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-27

Objective 5

• Explain the Difference between Correlation and Causation

Page 28: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-28

According to data obtained from the Statistical Abstract of the United States, the correlation between the percentage of the female population with a bachelor’s degree and the percentage of births to unmarried mothers since 1990 is 0.940.

Does this mean that a higher percentage of females with bachelor’s degrees causes a higher percentage of births to unmarried mothers?

Page 29: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-29

Certainly not! The correlation exists only because both percentages have been increasing since 1990. It is this relation that causes the high correlation. In general, time series data (data collected over time) may have high correlations because each variable is moving in a specific direction over time (both going up or down over time; one increasing, while the other is decreasing over time).

When data are observational, we cannot claim a causal relation exists between two variables. We can only claim causality when the data are collected through a designed experiment.

Page 30: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-30

Another way that two variables can be related even though there is not a causal relation is through a lurking variable.

A lurking variable is related to both the explanatory and response variable.

For example, ice cream sales and crime rates have a very high correlation. Does this mean that local governments should shut down all ice cream shops? No! The lurking variable is temperature. As air temperatures rise, both ice cream sales and crime rates rise.

Page 31: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-31

EXAMPLE Lurking Variables in a Bone Mineral Density StudyEXAMPLE Lurking Variables in a Bone Mineral Density Study

Because colas tend to replace healthier beverages and colas contain caffeine and phosphoric acid, researchers Katherine L. Tucker and associates wanted to know whether cola consumption is associated with lower bone mineral density in women. The table lists the typical number of cans of cola consumed in a week and the femoral neck bone mineral density for a sample of 15 women. The data were collected through a prospective cohort study.

Page 32: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-32

EXAMPLE Lurking Variables in a Bone Mineral Density StudyEXAMPLE Lurking Variables in a Bone Mineral Density Study

The figure on the next slide shows the scatter diagram of the data. The correlation between number of colas per week and bone mineral density is –0.806.The critical value for correlation with n = 15 from Table II in Appendix A is 0.514. Because |–0.806| > 0.514, we conclude a negative linear relation exists between number of colas consumed and bone mineral density. Can the authors conclude that an increase in the number of colas consumed causes a decrease in bone mineral density? Identify some lurking variables in the study.

Page 33: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-33

EXAMPLE Lurking Variables in a Bone Mineral Density StudyEXAMPLE Lurking Variables in a Bone Mineral Density Study

Page 34: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-34

EXAMPLE Lurking Variables in a Bone Mineral Density StudyEXAMPLE Lurking Variables in a Bone Mineral Density Study

In prospective cohort studies, data are collected on a group of subjects through questionnaires and surveys over time. Therefore, the data are observational. So the researchers cannot claim that increased cola consumption causes a decrease in bone mineral density.Some lurking variables in the study that could confound the results are:• body mass index• height• smoking• alcohol consumption• calcium intake • physical activity

Page 35: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-35

EXAMPLE Lurking Variables in a Bone Mineral Density StudyEXAMPLE Lurking Variables in a Bone Mineral Density Study

The authors were careful to say that increased cola consumption is associated with lower bone mineral density because of potential lurking variables. They never stated that increased cola consumption causes lower bone mineral density.

Page 36: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.

Section

Least-squares Regression

4.2

Page 37: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-37

Page 38: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.

Objectives

1. Find the least-squares regression line and use the line to make predictions

2. Interpret the slope and the y-intercept of the least-squares regression line

3. Compute the sum of squared residuals

4-38

Page 39: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-39

Using the following sample data:

Using (2, 5.7) and (6, 1.9):

m 5.7 1.9

2 6 0.95

y y1 m x x1 y 5.7 0.95 x 2 y 5.7 0.95x 1.9

y 0.95x 7.6

EXAMPLE Finding an Equation that Describes Linearly Relate DataEXAMPLE Finding an Equation that Describes Linearly Relate Data

(a) Find a linear equation that relates x (the explanatory variable) and y (the response variable) by selecting two points and finding the equation of the line containing the points.

Page 40: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-40

(b) Graph the equation on the scatter diagram.

(c) Use the equation to predict y if x = 3.

y 0.95x 7.6

0.95(3) 7.6

4.75

Page 41: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-41

Objective 1

• Find the least-squares regression line and use the line to make predictions

Page 42: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-42

}(3, 5.2)residual = observed y – predicted y = 5.2 – 4.75 = 0.45

The difference between the observed value of y and the predicted value of y is the error, or residual. Using the line from the last example, and the predicted value at x = 3:

residual = observed y – predicted y = 5.2 – 4.75 = 0.45

Page 43: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-43

Least-Squares Regression Criterion

The least-squares regression line is the line that minimizes the sum of the squared errors (or residuals). This line minimizes the sum of the squared vertical distance between the observed values of y and those predicted by the line , (“y-hat”). We represent this as

“ minimize Σ residuals2 ”.

y

Page 44: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-44

The Least-Squares Regression Line

The equation of the least-squares regression line is given by

y b1x b0

where

where

is the slope of the least-squares regression line

b1 r sy

sx

b0 y b1xis the y-intercept of the least-squares regression line

Page 45: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-45

The Least-Squares Regression Line

Note: is the sample mean and sx is the sample standard deviation of the explanatory variable x ; is the sample mean and sy is the sample standard deviation of the response variable y.

y

x

Page 46: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-46

EXAMPLE Finding the Least-squares Regression LineEXAMPLE Finding the Least-squares Regression Line

Using the drilling data

(a)Find the least-squares regression line.(b)Predict the drilling time if drilling starts at 130 feet.(c)Is the observed drilling time at 130 feet above, or below, average.(d)Draw the least-squares regression line on the scatter diagram of the data.

Page 47: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-47

(a) We agree to round the estimates of the slope and intercept to four decimal places.

(b)

(c) The observed drilling time is 6.93 seconds. The predicted drilling time is 7.035 seconds. The drilling time of 6.93 seconds is below average.

y 0.0116x 5.5273

y 0.0116x 5.5273

0.0116(130) 5.5273

7.035

Page 48: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-48

(d)

Page 49: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-49

Objective 2

• Interpret the Slope and the y-intercept of the Least-Squares Regression Line

Page 50: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-50

Interpretation of Slope:The slope of the regression line is 0.0116. For each additional foot of depth we start drilling, the time to drill five feet increases by 0.0116 minutes, on average.

Page 51: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-51

Interpretation of the y-Intercept: The y-intercept of the regression line is 5.5273. To interpret the y-intercept, we must first ask two questions:1. Is 0 a reasonable value for the explanatory variable? 2. Do any observations near x = 0 exist in the data set?

A value of 0 is reasonable for the drilling data (this indicates that drilling begins at the surface of Earth. The smallest observation in the data set is x = 35 feet, which is reasonably close to 0. So, interpretation of the y-intercept is reasonable. The time to drill five feet when we begin drilling at the surface of Earth is 5.5273 minutes.

Page 52: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-52

If the least-squares regression line is used to make predictions based on values of the explanatory variable that are much larger or much smaller than the observed values, we say the researcher is working outside the scope of the model. Never use a least-squares regression line to make predictions outside the scope of the model because we can’t be sure the linear relation continues to exist.

Page 53: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-53

Objective 3

• Compute the Sum of Squared Residuals

Page 54: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-54

To illustrate the fact that the sum of squared residuals for a least-squares regression line is less than the sum of squared residuals for any other line, use the “regression by eye” applet.

Page 55: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.

Section

The Coefficient of Determination

4.3

Page 56: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-56

Objective 1

• Compute and Interpret the Coefficient of Determination

Page 57: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-57

The coefficient of determination, R2, measures the proportion of total variation in the response variable that is explained by the least-squares regression line.

The coefficient of determination is a number between 0 and 1, inclusive. That is, 0 < R2 < 1.

If R2 = 0 the line has no explanatory value

If R2 = 1 means the line explains 100% of the variation in the response variable.

Page 58: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-58

The data to the right are based on a study for drilling rock. The researchers wanted to determine whether the time it takes to dry drill a distance of 5 feet in rock increases with the depth at which the drilling begins. So, depth at which drilling begins is the predictor variable, x, and time (in minutes) to drill five feet is the response variable, y.Source: Penner, R., and Watts, D.G. “Mining Information.” The American Statistician, Vol. 45, No. 1, Feb. 1991, p. 6.

Page 59: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-59

Page 60: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-60

Regression Analysis

The regression equation isTime = 5.53 + 0.0116 Depth

Sample Statistics

Mean Standard Deviation

Depth 126.2 52.2

Time 6.99 0.781

Correlation Between Depth and Time: 0.773

Page 61: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-61

Suppose we were asked to predict the time to drill an additional 5 feet, but we did not know the current depth of the drill. What would be our best “guess”?

Page 62: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-62

Suppose we were asked to predict the time to drill an additional 5 feet, but we did not know the current depth of the drill. What would be our best “guess”?

ANSWER:

The mean time to drill an additional 5 feet: 6.99 minutes

Page 63: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-63

Now suppose that we are asked to predict the time to drill an additional 5 feet if the current depth of the drill is 160 feet?

ANSWER:

Our “guess” increased from 6.99 minutes to 7.39 minutes based on the knowledge that drill depth is positively associated with drill time.

y 5.53 0.0116(160) 7.39

Page 64: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-64

Page 65: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-65

The difference between the observed value of the response variable and the mean value of the response variable is called the total deviation and is equal to

y y

Page 66: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-66

The difference between the predicted value of the response variable and the mean value of the response variable is called the explained deviation and is equal to

y y

Page 67: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-67

The difference between the observed value of the response variable and the predicted value of the response variable is called the unexplained deviation and is equal to

y y

Page 68: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-68

Total Deviation

y y y y y y

Unexplained Deviation

Explained Deviation

+=

Page 69: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-69

Total Deviation

y y 2 y y 2 y y 2

Unexplained Deviation

Explained Deviation

+=

Page 70: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-70

Total Variation = Unexplained Variation + Explained Variation

1 =Unexplained Variation Explained Variation

Unexplained VariationExplained Variation

Total Variation Total Variation

Total VariationTotal Variation

+

= 1 – R2 =

Page 71: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-71

To determine R2 for the linear regression model simply square the value of the linear correlation coefficient.To determine R2 for the linear regression model simply square the value of the linear correlation coefficient.

Squaring the linear correlation coefficient to obtain the coefficient of determination works only for the least-squares linear regression model

The method does not work in general.

y b1x b0

Page 72: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-72

EXAMPLE Determining the Coefficient of DeterminationEXAMPLE Determining the Coefficient of Determination

Find and interpret the coefficient of determination for the drilling data.

Because the linear correlation coefficient, r, is 0.773, we have that

R2 = 0.7732 = 0.5975 = 59.75%.

So, 59.75% of the variability in drilling time is explained by the least-squares regression line.

Page 73: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-73

Draw a scatter diagram for each of these data sets. For each data set, the variance of y is 17.49.

Page 74: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-74

Data Set A Data Set B Data Set C

Data Set A: 99.99% of the variability in y is explained by the least-squares regression line

Data Set B: 94.7% of the variability in y is explained by the least-squares regression line

Data Set C: 9.4% of the variability in y is explained by the least-squares regression line

Page 75: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.

Section

Contingency Tables and Association

4.4

Page 76: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.

Objectives

1. Compute the marginal distribution of a variable

2. Use the conditional distribution to identify association among categorical data

3. Explain Simpson’s Paradox

4-76

Page 77: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-77

A professor at a community college in New Mexico conducted a study to assess the effectiveness of delivering an introductory statistics course via traditional lecture-based method, online delivery (no classroom instruction), and hybrid instruction (online course with weekly meetings) methods, the grades students received in each of the courses were tallied.

The table is referred to as a contingency table, or two-way table, because it relates two categories of data. The row variable is grade, because each row in the table describes the grade received for each group. The column variable is delivery method. Each box inside the table is referred to as a cell.

Page 78: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-78

Objective 1

• Compute the Marginal Distribution of a Variable

Page 79: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-79

A marginal distribution of a variable is a frequency or relative frequency distribution of either the row or column variable in the contingency table.

Page 80: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-80

EXAMPLE Determining Frequency Marginal DistributionsEXAMPLE Determining Frequency Marginal Distributions

A professor at a community college in New Mexico conducted a study to assess the effectiveness of delivering an introductory statistics course via traditional lecture-based method, online delivery (no classroom instruction), and hybrid instruction (online course with weekly meetings) methods, the grades students received in each of the courses were tallied. Find the frequency marginal distributions for course grade and delivery method.

Page 81: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-81

EXAMPLE Determining Relative Frequency Marginal DistributionsEXAMPLE Determining Relative Frequency Marginal Distributions

Determine the relative frequency marginal distribution for course grade and delivery method.

Page 82: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-82

Objective 2

• Use the Conditional Distribution to Identify Association among Categorical Data

Page 83: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-83

A conditional distribution lists the relative frequency of each category of the response variable, given a specific value of the explanatory variable in the contingency table.

Page 84: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-84

EXAMPLE Determining a Conditional DistributionEXAMPLE Determining a Conditional Distribution

Construct a conditional distribution of course grade by method of delivery. Comment on any type of association that may exist between course grade and delivery method.

It appears that students in the hybrid course are more likely to pass (A, B, or C) than the other two methods.

Page 85: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-85

EXAMPLE Drawing a Bar Graph of a Conditional DistributionEXAMPLE Drawing a Bar Graph of a Conditional Distribution

Using the results of the previous example, draw a bar graph that represents the conditional distribution of method of delivery by grade earned.

Page 86: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-86

The following contingency table shows the survival status and demographics of passengers on the ill-fated Titanic.Draw a conditional bar graph of survival status by demographic characteristic.

Page 87: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-87

Objective 1

• Explain Simpson’s Paradox

Page 88: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-88

Insulin dependent (or Type 1) diabetes is a disease that results in the permanent destruction of insulin-producing beta cells of the pancreas. Type 1 diabetes is lethal unless treatment with insulin injections replaces the missing hormone. Individuals with insulin independent (or Type 2) diabetes can produce insulin internally. The data shown in the table below represent the survival status of 902 patients with diabetes by type over a 5-year period.

Type 1 Type 2 Total

Survived 253 326 579

Died 105 218 323

358 544 902

EXAMPLE Illustrating Simpson’s ParadoxEXAMPLE Illustrating Simpson’s Paradox

Page 89: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-89

EXAMPLE Illustrating Simpson’s ParadoxEXAMPLE Illustrating Simpson’s Paradox

From the table, the proportion of patients with Type 1 diabetes who died was 105/358 = 0.29; the proportion of patients with Type 2 diabetes who died was 218/544 = 0.40. Based on this, we might conclude that Type 2 diabetes is more lethal than Type 1 diabetes.

Type 1 Type 2 Total

Survived 253 326 579

Died 105 218 323

358 544 902

Page 90: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-90

However, Type 2 diabetes is usually contracted after the age of 40. If we account for the variable age and divide our patients into two groups (those 40 or younger and those over 40), we obtain the data in the table below.

Type 1 Type 2 Total

< 40 > 40 < 40 > 40

Survived 129 124 15 311 579

Died 1 104 0 218 323

130 228 15 529 902

Page 91: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-91

Of the diabetics 40 years of age or younger, the proportion of those with Type 1 diabetes who died is 1/130 = 0.008; the proportion of those with Type 2 diabetes who died is 0/15 = 0.

Type 1 Type 2 Total

< 40 > 40 < 40 > 40

Survived 129 124 15 311 579

Died 1 104 0 218 323

130 228 15 529 902

Page 92: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-92

Of the diabetics over 40 years of age, the proportion of those with Type 1 diabetes who died is 104/228 = 0.456; the proportion of those with Type 2 diabetes who died is 218/529 = 0.412.

The lurking variable age led us to believe that Type 2 diabetes is the more dangerous type of diabetes.

Type 1 Type 2 Total

< 40 > 40 < 40 > 40

Survived 129 124 15 311 579

Died 1 104 0 218 323

130 228 15 529 902

Page 93: Sfs4e ppt 04

Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc.4-93

Simpson’s Paradox represents a situation in which an association between two variables inverts or goes away when a third variable is introduced to the analysis.