54
1 Chapter 4: More on Two- Variable Data 4.1 Transforming Relationships 4.2 Cautions 4.3 Relations in Categorical Data 0 20 40 60 80 100 120 140 20 40 60 80 100 120 140 160 180 Tim e, m inutes g dye/kg fiber

Chapter 4: More on Two-Variable Data

Embed Size (px)

DESCRIPTION

Chapter 4: More on Two-Variable Data. 4.1Transforming Relationships 4.2Cautions 4.3Relations in Categorical Data. Year 1990 1993 1994 1995 1996 1997 1998 1999. Cell Phone Users (thousands) 5,283 16,009 24,134 33,786 44,043 55,312 69,209 86,047. Example. - PowerPoint PPT Presentation

Citation preview

Page 1: Chapter 4: More on Two-Variable Data

1

Chapter 4: More on Two-Variable Data

4.1 Transforming Relationships

4.2 Cautions

4.3 Relations in Categorical Data

0

20

40

60

80

100

120

140

20 40 60 80 100 120 140 160 180

Time, minutes

g dy

e/kg

fib

er

Page 2: Chapter 4: More on Two-Variable Data

2

Example

Year

1990

1993

1994

1995

1996

1997

1998

1999

Cell Phone Users (thousands)

5,283

16,009

24,134

33,786

44,043

55,312

69,209

86,047

Page 3: Chapter 4: More on Two-Variable Data

3

Scatterplot for Cell Phone Example

Page 4: Chapter 4: More on Two-Variable Data

4

Residuals Plot

Page 5: Chapter 4: More on Two-Variable Data

5

What’s going on here?

• Do the data (y) increase by a constant amount each year?

– This would suggest a linear model.

• Or, do the data increase by a fixed percentage each year? That is, can you multiply the y-value by a fixed number to get the next year’s number, and then multiply that number by the fixed number to get the following year’s number?

– This would suggest an exponential model.

Page 6: Chapter 4: More on Two-Variable Data

6

Transformation of the Variables

• The next step is to apply a mathematical transformation that changes exponential growth into linear growth.

– The transformation that can help here is to take the logarithm of the y-variable, then re-plot and re-calculate the LSR.

Page 7: Chapter 4: More on Two-Variable Data

7

New LSR, with Transformed y

Residuals Plot

Page 8: Chapter 4: More on Two-Variable Data

8

We are dealing with a transformed y-value!

• Model:

xy 13417.020.263log

• In order to use the model for prediction, we must “undo” the logarithm transformation to return to the original units of measurement.

– How do we do this?

• Now use the new model to predict cell phone subscribers for 2000.

Page 9: Chapter 4: More on Two-Variable Data

9

How do we predict for year 2000?

Page 10: Chapter 4: More on Two-Variable Data

10

Plotting our original data vs. our exponential model …

Page 11: Chapter 4: More on Two-Variable Data

11

• Problem 4.6, p. 212

• Problem 4.11, p. 213

• Reading: pp. 203-215

Homework

Page 12: Chapter 4: More on Two-Variable Data

12

Power Law Models

• General form of a power law model:

paxy • Biologists have found that many characteristics of living things are described quite closely by power

laws.

– For example, the rate at which animals use energy goes up as the ¾ power of their body weight (Kleiber’s Law).

Page 13: Chapter 4: More on Two-Variable Data

13

LSR and Power Law Models

• As we saw in the last section, exponential growth models become linear when we apply the logarithm transformation to the response variable y.

• Power law models become linear when we apply the logarithm transformation to both variables, x and y.

Page 14: Chapter 4: More on Two-Variable Data

14

Log Transformations for Power Law Models

• Looking carefully at the last equation, the power (p) becomes the slope of the straight line that links log y to log x.– We can estimate what power (p) the law involves by

regressing log y on log x and using the slope of the regression line to estimate the power.

paxy

xpay

axy p

logloglog

)log(log

Page 15: Chapter 4: More on Two-Variable Data

15

Problem 4.13, p 219

Page 16: Chapter 4: More on Two-Variable Data

16

Problem 4.13, p. 219

Log of Both Variables

Page 17: Chapter 4: More on Two-Variable Data

17

Residuals Analysis (Transformed Data)

Page 18: Chapter 4: More on Two-Variable Data

18

Undoing the Transformation

• Let’s do the math to see what we need:

xy log218215.076172.0log

Page 19: Chapter 4: More on Two-Variable Data

19

Predicting Lifespan for Humans

Page 20: Chapter 4: More on Two-Variable Data

20

• HW Problem:

– 4.14, p. 220

Page 21: Chapter 4: More on Two-Variable Data

21

Warm-Up Problem

• 4.25, pp. 224-225

• Create appropriate model

• Predict seed count for tree with seed weight of 1,000 mg.

Page 22: Chapter 4: More on Two-Variable Data

22

4.25 Log of both L1 and L2

Axes off to see trend

Y2 vs. original data

I. II.

III.IV.

V.

Page 23: Chapter 4: More on Two-Variable Data

23

4.2 Cautions about Correlationand Regression

• The correlation (r) and the LSR line are not resistant.

• As we have seen, extrapolation is often dangerous.

– Predicting past the x-variable for which the model was developed.

Page 24: Chapter 4: More on Two-Variable Data

24

The French Paradox

• The paradox refers to the fact that the French have long had low rates of heart disease (Japan is the only developed country with a lower rate), despite a diet relatively rich in saturated animal fats. The French propensity to drink wine the way some Americans guzzle soft drinks has been cited as a likely explanation of the paradox, since numerous studies have indicated that alcohol consumed in moderation helps to prevent atherosclerosis, or accumulation of fatty deposits in arteries, which is the underlying cause of most heart attacks.

+ from NY Times article

Page 25: Chapter 4: More on Two-Variable Data

25

Lurking Variables

• As we discussed in the example of amount of wine consumed vs. number of incidents of heart disease, there can be other variables not measured in a correlation study that may influence the interpretation of relationships among those variables.– Lurking Variables

• It is possible to show, for example, that there is a high correlation

between shoe size and intelligence for a group of children varying

in age from, say, 4 to 15.

– What is the lurking variable?• To control for age, we can calculate the correlation between shoe

size and IQ for each of the different ages.– Age 4, 5, 6, …

Page 26: Chapter 4: More on Two-Variable Data

26

Correlation Between Shoe Size and IQ?(Common Response)

Age

ShoeSize

IQ

Page 27: Chapter 4: More on Two-Variable Data

27

See Figure 4.18, p. 227

Page 28: Chapter 4: More on Two-Variable Data

28

Lurking Variables ThatChange Over Time

• Many lurking variables change systematically over time.

• One useful method for detecting lurking variables is to plot both the response variable and the regression residuals against the time order of the observations (whenever the time order is available).

• See Example 4.12, p. 228

Page 29: Chapter 4: More on Two-Variable Data

29

Page 30: Chapter 4: More on Two-Variable Data

30

Using Averaged Data

• Be careful when applying the results of a study that uses averages to individuals.

• Problem 4.31, p. 231

Page 31: Chapter 4: More on Two-Variable Data

31

Causation

• Simply put, a strong correlation between two variables says nothing about one variable causing the other. One variable may in fact cause the other to change, but a correlation or LSR line cannot tell us that.

– More investigation is needed!

• A designed study with proper experimental controls should be used.

Page 32: Chapter 4: More on Two-Variable Data

32

Figure 4.22, p. 232

• Causation

• Common Response

• Confounding

Page 33: Chapter 4: More on Two-Variable Data

33

Confounding

• The effects of two variables on a response variable are said to be confounded when they cannot be distinguished from one another.

– Definition: Two or more variables that might have caused an effect were simultaneously present, so that we do not know to which to attribute the effect.

– See 1, Example 4.13 (p. 232), and explanation, p. 233, top of p. 234.

• Does this mean that we cannot ever suggest causation?

– Read the two paragraphs on p. 235 (establishing causation).

Page 34: Chapter 4: More on Two-Variable Data

34

Causation

• Example 4.14, p. 232

– Numbers 1 and 2 (p. 233)

Page 35: Chapter 4: More on Two-Variable Data

35

Common Response

• Example 4.15, p. 233

Page 36: Chapter 4: More on Two-Variable Data

36

Homework

• Reading through p. 240

Page 37: Chapter 4: More on Two-Variable Data

37

Problems

• Problems on p. 237:

– 4.33, 4.34, 4.35

• 4.73, p.257

Page 38: Chapter 4: More on Two-Variable Data

38

Problem 4.73, p. 257

Power law model might best fit,so take log of L1 and L2. Plot belowof L3 and L4.

Page 39: Chapter 4: More on Two-Variable Data

39

4.73, cont.

The pendulum period is proportional to the square rootof its length.

Page 40: Chapter 4: More on Two-Variable Data

40

4.3 Relations in Categorical Variables

• There are many relationships of interest to us that cannot be described by using correlation and LSR techniques.

– Recall that correlation and LSR require both variables to be quantitative.

• Often, we want to study the relationship between two variables that are inherently categorical.

Page 41: Chapter 4: More on Two-Variable Data

41

Two-Way Table (Ex. 4.19, p. 241)

Age Group

Education 25 to 34 35 to 54 55+ Total

Did not complete HS

4,474 9,155 14,224 27,853

Complete HS 11,546 26,481 20,060 58,087

1-3 yrs college

10,700 22,618 11,127 44,445

4+ yrs college 11,066 23,183 10,596 44,845

Total 37,786 81,435 56,008 175,230

cell

Page 42: Chapter 4: More on Two-Variable Data

42

Two-Way Table

• The row variable is level of education.– In this study, is level of education the explanatory or

response variable?• The column variable is age.

– Explanatory or response?

• Marginal distributions:– The distributions of education alone and age alone

are called marginal distributions because their totals are in the margins: Education at the right, and age at the bottom.

Page 43: Chapter 4: More on Two-Variable Data

43

Marginal Distributions

• It is often advantageous to display the marginal distribution in percents instead of raw numbers.

Education Level in U.S. (adults age 25+)

15.9

33.125.4 25.6

0

10

20

30

40

50

No highschooldegree

High schoolonly

1-3 years ofcollege

4+ years ofcollege

Years of Schooling

Per

cen

t o

f T

ota

l

Page 44: Chapter 4: More on Two-Variable Data

44

Conditional Distributions

• The previous graph looked at the breakdown of education levels for the entire population. Many times, however, we are looking for breakdowns (i.e., distributions) for a certain group within the population.

– For example, of those people with 4+ years of college, look at the distribution across age groups.

– Let’s complete a bar graph for this comparison.

– This is a conditional distribution.

Page 45: Chapter 4: More on Two-Variable Data

45

One Conditional Distribution for Example 4.19

Breakdown by age group of people with 4+ years of college

24.7

51.7

23.6

0

10

20

30

40

50

60

25-34 35-54 55+

Age Group

Per

cent

Page 46: Chapter 4: More on Two-Variable Data

46

Different Question

• What proportion of each age group received 4+ years of college education?

Page 47: Chapter 4: More on Two-Variable Data

47

• Read paragraph at the bottom of page 248.

Page 48: Chapter 4: More on Two-Variable Data

48

One set of conditional distributions:Figure 4.27, p. 248

Page 49: Chapter 4: More on Two-Variable Data

49

Problems

• 4.53, p. 245

• 4.59, p. 251

Page 50: Chapter 4: More on Two-Variable Data

50

Graph for Problem 4.59

Beakdown of Planned Majors in Business School,by Gender

30.2

40.4

2.2

27.1

34.8

24.8

3.7

36.6

0

10

20

30

40

50

Accounting Admin Economics Finance

Business School Major

Per

cen

t

Female Male

Page 51: Chapter 4: More on Two-Variable Data

51

Homework

• Read through the end of the chapter.

• Be sure you understand “Simpson’s Paradox.”

• Problem:

– 4.62, p. 253

Page 52: Chapter 4: More on Two-Variable Data

52

Simpson’s Paradox

• Problem 4.60, p. 251

• Statement of the Paradox:

– Simpson’s paradox refers to the reversal of the direction of a comparison or association when data from several groups are combined to form a single group.

Page 53: Chapter 4: More on Two-Variable Data

53

Practice/Review Problems

• Problem:

– 4.68, p. 254

– 4.72 (parts a-c), p. 257

Page 54: Chapter 4: More on Two-Variable Data

54

Relationship Between Type of College and Management Level

4.2

54.4

41.4

7.3

53.1

39.8

-5

515

25

35

4555

65

High Middle Low

Management Level

Per

cen

t

Public Private