35
Statistic for the day: Number of words in English that exist because of typographical errors or misreadings: 381 Source: OED These slides were created by Tom Hettmansperger and in some cases modified by David Hunter Assignment: Read Chapter 15, Assignment: Read Chapter 15, pp 254-264 pp 254-264 Exercises pp 271-275: 1, 2, Exercises pp 271-275: 1, 2, 3, 9, 11, 12, 17 3, 9, 11, 12, 17

Assignment: Read Chapter 15, pp 254-264 Exercises pp 271-275: 1, 2, 3, 9, 11, 12, 17

Embed Size (px)

DESCRIPTION

381. Statistic for the day: Number of words in English that exist because of typographical errors or misreadings:. Source: OED. Assignment: Read Chapter 15, pp 254-264 Exercises pp 271-275: 1, 2, 3, 9, 11, 12, 17. - PowerPoint PPT Presentation

Citation preview

Statistic for the day:Number of words in English that

exist because of typographical errors or misreadings: 381

Source: OED

These slides were created by Tom Hettmansperger and in some cases modified by David Hunter

Assignment: Read Chapter 15, pp 254-264Assignment: Read Chapter 15, pp 254-264

Exercises pp 271-275: 1, 2, 3, 9, 11, 12, 17Exercises pp 271-275: 1, 2, 3, 9, 11, 12, 17

Research question: Is ghost sighting related to age? Do young and old people differ in

ghost sighting?

The skeptic responds by saying he The skeptic responds by saying he doesn’t believe that there is any doesn’t believe that there is any difference between the age groups.difference between the age groups.

We need to see the data to resolve the debate. Thenwe can consider assessing the risk.

Exercise 9, p219 of the text.

Expected counts are printed below observed yes no Total young 212 1313 1525 174.9 1350.1

old 465 3913 4378 502.1 3875.9

Total 677 5226 5903

Chi-Sq = 7.870 + 1.020 + 2.742 + 0.355 = 11.987

The research advocate wins and skeptic loses.There is evidence in the data that there are differencesin the population.

The percent of young who saw a ghost: 212/1525 = .139Answer: 13.9%

The proportion of old who saw a ghost:465/4378 = .106Answer: .106

The risk of young seeing ghost: Answer: 212/1525 or .139 or 13.9%

Odds ratio?

Odds

The odds of something The odds of something happening are given happening are given by a ratio:by a ratio:

For example, if you For example, if you flip a fair coin, the flip a fair coin, the odds of heads are 1 (or odds of heads are 1 (or sometimes “1 to 1”).sometimes “1 to 1”).

An odds ratio is the An odds ratio is the ratio of two odds!ratio of two odds!

Proportion of time it happens

Proportion of time it doesn't happen

The odds that a young person saw a ghost: 212/1313 = .161

The odds that an older person saw a ghost:465/3912 = .119

The odds ratio: Answer: .161/.106 = 1.35

Relative risk of young person seeing a ghost compared to older person:Answer: .139/.106 = 1.31

We would say that the risk that a younger personsees a ghost is 1.31 times higher than the risk that an older person sees a ghost.

The increased risk that a young person sees a ghost overthat of an older person:Answer: (.139 - .106)/.106 = .31

Hence we would say that young people have a 31% higher risk of seeing a ghost than older people.

Arby’s

LowLow HighHigh

SmallSmall 55 22 77

LargeLarge 22 55 77

77 77 1414

Response:Calories

Explanatory:Size

Expected counts are printed below observed

low high Total 1 5 2 7 3.50 3.50

2 2 5 7 3.50 3.50

Total 7 7 14

Chi-Sq = 0.643 + 0.643 + 0.643 + 0.643 = 2.571

Question: What happens if we had observed data 10 times bigger?

So the skeptic wins.

Expected counts are printed below observed low high Total 1 50 20 70 35.00 35.00

2 20 50 70 35.00 35.00

Total 70 70 140

Chi-Sq = 6.429 + 6.429 + 6.429 + 6.429 = 25.714

Now the research advocate wins.

The point: sample size

Statistical significance is related to the size of the sample. But that makes sense. More data, more information, more precise inference.

So statistical significance is related to two things:

1. The size of the difference between the percentages. Big differences are more likely to show stat. significance.

2. The size of the sample. Bigger samples are more likelyto show statistical significance irrespective of the size ofthe difference in percentages.

Research question: Is ethnicity related to mortgage approval rates?

approv not approv Total Af. Am. 3117 979 4096 76%

White 71950 12997 84947 85%

Total 75067 13976 89043

Chi-Sq = 32.714 +175.710 + 1.577 + 8.472 = 218.5

Research advocate wins big. (Exercise 19 p223 of the text.)

Notice that there were 89,043 applicants consideredin the last example. The chi-squared value was 218.5.

Suppose there were 100 times fewer, say about 890.

Further, suppose the percentages of successful applicants were the same: 76% for African Americansand 85% for whites.

Who do you think will win the debate, the researchadvocate or the skeptic?

Why?

The skeptic will win with a chi-squared value of 2.18.

Research question: Is there a relationship between whether you are sleep deprived and whether you typically smoke more than 0 packs per week?

Rows: Sleepdep Columns: Smoke No Yes All No 96 23 119 80.67 19.33 100.00 Yes 86 26 112 76.79 23.21 100.00 All 182 49 231 78.79 21.21 100.00

Rows: Sleepdep Columns: Smoke No Yes All No 96 23 119 80.67 19.33 100.00 Yes 86 26 112 76.79 23.21 100.00 All 182 49 231 78.79 21.21 100.00

Skeptic wins big!! No evidencein the data to suggest a differencein the population.

Chi-Square = 0.521

Note that 23.2% of the people who feel sleep deprived smoke but only 19.3% of the people who do not feel sleep deprived smoke.

The skeptic wins and we conclude that the difference could easily have happenedby chance.

There is no practical difference betweenthe two percentages anyway. Just a 3.9%difference.

What happens if we have 100 times the sample sizes?

And suppose the percentages stay the same.

Consider non-sleep-deprived students who say they smoke more than 0 packs per week:

So instead of 23/119 = .193 or 19.3%

We have 2300/11900 = .193 or 19.3% (same percent)

Rows: Sleepdep Columns: Smoke (Observed and expected counts shown) No Yes All No 9600 2300 11900 9375.76 2524.24 11900.00 Yes 8600 2600 11200 8824.24 2375.76 11200.00 All 18200 4900 23100 18200.00 4900.00 23100.00

Chi-Sq = 5.363 + 19.921 + 5.698 + 21.166 = 52.148

And now the research advocate wins and thedifference is statistically significant.But the difference of 3.9% is stillnot practically significant.

The point: practical significance

Even if the difference in percentages is uninterestingand of no practical interest, the difference may be statistically significant because we have a largesample.

Hence, in the interpretation of statistical significance,we must also address the issue of practical significance.

In other words, you must answer the skeptic’s secondquestion: WHO CARES?

Research question: Is the Salk polio vaccineeffective?

Randomized experiment, double blindedCarried out in 1954 on 400,000 children.

PolioPolio No polioNo polio

ControlControl 142142 199,858199,858 200,000200,000

TreatmentTreatment 5656 199,944199,944 200,000200,000

Control proportion = 142/200,000 = .00071 or .071%

Treatment proportion = 56/200,000 = .00028 or .028%

Difference: Control – Treatment = .00043 or .043%

Very small difference. But this was expected so theytook large samples. But is the difference significant?Does the research advocate (Dr. Jonas Salk) win?

Expected counts are printed below observed polio not Total C 142 199,858 200,000 99 199,901

T 56 199,944 200,000 99 199,901

Total 198 399,802 400,000

Chi-Sq = 18.677 + 0.009 + 18.677 + 0.009 = 37.372

The research advocate wins easily. We say that the vaccine is statistically significant. But is it practically significant?

Recall the difference in proportions for

Contol – Treatment = .00043

This represents the proportion of childrensaved from polio by the vaccine.

Population of US in 2000: 286,196,812.

Population of Children under age of 20: 82,997,075

Number of children saved from polio by the vaccine: 82,997,075 times .00043 35,688

That is certainly practically significant.

Goal:

Combine ideas from Chapter 4 on surveys and polls with ideas from Chapter 12 on testing for statistical significance in contingency tables.

YesYes NoNo

AugAug

20002000

22%22% 78%78%

SeptSept

19991999

17%17% 83%83%

Research question: Is there a significant difference between 2000 and 1999? Is 22% - 17% = 5% a real difference?

Gallup Poll: Has drug abuse ever been a causeof trouble in your family?

YesYes NoNo

AugAug

20002000

22%22% 78%78%

SeptSept

19991999

17%17% 83%83%

Suppose the polls were each based on 1200 people.

What is the margin of error for the percents in the table?

Margin of error = 1/(square root of 1200) = .03 or 3%.

First recall the margin of error

YesYes NoNo

AugAug

20002000

22% 22% ++ 3% 3% 78% 78% ++ 3% 3%

SeptSept

19991999

17% 17% ++ 3% 3% 83% 83% ++ 3% 3%

So now report:

YesYes NoNo

AugAug

20002000

19% to 25%19% to 25% 75% to 81%75% to 81%

SeptSept

19991999

14% to 20%14% to 20% 80% to 86%80% to 86%

1. First we will create a sample count table from the original Gallup Poll percentages.

2. Then we will use the 4 step statistical inference process to see if the differences are statistically significant.

3. If the research advocate wins, we will consider the differences in the Gallup Poll as reflecting real differences in the populations (1999 and 2000).

4. We will then compute the relative and increased risks associated with drug abuse troubles in families from 1999 and 2000. This would indicate how big the differences

are.

Plan

To resolve the debate between the research advocateand the skeptic we need to conduct a chi-squared test.

Remember the skeptic says the 5% difference occurredby chance. There is really no difference in the populations.

But we cannot conduct a chi-squared test on a table of percents.

We need raw counts.

The Gallup Poll generally tells you what the sample sizes were for the survey. If they do not, then we will use 1200since they usually use between 1000 and 1500.

YesYes NoNo CountCount

AugAug

20002000

22%22%

(.22)(.22)

78%78%

(.78)(.78)

12001200

SeptSept

19991999

17%17%

(.17)(.17)

83%83%

(.83)(.83)

12001200

YesYes NoNo

AugAug

20002000

.22 x 1200 = 264.22 x 1200 = 264 ??

SeptSept

19991999

.17 x 1200 = 204.17 x 1200 = 204 ??

YesYes NoNo

AugAug

20002000

264264 936936 12001200

SeptSept

19991999

204204 996996 12001200

468468 19321932 24002400

Gallup Poll: Has drug abuse ever been a causeof trouble in your family?

Table of counts

yes no Total 1 264 936 1200 234.00 966.00

2 204 996 1200 234.00 966.00

Total 468 1932 2400

Chi-Sq = 3.846 + 0.932 + 3.846 + 0.932 = 9.556

846.3234

)234264()( 22

Exp

ExpObs2342400

1200468 Exp

Obs = 264

Since chi-squared = 9.556, the research advocate wins.

There is evidence in the data that there are real differencesbetween the populations.

That is, we have detected statistically (in the samples)that the increase (from 1999 to 2000) in people who say there has been drug abuse problems in their family is really in the populations.

Next look at the relative risk and the increased risk ofhaving drug troubles in a family from 1999 to 2000.That is, consider the practical significance.

Conclusion

1. The relative risk of drug abuse troubles in a family (from 1999 to 2000) is:

.22/.17 = 1.29So the risk of drug troubles is 1.29 times higher in2000 than in 1999.

2. The increased risk of drug abuse troubles in a family (from 1999 to 2000) is:

(.22 - .17)/.17 = .29So there is a 29% higher risk in 2000 for drug troublesin a family than in 1999.

YesYes NoNo

Aug 2000Aug 2000 22%22% 78%78%

Sept 1999Sept 1999 17%17% 83%83%

1. First we created a sample count table from the originalGallup Poll percentages.

2. Then we used the 4 step statistical inference process to see if the differences were statistically significant.

3. The research advocate won ( chisquare = 9.556). So we can now consider the differences in the Gallup Pollas reflecting real differences in the populations (1999 and2000).

4. We finished by computing the relative and increased risks associated with drug abuse troubles in families from 1999 and 2000.

Summary