72
STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Embed Size (px)

Citation preview

Page 1: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

STEVE DOIGCRONKITE SCHOOL OF JOURNALISM

Statistics for Science Journalists

Page 2: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Journalists hate math

Definition of journalist: A do-gooder who hates math.

“Word person, not a numbers person.”1936 JQ article noting habitual numerical errors in

newspapers Japanese 6th graders more accurate on math test

than applicants to Columbia’s Graduate School of Journalism

20% of journalists got more than half wrong on 25-question “math competency test” (Maier)

18% of 5,100 stories examined by Phil Meyer had math errors

Page 3: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Bad examples abound

Paulos: 300% decrease in murdersDetroit Free Press (2006): Compared ACS to

Census data to get false drop in median income

KC Star (2000): Priests dying of AIDS at 4 times the rate of all Americans

Delaware ZIP Code of infant deathNYT: 51% of women without spouses

Page 4: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Common problems

Numbers that don’t add upMaking the reader do the mathFailure to ask “Does this make sense?”Over-precisionIgnoring sampling error marginsImplying that correlation equals causation

Page 5: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Dangers of journalistic innumeracy

Misleads math-challenged readers/viewersHurts credibility among math-capable

readers/viewersLeads to charges of bias, even when cause is

ignoranceMakes reporters vulnerable to being used for

the agendas of others

Page 6: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Common Research Methods

Randomized experiments: Measure deliberate manipulation of the environment

Observational studies: Measure the differences that occur naturally

Meta-analyses: Quantitative review of multiple studies

Case Study: Descriptive in-depth examination of one or a few individuals

Page 7: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Simple Measures...

...don’t exist!

Page 8: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Measurement Variability

Variable measurements include unpredictable errors or discrepancies that aren’t easily explained.

Natural variability is the result of the fact that individuals and other things are different.

Page 9: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Reasons for variable measures

Measurement errorNatural variability between

individualsNatural variability over time in a

single individual

Page 10: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Some Pitfalls in Studies

Page 11: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Deliberate Bias?

If you found a wallet with $20, would you:“Keep it?” (23% would keep it)“Do the honest thing and return it?” (13% would keep it)

Page 12: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Unintentional Bias?

“Do you use drugs?”“Are you religious?”

Page 13: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Desire to Please?

People routinely say they have voted when they actually haven’t, that they don’t smoke when they do, and that they aren’t prejudiced.

One study six months after an election:96% of actual voters said they voted.40% of non-voters said they voted.

Page 14: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Asking the uninformed?

Washington Post poll : “Some people say the 1975 Public Affairs Act should be repealed. Do you agree or disagree that it should be repealed?”

24% said yes19% said norest had no opinion

Page 15: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Asking the uninformed?

Later Washington Post poll: “President Clinton says the 1975 Public Affairs Act should be repealed. Do you agree or disagree that it should be repealed?”

36% of Democrats agreed16% of Republicans agreedrest had no opinion

Page 16: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Unnecessary Complexity?

“Do you support our soldiers in Iraq so that terrorists won’t strike the U.S. again?”

Page 17: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Question Order

“About how many times a month do you normally go out on a date?”

“How happy are you with life in general?”

Page 18: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Sampling

Page 19: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Margin of Error

95% of the time, a random sample’s characteristics will differ from the population’s by no more than about

where N= sample size

n1

Page 20: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Two Important Concepts about Error Margin

The larger the sample, the smaller the margin of sampling error.

The size of the population being surveyed doesn’t matter.*

*Unless the sample is a significant fraction of the population.

Page 21: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Sampling realities

Bigger sample means more cost (money and/or time)

Diminishing return on error margin improvement as sample increases. N=100: +/- 10 percentage points N=400: +/- 5 percentage points N=900: +/- 3.3 percentage points

Sample needs only to be large enough to give a reasonable answer.

Sampling error affects subsamples, too.

Page 22: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Describing data sets

Page 23: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Three Useful Features of a Set of Data

The CenterThe VariabilityThe Shape

Page 24: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

The Center

Mean (average): Total of the values, divided by the number of values

Median: The middle value of an ordered list of values

Mode: The most common valueOutliers: Atypical values far from the center

Page 25: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Yankees’ Baseball Salaries

Average: $7,404,762Median: $2,500,000Mode: $500,000 (also the minimum)Outlier: $27.5 million (Alex Rodriguez)

Page 26: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

The Variability

Some measures of variability:Maximum and minimum: Largest and

smallest valuesRange: The distance between the largest and

smallest valuesQuartiles: The medians of each half of the

ordered list of valuesStandard deviation: Think of it as the average

distance of all the values from the mean.

Page 27: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

What is “normal”?

Don’t consider the average to be “normal”Variability is normalAnything within about 3 standard deviations

of the mean is “normal”

Page 28: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Bell-Shaped “Normal” Curve

Page 29: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Some Characteristics of a Normal Distribution

Symmetrical (not skewed)One peak in the middle, at the meanThe wider the curve, the greater the standard

deviationArea under the curve is 1 (or 100%)

mean

Page 30: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Percentiles

Your percentile for a particular measure (like height or IQ) is the percentage of the population that falls below you.

Compared to other American males:My height (5’ 11”): 75th percentileMy weight (230 lbs.): 85th percentileMy age (66): 88th percentile

Therefore, I am older and heavier than I am tall.

Page 31: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Standardized Scores

A standardized score (also called the z-score) is simply the number of standard deviations a particular value is either above or below the mean.

The standardized score is: Positive if above the meanNegative if below the mean

Useful for defining data points as outliers.

Page 32: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

The Empirical Rule

For any normal curve, approximately:68% of values within one StdDev of the mean95% of values within two StdDevs of the

mean99.7% of values within three StdDevs of the

mean

Page 33: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists
Page 34: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Outlier

A value that is more than three standard deviations above or below the mean.

Page 35: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Correlation

Page 36: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Strength of Relationship

Correlation (also called the correlation coefficient or Pearson’s r) is the measure of strength of the linear relationship between two variables.

Think of strength as how closely the data points come to falling on a line drawn through the data.

Page 37: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Features of Correlation

Correlation can range from +1 to -1Positive correlation: As one variable

increases, the other increasesNegative correlation: As one variable

increases, the other decreasesZero correlation means the best line

through the data is horizontalCorrelation isn’t affected by the units of

measurement

Page 38: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Positive Correlations

r = +.1 r = +.4

r = +.8 r = +1

Page 39: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Negative Correlations

r = -.1

r = -.4

r = -.8 r = -1

Page 40: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Zero correlation

r = 0 r = 0

Page 41: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Number of PointsDoesn’t Matter

r = .8 r = .8

Page 42: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Important!

Correlation does not imply causation.

Page 43: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Correlation of variables

When considering relationships between measurement variables, there are two kinds: Explanatory (or independent) variable: The variable

that attempts to explain or is purported to cause (at least partially) differences in the…

Response (or dependent or outcome) variableOften, chronology is a guide to distinguishing

them (examples: baldness and heart attacks, poverty and test scores)

Page 44: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Some reasons why two variables could be related

The explanatory variable is the direct cause of the response variable

Example: pollen counts and percent of population suffering allergies, intercourse and babies

Page 45: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Some reasons two variables could be related

The response variable is causing a change in the explanatory variable

Example: hotel occupancy and advertising spending, divorce and alcohol abuse

Page 46: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Some reasons two variables could be related

The explanatory variable is a contributing -- but not sole -- cause

Example: birth complications and violence, gun in home and homicide, hours studied and grade, diet and cancer

Page 47: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Some reasons two variables could be related

Both variables may result from a common cause

Example: SAT score and GPA, hot chocolate and tissues, storks and babies, fire losses and firefighters, WWII fighter opposition and bombing accuracy

Page 48: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Some reasons two variables could be related

Both variables are changing over timeExample: divorces and drug offenses, divorces

and suicides

Page 49: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Some reasons two variables could be related

The association may be nothing more than coincidence

Example: clusters of disease, brain cancer from cell phones

Page 50: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

So how can we confirm causation?

The only way to confirm is with a designed (randomized double-blind) experiment.

But non-statistical evidence of a possible connection may include:

A reasonable explanation of cause and effect.A connection that happens under varying

conditions.Potential confounding variables ruled out.

Page 51: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Regression

Page 52: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Linear Regression

In addition to figuring the strength of the relationship, we can create a simple equation that describes the best-fit line (also called the “least-squares” line) through the data.

This equation will help us predict one variable, given the other.

Page 53: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Best-fit (“least-squares”) Line

Page 54: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Best-fit Line??? (much variance)

Page 55: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Best-fit Line! (least variance)

Page 56: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Remember 9th Grade Algebra?

x = horizontal axis y = vertical axis

Equation for a line:

y = slope * x + intercept

or as it often is stated:

y = mx + b

Page 57: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Regression in data journalism

Public school test scoresCheating in school test scoresTenure of white vs. black coaches in NBARacial bias in picking jurorsRacial profiling in traffic stops

Page 58: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Confusion of the inverse

Page 59: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Confusion of the Inverse

Confusing these two:Probability of actually having a condition,

given a positive test for itProbability of having a positive test, given

actually having the condition

When the incidence of some disease or condition is very low, and the test for it is not perfect, there will be a high probability that a positive test result is false positive.

Page 60: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Definitions

Base rate: The probability that someone has a disease or condition, without knowing any test results.

Test Sensitivity: Proportion of people who correctly test positive when they have the disease or condition (true positive)

Test Specificity: Proportion of people who correctly test negative when they don’t have the disease or condition (true negative)

Page 61: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Drug Tests

Consider this scenario:Base rate: 1% of population to be tested uses

dangerous drugsYou use a test that’s 99% accurate in both

sensitivity and specificity10,000 people are tested

Page 62: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Drug Tests

Test Test TotalPositive Negative

Users 100

Not 9,900

Total 10,000

Page 63: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Drug Tests

Test Test TotalPositive Negative

Users 99 1 100

Not 9,900

Total 10,000

Page 64: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Drug Tests

Test Test TotalPositive Negative

Users 99 1 100

Not 9,801 9,900

Total 9,802 10,000

Page 65: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Drug Tests

Test Test TotalPositive Negative

Users 99 1 100

Not ??? 9,801 9,900

Total 9,802 10,000

Page 66: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Drug Tests

Test Test TotalPositive Negative

Users 99 1 100

Not 9,801 9,900

Total 198 9,802 10,000

(50% of positives are FALSE!)

99

Page 67: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Confidence intervals and p-values

Page 68: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Confidence Intervals

Like the error margin around poll results

A confidence interval is a tradeoff between certainty and accuracy, like shooting at targets of different sizes

The bigger the sample, the smaller the confidence interval at the 95% level

When comparing results, if confidence intervals overlap, the results are NOT statistically significant

Page 69: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

P-values

P-value is the probability that the sample result is significantly different from the true result (i.e., wrong)

95% confidence interval (p < 0.05) is the most commonly used interval in social science research

Hard science, particularly medicine, often needs tighter confidence intervals and smaller p-values, like p<0.01

Studies are going to be wrong about 5% of the time (and you won’t know when)

On the other hand, they probably won’t be very wrong.

Page 70: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

How to read a research study

Pay attention to the method: Observational, randomized double-blind experiment, meta-analysis, case study

Note the sample sizeDon’t ignore the confidence intervalsConsider the p-value as the probability you’re

writing about something that isn’t trueRemember correlation doesn’t necessarily mean

causation. Consider the quality of the journal (peer reviewed?)Who paid for the research?

Page 71: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Newsroom math bibliography

“Numbers in the Newsroom”, by Sarah Cohen, IRE

“News and Numbers”, by Victor Cohn and Lewis Cope

“Precision Journalism (4th edition)”, by Phil Meyer

“Innumeracy”, by John Allen Paulos“A Mathematician Reads the Newspaper,” by

John Allen Paulos“Damned Lies and Statistics,” by Joel Best

Page 72: STEVE DOIG CRONKITE SCHOOL OF JOURNALISM Statistics for Science Journalists

Questions?