35
Statistical Foundations: Hypothesis Testing Hypothesis Testing Psychology 790 Lecture #9 Lecture #9 9/19/2006

Statistical Foundations: Hypothesis TestingHypothesis Testing

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Statistical Foundations: Hypothesis TestingHypothesis Testing

Statistical Foundations: Hypothesis TestingHypothesis Testing

Psychology 790Lecture #9Lecture #99/19/2006

Page 2: Statistical Foundations: Hypothesis TestingHypothesis Testing

Today’s ClassToday s Class

H th i T ti• Hypothesis Testing.– General terms and philosophy.– Specific Examples

Page 3: Statistical Foundations: Hypothesis TestingHypothesis Testing

Hypothesis Testingyp g

Page 4: Statistical Foundations: Hypothesis TestingHypothesis Testing

Rules of the NHST GameRules of the NHST Game• Recall our discussion about Null Hypothesis Significance TestingRecall our discussion about Null Hypothesis Significance Testing

from the last lecture:

• This probability value is often called a p-value or p.– When p < .05, a result is said to be “statistically significant”

• In short, when a result is statistically significant (p < .05), we conclude that the difference we observed was unlikely to be due toconclude that the difference we observed was unlikely to be due to sampling error alone. We “reject the null hypothesis.”

• If the statistic is not statistically significant (p > .05), we conclude th t li i l ibl i t t ti f th lt Wthat sampling error is a plausible interpretation of the results. We “fail to reject the null hypothesis.”

Page 5: Statistical Foundations: Hypothesis TestingHypothesis Testing

Hypothesis Testing NotesHypothesis Testing Notes

It i i t t t k i i d th t NHST• It is important to keep in mind that NHSTs were developed for the purpose of making yes/no decisions about the null hypothesisdecisions about the null hypothesis. – As a consequence, the null is either accepted or

rejected on the basis of the p-value.

• For logical reasons, some people are uneasy “accepting the null hypothesis” when p > .05, and prefer to say that they “failed to reject the null hypothesis” insteadhypothesis” instead.

Page 6: Statistical Foundations: Hypothesis TestingHypothesis Testing

Hypothesis Testing Items of InterestHypothesis Testing Items of Interest

• Ver important points abo t significance• Very important points about significance testing:

1. The term “significant” does not mean important, substantial, or worthwhile. po ta t, substa t a , o wo t w e.

Page 7: Statistical Foundations: Hypothesis TestingHypothesis Testing

Points continuedPoints, continued

2 Th ll d lt ti h th ft2. The null and alternative hypotheses are often constructed to be mutually exclusive. If one is true the other must be falsetrue, the other must be false.

• As a consequence, – When you reject the null hypothesis, you accept the alternative.– When you fail to reject the null hypothesis, you reject the alternative.

• This may seem tricky because NHSTs do not t t th h h th itest the research hypothesis per se. – Formally, only the null hypothesis is tested.

Page 8: Statistical Foundations: Hypothesis TestingHypothesis Testing

Points continuedPoints, continued

3 B NHST ft d t k3. Because NHSTs are often used to make a yes/no decision about whether the null h th i i i bl l tihypothesis is a viable explanation, mistakes can be made.

Page 9: Statistical Foundations: Hypothesis TestingHypothesis Testing

Errors in Hypothesis TestingHypothesis Testing

Page 10: Statistical Foundations: Hypothesis TestingHypothesis Testing

Errors in Inference using NHSTErrors in Inference using NHST• NHST can lead to decisions which are not• NHST can lead to decisions which are not

correct:

• Type I error: Your test is significant (p < .05), so you reject the null hypothesis, but the null hypothesis is actually true.hypothesis is actually true.

• Type II error: Your test is not significant (p > 05) d ’ j h ll h h i b.05), you don’t reject the null hypothesis, but you

should have because it is false.

Page 11: Statistical Foundations: Hypothesis TestingHypothesis Testing

Errors in Inference using NHST

• The probabilit of making a T pe I error is

Errors in Inference using NHST

• The probability of making a Type I error is determined by the experimenter. Often called the alpha value Usually set to 5%called the alpha value. Usually set to 5%.

• The probability of making a Type II error• The probability of making a Type II error is determined by the experimenter. Often called the beta value Usually ignored bycalled the beta value. Usually ignored by social science researchers.

Page 12: Statistical Foundations: Hypothesis TestingHypothesis Testing

Errors in Inference using NHSTErrors in Inference using NHST

Th f T II i ll d• The converse of Type II error is called Power:– The probability of rejecting the null hypothesis

when it is false—a correct decision. 1 b– 1- beta

Page 13: Statistical Foundations: Hypothesis TestingHypothesis Testing

More on PowerMore on Power

P i t l i fl d b l• Power is strongly influenced by sample size. – With larger N, more likely to reject null if it is

false.P l d d d i h– Power analyses are conducted to determine the size of a sample needed to reject a null hypothesishypothesis.

Page 14: Statistical Foundations: Hypothesis TestingHypothesis Testing

Inferential Errors and NHST

Null is true Null is false

Real World

Null is true Null is falseru

e

est

Nul

l is t

r

n of

the

te Correct decision

Type II error

Nls

e

nclu

sion

Nul

l is f

a

Con Correct

decisionType I error

N

Page 15: Statistical Foundations: Hypothesis TestingHypothesis Testing

Points of InterestPoints of Interest

Th l l d i l• The example we explored previously was an example of what is called a z-test of a

lsample mean.• Significance tests have been developed for

a number of statistics– difference between two group means: t-test– difference between two or more group means:

ANOVA– differences between proportions: chi-square– differences between proportions: chi-square

Page 16: Statistical Foundations: Hypothesis TestingHypothesis Testing

How do we control Type I errors?How do we control Type I errors?

• The Type I error rate is typically controlled by the researcherThe Type I error rate is typically controlled by the researcher.

• It is called the alpha rate, and corresponds to the probability cut-off that one uses in a significance test.

• By convention, researchers often use an alpha rate of .05. – In other words, they will only reject the null hypothesis when a statistic

is likely to occur 5% of the time or less when the null hypothesis is trueis likely to occur 5% of the time or less when the null hypothesis is true.

• In principle, any probability value could be chosen for making the accept/reject decision. – 5% is used by convention.

Page 17: Statistical Foundations: Hypothesis TestingHypothesis Testing

Type I errorsType I errors

• What does 5% mean in this context?• What does 5% mean in this context?

• It means that we will only make a decision error 5% ofIt means that we will only make a decision error 5% of the time if the null hypothesis is true.

• If the null hypothesis is false, the Type I error rate is undefined.

Page 18: Statistical Foundations: Hypothesis TestingHypothesis Testing

How do we control Type II errors?How do we control Type II errors?

• Type II errors can also be controlled by the experimenter• Type II errors can also be controlled by the experimenter.

• The Type II error rate is sometimes called betaThe Type II error rate is sometimes called beta.

• How can the beta rate be controlled? The easiest way to control Type II errors is by increase the statistical powerof a test.

Page 19: Statistical Foundations: Hypothesis TestingHypothesis Testing

Statistical PowerStatistical Power

• Statistical power is defined as the probability of• Statistical power is defined as the probability of rejecting the null hypothesis when it is false—a correct decision (1-beta).

• Power is strongly influenced by sample size. With a l N lik l t j t th ll h th ilarger N, we are more likely to reject the null hypothesis if it is truly false.– (As N increases, the standard error shrinks. Sampling error

becomes less problematic, and true differences are easier to detect.)

Page 20: Statistical Foundations: Hypothesis TestingHypothesis Testing

Power and correlationPower and correlation

• This graph shows how the• This graph shows how the power of the significance test for a correlation varies as a function of sample size.

1.0

Population r = .30

function of sample size.• Notice that when N = 80, there

is about an 80% chance of correctly rejecting the null

0.8

correctly rejecting the null hypothesis (beta = .20).

• When N = 45, we only have a 50% chance of making the

PO

WE

R

0.4

0.6

50% chance of making the correct decision—a coin toss (beta = .50). 0.

2

SAMPLE SIZE

50 100 150 200

Page 21: Statistical Foundations: Hypothesis TestingHypothesis Testing

Power and correlationPower and correlation

• Power also varies as a function of the

1.0

size of the correlation.

• When the population correlation is large (e.g., .80), it requires fewer subjects to correctly reject the null

r = .80 r = .60

PO

WE

R

0.4

0.6

0.8subjects to correctly reject the null

hypothesis that the population correlation is 0.

• When the population correlation is

r = .40

50 100 150 200

0.0

0.2

p psmallish (e.g., .20), it requires a large number of subjects to correctly reject the null hypothesis.

• When the population correlation is 0

r = .20

SAMPLE SIZE• When the population correlation is 0, the probability of rejecting the null is constant at 5% (alpha). Here “power” is technically undefined because the null hypothesis is true.

r = .00

Page 22: Statistical Foundations: Hypothesis TestingHypothesis Testing

Low Power StudiesLow Power Studies

• Because correlations in the .2 to .4 range are typically observed in non-experimental research, one would be wise not to trust research b d l i l th 1.

0

r = .80 r = .60

based on sample sizes less than 60ish.

• Why? Because such research only d 50% h f i ldi

0.8

1

r = .40

stands a 50% chance of yielding the correct decision, if the null is false. It would be more efficient (and, importantly, just as accurate) to flip a coin to make the decision

PO

WE

R

0.4

0.6

r = .20

to flip a coin to make the decision rather than collecting data and using a significance test.

0.0

0.2

r = .00

SAMPLE SIZE

50 100 150 200

0

Page 23: Statistical Foundations: Hypothesis TestingHypothesis Testing

A Sad FactA Sad Fact

• In 1962 Jacob Cohen surveyed all articles in the Journal of Abnormal• In 1962 Jacob Cohen surveyed all articles in the Journal of Abnormal and Social Psychology and determined that the typical power of research conducted in this area was 53%.

• An even sadder fact: In 1989, Sedlmeier and Gigerenzer surveyed studies in the same journal (now called the Journal of Abnormal Psychology) and found that the power had decreased slightly.

• Researchers, unfortunately, pay little attention to power. As a consequence, the Type II error rate of research in psychology is likely to be dangerously high maybe as high as 50%likely to be dangerously high—maybe as high as 50%.

Page 24: Statistical Foundations: Hypothesis TestingHypothesis Testing

Power in Research DesignPower in Research Design

• Power is important to consider and should be used to• Power is important to consider, and should be used to design research projects.– Given an educated guess about what the population

t i ht b ( l ti f 30parameter might be (e.g., a correlation of .30, a mean difference of .5 SD), one can determine the number of subjects needed for a desired level of power.

– Cohen and others recommend that researchers try to obtain a power level of about 80%.obtain a power level of about 80%.

Page 25: Statistical Foundations: Hypothesis TestingHypothesis Testing

Power in Research DesignPower in Research Design

• Thus if one used an alpha level of 5% and collected• Thus, if one used an alpha-level of 5% and collected enough subjects to ensure a power of 80% for an assumed effect, one would know, before the study was done, what the theoretical error rates are for the statistical testthe theoretical error rates are for the statistical test.

• Although these error rates correspond to long-run outcomes, one could get a sense of whether the research design was a credible one—whether it is likely to minimize the two kinds of errors that are possible in pNHST and, correspondingly, maximize the likelihood of making a correct decision.

Page 26: Statistical Foundations: Hypothesis TestingHypothesis Testing

Misconceptions About Hypothesis TestingHypothesis Testing

Page 27: Statistical Foundations: Hypothesis TestingHypothesis Testing

Three Common Misinterpretations of Significance Tests and p-values

1 The p value indicates the probability that the results are1. The p-value indicates the probability that the results are due to sampling error or “chance.”

2. A statistically significant result is a “reliable” result.

3. A statistically significant result is a powerful, important result.

Page 28: Statistical Foundations: Hypothesis TestingHypothesis Testing

Misinterpretation # 1Misinterpretation # 1

• The p value is a conditional probability The probability• The p-value is a conditional probability. The probability of observing a specific range of sample statistics GIVEN (i.e., conditional upon) that the null hypothesis is true. P(D|Ho).

Thi i t i l t t th b bilit f th ll• This is not equivalent to the probability of the null hypothesis being true, given the data.

P(Ho |D) ≠ P(D| Ho)

Page 29: Statistical Foundations: Hypothesis TestingHypothesis Testing

Misinterpretation # 2Misinterpretation # 2

• Is a significant result a “reliable ” easily replicated result?• Is a significant result a “reliable,” easily replicated result?

• Not necessarily The p-value is a poor indicator of theNot necessarily. The p value is a poor indicator of the replicability of a finding.

• Replicability (assuming a real effect exists, that is, that he null hypothesis is false), is primarily a function of t ti ti lstatistical power.

Page 30: Statistical Foundations: Hypothesis TestingHypothesis Testing

Misinterpretation # 2Misinterpretation # 2• If a study had a statistical power equivalent to 80%, what is theIf a study had a statistical power equivalent to 80%, what is the

probability of obtaining a “significant” result twice?

• The probability of two independent events both occurring is the i l d t f th b bilit f h f th isimple product of the probability of each of them occurring.– .80 × .80 = .64

• If power = 50%? 50 × 50 = 25If power 50%? .50 × .50 .25

• Bottom line: The likelihood of replicating a result is determined by statistical power, not the p-value derived from a significance test. Wh f h i l h lik lih d f l i fWhen power of the test is low, the likelihood of a long-run series of replications is even lower.

Page 31: Statistical Foundations: Hypothesis TestingHypothesis Testing

Misinterpretation # 3Misinterpretation # 3

• Is a significant result a powerful important result?• Is a significant result a powerful, important result?

• Not necessarilyNot necessarily.

• The importance of the result, of course, depends on the issue at hand, the theoretical context of the finding, etc.

Page 32: Statistical Foundations: Hypothesis TestingHypothesis Testing

Misinterpretation # 3Misinterpretation # 3• We can measure the practical or theoretical significance• We can measure the practical or theoretical significance

of an effect using an index of effect size.

• An effect size is a quantitative index of the strength of the relationship between two variables.

• Some common measures of effect size are correlations, regression weights, t-values, and R-squared.

Page 33: Statistical Foundations: Hypothesis TestingHypothesis Testing

Misinterpretation # 3Misinterpretation # 3

• Importantly the same effect size can have different p• Importantly, the same effect size can have different p-values, depending on the sample size of the study.

• For example, a correlation of .30 would not statistically significant with a sample size of 30, but would be t ti ti ll i ifi t ith l i f 130statistically significant with a sample size of 130.

• Bottom line: The p-value is a poor way to evaluate theBottom line: The p value is a poor way to evaluate the practical “significance” of a research result.

Page 34: Statistical Foundations: Hypothesis TestingHypothesis Testing

Wrapping UpWrapping Up

• Today was another• Today was another fun lecture about the philosophy of p p yhypothesis testing.

• We do hypothesis testing all the time.

That doesn’t make it– That doesn t make it something without error, though.

Page 35: Statistical Foundations: Hypothesis TestingHypothesis Testing

Next TimeNext Time

Offi h t d (1 4 449 F )• Office hours today (1pm-4pm, 449 Fraser).

• Lab tonight (examples of hypothesis tests).

• Hypothesis testing example.

• Confidence Intervals (Ch 6.8 – 6.11).