34
Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc.

Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

Embed Size (px)

Citation preview

Page 1: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

Chapter 3:Statistical Significance Testing

Warner (2007). Applied statistics:

From bivariate through multivariate.

Sage Publications, Inc.

Page 2: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

The process in NHST(Null Hypothesis Significance Tests)

1. Formulate a null hypothesis

For example, here is a null hypothesis about population mean human body temperature (in degrees Fahrenheit)

H0: hyp = 98.6

People widely assume that mean normal body temperature for humans is 98.6 degrees. Is this assumption correct?

Page 3: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

Steps in NHST continued:

2. For this research question we use the one sample t test to evaluate whether the mean body temperature in one sample (M) differs significantly from this hypothesized value for the population mean, hyp = 98.6.

The form of this t ratio is as follows:

t = (M - hyp ) / SEM

Chapter 2 described how to obtain SEM from s and N

(sample standard deviation s and sample size N)

Page 4: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

Verbal interpretation of:t = (M - hyp ) / SEM

This t ratio tells us: How far away from hyp is M, in number of Standard Errors (SEM)?

If t is “large”, we reject H0 and conclude that M differs significantly from hyp

If t is close to zero or small, we do not reject H0.

Next question: what is our criterion for a “large” value of t?

Page 5: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

Logic of NHST continued

3. Next we need to establish a criterion for statistical significance (i.e., how large must the obtained t ratio be to judge the difference between M and hyp “statistically significant”?

This criterion (critical value of t) depends on: our choice of level, the choice of a one versus two tailed test, and the degrees of freedom for our sample.

Page 6: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

Questions about criteria for statistical significance:

What is the usual level?

What does it mean to say “we used = .05 as the criterion for significance”

What different versions of H1 can be considered?

How do the “reject” regions in the distribution of values of t differ depending on your choice of H1?

Page 7: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

Logic of NHST continued

4. After we have established a criterion for statistical significance (that is, after we decide on an alpha level and a one or two tailed test, and figure out the reject regions), we look at the values of M, s, and N in the sample data; we calculate a value of t; and we evaluate this obtained value of t relative to the reject regions based on the alpha level.

Page 8: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

Specific Example

From Shoemaker (1996), use the following information to set up a significance test:

Ho: hyp = 98.6

H1: hyp not equal to 98.6

= .05 (two tailed)

N = 130 df = 129

Given these values, set up a diagram to show the “reject regions” for values of t.

Page 9: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

Evaluating sample results:

Shoemaker (1996) reported the following outcome for simulated body temperature data (these data show the same pattern as data in a published medical study cited by Shoemaker):

M = 98.25, t(129) = -5.45, p < .001

What conclusions can be drawn from this result?

Page 10: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

When do problems arise in NHST?

Null hypothesis significance testing essentially involves series of conditional “if…” statements.

If we set the alpha level and choose a directional or nondirectional test before we look at our data….

And if our data meet the assumptions required for the use of parametric statistics (e.g. scores are quantitative, nearly normally distributed, etc.)…

And if our sample is drawn randomly from the population of interest, and is representative of the population about which we want to make inferences…

Page 11: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

Conditional if’s involved in NHST continued:

And if we do only one statistical significance test…And if we avoid the temptation to change the

criteria for significance (such as the level) after looking at our sample data…

If and only if these conditions are met, then theoretically, using the reject regions set up early in the process of NHST, we should reject H0 only 5% of the time when H0 is actually correct; that is, our risk of committing a Type I error should be limited to 5%.

Page 12: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

What happens when one or more of these conditions are not satisfied?

In actual research it is fairly common for researchers to select an alpha level after they have examined the t test outcome; to compute means and t tests for data that have non-normal distribution shapes or that violate other assumptions for the use of test statistics such as t, F, and r; to discard outlier data points from the data set if the initial significance test outcome is not significant; or to run large numbers of statistical significance tests.

Page 13: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

What is the consequence of violating the assumptions for NHST?

When the ideal conditions for NHST are not obtained, our “real” risk of Type I error may be quite different from (and often much higher than) the “nominal” or “theoretical” risk of Type I error that corresponds to the stated level.

This is often called “inflated risk of Type I error”.

What can we do to limit inflated risk of Type I error?

Page 14: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

How do each of these procedures help us to limit inflated risk of Type I error?

1. Making sure that the sample is representative of any population about which inferences are to be made

2. Setting criteria for “statistical significance” decision before we look at the data (values of M and t for our sample)

Page 15: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

How do each of these procedures help us to limit inflated risk of Type I error?

3. Limit the number of hypotheses and statistical significance tests (to think about… why is it often easier to do this in experimental research than in survey or non experimental studies?)

4. Use of Bonferroni corrected per comparison levels

Page 16: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

How do each of these procedures help us to limit inflated risk of Type I error?

5. Replication of result across additional studies

6. Cross validation of result within a study

Page 17: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

Questions to discuss:

What conclusions can we draw when we obtain a non significant outcome for a one sample t test?

What conclusions can we draw when we obtain a statistically significant outcome for a one sample t test?

Page 18: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

Reporting recommendations

It is important to provide additional information, and not to report a t test in isolation.

For a one sample t test, the research report should include:

The values of the sample statistics (M, s, N)

The t ratio and its degrees of freedom.

Page 19: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

Reporting recommendations

A statement whether the t ratio is statistically significant at the pre-determined level

and/or

An exact p value can be reported (along with an indication whether it is one or two tailed).

Page 20: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

Reporting recommendations, continued:

A statement whether the t ratio is statistically significant at the pre-determined level

and/or

The exact p value (and whether it is one or two tailed).

Page 21: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

Reporting recommendations, continued:

An indication of effect size or magnitude of difference.

For example, for the one sample t test, we can set up Cohen’s d:

d = (M – hyp)/ s

In words, d tells us: what was the difference between M and hyp in number of standard deviations?

Page 22: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

Reporting recommendations, continued:

A Confidence Interval based on the sample mean should also be included as part of results.

Page 23: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

Statistical power: Notice that– given a specific set of numerical values for M, s and hyp

– the magnitude of SEM, and therefore, size of the t ratio, depends upon N (sample

size).

Page 24: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

Given a sample size N, we can (roughly) predict the size of t if we can make reasonably accurate

guesses about the value of d.

Due to sampling error, and our inability to know the exact values of M and s before we collect data, we cannot predict the value of t exactly.

However, there are statistical power tables that tell us: what is the (approximate) probability of obtaining a t value large enough to reject Ho, as a function of effect size (d) and N. The probability of (correctly) rejecting Ho when Ho is false is called statistical power.

Page 25: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

Questions about statistical power:

Several factors influence statistical power for a one sample t test. How does statistical power change (increase/ decrease) for each of the following changes?

(In every question, we assume that all other terms included in the r ratio remain the same.)

Page 26: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

Questions about statistical power:does it increase/decrease/stay the same?

As d (effect size) increases, assuming that all other terms in the t ratio remain the same, statistical power ____.

As N (sample size) increases, assuming that all other terms in the t ratio remain the same, statistical power ____.

As the level is made smaller, for example, if we change from .05 to .01, statistical power ____.

Page 27: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

Questions about statistical power, continued:

If we know ahead of time that the effect size d is very small, what does this tell us about the N we will need in order to have adequate statistical power?

If we know ahead of time that the effect size d is very large, what does this tell us about the N we will need in order to have adequate statistical power?

Page 28: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

Some logical problems with NHST

NHST does not tell us: “Given the sample mean M obtained in our study, how likely is it that H0 is correct?”.

Instead, a significance test tells us: “If we assume that the null hypothesis is true, how likely or unlikely is the value of M that we obtained in our study?”

Page 29: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

The nature of NHST:

Often, researchers want to reject H0 (this is almost always the case when we set up hypothese about relationships between variables; it is less often true for tests about a single population mean).

Often, researchers hope to obtain a value of M far away from hyp, and a value of t that is far away from 0, because these are outcomes that would be unlikely to occur if H0 is true.

Page 30: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

The logic of NHST more generally:

In later chapters, a typical null hypothesis corresponds to an assumption that there is no relationship between a predictor and an outcome variable.

Usually researchers hope to reject this null hypothesis.

This type of logic is awkward for several reasons:

Page 31: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

Reasons why NHST logic is problematic

In everyday reasoning, people have a strong preference for setting up hypotheses that they believe and then searching for confirmatory evidence.

In NHST, researchers usually set up a null hypothesis they do not believe, and then look for disconfirmatory evidence.

This runs counter to our everyday habits, and also involves a double negative (rejection of the null hypothesis is interpreted as support for a belief that the variables in a study are related – but this conclusion is logically somewhat problematic.)

Page 32: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

Additional reasons why NHST may be problematic in practice

Particularly in non experimental studies that include measurements for large numbers of variables, researchers often run large numbers of statistical significance tests; in these situations, unless precautions are taken to limit risk of Type I error, the p values obtained using NHST methods may greatly underestimate the true risk of Type I error.

Page 33: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

Conclusion:

Some professional associations (such as the American Psychological Association) have evaluated the problems that can arise in use of NHST. They stopped short of recommending that researchers abandon this; NHST can be useful as a means of trying to rule out sampling error as a highly likely explanation for the outcomes in a study.

Page 34: Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

Conclusion:

The APA now recommends that we do not report significant test results alone. In addition to statistical significance tests we should report descriptive data for all groups (e.g. M, s, N); Confidence Intervals; and effect size information. This additional information provides a better basis for readers to evaluate the outcomes of studies.