Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

Chapter 3:Statistical Significance Testing

Warner (2007). Applied statistics:

From bivariate through multivariate.

Sage Publications, Inc.

The process in NHST(Null Hypothesis Significance Tests)

1. Formulate a null hypothesis

For example, here is a null hypothesis about population mean human body temperature (in degrees Fahrenheit)

H0: hyp = 98.6

People widely assume that mean normal body temperature for humans is 98.6 degrees. Is this assumption correct?

Steps in NHST continued:

2. For this research question we use the one sample t test to evaluate whether the mean body temperature in one sample (M) differs significantly from this hypothesized value for the population mean, hyp = 98.6.

The form of this t ratio is as follows:

t = (M - hyp ) / SEM

Chapter 2 described how to obtain SEM from s and N

(sample standard deviation s and sample size N)

Verbal interpretation of:t = (M - hyp ) / SEM

This t ratio tells us: How far away from hyp is M, in number of Standard Errors (SEM)?

If t is “large”, we reject H0 and conclude that M differs significantly from hyp

If t is close to zero or small, we do not reject H0.

Next question: what is our criterion for a “large” value of t?

Logic of NHST continued

3. Next we need to establish a criterion for statistical significance (i.e., how large must the obtained t ratio be to judge the difference between M and hyp “statistically significant”?

This criterion (critical value of t) depends on: our choice of level, the choice of a one versus two tailed test, and the degrees of freedom for our sample.

Questions about criteria for statistical significance:

What is the usual level?

What does it mean to say “we used = .05 as the criterion for significance”

What different versions of H1 can be considered?

How do the “reject” regions in the distribution of values of t differ depending on your choice of H1?

Logic of NHST continued

4. After we have established a criterion for statistical significance (that is, after we decide on an alpha level and a one or two tailed test, and figure out the reject regions), we look at the values of M, s, and N in the sample data; we calculate a value of t; and we evaluate this obtained value of t relative to the reject regions based on the alpha level.

Specific Example

From Shoemaker (1996), use the following information to set up a significance test:

Ho: hyp = 98.6

H1: hyp not equal to 98.6

= .05 (two tailed)

N = 130 df = 129

Given these values, set up a diagram to show the “reject regions” for values of t.

Evaluating sample results:

Shoemaker (1996) reported the following outcome for simulated body temperature data (these data show the same pattern as data in a published medical study cited by Shoemaker):

M = 98.25, t(129) = -5.45, p < .001

What conclusions can be drawn from this result?

When do problems arise in NHST?

Null hypothesis significance testing essentially involves series of conditional “if…” statements.

If we set the alpha level and choose a directional or nondirectional test before we look at our data….

And if our data meet the assumptions required for the use of parametric statistics (e.g. scores are quantitative, nearly normally distributed, etc.)…

And if our sample is drawn randomly from the population of interest, and is representative of the population about which we want to make inferences…

Conditional if’s involved in NHST continued:

And if we do only one statistical significance test…And if we avoid the temptation to change the

criteria for significance (such as the level) after looking at our sample data…

If and only if these conditions are met, then theoretically, using the reject regions set up early in the process of NHST, we should reject H0 only 5% of the time when H0 is actually correct; that is, our risk of committing a Type I error should be limited to 5%.

What happens when one or more of these conditions are not satisfied?

In actual research it is fairly common for researchers to select an alpha level after they have examined the t test outcome; to compute means and t tests for data that have non-normal distribution shapes or that violate other assumptions for the use of test statistics such as t, F, and r; to discard outlier data points from the data set if the initial significance test outcome is not significant; or to run large numbers of statistical significance tests.

What is the consequence of violating the assumptions for NHST?

When the ideal conditions for NHST are not obtained, our “real” risk of Type I error may be quite different from (and often much higher than) the “nominal” or “theoretical” risk of Type I error that corresponds to the stated level.

This is often called “inflated risk of Type I error”.

What can we do to limit inflated risk of Type I error?

How do each of these procedures help us to limit inflated risk of Type I error?

1. Making sure that the sample is representative of any population about which inferences are to be made

2. Setting criteria for “statistical significance” decision before we look at the data (values of M and t for our sample)


3. Limit the number of hypotheses and statistical significance tests (to think about… why is it often easier to do this in experimental research than in survey or non experimental studies?)

4. Use of Bonferroni corrected per comparison levels


5. Replication of result across additional studies

6. Cross validation of result within a study

Questions to discuss:

What conclusions can we draw when we obtain a non significant outcome for a one sample t test?

What conclusions can we draw when we obtain a statistically significant outcome for a one sample t test?

Reporting recommendations

It is important to provide additional information, and not to report a t test in isolation.

For a one sample t test, the research report should include:

The values of the sample statistics (M, s, N)

The t ratio and its degrees of freedom.

Reporting recommendations

A statement whether the t ratio is statistically significant at the pre-determined level

and/or

An exact p value can be reported (along with an indication whether it is one or two tailed).

Reporting recommendations, continued:

A statement whether the t ratio is statistically significant at the pre-determined level

and/or

The exact p value (and whether it is one or two tailed).


An indication of effect size or magnitude of difference.

For example, for the one sample t test, we can set up Cohen’s d:

d = (M – hyp)/ s

In words, d tells us: what was the difference between M and hyp in number of standard deviations?


A Confidence Interval based on the sample mean should also be included as part of results.

Statistical power: Notice that– given a specific set of numerical values for M, s and hyp

– the magnitude of SEM, and therefore, size of the t ratio, depends upon N (sample

size).

Given a sample size N, we can (roughly) predict the size of t if we can make reasonably accurate

guesses about the value of d.

Due to sampling error, and our inability to know the exact values of M and s before we collect data, we cannot predict the value of t exactly.

However, there are statistical power tables that tell us: what is the (approximate) probability of obtaining a t value large enough to reject Ho, as a function of effect size (d) and N. The probability of (correctly) rejecting Ho when Ho is false is called statistical power.

Questions about statistical power:

Several factors influence statistical power for a one sample t test. How does statistical power change (increase/ decrease) for each of the following changes?

(In every question, we assume that all other terms included in the r ratio remain the same.)

Questions about statistical power:does it increase/decrease/stay the same?

As d (effect size) increases, assuming that all other terms in the t ratio remain the same, statistical power ____.

As N (sample size) increases, assuming that all other terms in the t ratio remain the same, statistical power ____.

As the level is made smaller, for example, if we change from .05 to .01, statistical power ____.

Questions about statistical power, continued:

If we know ahead of time that the effect size d is very small, what does this tell us about the N we will need in order to have adequate statistical power?

If we know ahead of time that the effect size d is very large, what does this tell us about the N we will need in order to have adequate statistical power?

Some logical problems with NHST

NHST does not tell us: “Given the sample mean M obtained in our study, how likely is it that H0 is correct?”.

Instead, a significance test tells us: “If we assume that the null hypothesis is true, how likely or unlikely is the value of M that we obtained in our study?”

The nature of NHST:

Often, researchers want to reject H0 (this is almost always the case when we set up hypothese about relationships between variables; it is less often true for tests about a single population mean).

Often, researchers hope to obtain a value of M far away from hyp, and a value of t that is far away from 0, because these are outcomes that would be unlikely to occur if H0 is true.

The logic of NHST more generally:

In later chapters, a typical null hypothesis corresponds to an assumption that there is no relationship between a predictor and an outcome variable.

Usually researchers hope to reject this null hypothesis.

This type of logic is awkward for several reasons:

Reasons why NHST logic is problematic

In everyday reasoning, people have a strong preference for setting up hypotheses that they believe and then searching for confirmatory evidence.

In NHST, researchers usually set up a null hypothesis they do not believe, and then look for disconfirmatory evidence.

This runs counter to our everyday habits, and also involves a double negative (rejection of the null hypothesis is interpreted as support for a belief that the variables in a study are related – but this conclusion is logically somewhat problematic.)

Additional reasons why NHST may be problematic in practice

Particularly in non experimental studies that include measurements for large numbers of variables, researchers often run large numbers of statistical significance tests; in these situations, unless precautions are taken to limit risk of Type I error, the p values obtained using NHST methods may greatly underestimate the true risk of Type I error.

Conclusion:

Some professional associations (such as the American Psychological Association) have evaluated the problems that can arise in use of NHST. They stopped short of recommending that researchers abandon this; NHST can be useful as a means of trying to rule out sampling error as a highly likely explanation for the outcomes in a study.

Conclusion:

The APA now recommends that we do not report significant test results alone. In addition to statistical significance tests we should report descriptive data for all groups (e.g. M, s, N); Confidence Intervals; and effect size information. This additional information provides a better basis for readers to evaluate the outcomes of studies.

Documents

Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc