Click here to load reader

Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate through multivariate. Sage Publications, Inc

  • View

  • Download

Embed Size (px)

Text of Chapter 3: Statistical Significance Testing Warner (2007). Applied statistics: From bivariate...

  • Chapter 3:Statistical Significance TestingWarner (2007). Applied statistics: From bivariate through multivariate.Sage Publications, Inc.

  • The process in NHST(Null Hypothesis Significance Tests)Formulate a null hypothesis For example, here is a null hypothesis about population mean human body temperature (in degrees Fahrenheit) H0: mhyp = 98.6 People widely assume that mean normal body temperature for humans is 98.6 degrees. Is this assumption correct?

  • Steps in NHST continued:2. For this research question we use the one sample t test to evaluate whether the mean body temperature in one sample (M) differs significantly from this hypothesized value for the population mean, mhyp = 98.6. The form of this t ratio is as follows: t = (M - mhyp ) / SEM Chapter 2 described how to obtain SEM from s and N (sample standard deviation s and sample size N)

  • Verbal interpretation of:t = (M - mhyp ) / SEM

    This t ratio tells us: How far away from mhyp is M, in number of Standard Errors (SEM)?

    If t is large, we reject H0 and conclude that M differs significantly from mhypIf t is close to zero or small, we do not reject H0. Next question: what is our criterion for a large value of t?

  • Logic of NHST continued3. Next we need to establish a criterion for statistical significance (i.e., how large must the obtained t ratio be to judge the difference between M and mhyp statistically significant? This criterion (critical value of t) depends on: our choice of a level, the choice of a one versus two tailed test, and the degrees of freedom for our sample.

  • Questions about criteria for statistical significance:What is the usual a level?

    What does it mean to say we used a = .05 as the criterion for significance

    What different versions of H1 can be considered?

    How do the reject regions in the distribution of values of t differ depending on your choice of H1?

  • Logic of NHST continued4. After we have established a criterion for statistical significance (that is, after we decide on an alpha level and a one or two tailed test, and figure out the reject regions), we look at the values of M, s, and N in the sample data; we calculate a value of t; and we evaluate this obtained value of t relative to the reject regions based on the alpha level.

  • Specific Example From Shoemaker (1996), use the following information to set up a significance test: Ho: mhyp = 98.6 H1: mhyp not equal to 98.6 a = .05 (two tailed) N = 130 df = 129 Given these values, set up a diagram to show the reject regions for values of t.

  • Evaluating sample results: Shoemaker (1996) reported the following outcome for simulated body temperature data (these data show the same pattern as data in a published medical study cited by Shoemaker): M = 98.25, t(129) = -5.45, p < .001

    What conclusions can be drawn from this result?

  • When do problems arise in NHST?Null hypothesis significance testing essentially involves series of conditional if statements.If we set the alpha level and choose a directional or nondirectional test before we look at our data.And if our data meet the assumptions required for the use of parametric statistics (e.g. scores are quantitative, nearly normally distributed, etc.)And if our sample is drawn randomly from the population of interest, and is representative of the population about which we want to make inferences

  • Conditional ifs involved in NHST continued:And if we do only one statistical significance testAnd if we avoid the temptation to change the criteria for significance (such as the a level) after looking at our sample dataIf and only if these conditions are met, then theoretically, using the reject regions set up early in the process of NHST, we should reject H0 only 5% of the time when H0 is actually correct; that is, our risk of committing a Type I error should be limited to 5%.

  • What happens when one or more of these conditions are not satisfied? In actual research it is fairly common for researchers to select an alpha level after they have examined the t test outcome; to compute means and t tests for data that have non-normal distribution shapes or that violate other assumptions for the use of test statistics such as t, F, and r; to discard outlier data points from the data set if the initial significance test outcome is not significant; or to run large numbers of statistical significance tests.

  • What is the consequence of violating the assumptions for NHST? When the ideal conditions for NHST are not obtained, our real risk of Type I error may be quite different from (and often much higher than) the nominal or theoretical risk of Type I error that corresponds to the stated a level. This is often called inflated risk of Type I error. What can we do to limit inflated risk of Type I error?

  • How do each of these procedures help us to limit inflated risk of Type I error?Making sure that the sample is representative of any population about which inferences are to be madeSetting criteria for statistical significance decision before we look at the data (values of M and t for our sample)

  • How do each of these procedures help us to limit inflated risk of Type I error?Limit the number of hypotheses and statistical significance tests (to think about why is it often easier to do this in experimental research than in survey or non experimental studies?)Use of Bonferroni corrected per comparison a levels

  • How do each of these procedures help us to limit inflated risk of Type I error?5. Replication of result across additional studies

    6. Cross validation of result within a study

  • Questions to discuss:What conclusions can we draw when we obtain a non significant outcome for a one sample t test?

    What conclusions can we draw when we obtain a statistically significant outcome for a one sample t test?

  • Reporting recommendationsIt is important to provide additional information, and not to report a t test in isolation.For a one sample t test, the research report should include:The values of the sample statistics (M, s, N)The t ratio and its degrees of freedom.

  • Reporting recommendationsA statement whether the t ratio is statistically significant at the pre-determined a level


    An exact p value can be reported (along with an indication whether it is one or two tailed).

  • Reporting recommendations, continued:A statement whether the t ratio is statistically significant at the pre-determined a leveland/or

    The exact p value (and whether it is one or two tailed).

  • Reporting recommendations, continued:An indication of effect size or magnitude of difference.For example, for the one sample t test, we can set up Cohens d: d = (M mhyp)/ sIn words, d tells us: what was the difference between M and mhyp in number of standard deviations?

  • Reporting recommendations, continued: A Confidence Interval based on the sample mean should also be included as part of results.

  • Statistical power: Notice that given a specific set of numerical values for M, s and mhyp the magnitude of SEM, and therefore, size of the t ratio, depends upon N (sample size).

  • Given a sample size N, we can (roughly) predict the size of t if we can make reasonably accurate guesses about the value of d.Due to sampling error, and our inability to know the exact values of M and s before we collect data, we cannot predict the value of t exactly.However, there are statistical power tables that tell us: what is the (approximate) probability of obtaining a t value large enough to reject Ho, as a function of effect size (d) and N. The probability of (correctly) rejecting Ho when Ho is false is called statistical power.

  • Questions about statistical power: Several factors influence statistical power for a one sample t test. How does statistical power change (increase/ decrease) for each of the following changes? (In every question, we assume that all other terms included in the r ratio remain the same.)

  • Questions about statistical power:does it increase/decrease/stay the same?As d (effect size) increases, assuming that all other terms in the t ratio remain the same, statistical power ____.As N (sample size) increases, assuming that all other terms in the t ratio remain the same, statistical power ____. As the a level is made smaller, for example, if we change a from .05 to .01, statistical power ____.

  • Questions about statistical power, continued:If we know ahead of time that the effect size d is very small, what does this tell us about the N we will need in order to have adequate statistical power? If we know ahead of time that the effect size d is very large, what does this tell us about the N we will need in order to have adequate statistical power?

  • Some logical problems with NHSTNHST does not tell us: Given the sample mean M obtained in our study, how likely is it that H0 is correct?.

    Instead, a significance test tells us: If we assume that the null hypothesis is true, how likely or unlikely is the value of M that we obtained in our study?

  • The nature of NHST:Often, researchers want to reject H0 (this is almost always the case when we set up hypothese about relationships between variables; it is less often true for tests about a single population mean).Often, researchers hope to obtain a value of M far away from mhyp, and a value of t that is far away from 0, because these are outcomes that would be unlikely to occur if H0 is true.

  • The logic of NHST more generally:In later chapters, a typical null hypothesis corresponds to an assumption that there is no relationship between a predictor and an outcome variable. Usually researchers hope to reject this null hypothesis.This type of logic is awkward for several reasons:

  • Reasons why NHST logic is problematicIn everyday reasoning, people have a strong preference for setting up hypotheses that they believe and then searching for confirmatory evidence.In NHST, researchers usually set up a null hypothesis they do not believe, and then look for disconfirmatory evidence.This runs counter to our everyday habits, and also involves a double negative (rejection of the null hypothesis is interpreted as support for a belief that the variables in a study are related but this conclusion is logically somewhat problematic.)

  • Additional reasons why NHST may be problematic in practice Particularly in non experimental studies that include measurements for large numbers of variables, researchers often run large numbers of statistical significance tests; in these situations, unless precautions are taken to limit risk of Type I error, the p values obtained using NHST methods may greatly underestimate the true risk of Type I error.

  • Conclusion: Some professional associations (such as the American Psychological Association) have evaluated the problems that can arise in use of NHST. They stopped short of recommending that researchers abandon this; NHST can be useful as a means of trying to rule out sampling error as a highly likely explanation for the outcomes in a study.

  • Conclusion: The APA now recommends that we do not report significant test results alone. In addition to statistical significance tests we should report descriptive data for all groups (e.g. M, s, N); Confidence Intervals; and effect size information. This additional information provides a better basis for readers to evaluate the outcomes of studies.

    **Shoemaker (1996) has posted an article on the web, including simulated sample data, for this research question. These data can also be used to review Confidence Intervals, to examine correlation between body temperature and heart rate, and to examine gender differences in mean temperature and heart rate.

    If we obtain a random sample from the population of human adults, compute a mean body temperature (M) for this sample, and evaluate M relative to this hypothesized population mean, we can decide whether there is a statistically significant difference between the sample mean and this hypothesized population mean. Shoemaker cited an actual study in which the mean body temperature for a sample of healthy adults differed significantly from the value that is widely believed to be the population norm (98.6). The authors of this study questioned whether this standard for normal body temperature, based on research done about 100 years ago, may be incorrect. We would want to see replications of this result before drawing conclusions. However, their results suggest that perhaps the population mean for normal body might be somewhat lower than 98.6 degrees. ****Conventionally, alpha = .05 is the most popular choice, following standards suggested by Sir Ronald Fisher.

    When we say alpha = .05, this implies that our reject region(s) for values of t will consist of the most extreme 5% of the distribution of values of t; it also implies that, in theory, if all assumptions are met, and all rules for NHST are followed, we have a 5% risk of rejecting Ho when Ho is true (a 5% risk of Type I error).

    H1 may be any of the following: H1: mhyp not equal to 98.6 H1: mhyp < 98.6 H1: mhyp > 98.6

    Each of these correspond to different reject regions in the distribution of t:H1:mhyp not equal to 98.6 two tailed (reject if t falls in bottom 2.5% or top 2.5%)H1: mhyp < 98.6 one tailed (reject if t falls in bottom 5%)H1: mhyp > 98.6 one tailed (reject if t falls in top 5%)

    **The distribution of t converges to the normal distribution as N increases.For a normal distribution, the reject regions for a two tailed test a = .05 are:

    Reject Ho if t < -1.96

    Do not reject Ho if t is between -1.96 and +1.96

    Reject Ho if t > + 1.96*For this sample, mean body temperature was significantly different from the hypothesized population mean of 98.6 degrees, using a = .05 two tailed as the criterion for significance.

    Of course, this is not conclusive disproof that the population mean normal body temperature is 98.6. Shoemakers data were simulated (to mimic the outcome of an actual study that found a similar significant difference between sample mean body temperature and this historically traditional estimate of the population mean.

    If numerous studies based on representative samples consistently report that sample means are significantly lower than 98.6, however, we may want to revise the standard for average body temperature. Evidence suggests that the population mean may be closer to 98.2 than to 98.6 degrees.

    A possible additional discussion point: if all the temperature values are converted to a different scale, such as Centigrade, would this change any of our conclusions? ********Students should be reminded that:

    Either decision (reject Ho, do not reject Ho) is associated with some risk of error.

    When we reject Ho, that decision may be an instance of Type I error.

    When we do not reject Ho, that decision may be an instance of Type II error.

    Furthermore, the results of any single study may be invalid for many reasons such as non representative sample, errors in data collection or analysis, experimenter bias, and so forth.

    No single study should be regarded as conclusive. It is more useful to think of each study as one more piece of evidence in a growing body of evidence; our degree of belief in a hypothesis should be somewhat stronger when there is a large amount of high quality research evidence consistent with the hypothesis. **Note that if SPSS reports p = .000 to three decimal places, it would not be appropriate to report p = .000 as an exact p value. The p value represents a theoretical risk of committing a Type I error, and although this risk decreases as N increases, in theory, this risk is never zero. It would be more appropriate to report p < .001 when SPSS shows sig = .000. *The American Psychological Association Publication manual now recommends reporting exact obtained p values. There is controversy about the use of significance test procedures.

    A more traditional and conservative view is that an obtained t ratio either does, or does not, meet the criterion for statistical significance. People who adhere to this view tend to report p < .05 or p < .05.

    A different view that the APA has implicitly endorsed in its revised guidelines is that p = .05 is not a cliff; and that it is useful to report exact obtained p values. It may be reasonable, in some exploratory research, to set alpha levels somewhat higher than the conventional .05. When a researcher reports, for example, p = .06, this information makes it possible for the reader to apply the traditional / conventional standards (and decide the outcome was not significant), or, the reader may apply some other alpha level such as .10 (and decide that this outcome is statistically significant.

    *The American Psychological Association Publication manual now calls for inclusion of effect size information. Unlike a t ratio, an effect size index such as Cohens d is not dependent on sample size.

    Effect size information can help researchers judge whether obtained differences are large enough to be of any clinical or practical use.

    Inclusion of effect size information in individual studies also facilitates meta analysis. A meta analysis is a quantitative review of past research; effect sizes or other quantitative outcomes are combined across many studies in order to evaluate overall effect size.

    *APA guidelines now call for the inclusion of Confidence Intervals for all important empirical results.

    *This implies that:

    Assuming that the M mhyp difference is not exactly 0 and that s is nonzero,

    If you just make N small enough, you will obtain a value of t that is small enough to be judged not statistically significant

    If you just make N large enough, you will obtain a value of t that is large enough to be judged statistically significant.