34
Mayo ASA -1- Statistical Skepticism: How to use significance tests effectively Deborah Mayo Virginia Tech October12, 2017 If you are to use statistical significance tests effectively you must be in a position to respond to expected critical challenges: Why are you using a statistical test of significance rather than some other tool? It makes sense to reject Unitarianism. What type of context would ever make statistical (significance) testing just the ticket?

Statistical skepticism: How to use significance tests effectively

Embed Size (px)

Citation preview

Page 1: Statistical skepticism: How to use significance tests effectively

Mayo ASA -1-

Statistical Skepticism: How to use significance tests effectively

Deborah Mayo

Virginia Tech

October12, 2017

If you are to use statistical significance tests effectively you must be in a position to respond to expected critical challenges:

Why are you using a statistical test of significance rather than

some other tool?

• It makes sense to reject Unitarianism.

• What type of context would ever make statistical (significance)

testing just the ticket?

Page 2: Statistical skepticism: How to use significance tests effectively

Mayo ASA -2-

(1) Why use a tool that infers from a single (arbitrary) P-

value that pertains to a statistical hypothesis H0 to a

research claim H*?

Page 3: Statistical skepticism: How to use significance tests effectively

Mayo ASA -3-

(1) Why use a tool that infers from a single (arbitrary) P-

value that pertains to a statistical hypothesis H0 to a

research claim H*?

RESPONSE: We don’t.

“[W]e need, not an isolated record, but a reliable method of

procedure. In relation to the test of significance, we may say

that a phenomenon is experimentally demonstrable when we

know how to conduct an experiment which will rarely fail to

give us a statistically significant result.” (Fisher 1947, p. 14)

• This is the basis for falsification in science.

Page 4: Statistical skepticism: How to use significance tests effectively

Mayo ASA -4-

• Were H0 a reasonable description of the process, then with very

high probability you would not be able to regularly produce

statistically significant results.

• So if you do, it’s evidence H0 is false in the particular manner

probed (argument from coincidence).

Page 5: Statistical skepticism: How to use significance tests effectively

Mayo ASA -5-

• Even a genuine statistical effect isn’t automatically evidence for a substantive H*.

• H* makes claims that haven’t been probed by the statistical test:

cliché: correlation isn’t causation.

• Neyman-Pearson (N-P) tests restrict the inference to an

alternative statistical claim–these improved on Fisher.

Page 6: Statistical skepticism: How to use significance tests effectively

Mayo ASA -6-

(2) Why use an incompatible hybrid (of Fisher and N-P)?

Page 7: Statistical skepticism: How to use significance tests effectively

Mayo ASA -7-

(2) Why use an incompatible hybrid (of Fisher and N-P)?

RESPONSE: They fall under the umbrella of tools that use a

method’s sampling distribution to assess and control its error

probing capacities.

• confidence intervals, N-P and Fisherian tests, resampling,

randomization.

• there’s an inference and it’s qualified by how well or poorly

tested that inference is.

Incompatibilist positions keep to caricatures

• Fisherians can’t use power (or sensitivity).

• N-P testers can’t use P-values; all used achieved significance

levels.

Page 8: Statistical skepticism: How to use significance tests effectively

Mayo ASA -8-

“It is good practice to determine … the smallest significance level…at which the hypothesis would be rejected for the given

observation. This number, the so-called P-value gives an idea

of how strongly the data contradict the hypothesis. It also

enables others to reach a verdict based on the significance

level of their choice.” (Lehmann and Romano 2005, pp. 63-4)

An attained P-value is also an error probability:

“…the P-value “is the probability that we would mistakenly

declare there to be evidence against H0, were we to regard the data under analysis as being just decisive” for issuing such a

report. (Cox and Hinkley 1974, p. 66)

Page 9: Statistical skepticism: How to use significance tests effectively

Mayo ASA -9-

• Some say: Fisher was evidential, Neyman cared only for long-run error control.

• Both tribes use error probabilities to measure performance and

evidence.

• The two schools mathematically dovetail; it’s up to us to interpret

them.

• Drop personality labels and the NHST acronym: “statistical tests”

or “error statistical tests” will do.

Page 10: Statistical skepticism: How to use significance tests effectively

Mayo ASA -10-

(3) Why apply a method that uses error probabilities, the

sampling distribution, researcher “intentions” and violates the

likelihood principle (LP)? You should condition on the data.

Page 11: Statistical skepticism: How to use significance tests effectively

Mayo ASA -11-

(3) Why apply a method that uses error probabilities, the

sampling distribution, researcher “intentions”, and violates the

likelihood principle (LP)? You should condition on the data.

RESPONSE 1: If I condition on the actual data, this precludes

error probabilities.

• I use them to assess how well or poorly tested claims are by the

data at hand.

• What bothers you when cherry pickers selectively report?

• Not a problem with long-runs: You can’t say about the case at

hand that it has done a good job of avoiding the sources of

misinterpreting data.

Page 12: Statistical skepticism: How to use significance tests effectively

Mayo ASA -12-

You need a principle of reasoning that makes sense of this.

Informally:

I haven’t been given evidence for a genuine effect if the

method makes it very easy to find some impressive-looking

effect, even if spurious.

Not enough that data x fit H; the method must have had a

reasonable ability to uncover flaws, if present.

Minimal requirement for severe testing.

Page 13: Statistical skepticism: How to use significance tests effectively

Mayo ASA -13-

So you can readily agree with those who say:

“P values can only be computed once the sampling plan is fully

known and specified in advance.” (Wagenmakers 2007, p. 784)

“In fact, Bayes factors can be used in the complete absence of a

sampling plan… .” (Bayarri, Benjamin, Berger, Sellke 2016, p.

100)

RESPONSE 2: We’re in very different contexts. I’m in one that

led to advise a “21 word solution”:

Page 14: Statistical skepticism: How to use significance tests effectively

Mayo ASA -14-

“We report how we determined our sample size, all data

exclusions (if any), all manipulations, and all measures in the

study.” (Simmons, Nelson, and Simonsohn 2012, p. 4)

• I’m in the setting of a (skeptical) consumer of statistics.

• Have you given yourself lots of extra chances in the “forking

paths” (Gelman and Loken 2014) between data and inference?

Page 15: Statistical skepticism: How to use significance tests effectively

Mayo ASA -15-

• The value of preregistered reports–key strategy to promote

replication–is that your appraisal of the evidence is altered when

you see the history (e.g., they had 6 primary endpoints, but the one reported is not among them).

• There’s a tension between the preregistration mindset and the

position,

It makes no sense that frequentists adjust the P-value in the face

of data dredging and other biasing selection effects–these

considerations have nothing to do with the evidence.

• Many ways to correct P-values, appropriate for different contexts

(Bonferroni, false discovery rates).

• The main thing is to have an alert that the reported P-values are

invalid or questionable.

Page 16: Statistical skepticism: How to use significance tests effectively

Mayo ASA -16-

RESPONSE 3: I don’t want to relinquish my strongest criticism

against researchers whose findings are the result of fishing expeditions.

• Wanting to lambast them, while promoting an account that

downplays error probabilities, critics turn to other means–e.g., give the claims low prior degrees of belief.

• Might work in some cases.

• The researcher deserving criticism can deflect this: saying a Bayesian can always counter an effect this way.

• Post-data subgroups are often believable, that’s what makes them seductive.

• Need to be able to say “that’s a plausible claim but this is a lousy

test of it.”

Page 17: Statistical skepticism: How to use significance tests effectively

Mayo ASA -17-

(4) Why use methods that overstate evidence against a null

hypothesis?

Page 18: Statistical skepticism: How to use significance tests effectively

Mayo ASA -18-

(4) Why use methods that overstate evidence against a null

hypothesis?

RESPONSE: Whether P-values exaggerate,

“depends on one’s philosophy of statistics and the precise

meaning given to the terms involved. …

that P values overstate evidence against test hypotheses, based

on directly comparing P values against certain quantities

(likelihood ratios and Bayes factors) that play a central role

as evidence measures in Bayesian analysis … Nonetheless,

many other statisticians do not accept these quantities as

gold standards, … Thus, from this frequentist perspective, P

values do not overstate evidence ... .” (Greenland, Senn, Rothman, Carlin, Poole, Goodman, Altman 2016. p. 342)

Page 19: Statistical skepticism: How to use significance tests effectively

Mayo ASA -19-

RESPONSE (cont.)

• From the error statistical perspective, it’s the Bayesian or Bayes

factor assessment that exaggerates.

• P-values can also agree with the posterior: in short, there is wide

latitude for Bayesian assignments.

• In the cases of disagreement, many charge the problem is not P-

values but the high prior.

“concentrating mass on the point null hypothesis is biasing the

prior in favor of H0 as much as possible… .”(Casella and Roger

Berger, 1987a, p. 111)

“The testing of a point null hypothesis is one of the most misused

statistical procedures. …[Most of the time] there is a direction of

interest in many experiments, and saddling an experimenter with

a two-sided test would not be appropriate.”(ibid. p. 106)

Page 20: Statistical skepticism: How to use significance tests effectively

Mayo ASA -20-

• You needn’t either rule out an assessment from a different school, nor assume your significance level assessments are

debunked.

• They’re measuring different things.

• The danger is when terms are confused.

Page 21: Statistical skepticism: How to use significance tests effectively

Mayo ASA -21-

(5) Why do you use a method that presupposes the underlying

statistical model?

Page 22: Statistical skepticism: How to use significance tests effectively

Mayo ASA -22-

(5) Why do you use a method that presupposes the underlying

statistical model?

RESPONSE: Au contraire. I use (simple) significance tests

because I want to test my statistical assumptions and perhaps falsify them.

Here’s George Box, a Bayesian eclecticist:

“Some check is needed on [the fact that] some pattern or other

can be seen in almost any set of data or facts. This is the object of diagnostic checks [which] require frequentist theory [of] significance tests… .” (Box 1983, p. 57)

Page 23: Statistical skepticism: How to use significance tests effectively

Mayo ASA -23-

Andrew Gelman develops “‘falsificationist Bayesianism’, …

with an interpretation of probability that can be seen as

frequentist in a wide sense and with an error statistical

approach to testing assumptions… .” (Gelman and Henning

2017, p. 25)

There’s a Bayesian P-value research program that some turn to in

checking accordance of a model (no explicit alternative).

Page 24: Statistical skepticism: How to use significance tests effectively

Mayo ASA -24-

(6) Why use a measure that doesn’t report effect sizes?

Page 25: Statistical skepticism: How to use significance tests effectively

Mayo ASA -25-

(6) Why use a measure that doesn’t report effect sizes?

RESPONSE:

• Here, supplements to statistical tests are needed.

• We interpret statistically significant results, or “reject H0”, in

terms of indications of a discrepancy γ from H0.

• Could also use confidence distributions (confidence intervals at

several levels).

Page 26: Statistical skepticism: How to use significance tests effectively

Mayo ASA -26-

(7) Why do you use a method that doesn’t provide posterior

probabilities?

Page 27: Statistical skepticism: How to use significance tests effectively

Mayo ASA -27-

(7) Why do you use a method that doesn’t provide posterior

probabilities?

Here you can answer a question with a question:

RESPONSE 1: Which notion of a posterior do you recommend?

Note that posteriors aren’t provided by comparative accounts: Bayes Factors, likelihood ratios or model selections.

• The most prevalent Bayesian accounts are default/non-subjective

(with data dominant in some sense).

• There is no agreement on suitable priors.

• Even the ordering of parameters will yield different priors.

Page 28: Statistical skepticism: How to use significance tests effectively

Mayo ASA -28-

• Default priors are not expressing uncertainty or degree of belief;

“…in most cases, they are not even proper probability

distributions in that they often do not integrate [to] one.”

(Bernardo 1997, pp. 159-160)

If priors are not probabilities, “what interpretation is justified for the posterior?” (Cox 2006, p. 77)

• Coherent updating goes by the board.

Page 29: Statistical skepticism: How to use significance tests effectively

Mayo ASA -29-

RESPONSE 2 (for testers)

• Scientists don’t seek hypotheses that are highly probable (in the

formal sense).

• They want methods that very probably would have unearthed

discrepancies and improvements–the probability is on the method.

• A Bayesian posterior requires a catchall factor: all hypotheses

that could explain the data, and I just want to split off a piece (or variant of a theory) to test.

• A silver lining to a difference in aims (well-probed hypotheses vs

highly probable ones): use different tools depending on context.

• Better than trying to reconcile very different goals.

Page 30: Statistical skepticism: How to use significance tests effectively

Mayo ASA -30-

Upshot: I’m dealing with a problem where I need to scrutinize what is warranted to infer – normative. Methods must be

• able to block inferences that violate minimal severity,

• directly altered by biasing selection effects (e.g., cherry-

picking),

• able to falsify claims statistically,

• able to test statistical model assumptions.

“A world beyond p <.05” suggests a world without error statistical

testing; if you’re using such a test, you’ll need to respond to

expected critical challenges.

Page 31: Statistical skepticism: How to use significance tests effectively

Mayo ASA -31-

NOTE, SLIDE 15: Benjamini and Hochberg (1995) develop what’s called the false discovery rate (FDR): the expected proportion of the N hypotheses tested that are falsely rejected. The same term is sometimes used in a different manner requiring prior probabilities. Westfall and Young (1993), one of the earliest treatments on P-value adjustments, develop resampling methods to adjust for both multiple testing and multiple modeling.

NOTE 1, SLIDE 16:

“Whenever the null hypothesis is sharply defined but the prior distribution on the alternative hypothesis is diffused over a

wide range of values, as it is in … it boosts the probability that any observed data will be higher under the null hypothesis than under the alternative. This is known as the Lindley-Jeffreys paradox: A frequentist analysis that yields strong evidence in

support of the experimental hypothesis can be contradicted by a misguided Bayesian analysis that concludes that the same data are more likely under the null.” (Bem et al 2011, p. 717)

NOTE 2, SLIDE 16: They can be developed from either Fisherian or N-P perspectives. FEV: Frequentist Principle of Evidence (Mayo and Cox 2006/2010); SEV: severity (Mayo and Spanos 2006, 2011). For a one-sided test of a mean:

• FEV/SEV: an insignificant result: A moderate (non-small) P‐value indicates the absence of a discrepancy δ from H0, just to

the extent that it’s probable the test would have given a worse fit with H0 (i.e., d(X) > d(x0)) were a discrepancy δ to exist.

• FEV/SEV: a significant result is evidence of discrepancy δ from H0, if and only if, there is a high probability the test would have yielded a less significant result, were a discrepancy δ absent.

NOTE, SLIDE 19: Casella and R. Berger (1987b) argue, “We would be surprised if most researchers would place even a 10% prior probability of H0. We hope that the casual reader of Berger and Delampady realizes that the big discrepancies between P-values and

P(H0|x) … are due to a large extent to the large value of [the prior of .5 to H0] that was used.” The most common uses of a point null, asserting the difference between means is 0, or the coefficient of a regression coefficient is 0, merely describe a potentially interesting feature of the population, with no special prior believability. “J. Berger and Delampady admit…, P-values are reasonable measures of

evidence when there is no a priori concentration of belief about H0” (ibid., p. 345). Thus, “the very argument that and Delampady use to dismiss P-values can be turned around to argue for P-values (ibid., p. 346).

NOTE, SLIDE 28: Diagnostic Screening: In a very different paradigm, the prior stems from considering your research hypothesis as randomly selected from a population of null hypotheses to be tested. Where relevant, the posterior might given a long-run performance assessment (perhaps over all science, or journals), it not an assessment of well-testedness.

Page 32: Statistical skepticism: How to use significance tests effectively

Mayo ASA -32-

REFERENCES:

Bayarri, M., Benjamin, D., Berger, J., and Sellke, T. (2016). “Rejection Odds and Rejection Ratios:

A Proposal for Statistical Practice in Testing Hypotheses”, Journal of Mathematical

Psychology, 72, 90-103.

Bem, D., Utts, J. and Johnson, W. (2011). “Must Psychologists Change the Way They Analyze Their

Data?” Journal of Personality and Social Psychology, 101(4), 716-719.

Benjamini Y. and Hochberg Y. (1995). “Controlling the False Discovery Rate: A Practical and

Powerful Approach to Multiple Testing”, Journal of the Royal Statistical Society, B, 57, 289-

300.

Bernardo, J. (1997). “Non-informative Priors Do Not Exist: A Discussion”, Journal of Statistical

Planning and Inference 65, 159–89.

Box, G. (1983). “An Apology for Ecumenism in Statistics”, in Box, G., Leonard, T. and Wu, D.

(eds.), Scientific Inference, Data Analysis, and Robustness, New York: Academic Press, 51-84.

Casella, G., and Berger, R. (1987a). “Reconciling Bayesian and Frequentist Evidence in the One-

sided Testing Problem”, Journal of the American Statistical Association 82(397), 106–11.

Casella, G., and Berger, R. (1987b). “Comment on Testing Precise Hypotheses by J. O. Berger and

M. Delampady”, Statistical Science 2(3), 344–347.

Cox, D. R. (2006). Principles of Statistical Inference, Cambridge: Cambridge University Press.

Cox, D. R. and Hinkley, D. (1974). Theoretical Statistics. London: Chapman and Hall.

Cox, D. R. and Mayo, D. (2010). “Objectivity and Conditionality in Frequentist Inference”, in Mayo,

D. and Spanos. A. (eds.), Error and Inference: Recent Exchanges on Experimental Reasoning,

Reliability, and the Objectivity and Rationality of Science, Cambridge: Cambridge University

Press, 276–304.

Fisher, R. A. (1947). The Design of Experiments (4th ed.), Edinburgh: Oliver and Boyd.

Page 33: Statistical skepticism: How to use significance tests effectively

Mayo ASA -33-

Gelman, A. and Hennig (2017) “Beyond Subjective and Objective in Statistics” (with discussion),

Journal of the Royal Statistical Society (in press).

Gelman, A. and Loken, E. (2014). “The Statistical Crisis in Science”, American Scientist 2, 460-65.

Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., et al. (1989; 1990). Empire of Chance: How

Probability Changed Science and Everyday Life, Cambridge: Cambridge University Press.

Greenland, S., Senn, S., Rothman, K., Carlin, J., Poole, C. Goodman, S., Altman, D. (2016).

“Statistical Tests, P values, Confidence Intervals, and Power: A Guide to Misinterpretations”,

European Journal of Epidemiology, 31(4), 337-350.

Goodman, S. (1999). “Toward Evidence-Based Medical Statistics. 2: The Bayes Factor”, Annals of

Internal Medicine, 130(12), 1005-13.

Lehmann, E. and Romano, J. (2005). Testing Statistical Hypotheses. 3rd ed., New York: Springer.

Mayo, D. (1996). Error and the Growth of Experimental Knowledge, Chicago: University of

Chicago Press.

Mayo, D. (2018, forthcoming). Statistical Inference as Severe Testing: How to Get Beyond the

Statistics Wars, New York/Cambridge: Cambridge Universe Press.

Mayo, D. and Cox, D. R. (2006). “Frequentists Statistics as a Theory of Inductive Inference”, in

Rojo, J. (ed.) Optimality: The Second Erich L. Lehmann Symposium, Lecture Notes-

Monograph series, Institute of Mathematical Statistics (IMS), 49: 77-97. Reprinted in Mayo, D.

and Spanos. A. (eds.), Error and Inference: Recent Exchanges on Experimental Reasoning,

Reliability, and the Objectivity and Rationality of Science, Cambridge: Cambridge University

Press, 247-275.

Mayo, D. and Spanos, A. (2006). “Severe Testing as a Basic Concept in a Neyman–Pearson

Philosophy of Induction”, British Journal for the Philosophy of Science 57(2), 323–57.

Page 34: Statistical skepticism: How to use significance tests effectively

Mayo ASA -34-

Mayo, D. and Spanos, A. (2011). “Error Statistics”, in Bandyopadhyay, P. and Forster, M. (eds.),

Philosophy of Statistics, 7, Handbook of the Philosophy of Science, The Netherlands: Elsevier,

152–198.

Senn, S. (2002). “A Comment on Replication, P-values and Evidence S. N. Goodman, Statistics in

Medicine 1992; 11:875-879”, Statistics in Medicine 21 (16), 2437–2444.

Simmons, J. Nelson, L. and Simonsohn, U. (2012). “A 21 Word Solution”, Dialogue: The Official

Newsletter of the Society for Personality and Social Psychology, 26(2), 4–7.

Wagenmakers, J. (2007). “A Practical Solution to the Pervasive Problems of P values”, Psychonomic

Bulletin & Review 14(5), 779-804.

Westfall, P. and Young, S. (1993). Resampling-Based Multiple Testing: Examples and Methods for

P-Value Adjustment. New York: Wiley.