Final mayo's aps_talk

Final APS Mayo / 1

Error Statistical Control: Forfeit at your Peril Deborah Mayo

• A central task of philosophers of science is to address

conceptual, logical, methodological discomforts of scientific practices—still we’re rarely called in

• Psychology has always been more self-conscious than most • To its credit, replication crises led to programs to restore

credibility: fraud busting, reproducibility studies. • There are proposed methodological reforms––many

welcome, some of them quite radical • Without a better understanding of the problems, many

reforms are likely to leave us worse off.

Final APS Mayo / 2

1. A Paradox for Significance Test Critics

Critic1: It’s much too easy to get a small P-value. Critic2: We find it very difficult to replicate the small P-values

others found.

Is it easy or is it hard?

R.A. Fisher: it’s easy to lie with statistics by selective reporting (he called it the “political principle”) Sufficient finagling—cherry-picking, P-hacking, significance seeking—may practically guarantee a researcher’s preferred hypothesis gets support, even if it’s unwarranted by evidence.

Final APS Mayo / 3

2. Bad Statistics Severity Requirement: If data x0 agree with a hypothesis H, but the test procedure had little or no capability, i.e., little or no probability of finding flaws with H (even if H is incorrect), then x0 provide poor evidence for H.

Such a test we would say fails a minimal requirement for a stringent or severe test.

• This seems utterly uncontroversial.

Final APS Mayo / 4

• Methods that scrutinize a test’s capabilities, according to their severity, I call error statistical.

• Existing error probabilities (confidence levels, significance

levels) may but need not provide severity assessments.

• New name: frequentist, sampling theory, Fisherian, Neyman-Pearsonian—are too associated with hard line views.

Final APS Mayo / 5

3. Two main views of the role of probability in inference Probabilism. To provide an assignment of degree of probability, confirmation, support or belief in a hypothesis, absolute or comparative, given data x0. (e.g., Bayesian, likelihoodist)—with due regard for inner coherency Performance. To ensure long-run reliability of methods, coverage probabilities, control the relative frequency of erroneous inferences in a long-run series of trials (behavioristic Neyman-Pearson) What happened to using probability to assess the error probing capacity by the severity criterion?

Final APS Mayo / 6

• Neither “probabilism” nor “performance” directly captures it.

• Good long-run performance is a necessary, not a sufficient, condition for avoiding insevere tests.

• The problems with selective reporting, cherry picking,

stopping when the data look good, P-hacking, barn hunting, are not problems about long-runs—

• It’s that we cannot say about the case at hand that it has

done a good job of avoiding the sources of misinterpreting data.

Final APS Mayo / 7

• Probabilism says H is not warranted unless it’s true or probable (or increases probability, makes firmer–some use Bayes rule, but its use doesn’t make you Bayesian).

• Performance says H is not warranted unless it stems from a method with low long-‐run error.

• Error Statistics (Probativism) says H is not warranted unless something (a fair amount) has been done to probe ways we can be wrong about H.

Final APS Mayo / 8

• If you assume probabilism is required for inference, it follows error probabilities are relevant for inference only by misinterpretation. False! • Error probabilities have a crucial role in appraising well-‐testedness, which is very different from appraising believability, plausibility, confirmation.

• It’s crucial to be able to say, H is highly believable or plausible but poorly tested. [Both H and not-‐H may be poorly tested)

• Probabilists can allow for the distinct task of severe testing

Final APS Mayo / 9

• It’s not that I’m keen to defend many common uses of significance tests; my work in philosophy of statistics has been to provide the long sought “evidential interpretation” (Birnbaum) of frequentist methods, to avoid classic fallacies.

• It’s just that the criticisms are based on serious

misunderstandings of the nature and role of these methods; consequently so are many “reforms”.

• Note: The severity construal blends testing and subsumes

(improves) interval estimation, but I keep to testing talk to underscore the probative demand.

Final APS Mayo / 10

4. Biasing selection effects:

One function of severity is to identify which selection effects are problematic (not all are) Biasing selection effects: when data or hypotheses are selected or generated (or a test criterion is specified), in such a way that the minimal severity requirement is violated, seriously altered or incapable of being assessed.

Picking up on these alterations is precisely what enables error statistics to be self-correcting— Let me illustrate.

Final APS Mayo / 11

Capitalizing on Chance

We often see articles on fallacious significance levels: When the hypotheses are tested on the same data that suggested them and when tests of significance are based on such data, then a spurious impression of validity may result. The computed level of significance may have almost no relation to the true level…Suppose that twenty sets of differences have been examined, that one difference seems large enough to test and that this difference turns out to be ‘significant at the 5 percent level.’ ….The actual level of significance is not 5 percent, but 64 percent! (Selvin, 1970, p. 104)

Final APS Mayo / 12

• This is from a contributor to Morrison and Henkel’s Significance Test Controversy way back in 1970!

• They were clear on the fallacy: blurring the “computed” or “nominal” significance level, and the “actual” or “warranted” level.

• There are many more ways you can be wrong with hunting (different sample space).

• Nowadays, we’re likely to see the tests blamed for permitting such misuses (instead of the testers).

• Even worse are those statistical accounts where the abuse vanishes!

Final APS Mayo / 13

What defies scientific sense? On some views, biasing selection effects are irrelevant….

Stephen Goodman (epidemiologist): Two problems that plague frequentist inference: multiple comparisons and multiple looks, or, as they are more commonly called, data dredging and peeking at the data. The frequentist solution to both problems involves adjusting the P-‐value…But adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense, belies the claim of ‘objectivity’ that is often made for the P-‐value.” (1999, p. 1010).

Final APS Mayo / 14

5. Likelihood Principle (LP) The vanishing act takes us to the pivot point around which much debate in philosophy of statistics revolves: In probabilisms, the import of the data is via the ratios of likelihoods of hypotheses

P(x0;H1)/P(x0;H0)

Different forms: posterior probabilities, positive B-‐boost, Bayes factor The data x0 is fixed, while the hypotheses vary

Final APS Mayo / 15

Savage on the LP:

“According to Bayes’s theorem, P(x|µ)...constitutes the entire evidence of the experiment, that is, it tells all that the experiment has to tell. More fully and more precisely, if y is the datum of some other experiment, and if it happens that P(x|µ) and P(y|µ) are proportional functions of µ (that is, constant multiples of each other), then each of the two data x and y have exactly the same thing to say about the values of µ… (Savage 1962, p. 17).

Final APS Mayo / 16

All error probabilities violate the LP (even without selection effects): “Sampling distributions, significance levels, power, all depend on something more [than the likelihood function]–something that is irrelevant in Bayesian inference–namely the sample space”. (Lindley 1971, p. 436) That is why properties of the sampling distribution of test statistic d(X) disappear for accounts that condition on the particular data x0

Final APS Mayo / 17

Paradox of Optional Stopping: Error probing capabilities are altered not just by cherry picking and data dredging, but also via data dependent stopping rules:

We have a random sample from a Normal distribution with mean µ and standard deviation σ, Xi ~ N(µ,σ2), 2-‐sided

H0: µ = 0 vs. H1: µ ≠ 0.

Instead of fixing the sample size n in advance, in some tests, n is determined by a stopping rule:

Keep sampling until H0 is rejected at the .05 level

i.e., keep sampling until |𝑋| ≥ 1.96 σ/ 𝑛.

Final APS Mayo / 18

“Trying and trying again”: having failed to rack up a 1.96 SD difference after, say, 10 trials, the researcher went on to 20, 30 and so on until finally obtaining a 1.96 SD difference. Nominal vs. Actual significance levels: with n fixed the type 1 error probability is .05. With this stopping rule the actual significance level differs from, and will be greater than .05.

Final APS Mayo / 19

Jimmy Savage (1959 forum) audaciously declared:

“optional stopping is no sin”

so the problem must be with significance levels (because they pick up on it).

On the other side: Peter Armitage, who had brought up the problem, also uses biblical language

“thou shalt be misled”

if thou dost not know the person tried and tried again. (72)

Final APS Mayo / 20

Where the Bayesians here claim:

“This irrelevance of stopping rules to statistical inference restores a simplicity and freedom to experimental design that had been lost by classical emphasis on significance levels” (in the sense of Neyman and Pearson) (Edwards, Lindman, Savage 1963, p. 239). The frequentists: While it may restore "simplicity and freedom" it does so at the cost of being unable to adequately control probabilities of misleading interpretations of data) (Birnbaum).

Final APS Mayo / 21

6. Current Reforms are Probabilist Probabilist reforms to replace tests (and CIs) with likelihood ratios, Bayes factors, HPD intervals, or just lower the P-value (so that the maximal likely alternative gets .95 posterior)

while ignoring biasing selection effects, will fail The same p-hacked hypothesis can occur in Bayes factors; optional stopping can exclude true nulls from HPD intervals

With one big difference: Your direct basis for criticism and possible adjustments has just vanished To repeat: Properties of the sampling distribution d(x) disappear for accounts that condition on the particular data.

Final APS Mayo / 22

7. How might probabilists block intuitively unwarranted inferences? (Consider first subjective) When we hear there’s statistical evidence of some unbelievable claim (distinguishing shades of grey and being politically moderate, ovulation and voting preferences), some probabilists claim—you see, if our beliefs were mixed into the interpretation of the evidence, we wouldn’t be fooled.

We know these things are unbelievable, a subjective Bayesian might say. That could work in some cases (though it still wouldn’t show what researchers had done wrong)—battle of beliefs.

Final APS Mayo / 23

It wouldn’t help with our most important problem: (2 critics) How to distinguish the warrant for a single hypothesis H with different methods (e.g., one has biasing selection effects, another, registered results and precautions)? Besides, as committees investigating questionable practices know, researchers really do sincerely believe their hypotheses. So now you’ve got two sources of flexibility, priors and biasing selection effects (which can no longer be criticized).

Final APS Mayo / 24

8. Conventional: Bayesian-‐Frequentist reconciliations?

The most popular probabilisms these days are non-subjective, default, reference:

• because of the difficulty of eliciting subjective priors, • the reluctance of scientists to allow subjective beliefs to

overshadow the information provided by data. Default, or reference priors are designed to prevent prior beliefs from influencing the posteriors.

• A classic conundrum: no general non-‐informative prior exists, so most are conventional.

Final APS Mayo / 25

“The priors are not to be considered expressions of uncertainty, ignorance, or degree of belief. Conventional priors may not even be probabilities…” (Cox and Mayo 2010, p. 299)

• Prior probability: An undefined mathematical construct for obtaining posteriors (giving highest weight to data, or satisfying invariance, or matching frequentists, or….).

Leading conventional Bayesians (J. Berger) still tout their methods as free of concerns with selection effects, stopping rules (stopping rule principle)

Final APS Mayo / 26

There are some Bayesians who don’t see themselves as fitting under either the subjective or conventional heading, and may even reject probabilism…..

Final APS Mayo / 27

Before concluding…I don’t ignore fallacies of current methods 9. How the severity analysis avoids classic fallacies Fallacies of Rejection: Statistical vs. Substantive Significance

i. Take statistical significance as evidence of substantive theory H* that explains the effect.

ii. Infer a discrepancy from the null beyond what the test warrants.

(i) Handled with severity: flaws in the substantive alternative H* have not been probed by the test, the inference from a statistically significant result to H* fails to pass with severity.

Final APS Mayo / 28

Merely refuting the null hypothesis is too weak to corroborate substantive H*, “we have to have ‘Popperian risk’, ‘severe test’ [as in Mayo], or what philosopher Wesley Salmon called ‘a highly improbable coincidence’” (Meehl and Waller 2002,184).

• NHSTs (supposedly) allow moving from statistical to substantive; if so, they exist only as abuses of tests: they are not licensed by any legitimate test.

• Severity applies informally: Much more attention to these quasi-‐formal statistical substantive links: Do those proxy variables capture the intended treatments? Do the measurements reflect the theoretical phenomenon?

Final APS Mayo / 29

Fallacies of Rejection: (ii) Infer a discrepancy beyond what’s warranted: Severity sets up a discrepancy parameter γ (never just report P-‐value)

A statistically significant effect may not warrant a meaningful effect — especially with n sufficiently large: large n problem.

• Severity tells us: an α-‐significant difference is indicative of less of a discrepancy from the null if it results from larger (n1) rather than a smaller (n2) sample size (n1 > n2 )

What’s more indicative of a large effect (fire), a fire alarm that goes off with burnt toast or one so insensitive that it doesn’t go off unless the house is fully ablaze? The larger sample size is like the one that goes off with burnt toast.)

Final APS Mayo / 30

Fallacy of Non-‐Significant results: Insensitive tests • Negative results do not warrant 0 discrepancy from the null, but we can use severity to rule out discrepancies that, with high probability, would have resulted in a larger difference than observed

• akin to power analysis (Cohen) but sensitive to x0

• We hear sometimes negative results are uninformative: not so.

No point in running replication research if your account views negative results as uninformative

Final APS Mayo / 31

Confidence Intervals also require supplementing There’s a duality between tests and intervals: values within the (1 -‐ α) CI are non-‐rejectable at the α level (but that doesn’t make them well-‐warranted) • Still too dichotomous: in /out, plausible/not plausible (Permit fallacies of rejection/non-‐rejection)

• Justified in terms of long-‐run coverage • All members of the CI treated on par • Fixed confidence levels (SEV need several benchmarks) • Estimation is important but we need tests for distinguishing real and spurious effects, and checking assumptions of statistical models

Final APS Mayo / 32

10. Error Statistical Control: Forfeit at your Peril

• The role of error probabilities in inference, is not long-‐run error control, but to severely probe flaws in your inference today.

• It’s not a high posterior probability in H that’s wanted (however construed) but a high probability our procedure would have unearthed flaws in H.

• Many reforms are based on assuming a philosophy of probabilism

• The danger is that some reforms may enable rather than directly reveal illicit inferences due to biasing selection effects.

Final APS Mayo / 33

Mayo and Cox (2010): Frequentist Principle of Evidence (FEV); SEV: Mayo and Spanos (2006)

• FEV/SEV: insignificant result: A moderate P-‐value is evidence of the absence of a discrepancy δ from H0, only if there is a high probability the test would have given a worse fit with H0 (i.e., d(X) > d(x0) ) were a discrepancy δ to exist.

• FEV/SEV significant result d(X) > d(x0) is evidence of discrepancy δ from H0, if and only if, there is a high probability the test would have d(X) < d(x0) were a discrepancy as large as δ absent.

Final APS Mayo / 34

Test T+: Normal testing: H0: µ < µ0 vs. H1: µ > µ0 σ known (FEV/SEV): If d(x) is not statistically significant, then µ < M0 + kεσ/√𝑛 passes the test T+ with severity (1 – ε). (FEV/SEV): If d(x) is statistically significant, then µ > M0 + kεσ/√𝑛 passes the test T+ with severity (1 – ε). where P(d(X) > kε) = ε.

Final APS Mayo / 35

REFERENCES: Armitage, P. 1962. “Contribution to Discussion.” In The Foundations of Statistical Inference:

A Discussion, edited by L. J. Savage. London: Methuen. Barnard, G. A. 1972. “The Logic of Statistical Inference (review of ‘The Logic of Statistical

Inference’ by Ian Hacking).” British Journal for the Philosophy of Science 23 (2) (May 1): 123–132.

Berger, J. O. 2006. “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–402.

Birnbaum, A. 1970. “Statistical Methods in Scientific Inference (letter to the Editor).” Nature 225 (5237) (March 14): 1033.

Cox, D. R., and Deborah G. Mayo. 2010. “Objectivity and Conditionality in Frequentist Inference.” In Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by Deborah G. Mayo and Aris Spanos, 276–304. Cambridge: Cambridge University Press.

Edwards, W., H. Lindman, and L. J. Savage. 1963. “Bayesian Statistical Inference for Psychological Research.” Psychological Review 70 (3): 193–242.

Fisher, R. A. 1955. “Statistical Methods and Scientific Induction.” Journal of the Royal Statistical Society, Series B (Methodological) 17 (1) (January 1): 69–78.

Goodman SN. Toward evidence-based medical statistics. 2: The Bayes factor. Annals of Internal Medicine 1999; 130:1005 –1013.

Final APS Mayo / 36

Lindley, D. V. 1971. “The Estimation of Many Parameters.” In Foundations of Statistical Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto: Holt, Rinehart and Winston.

Mayo, D. G. 1996. Error and the Growth of Experimental Knowledge. Science and Its Conceptual Foundation. Chicago: University of Chicago Press.

Mayo, D. G., and A. Spanos. 2006. “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2) (June 1): 323–357.

Mayo, D. G., and A. Spanos. 2011. “Error Statistics.” In Philosophy of Statistics, edited by Prasanta S. Bandyopadhyay and Malcom R. Forster, 7:152–198. Handbook of the Philosophy of Science. The Netherlands: Elsevier.

Meehl, Paul E., and Niels G. Waller. 2002. “The Path Analysis Controversy: A New Statistical Approach to Strong Appraisal of Verisimilitude.” Psychological Methods 7 (3): 283–300.

Morrison, Denton E., and Ramon E. Henkel, ed. 1970. The Significance Test Controversy: A Reader. Chicago: Aldine De Gruyter.

Pearson, E.S. & Neyman, J. (1930). On the problem of two samples. Joint Statistical Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). First published in Bul. Acad. Pol.Sci. 73-96.

Savage, L. J. 1962. The Foundations of Statistical Inference: A Discussion. London: Methuen. Selvin, H. 1970. “A critique of tests of significance in survey research. In The significance test

controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De Gruyter.

Education

Final mayo's aps_talk