27
PSYCH 710 Review of Statistical Inference Week 1 Prof. Patrick Bennett

PSYCH 710 Review of Statistical Inference Week 1

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

PSYCH 710

Review of Statistical Inference Week 1

Prof. Patrick Bennett

Role of statistics in research

• Statistical methods help us to collect, organize, summarize, analyze, interpret, & present data

• “The role of statistics is not to discover truth. The role of statistics is to resolve disagreements between people.” - Milton Friedman

• “...the purpose of statistics is to organize a useful argument from quantitative evidence, using a form of principled rhetoric.” - Robert P. Abelson

Ways of Collecting Data

• Designed Experiments -

- effects of independent variables on dependent variables

- random assignment of “subjects” to conditions

• Correlational Studies -

- associations among predictor & criterion variables

- “subjects” come with their own set of variables

• Both types can be combined into a single study/analysis: ANCOVA

Independent Variables

• Most Psychology studies collect data in designed experiments

• Experiments usually involve collection of data in various experimental conditions

- conditions differ in terms of 1 or more independent variables

- e.g., conditions in memory experiment defined by:

‣ type of items (faces vs words) being studied

‣ time interval between study & test phases

Dependent Variable

• The variable(s) that is(are) measured and constitute the data, or results, that will be analyzed.

• Designed experiments measure the effects of independent variables on dependent variables

Examples of dependent variablesP.W. Andrews et al. / Neuroscience and Biobehavioral Reviews 51 (2015) 164–188 175

Fig. 5. Total (intracellular + extracellular) serotonin content in different brain tis-sues declines with chronic citalopram treatment. Gray bars represent 15 daysof citalopram treatment (50 mg/ml) plus 2 days of washout. White bars repre-sent 17 days of citalopram treatment (50 mg/ml). Black bars represent chronicsaline treatment. Acad = anterior cingulate cortex; NAc = nucleus accumbens;CP = caudate/putamen; dHC = dorsal hippocampus; vHC = ventral hippocampus;Amy = amygdala; PVN = paraventricular nucleus of the hypothalamus; DRN = dorsalraphe nucleus; MRN = median raphe nucleus.Reprinted with permission from Bosker et al. (2010).

(Mitchell, 2005). In healthy volunteers, a single dose of the SSRIcitalopram potentiates anxiety, while chronic treatment inhibits it(Grillon et al., 2007, 2009). Similarly, acute and chronic paroxetinetreatments exert diametrically opposing effects on the excitabilityof motor cortex (Gerdelat-Mas et al., 2005; Loubinoux et al., 2002).Acute SSRI treatment stabilizes microtubule structure and poten-tiates the hippocampal-PFC synapse, while the opposite effects areseen over chronic treatment (Bianchi et al., 2009; Cai et al., 2013).BDNF signaling is decreased with acute SSRI treatment, and chronictreatment increases it (De Foubert et al., 2004; Khundakar andZetterström, 2006).

The opposing effects are theoretically important because theacute effects are more likely to be due to the direct pharmaco-logical properties of these drugs. That acute SSRI treatment haswidespread phenotypic effects is further evidence that they dis-rupt energy homeostasis. Conversely, the opposing effects thatoccur over chronic treatment are more likely to be due to thebrain’s compensatory responses that attempt to restore homeo-stasis.

The opposing effects are difficult for the phenotypic plasticityhypothesis to explain. As it is currently described (Branchi, 2011),there is no reason to predict that chronic SSRI treatment shouldreverse the phenotypic effects of acute treatment. Rather, the mostobvious prediction is that chronic treatment will exacerbate theeffects of acute treatment, simply because phenotypic changes havemore time to develop.

4.3. The mechanisms of symptom reduction

We hypothesize that it is the brain’s compensatory responses toSSRI treatment, rather than the direct pharmacological propertiesof SSRIs, that are responsible for reducing depressive symptoms.Others have suggested the symptom-reducing effects of SSRIs areattributable to the brain’s attempts to re-establish homeostasis(Hyman and Nestler, 1996). We differ slightly in that we pro-pose that the brain is attempting to restore energy homeostasisrather than serotonin homeostasis. The return of extracellular sero-tonin to equilibrium conditions is only one component of thehomeostatic response to the energy dysregulation caused by SSRItreatment.

If our hypothesis is correct, SSRIs (and perhaps other ADMs)could have opposing effects on depressive symptoms during acuteand chronic treatment. Efficacy studies usually do not report therelative effect of ADMs over placebo on depressive symptoms dur-ing the early stages of treatment. However, anecdotal evidence sug-gests that symptoms often worsen before they get better (Haslamet al., 2004). The anecdotal evidence is supported by two perti-nent studies. In one placebo-controlled study, imipramine was lesseffective than placebo during the first week of treatment (Oswaldet al., 1972). Imipramine only outperformed placebo over severalweeks of treatment. In another study, 30.4% of participants experi-enced a worsening of depressive symptoms (defined as an increaseof five points or more on the Hamilton Depression Research Scale;HDRS) within the first weeks of fluoxetine treatment (Cusin et al.,2007). This is perhaps a surprising finding given the large placeboeffect in depression (Kirsch et al., 2008), which could obscure anypharmacological effects that increase symptoms. Moreover, therequirement that the increase be at least five HDRS points is strin-gent since antidepressant drugs must only reduce symptoms bythree HDRS points more than placebo to be deemed clinically sig-nificant in the United Kingdom (Excellence, 2004). Indeed, since anincrease in depressive symptoms is likely to have a Poisson distribu-tion, the proportion of participants who experienced any increasein symptoms during early treatment is likely to have been muchhigher.

The initial worsening of symptoms is theoretically importantbecause this is when the largest increases in extracellular sero-tonin occur (Fig. 4). It is only over several weeks of treatmentthat depressive symptoms reduce, during which the trajectory ofextracellular serotonin is declining from its peak value (Fig. 4).That the therapeutic delay of ADMs might be related to the down-ward trajectory in serotonin has been noted by other authors. In astudy involving Flinders Sensitive Line rats, the symptom-reducingeffects of chronic desipramine administration were associated witha reduction in total (intracellular + extracellular) serotonin contentin PFC, hippocampus, and nucleus accumbens. The authors sug-gested that “decreasing 5-HT levels in limbic regions is importantfor the therapeutic effect of antidepressants” (Zangen et al., 1997,p. 2482). Similarly, in a primate microdialysis study, extracellularserotonin levels in the hippocampus and other brain regions grad-ually returned to baseline over chronic treatment with fluoxetine.The authors suggested that the brain’s compensatory responses“may contribute to the therapeutic actions of this drug in humandepression” (Smith et al., 2000, p. 470).

In short, the upward trajectory in serotonin during initial ADMtreatment is often associated with a worsening of symptoms, whilethe downward trajectory over chronic treatment is associated withsymptom reduction. This pattern can be explained by the energyregulation hypothesis. The acute (direct) effects of SSRI treatmentdisrupt energy homeostasis by exacerbating glutamatergic activ-ity in frontal brain regions, which, according to the glutamatehypothesis (Popoli et al., 2012), should worsen symptoms. Thebrain develops compensatory responses over chronic treatmentthat reverse the energy disruptions and reduce symptoms. Specif-ically, both the reduction in the synthesis of serotonin and thetonic activation of the 5-HT1A heteroreceptor act to reverse theelevated glutamatergic activity induced by the direct effects ofSSRI treatment. If the 5-HT1A heteroreceptor is still activated asextracellular serotonin returns to baseline over chronic treatment,glutamatergic activity would fall below equilibrium conditions(Fig. 3D), producing an actual antidepressant effect. We thereforeexplain the symptom reducing effects of ADMs as due to the brain’sattempts to restore energy homeostasis. Alterations to the seroto-nergic system are needed to accomplish this, but these alterationscannot all be explained in terms of restoring serotonin homeosta-sis.

Serotonin levels in cortical tissue. Figure 5 in Andrews et al, 2015, Neurosci & Biobehav Rev

! Bin interaction, Fð7;182Þ ¼ 6:97; p < :0001; f ¼ :31. The differ-ence between days was largest in the initial bin

ðD ¼ 0:26; CI95% ¼ ½0:21;0:31&Þ, and declined to an average of0.13, CI95% ¼ ½0:11;0:15& in the last three bins. Nevertheless, aswas the case with textures, the simple main effect of Day was sig-nificant at each Bin (tð26ÞP 5:42; p < :0001 in all cases). Again,the analyses suggest that there was more within-session learningon Day 1 than on Day 2 for this group.

Inspection of Fig. 3 shows that average response accuracy onDay 1 was significantly greater in the face condition than in thetexture condition ðCI95% ¼ ½0:02;0:16&; tð53Þ ¼ 2:49; p ¼ 0:015Þ.On Day 2, average response accuracy also was numerically higherfor faces than for textures, but the difference between the groupswas not statistically significant ðCI95% ¼ ½':03; :14&; tð53Þ ¼1:24; p ¼ 0:22Þ.

The current results are consistent with previous reports that 40trials per condition on Day 1 are sufficient to produce learning inthese texture- and face identification tasks (Hussain et al., 2005,2009a).

3.3. Effects of reduced practice: texture identification

In this section, and the next, we compare response accuracymeasured in all groups on Day 2. These analyses addressed theissue of whether any exposure to textures or faces on Day 1 im-proved performance relative to the 0-trials groups, and whethergroups that received 1–10 trials per condition performed worsethan the 40-trials groups. Fig. 4 clearly shows that the 10- and40-trials groups performed better than the 0-trials group. The

10 20 50 100 200 500 1000

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Prop

ortio

n C

orre

ct

40-trial10-trial5-trial1-trial

Textures

10 20 50 100 200 500 1000

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Trial Number (Day 1)

Prop

ortio

n C

orre

ct

40-trial20-trial10-trial5-trial1-trial

Faces

Fig. 2. Proportion correct on Day 1 plotted as a function of trial number. Proportioncorrect was measured for blocks consisting of 10 trials, and each point representsthe average taken across subjects. The solid line in each figure represents the least-squares regression line fit to the data.

1 2 3 4 5 6 7 8

0.2

0.3

0.4

0.5

0.6

0.7

Bin

Prop

ortio

n C

orre

ct

Day 1 TexturesDay 1 FacesDay 2 TexturesDay 2 Faces

Fig. 3. Proportion correct on Days 1 and 2 for the 40-trials groups in the face- andtexture-identification tasks. Each bin represents 105 consecutive trials. Error barsrepresent ±1 SEM.

9 10 11 12 13 14 15 16

0.2

0.3

0.4

0.5

0.6

0.7

Bin

Prop

ortio

n C

orre

ct

Trials/Condition40-trials10-trials5-trials1-trials0-trials

Day 2 (Textures)

0-trials 1-trials 5-trials 10-trials 40-trialsAver

age

Prop

ortio

n C

orre

ct0.

00.

20.

40.

6

Fig. 4. Top: proportion correct on Day 2 for all groups tested with textures. The datahave been divided into eight Bins of 105 trials each. The bins are numbered 9–16 todifferentiate them from the eight bins presented on Day 1. The symbols representaverage proportion correct. Error bars represent ±1 SEM. Standard errors werenearly constant across bins. For clarity, therefore, only error bars in Bins 9 and 16are shown. Bottom: proportion correct – averaged first across all trials and thenacross subjects – on Day 2. Error bars represent ±1 SEM.

Z. Hussain et al. / Vision Research 49 (2009) 2624–2634 2627

Accuracy in a face identification task. Hussain et al, 2009, Vis Research.

Our experiments comprised two dimensions. First, we exam-ined independently the effects of male age and sexual experience.Second, we assessed the effects of age and experience on threetypes of aggression: maleemale aggression in the context ofresource defence, and maleefemale aggression in the contexts offorced copulation of teneral females and coercive matings withrecently mated females. Below we first detail our protocols formanipulating male age and experience, and then present specificmethods for each of the three types of aggression.

Effects of Age on Aggression

As individuals age, they gain further experience. To separate theeffects of age and experience on aggression, we conducted two setsof experiments. In the first set, we varied male age while holdingmating experience constant whereas in the second set, wemanipulated male mating experience while holding age constant.In the experiments on male age, we used males that were 1, 4 and 7days old (Fig. 1a). We housed these males individually in regularfood vials until the time of testing. Our previous work indicatedthat males are sexually mature and have a high mating success andfertility when they are 1 day old (Baxter et al., 2015b; Dukas &

Baxter, 2014). We used males that were 1e7 days old because thisrepresents a realistic age range for wild fruit fly populations. Thelimited field data suggest a median life span of 3e6 days inD. melanogaster (Rosewell & Shorrocks, 1987). In the similarly sizedantler flies (Protopiophila litigate), median life span in the field was6 days (Bonduriansky & Brassil, 2002). Finally, in a few honeybeefield studies, median forager life spanwas 5e7 days (Dukas, 2008a,2008b; Dukas & Visscher, 1994). We had to limit the number ofmale age classes used because it was crucial that we conduct testsof all age groups simultaneously due to day and time of day effects.

Effects of Mating Experience on Aggression

In the experiments on male age and aggression, we equalizedmales' experience by keeping the males away from females prior totesting. Age and experience, however, were positively correlated,meaning that older males had been deprived of females longer thanyounger males. We thus conducted another set of experiments inwhich we manipulated males' experience with females whilekeeping male age constant. On day 1, we randomly assigned newlyeclosed males into either an experienced treatment or a deprivedtreatment. In the experienced male treatment, we added one 3-

4

3.5

2.5

1.5

0.5

0

1

2

3

Mea

n a

ggre

ssio

n fr

eque

ncy

(s/m

in)

1 4 7Male age (days)

0.8

0.6

0.4

0.2

0

0 40 80 120 160 200 2400 20 40 60 80 100 120Mating latency (min) Mating latency (min)

Prop

orti

on o

f mal

es m

ated

wit

h t

ener

al fe

mal

es

Prop

orti

on o

f mal

es m

ated

wit

h m

ated

fem

ales

7-day-old 7-day-old

4-day-old

1-day-old

4-day-old

1-day-old

0.25

0.15

0.05

0

0.1

0.2

Male age Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8

Test

Test

Test

1

4

7

(a)

(b)

(c) (d)

Figure 1. (a) Three treatments for the effects of age on aggression: males were 1, 4 and 7 days old when tested on day 8. (b) Mean þ SE aggression frequency per arena in the threemale age treatments (N ¼ 108 arenas, 36 per treatment). (c) The cumulative proportion of 1-, 4- and 7-day-old males that force-copulated with teneral females across a 120 min trialduration (N ¼ 288, 96 per treatment). (d) The cumulative proportion of 1-, 4- and 7-day-old males that mated with recently mated females across a 240 min trial duration (N ¼ 165,with 56 1-day-old males, 57 4-day-old males and 52 7-day-old males).

C. M. Baxter, R. Dukas / Animal Behaviour 123 (2017) 11e20 13

Aggression frequency in male fruit flies. Baxter & Dukas, 2017, Animal Behaviour

Examples of dependent variables include reaction time, response accuracy, number of items recalled in memory test, event-related brain potentials, heart rate, number of offspring, number of aggressive or affiliative behaviours, etc.

Hypothetical Recognition Memory Experiment

Retention Interval

short (1 minute) long (1 hour)

Study Items

faces

words

2 independent variables

2 levels on each independent variable

# of items correctly recognized during test

# of items correctly recognized during test

# of items correctly recognized during test

# of items correctly recognized during test

1 dependent variable measured in 4 experimental conditions defined by

combinations of independent variables

Correlational Studies

• Correlational studies measure the association between variables

• Usually, the variables are not manipulated by investigator

- each event/subject comes with own set of variables

- but values on variables differ across events/subjects

• Regression measures association between predictor & criterion variables

- e.g., measure the association between annual income (criterion variable) and parent’s income, years of education, race, gender (predictor variables)

Random Assignment

• In Psychology, designed experiments use subjects that also come with their own set of intrinsic characteristics

• These characteristics (personality, motivation, intelligence, etc.) probably affect dependent variable

• HOWEVER, in most experiments, subjects are randomly assigned to experimental conditions

• So, effects of subject differences should be UNRELATED TO EFFECTS OF INDEPENDENT VARIABLES

- big advantage of designed experiments over correlational studies

Modes of Statistical Analyses

• Descriptive vs. Inferential

• Exploratory vs. Confirmatory

Descriptive Statistics• Descriptive statistics:

- describes important characteristics of the sample

- uses graphs & statistics e.g., mean or standard deviation

! Bin interaction, Fð7;182Þ ¼ 6:97; p < :0001; f ¼ :31. The differ-ence between days was largest in the initial bin

ðD ¼ 0:26; CI95% ¼ ½0:21;0:31&Þ, and declined to an average of0.13, CI95% ¼ ½0:11;0:15& in the last three bins. Nevertheless, aswas the case with textures, the simple main effect of Day was sig-nificant at each Bin (tð26ÞP 5:42; p < :0001 in all cases). Again,the analyses suggest that there was more within-session learningon Day 1 than on Day 2 for this group.

Inspection of Fig. 3 shows that average response accuracy onDay 1 was significantly greater in the face condition than in thetexture condition ðCI95% ¼ ½0:02;0:16&; tð53Þ ¼ 2:49; p ¼ 0:015Þ.On Day 2, average response accuracy also was numerically higherfor faces than for textures, but the difference between the groupswas not statistically significant ðCI95% ¼ ½':03; :14&; tð53Þ ¼1:24; p ¼ 0:22Þ.

The current results are consistent with previous reports that 40trials per condition on Day 1 are sufficient to produce learning inthese texture- and face identification tasks (Hussain et al., 2005,2009a).

3.3. Effects of reduced practice: texture identification

In this section, and the next, we compare response accuracymeasured in all groups on Day 2. These analyses addressed theissue of whether any exposure to textures or faces on Day 1 im-proved performance relative to the 0-trials groups, and whethergroups that received 1–10 trials per condition performed worsethan the 40-trials groups. Fig. 4 clearly shows that the 10- and40-trials groups performed better than the 0-trials group. The

10 20 50 100 200 500 1000

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Prop

ortio

n C

orre

ct

40-trial10-trial5-trial1-trial

Textures

10 20 50 100 200 500 1000

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Trial Number (Day 1)

Prop

ortio

n C

orre

ct

40-trial20-trial10-trial5-trial1-trial

Faces

Fig. 2. Proportion correct on Day 1 plotted as a function of trial number. Proportioncorrect was measured for blocks consisting of 10 trials, and each point representsthe average taken across subjects. The solid line in each figure represents the least-squares regression line fit to the data.

1 2 3 4 5 6 7 8

0.2

0.3

0.4

0.5

0.6

0.7

Bin

Prop

ortio

n C

orre

ct

Day 1 TexturesDay 1 FacesDay 2 TexturesDay 2 Faces

Fig. 3. Proportion correct on Days 1 and 2 for the 40-trials groups in the face- andtexture-identification tasks. Each bin represents 105 consecutive trials. Error barsrepresent ±1 SEM.

9 10 11 12 13 14 15 16

0.2

0.3

0.4

0.5

0.6

0.7

Bin

Prop

ortio

n C

orre

ct

Trials/Condition40-trials10-trials5-trials1-trials0-trials

Day 2 (Textures)

0-trials 1-trials 5-trials 10-trials 40-trialsAver

age

Prop

ortio

n C

orre

ct0.

00.

20.

40.

6

Fig. 4. Top: proportion correct on Day 2 for all groups tested with textures. The datahave been divided into eight Bins of 105 trials each. The bins are numbered 9–16 todifferentiate them from the eight bins presented on Day 1. The symbols representaverage proportion correct. Error bars represent ±1 SEM. Standard errors werenearly constant across bins. For clarity, therefore, only error bars in Bins 9 and 16are shown. Bottom: proportion correct – averaged first across all trials and thenacross subjects – on Day 2. Error bars represent ±1 SEM.

Z. Hussain et al. / Vision Research 49 (2009) 2624–2634 2627

Mean face identification accuracy in 5 groups plotted as a function of the number of trials. Hussain et al, 2009, Vis Research.

Perceptual LearningFor each level of stimulus contrast variance used during trials with

external noise, a percentage of agreement, Pa, was calculated for replicatedtrials. A percentage of correct responses, Pc, was also estimated for eachstimulus contrast by using the fitted Weibull (psychometric) functiondescribed earlier. By pairing Pa and Pc according to stimulus contrast,we were thus able to obtain a unique mapping between Pa and Pc. Anobserver modeled with different levels of internal noise, relative to a con-stant amount of externally added noise, responds with systematic changesto the slope s of this equation (Gold et al., 1999b):

P c ¼ log10ðP a=100Þsþ 100 ð2Þ

The relationship between internal noise and s was measured by runningMonte Carlo simulations of a ideal observer performing in this experimentfor 50 different levels of internal noise. By comparing a participant’s slopeto the modeled observer’s slope, we were thus able to obtain an estimate oftheir total internal noise (ri), relative to external noise (rn). This inter-nal:external noise ratio, ri/rn, was calculated for each participant in allconditions.

3. Results

Statistical analyses were performed with R v2.5.1 (RDevelopment Core Team, 2007). The strength of associa-tion between the dependent and independent variableswas expressed as partial omega-squared (x2

p) using formu-lae described by Kirk (1995). When appropriate, theHuynh-Feldt estimate of sphericity, ~!, was used to adjustp values of F tests conducted on within-subject variables(Maxwell & Delaney, 2004).

3.1. Face identification thresholds

Preliminary analyses indicated that thresholds did notdiffer for male and female faces, so thresholds were aver-aged across stimulus gender prior to the main analyses. A2(Orientation) % 2(Contrast Polarity) % 2 (External Noise)within-subject analysis of variance (ANOVA) was per-formed on log-transformed thresholds. Not surprisingly,the main effect of external noise was significant(F(1, 6) = 590.47, p < 0.00001, x2

p ¼ 0:84), indicating thatthresholds were higher in high external noise. The maineffects of stimulus orientation (F(1, 6) = 28.97, p = 0.0017,x2

p ¼ 0:20) and contrast polarity (F(1, 6) = 32.37,p = 0.0013, x2

p ¼ 0:22) also were significant, reflecting thefact that thresholds were higher when faces were invertedand when contrast polarity was reversed. None of interac-tions were significant (F 6 2.34, p P 0.18, x2

p 6 :01 in allcases).

To gauge the strength of our inversion and contrast-reversal effects, we divided thresholds measured in theinverted, negative contrast, and combined (i.e, bothinverted and negative contrast) conditions by threshold inthe upright, positive contrast condition. Compared to nor-mal faces, subjects needed approximately 50% more con-trast to identify inverted faces, 63% more contrast toidentify contrast-reversed faces, and 100% more contrastto identify combined inverted and contrast-reversed faces(see Fig. 6). An initial analysis showed that the thresholdratios did not differ for male and female faces, so the ratios

were averaged across face gender. A 2(ExternalNoise) % 3(Condition) within-subjects ANOVA on thelog-transformed ratios revealed a significant main effectof Condition (F(2, 12) = 6.04, ~! ¼ 1, p = 0.0153,x2

p ¼ 0:11), but the main effect of External Noise(F(1,6) = 2.90, p = 0.14, x2

p ¼ 0:02) and the External Noisex Condition interaction (F(2,12) = 0.125, ~! ¼ 1, p = 0.88,x2

p ¼ 0) were not significant. Hence, the effects of stimulusinversion and contrast-reversal did not differ in the no-noise and high-noise conditions. Next, log-transformedthreshold ratios in the inverted and negative contrast con-ditions were averaged across levels of external noise, and ttests were used to compare the resulting values tolog10(1.6) = 0.204, which is the average ratio reported byMartelli et al. (2005) in a meta-analysis of face inversioneffects: Neither the effect of face inversion nor the effectof contrast-reversal differed from log10(1.6) (inversion:t(6) =& 0.709,p = 0.505; contrast-reversal: t(6) = 0.631,p = 0.551). Threshold ratios in the combined conditionwere significantly greater than log10(1.6) (t(6) = 2.66,p = 0.037). Finally, the inversion effect in the current studydid not differ (t(11) = 1.05, p = 0.31) from the effect mea-sured in a 1-of-10 face identification task that used facesand psychophysical methods similar to the ones used here(Gaspar, Sekuler, & Bennett, 2005).

These analyses indicate that our two-alternative facerecognition task was sufficiently sensitive to measure face-inversion (Martelli et al., 2005; Sekuler et al., 2004; Yin,1969) and contrast-reversal effects (Galper, 1970; Liu &Chaudhuri, 1997; Liu, Collin, & Chaudhuri, 2000; Russell,Sinha, Biederman, & Nederhouser, 2006; Vuong et al.,

none high

External Noise

log 1

0((thr

esho

ld r

atio

))0.

000.

050.

100.

150.

200.

250.

300.

35 inversioncontrastcombined

Fig. 6. Effects of inversion and contrast-reversal on face identificationthresholds. Threshold ratios were calculated by dividing thresholds in theinverted, contrast-reversed, and combined conditions by thresholds in thenormal-face condition. Threshold ratios were averaged across stimulusgender. Error bars represent ±1 SEM.

C.M. Gaspar et al. / Vision Research 48 (2008) 1084–1095 1089

Mean face identification thresholds in six experimental conditions. Error bars are ±1 SEM. Gaspar et al, 2008, Vis Research.

Face Perception

Inferential Statistics

• Inferential statistics:

- uses sample to make claims about a population

- e.g., estimate population parameters from sample statistics

- e.g., investigate differences among population by examining differences among groups/samples [effect size]

Exploratory vs Confirmatory Analyses

• Exploratory Data Analysis (Descriptive)

- first major proponent was John Tukey

- goal: discover & summarize interesting aspects of data

- explore data to discover interesting hypotheses to test

• Confirmatory Data Analysis (Inferential)

- data are gathered and analyzed to evaluate specific a priori hypotheses

- example: clinical drug trials

• Important not to confuse the two types of analyses

- will discuss this issue in the context of hypothesis testing (chapter 8)

Inference: Samples to Populations

• Population: all events (subjects, scores, etc)) of interest

• Sample: subset of population

- random sample: each member of population has equal chance of being selected

- convenience sample (e.g., psychology undergraduates)

• Inference depends on quality of relation between sample & population. Is sample representative of population?

sample statistics mean, variance

Bennett, PJ PSYCH 710 Hypothesis Testing

Y , s2

7

sample

population parameters

Bennett, PJ PSYCH 710 Hypothesis Testing

µ, �2

7

population

sampling distributions

Central Limit TheoremBennett, PJ PSYCH 710 Hypothesis Testing

90 95 100 105 110

0.00

0.05

0.10

0.15

score

prob

abilit

y de

nsity

Figure 1: The theoretical sampling distribution of the mean.

be Normal regardless of the shape of the population distribution as long as sample size is su�ciently large1.We will assume, therefore, that the sample mean is a random variable that follows a Normal distribution.

Finally, we are in a position to determine if our observation (Y = 93) is unusual (assuming the nullhypothesis is true and in the absence of additional information). According to the null hypothesis, samplemeans will be distributed Normally with a mean of 100 and a standard deviation of 2.236. This distribution— actually, the probability density function — is shown Figure 1. Inspection of Figure 1 indicates thatmost of the scores drawn from this sampling distribution should be between 95 and 105. In other words, theprobability of getting a score that is less than 95, or greater than 105, should be low. In fact, the probabilityof randomly selecting a score that is less than 95 is equal to the area under the part of the curve in Figure1 that lies to the left of 95 (i.e., the lower tail of the function; see Figure 2). Similarly, the probability ofselecting a score that is greater 105 is equal to the area under the curve that is to the right of 105 (i.e., theupper tail of the function). The joint probability – i.e., the probability of selecting a score that is less than95 or greater than 105 – is equal to the sum of the two individual probabilities. In R2, the probability ofselecting a score that is less than 95 or greater than 105 is calculated using the following commands:

> pnorm(95,mean=100,sd=2.236,lower.tail=TRUE)

[1] 0.01267143 # prob of getting a score <= 95

> pnorm(105,mean=100,sd=2.236,lower.tail=FALSE)

1The key phrase here is “su�ciently large”. For some types of population distributions, the sample size must be very large

for the sampling distribution of the mean to follow a normal distribution. In other words, although the Central Limit Theorem

is true, in some cases the sampling distribution of the mean can deviate significantly from normality even with sample sizes

that are large relative to those typically used in psychology experiments (Wilcox, 2002).

2R is a software environment for statistical computing that is free and runs on several common operating systems. It can be

downloaded at http://cran.r-project.org.

2

Bennett, PJ PSYCH 710 Hypothesis Testing

Review of Hypothesis Testing

PSYCH 710

September 11, 2017

0 Testing means

0.1 known population mean and variance

Consider the following scenario: The population of a small town was unknowingly exposed to an envi-ronmental toxin over the course of several years. There is a possibility that exposure to the toxin duringpregnancy adversely a↵ects cognitive development which eventually leads to lower verbal IQ. To determineif this has happened to the children of this town, you use a standardized test to measure verbal intelligencein a random sample of 20 children who were exposed to the toxin. Based on extensive testing with typicalchildren, the mean and standard deviation of the scores in the population of typical children is 100 and 10,respectively. The mean score of your sample was 93. Given this information, should we conclude that oursample of twenty scores was drawn from the population of typical scores?

We will answer this question in what appears to be a roundabout way. We will start by assuming thatthere was no e↵ect of the toxin, and therefore that our sample of scores was drawn from the population oftypical scores. Hence, our null hypothesis is that the data were drawn from a population with a mean of100 (µ = 100) and a standard deviation of 10 (� = 10). Next, we have to evaluate whether our observation(i.e., the sample mean is 93) is unusual, or unlikely, given the assumption that the null hypothesis is true.If the observation is unlikely, then we reject the null hypothesis (µ = 100) in favor of the alternative

hypothesis that µ 6= 100. In your textbook, the null and alternative hypotheses often are displayed thusly:

H0 : µ = 100

H1 : µ 6= 100

How can we determine if our observation is unlikely under the assumption that the null hypothesis istrue? Recall that the mean (µY ) and standard deviation (�Y ) of the distribution of means – otherwiseknown as the sampling distribution of the mean – are related to the mean (µ) and standard deviation(�) of the individual scores in the population by the equations

µY = µ (1)

�Y = �/pn (2)

where n is sample size. Therefore, for the current example

µY = 100 (3)

�Y = 10/p20 = 2.236 (4)

Next, we note that if verbal IQ scores are distributed Normally, then the distribution of sample means willalso be Normal. Moreover, the Central Limit Theorem implies the the distribution of sample means will

1

-4 -2 0 2 4 6 8

0.00

0.02

0.04

0.06

0.08

0.10

uniform density

x

y

Uniform Distribution of Scores sample size = 2

sample.means

Density

-2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

-4 -2 0 2 4

-20

24

6

QQ Plot (should be straight line)

Theoretical Quantiles

Sam

ple

Qua

ntile

s

-20

24

6

should be symmetrical

Sampling Distribution of Means (n=2)

sample size = 4

sample.means

Density

-2 0 2 4 60.0

0.2

0.4

0.6

0.8

1.0

-4 -2 0 2 4

-20

24

6

QQ Plot (should be straight line)

Theoretical Quantiles

Sam

ple

Qua

ntile

s

-20

24

6

should be symmetrical

Sampling Distribution of Means (n=4)sample size = 8

sample.means

Density

-2 -1 0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

-4 -2 0 2 4

-10

12

34

5

QQ Plot (should be straight line)

Theoretical Quantiles

Sam

ple

Qua

ntile

s

-10

12

34

5

should be symmetrical

Sampling Distribution of Means (n=8)

Density Functions & ProbabilityBennett, PJ PSYCH 710 Hypothesis Testing

Example: Normal Distribution

0.4

0.3

0.2

0.1

0.0

200150100500

CP(x≤C) P(x≥C)

for continuous distributions, the probability of selecting a value of x that is

less than or greater than some criterion, C, equals the area beneath the curve

and to the left or right of C, respectively.

Figure 2: The probability of randomly selecting a value of x that is C – i.e., P (x C) – corresponds tothe area under the probability density function that is to the left of C. P (x � C) equals the area under thecurve that is to the right of C.

[1] 0.01267143 # prob of getting a score >= 105

> 0.01267143+0.01267143

[1] 0.02534286 # prob of getting a score <= 95 OR >= 105

In other words, the probability of selecting a score that is less than 95 or greater than 105 is only 0.025.Because the total probability is 1, the probability of selecting a score that is between 95 and 105 is 1�0.025 =0.975. Given their relatively low probability, it is reasonable to assert that scores that fall below 95 or above105 are unusual. By this criterion, our observed mean score of 93 is unusual, and we therefore reject thenull hypothesis that µ = 100 in favor of the alternative hypothesis µ 6= 100.

In our example, we considered any score falling below 95 or above 105 to be unusual. It is importantto note, however, that getting an unusual score does not necessarily mean that the null hypothesis is false.After all, unusual scores are possible even when the null hypothesis is true. In fact, we expect to obtainan unusual score with a probability of .025 when the null hypothesis is true. Hence, it is possible that ourdecision to reject the null hypothesis is incorrect. This type of error — rejecting the null hypothesis whenit is true — is called a Type I error. The probability of making this error is determined by the criteria weuse to define a score as unusual. In this case, we used criteria (i.e., below 95 or above 105) which wouldlead to a Type I error 2.5% of the time. The probability of making a Type I error is referred to as ↵ (i.e.,alpha), and so we would say that the Type I error rate, or ↵, is .025 for this statistical test.

It is standard practice in Psychology to set ↵ to either 0.05 or 0.01. If we set ↵ = .05, then our decisioncriteria would be 95.6 and 104.4:

> qnorm(.025,mean=100,sd=2.236,lower.tail=TRUE) # qnorm... not pnorm

[1] 95.61752 # the probability of getting a score <= 95.6 is 0.025...

> pnorm(95.61752,mean=100,sd=2.236,lower.tail=TRUE)

[1] 0.02499999

> qnorm(.025,mean=100,sd=2.236,lower.tail=FALSE)

[1] 104.3825 # # the probability of getting a score >= 104.38 is 0.025...

3

Bennett, PJ PSYCH 710 Hypothesis Testing

Example: Normal Distribution

0.4

0.3

0.2

0.1

0.0

200150100500

CP(x≤C) P(x≥C)

for continuous distributions, the probability of selecting a value of x that is

less than or greater than some criterion, C, equals the area beneath the curve

and to the left or right of C, respectively.

Figure 2: The probability of randomly selecting a value of x that is C – i.e., P (x C) – corresponds tothe area under the probability density function that is to the left of C. P (x � C) equals the area under thecurve that is to the right of C.

[1] 0.01267143 # prob of getting a score >= 105

> 0.01267143+0.01267143

[1] 0.02534286 # prob of getting a score <= 95 OR >= 105

In other words, the probability of selecting a score that is less than 95 or greater than 105 is only 0.025.Because the total probability is 1, the probability of selecting a score that is between 95 and 105 is 1�0.025 =0.975. Given their relatively low probability, it is reasonable to assert that scores that fall below 95 or above105 are unusual. By this criterion, our observed mean score of 93 is unusual, and we therefore reject thenull hypothesis that µ = 100 in favor of the alternative hypothesis µ 6= 100.

In our example, we considered any score falling below 95 or above 105 to be unusual. It is importantto note, however, that getting an unusual score does not necessarily mean that the null hypothesis is false.After all, unusual scores are possible even when the null hypothesis is true. In fact, we expect to obtainan unusual score with a probability of .025 when the null hypothesis is true. Hence, it is possible that ourdecision to reject the null hypothesis is incorrect. This type of error — rejecting the null hypothesis whenit is true — is called a Type I error. The probability of making this error is determined by the criteria weuse to define a score as unusual. In this case, we used criteria (i.e., below 95 or above 105) which wouldlead to a Type I error 2.5% of the time. The probability of making a Type I error is referred to as ↵ (i.e.,alpha), and so we would say that the Type I error rate, or ↵, is .025 for this statistical test.

It is standard practice in Psychology to set ↵ to either 0.05 or 0.01. If we set ↵ = .05, then our decisioncriteria would be 95.6 and 104.4:

> qnorm(.025,mean=100,sd=2.236,lower.tail=TRUE) # qnorm... not pnorm

[1] 95.61752 # the probability of getting a score <= 95.6 is 0.025...

> pnorm(95.61752,mean=100,sd=2.236,lower.tail=TRUE)

[1] 0.02499999

> qnorm(.025,mean=100,sd=2.236,lower.tail=FALSE)

[1] 104.3825 # # the probability of getting a score >= 104.38 is 0.025...

3

z tests

• used to make inference about population mean when population variance is known

• when scores are distributed normally (Central Limit Theorem)

- z is distributed normally with mean=0 and std dev=1

- 95% of values fall between ±1.96

- 99% of values fall between ±2.56

Bennett, PJ PSYCH 710 Hypothesis Testing

> pnorm(104.3825,mean=100,sd=2.236,lower.tail=FALSE)

[1] 0.02499946

[1] 0.02499999 + 0.02499946 # the JOINT probability of getting a score <=95.6 OR >=104.38

[1] 0.04999945

Now, any score that is less than 95.6 or greater than 104.4 leads to the rejection of the null hypothesis.Notice that the range of acceptable scores — which do not cause us to reject the null hypothesis — is smallerthan before. In other words, we are more likely to reject the null hypothesis even when it is true. Thischange in the Type I error makes sense because we increased ↵ from .025 to .05. If we set ↵ = .01, then ourdecision criteria would be 94.8 and 105.2, and any score that is outside that range leads to the rejection ofthe null hypothesis. Now the Type I error rate, .01, is lower than before.

0.2 standardized scores & z tests

In the previous example, I used a computer to calculate the decision criteria for ↵ = .05 and ↵ = .01. Beforecomputers were readily available — yes, there was such a time — people looked up the decision values inpublished tables. It would be impossible to publish tables for every possible case, and therefore people useda table of standard normal deviates or z scores. This section shows how to use such a table to conducta z test.

Any value, Y , can be converted to a standard score using the formula

z =(Y � µ)

�(5)

Notice that a z score equals the number of standard deviations that Y is from µ. When Y is drawn from anormal distribution, then z is distributed as a Normal variable with µ = 0 and � = 1.

We can convert our observed mean score from the previous example into a z score – z = (93�100)/2.236 =�3.13 – which implies that the observed mean is 3.13 standard deviations below the expected value of themean. Now we want to know if our observed z score is unusual, given the assumption that the null hypothesisis true. If the null hypothesis is true, then z will be between ±1.96 95% of the time and between ±2.5699% of the time. Therefore, using the criteria of ±1.96 to reject the null hypothesis will yield a Type I errorrate of 0.05, whereas the criteria of ±2.56 corresponds to a Type I error rate of 0.01. Our observed z scorefalls outside of both sets of criteria, and so the null hypothesis is rejected. It used to be standard practiceto indicate which ↵ level was used by writing that the null hypothesis was rejected (p < .05) or (p < .01).Nowadays, scientists are encouraged to publish the exact p value for their observed statistic. In our case,the probability of drawing a z that was outside the range of ±3.13 is 0.00175, and so we would report thestatistical result by writing “the sample mean di↵ered significantly from 100 (z = �3.13, p = 0.00175) andso the null hypothesis µ = 100 was rejected”.

0.3 t tests

In the previous example, we knew that � = 10. But in most cases we we do not know �, and therefore itmust be estimated from the data:

� = s =

sPi(Yi � Y )2

(n� 1)

The estimate, s can be used to calculate t, which is similar to a z score:

t =(Y � µ)

(s/pn)

(6)

4

z test example

• sample mean = 93 (n=20)

• population standard deviation = 10

• is sample drawn from a population with a mean = 100?

• calculate z and compare to “critical values” of ±1.96 or ±2.56

Bennett, PJ PSYCH 710 Hypothesis Testing

Review of Hypothesis Testing

PSYCH 710

September 11, 2017

0 Testing means

0.1 known population mean and variance

Consider the following scenario: The population of a small town was unknowingly exposed to an envi-ronmental toxin over the course of several years. There is a possibility that exposure to the toxin duringpregnancy adversely a↵ects cognitive development which eventually leads to lower verbal IQ. To determineif this has happened to the children of this town, you use a standardized test to measure verbal intelligencein a random sample of 20 children who were exposed to the toxin. Based on extensive testing with typicalchildren, the mean and standard deviation of the scores in the population of typical children is 100 and 10,respectively. The mean score of your sample was 93. Given this information, should we conclude that oursample of twenty scores was drawn from the population of typical scores?

We will answer this question in what appears to be a roundabout way. We will start by assuming thatthere was no e↵ect of the toxin, and therefore that our sample of scores was drawn from the population oftypical scores. Hence, our null hypothesis is that the data were drawn from a population with a mean of100 (µ = 100) and a standard deviation of 10 (� = 10). Next, we have to evaluate whether our observation(i.e., the sample mean is 93) is unusual, or unlikely, given the assumption that the null hypothesis is true.If the observation is unlikely, then we reject the null hypothesis (µ = 100) in favor of the alternative

hypothesis that µ 6= 100. In your textbook, the null and alternative hypotheses often are displayed thusly:

H0 : µ = 100

H1 : µ 6= 100

How can we determine if our observation is unlikely under the assumption that the null hypothesis istrue? Recall that the mean (µY ) and standard deviation (�Y ) of the distribution of means – otherwiseknown as the sampling distribution of the mean – are related to the mean (µ) and standard deviation(�) of the individual scores in the population by the equations

µY = µ (1)

�Y = �/pn (2)

where n is sample size. Therefore, for the current example

µY = 100 (3)

�Y = 10/p20 = 2.236 (4)

Next, we note that if verbal IQ scores are distributed Normally, then the distribution of sample means willalso be Normal. Moreover, the Central Limit Theorem implies the the distribution of sample means will

1

Parameters of Normal Sampling DistributionAssume Means are Normally Distributed

Bennett, PJ PSYCH 710 Hypothesis Testing

> pnorm(104.3825,mean=100,sd=2.236,lower.tail=FALSE)

[1] 0.02499946

[1] 0.02499999 + 0.02499946 # the JOINT probability of getting a score <=95.6 OR >=104.38

[1] 0.04999945

Now, any score that is less than 95.6 or greater than 104.4 leads to the rejection of the null hypothesis.Notice that the range of acceptable scores — which do not cause us to reject the null hypothesis — is smallerthan before. In other words, we are more likely to reject the null hypothesis even when it is true. Thischange in the Type I error makes sense because we increased ↵ from .025 to .05. If we set ↵ = .01, then ourdecision criteria would be 94.8 and 105.2, and any score that is outside that range leads to the rejection ofthe null hypothesis. Now the Type I error rate, .01, is lower than before.

0.2 standardized scores & z tests

In the previous example, I used a computer to calculate the decision criteria for ↵ = .05 and ↵ = .01. Beforecomputers were readily available — yes, there was such a time — people looked up the decision values inpublished tables. It would be impossible to publish tables for every possible case, and therefore people useda table of standard normal deviates or z scores. This section shows how to use such a table to conducta z test.

z =(93� 100)

2.236= �3.13 (5)

Any value, Y , can be converted to a standard score using the formula

z =(Y � µ)

�(6)

Notice that a z score equals the number of standard deviations that Y is from µ. When Y is drawn from anormal distribution, then z is distributed as a Normal variable with µ = 0 and � = 1.

We can convert our observed mean score from the previous example into a z score – z = (93�100)/2.236 =�3.13 – which implies that the observed mean is 3.13 standard deviations below the expected value of themean. Now we want to know if our observed z score is unusual, given the assumption that the null hypothesisis true. If the null hypothesis is true, then z will be between ±1.96 95% of the time and between ±2.5699% of the time. Therefore, using the criteria of ±1.96 to reject the null hypothesis will yield a Type I errorrate of 0.05, whereas the criteria of ±2.56 corresponds to a Type I error rate of 0.01. Our observed z scorefalls outside of both sets of criteria, and so the null hypothesis is rejected. It used to be standard practiceto indicate which ↵ level was used by writing that the null hypothesis was rejected (p < .05) or (p < .01).Nowadays, scientists are encouraged to publish the exact p value for their observed statistic. In our case,the probability of drawing a z that was outside the range of ±3.13 is 0.00175, and so we would report thestatistical result by writing “the sample mean di↵ered significantly from 100 (z = �3.13, p = 0.00175) andso the null hypothesis µ = 100 was rejected”.

0.3 t tests

In the previous example, we knew that � = 10. But in most cases we we do not know �, and therefore itmust be estimated from the data:

� = s =

sPi(Yi � Y )2

(n� 1)

The estimate, s can be used to calculate t, which is similar to a z score:

t =(Y � µ)

(s/pn)

(7)

4

Bennett, PJ PSYCH 710 Hypothesis Testing

> pnorm(104.3825,mean=100,sd=2.236,lower.tail=FALSE)

[1] 0.02499946

[1] 0.02499999 + 0.02499946 # the JOINT probability of getting a score <=95.6 OR >=104.38

[1] 0.04999945

Now, any score that is less than 95.6 or greater than 104.4 leads to the rejection of the null hypothesis.Notice that the range of acceptable scores — which do not cause us to reject the null hypothesis — is smallerthan before. In other words, we are more likely to reject the null hypothesis even when it is true. Thischange in the Type I error makes sense because we increased ↵ from .025 to .05. If we set ↵ = .01, then ourdecision criteria would be 94.8 and 105.2, and any score that is outside that range leads to the rejection ofthe null hypothesis. Now the Type I error rate, .01, is lower than before.

0.2 standardized scores & z tests

In the previous example, I used a computer to calculate the decision criteria for ↵ = .05 and ↵ = .01. Beforecomputers were readily available — yes, there was such a time — people looked up the decision values inpublished tables. It would be impossible to publish tables for every possible case, and therefore people useda table of standard normal deviates or z scores. This section shows how to use such a table to conducta z test.

Any value, Y , can be converted to a standard score using the formula

z =(Y � µ)

�(5)

Notice that a z score equals the number of standard deviations that Y is from µ. When Y is drawn from anormal distribution, then z is distributed as a Normal variable with µ = 0 and � = 1.

We can convert our observed mean score from the previous example into a z score – z = (93�100)/2.236 =�3.13 – which implies that the observed mean is 3.13 standard deviations below the expected value of themean. Now we want to know if our observed z score is unusual, given the assumption that the null hypothesisis true. If the null hypothesis is true, then z will be between ±1.96 95% of the time and between ±2.5699% of the time. Therefore, using the criteria of ±1.96 to reject the null hypothesis will yield a Type I errorrate of 0.05, whereas the criteria of ±2.56 corresponds to a Type I error rate of 0.01. Our observed z scorefalls outside of both sets of criteria, and so the null hypothesis is rejected. It used to be standard practiceto indicate which ↵ level was used by writing that the null hypothesis was rejected (p < .05) or (p < .01).Nowadays, scientists are encouraged to publish the exact p value for their observed statistic. In our case,the probability of drawing a z that was outside the range of ±3.13 is 0.00175, and so we would report thestatistical result by writing “the sample mean di↵ered significantly from 100 (z = �3.13, p = 0.00175) andso the null hypothesis µ = 100 was rejected”.

0.3 t tests

In the previous example, we knew that � = 10. But in most cases we we do not know �, and therefore itmust be estimated from the data:

� = s =

sPi(Yi � Y )2

(n� 1)

The estimate, s can be used to calculate t, which is similar to a z score:

t =(Y � µ)

(s/pn)

(6)

4

Calculate z:

Our sample mean is 3.13 standard deviations below the expected mean.

t tests

Bennett, PJ PSYCH 710 Hypothesis Testing

0 1 2 3 4 5

0.0

0.2

0.4

0.6

Sampling Dist (N=5)

Sample Variance

µ = 0σ = 1

Median

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

Sampling Dist (N=5)

Sample Mean

Median

Median

(a) Sampling Distributions (Sample Size = 5)

0.5 1.0 1.5

0.0

0.5

1.0

1.5

Sampling Dist (N=50)

Sample Variance

µ = 0σ = 1

Median

−0.4 −0.2 0.0 0.2 0.4

0.0

0.5

1.0

1.5

2.0

2.5

Sampling Dist (N=50)

Sample Mean

Median

(b) Sampling Distributions (Sample Size = 50)

Figure 3: Sampling distributions of the variance and the mean for sample sizes of 5 (top) and 50 (bottom).The histograms show the variance and mean of 5,000 samples drawn from a normal distribution with µ = 0and � = 1. The sampling distribution of the variance, but not the mean, is positively skewed, particularlyfor small samples. The positive skew implies that the median of the variance sampling distribution is lessthan 1. Hence, although the mean of the sample variance is 1, and therefore the sample variance (s2) is anunbiased estimate of the population variance (�2 = 1), s2 is less than �2 in more than 50% of the samples.

6

Bennett, PJ PSYCH 710 Hypothesis Testing

> pnorm(104.3825,mean=100,sd=2.236,lower.tail=FALSE)

[1] 0.02499946

[1] 0.02499999 + 0.02499946 # the JOINT probability of getting a score <=95.6 OR >=104.38

[1] 0.04999945

Now, any score that is less than 95.6 or greater than 104.4 leads to the rejection of the null hypothesis.Notice that the range of acceptable scores — which do not cause us to reject the null hypothesis — is smallerthan before. In other words, we are more likely to reject the null hypothesis even when it is true. Thischange in the Type I error makes sense because we increased ↵ from .025 to .05. If we set ↵ = .01, then ourdecision criteria would be 94.8 and 105.2, and any score that is outside that range leads to the rejection ofthe null hypothesis. Now the Type I error rate, .01, is lower than before.

0.2 standardized scores & z tests

In the previous example, I used a computer to calculate the decision criteria for ↵ = .05 and ↵ = .01. Beforecomputers were readily available — yes, there was such a time — people looked up the decision values inpublished tables. It would be impossible to publish tables for every possible case, and therefore people useda table of standard normal deviates or z scores. This section shows how to use such a table to conducta z test.

Any value, Y , can be converted to a standard score using the formula

z =(Y � µ)

�(5)

Notice that a z score equals the number of standard deviations that Y is from µ. When Y is drawn from anormal distribution, then z is distributed as a Normal variable with µ = 0 and � = 1.

We can convert our observed mean score from the previous example into a z score – z = (93�100)/2.236 =�3.13 – which implies that the observed mean is 3.13 standard deviations below the expected value of themean. Now we want to know if our observed z score is unusual, given the assumption that the null hypothesisis true. If the null hypothesis is true, then z will be between ±1.96 95% of the time and between ±2.5699% of the time. Therefore, using the criteria of ±1.96 to reject the null hypothesis will yield a Type I errorrate of 0.05, whereas the criteria of ±2.56 corresponds to a Type I error rate of 0.01. Our observed z scorefalls outside of both sets of criteria, and so the null hypothesis is rejected. It used to be standard practiceto indicate which ↵ level was used by writing that the null hypothesis was rejected (p < .05) or (p < .01).Nowadays, scientists are encouraged to publish the exact p value for their observed statistic. In our case,the probability of drawing a z that was outside the range of ±3.13 is 0.00175, and so we would report thestatistical result by writing “the sample mean di↵ered significantly from 100 (z = �3.13, p = 0.00175) andso the null hypothesis µ = 100 was rejected”.

0.3 t tests

In the previous example, we knew that � = 10. But in most cases we we do not know �, and therefore itmust be estimated from the data:

� = s =

sPi(Yi � Y )2

(n� 1)

The estimate, s can be used to calculate t, which is similar to a z score:

t =(Y � µ)

(s/pn)

(6)

4

estimate of population standard deviation

Bennett, PJ PSYCH 710 Hypothesis Testing

> pnorm(104.3825,mean=100,sd=2.236,lower.tail=FALSE)

[1] 0.02499946

[1] 0.02499999 + 0.02499946 # the JOINT probability of getting a score <=95.6 OR >=104.38

[1] 0.04999945

Now, any score that is less than 95.6 or greater than 104.4 leads to the rejection of the null hypothesis.Notice that the range of acceptable scores — which do not cause us to reject the null hypothesis — is smallerthan before. In other words, we are more likely to reject the null hypothesis even when it is true. Thischange in the Type I error makes sense because we increased ↵ from .025 to .05. If we set ↵ = .01, then ourdecision criteria would be 94.8 and 105.2, and any score that is outside that range leads to the rejection ofthe null hypothesis. Now the Type I error rate, .01, is lower than before.

0.2 standardized scores & z tests

In the previous example, I used a computer to calculate the decision criteria for ↵ = .05 and ↵ = .01. Beforecomputers were readily available — yes, there was such a time — people looked up the decision values inpublished tables. It would be impossible to publish tables for every possible case, and therefore people useda table of standard normal deviates or z scores. This section shows how to use such a table to conducta z test.

Any value, Y , can be converted to a standard score using the formula

z =(Y � µ)

�(5)

Notice that a z score equals the number of standard deviations that Y is from µ. When Y is drawn from anormal distribution, then z is distributed as a Normal variable with µ = 0 and � = 1.

We can convert our observed mean score from the previous example into a z score – z = (93�100)/2.236 =�3.13 – which implies that the observed mean is 3.13 standard deviations below the expected value of themean. Now we want to know if our observed z score is unusual, given the assumption that the null hypothesisis true. If the null hypothesis is true, then z will be between ±1.96 95% of the time and between ±2.5699% of the time. Therefore, using the criteria of ±1.96 to reject the null hypothesis will yield a Type I errorrate of 0.05, whereas the criteria of ±2.56 corresponds to a Type I error rate of 0.01. Our observed z scorefalls outside of both sets of criteria, and so the null hypothesis is rejected. It used to be standard practiceto indicate which ↵ level was used by writing that the null hypothesis was rejected (p < .05) or (p < .01).Nowadays, scientists are encouraged to publish the exact p value for their observed statistic. In our case,the probability of drawing a z that was outside the range of ±3.13 is 0.00175, and so we would report thestatistical result by writing “the sample mean di↵ered significantly from 100 (z = �3.13, p = 0.00175) andso the null hypothesis µ = 100 was rejected”.

0.3 t tests

In the previous example, we knew that � = 10. But in most cases we we do not know �, and therefore itmust be estimated from the data:

� = s =

sPi(Yi � Y )2

(n� 1)

The estimate, s can be used to calculate t, which is similar to a z score:

t =(Y � µ)

(s/pn)

(6)

4

t statistic

Although t looks similar to z, t is NOT distributed normally because the sampling distribution of the variance (& std dev) is skewed, especially for small sample sizes

Bennett, PJ PSYCH 710 Hypothesis Testing

0 1 2 3 4 5

0.0

0.2

0.4

0.6

Sampling Dist (N=5)

Sample Variance

µ = 0σ = 1

Median

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

Sampling Dist (N=5)

Sample Mean

Median

Median

(a) Sampling Distributions (Sample Size = 5)

0.5 1.0 1.50.

00.

51.

01.

5

Sampling Dist (N=50)

Sample Variance

µ = 0σ = 1

Median

−0.4 −0.2 0.0 0.2 0.4

0.0

0.5

1.0

1.5

2.0

2.5

Sampling Dist (N=50)

Sample Mean

Median

(b) Sampling Distributions (Sample Size = 50)

Figure 3: Sampling distributions of the variance and the mean for sample sizes of 5 (top) and 50 (bottom).The histograms show the variance and mean of 5,000 samples drawn from a normal distribution with µ = 0and � = 1. The sampling distribution of the variance, but not the mean, is positively skewed, particularlyfor small samples. The positive skew implies that the median of the variance sampling distribution is lessthan 1. Hence, although the mean of the sample variance is 1, and therefore the sample variance (s2) is anunbiased estimate of the population variance (�2 = 1), s2 is less than �2 in more than 50% of the samples.

6

t distribution(s)

Bennett, PJ PSYCH 710 Hypothesis Testing

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

t distributions

t

prob

abili

ty d

ensi

ty

df=2df=10df=30

Figure 4: Probability density functions for the t distribution with 2, 10, and 30 degrees-of-freedom.

7

Bennett, PJ PSYCH 710 Hypothesis Testing

> pnorm(104.3825,mean=100,sd=2.236,lower.tail=FALSE)

[1] 0.02499946

[1] 0.02499999 + 0.02499946 # the JOINT probability of getting a score <=95.6 OR >=104.38

[1] 0.04999945

Now, any score that is less than 95.6 or greater than 104.4 leads to the rejection of the null hypothesis.Notice that the range of acceptable scores — which do not cause us to reject the null hypothesis — is smallerthan before. In other words, we are more likely to reject the null hypothesis even when it is true. Thischange in the Type I error makes sense because we increased ↵ from .025 to .05. If we set ↵ = .01, then ourdecision criteria would be 94.8 and 105.2, and any score that is outside that range leads to the rejection ofthe null hypothesis. Now the Type I error rate, .01, is lower than before.

0.2 standardized scores & z tests

In the previous example, I used a computer to calculate the decision criteria for ↵ = .05 and ↵ = .01. Beforecomputers were readily available — yes, there was such a time — people looked up the decision values inpublished tables. It would be impossible to publish tables for every possible case, and therefore people useda table of standard normal deviates or z scores. This section shows how to use such a table to conducta z test.

Any value, Y , can be converted to a standard score using the formula

z =(Y � µ)

�(5)

Notice that a z score equals the number of standard deviations that Y is from µ. When Y is drawn from anormal distribution, then z is distributed as a Normal variable with µ = 0 and � = 1.

We can convert our observed mean score from the previous example into a z score – z = (93�100)/2.236 =�3.13 – which implies that the observed mean is 3.13 standard deviations below the expected value of themean. Now we want to know if our observed z score is unusual, given the assumption that the null hypothesisis true. If the null hypothesis is true, then z will be between ±1.96 95% of the time and between ±2.5699% of the time. Therefore, using the criteria of ±1.96 to reject the null hypothesis will yield a Type I errorrate of 0.05, whereas the criteria of ±2.56 corresponds to a Type I error rate of 0.01. Our observed z scorefalls outside of both sets of criteria, and so the null hypothesis is rejected. It used to be standard practiceto indicate which ↵ level was used by writing that the null hypothesis was rejected (p < .05) or (p < .01).Nowadays, scientists are encouraged to publish the exact p value for their observed statistic. In our case,the probability of drawing a z that was outside the range of ±3.13 is 0.00175, and so we would report thestatistical result by writing “the sample mean di↵ered significantly from 100 (z = �3.13, p = 0.00175) andso the null hypothesis µ = 100 was rejected”.

0.3 t tests

In the previous example, we knew that � = 10. But in most cases we we do not know �, and therefore itmust be estimated from the data:

� = s =

sPi(Yi � Y )2

(n� 1)

The estimate, s can be used to calculate t, which is similar to a z score:

t =(Y � µ)

(s/pn)

(6)

4

t statistic

• t distribution is a family of distributions. • Low degrees-of-freedom: t has heavier tails than normal distribution • t & normal distributions increasingly similar as df increases

Bennett, PJ PSYCH 710 Hypothesis Testing

s2 is an unbiased estimator of �2, which means that the expected value of s2 is �2. However, the samplingdistribution of s2 is positively skewed, and therefore, s2 < �2 more than 50% of the time. Therefore, t is notdistributed as a standard normal deviate: in fact, extreme values of t (i.e., much less or greater than

zero) will occur more frequently than would be expected if t followed a normal distribution.

Think about the e↵ect this will have on our decision to reject or reject the null hypothesis: extreme values oft are more likely than we would predict if we used the methods described in the previous section. Therefore,the inflation of t, if left uncorrected, would make it more likely that we would mistakenly conclude that oursample mean is unusual. What we need is a more accurate description of the distribution of t.

William Gosset showed that (Y�µ)(s/

pn)

follows the so-called t distribution (Student, 1908). The t “distri-

bution” is actually a family of symmetrical distributions that are centered on zero (Figure 4). Di↵erentmembers of the t family are distinguished by a parameter called the degrees of freedom, which can bethought of as the number of independent pieces of information that are used to estimate a parameter. Inthis case, n numbers are used to estimate s, but the same numbers also are used to estimate Y . Becauses depends on the prior estimation of Y , the number of independent pieces of information relevant to thecalculation of s is n�1. When the degrees of freedom is greater than about 20, the t distribution essentiallyis identical to the standard normal distribution (i.e., the z distribution). When there are fewer than 20degrees of freedom, the t distribution has thicker tails than the z distribution (i.e., extreme values are morelikely).

The logic of using the t distribution to evaluate the null hypothesis is similar to the logic used in the ztest. The only di↵erence is that the range of “typical” values of t generally will di↵er from ±1.96 and ±2.56.The so-called critical values of t are listed in Table A.1 in your textbook. For our sample, df = 20� 1 = 19.If we set ↵ = .05, then we can use Table A.1 to determine that the critical values of t are ±2.09. Let’sassume that, for our data, s = 14. Therefore,

t =(93� 100)

14/p20

= �2.236

The observed value of t falls outside ±2.09, so we reject the null hypothesis that µ = 100 (t = �2.236, p <.05). As with the z test, modern practice is to report the exact p value, which is given by most statisticalsoftware. Therefore, we would write “the sample mean di↵ered significantly from 100 (t(19) = �2.236, p =0.0376)”. The number in the parenthesis is the degrees of freedom.

The following command shows how a t test can be done in R, assuming that the data are stored inthe variable my.scores, that the null hypothesis is µ = 100, and that it is a two-tailed test (i.e., that thealternative hypothesis is µ 6= 100):

> my.scores <- c(96, 94, 102, 86, 100, 92, 105, 86, 111,

82, 96, 79, 79, 85, 99, 97, 105, 98, 94, 73)

> t.test(x=my.scores,alternative="two.sided", mu=100)

0.4 two-tailed vs. one-tailed tests

Our t.test example evaluated the null hypothesis that µ = 100. In fact, this hypothesis does not quitecapture the essence of the experimental question, which was whether the exposure to the toxin had reduced

verbal IQ. A better set of null and alternative hypotheses would reflect this directional aspect of the question:

H0 : µ � 100

H1 : µ < 100

As in the original example, the null hypothesis is formulated in a way that is opposite to our expectationsof the e↵ect of the toxin. Also, the null and alternative hypotheses are mutually exclusive (i.e., only onecan be true) and exhaustive (i.e., one must be true). The di↵erence between the original hypotheses andthis new set is that the latter are directional. Now, only scores that are less than 100 tend to favor H1, and

5

pt(-2.236,df=(20-1)) + pt(2.236,df=(20-1),lower.tail=FALSE)

p = 0.0375

probability of obtaining t at least this extreme when H0 is true

Power (Type I & Type II Errors)

Bennett, PJ PSYCH 710 Hypothesis Testing

therefore we need only one criterion to decide if our observed score is unusually lower than 100. If we aredoing a z test, for example, then the criterion would be -1.64 for ↵ = 0.05, and -2.33 for ↵ = 0.01. If wewere doing a t test in R, we would type

> t.test(x=my.scores,alternative="less", mu=100)

On the other hand, if we wanted to determine if exposure to the toxin had increased IQ, then we wouldevaluate the hypotheses

H0 : µ 100

H1 : µ > 100

with the R command

> t.test(x=my.scores,alternative="greater", mu=100)

0.5 Type I vs. Type II errors

We have said that rejecting the null hypothesis when it is true is a Type I error, and that the probabilityof making such an error is ↵. Another kind of error occurs when we fail to reject the null hypothesis whenit is false. This type of mistake is called a Type II error, and the probability of it occurring is called �.In fact, the two states of the world regarding H0 (i.e., it is either True or False) combined with the twopossible decisions we can make regarding H0 mean that there are four possible decision outcomes (see Table1). Obviously, we would like to maximize our correct decisions, and so we should minimize the probability ofmaking both Type I (↵) and Type II (�)errors. Of course, Type I errors can be minimized by adopting verysmall levels of ↵. Unfortunately, adopting a small ↵ will (all other things begin equal) lead to an increasein �.

Table 1: Possible outcomes of hypothesis testing.

decision H0 is True H0 is Falsereject H0: Type I (p = ↵) Correct (p = 1� � =power)

do not reject H0: Correct (p = 1� ↵) Type II error (p = �)

The probability of rejecting H0 when it is false is referred to as the power of a statistical test, and itequals 1��. Obviously, the power of test depends on the size of the e↵ect being studied. In our IQ example,for instance, it would be much easier to detect the e↵ect of the toxin if it reduced IQ by 50 points than ifit reduced IQ by only 1 point. Power also generally increases with increasing sample size. Do you see why?(Hint: Think about what happens to the variation among sample means as sample size increases). Finally,the power of a test declines as ↵ increases. The power of a t test can be calculated using R’s power.t.test()command. The following example shows how to use the command to compute the power of a two-sided ttest:

> power.t.test(n=20,delta=5,sd=10,sig.level=0.05,type="one.sample",alternative="two.sided")

One-sample t test power calculation

n = 20

delta = 5

sd = 10

sig.level = 0.05

power = 0.5644829

alternative = two.sided

8

Type I Error: reject H0 when it is true (alpha)

Type II Error: fail to reject H0 when it is false (beta)

Power = Probability of rejecting false H0 (1-beta)

What do p-values mean?• A p value is the probability of obtaining a result that is at least

as extreme as observed result when the null hypothesis is true

- p.value = p(result GIVEN H0 isTRUE)

• A p value is NOT the probability that H0 is true given the data

- p.value ≠ p(H0 GIVEN the obtained result)

• Also, a p value is NOT the probability that you would get a different result if you replicated the experiment

Bennett, PJ PSYCH 710 Hypothesis Testing

The power is 0.56. What does this probability mean? If our sample size is 20, and if the scores are selectedfrom a population that has a standard deviation of 10 and a mean that is 5 less than the mean in thenull hypothesis, then the probability is 0.56 that we will (correctly) reject the null hypothesis. So, if thepopulation mean of our IQ scores is 95, then we have a 56% chance of correctly rejecting H0 when we use asample size of 20. If we evaluate a one-tailed null hypothesis, then the power increases to 0.69:

> power.t.test(n=20,delta=5,sd=10,sig.level=0.05,type="one.sample",alternative="one.sided")

One-sample t test power calculation

n = 20

delta = 5

sd = 10

sig.level = 0.05

power = 0.6951493

alternative = one.sided

This increase in power is one very good reason for using one-tailed tests whenever possible.

0.6 what do p-values mean?

A p value indicates the probability of getting our score (Y = 93), or one that is more extreme (i.e., furtherfrom the mean), assuming that the null hypothesis is true. We can write this more formally with the notationthat is used to express conditional probabilities: the probability of getting our data (or data that are moreextreme) given that the null hypothesis (H0) is true is P (data|H0). Notice that this probability is not quitethe same as the probability that H0 is true given our data, or P (H0|data), which is usually what we wantto know. It is vitally important that you understand that these two conditional probabilities

are not the same:

P (data|H0) 6= P (H0|data)

In fact, there is no way to derive P (H0|data) from P (data|H0) in the absence of other information aboutthe a priori probability that H0 is true. Another common mistake is to interpret a p value as indicatingthe probability that a particular finding would be replicated: for instance that a p value of .01 meansthat if the experiment was conducted many times then a significant result would be obtained 99% of thetime. Such an interpretation is incorrect: the probability of obtaining a statistically significant result whenthe null hypothesis is false depends on the e↵ect size, sample size, and the alpha level. Indeed, given themoderate-to-low power of many psychological experiments, the actual probability of obtaining a significantresult typically is much lower than a naıve guess based on the p value. Finally, there is a tendency tointerpret statistical tests (incorrectly) in a binary fashion: a significant test is interpreted as showing thatan e↵ect is real, whereas a non-significant test indicates the e↵ect is literally zero. These and other issueshave been discussed by many statisticians and scientists in what is now a large literature on the use andabuse of p values, as well as the limitations of the null hypothesis testing procedure that I have described inthese notes (e.g., Altman and Bland, 1995; Cohen, 1994; Dixon, 2003; Gelman, 2013; Krantz, 1999; Loftus,1996; Lykken, 1968).

It also is important that you realize that the p-values are correct only if the assumptions underlying the

statistical tests are true. A t test, for example, assumes that the scores are drawn independently from anormal population. If these assumptions are correct, then the p values obtained with a t test will be correct.If the distribution deviates from a normal distribution, or if the scores are not independent, then the p valuesmay be very misleading. Other statistical procedures — the analysis of variance, for example — make moreassumptions about the data. Again, the p values are exactly correct only if all of the assumptions are true.If one or more assumptions are false, then the p values are, at best, only approximately correct. In general,it is unlikely that all of the assumptions of a statistical test are strictly true, and so it is unlikely that the pare exactly correct. It makes you wonder if p values of 0.04, 0.05 and 0.06 di↵er in any meaningful way. . .

9

Lykken DT, Psychol Bulletin, 1968Psychological Bulletin1968, Vol. 70, No. 3, 151-159

STATISTICAL SIGNIFICANCE IN PSYCHOLOGICAL RESEARCH

DAVID T. LYKKEN

University of Minnesota

Most theories in the areas of personality, clinical, and social psychology predictno more than the direction of a correlation, group difference, or treatment effect.Since the null hypothesis is never strictly true, such predictions have about aSO-SO chance of being confirmed by experiment when the theory in question isfalse, since the statistical significance of the result is a function of the samplesize. Confirmation of a single directional prediction should usually add littleto one's confidence in the theory being tested. Most theories should be testedby multiple corroboration and most empirical generalizations by constructivereplication. Statistical significance is perhaps the least important attribute ofa good experiment; it is never a sufficient condition for claiming that a theoryhas been usefully corroborated, that a meaningful empirical fact has beenestablished, or that an experimental report ought to be published.

In a recent journal article Sapolsky (1964)developed the following substantive theory:Some psychiatric patients entertain an uncon-scious belief in the "cloacal theory of birth"which involves the notions of oral impregna-tion and anal parturition. Such patients shouldbe inclined to manifest eating disorders: com-pulsive eating in the case of those who wishto get pregnant and anorexia in those who donot. Such patients should also be inclined tosee cloacal animals, such as frogs, on theRorschach. This reasoning led Sapolsky topredict that Rorschach frog responders showa higher incidence of eating disorders thanpatients not giving frog responses. A test ofthis hypothesis in a psychiatric hospitalshowed that 19 of 31 frog responders hadeating disorders indicated in their charts, com-pared to only 5 of the 31 control patients. Ahighly significant chi-square was obtained.

It will be an expository convenience toanalyze Sapolsky's article in considerable de-tail for purposes of illustrating the methodo-logical issues which are the real subject ofthis paper. My intent is not to criticize aparticular author but rather to examine akind of epistemic confusion which seems tobe endemic in psychology, especially, but byno means exclusively, in its "softer" precincts.One would like to demonstrate this generalitywith multiple examples. Having just combedthe latest issues of four well-known journalsin the clinical and personality areas, I couldundertake to identify several papers in eachissue wherein, because they were able to re-

ject a directional null hypothsis at some highlevel of significance, the authors claimed tohave usefully corroborated some rather gen-eral theory or to have demonstrated some im-portant empirical relationship. To substantiatethat these claims are overstated and that muchof this research has not yet earned the rightto the reader's overburdened attentions wouldrequire a lengthy analysis of each paper. Suchprofligacy of space would ill become an essayone aim of which is to restrain the swellingvolume of the psychological literature. There-fore, with apologies to Sapolsky for subject-ing this one paper to such heavy handedscrutiny, let us proceed with the analysis.

Since I regarded the prior probability ofSapolsky's theory (that frog responders un-consciously believe in impregnation per os) tobe nugatory and its likelihood unenhanced bythe experimental findings, I undertook tocheck my own reaction against that of 20colleagues, most of them clinicians, by meansof a formal questionnaire. The 20 estimatesof the prior probability of Sapolsky's theory,which these psychologists made before beinginformed of his experimental results, rangedfrom 10"a to 0.13 with a median value of0.01, which can be interpreted to mean,roughly, "I don't believe it." Since the priorprobability of many important scientific the-ories is considered to be vanishingly smallwhen they are first propounded, this resultprovides no basis for alarm. However, afterbeing given a fair summary of Sapolsky's ex-perimental findings, which "corroborate" the

151

What do p-values mean?

Bennett, PJ PSYCH 710 Hypothesis Testing

The power is 0.56. What does this probability mean? If our sample size is 20, and if the scores are selectedfrom a population that has a standard deviation of 10 and a mean that is 5 less than the mean in thenull hypothesis, then the probability is 0.56 that we will (correctly) reject the null hypothesis. So, if thepopulation mean of our IQ scores is 95, then we have a 56% chance of correctly rejecting H0 when we use asample size of 20. If we evaluate a one-tailed null hypothesis, then the power increases to 0.69:

> power.t.test(n=20,delta=5,sd=10,sig.level=0.05,type="one.sample",alternative="one.sided")

One-sample t test power calculation

n = 20

delta = 5

sd = 10

sig.level = 0.05

power = 0.6951493

alternative = one.sided

This increase in power is one very good reason for using one-tailed tests whenever possible.

0.6 what do p-values mean?

A p value indicates the probability of getting our score (Y = 93), or one that is more extreme (i.e., furtherfrom the mean), assuming that the null hypothesis is true. We can write this more formally with the notationthat is used to express conditional probabilities: the probability of getting our data (or data that are moreextreme) given that the null hypothesis (H0) is true is P (data|H0). Notice that this probability is not quitethe same as the probability that H0 is true given our data, or P (H0|data), which is usually what we wantto know. It is vitally important that you understand that these two conditional probabilities

are not the same:

P (data|H0) 6= P (H0|data)

In fact, there is no way to derive P (H0|data) from P (data|H0) in the absence of other information aboutthe a priori probability that H0 is true. Another common mistake is to interpret a p value as indicatingthe probability that a particular finding would be replicated: for instance that a p value of .01 meansthat if the experiment was conducted many times then a significant result would be obtained 99% of thetime. Such an interpretation is incorrect: the probability of obtaining a statistically significant result whenthe null hypothesis is false depends on the e↵ect size, sample size, and the alpha level. Indeed, given themoderate-to-low power of many psychological experiments, the actual probability of obtaining a significantresult typically is much lower than a naıve guess based on the p value. Finally, there is a tendency tointerpret statistical tests (incorrectly) in a binary fashion: a significant test is interpreted as showing thatan e↵ect is real, whereas a non-significant test indicates the e↵ect is literally zero. These and other issueshave been discussed by many statisticians and scientists in what is now a large literature on the use andabuse of p values, as well as the limitations of the null hypothesis testing procedure that I have described inthese notes (e.g., Altman and Bland, 1995; Cohen, 1994; Dixon, 2003; Gelman, 2013; Krantz, 1999; Loftus,1996; Lykken, 1968).

It also is important that you realize that the p-values are correct only if the assumptions underlying the

statistical tests are true. A t test, for example, assumes that the scores are drawn independently from anormal population. If these assumptions are correct, then the p values obtained with a t test will be correct.If the distribution deviates from a normal distribution, or if the scores are not independent, then the p valuesmay be very misleading. Other statistical procedures — the analysis of variance, for example — make moreassumptions about the data. Again, the p values are exactly correct only if all of the assumptions are true.If one or more assumptions are false, then the p values are, at best, only approximately correct. In general,it is unlikely that all of the assumptions of a statistical test are strictly true, and so it is unlikely that the pare exactly correct. It makes you wonder if p values of 0.04, 0.05 and 0.06 di↵er in any meaningful way. . .

9