Things I’ve learned about Experimental Design and Data Analysis in 2007

Things I’ve learned about Experimental Design and Data Analysis in 2007

Paper of the Year 2007

• Wine (Bacardi / Malibu) etc as prizes

• EVERYONE to participate please (PIs to enforce !)

• Name a paper that • you think is the best this year (pain, regeneration or other)• that has inspired your work or will guide it next year• others should know about

• First to nominate via PDF upload to PainShared/Paper of the Year 2007• Please also print a few copies and leave on the shelves in the lunch area

• Presentations and prize giving in January 2008

• You vote who wins!

Things I’ve learned about Experimental Design and Data Analysis in 2007

Or -- The Good, The Bad and the Ugly

The Good

STATISTICS HELL

http://www.statisticshell.com/statisticshell.html

Dr Andy Field (University of Sussex)

Dr Dom Spina (KCL)

Prof Steve Dunnett (U Cardiff)

References.....

• Dr Michael Festing BSc MSc PhD CStat FIBiol !!!!• www.isogenic.info

http://www.isogenic.info/

Steve Dunnett’s Handout – all you need to know about stats on a single page

At meetings, how to guess if it’s significant

Cumming et al., 2007 JCB 177:7-11

If there’s >1 standard error between tips then they’re probably different at level of significance of 0.05

n.b. less reliable where n is nearer 3

The Bad

Experiments will always be biased unless you actExperiments will always be biased unless you act

Randomize the ‘experimental units’

Use blinding strategies

Tell us in talks if you did.... and how...

In lab meetings, please don’t just show a graph and say.....

“Drug had an effect.”

Show us the stats and PROVE your point

Repeated measures ANOVA (F 2,16)=1.24; p=0.04

Tukey post hoc t-test between drug and control only significant at week 5 (p<0.05)

(groups, ‘n’s, experimental design and error bars would be nice too)

The Ugly

Presenting data in lab meetsPresenting data in lab meets

choose an appropriate averagechoose an appropriate average

mean (arithmetic or geometric)mean (arithmetic or geometric) depict with SEM, SD or CIdepict with SEM, SD or CI

median (less sensitive to outliers)median (less sensitive to outliers) depict with box plotsdepict with box plots

modemode

In lab meetings, why not show us the data?

It helps us understand the real nature of the data – especially outliers

Festing & Altman 2002

Tell us why you chose that control ….

Many people use “vehicle” as a control for a recombinant protein / antibody / enzyme.

Why not use some other recombinant protein?same speciessimilar MW (diffusion)similar stability

e.g. counters any non specific effect of the recombinant protein.(immunogenic response?)

More Good

What test should I use?

• Steve Dunnett’s one page summary• Andy Field’s flow charts

• next bit is stolen from Dom Spina. Bit lengthy....

T tests to assess whether means of two groups to assess whether means of two groups

differdiffer DON’T use multiple t-tests if you have DON’T use multiple t-tests if you have

more than two groups!more than two groups!

1 2

BEFORE DOING THE EXPERIMENT

decide on your Null hypothesis and level of decide on your Null hypothesis and level of significancesignificance

Can you predict the direction of change?Can you predict the direction of change? e.g. the therapy should e.g. the therapy should increaseincrease axon axon

growthgrowth If not, then use a two tailed testIf not, then use a two tailed test ONE TAILED T-TESTS DOUBLE YOUR ONE TAILED T-TESTS DOUBLE YOUR

POWER.POWER.

Assumption 1 for t tests

data values are normally distributeddata values are normally distributed if not, then try transforming or a non-if not, then try transforming or a non-

parametric testparametric test

+/- 1.96 SD=95%

Assumption 2 for t tests

variances are equal (“homogenous”) variances are equal (“homogenous”) in the two groupsin the two groups check with a graph (or formal test)check with a graph (or formal test)

presentnewer45

50

55

60

65

heig

hts

of

pla

nts

(cm

)

48.20 52.3054.60 57.4058.30 55.6047.80 53.2051.40 61.3052.00 58.0055.20 59.8049.10 54.8049.9052.60

present newer

51.9111.363.37

56.558.893.144

Meanvariancestd dev

F = 1.15 (df = 9,7)P > 0.05

Three types of t testThree types of t test

single sample - comparing experimental single sample - comparing experimental data with a known population (RARE)data with a known population (RARE)

t test for independent samples - t test for independent samples - comparing data from two different groupscomparing data from two different groups

paired t test - used for paired or ‘before paired t test - used for paired or ‘before and after’ dataand after’ data

some examples......some examples......

Thinking about experimental design and statistics BEFORE an experiment is important.....

here’s why....

Water Gatorade

Alan 178 Bob 180Cath 189 Dave 191Eric 208 Freda210Gav 200 Hugo 202Irin 188 Jake 189Lucy 176 Mike 177Nata 190 Olaf 191Phil 201 Rob 202Stu 187 Toby 188Una 200 Viv 201

Jo measures the amount of energyexpended (kcal) in subjects during exercise.‘Can we reject the null hypothesis that pre-loading with drink makes no difference in the amount of energy expended?’

50

100

150

200

Kilocala

ries

Unpaired t-test has to be used

Null hypothesis: The sports drink will have no effect on kilocalories burnt

Group 1(water)191.7 + 10.4

Group 2(Pep-up)193.1 + 10.5 Mean + sd

t statistic calculated as -0.3degrees of freedom (n1+n2-2) 18One- or two-tailed P value? Two-tailed

p=0.768 not significant

50

100

150

200

Kilocala

ries

Can use one tailed t-test if you predict a priori that

Null hypothesis: The sports drink will increase kilocalories burnt

Group 1(water)191.7 + 10.4


t statistic calculated as -0.3degrees of freedom (n1+n2-2) 18One- or two-tailed P value? One-tailed

p=0.769 / 2 = 0.384 closer but still not significant

Suppose Jo had used a paired design (using half as many subjects) and got the same data.

Water Gatorade

Alan 178 180Cath 189 191Eric 208 210Gav 200 202Irin 188 189Lucy 176 177Nata 190 191Phil 201 202Stu 187 188Una 200 201

50

100

150

200

Kilocala

ries

Can now used paired t-test

Null hypothesis: The sports drink will have no effect on kilocalories burnt

Group 1(water)191.7 + 10.4


t statistic calculated as -8.57degrees of freedom (n1+n2-2) 9One- or two-tailed P value? Two-tailed

p<0.001 highly significant

(rough sketch)

Non-parametric statisticsNon-parametric statistics((aka distribution-free statistics)aka distribution-free statistics)

make no assumptions about make no assumptions about underlying distribution of the dataunderlying distribution of the data

most based on ranking methodsmost based on ranking methods

less powerful than parametric testsless powerful than parametric tests

Non Parametric Test

Wilcoxon test

Mann-Whitney test

Parametric equivalent

Students paired t-test

Student’s (unpaired) t-test

if you have two groups then.....

Non Parametric Test

Mann Whitney test

Wilcoxon test

Kruskal Wallis

Friedman’s test

Spearman’s correlation

Parametric equivalent

Students non-paired t-test

Student’s paired t-test

ANOVA

Repeated ANOVA

Pearson correlation

WHAT IF I HAVE MORE THAN TWO GROUPS?

KEY RULE: DON’T USE MULTIPLE T-TESTS. VIOLATES INDEPENDENCE.Risks getting false positive results.

ANYHOW, ANOVA AND K-W ARE MORE POWERFUL.

Non Parametrics: because Non Parametrics: because real measurements are often real measurements are often

not normally distributednot normally distributed

In an experiment to determine latency to respond to thermal stimulus (a hotplate at 49oC) female mice were pretreated with either vehicle (0.9% methylcellulose) or various doses of PD117302 (a kappa opioid agonist) 30 min before testing. Reaction times (s) were recorded by an observer unware of the animals’ treatment and the following results obtained:

Control

6029.414.829.217.519.529

1 mg/kg po

5226.522.231.623.619.720.5

3 mg/kg po

30.556.443.555.56031.621.3

9 mg/kg po

606048.846606060

measurements are not measurements are not normally distributednormally distributed

control 1mg/kg 3mg/kg 9mg/kg0

25

50

75

react

ion t

ime (

sec)

Control1 mg/kg3 mg/kg9 mg/kg0

25

50

75

Dose of PD117302

Resp

on

se t

ime (

s)

Kruskal Wallis statistic = 14.3P = 0.0038

Single factor ANOVASingle factor ANOVA

When testing the hypothesis that severalsample means vary amongst themselves morethan should be expected on the basis ofrandom sampling, then use ANOVA

Comparison of systematic differences anderror variance (experimental error)

Advantages of ANOVA

•Test the null hypothesis of several means(reduces time and effort)

•Increases the sensitivity of a test(increase in degrees of freedom)

•Avoid violation of one of the major assumptionsof comparison between different means (assumption of independent comparisons)

Heart rate was monitored in the absence and presence of various drugs

Beats/min

203 180 160178 155 142147 139 148162 135 156190 165 172

control Drug 1 Drug 2

One way ANOVA: results look like thisOne way ANOVA: results look like this

Between groupwithin grouptotal

SS

144438795321

df

21214

MS

721323

F

2.23

P

0.15

Comparisons following ANOVAComparisons following ANOVA

Post-hoc analysis

Bonferroni CorrectionDunnett’s test (vs control)TukeyNeuman-Keuls testScheffe

Pitfalls in Multiple Comparison testing Pitfalls in Multiple Comparison testing (type I error)(type I error)

Number of independentNull Hypotheses

Probability ofobtaining one ormore P valuesless than 0.05 by chance

Threshold to keepoverall risk of type 1error = 0.05

1 5% 0.052 10% 0.0253 14% 0.0174 19% 0.01275 23% 0.0102

100 99% 0.0005

N 100(1.0 - 0.95N) 1.0 - 0.95(1/N)

Transformation of Data which appears non-GaussianTransformation of Data which appears non-Gaussian

Type of data Distribution Normalizing transform

Count Poisson distribution C

Proportion Binomial distribution Arcsine P

Measurement Lognormal Log (M)

Duration 1/D

Repeated Measures ANOVARepeated Measures ANOVA

Matched paired t-test a more powerful testthan the independent groups t test (variability is reduced)

Repeated measures ANOVA is used tocompare means resulting from multiple testsof the same or matched subjects

Repeated Measures ANOVA: single factor

Plasma cholesterol levels were measured in 7 subjects taking three drugs.

Test the null hypothesis that all three drugs have similar effects on plasma cholesterol levels.

Subject Drug 1 Drug 2 Drug 3

1 164 152 1782 202 181 2223 143 136 1324 210 194 216

567

meanstdevsem

228173161

18330.7111.61

219159157

17128.510.8

245182165

19138.514.6

Between subjectswithin subjects drugs error

Total

SS18730

1454695.3

20880

df6

212

20

MS3122

72757.94

F

12.6

P

0.0011

The null hypothesis is therefore rejected. There is a significanttreatment effect.

drug 1 drug 2 drug 3100

200

300B

lood

ch

ole

ste

rol

(mg

/dL)

Box plot Line plot

1 2 3

100

200

300

chole

stero

l (m

g/d

L)

SS df MS F Between group (drug) 1454 2 727.0 1.48 Within group 19430 18 1079 Total 20880 20

Repeated measures more powerful than one Repeated measures more powerful than one way ANOVAway ANOVA

Carry over effect is a potential confounding factor that must beavoided

The value at one time point is likely to influence successive time point(ie a test at one time point gives a significant result it is likely that testsperformed closer in time will also give significant results)

Single factor analysis of the cholesterol data, assuming 21 subjectswere recruited to the study:

Cautionary note (see Mathews et al. 1990)

Lots more Good

Experimental design......

If I want info about male and female rats do I need to test 40 animals in total for same power?

Suppose you use 20 male rats (n=10, 10)

with df = N - 2 = 18

threshold/critical t0.05 = 2.10

SMART IDEA: How to increase generalitySMART IDEA: How to increase generality

Question: Does drug A reduce pain behaviour in rats relative to control drug?

Your observed / measured value of t needs to be lower than the threshold tto provide evidence your results didn’t happen by chance (i.e. were sig).

Even better would beRepeated measures design

5 males, 5 females:threshold t0.05 = 2.45

df = 9 - 3 = 6

No. Factorial design enhances power

5 5

5 5

M F

control

drug A

df = (N-1) -(T-1)df = 19 - 3df = 16

threshold t0.05 = 2.20

Factor A (Sex)

Factor B(treatment)

Na+

Why has no-one ever told me about randomised block design?

Do you do big in vivo experiments with many animals (e.g. 6 each day for 5 days)?

Do you do cell culture experiments on different days and then combine the results?

If so, then SESSION is a factor and contributes variability to your experiment.

You shouldn’t just ignore this!

In fact, you can even IMPROVE power to detect an effect

..... you need ........

Randomised block design

Na+

Randomised block design?

Consider an experiment that will take three daysThere are three treatments (A, B, C) with three mice per group.

First, randomize the treatments per block. e.g.

Block (day) Treatment for mouse 1 2 3

1 A B C2 B C A3 C A B

You then do two way ANOVA without interaction (block is a random, not fixed, factor).This is easy and I can show you how. IT IMPROVES POWER TO DETECT AN EFFECT

Na+

Suppose you’ve run a big experiment.

Three treatments (A, B, C) were delivered to groups of 20 stroke rats on three different surgical days. You thus have two factors (treatment and session/block). A month later, you measure reaching ability.

TreatmentBlock

1Block

2Block

3

A 10 12 7

B 7 8 5

C 5 6 4

By eye we can see that reaching ability is best in A (independent of block).But reaching is also best for block 2 across groups (smaller lesions that day?)

Na+

take home message: include “session” in your ANOVA

The two-way ANOVA table for these data (calculated by computer) is as follows:

Source DF SS MS F p .Session 2 16.889 8.444 13.82 0.016Treatment 2 33.556 16.778 27.45 0.005Error / residual 4 2.444 0.611Total 8 52.889

The two-way ANOVA without interaction partitions the total variation into parts associated with treatments, session and error..... tells you about how big the effect of session (block) was...... and there is therefore less variation in the “error” term

From these results it is clear that there were large block differences (p=0.016), implying that including “block” as a variable was worthwhile: it will have improved the power of the experiment.

There were statistically significant treatment effects i.e. the null hypothesis that there are no differences among the three treatment means is rejected at p=0.005.

Na+

take home message: include “session” in your ANOVA

Andy’s papers (Lew 2007 BJP 152:299-303) proves this

If you disagree, you have to find me a reference.........

Next bit is contraversial, so

http://www.isogenic.info/html/6__experimental_unit.html

Na+

What’s my “n” ?

n, the experimental unit, is the entity in the experiment “which can be assigned at random and independently to a treatment” (Festing et al., 1998 ATLA 26, 283-301)

Control virus

Experimental virus

3 are randomly selected for

twomiceprovidecells for six dishes

then measure lengths of neurites of 100 neurons per well

3 are randomly selected forWHAT’S MY n ?(ask first, what is the entity)

“does my virus boost growth of neurites in vitro?”

http://www.wpclipart.com/tools/science/petri_dish.png

http://www.wpclipart.com/small_icons/animals/mouse.png






http://www.wpclipart.com/small_icons/animals/mouse.png

What’s my “n”?

The experimental unit is “dish”.

There are six of them.

You take means or medians per well and then do stats with these.

Counting more neurons doesn’t increase “n”, however many hours you already spent.

Planning experiments with more dishes WILL increase “n”.

If you disagree, you have to find me a reference.........

“If cells from an animal are cultured in a number of dishes that canbe assigned to different in vitro treatments, then the dish of cells is the experimental unit” (Festing & Altman, 2002).

Na+

Replicate experiments

Suppose you run the cell culture experiment four times over one month.

Is this n=4? How to analyse?

It’s a randomized block factorial design again.

Thus two way ANOVA, with block and drug as factors.

You’ll learn if there was an effect of session (and it will be parcelled out).The study will be more powerful.

Na+

Replicate samples

A “replicate” refers to repetition of measurement on one sample. Tells you about variation due to measurement.

If you run three PCR reactions using the same primers and the same cDNA, then you have three replicates.

If you run three PCR reactions using the same primers and cDNA from three different animals, then you have no replicates, but n=3 and you can do stats.

Na+

How generalisable are my results ?

Thus it’s possible to run an experiment with multiple “n” using tissue derived from only a single mouse.

“n” tells you about the number of units in an experiment, and NOT necessarily about the number of animals used.

“p” tells you how likely it is that you got these results (using these samples) by chance (e.g. 0.05)

So how far are your results predictive of the rodent population? Would you get these results if you’d used a different mouse?

If you wish to know this, you have to design your experiments properly!!!! e.g. run the cell culture experiment four times each time with different mice

Na+

Crowd participation: What’s my “n” ?

4 rats per cage, one drinking bottle per cage

10 cages 5 cages given control in drinking water5 cages given drug in drinking water

What’s my ‘n’?

n is 5 and 5.

You take the mean (or median) per cage Use these numbers in a t-test (or Mann Whitney).

Plea bargain

I promise never to give a talk on this again if you all

1. report your experimental design clearlyi. did you randomise?ii. did you measure / analyse blind?iii. did you use randomised block design?

2. show raw data as well as means plus errorsi. show us your outliers!

3. describe what statistical methods you usedi. parametric? non parametric? ii. what posthocs? how about multiple comparisons?

Thanks for listening!

Na+

Good science: Experimental design

Can I pair my samples?

Paired (within) designs are more powerful.

e.g. twin studytwo inbred (isogenic) animalssame well of cells first with drug A, then with drug Bsame rat, L5 ipsilateral to nerve injury and L5 contralateral

Unpaired (between) designs are less powerful

e.g. brother and sistertwo different wells, one with drug A, one with drug B

Ask yourself.can I use twins?can I use inbreds? can I pre- and post- measure the same well or animal?can I compare ipsi and contra from the same animal?

Na+

Ravers

Suppose you wanted to know if crystal meth kills more ravers than ecstasy.

Ethical issues aside, if you could directly test humans in your study, who would you choose?

One option would be to recruit 16 students that are age- and sex- matched.

PTO.....

Na+

Unrelated students

Control Treated

Student 1 Student 9

Student 2 Student 10







Na+

But genetic differences are sources of variability

A powerful option would be to recruit genetically identical twins.....

Control Treated

Twin 1A Twin 1B

Twin 2A Twin 2B

Twin 3A Twin 3B

Twin 4A Twin 4B

Twin 5A Twin 5B

Twin 6A Twin 6B

Twin 7A Twin 7B

Twin 8A Twin 8B

Na+

Suppose you use outbred rats

Control Treated

Sprague 1 Sprague 9

Sprague 2 Sprague 10







Na+

Inbred rats are isogenic – like twins – so more powerful

Control Treated

F344 1 F344 9

F344 2 F344 10

F344 3 F344 11

F344 4 F344 12

F344 5 F344 13

F344 6 F344 14

F344 7 F344 15

F344 8 F344 16

Na+

So why use outbred rats?

“To model the variability in the human population”

this could be done as follows

Na+

But what I didn’t know..... you can use multiple isogenics

Control Treated

F344 1 F344 2

F344 3 F344 4

LEW 1 LEW 2

LEW 3 LEW 4

DA 1 DA 2

DA 3 DA 4

WKY 1 WKY 2

WKY 3 WKY 4

Na+

Pair them beforehand and then use ANOVA

You can also pair sexes as well as strains in the same experiment.

If this is compatible with your outcome measures, then you will learn about differences between genotype and differences between sexfor free.

IT DOES NOT REQUIRE EXTRA n

Michael Festing

You’d do ANOVA using two factors (strain and treatment)

Worked example

Suppose 16 outbred Sprague Dawley rats are treated with a drug or a control and something is measured.

Control Treated

12 16

15 17

18 15

9 15

7 9

16 19

15 18

10 14

Mean 12.75 15.38

Worked example

Two-sample t-test N Mean StDev SE MeanControl 8 12.75 3.85 1.4Treated 8 15.38 3.07 1.1

Estimate for difference is -2.6395% confidence interval for difference is (-6.36, 1.11)T-Value = -1.51 P-Value = 0.153 THUS NOT SIGNIFICANT DF = 14Both use Pooled StDev = 3.48

Worked example

Suppose exactly same data is obtained but using PAIRS of different inbred rats

Strain Control Treated Difference

DA 12 16 -4

F344 15 17 -2

LEW 18 15 3

WKY 9 15 -6

BDIX 7 9 -2

BUF 16 19 -3

ACI 15 18 -3

MNR 10 14 -4

Mean 12.75 15.38 -2.625

Worked example

Paired t-test (One-Sample, using differences)

N Mean StDev SE MeanDifference 8 -2.625 2.615 0.925

95.0% confidence interval of difference ( -4.813, -0.437) T value -2.84p=0.025 THIS IS SIGNIFICANT

Conclusion: Uncontrolled genetic variation reduces the power of the first experiment, leading to more negative results or the need to use larger sample sizes

Na+

Mead’s Resource Equation; what n should I choose?

"The total information in an experiment involving N experimental units may be represented by the total variation based on (N-1) degrees of freedom (df). ”

In the general experimental situation this total variation is divided into three components, each serving a different function."

These three sources of variation consist of:

TreatmentBlockError

Mead (1988) says experiments should be designed to give a good estimate of error, but should not be so big that they waste resources, i.e. the error degrees of freedom should be somewhere between 10 and 20. Above this there are diminishing returns

15

0

5

10

Critical value of t0.05

degrees of freedom (n - 1)

10 20 60

Increasing n number is a good approach: Increasing n number is a good approach: within limits!within limits!

Optimum size for an experiment*

* Mead (1988). The design of Experiments.Cambridge, NY. Cambridge University Press

Na+

Resource equation; what n should I choose?

DF =n-1. So if there are 20 rats in an experiment the total df will be 19. If there are 6 treatments, then the treatments df will be 5.

The method is extremely easy to use as it boils down to the very simple equation:E=N-B-T, where E is the error df and should be between 10 and 20, N is the total df, B is the blocks df, and T is the treatments df.

In a non-blocked design the equation reduces to E=N-T should be 10-20. which is simply:The total number of animals minus the number of treatments should be between ten and twenty.

Example: suppose an experiment is planned with four treatments, with eight animals per group (32 rats total). In this case N=31, B=0, T=3, so E=28.

Conclusion: this experiment is a bit too large, and six animals per group might be more appropriate.

Documents

Things I’ve learned about Experimental Design and Data Analysis in 2007