Upload
elaine-cross
View
242
Download
2
Tags:
Embed Size (px)
Citation preview
Things I’ve learned about Experimental Design and Data Analysis in 2007
Paper of the Year 2007
• Wine (Bacardi / Malibu) etc as prizes
• EVERYONE to participate please (PIs to enforce !)
• Name a paper that • you think is the best this year (pain, regeneration or other)• that has inspired your work or will guide it next year• others should know about
• First to nominate via PDF upload to PainShared/Paper of the Year 2007• Please also print a few copies and leave on the shelves in the lunch area
• Presentations and prize giving in January 2008
• You vote who wins!
Things I’ve learned about Experimental Design and Data Analysis in 2007
Or -- The Good, The Bad and the Ugly
The Good
STATISTICS HELL
http://www.statisticshell.com/statisticshell.html
Dr Andy Field (University of Sussex)
Dr Dom Spina (KCL)
Prof Steve Dunnett (U Cardiff)
References.....
• Dr Michael Festing BSc MSc PhD CStat FIBiol !!!!• www.isogenic.info
Steve Dunnett’s Handout – all you need to know about stats on a single page
At meetings, how to guess if it’s significant
Cumming et al., 2007 JCB 177:7-11
If there’s >1 standard error between tips then they’re probably different at level of significance of 0.05
n.b. less reliable where n is nearer 3
The Bad
Experiments will always be biased unless you actExperiments will always be biased unless you act
Randomize the ‘experimental units’
Use blinding strategies
Tell us in talks if you did.... and how...
In lab meetings, please don’t just show a graph and say.....
“Drug had an effect.”
Show us the stats and PROVE your point
Repeated measures ANOVA (F 2,16)=1.24; p=0.04
Tukey post hoc t-test between drug and control only significant at week 5 (p<0.05)
(groups, ‘n’s, experimental design and error bars would be nice too)
The Ugly
Presenting data in lab meetsPresenting data in lab meets
choose an appropriate averagechoose an appropriate average
mean (arithmetic or geometric)mean (arithmetic or geometric) depict with SEM, SD or CIdepict with SEM, SD or CI
median (less sensitive to outliers)median (less sensitive to outliers) depict with box plotsdepict with box plots
modemode
In lab meetings, why not show us the data?
It helps us understand the real nature of the data – especially outliers
Festing & Altman 2002
Tell us why you chose that control ….
Many people use “vehicle” as a control for a recombinant protein / antibody / enzyme.
Why not use some other recombinant protein?same speciessimilar MW (diffusion)similar stability
e.g. counters any non specific effect of the recombinant protein.(immunogenic response?)
More Good
What test should I use?
• Steve Dunnett’s one page summary• Andy Field’s flow charts
• next bit is stolen from Dom Spina. Bit lengthy....
T tests to assess whether means of two groups to assess whether means of two groups
differdiffer DON’T use multiple t-tests if you have DON’T use multiple t-tests if you have
more than two groups!more than two groups!
1 2
BEFORE DOING THE EXPERIMENT
decide on your Null hypothesis and level of decide on your Null hypothesis and level of significancesignificance
Can you predict the direction of change?Can you predict the direction of change? e.g. the therapy should e.g. the therapy should increaseincrease axon axon
growthgrowth If not, then use a two tailed testIf not, then use a two tailed test ONE TAILED T-TESTS DOUBLE YOUR ONE TAILED T-TESTS DOUBLE YOUR
POWER.POWER.
Assumption 1 for t tests
data values are normally distributeddata values are normally distributed if not, then try transforming or a non-if not, then try transforming or a non-
parametric testparametric test
+/- 1.96 SD=95%
Assumption 2 for t tests
variances are equal (“homogenous”) variances are equal (“homogenous”) in the two groupsin the two groups check with a graph (or formal test)check with a graph (or formal test)
presentnewer45
50
55
60
65
heig
hts
of
pla
nts
(cm
)
48.20 52.3054.60 57.4058.30 55.6047.80 53.2051.40 61.3052.00 58.0055.20 59.8049.10 54.8049.9052.60
present newer
51.9111.363.37
56.558.893.144
Meanvariancestd dev
F = 1.15 (df = 9,7)P > 0.05
Three types of t testThree types of t test
single sample - comparing experimental single sample - comparing experimental data with a known population (RARE)data with a known population (RARE)
t test for independent samples - t test for independent samples - comparing data from two different groupscomparing data from two different groups
paired t test - used for paired or ‘before paired t test - used for paired or ‘before and after’ dataand after’ data
some examples......some examples......
Thinking about experimental design and statistics BEFORE an experiment is important.....
here’s why....
Water Gatorade
Alan 178 Bob 180Cath 189 Dave 191Eric 208 Freda210Gav 200 Hugo 202Irin 188 Jake 189Lucy 176 Mike 177Nata 190 Olaf 191Phil 201 Rob 202Stu 187 Toby 188Una 200 Viv 201
Jo measures the amount of energyexpended (kcal) in subjects during exercise.‘Can we reject the null hypothesis that pre-loading with drink makes no difference in the amount of energy expended?’
50
100
150
200
Kilocala
ries
Unpaired t-test has to be used
Null hypothesis: The sports drink will have no effect on kilocalories burnt
Group 1(water)191.7 + 10.4
Group 2(Pep-up)193.1 + 10.5 Mean + sd
t statistic calculated as -0.3degrees of freedom (n1+n2-2) 18One- or two-tailed P value? Two-tailed
p=0.768 not significant
50
100
150
200
Kilocala
ries
Can use one tailed t-test if you predict a priori that
Null hypothesis: The sports drink will increase kilocalories burnt
Group 1(water)191.7 + 10.4
Group 2(Pep-up)193.1 + 10.5 Mean + sd
t statistic calculated as -0.3degrees of freedom (n1+n2-2) 18One- or two-tailed P value? One-tailed
p=0.769 / 2 = 0.384 closer but still not significant
Suppose Jo had used a paired design (using half as many subjects) and got the same data.
Water Gatorade
Alan 178 180Cath 189 191Eric 208 210Gav 200 202Irin 188 189Lucy 176 177Nata 190 191Phil 201 202Stu 187 188Una 200 201
50
100
150
200
Kilocala
ries
Can now used paired t-test
Null hypothesis: The sports drink will have no effect on kilocalories burnt
Group 1(water)191.7 + 10.4
Group 2(Pep-up)193.1 + 10.5 Mean + sd
t statistic calculated as -8.57degrees of freedom (n1+n2-2) 9One- or two-tailed P value? Two-tailed
p<0.001 highly significant
(rough sketch)
Non-parametric statisticsNon-parametric statistics((aka distribution-free statistics)aka distribution-free statistics)
make no assumptions about make no assumptions about underlying distribution of the dataunderlying distribution of the data
most based on ranking methodsmost based on ranking methods
less powerful than parametric testsless powerful than parametric tests
Non Parametric Test
Wilcoxon test
Mann-Whitney test
Parametric equivalent
Students paired t-test
Student’s (unpaired) t-test
if you have two groups then.....
Non Parametric Test
Mann Whitney test
Wilcoxon test
Kruskal Wallis
Friedman’s test
Spearman’s correlation
Parametric equivalent
Students non-paired t-test
Student’s paired t-test
ANOVA
Repeated ANOVA
Pearson correlation
WHAT IF I HAVE MORE THAN TWO GROUPS?
KEY RULE: DON’T USE MULTIPLE T-TESTS. VIOLATES INDEPENDENCE.Risks getting false positive results.
ANYHOW, ANOVA AND K-W ARE MORE POWERFUL.
Non Parametrics: because Non Parametrics: because real measurements are often real measurements are often
not normally distributednot normally distributed
In an experiment to determine latency to respond to thermal stimulus (a hotplate at 49oC) female mice were pretreated with either vehicle (0.9% methylcellulose) or various doses of PD117302 (a kappa opioid agonist) 30 min before testing. Reaction times (s) were recorded by an observer unware of the animals’ treatment and the following results obtained:
Control
6029.414.829.217.519.529
1 mg/kg po
5226.522.231.623.619.720.5
3 mg/kg po
30.556.443.555.56031.621.3
9 mg/kg po
606048.846606060
measurements are not measurements are not normally distributednormally distributed
control 1mg/kg 3mg/kg 9mg/kg0
25
50
75
react
ion t
ime (
sec)
Control1 mg/kg3 mg/kg9 mg/kg0
25
50
75
Dose of PD117302
Resp
on
se t
ime (
s)
Kruskal Wallis statistic = 14.3P = 0.0038
Single factor ANOVASingle factor ANOVA
When testing the hypothesis that severalsample means vary amongst themselves morethan should be expected on the basis ofrandom sampling, then use ANOVA
Comparison of systematic differences anderror variance (experimental error)
Advantages of ANOVA
•Test the null hypothesis of several means(reduces time and effort)
•Increases the sensitivity of a test(increase in degrees of freedom)
•Avoid violation of one of the major assumptionsof comparison between different means (assumption of independent comparisons)
Heart rate was monitored in the absence and presence of various drugs
Beats/min
203 180 160178 155 142147 139 148162 135 156190 165 172
control Drug 1 Drug 2
One way ANOVA: results look like thisOne way ANOVA: results look like this
Between groupwithin grouptotal
SS
144438795321
df
21214
MS
721323
F
2.23
P
0.15
Comparisons following ANOVAComparisons following ANOVA
Post-hoc analysis
Bonferroni CorrectionDunnett’s test (vs control)TukeyNeuman-Keuls testScheffe
Pitfalls in Multiple Comparison testing Pitfalls in Multiple Comparison testing (type I error)(type I error)
Number of independentNull Hypotheses
Probability ofobtaining one ormore P valuesless than 0.05 by chance
Threshold to keepoverall risk of type 1error = 0.05
1 5% 0.052 10% 0.0253 14% 0.0174 19% 0.01275 23% 0.0102
100 99% 0.0005
N 100(1.0 - 0.95N) 1.0 - 0.95(1/N)
Transformation of Data which appears non-GaussianTransformation of Data which appears non-Gaussian
Type of data Distribution Normalizing transform
Count Poisson distribution C
Proportion Binomial distribution Arcsine P
Measurement Lognormal Log (M)
Duration 1/D
Repeated Measures ANOVARepeated Measures ANOVA
Matched paired t-test a more powerful testthan the independent groups t test (variability is reduced)
Repeated measures ANOVA is used tocompare means resulting from multiple testsof the same or matched subjects
Repeated Measures ANOVA: single factor
Plasma cholesterol levels were measured in 7 subjects taking three drugs.
Test the null hypothesis that all three drugs have similar effects on plasma cholesterol levels.
Subject Drug 1 Drug 2 Drug 3
1 164 152 1782 202 181 2223 143 136 1324 210 194 216
567
meanstdevsem
228173161
18330.7111.61
219159157
17128.510.8
245182165
19138.514.6
Between subjectswithin subjects drugs error
Total
SS18730
1454695.3
20880
df6
212
20
MS3122
72757.94
F
12.6
P
0.0011
The null hypothesis is therefore rejected. There is a significanttreatment effect.
drug 1 drug 2 drug 3100
200
300B
lood
ch
ole
ste
rol
(mg
/dL)
Box plot Line plot
1 2 3
100
200
300
chole
stero
l (m
g/d
L)
SS df MS F Between group (drug) 1454 2 727.0 1.48 Within group 19430 18 1079 Total 20880 20
Repeated measures more powerful than one Repeated measures more powerful than one way ANOVAway ANOVA
Carry over effect is a potential confounding factor that must beavoided
The value at one time point is likely to influence successive time point(ie a test at one time point gives a significant result it is likely that testsperformed closer in time will also give significant results)
Single factor analysis of the cholesterol data, assuming 21 subjectswere recruited to the study:
Cautionary note (see Mathews et al. 1990)
Lots more Good
Experimental design......
If I want info about male and female rats do I need to test 40 animals in total for same power?
Suppose you use 20 male rats (n=10, 10)
with df = N - 2 = 18
threshold/critical t0.05 = 2.10
SMART IDEA: How to increase generalitySMART IDEA: How to increase generality
Question: Does drug A reduce pain behaviour in rats relative to control drug?
Your observed / measured value of t needs to be lower than the threshold tto provide evidence your results didn’t happen by chance (i.e. were sig).
Even better would beRepeated measures design
5 males, 5 females:threshold t0.05 = 2.45
df = 9 - 3 = 6
No. Factorial design enhances power
5 5
5 5
M F
control
drug A
df = (N-1) -(T-1)df = 19 - 3df = 16
threshold t0.05 = 2.20
Factor A (Sex)
Factor B(treatment)
Na+
Why has no-one ever told me about randomised block design?
Do you do big in vivo experiments with many animals (e.g. 6 each day for 5 days)?
Do you do cell culture experiments on different days and then combine the results?
If so, then SESSION is a factor and contributes variability to your experiment.
You shouldn’t just ignore this!
In fact, you can even IMPROVE power to detect an effect
..... you need ........
Randomised block design
Na+
Randomised block design?
Consider an experiment that will take three daysThere are three treatments (A, B, C) with three mice per group.
First, randomize the treatments per block. e.g.
Block (day) Treatment for mouse 1 2 3
1 A B C2 B C A3 C A B
You then do two way ANOVA without interaction (block is a random, not fixed, factor).This is easy and I can show you how. IT IMPROVES POWER TO DETECT AN EFFECT
Na+
Suppose you’ve run a big experiment.
Three treatments (A, B, C) were delivered to groups of 20 stroke rats on three different surgical days. You thus have two factors (treatment and session/block). A month later, you measure reaching ability.
TreatmentBlock
1Block
2Block
3
A 10 12 7
B 7 8 5
C 5 6 4
By eye we can see that reaching ability is best in A (independent of block).But reaching is also best for block 2 across groups (smaller lesions that day?)
Na+
take home message: include “session” in your ANOVA
The two-way ANOVA table for these data (calculated by computer) is as follows:
Source DF SS MS F p .Session 2 16.889 8.444 13.82 0.016Treatment 2 33.556 16.778 27.45 0.005Error / residual 4 2.444 0.611Total 8 52.889
The two-way ANOVA without interaction partitions the total variation into parts associated with treatments, session and error..... tells you about how big the effect of session (block) was...... and there is therefore less variation in the “error” term
From these results it is clear that there were large block differences (p=0.016), implying that including “block” as a variable was worthwhile: it will have improved the power of the experiment.
There were statistically significant treatment effects i.e. the null hypothesis that there are no differences among the three treatment means is rejected at p=0.005.
Na+
take home message: include “session” in your ANOVA
Andy’s papers (Lew 2007 BJP 152:299-303) proves this
If you disagree, you have to find me a reference.........
Next bit is contraversial, so
http://www.isogenic.info/html/6__experimental_unit.html
Na+
What’s my “n” ?
n, the experimental unit, is the entity in the experiment “which can be assigned at random and independently to a treatment” (Festing et al., 1998 ATLA 26, 283-301)
Control virus
Experimental virus
3 are randomly selected for
twomiceprovidecells for six dishes
then measure lengths of neurites of 100 neurons per well
3 are randomly selected forWHAT’S MY n ?(ask first, what is the entity)
“does my virus boost growth of neurites in vitro?”
What’s my “n”?
The experimental unit is “dish”.
There are six of them.
You take means or medians per well and then do stats with these.
Counting more neurons doesn’t increase “n”, however many hours you already spent.
Planning experiments with more dishes WILL increase “n”.
If you disagree, you have to find me a reference.........
“If cells from an animal are cultured in a number of dishes that canbe assigned to different in vitro treatments, then the dish of cells is the experimental unit” (Festing & Altman, 2002).
Na+
Replicate experiments
Suppose you run the cell culture experiment four times over one month.
Is this n=4? How to analyse?
It’s a randomized block factorial design again.
Thus two way ANOVA, with block and drug as factors.
You’ll learn if there was an effect of session (and it will be parcelled out).The study will be more powerful.
Na+
Replicate samples
A “replicate” refers to repetition of measurement on one sample. Tells you about variation due to measurement.
If you run three PCR reactions using the same primers and the same cDNA, then you have three replicates.
If you run three PCR reactions using the same primers and cDNA from three different animals, then you have no replicates, but n=3 and you can do stats.
Na+
How generalisable are my results ?
Thus it’s possible to run an experiment with multiple “n” using tissue derived from only a single mouse.
“n” tells you about the number of units in an experiment, and NOT necessarily about the number of animals used.
“p” tells you how likely it is that you got these results (using these samples) by chance (e.g. 0.05)
So how far are your results predictive of the rodent population? Would you get these results if you’d used a different mouse?
If you wish to know this, you have to design your experiments properly!!!! e.g. run the cell culture experiment four times each time with different mice
Na+
Crowd participation: What’s my “n” ?
4 rats per cage, one drinking bottle per cage
10 cages 5 cages given control in drinking water5 cages given drug in drinking water
What’s my ‘n’?
n is 5 and 5.
You take the mean (or median) per cage Use these numbers in a t-test (or Mann Whitney).
Plea bargain
I promise never to give a talk on this again if you all
1. report your experimental design clearlyi. did you randomise?ii. did you measure / analyse blind?iii. did you use randomised block design?
2. show raw data as well as means plus errorsi. show us your outliers!
3. describe what statistical methods you usedi. parametric? non parametric? ii. what posthocs? how about multiple comparisons?
Thanks for listening!
Na+
Good science: Experimental design
Can I pair my samples?
Paired (within) designs are more powerful.
e.g. twin studytwo inbred (isogenic) animalssame well of cells first with drug A, then with drug Bsame rat, L5 ipsilateral to nerve injury and L5 contralateral
Unpaired (between) designs are less powerful
e.g. brother and sistertwo different wells, one with drug A, one with drug B
Ask yourself.can I use twins?can I use inbreds? can I pre- and post- measure the same well or animal?can I compare ipsi and contra from the same animal?
Na+
Ravers
Suppose you wanted to know if crystal meth kills more ravers than ecstasy.
Ethical issues aside, if you could directly test humans in your study, who would you choose?
One option would be to recruit 16 students that are age- and sex- matched.
PTO.....
Na+
Unrelated students
Control Treated
Student 1 Student 9
Student 2 Student 10
Student 3 Student 11
Student 4 Student 12
Student 5 Student 13
Student 6 Student 14
Student 7 Student 15
Student 8 Student 16
Na+
But genetic differences are sources of variability
A powerful option would be to recruit genetically identical twins.....
Control Treated
Twin 1A Twin 1B
Twin 2A Twin 2B
Twin 3A Twin 3B
Twin 4A Twin 4B
Twin 5A Twin 5B
Twin 6A Twin 6B
Twin 7A Twin 7B
Twin 8A Twin 8B
Na+
Suppose you use outbred rats
Control Treated
Sprague 1 Sprague 9
Sprague 2 Sprague 10
Sprague 3 Sprague 11
Sprague 4 Sprague 12
Sprague 5 Sprague 13
Sprague 6 Sprague 14
Sprague 7 Sprague 15
Sprague 8 Sprague 16
Na+
Inbred rats are isogenic – like twins – so more powerful
Control Treated
F344 1 F344 9
F344 2 F344 10
F344 3 F344 11
F344 4 F344 12
F344 5 F344 13
F344 6 F344 14
F344 7 F344 15
F344 8 F344 16
Na+
So why use outbred rats?
“To model the variability in the human population”
this could be done as follows
Na+
But what I didn’t know..... you can use multiple isogenics
Control Treated
F344 1 F344 2
F344 3 F344 4
LEW 1 LEW 2
LEW 3 LEW 4
DA 1 DA 2
DA 3 DA 4
WKY 1 WKY 2
WKY 3 WKY 4
Na+
Pair them beforehand and then use ANOVA
You can also pair sexes as well as strains in the same experiment.
If this is compatible with your outcome measures, then you will learn about differences between genotype and differences between sexfor free.
IT DOES NOT REQUIRE EXTRA n
Michael Festing
You’d do ANOVA using two factors (strain and treatment)
Worked example
Suppose 16 outbred Sprague Dawley rats are treated with a drug or a control and something is measured.
Control Treated
12 16
15 17
18 15
9 15
7 9
16 19
15 18
10 14
Mean 12.75 15.38
Worked example
Two-sample t-test N Mean StDev SE MeanControl 8 12.75 3.85 1.4Treated 8 15.38 3.07 1.1
Estimate for difference is -2.6395% confidence interval for difference is (-6.36, 1.11)T-Value = -1.51 P-Value = 0.153 THUS NOT SIGNIFICANT DF = 14Both use Pooled StDev = 3.48
Worked example
Suppose exactly same data is obtained but using PAIRS of different inbred rats
Strain Control Treated Difference
DA 12 16 -4
F344 15 17 -2
LEW 18 15 3
WKY 9 15 -6
BDIX 7 9 -2
BUF 16 19 -3
ACI 15 18 -3
MNR 10 14 -4
Mean 12.75 15.38 -2.625
Worked example
Paired t-test (One-Sample, using differences)
N Mean StDev SE MeanDifference 8 -2.625 2.615 0.925
95.0% confidence interval of difference ( -4.813, -0.437) T value -2.84p=0.025 THIS IS SIGNIFICANT
Conclusion: Uncontrolled genetic variation reduces the power of the first experiment, leading to more negative results or the need to use larger sample sizes
Na+
Mead’s Resource Equation; what n should I choose?
"The total information in an experiment involving N experimental units may be represented by the total variation based on (N-1) degrees of freedom (df). ”
In the general experimental situation this total variation is divided into three components, each serving a different function."
These three sources of variation consist of:
TreatmentBlockError
Mead (1988) says experiments should be designed to give a good estimate of error, but should not be so big that they waste resources, i.e. the error degrees of freedom should be somewhere between 10 and 20. Above this there are diminishing returns
15
0
5
10
Critical value of t0.05
degrees of freedom (n - 1)
10 20 60
Increasing n number is a good approach: Increasing n number is a good approach: within limits!within limits!
Optimum size for an experiment*
* Mead (1988). The design of Experiments.Cambridge, NY. Cambridge University Press
Na+
Resource equation; what n should I choose?
DF =n-1. So if there are 20 rats in an experiment the total df will be 19. If there are 6 treatments, then the treatments df will be 5.
The method is extremely easy to use as it boils down to the very simple equation:E=N-B-T, where E is the error df and should be between 10 and 20, N is the total df, B is the blocks df, and T is the treatments df.
In a non-blocked design the equation reduces to E=N-T should be 10-20. which is simply:The total number of animals minus the number of treatments should be between ten and twenty.
Example: suppose an experiment is planned with four treatments, with eight animals per group (32 rats total). In this case N=31, B=0, T=3, so E=28.
Conclusion: this experiment is a bit too large, and six animals per group might be more appropriate.