Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Study design and sampling(and more on
ANOVA/regression)Tron Anders Moger
7.10.2007
Recall: Could put data in a table as this:
• Each type of test was given three times for each type of subject
76 77 7578 78 7583 82 84Excellent78 82 8068 73 7564 72 65Good70 69 6572 69 6974 79 76Fair75 75 7869 71 67Cell: 65 68 62Poor
Psych OutMindbenderProfile fitBlock: Subject typeGroup: Test type
Testing different types of wheat in a field
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Wheat 3Wheat 2Wheat 1Group
Interested in finding out if different types of wheatyields different cropsOutcome: E.g. wheat in pounds per acre
Your field resembles an ANOVA data matrix!One-way ANOVA: Testing if mean crop per acre isdifferent for different types of wheat!
More complex designs:• Want to test different fertilizers also
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIFertilizer 3IIIIIIIIIIIIIIIIIIIIIIIIIIIIIFertilizer 2IIIIIIIIIIIIIIIIIIIIIIIIIIIIIFertilizer 1
Wheat 3Wheat 2Wheat 1Block:Group:
Do different wheat types give different crops?Do different fertilizers give different crops?Two-way ANOVA!Do e.g fertilizer 1 work better for wheat 1 than for wheat 2and 3? Is there interaction between wheat and fertilizer?Two-way ANOVA with interaction!
Groups and blocks• In the example: Arbitrary if we put wheat
type in group or block• Equally interested in wheat and fertilizer
effects in the example• Another example: Want to test 3 different
treatments for e.g. asthma• Only interested in treatment effect (Group)• Design a study for one-way ANOVA,
everyone’s happy
Are we happy?• Is an asthma patient an asthma patient no matter
what?• Different types of asthma patients could give
different treatment effects (this is serious for pharma companies)
• If we do one-way ANOVA, we won’t find out!• Specifically, if we sample patients at random,
might end up with 5 patients of the type thatresponds badly, and 50 patients of the other types
• Results for the 5 patients will drown in the resultsfor the others, so we won’t even suspect thatsomething is wrong
• Blocking variable: Ensure that you sample e.g 30 patients of each type
Asthma: Two-way ANOVA• Still only really interested in the treatment effect• But, would like to control for the confounding
effect of type of patient• Model for one-way ANOVA: Xij=µ+Gi+εij
• µ is total mean, Gi is group effect, εij is N(0,σ2)• σ2 includes variation due to everything else,
including patient type• Only effect we describe in the model, is the
treatment effect
Asthma: Two-way ANOVA cont’d• Two-way ANOVA model: Xijl=µ+Gi+Bj+Iij+εijl• Describe both treatment effect (Gi), patient type
effect (Bi) and interaction (Iij)• Remove variation due to patient type from σ2 (and
from it’s primary estimator, MSE)• Means that σ2
two-way<σ2one-way
• Recall: Test for treatment effect (Gi =0), comparesMSG to MSE (MSG/MSE~F-dist), reject ifsufficiently large
• Similar tests for the other effects, but based onMSB and MSI
Asthma: Two-way ANOVA cont’d• If there is a treatment effect, MSG will be a biased
estimator for σ2
• If there is a block effect, denominator MSE will be smaller here than MSW for one-way ANOVA
• Value of test statistic will be larger! • Easier to get significant effects! (More power)• Also get more correct estimates for the group
means (because of the sampling)• Similar to regression: The more significant
variables you include in your model, the greaterR2 becomes, and you get more correct estimatesfor the regression coefficients
• R2 increases because σ2 decreases the more variables you include
ANOVA and linear regression• Regression: Split the distance from each data point
to the total mean into:– 1. Distance from mean to regression line– 2. Distance from regression line to data point
• Got sums of squares SSR, SSE and SST• Used for estimation and measuring how close data
points were to regression line (R2)• However; also used for an F-test on whether all
Bi=0 (From slide of detailed explanations of SPSS output)
• This is ANOVA in linear regression!
Design differences: ANOVA and regression
• Wheat/fertilizer example, additional confounders: Earth quality or amount of sun could vary acrossthe field
• ANOVA: Can only have two independent variables in the model. Control for other thingse.g. by repeating study until all types of wheathave been grown in each part of the field
• Regression: Collect information on earth qualityand sun amounts, and include in the model
Conclusion, regression vs ANOVA
• Regression allows for explicit modelling ofmore than two independent variables
• Also, in regression, independent variables can be continuous, whereas in ANOVA they have to be categorical
• Hence, regression is more flexible thanANOVA
Confounding in regression• Two hospitals: A and B• Measure cost per patient for treating some disease• For some reason, males cost $1000000 more on
average than females• At hospital A, 80% of patients are males• At hospital B, 20% of patients are males• Do a regression of cost vs hospital, what will the
results indicate?• Do a regression of cost vs hospital and gender,
what will the results indicate?
Confounding, regression vs ANOVA
• In regression, adjust for confounding by including the confounder in the model
• In ANOVA, adjust for the confounder by using it as a blocking variable, or by making sure in the sampling that the confounder is equally distributed in e.g. both hospitals
Interaction in regression (and ANOVA)
• Recall the model with main effects only:
• Have fitted the modelBirth weight=2500.174+4.238*mother’s weight-270.013*smoking status
• If your mother weighs 100 lbs, what is theestimated effect of smoking?
• If your mother weighs 200 lbs, what is theestimated effect of smoking?
Coefficientsa
2500,174 230,833 10,831 ,000 2044,787 2955,5614,238 1,690 ,178 2,508 ,013 ,905 7,572
-270,013 105,590 -,181 -2,557 ,011 -478,321 -61,705
(Constant)weight in poundssmoking status
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Lower Bound Upper Bound95% Confidence Interval for B
Dependent Variable: birthweighta.
Model with interaction:• Getbwt=2347+5.41*mwt+47.87*smoking-2.46*mwt*smoking
• Mwt=100 lbs vs mwt=200lbs for non-smokers:bwt=2888g and bwt=3428g, difference=540g
• Mwt=100 lbs vs mwt=200lbs for smokers:bwt=2690g and bwt=2985g, difference=295g
Coefficientsa
2347,507 312,717 7,507 ,000 1730,557 2964,4575,405 2,335 ,227 2,315 ,022 ,798 10,012
47,867 451,163 ,032 ,106 ,916 -842,220 937,953-2,456 3,388 -,223 -,725 ,470 -9,140 4,229
(Constant)weight in poundssmoking statussmkwht
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Lower Bound Upper Bound95% Confidence Interval for B
Dependent Variable: birthweighta.
What does this mean?• Mother’s weight
has a greaterimpact on birthweight for non-smokers than for smokers (or theother way round)
• We see that theslope is steeperfor non-smokers
• However, interaction is non-significant
50,00 100,00 150,00 200,00 250,00
weight in pounds
0,00
1000,00
2000,00
3000,00
4000,00
5000,00
birt
hwei
ght
smoking status,001,00
R Sq Linear = 0,023
R Sq Linear = 0,042
Interaction in hospital example:• Two hospitals: A and B (coded as A=0,B=1)• Gender: females=0, males=1• Measure cost per patient for treating some disease• Say, we get the model:• Cost in $=5000000+1000000*Gender-
300000*Hospital• Estimated cost of a male compared to a female in
hospital A: $1000000• Estimated cost of a male compared to a female in
hospital B: $1000000• What if males really are $500000 more expensive to
treat than females in hospital A, but $1500000 more expensive to treat in hospital B?
• Have to include an interaction term to model it
Designing a study:• Ideally, should know in advance:
– The basic hypotheses you want to test– What information you need in order to test the
hypotheses– Which population do you want the results to
apply for?– How to collect that information; sampling,
design– If regression: What important confounders do
you need information on
Sampling in practice• Newbold mentions:
1. Information required? Has the study been donebefore? Is it possible to get the information
2. Relevant population? 3. Sample selection? Random? Systematic? Stratified?4. Obtaining information? Interviews? Questionnaires?5. Inferences from sample? Which methods?6. Conclusions? How to present your results?
• Nonsampling errors; Missing data, dishonest or inaccurate answers, low reliability or validity
Reliability and validity
• Validity of a research instrument: Thedegree to which it measures what you areinteresting in measuring
• Reliability of a research instrument: Theextent that repeated measurements under constant conditions will give the same result
• A research instrument may be reliable, but not valid
Types of sampling• Simple random sampling: Select subjects at
random• Every subject in the population has same
probability of being sampled– Ex: One-way analysis of asthma patients
• If large enough sample, gives you a representative sample compared to the population
• Problem: If small sample, will give to few data oninteresting sub-groups
• Systematic sampling: As random sampling, butyou include e.g every 5th subject in your sample
Types of sampling cont’d:• Stratified sampling: Want to ensure that
interesting sub-groups of the population aresampled in sufficient numbers (over-sampled)
• Divide the population into K strata, randomlysample ni from each stratum
• Ex: Asthma patients, two-way ANOVA• Problems: How many patients in each stratum?• Cluster sampling: Similar to stratified sampling,
but considers geographical units• Divide the population into M clusters, randomly
sample m of them• Include all subjects in the sampled clusters
Types of sampling cont’d:• Two-phase sampling: Carry out an initial pilot
study, where only a small sample is collected• Then proceed with collecting the main sample• Advantages: Get initial estimates on effects• Initial estimates on variance in data-> sample size,
how much data do you need to reject H0?• Disadvantages: Costly, time-consuming
• NOTE: Most methods I’ve mentioned requiresadjusted formulas for estimation, described in thebook
Some study types
• Observational studies– Cross-sectional studies– Cohort studies – Longitudinal studies– Panel data– Case / control studies
• Experimental studies– Randomized, controlled experiments (blind, double-
blind)– Interventions
Cross-sectional studies
• Examines a sample of persons, at a single timepoint
• Time effects rely on memory of respondents• Good for estimating prevalence• Difficult for rare diseases• Response rate bias
Cohort studies and longitudinal studies
• A sample (cohort) is followed over some time period.
• If queried at specific timepoints: Longitudinal study
• Gives better information about causal effects, as report of events is not based on memory
• Requires that a substantial group developesdisease, and that substantial groups differ withrespect to risk factors
• Problem: Long time perspective
Panel data
• Data collected for the same sample, at repeated time points
• Corresponds to longitudinal epidemiological studies
• A combination of cross-sectional data and time series data
• Increasingly popular study type
Case – control studies
• Starts with a set of sick individuals (cases), and adds a set of controls, for comparison
• Retrospective study – Start with finding cases and controls, then dig into their past and find out whatmade them cases and controls
• Cases and controls should be from same populations
• Matching controls• Cheap, good method for rare diseases• Problem: Bias from selection, recall bias
Epidemiology
• Epidemiology is the study of diseases in a population– prevalence– incidence, mortality– survival
• Goals– describe occurrence and distribution– search for causes– determine effects in experiments
Measures of risk in epidemiology
• Relative risk (used for prospective studies)• Odds ratio (used for retrospective studies)
57
31
26
No abortions
7013Total
343Other nurses
3610Op.nurses
TotalAbortions
Op-nurses cont’d:
• Relative risk: Proportion of abortionsamong op.nurses divided by proportion ofabortions among others
RR= =3.1• Odds ratio:
Odds for abortion among op.nurses: 10/26 Odds for abortion among other nurses: 3/31
• Gives the odds ratio:OR: =4.0
34/336/10
31/326/10
Correcting for finite population in estimations
• Our estimates of for example populationvariances, population proportions, etc. assumed an ”infinite” population
• When the population size N is comparable to thesample size n, a correction factor is necessary.
• Used if n>0.05N • Examples:
– Variance of population mean estimate: – Variance of population proportion estimate:
2 (1 ) ( )ˆ1p
p p N nn N
σ − −= ⋅
−
22 ( )ˆ X
s N nn N
σ −= ⋅
Determining sample size• An important part of experimental planning• The answer will generally depend on the
parameters you want to estimate in the first place, so only a rough estimate is possible
• However, a rough estimate may sometimesbe very important to do
• A pilot study may be very helpful
Sample size for means (largesamples)
• We want to estimate mean• We want a confidence interval to extend a distance
a from the estimate• We guess at the population variance• A sample size estimate:
• Small samples: If we have a population of size N, and want a specified , we get
µ
2σ
2 2 2/ 2
2 2
4Zna a
α σ σ= ≈ at 95% confidence level
2Xσ 2
2 2( 1) X
NnN
σσ σ
=− +
Example: Have dental costsincreased since 1995?
• Want to compare dental costs in 1995 (adjusted to 2006-kroner) and 2006
• Could do a paired sample t-test. How manyindividuals do we need to ask?
• We believe a difference of 1500 kroner is important
• From experience, we think for the difference is 2500 kroner
• Need 4*25002/15002=at least 12 individuals to find a significant difference if our assumptions arecorrect
σ
Sample size for proportions (largesamples)
• We want to estimate proportion P • We want a confidence interval to extend a
distance a from the estimate• Recall: CI for P=P+Zα/2√P(1-P)/n• A sample size estimate:
• Largest possible value of this expression is 1/a2 (P=0.5)
2/ 2
2 2
(1 ) 4 (1 )Z P P P Pna a
α − −= ≈ at 95% confidence level
Example: Poll
• Want to estimate the proportion votingLabour with 95% confidence intervalextending +3%
• Need to include at most 1/0.032=1112 people in our study
• Would probably stick with 1112 if we don’thave any reason to believe P is smaller than0.5
Next time:
• More on time-series analysis from chapter19
• Presentation of results: How do you do it?• Recap of the different methods we’ve learnt