39
Study design and sampling (and more on ANOVA/regression) Tron Anders Moger 7.10.2007

# Study design and sampling (and more on ANOVA/regression) · Confounding, regression vs ANOVA • In regression, adjust for confounding by including the confounder in the model •

others

• View
7

0

Embed Size (px)

Citation preview

Study design and sampling(and more on

ANOVA/regression)Tron Anders Moger

7.10.2007

Recall: Could put data in a table as this:

• Each type of test was given three times for each type of subject

76 77 7578 78 7583 82 84Excellent78 82 8068 73 7564 72 65Good70 69 6572 69 6974 79 76Fair75 75 7869 71 67Cell: 65 68 62Poor

Psych OutMindbenderProfile fitBlock: Subject typeGroup: Test type

Testing different types of wheat in a field

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Wheat 3Wheat 2Wheat 1Group

Interested in finding out if different types of wheatyields different cropsOutcome: E.g. wheat in pounds per acre

Your field resembles an ANOVA data matrix!One-way ANOVA: Testing if mean crop per acre isdifferent for different types of wheat!

More complex designs:• Want to test different fertilizers also

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIFertilizer 3IIIIIIIIIIIIIIIIIIIIIIIIIIIIIFertilizer 2IIIIIIIIIIIIIIIIIIIIIIIIIIIIIFertilizer 1

Wheat 3Wheat 2Wheat 1Block:Group:

Do different wheat types give different crops?Do different fertilizers give different crops?Two-way ANOVA!Do e.g fertilizer 1 work better for wheat 1 than for wheat 2and 3? Is there interaction between wheat and fertilizer?Two-way ANOVA with interaction!

Groups and blocks• In the example: Arbitrary if we put wheat

type in group or block• Equally interested in wheat and fertilizer

effects in the example• Another example: Want to test 3 different

treatments for e.g. asthma• Only interested in treatment effect (Group)• Design a study for one-way ANOVA,

everyone’s happy

Are we happy?• Is an asthma patient an asthma patient no matter

what?• Different types of asthma patients could give

different treatment effects (this is serious for pharma companies)

• If we do one-way ANOVA, we won’t find out!• Specifically, if we sample patients at random,

might end up with 5 patients of the type thatresponds badly, and 50 patients of the other types

• Results for the 5 patients will drown in the resultsfor the others, so we won’t even suspect thatsomething is wrong

• Blocking variable: Ensure that you sample e.g 30 patients of each type

Asthma: Two-way ANOVA• Still only really interested in the treatment effect• But, would like to control for the confounding

effect of type of patient• Model for one-way ANOVA: Xij=µ+Gi+εij

• µ is total mean, Gi is group effect, εij is N(0,σ2)• σ2 includes variation due to everything else,

including patient type• Only effect we describe in the model, is the

treatment effect

Asthma: Two-way ANOVA cont’d• Two-way ANOVA model: Xijl=µ+Gi+Bj+Iij+εijl• Describe both treatment effect (Gi), patient type

effect (Bi) and interaction (Iij)• Remove variation due to patient type from σ2 (and

from it’s primary estimator, MSE)• Means that σ2

two-way<σ2one-way

• Recall: Test for treatment effect (Gi =0), comparesMSG to MSE (MSG/MSE~F-dist), reject ifsufficiently large

• Similar tests for the other effects, but based onMSB and MSI

Asthma: Two-way ANOVA cont’d• If there is a treatment effect, MSG will be a biased

estimator for σ2

• If there is a block effect, denominator MSE will be smaller here than MSW for one-way ANOVA

• Value of test statistic will be larger! • Easier to get significant effects! (More power)• Also get more correct estimates for the group

means (because of the sampling)• Similar to regression: The more significant

variables you include in your model, the greaterR2 becomes, and you get more correct estimatesfor the regression coefficients

• R2 increases because σ2 decreases the more variables you include

ANOVA and linear regression• Regression: Split the distance from each data point

to the total mean into:– 1. Distance from mean to regression line– 2. Distance from regression line to data point

• Got sums of squares SSR, SSE and SST• Used for estimation and measuring how close data

points were to regression line (R2)• However; also used for an F-test on whether all

Bi=0 (From slide of detailed explanations of SPSS output)

• This is ANOVA in linear regression!

Design differences: ANOVA and regression

• Wheat/fertilizer example, additional confounders: Earth quality or amount of sun could vary acrossthe field

• ANOVA: Can only have two independent variables in the model. Control for other thingse.g. by repeating study until all types of wheathave been grown in each part of the field

• Regression: Collect information on earth qualityand sun amounts, and include in the model

Conclusion, regression vs ANOVA

• Regression allows for explicit modelling ofmore than two independent variables

• Also, in regression, independent variables can be continuous, whereas in ANOVA they have to be categorical

• Hence, regression is more flexible thanANOVA

Confounding in regression• Two hospitals: A and B• Measure cost per patient for treating some disease• For some reason, males cost \$1000000 more on

average than females• At hospital A, 80% of patients are males• At hospital B, 20% of patients are males• Do a regression of cost vs hospital, what will the

results indicate?• Do a regression of cost vs hospital and gender,

what will the results indicate?

Confounding, regression vs ANOVA

• In regression, adjust for confounding by including the confounder in the model

• In ANOVA, adjust for the confounder by using it as a blocking variable, or by making sure in the sampling that the confounder is equally distributed in e.g. both hospitals

Interaction in regression (and ANOVA)

• Recall the model with main effects only:

• Have fitted the modelBirth weight=2500.174+4.238*mother’s weight-270.013*smoking status

• If your mother weighs 100 lbs, what is theestimated effect of smoking?

• If your mother weighs 200 lbs, what is theestimated effect of smoking?

Coefficientsa

2500,174 230,833 10,831 ,000 2044,787 2955,5614,238 1,690 ,178 2,508 ,013 ,905 7,572

-270,013 105,590 -,181 -2,557 ,011 -478,321 -61,705

(Constant)weight in poundssmoking status

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig. Lower Bound Upper Bound95% Confidence Interval for B

Dependent Variable: birthweighta.

Model with interaction:• Getbwt=2347+5.41*mwt+47.87*smoking-2.46*mwt*smoking

• Mwt=100 lbs vs mwt=200lbs for non-smokers:bwt=2888g and bwt=3428g, difference=540g

• Mwt=100 lbs vs mwt=200lbs for smokers:bwt=2690g and bwt=2985g, difference=295g

Coefficientsa

2347,507 312,717 7,507 ,000 1730,557 2964,4575,405 2,335 ,227 2,315 ,022 ,798 10,012

47,867 451,163 ,032 ,106 ,916 -842,220 937,953-2,456 3,388 -,223 -,725 ,470 -9,140 4,229

(Constant)weight in poundssmoking statussmkwht

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig. Lower Bound Upper Bound95% Confidence Interval for B

Dependent Variable: birthweighta.

What does this mean?• Mother’s weight

has a greaterimpact on birthweight for non-smokers than for smokers (or theother way round)

• We see that theslope is steeperfor non-smokers

• However, interaction is non-significant

50,00 100,00 150,00 200,00 250,00

weight in pounds

0,00

1000,00

2000,00

3000,00

4000,00

5000,00

birt

hwei

ght

smoking status,001,00

R Sq Linear = 0,023

R Sq Linear = 0,042

Interaction in hospital example:• Two hospitals: A and B (coded as A=0,B=1)• Gender: females=0, males=1• Measure cost per patient for treating some disease• Say, we get the model:• Cost in \$=5000000+1000000*Gender-

300000*Hospital• Estimated cost of a male compared to a female in

hospital A: \$1000000• Estimated cost of a male compared to a female in

hospital B: \$1000000• What if males really are \$500000 more expensive to

treat than females in hospital A, but \$1500000 more expensive to treat in hospital B?

• Have to include an interaction term to model it

Designing a study:• Ideally, should know in advance:

– The basic hypotheses you want to test– What information you need in order to test the

hypotheses– Which population do you want the results to

apply for?– How to collect that information; sampling,

design– If regression: What important confounders do

you need information on

Sampling in practice• Newbold mentions:

1. Information required? Has the study been donebefore? Is it possible to get the information

2. Relevant population? 3. Sample selection? Random? Systematic? Stratified?4. Obtaining information? Interviews? Questionnaires?5. Inferences from sample? Which methods?6. Conclusions? How to present your results?

• Nonsampling errors; Missing data, dishonest or inaccurate answers, low reliability or validity

Reliability and validity

• Validity of a research instrument: Thedegree to which it measures what you areinteresting in measuring

• Reliability of a research instrument: Theextent that repeated measurements under constant conditions will give the same result

• A research instrument may be reliable, but not valid

Types of sampling• Simple random sampling: Select subjects at

random• Every subject in the population has same

probability of being sampled– Ex: One-way analysis of asthma patients

• If large enough sample, gives you a representative sample compared to the population

• Problem: If small sample, will give to few data oninteresting sub-groups

• Systematic sampling: As random sampling, butyou include e.g every 5th subject in your sample

Types of sampling cont’d:• Stratified sampling: Want to ensure that

interesting sub-groups of the population aresampled in sufficient numbers (over-sampled)

• Divide the population into K strata, randomlysample ni from each stratum

• Ex: Asthma patients, two-way ANOVA• Problems: How many patients in each stratum?• Cluster sampling: Similar to stratified sampling,

but considers geographical units• Divide the population into M clusters, randomly

sample m of them• Include all subjects in the sampled clusters

Types of sampling cont’d:• Two-phase sampling: Carry out an initial pilot

study, where only a small sample is collected• Then proceed with collecting the main sample• Advantages: Get initial estimates on effects• Initial estimates on variance in data-> sample size,

how much data do you need to reject H0?• Disadvantages: Costly, time-consuming

• NOTE: Most methods I’ve mentioned requiresadjusted formulas for estimation, described in thebook

Some study types

• Observational studies– Cross-sectional studies– Cohort studies – Longitudinal studies– Panel data– Case / control studies

• Experimental studies– Randomized, controlled experiments (blind, double-

blind)– Interventions

Cross-sectional studies

• Examines a sample of persons, at a single timepoint

• Time effects rely on memory of respondents• Good for estimating prevalence• Difficult for rare diseases• Response rate bias

Cohort studies and longitudinal studies

• A sample (cohort) is followed over some time period.

• If queried at specific timepoints: Longitudinal study

• Gives better information about causal effects, as report of events is not based on memory

• Requires that a substantial group developesdisease, and that substantial groups differ withrespect to risk factors

• Problem: Long time perspective

Panel data

• Data collected for the same sample, at repeated time points

• Corresponds to longitudinal epidemiological studies

• A combination of cross-sectional data and time series data

• Increasingly popular study type

Case – control studies

• Starts with a set of sick individuals (cases), and adds a set of controls, for comparison

• Retrospective study – Start with finding cases and controls, then dig into their past and find out whatmade them cases and controls

• Cases and controls should be from same populations

• Matching controls• Cheap, good method for rare diseases• Problem: Bias from selection, recall bias

Epidemiology

• Epidemiology is the study of diseases in a population– prevalence– incidence, mortality– survival

• Goals– describe occurrence and distribution– search for causes– determine effects in experiments

Measures of risk in epidemiology

• Relative risk (used for prospective studies)• Odds ratio (used for retrospective studies)

57

31

26

No abortions

7013Total

343Other nurses

3610Op.nurses

TotalAbortions

Op-nurses cont’d:

• Relative risk: Proportion of abortionsamong op.nurses divided by proportion ofabortions among others

RR= =3.1• Odds ratio:

Odds for abortion among op.nurses: 10/26 Odds for abortion among other nurses: 3/31

• Gives the odds ratio:OR: =4.0

34/336/10

31/326/10

Correcting for finite population in estimations

• Our estimates of for example populationvariances, population proportions, etc. assumed an ”infinite” population

• When the population size N is comparable to thesample size n, a correction factor is necessary.

• Used if n>0.05N • Examples:

– Variance of population mean estimate: – Variance of population proportion estimate:

2 (1 ) ( )ˆ1p

p p N nn N

σ − −= ⋅

22 ( )ˆ X

s N nn N

σ −= ⋅

Determining sample size• An important part of experimental planning• The answer will generally depend on the

parameters you want to estimate in the first place, so only a rough estimate is possible

• However, a rough estimate may sometimesbe very important to do

• A pilot study may be very helpful

Sample size for means (largesamples)

• We want to estimate mean• We want a confidence interval to extend a distance

a from the estimate• We guess at the population variance• A sample size estimate:

• Small samples: If we have a population of size N, and want a specified , we get

µ

2 2 2/ 2

2 2

4Zna a

α σ σ= ≈ at 95% confidence level

2Xσ 2

2 2( 1) X

NnN

σσ σ

=− +

Example: Have dental costsincreased since 1995?

• Want to compare dental costs in 1995 (adjusted to 2006-kroner) and 2006

• Could do a paired sample t-test. How manyindividuals do we need to ask?

• We believe a difference of 1500 kroner is important

• From experience, we think for the difference is 2500 kroner

• Need 4*25002/15002=at least 12 individuals to find a significant difference if our assumptions arecorrect

σ

Sample size for proportions (largesamples)

• We want to estimate proportion P • We want a confidence interval to extend a

distance a from the estimate• Recall: CI for P=P+Zα/2√P(1-P)/n• A sample size estimate:

• Largest possible value of this expression is 1/a2 (P=0.5)

2/ 2

2 2

(1 ) 4 (1 )Z P P P Pna a

α − −= ≈ at 95% confidence level

Example: Poll

• Want to estimate the proportion votingLabour with 95% confidence intervalextending +3%

• Need to include at most 1/0.032=1112 people in our study

• Would probably stick with 1112 if we don’thave any reason to believe P is smaller than0.5

Next time:

• More on time-series analysis from chapter19

• Presentation of results: How do you do it?• Recap of the different methods we’ve learnt