107
M&M Ch 6 Introduction to Inference ... OVERVIEW Introduction to Inference* Bayes Theorem : Haemophilia Brother has haemophilia => Probability (WOMAN is Carrier) = 0.5 New Data: Her Son is Normal (NL) . Update : Prob[Woman is Carrier, given her son is NL] = ?? Inference is about Parameters (Populations) or general mechanisms -- or future observations. It is not about data (samples) per se, although it uses data from samples. Might think of inference as statements about a universe most of which one did not observe. 0.5 0.5 CARRIER NOT CARRIER WOMAN Son 0.0 0.5 NL H Son Products of PRIOR and LIKELIHOOD PRIOR [ prior to knowing status of her son ] LIKELIHOOD 0.25 0.67 0.33 WOMAN CARRIER NOT CARRIER WOMAN POSTERIOR Given that Son is NL 0.5 observed data NL H 1.0 0.5 1. 2. 3. [ Prob son is NL | ] PRIOR Probs. Scaled to add to 1 0.5 x 1.0 0.5 x 0.5 Two main schools or approaches: Bayesian [ not even mentioned by M&M ] Makes direct statements about parameters and future observations Uses previous impressions plus new data to update impressions about parameter(s) e.g. Everyday life Medical tests: Pre- and post-test impressions Frequentist Makes statements about observed data (or statistics from data) (used indirectly [but often incorrectly] to assess evidence against certain values of parameter) Does not use previous impressions or data outside of current study (meta-analysis is changing this) e.g. • Statistical Quality Control procedures [for Decisions ] • Sample survey organizations: Confidence intervals • Statistical Tests of Hypotheses Unlike Bayesian inference, there is no quantified pre-test or pre- data "impression"; the ultimate statements are about data, conditional on an assumed null or other hypothesis. Thus, an explanation of a p-value must start with the conditional "IF the parameter is ... the probability that the data would ..." Book "Statistical Inference" by Michael W. Oakes is an excellent introduction to this topic and the limitations of frequentist inference. page 1

M&M Ch 6 Introduction to Inference OVERVIEW

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: M&M Ch 6 Introduction to Inference OVERVIEW

M&M Ch 6 Introduction to Inference ... OVERVIEW

Introduction to Inference* Bayes Theorem : HaemophiliaBrother has haemophilia => Probability (WOMAN is Carrier) = 0.5New Data: Her Son is Normal (NL) .Update: Prob[Woman is Carrier, given her son is NL] = ??

Inference is about Parameters (Populations) or generalmechanisms -- or future observations. It is not aboutdata (samples) per se, although it uses data fromsamples. Might think of inference as statements about auniverse most of which one did not observe.

0.5 0.5

CARRIERNOT CARRIER

WOMAN

Son

0.00.5

NL H

Son

Products of PRIOR and LIKELIHOOD

PRIOR [ prior to knowing status of her son ]

LIKELIHOOD

0.25

0.670.33

WOMAN

CARRIERNOT CARRIER

WOMAN

POSTERIOR Given that Son is NL

0.5

observed dataNL H

1.00.5

1.

2.

3.

[ Prob son is NL | ]PRIOR

Probs. Scaled to add to 1

0.5 x 1.0 0.5 x 0.5

Two main schools or approaches:

Bayesian [ not even mentioned by M&M ]

• Makes direct statements about parametersand future observations

• Uses previous impressions plus new data to update impressionsabout parameter(s)

e.g.Everyday lifeMedical tests: Pre- and post-test impressions

Frequentist

• Makes statements about observed data (or statistics from data)(used indirectly [but often incorrectly] to assess evidence againstcertain values of parameter)

• Does not use previous impressions or data outside of currentstudy (meta-analysis is changing this)

e.g.

• Statistical Quality Control procedures [for Decisions]• Sample survey organizations: Confidence intervals• Statistical Tests of Hypotheses

Unlike Bayesian inference, there is no quantified pre-test or pre-data "impression"; the ultimate statements are about data,conditional on an assumed null or other hypothesis.

Thus, an explanation of a p-value must start with the conditional"IF the parameter is ... the probability that the data would ..."

Book "Statistical Inference" by Michael W. Oakes is an excellentintroduction to this topic and the limitations of frequentist inference.

page 1

Page 2: M&M Ch 6 Introduction to Inference OVERVIEW

M&M Ch 6 Introduction to Inference ... OVERVIEW

Bayesian Inference for a quantitative parameter Bayesian Inference ... in generalE.g. Interpretation of a measurement that is subject to intra-personal (incl.measurement) variation. Say we know the pattern of inter-personal and intra-personal variation. Adapted from Irwig (JAMA 266(12):1678-85, 1991)

• Interest in a parameter θ .

MY MEAN CHOLESTEROL µ

3. POSTERIOR for µ

Products of PRIOR and LIKELIHOOD (Scaled)

1. PRIOR

LIKELIHOODi.e. [Prob • | µ)uses known model for variationof measurements around µ

2. DATA: one measurement on ME

MY MEAN CHOLESTEROL µ

p(µ)

i.e. f(• | µ) forvarious values of µ (3 shown here)

(know there is substantial intra-personal & measurement variation)

Posterior is composite of

P(µ | •)

under-estimate ?

'on target' ?

over-estimate ?

prior data (•)and

.

• Have prior information concerning θ in form of a priordistribution with probability density function p(θ).[to distinguish, might use lower case p for prior]

• Obtain new data x(x could be a single piece of information or more complex)

Likelihood of the data for any contemplated value θ isgiven by

L[ x | θ ] = prob[ x | θ ]

Posterior probability for θ, GIVEN x, is calculated as:

P( θ | x ) = L[ x | θ ] p(θ)

∫ L[ x | θ ] p(θ) dθ

[To distinguish, might use UPPER CASE P for POSTERIOR]. Thedenominator is a summation/integration (the ∫ sign ) over range of θand serves as a scaling factor that makes P(θ) sum to 1.

In Bayesian approach, post-data statements of uncertaintyabout are made directly from the function P( | x) .

page 2

Page 3: M&M Ch 6 Introduction to Inference OVERVIEW

M&M Ch 6 Introduction to Inference ... OVERVIEW

Re: Previous 2 examples of Bayesian inference Cholesterol example

θ = my mean cholesterol level

θ = ?? i.e. p[θ] = ?

Haemophilia example

θ = possible status of woman:In absence of any knowledge about me, would have to take as aprior the distribution of mean cholesterollevels for population my age and sex

θ = "Carrier" or "Not a carrier"

p(θ = Carrier) = 0.5

p(θ = Not a Carrier) = 0.5x = one cholesterol measurement on me

Assume that if a person's mean is θ, the variation around θ would beGaussian with standard deviation σw. (Bayesian argument does not insiston Gaussian-ness). So...x = status of sonL[ X=x | my mean is θ] is obtained by evaluating height ofGaussian(θ,σw) curve at X = xL[ x=Normal | Woman is Carrier ] = 0.5

L[ x=Normal | Woman is Not Carrier ] = 1P[θ | X = x] =

L[ X = x | θ ] p[θ]

∫ L[ X = x | θ ] p[θ] dθ

If intra-individual variation is Gaussianwith SD w and the prior is Gaussian withmean and SD b [b for betweenindividuals], then the mean of theposterior distribution is a weightedaverage of and x, with weights inverselyproportional to the squares of w and brespectively. So, the less the intra-individual and lab variation, the more theposterior is dominated by themeasurement x on the individual --- andvice versa.

P(θ = Carrier | x=Normal)

= L[x=N | θ =C] p[θ = C]

L[x=N | θ =C] p[θ =C] + L[x=N | θ =Not C] p[θ = Not C]

[equation for predictive value of a diagnostic test withbinary results]

page 3

Page 4: M&M Ch 6 Introduction to Inference OVERVIEW

M&M Ch 6.1 Introduction to Inference ... Estimating with Confidence

(Frequentist) Confidence Interval (CI) or Interval Estimatefor a parameter

Large-sample CI's

Many large-sample CI's are of the formFormal definition:

A level 1 - Confidence Interval for a parameter isgiven by two statistics

Upper and Lower

such that when is the true value of the parameter,

Prob ( Lower Upper ) = 1 -

1 -

θ^ ± multiple of SE(θ^) or f -1 [ f{θ}^ ± multiple of SE(f{θ}^ ] ,

where f is some function of θ^ which has close to a Gaussian

distribution, and f -1 is the inverse function(other motivation is variance stabilization; cf A&B ch11)

examples of the latter are:

θ = odds ratio

f = ln ; f -1 = exp

θ = proportion πf = arcsine ; f -1 = reversef = logit ; f -1 = exp(•)/[1+exp(•)]• CI is a statistic: a quantity calculated from a sample

• usually use α = 0.01 or 0.05 or 0.10, so that the "level ofconfidence", 1 - α, is 99% or 95% or 90%. We will also use "α" fortests of significance (there is a direct correspondence betweenconfidence intervals and tests of significance)

Method of Constructing a 100(1 - )% CI (in general):

"Over" estimate ?

(point) estimate

Lower

NB: shapes of distributions may differ at the 2 limits and thus yield asymmetric limits: see e.g. CI for π , based on binomial. Notice also the use of concept of tail area (p-value) to construct CI.

θ

Lowerθ

Upperθ

Upperθ

"Under" estimate ?

• technically, we should say that we are using a procedure whichis guaranteed to cover the true in a fraction 1 - ofapplications. If we were not fussy about the semantics, we mightsay that any particular CI has a 1-α chance of covering θ.

• for a given amount of sample data] the narrower the interval from L toU, the lower the degree of confidence in the interval and vice versa.

Meaning of a CI is often "massacred" in the telling... usersslip into Bayesian-like statements without realizing it.

S TATISTICAL CORRECTNESS

The Frequentist CI (statistic) is the SUBJECT of the sentence (speak oflong-run behaviour of CI's).

In Bayesian parlance, the parameter is the SUBJECT of the sentence[speak of where parameter values lie].

Book "Statistical Inference" by Oakes is good here..

page 4

Page 5: M&M Ch 6 Introduction to Inference OVERVIEW

M&M Ch 6.1 Introduction to Inference ... Estimating with Confidence

SD's* for "Large Sample" CI's for specific parameters Semantics behind Confidence Interval (e.g.)

parameter estimate SD(estimate) parameter estimate SD*(estimate)

µx x– σx

nθ θ^

SD(θ^)

_______________________________________________

µx x–σx

n

Probability is 1 – α that ...

x– falls within zα/2tα/2

SD(x–) of µx (see footnote 1 )

Probability is 1 – α that ..

µx "falls" within zα/2tα/2

SD(x–) of x– (see footnote 2 )µ∆X d– σd

n

Pr { µx – zα/2tα/2

SD(x–) ≤ x– ≤ µx + zα/2tα/2

SD(x–)} = 1 – απ p

π[1-π]

nPr { –

zα/2tα/2

SD(x–) ≤ x– - µx ≤ + zα/2tα/2

SD(x–) } = 1 – α

µ1 - µ2 x–1 - x–1

σ12

n1 +

σ22

n2 Pr { + zα/2tα/2

SD(x–) ≥ µx – x– ≥ – zα/2tα/2

SD(x–) } = 1 – α

Pr { x– + zα/2tα/2

SD(x–) ≥ µx ≥ x– – zα/2tα/2

SD(x–) } = 1 – απ1 - π2 p1 - p2

π1[1-π1]

n1 +

π2[1-π2]

n2

Pr { x– - zα/2tα/2

SD(x–) ≤ µx ≤ x– + zα/2tα/2

SD(x–)} = 1 – α

The last two are of the form SD12 + SD 2

2

1 This is technically correct, because the subject of the sentence is thestatistic xbar. Statement is about behaviour of xbar.

* In practice, measures of individual (unit) variation about θ {e.g. σx,

π[1-π] , ...} are replaced by estimates (e.g. sx , p[1-p] , ... } calculated

from the sample data, and if sample sizes are small, adjustments are made

to the "multiple" used in the multiple of SD(θ^). To denote the

"substitution" , some statisticians and texts (e.g., use the term "SE"

rather than SD; others (e.g. Colton, Armitage and Berry), use the term SE

for the SD of a statistic -- whether one 'plugs in' SD estimates or not.

Notice that M&M delay using SE until p504 of Ch 7.

2 This is technically incorrect, because the subject of the sentence is theparameter. µX. In the Bayesian approach the parameter is the subjectof sentence. In special case of "prior ignorance" [e.g. if had just arrivedfrom Mars], the incorrectly stated frequentist CI is close to a Bayesianstatement based on the posterior density function p(µX | data).

Technically, we are not allowed to "switch" from one to the other [it isnot like saying "because St Lambert is 5 Km from Montreal, thusMontreal is 5Km from St Lambert".] Here µX is regarded as afixed (but unknowable) constant; it doesn't "fall" or "varyaround" any particular spot; in contrast you can think ofthe statistic xbar "falling" or "varying around" the fixed

X .

page 5

Page 6: M&M Ch 6 Introduction to Inference OVERVIEW

M&M Ch 6.1 Introduction to Inference ... Estimating with Confidence

.006.037

.109

.204

.237

.195

.128

.059

.019.005.001

.004.021

.085

.230

.377

.282

P=0.025

P=0.025

Clopper-Pearson 95% CI for based on observed proportion 4/12

Notice that Prob[4] is counted twice, once in each tail .

The use of CI's based on Mid-P values, where Prob[4] is counted onlyonce, is discussed in Miettinen's Theoretical Epidemiology and in §4.7of Armitage and Berry's text.

See similar graph in Fig4.5 p 120 of A&B

1 11 09876543210 1 2

1 11 09876543210 1 2

4

Binomial at

upper = 0.65

Binomial at

lower = 0.10

Constructing a Confidence Interval for

Assumptions & steps (simplified for didactic purposes)

(1) Assume (for now) that we know the sd (σ) of the Y values in thepopulation. If we don't know it, suppose we take a"conservative" or larger than necessary estimate of σ.

(2) Assume also that either(a) the Y values are normally distributed or(b) (if not) n is large enough so that the Central Limit Theoremguarantees that the distribution of possible y– 's is well enoughapproximated by a Gaussian distribution.

(3) Choose the degree of confidence (say 90%).

(4) From a table of the Gaussian Distribution, find the z value suchthat 90% of the distribution is between -z and +z. Some 5% ofthe Gaussian distribution is above, or to the right of, z = 1.645and a corresponding 5% is below, or to the left of, z = -1.645.

(5) Compute the interval y– ±1.645 SD( y– ), i.e., y– ±1.645 σ / n

Warning: Before observing y– , we can say that there is a 90%probability that the y– we are about to observe will be within ±1.645SD( y– )'s of µ . But, after observing y–, we cannot reverse thisstatement and say that there is a 90% probability that µ is in thecalculated interval.We can say that we are USING A PROCEDURE IN WHICHSIMILARLY CONSTRUCTED CI's "COVER" THECORRECT VALUE OF THE PARAMETER ( in ourexample) 90% OF THE TIME. The term "confidence" is astatistical legalism to indicate this semantic difference.Polling companies who say "polls of this size are accurate towithin so many percentage points 19 times out of 20" are beingstatistically correct -- they emphasize the procedure rather thanwhat has happened in this specific instance. Polling companies (orreporters) who say "this poll is accurate .. 19 times out of 20" aretalking statistical nonsense -- this specific poll is either "right" or"wrong"!. On average 19 polls out of 20 are "correct ". But thispoll cannot be right on average 19 times out of 20!

page 6

Page 7: M&M Ch 6 Introduction to Inference OVERVIEW

M&M Ch 6.2 Introduction to Inference ... Tests of Significanc

(Frequentist) Tests of Significance Example 2

Use: To assess the evidence provided by sample data in favour ofa pre-specified claim or 'hypothesis' concerning someparameter(s) or data-generating process. As with confidenceintervals, tests of significance make use of the concept of asampling distribution.

In 1949 a divorce case was heard in which the sole evidence ofadultery was that a baby was born almost 50 weeks after thehusband had gone abroad on military service.

[Preston-Jones vs. Preston-Jones, English House of Lords]

To quote the court "The appeal judges agreed that the limit ofcredibility had to be drawn somewhere, but on medicalevidence 349 (days) while improbable, was scientificallypossible." So the appeal failed.

Example 1 (see R. A Fisher, Design of Experiments Chapter 2)

Pregnancy Duration: 17000 cases > 27 weeks (quoted in Guttmacher's book)

Week

%

0

5

10

15

20

25

30

28

30

32

34

36

38

40

42

44

46

In U.S., [Lockwood vs. Lockwood, 19??], a 355-day pregnancy was found to be'legitimate'.

STATISTICAL TEST OF SIGNIFICANCELADY CLAIMS SHE CAN TELL

WHETHER

MILK WAS POUREDFIRST

MILK WAS POUREDSECOND

BLIND TEST

MILK

TEAMILK

TEA

LADY SAYS

4

0

0

4

4 00 4

2 22 2

1 33 1

3 11 3

0 44 0

if justguessing,

probability of this result

1 / 70

16 / 70

1 / 70

16 / 70

36 / 70

Other Examples: 3. Quality Control (it has given us terminology) 4 Taste-tests (see exercises ) 5. Adding water to milk.. see M&M2 Example 6.6 p448 6. Water divining.. see M&M2 exercise 6.44 p471 7. Randomness of U.S. Draft Lottery of 1970.. see M&M2

Example 6.6 p105-107, and 447- 8. Births in New York City after the "Great Blackout" 9 John Arbuthnot's "argument for divine providence"10 US Presidential elections: Taller vs. Shorter Candidate.

page 7

Page 8: M&M Ch 6 Introduction to Inference OVERVIEW

M&M Ch 6.2 Introduction to Inference ... Tests of Significanc

Elements of a Statistical Test Elements of a Statistical Test (Preston-Jones case)The ingredients and the methods of procedure in a statistical test are:

1. Parameter (unknown) : DATE OF CONCEPTION

Claim about parameter

H0 DATE ≤ HUSBAND LEFT (use = as 'best case')

Ha DATE > HUSBAND LEFT

1. A claim about a parameter (or about the shape of a distribution,

or the way a lottery works, etc.). Note that the null and alternative

hypotheses are usually stated using Greek letters, i.e. in terms of

population parameters, and in advance of (and indeed without

any regard for) sample data. [ Some have been known to write

hypotheses of the form H: y– = ... , thereby ignoring the fact that

the whole point of statistical inference is to say something about

the population in general, and not about the sample one

happens to study. It is worth remembering that statistical

inference is about the individuals one DID NOT study, not

about the ones one did. This point is brought out in the

absurdity of a null hypothesis that states that in a triangle taste

test, exactly p=0.333.. of the n = 10 individuals to be studied will

correctly identify the one of the three test items that is different

from the two others.]

2. A probability model for statistic ?Gaussian ?? Empirical?2. A probability model (in its simplest form, a set of assumptions)

which allows one to predict how a relevant statistic from a sample

of data might be expected to behave under H0.

3. A probability level or threshold

(a priori ) "limit of extreme-ness" relative to H0

- for judge to decide

Note extreme-ness measured as conditional probability,

not in days

3. A probability level or threshold or dividing point below which

(i.e. close to a probability of zero) one considers that an event

with this low probability 'is unlikely' or 'is not supposed to

happen with a single trial' or 'just doesn't happen'. This pre-

established limit of extreme-ness is referred to as the "α (alpha)

level" of the test.

page 8

Page 9: M&M Ch 6 Introduction to Inference OVERVIEW

M&M Ch 6.2 Introduction to Inference ... Tests of Significanc

Elements of a Statistical Test ... Elements of a Statistical Test (Preston-Jones case)

4. A sample of data, which under H0 is expected to follow the

probability laws in (2).

4. data: date of delivery.

5. The most relevant statistic (e.g. y- if interested in inference about

the parameter µ)

5. The most relevant statistic (date of delivery; same as raw data:

n=1)

6. The probability of observing a value of the statistic as extreme or

more extreme (relative to that hypothesized under H0) than we

observed. This is used to judge whether the value obtained is

either 'close to' i.e. 'compatible with' or 'far away from' i.e.

'incompatible with', H0. The 'distance from what is expected

under H0' is usually measured as a tail area or probability and is

referred to as the "P-value" of the statistic in relation to H0.

6. The probability of observing a value of the statistic as extreme or

more extreme (relative to that hypothesized under H0) than we

observed

P-value = Upper tail area : Prob[ 349 or 350 or 351 ...] : quite

small

7. A comparison of this "extreme-ness" or "unusualness" or

"amount of evidence against H0 " or P-value with the agreed-on

"threshold of extreme-ness". If it is beyond the limit, H0 is said

to be "rejected". If it is not-too-small, H0 is "not rejected".

These two possible decisions about the claim are reported as "the

null hypothesis is rejected at the P= α significance level" or "the

null hypothesis is not rejected at a significance level of 5%".

7. A comparison of this "extreme-ness" or "unusualness" or

"amount of evidence against H0 " or P-value with the agreed-on

"threshold of extreme-ness". Judge didn't tell us his threshold,

but it must have been smaller than that calculated in 6.

Note: the p-value does not take into account any other 'facts',

prior beliefs, testimonials, etc.. in the case. But the judge

probably used them in his overall decision (just like the jury did

in the OJ case).

.

page 9

Page 10: M&M Ch 6 Introduction to Inference OVERVIEW

M&M Ch 6.3 and 6.4 Introduction to Inference ... Use and Misuse of Statistical Tests

"Operating" Characteristics of a Statistical TestThe quantities (1 - β) and (1 - α) are the "sensitivity(power)" and "specificity" of the statistical test.Statisticians usually speak instead of the complements ofthese probabilities, the false positive fraction (α ) and thefalse negative fraction (β) as "Type I" and "Type II" errorsrespectively [It is interesting that those involved indiagnostic tests emphasize the correctness of the testresults, whereas statisticians seem to dwell on the errors ofthe tests; they have no term for 1-α ].

As with diagnostic tests, there are 2 ways statistical testcan be wrong:

1) The null hypothesis was in fact correct but the

sample was genuinely extreme and the null

hypothesis was therefore (wrongly) rejected.

2) The alternative hypothesis was in fact correct but

the sample was not incompatible with the null

hypothesis and so it was not ruled out. Note that all of the probabilities start with (i.e. areconditional on knowing) the truth. This is exactlyanalogous to the use of sensitivity and specificity ofdiagnostic tests to describe the performance of the tests,conditional on (i.e. given) the truth. As such, they describeperformance in a "what if" or artificial situation, just assensitivity and specificity are determined under 'lab'conditions.

The probabilities of the various test results can be put inthe same type of 2x2 table used to show thecharacteristics of a diagnostic test.

Result of Statistical Test

"Negative"(do not

reject H0)

"Positive"(reject H0 in

favour of Ha) So just as we cannot interpret the result of a Dx testsimply on basis of sensitivity and specificity, likewise wecannot interpret the result of a statistical test in isolationfrom what one already thinks about the null/alternativehypotheses.

H0 1 - α α

TRUTH

Ha β 1 - β

page 10

Page 11: M&M Ch 6 Introduction to Inference OVERVIEW

M&M Ch 6.3 and 6.4 Introduction to Inference ... Use and Misuse of Statistical Tests

Interpretation of a "positive statistical test" But if one follows the analogy with diagnostic tests, thisstatement is like saying that

It should be interpreted n the same way as a "positive

diagnostic test" i.e. in the light of the characteristics of the subject

being examined. The lower the prevalence of disease, the

lower is the post-test probability that a positive diagnostic test

is a "true positive". Similarly with statistical tests. We are now

no longer speaking of sensitivity = Prob( test + | Ha ) and

specificity = Prob( test - | H0 ) but rather, the other way round,

of Prob( Ha | test + ) and Prob( H0 | test - ), i.e. of positive and

negative predictive values, both of which involve the

"background" from which the sample came.

"1-minus-specificity is the probability of being wrong if, uponobserving a positive test, we assert that the person is diseased".

We know [from dealing with diagnostic tests] that we cannot turnProb( test | H ) into Prob( H | test ) without some knowledgeabout the unconditional or a-priori Prob( H ) ' s.

The influence of "background" is easily understood if oneconsiders an example such as a testing program for potentialchemotherapeutic agents. Assume a certain proportion P aretruly active and that statistical testing of them uses type I andType II errors of α and β respectively. A certain proportion ofall the agents will test positive, but what fraction of these"positives" are truly positive? It obviously depends on α andβ, but it also depends in a big way on P, as is shown below forthe case of α = 0.05, β = 0.2.

A Popular Misapprehension: It is not uncommon to see orhear seemingly knowledgeable people state that

P --> 0.001 .01 .1 .5

TP = P(1- β) --> .00080 .0080 .080 .400FP = (1 - P)(α)-> .04995 .0495 .045 .025Ratio TP : FP --> ≈ 1 : 62 ≈ 1: 6 ≈ 2 : 1 ≈ 16 : 1

"the P-value (or alpha) is the probability of beingwrong if, upon observing a statistically significantdifference, we assert that a true difference exists"

Glantz (in his otherwise excellent text) and Brown (Am J DisChild 137: 586-591, 1983 -- on reserve) are two authors whohave made statements like this. For example, Brown, in anotherwise helpful article, says (italics and strike through by JH) :

Note that the post-test odds TP:FP is

P(1- β) : (1 - P)(α) = { P : (1 - P) } × [ 1- βα ]

"In practical terms, the alpha of .05 means that the

researcher, during the course of many such decisions, accepts

being wrong one in about every 20 times that he thinks he has

found an important difference between two sets of

observations" 1

PRIOR × function of TEST's characteristics

i.e. it has the form of a "prior odds" P : (1 - P), the"background" of the study, multiplied by a "likelihood ratiopositive" which depends only on the characteristics of thestatistical test. Text by Oakes helpful here1[Incidentally, there is a second error in this statement : it has to do with

equating a "statistically significant" difference with an important one...minute differences in the means of large samples will be statisticallysignificant ]

page 11

Page 12: M&M Ch 6 Introduction to Inference OVERVIEW

M&M Ch 6.3 and 6.4 Introduction to Inference ... Use and Misuse of Statistical Tests

"SIGNIFICANCE" The difference between two treatments is 'statistically significant' if itis sufficiently large that it is unlikely to have risen by chance alone.The level of significance is the probability of such a large differencearising in a trial when there really is no difference in the effects ofthe treatments. (But the lower the probability, the less likely is it thatthe difference is due to chance, and so the more highly significant isthe finding.)

notes prepared by FDK Liddell, ~1970

And then, even if the cure should be performed, how can he be surethat this was not because the illness had reached its term, or a resultof chance, or the effect of something else he had eaten or drunk ortouched that day, or the merit of his grandmother's prayers?Moreover, even if this proof had been perfect, how many times wasthe experiment repeated? How many times was the long string ofchances and coincidences strung again for a rule to be derived fromit?

Michel de Montaigne 1533-1592

• Statistical significance does not imply clinical importance.

• Even a very unlikely (i.e. highly significant) difference may beunimportant.

The same arguments which explode the Notion of Luck may, on theother side, be useful in some Cases to establish a due comparisonbetween Chance and Design. We may imagine Chance and Designto be as it were in Competition with each other for the production ofsome sorts of Events, and may calculate what Probability there is,that those Events should be rather owing to one than to the other...From this last Consideration we may learn in many Cases how todistinguish the Events which are the effect of Chance, from thosewhich are produced by Design.

Abraham de Moivre: 'Doctrine of Chances' (1719)

• Non-significance does not mean no real difference exists.

• A significant difference is not necessarily reliable.

• Statistical significance is not proof that a real difference exists.

• There is no 'God-given' level of significance. What level would yourequire before being convinced:

a to use a drug (without side effects) in the treatment of lungcancer?

b that effects on the foetus are excluded in a drug whichdepresses nausea in pregnancy?

If we... agree that an event which would occur by chance only oncein (so many) trials is decidedly 'significant', in the statistical sense,we thereby admit that no isolated experiment, however significant initself, can suffice for the experimental demonstration of any naturalphenomenon; for the 'one chance in a million' will undoubtedlyoccur, with no less and no more than its appropriate frequency,however surprised we may be that it should occur to us.

R A Fisher 'The Design of Experiments'(First published 1935)

c to go on a second stage of a series of experiments with rats?

• Each statistical test (i.e. calculation of level of significance, orunlikelihood of observed difference) must be strictly independentof every other such test. Otherwise, the calculated probabilities willnot be valid. This rule is often ignored by those who:

- measure more than on response in each subject- have more than two treatment groups to compare- stop the experiment at a favourable point.

page 12

Page 13: M&M Ch 6 Introduction to Inference OVERVIEW

The (many) ways to (in)correctly describe a confidence interval and to talk about p-values (q's from 2nd edition of M&M Chapter 6; answers anonymous)

Below are some previous students' answers to questions from 2ndEdition of Moore and McCabe Chapter 6. For each answer, saywhether the statement/explanation is correct and why.

7 In 95 of 100 comparable polls,expect 44 - 50% of women will givethe same answer.

Given a parameter, we are 95% surethat the mean of this parameter fallsin a certain interval.

• NO. Same answer? as what?

Not given a parameter (ever) . If wewere, wouldn't need this course!Mean of a parameter makes no sensein frequentist inference.

Question 6.2

A New York Times poll on women's issues interviewed 1025 women and 472 menrandomly selected from the United States excluding Alaska and Hawaii. The pollfound that 47% of the women said they do not get enough time for themselves. 8 "using the poll procedure in which

the CI or rather the true percentage iswithin +/- 3, you cover the truepercentage 95% of times it isapplied.

• A bit muddled... but "correct in95% of applications" is accurate.

(a) The poll announced a margin of error of ±3 percentage points for 95%confidence in conclusions about women. Explain to someone who knowsno statistics why we can't just say that 47% of all adult women do not getenough time for themselves.

9 Confident that a poll (such) as thisone would have measured correctlythat the true proportion lies betweenin 95% .

• ??? [ I have trouble parsing this!]In 95% of applications/uses, pollslike these come within ± 3% of truth.

(b) Then explain clearly what "95% confidence" means.(c) The margin of error for results concerning men was ± 4 percentage points.

Why is this larger than the margin of error for women?

1 True value will be between 43 &50% in 95% of repeated samples ofsame size.

• No . Estimate will be between µ –

margin & µ + margin in 95% ofsamples.

10 95% chance that the info is correctfor between 44 and 50% of women.

• ??? 95% confidence in the procedurethat produced the interval 44-50

11 95% confidence -> 95% of time theproportion given is the goodproportion (if we interviewed othergroups).

• "Correct in 95% of applications"

Good to connect the 95% with thelong run, not specifically with thisone estimate.Always ask yourself: what do I meanby "95% of the time" ?

If you substitute "applications" for"time", it becomes clearer.

2 Pollsters say their survey methodhas 95% chance of producing a rangeof percentages that includes π.

• Good . Emphasize averageperformance in repeated applicationsof method.

3 If this same poll were repeated manytimes, then 95 of every 100 suchpolls would give a range thatincluded 47%.

• No! . See 1.

4 You're pretty sure that the truepercentage π is within 3% of 47% ."95% confidence" means that 95% ofthe time, a random poll of this sizewill produce results within 3% of π.

• Bayesians would object (and rightlyso!) to this use of the "true parameter"as the subject of the sentence. Theywould insist you use the statistic asthe subject of the sentence and theparameter as object.

12 It means that 47% give or take 3% isan accurate estimate of thepopulation mean 19 times out of 20such samplings.

• ??? 95% of applications of CI givecorrect answer. How can the sameinterval 47%±3 be accurate in 19 butnot in the other 1?

Q6.4 "This result is trustworthy 19 timesout of 20"

• ??? "this" result: Cf. thedistinction between "my operation issuccessful 19 times out of 20 … " and"operations like the one to be done onme are successful 19 times out of 20"

5 If took 100 different samples, in95% of cases, the sample proportionwill be between 44% and 50%.

• NO! The sample proportion will bebetween truth – 3% & truth + 3% in95% of them.

6 With this one sample taken, we aresure 95 times out of 100 that 41-53%of the women surveyed do not getenough time for themselves.

• NO. 95/100 times the estimate willbe within 3% of π, i.e., estimate will

be in interval π – margin to π +margin. Method used gives correctresults 95% of time.

95% of all samples we could selectwould give intervals between 8669and 8811.

• Surely n o t !

page 13

Page 14: M&M Ch 6 Introduction to Inference OVERVIEW

The (many) ways to (in)correctly describe a confidence interval and to talk about p-values (q's from 2nd edition of M&M Chapter 6; answers anonymous)

Question 6.18 4 Interval of true values ranges b/w27% + 33%.

• ??? There is only one true value.AND, it isn't 'going' or 'ranging' or'moving' anywhere!The Gallup Poll asked 1571 adults what they considered to be the most serious

problem facing the nation's public schools; 30% said drugs. This sample percentis an estimate of the percent of all adults who think that drugs are the schools'most serious problem. The news article reporting the poll result adds, "The pollhas a margin of error -- the measure of its statistical accuracy -- of three percentagepoints in either direction; aside from this imprecision inherent in using a sample torepresent the whole, such practical factors as the wording of questions can affecthow closely a poll reflects the opinion of the public in general" (The New YorkTimes, August 31, 1987).

5 Confident that in repeated samplesestimate would fall in this range95/100 times.

• NO. Estimate falls within 3% of π in95% of applications

6 95% of intervals will contain trueparameter value and 5% will not.Cannot know whether result ofapplying a CI to a particular set ofdata is correct.

• GOOD. Say "Cannot know whetherCI derived from a particular set of datais correct." Know about behaviour ofprocedure! If not from Mars, (i.e. if youuse past info) might be able to betmore intelligently on whether it doesor not.The Gallup Poll uses a complex multistage sample design, but the sample percent

has approximately a normal distribution. Moreover, it is standard practice toannounce the margin of error for a 95% confidence interval unless a differentconfidence level is stated.

7 In 1/20 times, the question willyield answers that do not fall intothis interval.

• No . In 5% of applications, estimatewill be more than 3% away from trueanswer. See 1,2,3 above.

a The announced poll result was 30%±3%. Can we be certainthat the true population percent falls in this interval? 8 This type of poll will give an

estimate of 27 to 33% 19 times outof 20 times.

• NO. Won't give 27 ± 3 19/20 times.Estimate will be within ± 3 of truth in19/20 applicationsb Explain to someone who knows no statistics what the

announced result 30%±3% means. 9 5% risk that µ is not in thisinterval.

• ??? If an after the fact statement,somewhat inaccurate.

c This confidence interval has the same form we have met earlier:estimate ± Z*σestimate

(Actually s is estimated from the data, but we ignore this for now.)

What is the standard deviation σestimate of the estimated percent?

10 95 out 100 times when doing thecalculations the result 27-33%would appear.

• No it wouldn't . See 1,2,3,7.

11 95% prob computed interval willcover parameter.

• Accurate if viewed as a prediction.

d Does the announced margin of error include errors due to practical problemssuch as undercoverage and nonresponse? 12 The true popl'n mean will fall

within the interval 27-33 in 95% ofsamples drawn.

• NO. True popl'n mean will not "fall"anywhere. It's a fixed, unknowableconstant. Estimates may fall around it .

1 This means that the populationresult will be between 27% and 33%19/20 times.

• NO! Populat ion resul t i swherever i t is and it doesn'tm o v e . Think of it as if it were thespeed of light.

2 95% of the time the actual truth willbe between 30 ± 3% and 5% it willbe false.

• It either is or it isn't … the truthdoesn't vary over samplings.

3 If this poll were repeated very manytimes, then 95 of 100 intervalswould include 30% .

• NO. 95% of polls give answer within3% of truth, NOT within 3% of themean in this sample.

page 14

Page 15: M&M Ch 6 Introduction to Inference OVERVIEW

The (many) ways to (in)correctly describe a confidence interval and to talk about p-values (q's from 2nd edition of M&M Chapter 6; answers anonymous)

Question 6.22 3 Ho : a loud noise has no effect onthe rapidity of the mouse to find itsway through the maze.

• OK if being generic. but not if itmakes a prediction about a specificmouse (sounds like this student wastalking about a specific mouse. H isabout mean of a populat ion,i .e . about mice (p lural ) . I t i snot about the 10 mice in thestudy!

In each of the following situations, a significance test for a population mean µ iscalled for. State the null hypothesis Ho and the alternativehypothesis Ha in each case.

a Experiments on learning in animals sometimes measure how long it takes amouse to find its way through a maze. The mean time is 18 seconds for oneparticular maze. A researcher thinks that a loud noise will cause the mice tocomplete the maze faster. She measures how long each of 10 mice takes witha noise as stimulus.

Question 6.24

A randomized comparative experiment examined whether a calcium supplement inthe diet reduces the blood pressure of healthy men. The subjects received either acalcium supplement or a placebo for 12 weeks. The statistical analysis was quitecomplex, but one conclusion was that "the calcium group had lower seated systolicblood pressure (P=0.008) compared with the placebo group." Explain thisconclusion, especially the P-value, as if you were speaking to adoctor who knows no statistics . (From R.M. Lyle et al., "Blood pressureand metabolic effects of calcium supplement in normotensive white and blackmen," Journal of the American Medical Association, 257 (1987), pp. 1772-1776.)

a The examinations in a large accounting class are scaled after grading so thatthe mean score is 50. a self-confident teaching assistant thinks that hisstudents this semester can be considered a sample from the population of allstudents he might teach, so he compares their mean score with 50.

c A university gives credit in French language courses to students who pass aplacement test. The language department wants to know if students who getcredit in this way differ in their understanding of spoken French from studentswho actually take the French courses. Some faculty think the students whotest out of the courses are better, but others argue that they are weaker in oralcomprehension. Experience has shown that the mean score of students in thecourses on a standard listening test is 24. The language department gives thesame listening test to a sample of 40 students who passed the creditexamination to see if their performance is different.

1 The P-value is a probability:"P=0.008" means 0.8% . It is theprobability, assuming the nullhypothesis is true, that a sample(similar in size and characteristics asin the study) would have an averageBP this far (or further) below theplacebo group's average BP. In otherwords, if the null hypothesis is reallytrue, what's the chance 2 group ofsubjects would have results thisdifferent or more different?

• Not bad!

1 Ho: Is there good evidence againstthe claim that πmale > πfemale

Ha: Fail to give evidence againstthe claim that πmale > πfemale .

• NO. Hypotheses do not includestatements about data or evidence. .This student mixed parameters andstatistics/data …

Put Ho, Ha in terms of parameters πmale

vs πfemale only;

H's have nothing to do with new data;

Evidence has to do with p-values, data.

2 Only a 0.008 chance of finding thisdifference by chance if, in thepopulation there really was nodifference between treatment andcentral groups.

• Good!

3 The p-value of .008 means that theprobability of the observed results ifthere is, in fact, no difference between"calcium" and "placebo" groups is8/1000 or 1/125.

• Good, but would change to "theobserved results or results moreextreme"

2 x– = average. time of 10 mice w/loud noise.

Ho: mu - x– = 0 or mu = x–

• NO! Ho must be in terms ofparameter(s). IT MUST NOT SPEAK OFDATA

page 15

Page 16: M&M Ch 6 Introduction to Inference OVERVIEW

The (many) ways to (in)correctly describe a confidence interval and to talk about p-values (q's from 2nd edition of M&M Chapter 6; answers anonymous)

4 The p-value measures the probabilityor chance that the calcium supplementhad no effect.

• No . First, Ho and Ha refer not justto the n subjects studied, but to allsubjects like them. They should bestated in the present (or even future)tense.

Second, the p-value is about data,under the null H. It is not about thecredibility of Ho or Ha.

Question 6.32

The level of calcium in the blood in healthy young adults varies with mean about9.5 milligrams per deciliter and standard deviation about σ = 0.4. A clinic in ruralGuatemala measures the blood calcium level of 180 healthy pregnant women attheir first visit for prenatal care. The mean is x– = 9.57. Is this an indication that themean calcium level in the population from which these women come differs from9.5?

a State Ho and Ha.5 There is strong evidence that Casupplement in the diet reduces theblood pressure of healthy men.

The probability of this being wrongconclusion according to the procedureand data corrected is only 0.008 (i.e.0.8%) .

• Stick to "official wording"

.. IF Ca makes no ∆ to average BP,chance of getting ...

Notice the illegal attempt to makethe p-value into a predict ivevalue -- about as illegal as astatistician trying to interpret amedical test that gave a reading inthe top percentile of the 'health'population -- without even seeingthe patient!

b Carry out the test and give the P-value, assuming that = 0 .4in this population. Report your conclusion.

c The 95% confidence interval for the mean calcium level µ in this population

is obtained from the margin of error, namely 1.96 × 0.4 / 180 = 0.058.i.e. as 9.57 ± 0.058 or 9. We are confident that µ lies quite close to 9.5.This illustrates the fact that a test based on a large sample (n=180 here) willoften declare even a small deviation from Ho to be statistically significant.

1 95% of the time the mean will beincluded in the interval of 9.512 to9.628 and 5% will be missed.

• No . See 1,2,3 in Q6.18 above

6 Only 0.8% chance that the lower BPin Calcium group is lower thanplacebo due to chance.

• If Ca++ does nothing, then prob.of obtaining a result ≥ this extremeis ... Wording borders on the illegal.

2 Ho: There is no difference betweensample area and the populationarea:

H0: µ = x–.

Ha: There is a significant differencebetween the sample mean and thepopulation area.

PS A professor in the dept. of Mathand Statistics questioned what we inEpi and Biostat are teaching, afterhe saw in a grant proposalsubmitted by one of our doctoralstudents (now a faculty member!) astatement of the type

H0: µ = x–.

Please do not g ive our cr i t icany such ammunition! -- JH

• NO. This is quite muddled. Unlessone takes a census, there will always-- because of sampling variability --be some non-zero difference between

x– and µ. The question posed is

whether "mean calcium level (µ) in thepopulation from which these womencome differs from 9.5"

ALSO: Must state H's in terms ofPARAMETERS.

Here there is one population. If twopopulations, identify them bysubscripts e.g. Ho: µarea1 = µarea2 .

"Significant" is used to interpret data.(and can be roughly paraphrased as"evidence that true parameter is non-zero". Do not use "s igni f icant"when s ta t ing hypothese s .

7 The chance that the supplement madeno change or raised the B/P is veryslim.

• NO! p-value is a conditionalstatement, predicated (calculated onsupposition that) Ca makes nodifference to µ. Often stated inpresent tense. p-value is more 'afterthe data' in 'past-conditional' tense.Again, wording bordering on illegal.

8 There is 0.8% that this difference isdue to chance alone and 99.2% chancethat this difference is a true difference.

• Not real ly .. Just like theprevious statements, this type oflanguage is crossing over intoBayes-Speak.

9 There has been a significant reductionin the BP of the treated group...there's only a probability of 0,8%that this is due to chance alone.

• NO. Cannot talk about the cause...Can say "IF no other cause thanchance, then prob. of getting ≥ adifference of this size is ...

page 16

Page 17: M&M Ch 6 Introduction to Inference OVERVIEW

The (many) ways to (in)correctly describe a confidence interval and to talk about p-values (q's from 2nd edition of M&M Chapter 6; answers anonymous)

3 95% CI: µ ± 1.96 σ / √180 • NO

CI for µ is x– ± 1.96 σ/√180 !!!!

If we knew µ, we would say µ ± 0 !!

and we wouldn't need a statisticscourse!

Rather than leave this column blank...

http://www.stat.psu.edu/~resources/Cartoons/

http://watson.hgen.pitt.edu/humor/

4 Ho : mu = x– = 9.57

Ha : mu ≠ x– = 9.57

• NO. Cannot use sample values inhypotheses. Must use parameters.

5 µ differs from 9.5 and theprobability that this difference isonly due to chance is 2%.

• Correct to say that "we foundevidence that µ differs from 9.5"

In frequentist inference, can speakprobabilistically only about data

(such as x–).

This miss-speak illustrates that wewould indeed prefer to speak about µrather than about the data in a sample.We should indeed start to adopt aformal Bayesian way of speaking, andnot 'mix our metaphors' as wecommonly do when we try to staywithin the confines of the frequentistapproach.

What does this difference mean?

Should not speak about theprobability that this difference isonly due to chance.

6 Ho : µ = 9.5

Ha : µ ≠ 9.5

• Correct . Notice that Ho & Ha saynothing about what you will find inyour data.

7 Q6.44: Ho: x– = 0.5; Ha: x– > 0.5 • NO. Must write H's in term ofparameters!

8 There is a 0.96% probability thatthis difference is due to chancealone.

• NO. This shorthand is so short thatit misleads. If want to keep it short,say something like "difference islarger than expected under samplingvariation alone". Don't get intoattribution of cause.

page 17

Page 18: M&M Ch 6 Introduction to Inference OVERVIEW

M&M Ch 7.1 (FREQUENTIST) Inference for

Inference for : A&B Ch 7.1 ; Colton Ch 4; Student"'s 't distribution (continued)

Distribution (histogram of sampling distribution)CI( small n: => "Student" 's t distribution

• is symmetric around 0 ( just like Z = x– – µσ/√n

)Use when replace σ by s (an estimate of σ) in CI's and tests.

(1) Assume that either(a) the Y values are either normally distributed or(b) if not, n is large enough so that the Central Limit Theorem guarantees that

the distribution of possible y- 's is well enough approximated by a Gaussiandistrn.

• has a shape like that of the Z distribution, but with SD slightly larger than

unity i.e. slightly flatter & more wide-tailed; Var(t) = df

df–2

• shape becomes indistinguishable from Z distribution as n -> ∞ (in fact as

n goes much beyond 30)(2) Choose the desired degree of confidence [50%, 80%, 90%, 99... ] as before.

(3) Proceed as above, except that use t Distribution rather than Z -- find the t valuesuch that xx% of the distribution is between –t and + t. The cutpoints for %-iles of the t distribution vary with the amount of data used to estimate σ.

• Instead of ± 1.96 σ√n

to enclose µ with 95% confidence, we need

Multiple n degrees of freedom ('df')"Student"'s 't distribution is (conceptual) distribution one gets if...

± 3.182 4 3• take samples (of given size n) from Normal(µ, σ) distribution

± 2.228 11 10• form the quantity t = x–

– µs/√n

from each sample

± 2.086 21 20• compile a histogram of the results

or, in Gossett's own words ...(W.S. Gossett 1908) ± 2.042 31 30

"Before I had succeeded in solving my problem analytically, Ihad endeavoured to do so empirically [i.e. by simulation]. Thematerial I used was a ... table containing the height and leftmiddle finger measurements of 3000 criminals.... Themeasurements were written out on 3000 pieces of cardboard,which were then very thoroughly shuffled and drawn atrandom... each consecutive set of 4 was taken as a sample...[i.e. n=4 above]... and the mean [and] standard deviation ofeach sample determined.... This provides us with two sets of...750 z's on which to test the theoretical results arrived at. Theheight and left middle finger... table was chosen because thedistribution of both was approximately normal..."'

± 1.980 121 120

± 1.96 ∞ ∞

• Test of µ = µ0 CI for µ

Ratio = x– – µ0

s√n

x– ± t s√n

page 1

Page 19: M&M Ch 6 Introduction to Inference OVERVIEW

M&M Ch 7.1 (FREQUENTIST) Inference for

WORKED EXAMPLE : CI and Test of Significance WORKED EXAMPLE C P G Barker The Lancet Vol 345 . April 22, 1995, p 1047.

Posture, blood flow, and prophylaxis of venous thromboembolismResponse of interest: D: INCREASE (D) IN HOURSOF SLEEP with DRUG Sir--Ashby and colleagues (Feb 18, p 419) report adverse effects of posture on

femoral venous blood flow. They noted a moderate reduction velocity when apatient was sitting propped up at 35° in a hospital bed posture and a furtherpronounced reduction when the patient was sitting with legs dependent.Patients recovering from operations are often asked to sit in a chair with theirfeet elevated on a footrest. The footrests used in most hospitals, while raisingthe feet, compress the posterior aspect of the calf. Such compression may beimportant in the aetiology of venous thrombo-embolism. We investigated theeffect of a footrest on blood flow in the deep veins of the calf by dynamicradionuclide venography.

Test: H0: µD = 0 vs Halt: µD ≠ 0α =0.05 (2-sided);

Data:

HOURS of SLEEP† DIFFERENCESubject DRUG PLACEBO Drug - Placebo

1 6.1 5.2 0.9 2 7.0 7.9 -0.9 3 8.2 3.9 4.3 4 . . 2.9 5 . . 1.2 6 . . 3.0 7 . . 2.7 8 . . 0.6 9 . . 3.610 . . -0.5

Calf venous blood flow was measured in fifteen young (18-31 years) healthymale volunteers. 88 MBq technetium-99m-labelled pertechnetate in 1 mL salinewas injected into the lateral dorsal vein of each foot, with ankle tourniquetsinflated to 40 mm Hg, and the time the bolus took to reach the lower border ofthe patella was measured (Sophy DSX Rectangular Gamma Camera). Eachsubject had one foot elevated with the calf resting on the footrest and the otherplantegrade on the floor as a control. The mean transit time of the bolus to theknee was 24.6 s (SE 2.2) for elevated feet and 14.8 s (SE 2.2) for control feet [seefigure overleaf]. The mean delay was 9.9 s (95% CI 7.8–12.0).

Simple leg elevation without hip flexion increases leg venous drainage andfemoral venous blood flow. The footrest used in this study raises the foot byextension at the knee with no change in the hip position. Ashby andcolleagues' findings suggest that such elevation without calf compressionwould produce an increase in blood flow. Direct pressure of the posterior aspectof the calf therefore seems to be the most likely reason for the reduction in flowwe observed. Sitting cross-legged also reduced calf venous blood flow,probably by a similar mechanism. If venous stasis is important in theaetiology of venous thrombosis, the practice of nursing patients with their feetelevated on footrests may need to be reviewed.

d– = 1.78

SD of 10 differences SD[d] = 1.77

Test statistic = 1.78 - [0]

1.77

10

= 3.18 CR:ref|t9|=2.26

JH's Analysis of raw data [data abstracted by eye, so mycalculations won't match exactly with those in text]Since 3.18 > 2.26, "Reject" H0

95% CI for µD

= 1.78 ± t9

1.77

10 = 1.78 ± 1.26 = 0.5 to 3.0 hours d

–(SD) = 9.8(4.1); t =

9.8 - [0]

4.1/ 15 =

9.81.0

= 9.8> t14,0.05 of 2.145

difference is 'off the t-scale'NOTE : I deliberately omitted the full data on the drug and placeboconditions: all we need for the analysis are the 10 differences.

What if not sure d's come from a Gaussian Distribution?

[ for t: Gaussian data or (via CLT) Gaussian statistic ( d– )

95% CI for µD: 9.8 ± 2.145[1.0] = 7.7 to 11.9 s

page 2

Page 20: M&M Ch 6 Introduction to Inference OVERVIEW

M&M Ch 7.1 (FREQUENTIST) Inference for

WORKED EXAMPLE: Leg Elevation (continued) Sample Size for CI's and test involving T

ran

sit

Tim

e (

s)

0

5

10

15

20

25

30

35

40

45

50

FootRest

38 48 10 26 32 6 21 28 7 18 27 9 16 21 5 15 22 7 14 25 11 12 28 16 12 31 19 12 25 13 11 20 9 8 13 5 7 17 10 7 14 7 5 18 13

mean 14.8 24.6 9.8SD 8.5 8.7 4.1SEM 2.2 2.2 1.0

No FootRest FootRest Delay

No FootRest

n to yield (2-sided) CI with margin oferror m at confidence level 1- (seeM&M p 447)

|<-- margin | | of error -->| | | (-------------------•-------------------)

• large-sample CI: x– ± Z SE( x– ) = x– ± m

• SE( x– ) = / n , so...

n = 2 • Z 2

m2

Remarks: If n small, replace Zα/2 by tα/2Whereas mean of 15 differences between 2 conditions is arithmeticallyequal to the difference of the 2 means of 15, the SE of the mean of these15 differences is not the same as the SE of the difference of twoindependent means. In general... Typically, won't know σ so use

guesstimate;Var( y–1 – y–2 ) = Var( y–1) + Var( y–2) – 2 Covariance( y–1, y–2 )

Authors continue to report the SE of each of the 2 means, but they are oflittle use here, since we are not interested in the means per se, but in themean difference.

In planning n for example just discussed, authors mighthave had pilot data on inter leg differences in transit time-- with both legs in the No FootRest position. Sometimes,one has to 'ask around' as to what the SD of the d's willbe. Always safer to assume a higher SD than might turnout to be the case.

Calculating Var( y–1 – y–2 ) = Var( y–1) + Var( y–2)assumes that we used one set of 15 subjects for the No FootRestcondition, and a different set of 15 for the FootRest condition, a muchnoisier contrast. As it is, even this inefficient analysis would have sufficedhere because the 'signal' was so much greater than the 'noise'.

See article On Reserve on display of data from pairs.

page 3

Page 21: M&M Ch 6 Introduction to Inference OVERVIEW

M&M Ch 7.1 (FREQUENTIST) Inference for

Sample Size for CI's and test involving .. cont'd Sign Test for mediann for power 1- if mean is units from µ0 (test value) ; type Ierror = (cf Colton p142 or CRC table next)

Test:

H0: MedianD = 0vs

Halt: MedianD ≠ 0 ; α =0.05 (2-sided);Need Zα/2 SE( x– ) + ZβSE( x– ) > ∆.

Substitute ( x– ) = σ/√n and solve for n: DIFFERENCE SIGNDrug – Placebo of d

so need n = { Zα/2 – Zβ }2 σ2

∆2 0.9 +-0.9 –

α/2µ

Za/2 SE(xbar)

µ

Zb SE(xbar)

β

∆ = µ − µ

alt

0

0alt

4.3 + 2.9 + 1.2 + 3.0 + 2.7 + 0.6 + 3.6 +-0.5 –

∑ 8+, 2–

Reference: Binomial [ n=10; π(+) = 0.5 ] See Table C (last column of p T9) orSign Test Table which I have provided in Chapter on Distribution-free Methods.

Upper Tail: Prob( ≥ 8+ | π = 0.5 ) = 0.0439 + 0.0098 + 0.0010 = 0.0547

2 Tails: P = 0.0547 +0.0547 = 0.1094

P > 0.05 (2-sided). (less Powerful than t-test)If power is > 0.5, then β < 0.5, and Zβ < 0 .

In above example on Blood Flow, the fact that all 15/15 had delays makes anyformal test unnecesary... the "Intra-Ocular Traumatic Test" says it all. [Q:could it be that always raised the left leg, and blood flow is less in left leg? Doubtit but ask the question just to point out that just because we find a numericaldifference doesn't necessarily mean that we know what caused the difference

eg. α=0.05 , β=0.2 => Zα/2 = 1.96 Zβ = –0.84

Technically, if n small, use t-test... see table next page

The question of what to use is not a matter ofstatistics or samples, or what the last guy found in astudy, but rather the difference that makes a difference"i.e it is a clinical judgement, and includes the impact,cost, alternatives, etc...It is the that IF TRUE would lead to a difference inmanagement or a substantial risk, or whatever...

Famous scientist, begins by removing one leg from an insect and, in an accent Icannot reproduce on paper, says "quick march". The insect walks briskly. Thescientist removes another leg, and again on being told "quick march" the insectwalks along... This continues until the last leg has been removed, and the insectno longer walks. Whereupon the Scientist, again in an accent I cannot convey here,, pronounces "There! it goes to prove my theory: when you remove the legs froman insect, it cannot hear you anymore!".

page 4

Page 22: M&M Ch 6 Introduction to Inference OVERVIEW

M&M Ch 7.1 (FREQUENTIST) Inference for

Number of Observations to Ensure Specified Power (1- ) if use 1-sample or paired t-test of Mean

α = 0.005(1-sided) α = 0.025(1-sided) α = 0.05(1-sided) α = 0.01 (2-sided) α = 0.05 (2-sided) α = 0.1 (2-sided)

β = 0.01 0.05 0.1 0.2 0.5 0.01 0.05 0.1 0.2 0.5 0.01 0.05 0.1 0.2 0.5 [ POWER = 1 – β ]

∆σ

0.2 99 700.3 134 78 119 90 45 122 97 71 320.4 115 97 77 45 117 84 68 51 26 101 70 55 40 190.5 100 75 63 51 30 76 54 44 34 18 65 45 36 27 13

0.6 71 53 45 36 22 53 38 32 24 13 46 32 26 19 90.7 53 40 34 28 17 40 29 24 19 10 34 24 19 15 80.8 41 32 27 22 14 31 22 19 15 9 27 19 15 12 60.9 34 26 22 18 12 25 19 16 12 7 21 15 13 10 51.0 28 22 19 16 10 21 16 13 10 6 18 13 11 8 5

1.2 21 16 14 12 8 15 12 10 8 5 13 10 8 61.4 16 13 12 10 7 12 9 8 7 10 8 7 51.6 13 11 10 8 6 10 8 7 6 8 6 61.8 12 10 9 8 6 8 7 6 7 62.0 10 8 8 7 5 7 6 5 6

2.5 8 7 6 6 6

3.0 7 6 6 5 5

∆σ =

µ – µ0σ =

"Signal""Noise"

Table entries transcribed from Table IV.3 of CRC Tables of Probability and Statistics. Table IV.3 tabulates the n's for the Signal/Noise ratios increments of 0.1, and alsoincludes entries for alpha=0.01(1sided)/0.02(2-sided)

See also Colton, page 142

Sample sizes based on t-tables, and so slightly larger (and more realistic, when n small) than those given by z-based formula: n = (zα + zβ)2(σ

∆)2

page 5

Page 23: M&M Ch 6 Introduction to Inference OVERVIEW

M&M Ch 7.1 (FREQUENTIST) Inference for

"Definitive Negative" Studies? Starch blockers--their effect on calorie absorption from a high-starch meal.

Abstract Table 1. Standard Test Meal.Ingredients

It has been known for more than 25 years that certain plant foods, such as kidneybeans and wheat, contain a substance that inhibits the activity of salivary andpancreatic amylase. More recently, this antiamylase has been purified and marketedfor use in weight control under the generic name "starch blockers." Although thisapproach to weight control is highly popular, it has never been shown whetherstarch-blocker tablets actually reduce the absorption of calories from starch. Usinga one-day calorie-balance technique and a high-starch (100 g) meal (spaghetti,tomato sauce, and bread), we measured the excretion of fecal calories after normalsubjects had taken either placebo or starch-blocker tablets. If the starch-blockertablets had prevented the digestion of starch, fecal calorie excretion should haveincreased by 400 kcal. However, fecal reduce the absorption of calories from starch.Using a one-day calorie-balance technique and a high-starch (100 g) meal(spaghetti, tomato sauce, and bread), we measured the excretion of fecal caloriesafter normal subjects had taken either placebo or starch-blocker tablets. If thestarch-blocker tablets had prevented the digestion of starch,fecal calorie excretion should have increased by 400 kcal.However, fecal calorie excretion was the same on the two testdays (mean ± S.E.M., 80 ± 4 as compared with 78 ± 2). Weconclude that starch-blocker tablets do not inhibit the digestionand absorption of starch calories in human beings.

Spaghetti (dry weight)* .............. 100 gTomato sauce .112 gWhite bread ........50 gMargarine.............................. 10 gWater .................................250 g

51CrCl3 ..................................4 µCiDietary constituents†Protein.................................19 gFat................................... 14 gCarbohydrate (starch) ................ 108 g (97 g)

•Boiled for seven minutes in 1 liter of water.† Determined by adding food-table contents of each item

Table 2. Results in Five Normal Subjects on Days of Placebo andStarch-Blocker Tests.

Placebo Test Day Starch-Blocker test Day

DUPLICATE RECTAL MARKER DUPLICATE RECTAL MARKERBo-Linn GW. et al New England Journal of Medicine. 307(23):1413-6, 1982 Dec2

TEST MEAL* EFFLUENT RECOVERY TEST MEAL EFFLUENT RECOVERY

kcal kcal % kcal kcal %[Overview of Methods: The one-day calorie-balance technique beginswith a preparatory washout in which the entire gastrointestinal tract iscleansed of all food and fecal material by lavage with a special calorie-free, electrolyte-containing solution. The subject then eats the test meal,which includes 51CrCl3 as a non absorbable marker. After 14 hours,the intestine is cleansed again by a final washout. The rectal effluent iscombined with any stool (usually none) that has been excreted sincethe meal was eaten. The energy content of the ingested meal and of therectal effluent is determined by bomb calorimetry. The completenessof stool collection is evaluated by recovery of the non absorbablemarker.]

1 664 81 97.8 665 76 96.62 675 84 95.2 672 84 98.33 682 80 97.4 681 73 94.44 686 67 95.5 675 75 103.6

5 676 89 96.3 687 83 106.9 Means 677 80 96.4 676 78 100 ±S.E.M. ±4 ±4 ±0.5 ±4 ±2 ±2

*Does not include calories contained in three placebo tablets (each tablet, 1.2±0.1kcal) or in three Carbo-Lite tablets (each tablet, 2.8±0.1 kcal) that were ingestedwith each test meal.

0 100 200 300 400-100

Company's ClaimEstimate from Study (95%CI)

kcal blocked

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _For an good paper on topic of 'negative' studies, see Powell-Tuck J "Adefence of the small clinical trial: evaluation of three gastroenterological studies."British Medical Journal Clinical Research Ed..292(6520):599-602, 1986 Mar 1.(Resources for Ch 7)

page 6

Page 24: M&M Ch 6 Introduction to Inference OVERVIEW

M&M Ch 7.2 (FREQUENTIST) Inference for –

Inference for µ1 – µ2 : A&B Ch 7.2 ; ColtonCh 4 WORKED EXAMPLE : CI and Test of Significance

Somewhat more complex than simply replacing σ1 and σ2 by s1 ands2 as estimates of σ's in CI's and tests.

Y= Fall in BP over 8 weeks [mm] with Modest Reduction in Dietary SaltIntake in Mild Hypertension (Lancet 25 Feb, 1989)

Need to distinguish two theoretical situations (unfortunatelyseldom clearly distinguishable in practice) where: Test: H0: µY (Normal Sodium Diet) = µY (Low Sodium Diet)

Halt: µY (Normal Sodium Diet) ≠ µY (Low Sodium Diet)σ1 = σ2 = σ

α =0.05 (2-sided); β =0.20;use a "pooled" estimate s of σ [see M&M page 550][think of s2pooled as a weighted average of s12 and s12]

t-table is accurate (if Gaussian data)

==> Power = 80% if | µY(Nl ) - µY(Low) | ≥ 2mm DBP

Data given: Mean(SEM) Fall in BP

σ1 ≠ σ2 "Normal" "Low" Group Groupuse separate estimates of σ1 and σ2

adjust d.f. downwards from (n1+n2–2) tocompensate for inaccuracy

Option 1 (p 549) "software approximation"*

Option 2 (p541) for hand use: df = Minimum[ n1–1, n2–1 ]

(n=53) (n=50)

SBP -0.6(1.0) -6.1(1.1)

Reconstruct s2's via relation: s2 = n SEM2

Mean(s2) Fall in BP[M&M are the only ones I know who suggest this option 2; I thinkthey do so because the undergraduates they teach may not bemotivated enough to use equation 7.4 page 549 to calculate thereduced degrees of freedom... I agree that the only time one should useoption 2 is the 1st time when learning about the t-test]

"Normal" "Low" (n=53) (n=50)

SBP -0.6(53) -6.1(60.5)

* The SAS manual says that in its TTEST procedure it usesSatterthwaite's approximation [p. 549 of M&M] for the reduceddegrees of freedom unless the user specifies otherwise. s2 =

[52]53+[49]60.5[52]+[49]

= 56.63; s = 56.63 = 7.52

t = -6.1- [-0.6]

7.52 1/53 + 1/50 = –3.71 vs t101,05 = 1.98Adjustments are not a big issue if sample sizes are large or variances

similar.

page 7

Page 25: M&M Ch 6 Introduction to Inference OVERVIEW

M&M Ch 7.2 (FREQUENTIST) Inference for –

Calculation of t-test using separate variances "Eye test": Judging overlap of two independent CI's

|----------x----------| Overlapping CI'sHad we used the separate s2's in each sample we wouldhave calculated

t =-6.1- [-0.6]

53 53

+ 60.550

= –3.70

This is equivalent to calculating:

t =-6.1- [-0.6]

SE12 + SE22 = -6.1- [-0.6]

1.12 + 1.02 = –3.70

|----------x----------|

How far apart do two independent x–'s , say x–1 and x–2 , have to befor a formal statistical test, using an alpha of 0.05 two sided, tobe statistically significant?

Need...

| x–1 – x–2 | ≥ 1.96 [ {SE[ x–1 ]}2 + [{SE[ x–2 ]}2 if using z-test*

If the 2 SEM's are about the same size (as they would be if the 2 n's, and the per-unit variability, were about the same), then ... [as in exercise X in Chapter 5]

M&M suggest that the appropriate df for t areOption 1 (via their eqn. 7.4): 99.5Option 2 (smaller df): 49

Need... | x–

1 – x–2 | ≥ 1.96 2 {SE[each x–]}2

i.e. | x–1 – x–2 | ≥ 1.96 2 SE[each x–] , or... | x–

1 – x–2 | ≥ 2.77 SE[each x–]

Either way, the t ratio is far beyond the α=0.05 pointof the null distribution. Notice that the reduction indf is minimal here because the two sample variances arequite close.

*If using t rather than z, multiple would be somewhat higher than 1.96, so that

when multiplied by 2 it might be higher than 2.77, closer to 3. Thus arough answer to the question could be

| x–1 – x–2 | 3 SE[each x–]Incidentally, as per their power calculations, theprimary response variable was DBP

Mean(SEM) Fall in DBP in the 2 samples:

This means that even when two 100(1-α)% CI's overlap slightly, as above, thedifference between the two means could be statistically significant at the α level.This is why Moses, in his article on graphical displays (see reserve material)advocates plotting the 2 CI's formed by

"Normal" "Low" Group Group x–1 ± 1.5 SE[x–1] and x–2 ± 1.5 SE[x–2] (n=53) (n=50)

-0.9(0.6) -3.7(0.6)

t = -3.7- [-0.9]

0.62 + 0.62 = 3.3

Thus, we can be reasonably sure that if the CI's do not overlap ( i.e. if x–1 and x–2

are more than 3 SE[each x–] apart) the difference between them is statisticallysignificant at the alpha=0.05 level.

[ estimate ±1.5 SE(estimate) corresponds to an 86% CI if using Z distribution].

Note: above logic applies for other symmetric CI's too.

page 8

Page 26: M&M Ch 6 Introduction to Inference OVERVIEW

M&M Ch 7.2 (FREQUENTIST) Inference for –

Inferences regarding means --- Summary

Situation Object known unknown(or large n's)

1 Popln. CI for x- ± z

σx√n x

- ± tn-1

sx√n

, x

Test 0 z = x- - µ0

σx√n

tn-1 = x- - µ0

sx√n

(sample of n)

1 Popln. CI for d- ± z

σd√n d

- ± tn-1

sd√n

under 2 = d)condns.

Test 0 z = d- - ∆0

σd√n

tn-1 = d- - ∆0

sd√n

(sample of n within-pairdifferences {d=x1-x2} )

2 Poplns. CI for x-1 - x

-2 ± z

σ12

n1 +

σ22

n2 x

-1 - x

-2 ± tdf

s2

n1 +

s2

n2= 1- 2

Test 0 z = x-1 - x

-2 - ∆0

σ12

n1 +

σ22

n2

tdf = x-1 - x

-2 - ∆0

s2

n1 +

s2

n2

(independent samples of n1 and n2)

Notes:

•Pooled s2 = (n1-1)s1

2 + (n2-1)s22

(n1 - 1) + (n2 - 1) (weighted average of the two s2 's) •df = (n1-1) + (n2-1) = n1 + n2 -2

•If it appears that σ12 is very different from σ22, then a "separate variances" t-test is used with df reducedto account for the differing σ2 's

page 9

Page 27: M&M Ch 6 Introduction to Inference OVERVIEW

M&M Ch 7.2 (FREQUENTIST) Inference for –

Sample Size for CI for 1 – 2 Sample Size for test of 1 versus 2

CI( 1 – 2 ) Test H0: 1 = 2 vs Ha: 1 2

n's to produce CI for 1 – 2 with prespecifiedmargin of error

n's for power of 100(1 – )% if 1 – 2 = ;Prob(type I error) =

(cf. Colton p 145 or CRC tables)• large-sample CI:

x–1 - x–

2 ± Z SE( x–1 - x–

2 ) = x–1 - x–

2 ± margin of error

• SE( x–1 - x–

2 ) = σ2

n1 +

σ2

n2

• if use equal n's, then ...

n per group = 2 {Zα/2 – Zβ}2 σ2

∆2

= 2(Zα/2 – Zβ)2 {

σ∆ }2

Note that if < 0.5, Z <0 (also, Z always 1-sided).

example α=0.05 (2-sided) and β=0.2 ...

Zα/2 = 1.96; Zβ = -0.84,

2(Zα/2 – Zβ)2 = 2{1.96 – (–0.84)}2 ≈ 16, i.e.

n per group ≈ 16 • {noise/signal ratio}2

n per group = 2σ2 Z2

[margin of error]2

example:

* 95% CI for difference in mean Length of Stay (LOS);* desired Margin of Error for difference: 0.5 days,* anticipate SD of individual LOS's, in each situation, of 5 days.

These formulae are easily programmed in a spreadsheet. There are also specializedsoftware packages for sample size and statistical power See web page underResources for Chapter 7.

Greenland S. "On sample-size and power calculations for studies using confidenceintervals". American Journal of Epidemiology. 128(1):231-7, 1988 Jul. Abstract: Arecent trend in epidemiologic analysis has been away from significance tests and towardconfidence intervals. In accord with this trend, several authors have proposed the use ofexpected confidence intervals in the design of epidemiologic studies. This paperdiscusses how expected confidence intervals, if not properly centered, can bemisleading indicators of the discriminatory power of a study. To rectify such problems,the study must be designed so that the confidence interval has a high probability of notcontaining at least one plausible but incorrect parameter value. To achieve this end,conventional formulas for power and sample size may be used. Expected intervals, ifproperly centered, can be used to design uniformly powerful studies but will yieldsample-size requirements far in excess of previously proposed methods.

95% -> α=0.05 -> Zα/2 = 1.96

n per group = 2 • 52 • 1.962

[0.5]2 ≅ 800

Contrast formula for test and formula for CI:CI: no null and al. values for comparative parameter;

notice also absence of beta.

See reference to Greenland [bottom of next column].

page 10

Page 28: M&M Ch 6 Introduction to Inference OVERVIEW

M&M Ch 7.2 (FREQUENTIST) Inference for –

Number of Observations PER GROUP to Ensure Specified Power (1 - ) if use 2-sample t-test of 2 Means

α = 0.005(1-sided) α = 0.025(1-sided) α = 0.05(1-sided) α = 0.01 (2-sided) α = 0.05 (2-sided) α = 0.1 (2-sided)

β = 0.01 0.05 0.1 0.2 0.5 0.01 0.05 0.1 0.2 0.5 0.01 0.05 0.1 0.2 0.5

∆σ

0.2 1370.3 87 610.4 85 100 50 108 78 350.5 96 55 106 86 64 32 88 70 51 23

0.6 101 85 67 39 104 74 60 45 23 89 61 49 36 160.7 100 75 63 50 29 76 55 44 34 17 66 45 36 26 120.8 77 56 49 39 23 57 42 34 26 14 50 35 28 21 100.9 62 46 39 31 19 47 34 27 21 11 40 28 22 16 81.0 50 38 32 26 15 38 27 23 17 9 33 23 18 14 7

1.2 36 27 23 18 11 27 20 16 12 7 23 16 13 10 51.4 27 20 17 14 9 20 15 12 10 6 17 12 10 8 41.6 21 16 14 11 7 16 12 10 8 5 14 10 8 6 41.8 17 13 11 10 6 13 10 8 6 4 11 8 7 52.0 14 11 10 8 6 11 8 7 6 4 9 7 6 4

2.5 10 8 7 6 4 8 6 5 4 6 5 4 3

3.0 8 6 6 5 4 6 5 4 4 5 4 3

∆σ =

µ1 – µ2σ =

"Signal""Noise"

Table entries transcribed from Table IV.4 of CRC Tables of Probability and Statistics. Table IV.4 tabulates the n's for the Signal/Noise ratiosincrements of 0.1, and also includes entries for alpha=0.01(1-sided)/0.02(2-sided).

See also Colton, page 145

Sample sizes based on t-tables, and so slightly larger (and more realistic) than those given by z-based formula: n/group = 2(zα/2 + zβ)2(σ

∆)2

See later (in Chapter 8) for unequal sample sizes i.e. n1 ≠ n2

page 11

Page 29: M&M Ch 6 Introduction to Inference OVERVIEW

Inference concerning a single M&M §8.1 updated Dec 26, 2003

Parameter : the proportion e.g. ... (FREQUENTIST) Confidence Interval for from a proportion p = x / n• with undiagnosed hypertension / seeing MD during a 1-year span• responding to a therapy• still breast-feeding at 6 months• of pairs where response on treatment > response on placebo• of US presidential elections where taller candidate expected to win• of twin pairs where L handed twin dies first• able to tell imported from domestic beer in a "triangle taste test"• who get a headache after drinking red wine• (of all cases, exposed and unexposed) where case was "exposed" [function of rate ratio & of relative sizes of exposed & unexposed denominators; CONDITIONAL analysis (i.e. "fix" # cases), used for CI for ratio of 2 rates , especially in 'extreme' data configurations... eg. # seroconversions in RCT of HPV16 Vaccine, NEJM Nov 21, 2002

1 . "Exact" (not as "awkward to work with' as M&M p586 say they are)

tables [Documenta Geigy, Biometrika , ...] nomograms, software

e.g . what fraction π will return a 4-page questionnaire?11/20 returns on a pilot test i.e. p= 11/20 =0.55

95% CI (from CI for proportion table Ch 8 Resources ) 32% to 77%[To save space, table gives CI's only for p≤0.5, so get CI for π of non-returns: point estimate is 9/20 or 45%, CI is 23% to 68% {1st row,middle column of the X=9 block} Turn this back to 100-68=32% to100-23=77% returns]

95% CI (Biometrika nomogram) 32% to 77%[uses c for numerator; enter through lower x-axis if p≤0.5; in ourcase p=0.55 so enter nomogram from the top at c/n = 0..55 nearupper right corner; travel downwards until you hit bowed linemarked 20 (the 5th line from the top) and exit towards the rightmostborder at πlower ≈ 0 .32 ; go back and travel downward until hit thecompanion bowed line marked 20 (the 5th line from bottom) andexit towards the rightmost border at πlupper ≈ 0.77 ].

Others may use other names for numerator and statistic, or usesymmetry (Binomial[y, n,p] <--> Binomial[n-y, n,1 - p] to save space.Nomogram on next page shows full range, but uses an approxn..

0/11084.0 W-Y in vaccinated gp. sersus 41/11076.9 W-Y in placebo gp]

Statistic: the proportion p = y/n in a sample of size n. ...

Inferences from y/n to

FREQUENTIST

via Confidence Intervals and Tests

• Confidence Interval: where is ?

supplies a NUMERICAL answer (range)

• Evidence (P-value) against H0: = 0.xx

• Test of Hypothesis: Is (P-value) < preset ?

supplies a YES / NO answer (uses Pdata | H0)

Notice link between 100(1 - α)% CI and two-sided test ofsignificance with a preset α. If true π were < πlower, there would onlybe less than a 2.5% probability of obtaining, in a sample of 20, thismany (11) or more respondents; likewise, if true π were > πlower,there would be less than a 2.5% probability of obtaining, in a sampleof 20, this many (11) or fewer respondents. The 100(1 - α)% CI forπ includes all those parameter values such that if the oberved datawere tested against them, the p-value (2-sided) would not be < α.

BAYESIAN

via posterior probability distribution for, andprobabilistic statements concerning, itself

• point estimate: median, mode, ...• interval estimate: credible intervals, ...

Software: • "Bayesian Inference for Proportion (Excel)" Resources Ch 8 • First Bayes { http://www.epi.mcgill.ca/Joseph/courses.html }

e.g. Experimental drug gives p = 0 successes14 patients

=> π = ??

95% CI for π (from table) 0% to 23%CI "rules out" (with 95% confidence) possibility that π>23%[might use a 1-sided CI if one is interested in putting just an upper bound onrisk: e.g. what is upper bound on π = probability of getting HIV from HIV-infected dentist? see JAMA article on "zero numerators" by Hanley andLippman-Hand (in Resources for Chapter 8) .

cf also A&B §4.7; Colton §4. Note that JH's notes use p for statistic, π for parameter.

page 1

Page 30: M&M Ch 6 Introduction to Inference OVERVIEW

Inference concerning a single M&M §8.1

CI for -- using nomogram (many books of statistical tables have fuller versions)

Observed proportion (p)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

20

50

100

200

400

1000

1000

400

200

100

50

20

sample size95% CI for

(Asymmetric) CI in above Nomogram: approx. formula π = 1 –

nn + z2 +

2npn + z2 ±

z 4np – 4np2 + z2

n + z2

2 (cf later page)

See Biometrika Tables for Statisticians for the "exact" Clopper-Pearson version.calculated so that Binomial Prob [≥ p | πlower] = Prob[ ≤ p | πupper ] = 0.025 exactly.

page 2

Page 31: M&M Ch 6 Introduction to Inference OVERVIEW

Inference concerning a single M&M §8.1

(FREQUENTIST) Confidence Interval for from a proportion p = x / n (FREQUENTIST) Confidence Interval for

1. Exactly, but by trial and error, via SOFTWARE with

Binomial probability function

Exactly, and directly, using table of (or SOFTWARE

function that gives) the percentiles of the F distribution

NOTES• Turn spreadsheet of Binomial Probabilities (Table C) 'on its side' to get

CI(π) .. simply find those columns (π values) for which the probability of theobserved proportion is small

• Read horizontally, Nomogram [previous page] shows the variability ofproportions from SRS samples of size n. [very close in style to table ofBinomial Probabilities, except that only the central 95% range of variationshown , & all n's on same diagram]

See spreadsheet "CI for a Proportion (Excel

spreadsheet, based on exact Binomial model) " under

Resources for Chapter 8. In this sheet one can obtain the

direct solution, or get there by trial and error. Inputs in bold

may be changed.

Read vertically, it shows:

• CI -> symmetry as p -> 0.5 or n - > ∞ [in fact, as np & n(1-p) - > ∞ ]

• widest uncertainty at p=0.5 => can use as a 'worst case scenario'

• cf. the ± 4 % points in the 'blurb' with Gallup polls of size n ≈ 1000.Polls are usually cluster (rather than SR) samples, and so havebigger margins of error [wider CI's] than predicted from the Binomial.

95%CI? IC? ... Comment dit on... ?The general "Clopper-Pearson" method for obtaining a

Binomial-based CI for a proportion is explained in 607 Notes

for Chapter 6.[La Presse, Montréal, 1993] L'Institut Gallup a demandé récemment àun échantillon représentatif de la population canadienne d'évaluer lamanière dont le gouvernement fédéral faisait face à divers problèmeséconomiques et général. Pour 59 pour cent des répondants, les libérauxn'accomplissent pas un travail efficace dans ce domaine, tandis que 30pour cent se déclarent de l'avis contraire et que onze pour cent neformulent aucune opinion.

Can obtain these limits by trial and error (e.g. in spreadsheet)

or directly using the link between the Binomial and F tail areas

(also implemented in spreadsheet). The basis for the latter is

explained by Liddell (method, and reference, given at bottom

of Table of 95% CI's).La même question a été posée par Gallup à 16 reprises entre 1973 et1990, et ne n'est qu'une seule fois, en 1973, que la proportion desCanadiens qui se disaient insatisfaits de la façon dont le gouvernementgérait l'économie a été inférieure à 50 pour cent.Spreadsheet opens with example of an observed proportion

p = 11/20. Les conclusions du sondage se fondent sur 1009 interviews effectuéesentre le 2 et le 9 mai 1994 auprès de Canadiens âgés de 18 ans et plus.Un échantillon de cette ampleur donne des résultats exacts à 3,1p.c., près dans 19 cas sur 20. La marge d'erreur est plus forte pourles régions, par suite de l'importance moidre de l'échantillonnage; parexemple, les 272 interviews effectuées au Québec ont engendré unemarge d'erreur de 6 p.c. dans 19 cas sur 20.

page 3

Page 32: M&M Ch 6 Introduction to Inference OVERVIEW

Inference concerning a single M&M §8.1

2. CI for based on "large-n" behaviour of p, or fn. of p

(observed) proportion p

πupper

πlower

0.30

0.27

0.33

0 . 0 2 5

0 . 0 2 5

SD, calculated at 0.30, rather than at lower limit

SD, calculated at 0.30, rather than at upper limit

CI: p ± z SE(p) = p ± z p[1-p]

n

e.g. p = 0.3, n=1000

95%CI for π

= 0.3 ± 1.96 0.3[0.7]1000

= 0.30 ± 1.96(0.015)

= 0.30 ± 0.03

= 30% ± 3%

Note: the ± 3% is pronounced and written as "± 3 percentagepoints" to avoid giving the impression that it is 3% of 30% SE-based (sometimes referrred to in texts and software output

as "Wald" CI's) use the same SE for the upper and lower limits --they calculate one SE at the point estimate, rather than twoseparate SE's, calculated at each of the two limits."Large-sample n": How large is large?

• A rule of thumb: when the expected no. of positives, np, andthe expected no. of negatives, n(1-p), are both bigger than 5 (or10 if you read M&M).

From SASDATA CI_propn;INPUT n_pos n ;

• JH's rule: when you can't find the CI tabulated anywhere!LINES; 300 1000;

• if the distribution is not 'crowded' into one corner (cf. the shapesof binomial distributions in the Binomial spreadsheet -- inResources for Ch 5), i.e., if, with the symmetric Gaussianapproximation, neither of the tails of the distribution spills over aboundary (0 or 1 if proportions, or 0 or n if on the count scale),

See M&M p383 and A&B §2.7 on Gaussian approximation toBinomial.

PROC genmod data = CI_propn; model n_pos/n = / dist = binomial link = identity waldci ; RUN;

From Stata immediate command: cii 1000 300clear * Using datafileinput n_pos n 140 500 * glm doesn't like file with 1 'observation' 160 500 * so...........split across 2 'observations'endglm n_pos , family(binomial n) link(identity)

page 4

Page 33: M&M Ch 6 Introduction to Inference OVERVIEW

Inference concerning a single M&M §8.1

2. CI for based on "large-n" behaviour of p... continued

Other, more accurate and more theoretically correct,large-sample (Gaussian-based) constructions

The "usual" approach is to form a symmetric CI as

point estimate ± a multiple of the SE.

This is technically incorrect in the case of a distribution, such as thebinomial, with a variance that changes with the parameter beingmeasured. In construction of CI's [see diagram on page 1 of material onCh 6.1] there are two distributions involved: the binomial at πupper andthe binomial at πlower. They have different shapes and different SD's ingeneral. Approaches i and ii (below) take this into account.

ii Based on Gaussian distribution of a variance-stabilizingtransformation of the binomial, again with SD's calculated atthe limits rather than at the point estimate itself

[sin[ sin-1[√p] – z

2√n ] ]2 , [sin[ sin-1[√p] + z

2√n ] ]2

as in most calculators, sin-1 & the * in sin[*] measured in radians.

i Based on Gaussian approximation to binomial distribution, butwith SD's calculated at limits SD = π[1–π] / n rather thanat the point estimate itself {"usual" CI uses SD = p[1–p] / n }

If define CI for π as (πL,πU},

where Prob[sample proportion ≥ p | πL ] = α/2Prob[sample proportion ≤ p | πU ] = α/2 E.g. with = 0.05, so that z=1.96, we get: * from Mainland

and if use Gaussian approximations to Binomial(n, L) andBinomial(n, U), and solve

p = L + zα/2 L [1– L ]

nand

Method n=10 p=0.0 n=10 p=0.3 n=20 p=0.15 n=40 p=0.075

1. [0.00, 0.28] [0.11, 0.60] [ 0.05, 0.36] [ 0.03, 0.20]

2. [0.09, 0.09] [0.07, 0.60] [ 0.03, 0.33] [ 0.01, 0.18]

"usual" [0.00, 0.00] [0.02, 0.58] [–0.01, 0.31] [ –0.01, 0.16]

p = U – zα/2 U [1– U ]

nBinomial* [0.00, 0.31] [0.07, 0.65] [ 0.03, 0.38] [ 0.02, 0.20]

for L and U, * from Mainland

This leads to asymmetric 100(1–α)% limits of the form: References: • Fleiss, Statistical Methods for Rates and Proportions• Miettinen, Theoretical Epidemiology, Chapter 10.

1 – n

n + z2 + 2np

n + z2 ± z 4np – 4np2 + z2

n + z2

2Rothman(2002-p132) attributes this method i to Wilson 1927.

page 5

Page 34: M&M Ch 6 Introduction to Inference OVERVIEW

Inference concerning a single M&M §8.1

2. CI for based on "large-n" behaviour of logittransformation of proportion

2. CI for based on "large-n" behaviour of logtransformation of proportion

iii Based on Gaussian distribution of the logit transformationof the estimate (p, the observed proportion) of the parameter π

iv Based on Gaussian distribution of estimate of log[π]

PARAMETER: LOGIT[ π ] = log [ODDS] = log [ π / (1 - π ) ]

= log ["Proportion POSITIVE" / "Proportion NEGATIVE" ]

STATISTIC: logit[ p ] = log [odds] = log [ p / (1 - p ) ]

(Here, log = 'natural' log, i.e. to base e, which some write as ln )(UPPER CASE/Greek = parameter; lower case/Roman = statistic)

Reverse transformation ( to get back from LOGIT to π )...

π = ODDS

1 + ODDS = exp[LOGIT]

1 + exp[LOGIT] ; and likewise p <-- logit

πLOWER = exp[LOWER limit of LOGIT]

1+exp[ LOWER limit of LOGIT ] ; πUPPER likewise

PARAMETER: log[ π ]

STATISTIC: log[ p ]

Reverse transformation ( to get back from log[π] to π )...

π = antilog[ log[π] ] = exp[ log[π] ] ; and likewise p <-- log[p]

πLOWER = exp[ LOWER limit of log[ π ] ] ; πUPPER likewise

SE[ log[p] ] = Sqrt[ 1 / #positive - 1 / #total ]

Limits for π from p = 3/10 : exp[ log[3/10] ± z × Sqrt[1/3 - 1/10] ]

SE[logit] = Sqrt[ 1 / #positive + 1 / #negative ]

e.g. p = 3/10 => estimated odds = 3/7 => logit = log[3/7] = -0.85

SE[logit] = Sqrt[1/3 + 1/7] = 0.69

CI in LOGIT scale: -0.85 ± 1.96×0.69 = { -2.2, 0.5}

CI in π scale: { exp[-2.2]

1+exp[-2.2], exp[0.5]

1+exp[0.5]} = { 0.10, 0.67}

Exercises:

1 Verify that you get same answer by calculator and by software

2 Even with these logiy and log transformations, the Gausian distribution isnot accurate at such small sample sizes as 3/10. Compare their preformance(against the exact methods) for various sample sizes and numbers positive.

From SAS From Stata From SAS From Stata

DATA CI_propn; INPUT n_pos n ;LINES; 3 10;PROC genmod data = CI_propn;model n_pos/n = / dist = binomial

link = logit waldci ;

clearinput n_pos n 1 5 2 5endglm n_pos, family(binomial n) link(logit)

DATA CI_propn; INPUT n_pos n ;LINES; 3 10;PROC genmod data = CI_propn;model n_pos/n = / dist = binomial

link = log waldci ;

clearinput n_pos n 1 5 2 5endglm n_pos, family(binomial n) link(log)

anti-logit [logit] = exp[logit]

1 + exp[ logit] Greenland calls it the "expit" function

page 6

Page 35: M&M Ch 6 Introduction to Inference OVERVIEW

Inference concerning a single M&M §8.1

1200 are hardly representative of 80 million homes /220 million people!! The "Margin of Error blurb" introduced (legislated) in the mid 1980's

The Nielsen system for TV ratings in U.S.A. Montreal Gazette August 8, 1 9 8 1(Excerpt from article on "Pollsters" from an airline magazine) NUMBER OF SMOKERS RISES BY FOUR POINTS: GALLUP POLL

"...Nielsen uses a device that, at one minute intervals, checks to see if theTV set is on or off and to which channel it is tuned. That information is periodicallyretrieved via a special telephone line and fed into the Nielsen computer center inDunedin, Florida.

Compared with a year ago, there appears to be an increase in the number of Canadianswho smoked cigarettes in the past week - up from 41% in 1980 to 45% today. Thequestion asked over the past few years was: "Have you yourself smoked anycigarettes in the past week?" Here is the national trend:

With these two samplings, Nielsen can provide a statistical estimate of thenumber of homes tuned in to a given program. A rating of 20, for instance, meansthat 20 percent, or 16 million of the 80 million households, were tuned in.

Smoked cigarettes in the past weekToday.... . . . . . . . . . . . . . . . . . . . . . . .45%1980.. . . . . . . . . . . . . . . . . . . . . . . . . . .41

To answer the criticism that 1,200 or 1,500 are hardly representative of 80 millionhomes or 220 million people, Nielsen offers this analogy:

1979.. . . . . . . . . . . . . . . . . . . . . . . . . . .441978.. . . . . . . . . . . . . . . . . . . . . . . . . . .47

Mix together 70,000 white beans and 30,000 red beans and then scoop outa sample of 1000. the mathematical odds are that the number of red beans will bebetween 270 and 330 or 27 to 33 percent of the sample, which translates to a "rating"of 30, plus or minus three, with a 20-to-1 assurance of statistical reliability. Thebasic statistical law wouldn't change even if the sampling came from 80 million beansrather than just 100,000." ...

1977.. . . . . . . . . . . . . . . . . . . . . . . . . . .451976........ . . . . . . . . . . . . . . . . . . . .Not asked1975.. . . . . . . . . . . . . . . . . . . . . . . . . . .471974.. . . . . . . . . . . . . . . . . . . . . . . . . . .52

Men (50% vs. 40% for women), young people (54% vs. 37% for those > 50 ) andCanadians of French origin (57% vs. 42% for English) are the most likely smokers.Today ' s resu l t s are based on 1 ,054 persona l in -home in terv iews wi thadul t s , 18 years and over , conducted in June .

Why, if the U.S. has a 10 times bigger population than Canada,do pollsters use the same size samples of approximately 1,000

in both countries? ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––

The Gazette, Montreal, Thursday, June 27, 1 9 8 5Answer : it depends on WHAT IS IT THAT IS BEING ESTIMATED. With n=1,000,the SE or uncertainty of an estimated PROPORTION 0.30 is indeed 0.03 or 3percentage points. However, if interested in the NUMBER of households tuned in toa given program, the best estimate is 0.3N, where N is the number of units in thepopulation (N=80 million in the U.S. or N=8 million in Canada). The uncertainty in the'blown up' estimate of the TOTAL NUMBER tuned in is blown up accordingly, so thate.g. the estimated NUMBER of households is

U.S.A. 80,000,000[0.3 ± 0.03] = 24,000,000 ± 2,400,000Canada. 8,000,000[0.3 ± 0.03] = 2,400,000 ± 240,000

2.4 million is a 10 times bigger absolute uncertainty than 240,000. Our intuition aboutneeding a bigger sample for a bigger universe probably stems from absolute errorsrather than relative ones (which in our case remain at 0.03 in 0.3 or 240,000 in 2.4million or 2.4million in 24 million i.e. at 10% irrespective of the size of the universe. Itmay help to think of why we do not take bigger blood samples from bigger persons:the reason is that we are usually interested in concentrations rather than in absoluteamounts and that concentrations are like proportions.

39% OF CANADIANS SMOKED IN PAST WEEK: GALLUP POLL

Almost two in every five Canadian adults (39 per cent) smoked at least one cigarette inthe past week - down significantly from the 47 percent who reported this 10 years ago,but at the same level found a year ago. Here is the question asked fairly regularly over thepast decade: "Have you yourself smoked any cigarettes in the past week?"The national trend shows:

Smoked cigarettes in the past week1985.. . . . . . . . . . . . . . . . . . . . . . . . . . .39%1984.. . . . . . . . . . . . . . . . . . . . . . . . . . .391983.. . . . . . . . . . . . . . . . . . . . . . . . . . .411982*...........................42 (* Smoked regularly or occasionally)1981.. . . . . . . . . . . . . . . . . . . . . . . . . . .451980.. . . . . . . . . . . . . . . . . . . . . . . . . . .411979.. . . . . . . . . . . . . . . . . . . . . . . . . . .441978.. . . . . . . . . . . . . . . . . . . . . . . . . . .471977.. . . . . . . . . . . . . . . . . . . . . . . . . . .451975.. . . . . . . . . . . . . . . . . . . . . . . . . . .47

Those < 50 are more likely to smoke cigarettes (43%) than are those 50 years or over(33%). Men (43%) are more likely to be smokers than women (36%).Results are based on 1,047 personal, in-home interviews with adults, 18 years and over,conducted between May 9 and 11. A sample of this size is accurate within a4-percentage-point margin, 19 in 20 times.

page 7

Page 36: M&M Ch 6 Introduction to Inference OVERVIEW

Inference concerning a single M&M §8.1

Test of Hypothesis that = some test value

1 . n small enough -> Binomial Tables/Spreadsheet 2 . Large n : Gaussian Approximation

ie if testing H0: π = π0 vs Ha: π ≠ π0 [or Ha: π > π0 ]

and if observe x / n ,

then calculate

Prob( observed x, or an x that is more extreme | π0 )

using Ha to specify which x's are more extreme i.e. provide evenmore evidence for Ha and against H0.

Test π = π0 : z = p – π0

SE[p] = p – π0

π0[1–π0]

nNote that the test uses a variance based on the (specified) π0. The "usual" CI uses

a variance based on the (observed) p.

(Dis)Continuity Correction†

Because we approximate a discrete distribution [where p takes on the values 0/n,1/n, 2/n, ... n/n corresponding to the integer values (0,1,2, ..., n) in the numeratorof p] by a continuous Gaussian distribution, authors have suggested a 'continuitycorrection' (or if you are more precise in your language, a 'discontinuity' correction).This is the same concept as we saw back in §5.1, where we said that a binomialcount of 8 became the interval (7.5, 8.5) in the interval scale. Thus, e.g., if we wantto calculate the probability that proportion out of 10 is ≥ 8, we need probability of≥ 7.5 on the continuous scale.

If we work with the count itself in the numerator, this amounts to reducing theabsolute deviation y–nπ0 by 0.5 . If we work in the proportion scale, the absolutedeviation is reduced by 0.5/n viz.

or use correspondence between a 100(1-α)% CI and a test whichuses an alpha level of α i.e. check if CI obtained from CI table ornomogram includes π value being tested

[there may be slight discrepancies between test and CI: themethods used to construct CI's don't always correspond exactly tothose used for tests]

e.g. 1 A common question is whether there is evidence against theproposition that a proportion π=1/2 [Testing preferences anddiscrimination in psychophysical matters e.g., therapeutic touch,McNemar's test for discordant pairs when comparing proportions ina paired-matched study, the non-parametric' Sign Test for assessingintra-pair differences in measured quantities, ...]. Because of thespecial place of the Binomial at π=1/2, the tail probabilities havebeen calculated and tabulated. See the table entitled "Sign Test" inthe chapter on Distribution-Free Methods.

M&M (2nd paragraph p 592) say that "we do not often usesignificance tests for a single proportion, because it is uncommon tohave a situation where there is a precise proportion that we want totest". But they forget paired studies, and even the sign test formatched pairs, which they themselves cover in section 7.1, page521. They give just 1 exercise (8.18) where they ask you to testπ=0.5vs π > 0.5.

zc = |y – nπ0|–0.5

SE[y] = |y – nπ0|–0.5

nπ0[1–π0]

or

zc = |p – π0|–0.5/n

SE[p] = |p – π0|–0.5/n

π0[1–π0]

n

†Colton [who has a typo in the formula on p ___] and A&B deal with this; M&Mdo not, except to say on p386-7 "because most statistical purposes do not requireextremely accurate probability calculations, we do not emphasize use of thecontinuity correction". There are some 'fundamental' problems here that statisticiansdisagree on. The "Mid-P" material (below) gives some of the flavour of the debate.

e.g. 2 Another example, dealing with responses in a setup where the"null" is π=1/3, the "Triangle Taste Test" is described in the nextpage.

page 8

Page 37: M&M Ch 6 Introduction to Inference OVERVIEW

Inference concerning a single M&M §8.1

EXAMPLE of Testing : THE TRIANGLE TASTE TEST

As part of preparation for a double blind RCT of lactase-reduced infant formula oninfant crying behaviour, the experimental formulation was tested for its similarity intaste to the regular infant formula . n mothers in the waiting room at MCH weregiven the TRIANGLE TASTE TEST i.e. they were each given 3 coded formulasamples -- 2 containing the regular formula and 1 the experimental one. Told that "2of these samples are the same and one sample is different", p = y/n correctlyidentify the odd sample. Should the researcher be worried that the experimentalformula does not taste the same? (assume infants are no more or less taste-discriminating than their mothers) [ study by Ron Barr, Montreal Children's Hospital]

The null hypothesis being tested is

H0: π(correctly identified samples) = 0.33 against Ha: π() > 0.33

[here, for once, it is difficult to imagine a 2-sided alternative -- unless mothers werevery taste-discriminating but wished to confuse the investigator]

Our observed proportion of 5/12 projects to a one-sided 95% CI of "as many as 65%in the population get it right". In this worst-case scenario, assuming that thepercentage of right answers in the population is a mix of a proportion πcan who canreally tell and one third of the remaining (1-πcan) who get it right by guessing, weequate

0.65 = πcan + (1-πcan) / 3

giving us an upper bound πcan = (0.65-0.33) / (2/3) = 0.48 or 48%.

*These calculations can be done easily even on a calculator or spreadsheet withoutany combinatorials:P(0) = 0.6712 = 0.008P(1) = 12 x 0.33 x P(0) / [1 x 0.67] = 0.048P(2) = 11 x 0.33 x P(1) / [2 x 0.67] = 0.131P(3) = 10 x 0.33 x P(2) / [3 x 0.67 = 0.215P(4) = 9 x 0.33 x P(3) / [4 x 0.67] = 0.238

Σ = 0.640so Prob(5 or more correct | π = 0.33) = 1 - 0.64 = 0.32

We consider two situations (the real study with n=12, and a hypothetical largersample of n=120 for illustration)

• 5 of n = 12 mothers correctly identified the odd sample.

i.e. p = 5/12 = 0.42

Degree of evidence against H0

= Prob(5 or more correct | π=0.33)... - a Σ of 8 probabilities

= 1 - Prob(4 or fewer correct | π=0.33) ...- a shorter Σ of only 5

= 1 - [ P(0) + P(1) + P(2) + P(3) + P(4) ] = 0.37*

• 50 of 120 (p=0.42) mothers identified odd sample.

Test π = 0.33 : z = 0 .42* – 0 .33

0.33[1–0.33]

120

= 2.1

So P = Prob[ ≥ 50 | π = 0.33 ] = Prob[Z ≥ 2.1] = 0.018

Using n=12, and p=0.30 in Table C gives 0.28; using p=0.35 gives 0.42.Interpolation gives 0.37 approx.

* We treat the proportion 50/120 as a contimuous measurement; in fact it is based onan integer numerator 50; we should treat 50 as 49.5 to 50.5 so ≥50 is really > 49.5 .

The Prob. of obtaining 49.5/120 or more is te Prob. of Z = 0.413 – 0.33

0.33[1–0.33]

120

or

more. With n=120, the continuity correction does not make a large difference;however, with smaller n, and its coarser grain, the continuity correction [whichmakes differences smaller] is more substantial.

Can also obtain this probability directly via Excel , using the function

1 - BINOMDIST(4, 12, 0.33333, TRUE)

So, by conventional criteria (Prob < 0.05 is considered a cutoff for evidence againstH0) there is not a lot of evidence to contradict the H0 of taste similarity of theregular and experimental formulae.

With a sample size of only n=12, we cannot rule out the possibility that a sizablefraction of mothers could truly distinguish the two.

page 9

Page 38: M&M Ch 6 Introduction to Inference OVERVIEW

Inference concerning a single M&M §8.1

Sample Size for CI's and Tests involving

n to yield (2-sided) CI with margin of error m at confidencelevel 1- (see M&M p 593, Colton p161)

Worked example 1: sample size forTest that (preferences) = 0.5 vs. 0 .5

orSign Test that median difference = 0

|--- margin of error --- >|Test: H0: MedianD = 0 vs Halt: MedianD ≠ 0 (---------------------•-----------------------) CI

=0.05 (2-sided);• see CI's as function of n in tables and nomograms

• (or) large-sample CI: p ± Zα/2 SE(p) = p ± m

SE(p) = p[1-p]

n , so... n = p[1-p] • Zα/22

m2

or H0: π(+) = 0.5 vs Halt: π(+) > 0.5

For Power 1- against: Halt: π(+) = 0.65 say

[ at π=ave of 0.5 & 0.65, π[1–π] = = 0.494 ]

n ≈ { Zα/2 – Zβ }2 { 0.4940.15 }2

If unsure, use largest SE i.e. when p=0.5 i.e.

n = 0.25 • Zα/2

2

m2 [1.c] α=0.05 (2-sided) & β=0.2 ...

Zα = 1.96; Zβ = -0.84,

(Zα/2 – Zβ)2 = {1.96 – (–0.84)}2 ≈ 8, i.e.n for power 1- to "detect" a population proportion 1 that is units from 0 (test value) ; type I error = (Colton p 161)

n ≈ 8 { 0.4940.15 }2

= 87

n = { Zα/2 0[1– 0]– Zβ 1[1– 1] }2

∆ 2 [1.t]

Worked example 2: sample size for Taste Test

(correct) = 1/3 vs. >1/3≈ { Zα/2 – Zβ }2 {

[1– ]∆ }2

[1.t]≈If set α =0.05 (hardliners might allow 1 -sided test here),then Zα = 1.645; If want 90% pwer, then Zβ = -1.28; Then using eqn [1.t]above...

where π is average of π0 and π1

= { Zα/2 – Zβ }2 { σ0/1

∆ }2 n's for 90% Power against...

π(correct)= 0.4 0.5 0.6 0.7 0.8 400 69 27 14 8

Notes: Zβ will be negative; formula is same as for testing µ

(See also homegrown exercise # ___ )

page 10

Page 39: M&M Ch 6 Introduction to Inference OVERVIEW

Inference concerning 2 's M&M §8.2 ; updated Dec 14, 2003

Parameters: 1 and 2 .... π's are proportions / prevalences / risksLarge-sample CI for COMPARATIVE MEASURE /PARAMETER (if 2 estimates are uncorrelated)3 Comparative measures / parameters:

IN GENERAL (if calculations in transformed scale, must back-transform )

COMPARATIVE estimate1 - estimate2 ± z SE[ estimate1 - estimate2 ]

estimate1 - estimate2 ± z Sqrt[Var[estimate1] + Var[estimate2]]

IN PARTICULAR

p1 - p2 ± z SE[ p1 - p2 ] Remember: SE's don't add !

= p1 - p2 ± z SE2[p1] +SE2[p2] Their squares do !

PARAMETER estimate New Scale

(Risk orPrevalence)Difference 1 – 2 p1 – p2

cf. Rothman2002 p 135 Eqn 7-2 <<<<<<= p1 - p2 ± z

p1[1-p1]

n1

+ p2[1-p2]

n2

(Risk orPrevalence)

Ratio 12

p1p2

log[p1p2] = log[p1] – log[p2]

anti-log[ log[p1/p2] ± z SE[ log[p1] – log[p2] ] ]

= anti-log[ log[p1/p2] ± z SE2[ log[p1] ] + SE2[ log[p2] ] ]cf. Rothman2002 p 135 Eqn 7-3 <<<<<<

From 8.1: SE2[ log[p1] ] = Var[ log[p1] ] = 1/#positive1 – 1/ #total1

SE2[ log[p2] ] = Var[ log[p2] ] = 1/#positive2 – 1/ #total2

cf. Rothman2002 p 139 Eqn 7-6 <<<<<< anti-log [ log[oddsRatio] ± z SE[ logit1 – logit2 ] ]

= anti-log [ log[oddsRatio] ± z SE2[logit1] + SE2[logit2] ]

From 8.1: SE2[ logit1 ] = Var[ logit1 ] = 1/#positive1 + 1/ #negative1

SE2[ logit2 ] = Var[ logit2 ] = 1/#positive2 + 1/ #negative2

Odds Ratio 1 / (1– 1)

2 / (1– 2)

odds1odds2

log[odds1odds2

] = log[odds1] – log[odds2]

= logit1 – logit2Var[log of OR est.] = 1/a + 1/b + 1/c + 1/d ==> "Woolf's Method": CI[ODDSRATIO]

page 1

Page 40: M&M Ch 6 Introduction to Inference OVERVIEW

Inference concerning 2 's M&M §8.2 ; updated Dec 14, 2003

Examples:Large-sample

Test of π1 = π1 equivalent to Test of π1– π1 = 0

(Risk or Prevalence Difference = 0)

same as Test of π1π2

=1

0 The generic 2x2 contingency table:

+ve -ve both--------------------------------------------------------------------------

sample 1 y1 (%) n1-y1 n1 (100%)

sample 2 y2 (%) n2-y2 n2 (100%)

---------------------------------------------------------------------------(Risk or Prevalence Ratio = 1)

same as Test of π1/ {1 – π1}π2/ {1 – π2} = 1

(ODDS Ratio = 1)

total y (%) n - y n (100%)

1 Bromocriptine for unexplained 1º infertility (BMJ 1979)

became did total no.pregnant not couples

------------------------------------------------------------------------------------Bromocriptine 7 (29%) 17 24 (100%)Placebo 5 (22%) 18 23 (100%)------------------------------------------------------------------------------------total 12 (26%) 35 47 (100%)

z = p1 - p2 - {∆=0}SE[ p1 - p2 ] †

2 Vitamin C and the common cold (CMAJ Sept 1972 p 503)

= p1 - p2

p[1-p]

n1 +

p[1-p]n2

† †no colds ≥ 1 cold total subjects

------------------------------------------------------------------------------------Vitamin C 105 (26%) 302 407 (100%)Placebo 76 (18%) 335 411 (100%)------------------------------------------------------------------------------------total 181 (22%) 637 818 (100%)

= p1 - p2

p[1-p]{1n1

+ 1n2

} =

p1 - p2

p[1-p] 1n1

+ 1n2

[ use estimate p = n1p1 + n 2p2

n1 + n 2 =

total +vetotal

in test ]

3 Stroke Unit vs Medical Unit for Acute Stroke in elderly?Patient status at hospital discharge (BMJ 27 Sept 1980)

indept. dependent total no. pts------------------------------------------------------------------------------------Stroke Unit 67 (66%) 34 101 (100%)Medical Unit 46 (51%) 45 91 (100%)------------------------------------------------------------------------------------total 113 (59%) 79 192 (100%)

† Continuity correction: use |p1 – p

2 | – [ 1/(2n1) + 1/(2n2) ] in numerator

† † Variances add if the proportions are uncorrelated

page 2

Page 41: M&M Ch 6 Introduction to Inference OVERVIEW

Inference concerning 2 's M&M §8.2 ; updated Dec 14, 2003

Rx for primary infertilityStroke Unit vs Medical Unit

95%CI on ∆π : 0.29 - 0.22 ± z 0.29 • 0.71

24 +

0.22 • 0.7823

= 0.07 ± 1.96 x 0.13= 0.07 ± 0.25

95%CI for ∆π : 0.66 - 0.51 ± z 0.66 • 0.34

101 +

0.51 • 0.4991

= 0.15 ± 1.96 x 0.07= 0.15 ± 0.14

test ∆π=0 :test ∆π=0 : [carrying several decimal places, for comparison with χ2 later]

z = 0.6634 – 0.5054

0.5885 • 0.4115 • { 1

101 +

191

} =

0.15800.0711

= 2.22

[ estimate of hypothesized common = total +ve

total =

113192

= 0.5885 ]

z = 0.29 - 0.22

0.26 [0.74] { 1

24 +

123

}

= 0.070.13

= 0.55

P=0.58 (2-sided)

Fall in BP with Reduction in Dietary Salt

P = Prob[ | Z | ≥ 2.22 ] = 0.026 (2-sided) Response of interest: Y = Achieve DBP < 90 ? [ 0 / 1 ]

H0: π(Y=1 | Normal Sodium Diet) = π(Y=1 | Low Sodium Diet)Halt: π(Y=1 | Normal Sodium Diet) ≠ π(Y=1 | Low Sodium Diet)

Vitamin C and the common cold

95%CI on ∆π: 0.26 - 0.18 ± z 0.26 • 0.74

407 +

0.18 • 0.81411

= 0.08 ± 1.96 x 0.03= 0.08 ± 0.06

α =0.05 (2-sided);

Proportion achieving DBP < 90 mm

"Normal" "Low"Group Group

test ∆π=0 : [carrying several decimal places, for comparison with χ2 later]11/50 17/53

( 22 %) ( 32 %) z = 0.32 – 0.22

0.27 • 0.73 •[153 +

150]

z = 0.258 - 0.185

0.221 [0.779] { 1

407 +

1411

}

= 0.0730.029

= 2.52 (2.517 = √6.337)

P=0.006 (1-sided); P=0.012 (1-sided) i.e. |z| = 1.14 which is < Zα = 1.96 and so observed difference of 10% is"N.S."

page 3

Page 42: M&M Ch 6 Introduction to Inference OVERVIEW

Inference concerning 2 's M&M §8.2 ; updated Dec 14, 2003

CI for Risk Ratio (Rel. RISK) or Prev. Ratio cf. Rothman2002 p135 CI for ODDS RATIO cf. Rothman2002 p139

Example: Vitamin C and the common cold (CMAJ Sept 1972 p 503) . . .REVISITED Vitamin C Placebo:

had cold(s) 302 335avoided colds 105 76

# with cold(s) for every

no colds ≥ 1 cold total subjects------------------------------------------------------------------------------------Vitamin C 105 (26%) 302(74%) 407 (100%)Placebo 76 (18%) 335(82%) 411 (100%)------------------------------------------------------------------------------------total 181 (22%) 637(78%) 818 (100%) 1 who avoided colds 2.88 (:1) 4.41 (:1)

RR^ = Prob[ cold | Vitamin C]Prob[ cold | Placebo ]

= 74%82%

= 0.91 odds of cold(s) 2.88 4.41

odds Ratio = 2.88 / 4.41 = 0.65 >>> OR^ = 0.65

CI[OR] = anti-log [ log[oddsRatio] ± z SE[ logit1 – logit2 ] ]

CI[RR]: antilog{ log[0.91] ± z SE[ log[p1] – log[p2] ] }

=antilog{ log[0.91] ± z SE2[ log[p1] ] + SE2[ log[p2] ] }From 8.1: SE2[ log[p1] ] = Var[ log[p1] ] = 1/ 302 – 1/ 407 = 0.000854

SE2[ log[p2] ] = Var[ log[p2] ] = 1/ 335 – 1/ 411 = 0.000552From 8.1: SE2[ logit1 ] = 1/#positive1 + 1/ #negative1

SE2[ logit2 ] = 1/#positive2 + 1/ #negative2

SE[ logit1 – logit2 ] = Sqrt[ (1/ 302 + 1/ 105) + (1/ 335 + 1/ 76 ) ] = 0.17

z SE[ logit1 – logit2 ] = 1.96 × 0.17 = 0.33

anti-log [ log[0.65] ± 0.33 ] = exp[ –0.43 ± 0.33 ] = 0.47 to 0.90

>>>>> OR = 0.65; CI[OR] = {0.47 to 0.90}

CI[RR]: antilog{ log[0.91] ± z Sqrt[ 0.000854 + 0.000552] }= antilog{ log[0.91] ± 0.073 } = 0.85 to 0.98

Shortcut: calculate exp{ z×SE[log RR ] } and use it as a multiplier anddivider of RR

^. In our e.g., exp{z×SE[log RR ]} = exp{0.073} = 1.076.

Thus {RRLOWER, RRUPPER} = {0.91 / 1.076 , 0.91 × 1.076} = {0.85 , 0.98}You can use this shortcut whenever you are working with log-based CI'sthat you convert back to the original scale, there they become "multiply-divide" symmetric rather than "plus-minus" symmetric.

From SAS From StataFrom SAS From StataPROC FORMAT;

VALUE onefirst 0="z_0" 1="a_1";DATA CI_RR_OR;INPUT vitC cold n_people;LINES; 1 1 302 1 0 105 0 1 335 0 0 76;PROC FREQ data=CI_RR_OR ORDER=FORMATTED;TABLES vitC*cold / CMH;WEIGHT n_people;FORMAT vitC cold onefirst. ;RUN;

Immediate: csi 302 335 105 76

cs stands for 'cohort study'

input vit_c cold n_people 1 1 302 1 0 105 0 1 335 0 0 76endcs cold vit_c [freq = n_people]

<<<< See statements for RR

(output gives both RR and OR)

Be CAREFUL as to rows/colsIndex exposure category must be1st row ; reference exposurecategory must be 2nd

If necessary, use FORMAT tohave table come out this way...(note trick to reverse rows/cols)

SAS doesn't know if cc or cohort

Immediate: cci 302 335 105 76,woolf

cc stands for 'case control study'

input vit_c cold n_people 1 1 302 1 0 105 0 1 335 0 0 76endcc cold vit_c [freq =n_people],woolf

page 4

Page 43: M&M Ch 6 Introduction to Inference OVERVIEW

Inference concerning 2 's M&M §8.2 ; updated Dec 14, 2003

"Test-based" CI's ... IN GENERAL "Test-based" CI's for... IN PARTICULAR

Preamble • Difference of 2 proportions π1 - π2 (Risk or Prevalence Difference)In 1959, when Mantel and Haenszel developed their summary Odds Ratiomeasure over 2 or more strata, they did not supply a CI to accompany thispoint estimate. From 1955 onwards, the main competitor was the weightedaverage (in the log OR scale) and accompanying CI obtained by Woolf'. Butthis latter method has problems with strata where one or more cellfrequencies are zero. In 1976, Miettinen developed the "test-based"method for epidemiologic situations where the summary point estimate iseasily calculated, the standard error estimate is unknown or hard tocompute, but where a statistical test of the null value of the parameter ofinterest (derived by aggregating a "sub-statistic" from each stratum) isalready available. Although the 1886 development, by Robins, Breslowand Greenland, of a direct standard error for the log of the Mantel-HaenszelOR estimator, the "test-based" CI is still used (see A&B KKM).

Observe : p1 & p2 and (maybe via p-value) the calculated value of x2

This implies that

Sqrt[observed x2 value] = observed x value = observed z value;

But... observed z statistic = ( p1 – p2 ) / SE [ p1 – p2 ] .

So... SE [ p1 – p2 ] = p1 – p2

observed z statistic use +ve sign

95% CI for p1 - p2 :

( p1 – p2 ) ± {z value for 95%} × SE[ p1 – p2 ]i.e.,...

( p1 – p2 ) ± {z value for 95%} × p1 – p2

observed z statistic

i.e., after re-arranging terms..

( p1 – p2 ) {1 ± z value for 95%

observed z statistic }or, in terms of a reported chi-square statistic

Even though its main usefulness is for summaries over strata, the idea canbe explained using a simpler and familiar (single starum) example, thecomparison of two independent means using a z-test with large df (theprinciple does not depend on t vs. z). Suppose all that was reported wasthe difference in sample means, and the 2-sided p-value associated with atest of the null hypothesis that the mean difference was zero. From thesample means, and the p-value, how could we obtain a 95%CI for thedifference in the "population' means? The trick is to1 work back (using a table of the normal distribution) from the p-value to

the corresponding value of the z-statistic (the number of standarderrors that the difference in sample means is from zero);

2 divide this observed difference by the observed z value, to get thestandard error of the difference in sample means, and

( p1 – p2 ) {1 ± z value for 95%

Sqrt[observed chi-square statistic] }3 use the observed difference, and the desired multiple (1.645 for 90%

CI, 1.96 for 95% etc.) to create the CI.

The same procedure is directly applicable for the difference of twoindependently estimated proportions. If one tests the (null) differenceusing a z-test, one can obtain the SE of the difference by dividing theobserved difference in proportions by the z statistic; if the difference wastested by a chi-square statistic, one can obtain the z-statistic by taking thesquare root of the observed chi-square value (authors call this square rootan observed 'chi' value). Either way, the observed z-value leads directly tothe SE, and from there to the CI.

See Section 12.3 of Miettinen's "Theoretical Epidemiology"

Technically, when the variance is a function of the parameter (as isthe case with binary response data), the test-based CI is mostaccurate close to the Null. However, as you can verify by comparingtest-based CIs with CI's derived in other ways, the inaccuracies arenot as extreme as textbooks and manuals (e.g. Stata) suggest.

This is worked out in the next example, where it is assumed that the nullhypothesis is tested via a chi-squared (x2) test

page 5

Page 44: M&M Ch 6 Introduction to Inference OVERVIEW

Inference concerning 2 's M&M §8.2 ; updated Dec 14, 2003

"Test-based" CI's for... IN PARTICULAR "Test-based" CI's for... IN PARTICULAR

• Ratio of 2 proportions π1/π2

( Risk Ratio; Prevalence Ratio; Relative Risk; "RR" )

• Ratio of 2 odds π1/[1–π1] and π2/[1–π2]

( Odds Ratio; "OR" )

Observe :(i) rr = p1/p2 and

(ii) (maybe via p-value) the value of x2 statistic (H0: RR=1)

Observe :(i) or = p1/[1–p1] / p2/[1–p2 ] ( " a×d / b×c " ) and

(ii) (maybe via p-value) the value of x2 statistic (H0: OR=1)

>>> Sqrt[observed x2 value] = observed x value = observed z value

In log scale, in relation to log[RRnull] = 0, observed z value would be:

observed z value = ( log[rr] – 0) / SE[ log[rr] ]

>>> Sqrt[observed x2 value] = observed x value = observed z value

In log scale, in relation to log[ORnull] = 0, observed z value would be:

observed z value = ( log[or] – 0) / SE[ log[or] ]

This implies that

SE[ log[rr] ] = log[rr] / observed z value use +ve sign

95% CI for log[RR] :

log[rr] ± {z value for 95%} × SE[ log[rr] ]i.e.,...

log[rr] ± {z value for 95%} × log[rr] observed z value

i.e., after re-arranging terms..

log[rr] × {1 ± z value for 95%

observed z statistic }

This implies that

SE[ log[or] ] = log[or] / observed z value use +ve sign

95% CI for log[OR] :

log[or] ± {z value for 95%} × SE[ log[or] ]i.e.,...

log[or] ± {z value for 95%} × log[or] observed z value

i.e., after re-arranging terms..

log[or] × {1 ± z value for 95%

observed z statistic }Going back to RR scale, by taking antilogs*... Going back to OR scale, by taking antilogs*...

95% CI for RR:

rr to power of {1 ± z value for 95%

observed z statistic }95% CI for OR:

or to power of {1 ± z value for 95%

observed z statistic }

See Section13.3 of Miettinen's "Theoretical Epidemiology"See Section13.3 of Miettinen's "Theoretical Epidemiology"

* antilog[ log[a] × b ] = exp[ log[a] × b ]= { exp[log[a]] } to power of b = a to power of b

page 6

Page 45: M&M Ch 6 Introduction to Inference OVERVIEW

Inference concerning 2 's M&M §8.2 ; updated Dec 14, 2003

Sample Size considerations... CI(π1 – π2)

n's to produce CI for difference in π's of pre specified margin oferror m at stated confidence level

Sample Size considerations... Test involving πT and πC

Test H0: πT = πC vs Ha: πT πC :

n's for power 1– if πT = πC + ; prob[type I error] =

• large-sample CI: p1–p2 ± Z SE(p1–p2 ) = p1–p2 ± m

• SE(p1 – p2 ) = p1{1–p1}

n1 +

p2{1–p2}n2

n per group

= { Zα/2 2πC{1–πC} – Zβ πC{1–πC}+πT{1–πT} }2

2

(See Colton p 168 )

≈ 2(Zα/2 – Zβ)2 { π–{1– π–}

∆ }2

= 2{Zα/2 – Zβ}2{

σ0/1 }

2

Simplify by using an average p;

if use equal n's, then

n per group = 2p{1–p} Zα/22

[margin of error]2

M&M use the fact that if p = 1/2 then p(1–p) = 1/4, and so

2p(1–p) = 1/2, so the above equation becomes

[max] n per group = Zα/22

2 × [margin of error]2

e.g.

=0.05 (2-sided) & =0.2 ... Zα = 1.96; Zβ = -0.84,

2(Zα/2 – Zβ)2 = 2{1.96 – (–0.84)}2 ≈ 16, i.e.

n per group ≈ 16 • π–{1– π–} / ∆2

So n 100 for T group and n 100 for C groupif T = 0.6 and C = 0.4

See Sample Size Requirements for Comparison of 2Proportions (from text by Smith and Morrow) underResources for Chapter 8.

page 7

Page 46: M&M Ch 6 Introduction to Inference OVERVIEW

Inference concerning 2 's M&M §8.2 ; updated Dec 14, 2003

Effect of Unequal Sample Sizes ( n1 n2 ) on precision of estimated differences

If we write the SE of an estimated difference in mean responses as σ 1n1

+ 1n2

, where σ is the (average) per unit variability of the response, then we can establish the

following principles:

1 If costs and other factors (including unit variability) are equal,and if both types of units are equally scarce or equallyplentiful , then for a given total sample size of n = n1 + n2, an equal divisionof n i.e. n1 = n2 is preferable since it yields a smaller SE(estimated difference inmeans) than any non-symmetric division. However, the SE is relativelyunaffected until the ratio exceeds 70:30. This is seen in the following table

which gives the value of 1n1

+ 1n2

= SE(estimated difference in means) for

various combinations of n1 and n2 adding to 100 (the 100 itself is arbitrary) andassuming σ = 1 (also arbitrary).

2 If one type of unit is much scarcer, and thus the limiting factor ,then it makes sense to choose all (say n1) of the available scarcer units, andsome n2 ≥ n1 of the other type. The greater is n2 , the smaller the SE of theestimated difference. However, there is a 'law of diminishing returns' once n2 ismore than a few multiples of n1. This is seen in the following table which

gives the value of 1n1

+ 1n2

for n1 fixed (arbitrarily) at 100 and n2 ranging

from 1 x n1 to 100 x n1; again, we assume σ=1.

SEK:1 SEK:1 Ratio as % of as % of

n1 n2 (K) SE(µ1 - µ2) of SE(1:1) SE(∞:1) *n1 n2 SE(estimated difference in means) %Increase in SEover SE(50:50) *

50 50 0.200 -–---60 40 0.204 2.1%65 35 0.210 4.8%70 30 0.218 9.1%75 25 0.231 15.5%80 20 0.250 25.0%85 15 0.280 40.0%

50 50 1.0 0.2000 – 1.41450 75 1.5 0.1825 91.3% 1.29050 100 2.0 0.1732 86.6% 1.22550 150 3.0 0.1633 81.6% 1.15550 200 4.0 0.1581 79.1% 1.11850 250 5.0 0.1549 77.5% 1.09550 300 6.0 0.1527 76.4% 1.08050 400 8.0 0.1500 75.0% 1.06150 500 10.0 0.1483 74.2% 1.04950 1000 20.0 0.1449 72.4% 1.02550 5000 100.0 0.1421 71.1% 1.00550 ∞ ∞ 0.1414 70.7% 1 * if sample sizes are π:(1–π), the % increase is 50 / π(1-π) .

* calculated as K + 1

K ; 'efficiency' =

KK + 1

Note: these principles apply to both measurement and count data

page 8

Page 47: M&M Ch 6 Introduction to Inference OVERVIEW

Inference concerning 2 's M&M §8.2 ; updated Dec 14, 2003

Sample size calculation when using unequal sample sizes to estimate / test difference in 2 means or proportions

For power (sensitivity) 1–β, and specificity 1–α (2-sided), the samplesizes n1 and n2 have to be such that

Notes:

a. If K=1, so that n1=n2, then we get the familiar "2" at the front ofthe sample size formula.

Zα/2 SE( x–1 – x–2 ) – ZβSE( x–1 – x–2 ) = ∆.

(if β < 0.5, then Zβ will be negative). If we assume equal per unitvariability, σ, of the x's in the 2 populations, we can write therequirement as

Zα/2 σ 1n1

+ 1n2

– Zβ σ 1n1

+ 1n2

= ∆.

b. The same factor applies for proportions:

If we use σ0/1 = π– [1 – π– ]

as an "average" standard deviation for the

individual 0's and 1's in each population, i.e.

σ0/1 = π [1 – π ]If we rewrite

1n1

+ 1n2

as 1n1

{1 + n1n2

}

and rearrange the inequality, we get then, as we get the approximate formula:

n1 ≈ {

K+1K }(Zα/2 – Zβ)2 {

π–

[1 – π–

]

∆2 }n1 = {1 +

n1n2

}(Zα/2 – Zβ)2 { σ∆ }2

or, denoting n2n1

by K,

n1 = {1 + 1K}(Zα/2 – Zβ)2 {

σ∆ }2

i.e.

n1 = { K+1

K }(Zα/2 – Zβ)2 { σ∆ }2

page 9

Page 48: M&M Ch 6 Introduction to Inference OVERVIEW

Add-ins for M&M §8 and §9 statistics for epidemiology

Sample Size considerations... Test involving OR

Test H0: OR = 1 vs. Ha: OR OR :

n's for power 1– if OR = ORal t; prob[type I error] =

Key pointsln [ or] most precise when all 4 cells are of equal size; so...

1 increasing the control:case ratio leads to diminishing marginalgains in precision.

To see this... examine the function

1

# of cases + 1

multiple of this # of controls

for various values of "multiple"

[like we did back in Chapter 8, for "effect of unequal samplesizes"]

Here I use ln for natural log (elsewhere I have used log; I use them interchangeably)

Work in ln (or) scale; SE[ ln (or) ] = 1a +

1b +

1c +

1d

Need Zα/2 SE[ ln (or) ]0 + ZβSE[ ln (or) ]alt < "∆"

where "∆" = ln (ORalt)

α/2

Z SE[ ln(or) | OR alt ]

β

ln [OR] = 0

βZ SE[ ln(or) | Ho ]α/2

ln[OR ]alt

= ln[OR ]alt∆

2 The more unequal the distribution of the etiologic / preventivefactor, the less precise the estimate

Examine the functions

1# of exposed cases +

1 # of unexposed cases

and1

# of exposed controls +1

# of unexposed controlsReading graphs on next page (Note log scale for observed or)

Take as an example the study in the middle panel, with 200 cases, and an exposureprevalence of 8%. Say that the Type I error rate is set at α=0.05 (2sided) so thatthe upper critical value (the one that cuts off the top 2.5% of the null distribution)is close to or = 2. Draw a vertical line at this critical value, and examine howmuch of each non-null distribution falls to the right of this critical value. This areato the right of the critical value is the power of the study, i.e., the probability ofobtaining a significant or, when in fact the indicated non-null value of OR iscorrect. Two curves at each OR value are for studies with 1(grey) and 4(black)controls/case. Note that OR values 1, 1.5, 2.25 and 3.375 are also on a log scale.

Power larger if...Substitute expected a, b, c, d values under null and alt. intoSE's and solve for numbers of cases and controls.

i non-null OR >> 1 (cf 2.5 vs 2.25 vs 3.375)

References: Schlesselman, Breslow and Day, Volume II, ... ii exposure common (cf 2% vs 8% vs 32%) and not near universal)iii use more cases (cf 100 vs 200 vs 400), and controls/case (1 vs 4)

page 10

Page 49: M&M Ch 6 Introduction to Inference OVERVIEW

Add-ins for M&M §8 and §9 statistics for epidemiology

Factors affecting variability of estimates from, and statistical power of, case-control studies

OR

3.375

2.25

1.5

1

3.375

2.25

1.5

1

3.375

2.25

1.5

10.25 0.5 1 2 4 8

or

Cases: 100 Exposure Prevalence: 32%

0.25 0.5 1 2 4 8or

Cases: 200 Exposure Prevalence: 32%

0.25 0.5 1 2 4 8or

Cases: 400 Exposure Prevalence: 32%

0.25 0.5 1 2 4 8or

Cases: 100 Exposure Prevalence: 8%

0.25 0.5 1 2 4 8or

Cases: 200 Exposure Prevalence: 8%

0.25 0.5 1 2 4 8or

Cases: 400 Exposure Prevalence: 8%

0.25 0.5 1 2 4 8or

Cases: 100 Exposure Prevalence: 2%

0.25 0.5 1 2 4 8or

Cases: 200 Exposure Prevalence: 2%

0.25 0.5 1 2 4 8or

Cases: 400 Exposure Prevalence: 2%

jh 1995-2003

page 11

Page 50: M&M Ch 6 Introduction to Inference OVERVIEW

Add-ins for M&M §8 and §9

The "Exact" Test for 2 x 2 tables [material taken from A&B §4.9]

Even with the continuity correction there will be some doubt about the adequacy ofthe χ2 approximation when the frequencies are particularly small. An exact test wassuggested almost simultaneously in the mid-1930s by R. A. Fisher, J. O. Irwin andF. Yates. It consists in calculating the exact probabilities of the possible tablesdescribed in the previous subsection. The probability of a table with frequencies

There are six possible tables with the same marginal totals as those observed. sinceneither a nor c (in the notation given above) can fall below 0 or exceed 5, thesmallest marginal total in the table. The cell frequencies in each of these tables areshown in Table 4.7. Below them are shown the probabilities of these tables,calculated under the null hypothesis.

Table 4.7 Cell frequencies in tables with the same marginal totals as those in Table4.6

a b r1

c d r2

----------------------------c1 c2 N

0 20 20 1 19 20 2 18 20 3 17 20 4 16 20 5 15 20 5 17 22 4 18 22 3 19 22 2 20 22 1 21 22 0 22 22 5 37 42 S 37 42 5 37 42 5 37 42 5 37 42 5 37 42

is given by the formula

P[ a | r1, r2 , c1, c2 ] = r1! r2! s1! s1!N! a! b! c! d!

(4.25)

a 0 1 2 3 4 5

Pa 0.0310 0.1720 0.3440 0.3096 0.1253 0.0182

The Probabilities of the various tables are calculated in the following way*: theprobability that a = 0 is, from (4.25),

This is, in fact, the probability of the observed cell frequencies condi t ional onthe observed marginal totals, under the null hypothesis of no association betweenthe row and column classifications. Given any observed table, the probabilities ofall tables with the same marginal totals can be calculated, and the P value for thesignificance test calculated by summation. Example 4.14 illustrates the calculationsand some of he difficulties of interpretation which may arise. The data in Table 4.6,due to M. Hellman, are discussed by Yates (1934).

P0 = 20! 22! 5! 37!

42! 0! 20! 5! 7! = 0.03096.

Tables of log factorials (Fisher and Yates, 1963, Table XXX) are often useful forthis calculation, and many scientific calculators have a factorial key (although itmay only function correctly for integers less than 70). Alternatively the expressionfor P0 can be calculated without factorials by repeated multiplication and divisionafter cancelling common factors:

P0 = 22 x 21 x 20 x 19 x 1842 x 41 x 40 x 39 x 38

= 0.03096.Table 4.6 Data on malocclusion of teeth in infants (Yates, 1934)

Infants withThe probabilities for a = 1, 2, . . ., 5 can be obtained in succession. Thus,

P1 = 5 x 201 x 18

x P0

P2 = 4 x 192 x 19

x P1, etc.

Normal teeth Malocclusion Total

Breast-fed 4 16 20Bottle-ed 1 21 22

------ ------ ------Total 5 37 42

The results are shown above.

[Notes from JH: 1. The 5 tables from the tea-tasting experiment with to the 2x2 tables with all marginal totals = 4 are another example of this hypergeometric distribution]

* 2. Don't worry about the formula and the factorials; Excel has this function built in. It is called the Hypergeometric probability function, It is like the Binomial, except that instead ofspecifying p, one specifies the size of the POPULATION and the NUMBER OF POSITIVES IN THE POPULATION.. example, to get P1 above, one would ask for HYPGEOMDIST(a;r1;c1;N)

The spreadsheet "Fisher's Exact test" uses this function; to use the spreadsheet, simply type in the 4 cell frequencies, a, b, c, and d. The spreadsheet will calculate the probability for eachpossible table. Then you can find the tail areas yourself.

page 12

Page 51: M&M Ch 6 Introduction to Inference OVERVIEW

Add-ins for M&M §8 and §9

The "Exact" Test for 2 x 2 tables continued...

This is the complete conditional distribution for the observed marginal totals, andthe probabilities sum to unity as would be expected. Note the importance ofcarrying enough significant digits in the first probability to be calculated; the abovecalculations were carried out with more decimal places than recorded by retainingeach probability in the calculator for the next stage. The observed table has aprobability of 0.1253. To assess its significance we could measure the extent towhich it falls into the tail of the distribution by calculating the probability of thattable or of one more extreme. For a one-sided test the procedure clearly gives P =0.1253 + 0.0182 = 0.1435. The result is not significant at even the 10% level.

and the probability level of 0.12 for X2 is a fair approximation to the exact mid-Pvalue of 0.16.

Cochran (1954) recommends the use of the exact test, in preferenceto the 2 test with continuity correction, ( i) i f N < 20, or ( i i ) 20< N < 40 and the smallest expected value is less than 5. Withmodern scientific calculators and statistical software the exact testis much easier to calculate than previously and should be used forany table with an expected value less than 5.

The exact test and therefore the χ2 test with Yates's correction for continuity have

been criticized over the last 50 years on the grounds that they are conservative in the

sense that a result significant at, say, the 5% level will be found in less than 5% of

hypothetical repeated random samples from a population in which the null

hypothesis is true. This feature was discussed in §4.7 and it was remarked that the

problem was a consequence of the discrete nature of the data and causes no difficulty

if the precise level of P is stated. Another source of criticism has been that the tests

are conditional on the observed margins, which frequently would not all be fixed.

For example, in Example 4.14 one could imagine repetitions of sampling in which

20 breast-fed infants were compared with 22 bottle-fed infants but in many of these

samples the number of infants with normal teeth would differ from 5. The

conditional argument is that, whatever inference can be made about the association

between breast-feeding and tooth decay, it has to be made within the context that

exactly five children had normal teeth. If this number had been different then the

inference would have been made in this different context, but that is irrelevant to

inferences that can be made when there are five children with normal teeth.

Therefore, we do not accept the various arguments that have been put forward for

rejecting the exact test based on consideration of possible samples with different

totals in one of the margins. The issues were discussed by Yates 1984) and in the

ensuing discussion, and by Barnard (1989) and Upton (1992), .and we will not

pursue this point further. Nevertheless, the exact test and the corrected χ2 test have

the undesirable feature that the average value of the significance level, when the null

hypothesis is true, exceeds 0.5. The mid-P value avoids this problem, and so is

more appropriate when combining results from several studies (see §4.7).

For a two-sided test the other tail of the distribution must be takeninto account, and here some ambiguity arises. Many authorsadvocate that the one-tailed P value should be doubled. In thepresent example, the one-tailed test gave P = 0.1435 and the two-tailed test would give P = 0.2870. An alternative approach is tocalculate P as the total probability of tables, in either tail, whichare at least as extreme as that observed in the sense of having aprobability at least as small. In the present example we shouldhave

P = 0.1253 + 0.0182 + 0.0310 = 0.1745.

The first procedure is probably to be preferred on the grounds that asignificant result is interpreted as strong evidence for a differencein the observed d i rec t ion , and there is some merit in controllingthe chance probability of such a result to no more than half thetwo-sided significance level. The tables of Finney et a l . (1963)enable one-sided tests at various significance levels to be madewithout computation provided the frequencies are not too great.

To calculate the mid-P value only half the probability of the observed table isincluded and we have

mid-P = 0.5(0.1253) + 0.0182 = 0.0808

as the one-sided value, and the two-sided value may be obtained by doubling this togive 1617.

The results of applying the exact test in this example may be compared with thoseobtained by the χ2 test with Yates's correction. We find X2 = 2 39 (P = 0.12)without correction and X2

C = 1.14 (P = 0.29) with correction. The probabilitylevel of 0.29 for X2

C agrees well with the two-sided value 0 29 from the exact test,

page 13

Page 52: M&M Ch 6 Introduction to Inference OVERVIEW

Add-ins for M&M §8 and §9

The "Exact" Test for 2 x 2 tables continued...

As for a single proportion, the mid-P value corresponds to an uncorrected χ2 test,whilst the exact P value corresponds to the corrected χ2 test. The confidence limitsfor the difference, ratio or odds ratio of two proportions based on the standard errorsgiven by (4.14), (4.17) or (4.19) respectively are all approximate and theapproximate values will be suspect if one or more of the frequencies in the 2 x 2table are small. Various methods have been put forward to give improved limits butall of these involve iterations and are tedious to carry out on a calculator. The oddsratio is the easiest case. Apart from exact limits, which involve an excessiveamount of calculation, the most satisfactory limits are those of Cornfield ( 1956);see Example 16.1 and Breslow and Day (1980, §4.3) or Fleiss ( 1981, §5.6). Forthe ratio of two proportions a method was given by Koopman (1984) and Miettinenand Nurminen (1985) which can be programmed fairly readily. The confidenceinterval produced gives a good approximation to the required confidence coefficient,but the two tail probabilities are unequal due to skewness. Gart and Nam (1988)gave a correction for skewness but this is tedious to calculate. For the difference oftwo proportions a method was given by Mee (1984) and Miettinen and Nurminen(1985). This involves more calculation than for the ratio limits, and again therecould be a problem due to skewness (Gart and Nam, 1990).

• Fisher's exact test is usually used just as a test*; if one is interested in thedifference ∆ = π1 – π2 , the conditional approach does not yield acorresponding confidence interval for ∆ . [it does provide one for the

comparative odds ratio parameter ψ = 1–π1

π1 ÷

1–π2π2

]

• Thus, one can find anomalous situations where the (conditional) testprovides P>0.05 making the difference 'not statistically significant',whereas the large-sample (unconditional) CI for ∆ , computed as p1 – p2 ±zSE(p1 – p2), does not overlap 0, and so would indicate that thedifference is 'statistically significant'. [* see the Breslow and Day text Vol I ,§4.2, for CI's for ψ derived from the conditional distribution]

• See letter from Begin & Hanley re 1/20 mortality with pentamidine vs 5/20with Trimethoprim-Sulfamethoxazole in patients with Pneumocystis cariniiPreumonia-Annals Int Med 106 474 1987.

• Miettinen's test-based method of forming CI's, while it can have somedrawbacks, keeps the correspondence between test and CI and avoidssuch anomalies.Notes by JH

• The word "exact" means that the p-values arecalculated using a finite discrete referencedistribution -- the hypergeometric distribution (cousinof the binomial) rather than using large-sampleapproximations. It doesn't mean that it is the correcttest. [see comment by A&B in their section dealingwith Mid-P values].

While greater accuracy is always desirable, thisparticular test uses a 'conditional' approach that notall statisticians agree with. Moreover, compared withsome unconditional competitors, the test issomewhat conservative, and thus less powerful,particularly if sample sizes are very small.

• This illustrates one important point about parameters related to binary data-- with means of interval data, we typically deal just with differences*;however, with binary data, we often switch between differences and ratios,either because the design of the study forces us to use odds ratios (case-control studies), or because the most readily available regression softwareuses a ratio (i.e. logistic regression for odds ratios) or because one iseasier to explain that the other, or because one has a more naturalinterpretation (e.g. in assessing the cost per life saved of a moreexpensive and more efficacious management modality, it is the differencein, rather than the ratio of, mortality rates that comes into the calculation). [*the sampling variability of the estimated ratios of means of interval data isalso more difficult to calculate accurately].

• Two versions of an unconditional test for the H0: π1 = π2are available: Liddell; Suissa and Shuster;

page 14

Page 53: M&M Ch 6 Introduction to Inference OVERVIEW

Add-ins for M&M §8 and §9

FISHER'S EXACT TEST IN A DOUBLE-BLIND STUDY OF SYMPTOM PROVOCATION TO DETERMINE FOOD SENSITIVITY (N Engl J Med 1990; 323:429-33.)

Abstract Table 1: Responses of 18 Patients Forced to Decide WhetherInjections Contained an Active Ingredient or PlaceboBackground Some claim that food sensitivities can best be identified by

intradermal injection of extracts of the suspected allergens to reproduce the associatedsymptoms. A different dose of an offending allergen is thought to "neutralize" thereaction.

Pt. Active Placebo PNo* Injection Injection Value† resp no resp resp no resp3 2 1 1 8 0.13Methods To assess the validity of symptom provocation, we performed a double-

blind study that was carried out in the offices of seven physicians who wereproponents of this technique and experienced in its use. Eighteen patients weretested in 20 sessions (two patients were tested twice) by the same technician, usingthe same extracts (at the same dilutions with the same saline diluent) as thosepreviously thought to provoke symptoms during unblinded testing. At each sessionthree injections of extract and nine of diluent were given in random sequence. Thesymptoms evaluated included nasal stuffiness, dry mouth, nausea, fatigue, headache,and feelings of disorientation or depression. No patient had a history of asthma oranaphylaxis.

1 2 1 2 7 0.2414a 2 1 2 7 0.2412 1 2 0 9 0.2516 2 1 3 6 0.36

18 2 1 4 5 0.5014b 1 2 2 7 0.874 1 2 2 7 0.875 1 2 2 7 0.879 0 3 0 9 --

2a 0 3 1 8 0.75Results The responses of the patients to the active and control injections wereindistinguishable, as was the incidence of positive responses: 27 percent of theactive injections (16 of 60) were judged by the patients to be the active substance, aswere 24 percent of the control injections (44 of 180). Neutralizing doses given bysome of the physicians to treat the symptoms after a response were equallyefficacious whether the injection was of the suspected allergen or saline. The rate ofjudging injections as active remained relatively constant within the experimentalsessions, with no major change in the response rate due to neutralization orhabituation.

13 0 3 1 8 0.7515 1 2 3 6 0.766 0 3 2 7 0.558 0 3 2 7 0.55

17 1 2 5 4 0.502b 0 3 3 6 0.387 0 3 3 6 0.3810 0 3 3 6 0.3811 0 3 3 6 0.38

Conclusions When the provocation of symptoms to identify food sensitivities isevaluated under double-blind conditions, this type of testing, as well as thetreatments based on "neutralizing" such reactions, appears to lack scientific validity.The frequency of positive responses to the injected extracts appears to be the resultof suggestion and chance

*Patients were numbered in the order they were studied.

The order in the table is related to the degree that the results agree with thehypothesis that patients could distinguish active injections from placebo injections.The results listed below those of Patient 9 do not support this hypothesis, placeboinjections were identified as active at a higher rate than were true active injections.The letters a and b denote the first and second testing sessions, respectively, inPatients 2 and 14. true active injections.

-------------------------------------------------------------------------------------------------† Calculated according to Fisher's exact test, which assumes that the hypothesizeddirection of effect is the same as the direction of effect in the data. Therefore, whenthe effect is opposite to the hypothesis, as it is for the data below those of Patient9, the P value computed is testing the null hypothesis that the results obtainedwere due to change as compared with the possibility that the patients were morelikely to judge a placebo injection as active than an active injection.

ID denotes intradermal, and SC subcutaneous.

The value is the P value associated with the test of whether the common odds ratio(the odds ratio for all patients) is equal to 1.0. The common odds ratio was equal to1.13 (computed according to the Mantel-Haenszel test).

page 15

Page 54: M&M Ch 6 Introduction to Inference OVERVIEW

Notes on P-Values from Fisher's Exact Test in previous article

Response Response + – Total + – TotalPatient no. 18 Active Injection 2 1 | 3Patient no. 3 Active Injection 2 1 | 3 Placebo Injection 4 5 | 9 Placebo Injection 1 8 | 9 ---------- ---------- 6 6 3 9

All possible tables with a total of 6 +ve responsesAll possible tables with a total of 3 +ve responses

0 3 1 2 2 1 3 0 0 3 1 2 2 1 3 0 6 3 5 4 4 5 3 6 3 6 2 7 1 8 0 9

6• 5• 4 3•6 2•5 1•4 9• 8• 7 3•3 2•2 1•1prob –––––--- 0.091 • --- 0.409 • --- 0.409 • ---prob –––––--- 0.382 • --- 0.491 • --- 0.123 • --- 12•11•10 1•4 2•5 3•6 12•11•10 1•7 2•8 3•9

0.091 0.409 0.409 0.091 0.382 0.491 0.123 0.005

(pt #) (17) (18)(pt #) (2b,7,10,11) (14b, 4, 5) (3)

P-Value 1.0 0.909 0.500 0.091(1-sided, as above)P-Value* 1.0 0.618 0.128 0.005

Response

In Table 1, the P-values for patients belowpatient 9 are calculated as 1-sided, but guidedby the opposite Halt from that used for thepatients in the upper half of the table, i.e. byHalt: of +ve responses with Active< of +ve responses with Placebo.

+ – TotalPatient no. 1 Active Injection 2 1 | 3 Placebo Injection 2 7 | 9 ---------- 4 8

All possible tables with a total of 4 +ve responses

0 3 1 2 2 1 3 0 4 5 3 6 2 7 1 8

8• 7• 6 3•4 2•3 1•2prob –––––--- 0.255 • --- 0.510 • --- 0.218 • --- 12•11•10 1•6 2•7 3•8

It appears that the authors decided the "sided-ness" of the Halt after observing the data!!! andthat they used different Halt for differentpatients!!!

0.255 0.510 0.218 0.018

(pt #) (15) (1,14a)

P-Value 1.0 0.745 0.236 0.018

(*1-sided, guided by Halt: π of +ve responses with Active > π of +veresponses with Placebo)

page 16

Page 55: M&M Ch 6 Introduction to Inference OVERVIEW

Fisher's Exact Test political and ecological correctness . . . M&M §8.2 updated Dec 14 2003

The Namibian government expelled the authors formNamibia following the publication of this article; thereason given was that their "data andconclusions were premature" .. jh ¶•

We gathered data on more than 40 known horned and hornlessblack rhinos in the presence and absence of dangerous carnivoresin a 7,000 km2 area of the northern Namib Desert and on 60horned animals in the 22,000 km2 Etosha National Park. On thebasis of over 200 witnessed interactions between horned rhinos andspotted hyenas (Crocura crocura) and lions (Panthera leo) we sawno cases of predation, although mothers charged predators in about45% of the cases. Serious interspecific aggression is not uncommonelsewhere in Africa, and calves missing ears and tails have beenobserved from South Africa, Kenya, Tanzania, and Namibia (13).

lost calves were between 15 to 25 years old, suggestingthat they were not first time, inexperienced mothers (14).What seems more likely is that the drought-inducedmigration of more l-than 85% of the large, herbivorebiomass (kudu, springbok, zebra, gemsbok, giraffe, andostrich) resulted in hyenas preying on an alternative food,rhino neonates, when mothers with regenerating hornscould not protect them.

Since 1900 the world's population has increased from about 1.6 toover 5 billion) the U.S. population has kept pace, growing fromnearly 75 to 260 million. While the expansion of humans andenvironmental alterations go hand in hand, it remains uncertainwhether conservation programs will slow our biotic losses. Currentstrategies focus on solutions to problems associated with diminishingand less continuous habitats, but in the past, when habitat loss wasnot the issue, active intervention prevented extirpation. Here webriefly summarize intervention measures and focus on tactics forspecies with economically valuable body parts, particularly on themerits and pitfalls of biological strategies tried for Africa's mostendangered pachyderms, rhinoceroses.

Clearly, unpredictable events, including drought, may notbe anticipated on a short-term basis. Similarly, it may notbe possible to predict when governments can no longerfund antipoaching measures, an event that may have led tothe collapse of Zimbabwe's dehorned white rhinos.Nevertheless, any effective conservation actions mustaccount for uncertainty. In the case of dehorning,additional precautions must be taken. [ ... ]

To evaluate the vulnerability of dehornedrhinos to potential predators, we developedan experimental design using three regions:

• Area A had horned animals with spotted hyenas andoccasional lions

[ ... ]• Area B had dehorned animals lacking dangerouspredators, A B C

††survived 4 3 0

Given the inadequacies of protective. legislation and enforcement,Namibia. Zimbabwe, and Swaziland are using a controversialpreemptive measure, dehorning (Fig. D) with the hope thatcomplete devaluation will buy time for implementing otherprotective measures (7) In Namibia and Zimbabwe, two species,black and white rhinos (Ceratotherium simum), are dehorned, atactic resulting in sociological and biological uncertainty: Ispoaching deterred? Can hornless mothers defend calves fromdangerous predators?

• Area C consisted of dehorned animals that were sympatricwith hyenas only.

died 0 0 3 4 3 3

B vs CPopulations were discrete and inhabited similar xericlandscapes that averaged less than 125 mm ofprecipitation annually. Area A occurred north of acountry-long veterinary cordon fence, whereas animalsfrom areas B and C occurred to the south or east, and noindividuals moved between regions.

B C B C B C B C tot*survived 3 0 2 1 1 2 0 3 3died 0 3 1 2 2 1 3 3 3

On the basis of our work in Namibia during the last 3 years (8) andcomparative information from Zimbabwe, some data are available.Horns regenerate rapidly, about 8.7 cm per animal per year, so that1 year after dehorning the regrown mass exceeds 0.5 kg. Becausepoachers apparently do not prefer animals with more massive horns(8), frequent and costly horn removal may be required (9). InZimbabwe, a population of 100 white rhinos, with at least 80dehorned, was reduced to less than 5 animals in 18 months (10).These discouraging results suggest that intervention by itself isunlikely to eliminate the incentive for poaching. Nevertheless, somebenefits accrue when governments, rather than poachers, practicehorn harvesting, since less horn enters the black market Whetherhorn stockpiles may be used to enhance conservation remainscontroversial, but mortality risks associated with anesthesia duringdehorning are low (5).

3 3 3 3 3 3 3 3

The differences in calf survivorship were remarkable. Allthree calves in area C died within 1 year of birth, whereasall calves survived for both dehorned females livingwithout dangerous predators (area B; n = 3) and forhorned mothers in area A (n = 4). Despite admittedlyrestricted samples, the differences are striking [Fisher's(3 x 2) exact test, P = 0.017; area B versus C, P = 0.05;area A versus C, P = 0.0291 ††. The data offer a firstassessment of an empirically derived relation betweenhorns and recruitment.

prob 120

920

920

120

A vs C A C A C A C A C tot*survived 4 0 3 1 2 2 1 3 4Biologically, there have also been problems. Despite media

attention and a bevy of allegations about the soundness ofdehorning ( 11 ), serious attempts to determine whether dehorningis harmful have been remiss. A lack of negative effects has beensuggested because (i) horned and dehorned individuals haveinteracted without subsequent injury; (ii) dehorned animals havethwarted the advance of dangerous predators; (iii) feeding isnormal; and (iv) dehorned mothers have given birth (12) However,most claims are anecdotal and mean little without attendant data ondemographic effects. For instance, while some dehorned femalesgive birth, it may be that these females were pregnant when firstimmobilized. Perhaps others have not conceived or have lost calvesafter birth. Without knowing more about the frequency of mortality,it seems premature to argue that dehorning is effective.

died 0 3 1 2 2 1 3 3 3 4 3 4 3 4 3 4 3

Our results imply that hyena predation was responsible forcalf deaths, but other explanations are possible. If droughtaffected one area to a larger extent than the others, thencalves might be more susceptible to early mortality. Thispossibility appears unlikely because all of westernNamibia has been experiencing drought and, on average,the desert rhinos in one area were in no poorer bodilycondition than those in another. Also, the mothers who

prob 135

1235

1835

435

¶ Do you agree?

page 17

Page 56: M&M Ch 6 Introduction to Inference OVERVIEW

Add-ins for M&M §8 and §9 [ updated Jan 3 2004 ] stratified data

Combining measures from several strata [cf. M&M 3§2.6, A&B3 §16.2, Rothman2002, Chapter 8]

Why not just add (sum, ) the 'a' frequencies across tables(strata), the 'b' frequencies across tables, ... the 'd'frequencies across tables, to make a single 2 x 2 table withentries

e.g. 2 Numbers of Applicants (n), and Admission rates (%) toBerkeley Graduate School

Men Women

Faculty n % admitted n % admitteda b A 825 62 108 82b d

B 560 63 25 68

C 325 37 593 34and use these 4 cell counts to perform the analyses?

D 417 33 375 35e.g. 1 Batting Averages of Gehrig and Ruth

(see book "Innumeracy" by Paulos)

E 191 28 393 27

F 373 6 341 7

Combined 2691 44 1835 3 0

Gehrig Ruth (see early Chapter in text "Statistics" by Freedman et al)1st half of season .290 < .300

Paradox: π(admission | male) > π(admission | male) overall,but, by an large, faculty by faculty, its the other way!!!

2nd half of season .390 < .400––––––––––––––––––––––––––––––––––––––––––––––Entire season .357 > .333 !!!

Explanation: Women are more likely than men to apply tothe faculties that admit lower proportions of applicants.Explanation:

Gehrig Ruth Remedy: aggregate the within-strata comparisons [like vs.like], rather than make comparisons with aggregated rawdata -- see next for classical ways of doing this; MH standsfor "Mantel-Haenszel".

1st half of seasonHits 29 60AT BAT 100 200

2nd half of seasonhits 78 40AT BAT 200 100

For other examples:-

–––––––––––––––––––––––––––––––––––––––––––––– 1. See Moore and McCabe(3rd Ed) 2.6 (The Perils of Aggregation, includingSimpson's paradox) They speak of 'lurking' variables; in epidemiology wespeak of 'confounding' variables.

2. See Rothman2002, p1 (death rates Panama vs. Sweden) and p2 (20-yearmortality in female smokers and non-smokers in Whickham England)

Entire seasonhits 107 100AT BAT 300 300

Two features, involving time, created this 'paradox'Simpson's paradox is an extreme form of confounding. Sometextbooks give made-up examples See web site for course 626 for severalreal examples.

-1- batting averages increased from 1st to 2nd half of season

-2- Ruth had greater proportion of his AT BAT's in 1st half than Gehrig

page 1

Page 57: M&M Ch 6 Introduction to Inference OVERVIEW

Add-ins for M&M §8 and §9 stratified data

Story 4: Does Smoking Improve Survival? in the EESEE Expansion Modulesin the website for the text (link from course description) [also in Rothman2002,with finer age-categories] Table 2: Twenty-year survival status for 1314 women categorized

by age and smoking habits at the time of the original survey.http://WWW.WHFREEMAN.COM/STATISTICS/IPS/EESEE4/EESEES4.HTM

AgeGroup(Years)

Smoking StatusA survey concerned with thyroid and heart disease was conducted in 1972-74 in adistrict near Newcastle, United Kingdom by Tunbridge et al (1977). A follow-upstudy of the same subjects was conducted twenty years later by Vanderpump et al(1996). Here we explore data from the survey on the smoking habits of 1314 womenwho were classified as being a current smoker or as never having smoked at the timeof the original survey. Of interest is whether or not they survived until the secondsurvey.

SurvivalStatus Smoker

Non-Smoker

18-44 Dead 19 13Results Alive 269 327

The following tables summarize the results of the experiment: [notefrom JH.. We would not call it "an experiment"; mathematicalstatisticians call any process for generating data "an experiment"]

(or =1.78)

Table 1: Relationship between smoking habits and 20-yearsurvival in 1314 women (582 Smokers, 732 Non-Smokers) 44-64 Dead 78 52

Alive 167 147S m o k i n g S t a t u s

Survival Status Smoker Non-Smoker Compared...(or =1.32)

Dead 139 230Alive 443 502

Risk = #dead#Total

139582 = 23.9%

230732 = 31.4%

Diff: –7.5%

Ratio: 0.76 .

>64 Dead 42 165

Alive 7 28

Odds = #dead#alive

139732 = 0.314(:1)

230502 = 0.458(:1) Ratio: 0.68*

(or =1.02)

The odds ratio is > 1 in each age group!* shortcut: or =

a × db × c =

139 × 502230 × 443 =

69778101890 = 0.68

Why the contradictory results?A message the tobacco companies would love us to believe!

page 2

Page 58: M&M Ch 6 Introduction to Inference OVERVIEW

Add-ins for M&M §8 and §9 stratified data

Adjustment (compare like with like,i.e. Σ within-category estimates**)

Ratio** Estimators ("M-H") [implicit precision weighting]

Risk RatioΣ #casesindex × DENOMref / DENOMtotal

Σ #casesref × DENOMindex / DENOMtotal

Weighted averages [ explicit weights (w's) ]

Precision-based Investigator-chosen(inverse variance) ("Standardized")

Rate Ratio same, except that denominators areamounts of person-time, not persons.Mean Difference Σ w × y–index – Σ w × y–ref

= Σ w × ( y–index – y–ref )

Risk Difference Σ w × riskindex – Σ w × riskref

= Σ w × ( risk index – riskref )Odds Ratio

Σ #casesindex × "denom"ref / "size"total

Σ #casesref × "denom"index / "size"total

[case control study]Odds Ratio ("Woolf" method, precision based)

exp[ Σ w × logoddsindex – Σ w × logoddsref ]

= exp[ Σ w × ( log [odds ratio] ) all logs to base e

where w = 1 / var[log [odds ratio] ] = 1 / (1/a + 1/b + 1/c + 1/d)

same as risk and rate ratio above exceptthat "denominators" are partial (pseudo)ones estimated from a denominatorseries*("controls"); "size"total refers to thesize of (stratum-specific) case series anddenominator series combined. *MODERNway to view case-control studies.

Note: Computational formulae often constructed to minimizenumber of steps, and avoid division, and so may hide realstructure of the estimator.

Odds RatioΣ #casesindex × #"rest"ref / totalΣ #casesref × #"rest"index / total

e.g. 8.1 in Rothman p147, for risk diff. (precision weighting) [cohort/prevalence study]

Var[risk diff] proportional to 1/N0 + 1/N1 = (N0+N1)/(N0 N1)

So that the denominator contribution, i.e., the weight, is

w = 1/Var = (N0 N1)/(N0+N1) = (N0 N1)/T

and numerator contribution is

( riskindex – riskref ) × w

= ( a/N1 - b/N0 ) × w = ( a/N1 - b/N0 ) × (N0 N1)/T

= ( a N0 /T - b N1) / T (after some algebra)

Not that common to use this measure,since odds ratio more cumbersome toexplain, and less 'natural'. Might use it tomaintain comparability with results of alog-odds (logistic) regression. If #case asmall fraction

page 3

Page 59: M&M Ch 6 Introduction to Inference OVERVIEW

Add-ins for M&M §8 and §9 stratified data

**NOTE ON RATIO ESTIMATORS: Even though one could (if alldenominators were obligingly non-zero) rewrite the ratioestimator as a weighted average of ratios, this would runcounter to Mantel's express wishes.. to calculate just one ratioat the end, i.e. a ratio of two sums, rather than a sum of ratios.The main reason is statistical stability: imagine a (simpler,non-comparative) situation where one wished to estimate theoverall sex ratio in small day-care facilities: would you averagethe ratios from each facility, or take a single ratio of the totalnumber of males to the total number of females? The caveatdoes not apply to absolute differences, where the difference oftwo weighted averages (same set of weights for both) is thesame as the weighed average of the differences.

Determinant No. Pairs

Outcome* Index Ref Tota × d

Tb × c

TYes 1 1 1No 0 0 1

1 1 2 0 0 "A"

Yes 0 0 1No 1 1 1

1 1 2 0 0 "D"

Yes 1 0 1No 0 1 1

1 1 2 1/2 0 "B"

Yes 0 1 1No 1 0 1

Matched-pairs: the limiting case of finely stratified data 1 1 2 0 1/2 "C"

Examples: pair-matched case-control studies; Mother -> infant transmission ofHIV in twins in relation to order of delivery; & others...[see 607 notes for Ch 9]ALSO: Case-crossover studies (self-matched case-control studies)eg" Redelmeier: auto accidents, while on/off cell phone when driving

Odds Ratio estimator = A 0 + B 1/2 + C 0 + D 0A 0 + B 0 + C 1/2 + D 0

= BC

Tabular format for displaying matched pair-data

Result in Other PAIR MemberCOHORTSTUDY + ve – ve

Total #PAIRSe.g. Response of same subject in each of 2 conditions (self-paired)

Responses of matched pair, one in 1 condition, 1 in other

's in paired responses on interval scale, reduced to sign of

+ ve A BResult in OnePAIR Member

– ve C Dn

The 4 possibilities for 2 pair-members are:(using generic 2 x 2 table: 2nd row might be a 'denominator series' of 1 per case)

Exposure in "Control"CASE-CTLSTUDY + ve – ve

TotalPAIRS

+ ve A BCategory of DeterminantExposure in"Case"Outcome Index Reference Total

– ve C DYes a = 1 or 0 b =1 or 0 1 nNo c = 0 or 1 d = 0 or 1 1

* In matched (self- or other) case-control study, the "denominator series"is not limited to 1 "probe-for-exposure" per case... could ask about"usual" exposure (e.g. % time usually exposed) or sample several "person-moments" ['controls'] per case. i.e. the 2nd row total could be > 2.

1 1 T=2

The contributions to orMH from the 4 possibilities are ...

page 4

Page 60: M&M Ch 6 Introduction to Inference OVERVIEW

Add-ins for M&M §8 and §9 stratified data

Standardization of Rates [proportion-type and incidence-type] [explicit, investigator-selected weights] SMR =

Total # cases observedTotal # cases expected

= Σ # observedΣ # expected (*)

= Σ observed #

Σ ref. rate × exposed PT

= Σ observed rate × exposed PT

Σ ref. rate × exposed PT

= Σ observed rate × w

Σ ref. rate × w , with w = exposed PT

• Usual to first calculate standardized rate for index category(of the determinant) and standardized rate for referencecategory (of the determinant) separately, then compare thestandardized rates.

• If one uses the confounder distribution in one of the twocompared determinant categories as the common set ofweights, then the standardized rate in this category remainsunchanged from the crude rate in this category.See the worked example comparing death rates in Quebecmales in 1971 and 1991 in the document "Direct" and"Indirect" Standardization:2 sides of same coin?(.pdf) under"Material from previous years" in the c626 web page. this isan interesting local case of natural confounding: relative tothat 20 years earlier, the crude mortality rate in 1991 was1.00. yet, in every age category, the rate in 1991 was at least10% lower, and in many age-groups, more than 20% lowerthan in 1971 (in the table, the rate ratios in bold are 71/91, sotake their reciprocals to see the rate ratios 91/71)

If one starts again from (*), one can show that the SMR canalso be represented as a weighted average of rate ratios [aswas mentioned in footnote to Quebec table*]

SMR = Σ # observedΣ # expected

= Σ obs. rate × exposed PT

Σ # expected

= Σ

obs. rateref. rate

× ref. rate × exposed PT

Σ # expected (divide & mult. by ref rate)

= Σ

obs. rateref. rate

× # expected

Σ # expected = weighted ave. of rate ratios

• Read Rothman's comment (p159) about the uniformity ofeffect (eg a constant rate ratio across age groups in the Queexample). Why in his last sentence in that paragraph doeshe seem to "allow" a weighted average of very different rateratios, if they were derived from standardization, but NOT ifthey were derived from (precision-weighted) pooling?

• Rothman (p161) emphasizes how "silly" the term "indirect"standardization used with standardized mortality ratio, is. Hecorrectly points out that "the calculations for any ratestandardization, "direct" or "indirect", are basically the same".He leaves it as an exercise (Q4 page 166) to work out whatthe weights are in the so-called "indirect" standardizationused to compute an SMR (or SIR).

*cf. Liddell FD. The measurement of occupational mortality.Br J Ind Med. 1960 Jul;17:228-33.

Hint: write the SMR (with Σ denoting sum over strata) as

page 5

Page 61: M&M Ch 6 Introduction to Inference OVERVIEW

Add-ins for M&M §8 and §9 stratified data

Table 2: Twenty-year survival status for 1314 women categorized by age and smoking habits at the time of the original survey.Worked out calculations (see same calculations on spreadsheet) for... (* r1 r2 are row totals; c1 c2 are column totals) See Rothman Ch 8

Mantel-Haenszel summary odds ratio, orMH and Woolf: exp[ weighted average of ln or 's ]

Mantel Haenszel (Chi-Square) Test of OR1 = OR2 = OR3 = 1 Var[ weighted ave ] = 1/ {Sum of Weights}

AgeGroup(Years)

Smoking Status Calculations forSummary odds ratio

Calculations for TestStatistic*

Calculations for Woolf's Method

SurvStatus Smoker

Non-Smoker n

a dn

b cn

E[ a | H0] Var[ a | H0] ln or(1)

Var[ ln or] Weight(2)

W × ln or

r1 • c1n

r1 • r2 • c1 • c2 n2 • {n - 1}

1/a + 1/b+ 1/c + 1/d

1Var[ ln or]

(1) × (2)

18-44 Dead 19 13

Alive 269 327(or =1.78) 628 9.89 5.59 14.7 7.6 0.575 0.1363 7.335 4.218

44-64 Dead 78 52

Alive 167 147(or =1.32) 444 25.82 19.56 71.7 22.8 0.278 0.0448 22.30 6.199

>64 Dead 42 165

Alive 7 28(or =1.02) 242 4.86 4.77 41.9 4.9 0.018 0.2084 4.798 0.086

139 Sum 1314 40.57 29.92 128.3 35.2 34.433 10.503

MH Odds Ratio 1.36 weighted ave. of ln or ' s 10.503/ 34.433 = 0.305

(40.57/29.92) exp[weighted ave. of ln or ' s] exp[0.305] = 1.36

orMH = ∑ a d / n ∑ b c /n =

40.5729.92 = 1.36 ; X2

MH (1 df) = {∑ a – ∑ E[ a | H0 ]}2

∑ Var[ a | H0 ] =

{139 – 128.3}2

35.2 = 3.24 XMH = 1.80

(Miettinen) Test-based 100(1 - α)% CI for OR: orMH1 ± zα/2 / XMH

= 1.361 ± 1.96/1.80 = 0.97 to 1.89 (95% CI)

(Woolf) 100(1 - α)% CI for OR: exp[{weighted ave. of ln or's} ± zα/2 Sqrt[1/34.433] = exp[0.305 ± 1.96×0.170] = 0.97 to 1.89

page 6

Page 62: M&M Ch 6 Introduction to Inference OVERVIEW

stratified data

Via SAS TABLE 1 OF I_SMOKE BY I_DEAD SUMMARY STATISTICS FOR

I_SMOKE BY I_DEAD CONTROLLING FOR AGE=18-44

data sasuser.simpson; I_SMOKE I_DEADinput age $ i_smoke i_deadnumber;

CONTROLLING FOR AGE Frequency| Expected | 0| 1| Total Cochran-Mantel-Haenszel Statistics

(Based on Table Scores) ------+--------+--------+lines; 0 | 327 | 13 | 340

Alt. Hypothesis DF Value Prob18-44 1 1 19 | 322.68 | 17.325 |-------------------------------------18-44 1 0 269 ------+--------+--------+Nonzero Correlation 1 3.239 0.072 1 | 269 | 19 | 28818-44 0 1 13Row Mean Scores Differ 1 3.239 0.072 | 273.32 | 14.675 |18-44 0 0 327 General Association 1 3.239 0.072 ------+--------+--------+44-64 1 1 78

Total 596 32 62844-64 1 0 16744-64 0 1 52 Estimates of Common Relative Risk

(Row1/Row2) TABLE 2 OF I_SMOKE BY I_DEAD44-64 0 0 147 CONTROLLING FOR AGE=44-6464- 1 1 42 I_SMOKE I_DEAD 95%64- 1 0 7

Type of Study Method Estimate Conf Bounds64- 0 1 165 Frequency|------------------------------------- Expected | 0| 1| Total64- 0 0 28Case-Control Mantel-Haenszel 1.357 0.973 1.892 ------+--------+--------+; (Odds Ratio) Logit 1.357 0.971 1.894

0 | 147 | 52 | 199 | 140.73 | 58.266 |run; Cohort Mantel-Haenszel 1.047 0.996 1.101 ------+--------+--------+ (Col1 Risk) Logit 1.034 0.998 1.072options ls = 75 ps = 50; run; 1 | 167 | 78 | 245proc freq data=sasuser.simpson; | 173.27 | 71.734 | Cohort Mantel-Haenszel 0.864 0.738 1.013 ------+--------+--------+ (Col2 Risk) Logit 0.953 0.849 1.071tables age * i_smoke * i_dead / Total 314 130 444 nocol norow nopercent cmh expected; Confidence bounds for M-H estimates

are test-based. TABLE 3 OF I_SMOKE BY I_DEADweight number; CONTROLLING FOR AGE=64-/* weight indicates multiples */ Breslow-Day Test for Homogeneity of

the Odds Ratios I_SMOKE I_DEAD

run; Frequency| Expected | 0| 1| Total Chi-Square = 0.950 DF = 2 Prob = 0.622

See for SAS 'trick' to produce Tablesin an orientation that gives theratios of interest (use PROC FORMAT toassociate another values with eachactual value; then use theORDER=FORMATTED option in PROC FREQ )

------+--------+--------+ 0 | 28 | 165 | 193 | 27.913 | 165.09 | Total Sample Size = 1314 ------+--------+--------+ 1 | 7 | 42 | 49 | 7.0868 | 41.913 | ------+--------+--------+ Total 35 207 242

page 7

Page 63: M&M Ch 6 Introduction to Inference OVERVIEW

stratified data

Via Stata Aggregating Odds Ratio (OR)'s ...Woolf's Method

Recall: data from single 2x2 table: or = adbc

clearinput str5 age i_smoke i_dead number 18_44 1 1 19

SE[ ln (or) ] = 1a +

1b +

1c +

1d 18_44 1 0 269

18_44 0 1 13 18_44 0 0 327

data from several (K) 2x2 tables: (Σ: summation over strata) 44_64 1 1 78 44_64 1 0 167

ln (orWoolf)= ∑ wk ln (ork)

∑ wk (weighted average)

with wk = 1

Var[ ln [ork] ](weight ∝ 1 / variance)

(note: Var = SE2)

SE[ln (orWoolf)] = 1

∑wk =

Var*K [see drivation #]

(Var* : harmonic mean of K Var's)

44_64 0 1 52 44_64 0 0 147 64_ 1 1 42 64_ 1 0 7 64_ 0 1 165 64_ 0 0 28end

cc i_dead i_smoke [freq=number], by(age)

age | OR [95% CI] M-H Weight ------+--------------------------------------------- 18_44 | 1.78 .87 3.61 5.57 (Cornfield) 44_64 | 1.32 .87 1.99 19.56 (Cornfield) 64_ | 1.02 .42 2.43 4.77 (Cornfield)

CI[ OR ] = exp{ CI[ ln (OR) ] } ------+--------------------------------------------- Crude | .68 .53 .88 (Cornfield)M-H combined | 1.36 .97 1.90-----------------+-----------------------------------------Test of homogeneity (M-H) chi2(2) = 0.95 Pr>chi2 = 0.6234 # Derivation: Var[Σ{w × ln} / Σw] = (1/Σw])2 × Σ{w2 × Var[ln]}

= (1/Σw])2 × Σ{1/w} = 1/Σw [ since w = 1/var[ln] ] Test that combined OR = 1: Mantel-Haenszel chi2(1) = 3.24 Pr>chi2 = 0.0719

Also available... See worked example in Spreadsheet (under Resources Ch 9)[Robins-Breslow-Greenland SE for ln orMH not programmed]

cc i_dead i_smoke [freq=number], by(age) woolf

References: A&B Ch 4.8 and 16, Schlesselman, KKM, Rothman...cc i_dead i_smoke [freq=number], by(age) tb

*tb = "test-based" Summary Risk Ratio and Summary Rate Ratio

See Rothman pp 147- (Risk Ratio) and pp153- (Rate Ratio)

page 8

Page 64: M&M Ch 6 Introduction to Inference OVERVIEW

stratified data

Berkeley Data: M:F Comparative parameters Odds Ratio (OR), Risk Ratio (RR) and Risk Difference (R )

E E–

(Using KKM table 17.16 notation) D a b | m1

D– c d | m0

n1 n0 | n for R for OR for RR for R

Faculty a/n1 b/n0 R∆ a•db•c

a•dn b•cn

a•n0b•n1

a•n0n

b•n1n

var(R∆)* w = 1/var w•R∆

Admitted? Men Women AllA Y 512 89 | 601 0.62 0.82 –0.20 0.35 10.4 29.9 0.75 59.3 78.7 1.63E-3 614 –125 N 313 19 | 332 All 825 108 | 933

B Y 353 17 | 370 0.63 0.68 –0.05 0.80 4.8 6.0 0.93 15.1 16.3 9.12E-3 110 –5 N 207 8 | 215 All 560 25 | 585

C Y 120 202 | 322 0.37 0.34 +0.03 1.13 51.1 45.1 1.08 77.5 71.5 1.10E-3 913 26 N 205 391 | 596 All 325 593 | 918

D Y 138 131 | 269 0.33 0.35 –0.02 0.92 42.5 46.1 0.95 65.3 69.0 1.14E-3 879 –16 N 279 244 | 523 All 417 375 | 792

E Y 53 94 | 147 0.28 0.24 +0.04 1.22 27.1 22.2 1.16 35.7 30.7 1.51E-3 661 25 N 138 299 | 437 All 191 393 | 584

F Y 22 24 | 101 0.06 0.07 –0.01 0.83 9.8 11.8 0.84 10.5 12.5 3.41E-4 2935 -33 N 351 317 | 668 All 373 341 | 769

All Y 1198 557 | 1755 0.44 0.30 +0.14 1.84 1.47 N 1493 1278 | 2771 All 373 341 | 4526 ----- ----- ----- ----- ---- ---- ∑: 145.8 161.1 263.4 278.7 6113 –129

ORMH = 145.8161.1

= 0.91 RRMH = 263.4278.7

= 0.94 R∆w = ∑w•R∆

∑w = –1296113

= –0.02

* var(R∆) = Sum of 2 binomial variances

page 9

Page 65: M&M Ch 6 Introduction to Inference OVERVIEW

stratified data

Test of equal M:F admission rates; Confidence Intervals for ORMH (Berkeley data, KKM and A&B notation; cf. Rothman'02,Table 8.4, p152)

CI for ORMH [notation from A&B p461]

TEST M = F (Method of Robins, Breslow & Greenland 1986 *) CI ORMH continued...

Faculty E[a|Ho] Var[aHo] a+dn

b+cn

a•dn

b•cn

(P) (Q) (R) (S) P•R P•S Q•R Q•S lnORMH = ln 0.91 = –0.10 Admitted? Men Women AllA Y 512 89 | 601 531.4 21.9 0.57 0.43 10.4 29.9 5.9 17.0 4.5 12.9 Var[lnORMH ] = 0.0066 N 313 19 | 332 All 825 108 | 933 SE[lnORMH] =√Var = 0.08

B Y 353 17 | 370 354.2 5.6 0.62 0.38 4.8 6.0 3.0 3.7 1.8 2.3 CI[lnORMH]= –0.10 ± z•0.08 N 207 8 | 215 All 560 25 | 585 = –0.26 to 0.06 (95%)

C Y 120 202 | 322 114.0 47.9 0.56 0.44 51.1 45.1 28.5 25.1 22.7 20.0 CI[ ORMH ] = N 205 391 | 596 All 325 593 | 918 exp[–0.26] to exp[0.06]

D Y 138 131 | 269 141.6 44.3 0.48 0.52 42.5 46.1 20.5 22.3 22.0 23.9 = 0.77 to 1.06 N 279 244 | 523 All 417 375 | 792 _______________________________________

E Y 53 94 | 147 48.1 24.3 0.60 0.40 27.1 22.2 16.4 13.4 10.8 8.8 CI [ORMH] "test-based" (Miettinen 1976) N 138 299 | 437

All 191 393 | 584 Chi-MH = | ln orMH | / SE[ln orMH ] ===>

F Y 22 24 | 101 24.0 10.8 0.47 0.53 9.8 11.8 4.6 5.6 5.1 6.2 SE[ln orMH]=|ln orMH| / Chi-MH {0.10/√1.52= 0.08} N 351 317 | 668

All 373 341 | 769 Rothnan2002, p152 uses different notation CI[ln ORMH] = ln or ± z SE[ ln orMH ]

All Y 1198 557 | 1755 A&B {R,S, P,Q} -> Rothman{G,H, P,Q} CI[ORMH] = CI [exp[ln orMH]] N 1493 1278 | 2771

All 373 341 | 4526 ----- ----- ---- ---- ---- ---- = exp[CI for ln] = orMH[1 ± z/Chi-MH]

∑: 1213.4 154.7 145.8 161.1 78.9 87.1 66.9 74.1

(R+) (S+) = orMH[1 ± 1.96/ 1.52] in our example

{ ∑a – ∑a|Ho] }2

∑Var[a|Ho] =

{1198 – 1213.4}2

154.7 = 1 .52 [#] Var[ ln ORMH ] =

∑P•R2R+

2 + ∑[P•S + Q•R]

2R+•S+ +

∑Q•S2S+

2 ______________________________________

This MH X2 of 1.52 is "NS" in the χ2 1df distribution = 78.9

2•145.82 +

87.1 + 66.9]2•14.5.8•161.1

+ 74.1

2•161.12 = 0.0066 CI [R ] ... (continued from last column, previous page)

SE[R ] = 1/ w = 0.013 [#] see Rothman2002, p162 continued at top of next column ... CI [R ] = –0.02 ± z × 0.013

page 10

Page 66: M&M Ch 6 Introduction to Inference OVERVIEW

Inference from 2 way Tables M&M §9 REVISED Dec 2003 Analysis of 2 x 2 tables ... chi-square test

Recall some earlier examples... More generally...

Cross-classification of single sample of size n with respect totwo characteristics, say A and B ("second" model in M&M p 641)Test of independence of two characteristics

Montreal Metropolitan Population by knowledge of official languageData collected by Statistics Canada at the 1996 census:{numbers rounded, so subtotals do not sum exactly to total]

BEnglish?B1 B2 TotalYes No Total

A1 n11 n12 nA1Oui 1,634,785 1,309,150 2,943,935AFrançais?

A2 n21 n22 nA2Non 280,205 63,500 343,705Total 1,914,990 1,372,650 3,287,645 nB1 nB2 n

Stroke Unit vs. Medical Unit for Acute Stroke in elderly?Patient status at hospital discharge (BMJ 27 Sept 1980)

Cohort Study: Fixed /Variable follow-up. Person (P) or P-Time denominators(Cross-sectional Study, document states rather than events)

indept. dependent total no. pts------------------------------------------------------------------------------------

event(or state)

c="cases"numerator

non-event(or state)

TotalPersons

D=Denominator

or TotalPerson-Time

D=Denominator

Stroke Unit 67 34 101Medical Unit 46 45 91------------------------------------------------------------------------------------total 113 (58.9%) 79 192

"exposed" (1) c1 D1 D1

not exposed (0) c0 D0 D0c D DBone mineral density and body composition in boys with distal

forearm fractures (J Pediatr 2001 Oct;139(4):509-15)Case-Control Study: person- or person-time "quasi-denominators"

Fracture?event

(or state)

"c=cases"numerator

quasi-denominators

persons

or quasi-denominators

Person-Time

Yes NoYes 36 14

Overweight?No 64 86

100 100"exposed" (1) c1 d1 d1

Pour battre Roy, mieux vaux lancer bas ...(LA PRESSE, MONTREAL, JEUDI, 21 AVRIL 1994 ... cf. Course 626) not exposed (0) c0 d0 d0

d=sample of D d=sample of DAu cours des vingt matches des séries éliminatoires disputés l'an passé, leCanadien a accordé 51 buts... Des 51 buts alloués par le meilleur gardien au monde..

Haut 10 (20%)

... ont vu la rondelle pénétrer dans la partie .. du filet Milieu 5 (10%)

Bas 36 (70%)

51 (100%)

page 1

Page 67: M&M Ch 6 Introduction to Inference OVERVIEW

Inference from 2 way Tables M&M §9 REVISED Dec 2003 Analysis of 2 x 2 tables ... chi-square test

e.g. languages in Montreal Statistics from 2 x 2 table via SAS Proc FREQ

options ls = 75 ps = 50; run;create SAS file via Program Editor (Could also type directly into INSIGHT)proc freq data=sasuser.lang_mtl; tables Francais * English /data sasuser.lang_mtl; all cellchi2 expected;

input Francais $ English $ number; /* turn on all output */

/* number instead of individual per line */ weight number; /* use weight to indicate "multiples" */ /* $ sign after name indicates character variable */ run;

lines; TABLE OF FRANCAIS BY ENGLISH Oui Yes 1634785 FRANCAIS ENGLISH Oui No 1309150

(Observed) Frequency ... "obs" for short Non Yes 280205 Expected (under H0: independence) ... "exp" for short Non No 63500 (Cell Chi-Square)|; Percent |run; Row Pct |

then.. via SAS INSIGHT Mosaic plot Col Pct |No |Yes | Total ---------------+--------+--------+

Non

Oui

F

R

A

N

C

A

I

S

No Yes

ENGLISH

Non | 63500 | 280205 | 343705 | 143503 | 200202 | | 44602 | 31970 | | 1.93 | 8.52 | 10.45 | 18.48 | 81.52 | | 4.63 | 14.63 | ---------------+--------+--------+ Oui |1309150 |1634785 | 2943935 |1229147 |1714788 | | 5207.3 | 3732.5 | | 39.82 | 49.73 | 89.55 | 44.47 | 55.53 | | 95.37 | 85.37 | ---------------+--------+--------+ Total 1372650 1914990 3287640 41.75 58.25 100.00

143503 = 10.45% of 1372650 = 343705 × 1372650

3287640 =

RowTotal × ColTotalOverallTotal

44602 = (observed freq. - expected freq. )2

expected freq. =

(63500 - 143503 )2

143503

Statistic DF Value Prob*------------------------------------------------------------------

Chi-Square (2-1) × (2-1) = 1 Σ (obs - exp)2

exp = 85511.881 0.001

DF = "degrees of freedom"; Σ is over all 4 cells

* Clearly, p-values are not relevant here.

page 2

Page 68: M&M Ch 6 Introduction to Inference OVERVIEW

Inference from 2 way Tables M&M §9 REVISED Dec 2003 Analysis of 2 x 2 tables ... chi-square test

e.g. Stroke Unit vs. Medical Unit for Acute Stroke in elderly?Patient status at hospital discharge (BMJ 27 Sept 1980)

data sasuser.str_unit;input Unit $ Status $ number;lines;

independent. dependent total no. pts Stroke Indep 67Stroke Unit 67 (66.3%) 34 101 Stroke Dep 34Medical Unit 46 (50.5%) 45 91 Medical Indep 46

Medical Dep 45113 (58.9%) 79 192;

"Expected" numbers [in bold] under

H0: % discharged "independent" unaffected by type of unit

run;

proc freq data=sasuser.str_unit; weight number;tables Unit * Status / chisq cmh relrisk riskdiff nopercent nocol;

independent dependent total no. pts run;

Stroke Unit 59.4 (58.9%) 41.6 101 UNIT STATUSMedical Unit 54.6 (58.9%) 37.4 91

Frequency|113 (58.9%) 79 192 Row Pct |Dep |Indep | Total

¶ X2 = {67 – 59.4 }2

59.4 + {34 – 41.6 }2

41.6

+ {46 – 54.6 }2

54.6 + {45 – 37.4 }2

37.4

= 4.9268 [ to be referred to χ2 (1df) distr'n]

INDEX Category Medical | 45 | 46 | 91 (see NOTE) | 49.45 | 50.55 | Stroke | 34 | 67 | 101REFERENCE Category | 33.66 | 66.34 | Total 79 113 192

Statistic DF Value Prob ------------------------------------------------------ Chi-Square 1 4.927 0.026

Generic formula for X2 Continuity Adj. Chi-Square 1 4.296 0.038 Mantel-Haenszel Chi-Square 1 4.901 0.027

X2 = Σ { observed – expected } 2

expected [ over all cells! ] Fisher's Exact Test Left:0.991 Right:0.019 2-Tail: 0.029

Column 2 Risk Estimates

95% Conf Bounds 95% Conf BoundsThe expected number in cell in to row i and column j is:

total row i • total column j

overall total

Risk ASE (Asymptotic) (Exact) Row 1 0.505 0.052 0.403 0.608 0.399 0.612 Row 2 0.663 0.047 0.571 0.756 0.562 0.754 Total 0.589 0.036 0.519 0.658 0.515 0.659

Continuity-corrected X2 †

X2c = Σ

{ | observed – expected | – 0 .5 } 2

expected

Row 1 - Row 2 -0.158 0.070 -0.296 -0.020

NOTE FREQ uses upper row as INDEX category; lower as REFERENCE cat.

Estimates of the Relative Risk (Row1/Row2) Type of Study Value 95% Confidence Bounds Cohort (Col2 Risk) 0.762 (50.55/66.34) 0.596 0.975

¶ Use X2 to refer to the calculated statistic in a sample, χ2 for distribution.

† (Yates') Continuity-correction is used to reflect the fact that the binomial countsare discrete and that their probabilities are being approximated by intervals (count –0.5, count + 0.5). The uncorrected X2 is overly liberal i.e. it produces too large adistribution of discrepancies that is larger than the tabulated distribution... hencethe reduction of each absolute deviation | observed – expected | by 0.5.

cf. z-test for difference of 2 proportions .. in Chapter 8

z = (0.6634 – 0.5054)/sqrt [ 0.5885 • 0.4115 • (1/101+ 1/91)) = 0.1580/0.0711=2.22

P = Prob[ | Z | ≥ 2.22 ] = 0.026 (2-sided)

z2 = 2.222= 4.93 ( same as X2 with 1 degree of freedom ! )

page 3

Page 69: M&M Ch 6 Introduction to Inference OVERVIEW

Inference from 2 way Tables M&M §9 REVISED Dec 2003 Analysis of 2 x 2 tables ... chi-square test

OTHER E XAMPLESNote: In the following examples, in order to keep the formulae uncluttered, I havenot shown the continuity correction. In some examples, as in the one above, thesample sizes are large and so the continuity correction makes only a small change.In some others, as in the milk immunoglobulin example below, it makes a bigdifference. However, don't do as I do; do as I say -- use the continuity correctionroutinely. That way, editors and referees won't accuse you of trying to make your p-values more impressive by not using the correction.

Protection by milk immunoglobulin concentrate against oralchallenge with enterotoxigenic e. coli NEJM May 12 ,1988 p 1240

developeddiarrhea*

didnot*

Totalsubjects

received milkimmunoglobinconcentrate

0 (4 .5 ) 10(5 .5 ) 10

Do infant formula samples shorten the duration of breast-feeding?Bergevin Y, Dougherty C, Kramer MS. Lancet. 1983 May 21;1(8334):1148-51.

received controlimmunoglobinconcentrate

9 (4 .5 ) 1(5 .5 ) 10% still breastfeeding at 1 month in RCT which withheld free formula samples[normally given by baby-food companies to mothers leaving hospital with theirinfants] from a random half of those studied All 9 (45%) 11 20

breastfeeding

not breastfeeding

Totalmothers

* the numbers 4.5, 5.5, 4.5 and 5 .5 in bold in parentheses in thetable are the "Expected" numbers calculated on the (null) hypothesis H0:rate not changed by immunoglobin concentrategiven sample 175 (77%) 52 227

no sample 182 (84%) 35 217

x2 = { 0 – 4 .5 }2

4 .5 +

{ 9 – 4 .5 }2

4 .5 +

{10 – 5 .5 }2

5 .5 +

{ 1 – 5 .5 }2

5 .5

={ 4.5 }2 ( 1

4 .5 +

14 .5

+ 1

5 .5 +

15 .5

) = 16.4

357 (80.4%) 87 444

"Expected" numbers underH0: rate not changed by giving samples

breastfeeding

not breastfeeding

Totalmothers

Continuity-corrected x2 = { | 0 – 4 .5 | – 0 .5}2

4 .5 + ... + =

= { 4.0 }2 ( 1

4 .5 +

14 .5

+ 1

5 .5 +

15 .5

) = 12.9

given sample 182.5 (80.4%) 44 .5 227no sample 174.5 (80.4%) 42 .5 217

357 (80 .4% ) 87 444

X2 = {175 – 1 8 2 . 5 }2

1 8 2 . 5 +

{182 – 1 7 4 . 5 }2

1 7 4 . 5 Attack rates of ophthalmia neonatorum among exposed newbornsreceiving silver nitrate, or tetracycline NEJM March 11 ,1998 653-7

+ {52 – 44 .5 }2

44 .5 +

{35 – 42 .5 }2

42 .5

= 3.22 [ "NS" at 0.05 level, even with uncorrected X2]

X2 = 3.22 <--> |Z| = √3.22 = 1.79; Prob( |Z| > 1.79) = 2 × 0.0367 = 0.0734

Silver Nitrate Tetracycline

exposed to N. gonorrhœa 5 / 71 2 / 66attack rate ( 7% ) ( 3% )

exposed to C. trachomatis 10 / 99 8 / 111attack rate ( 10% ) ( 7% )

page 4

Page 70: M&M Ch 6 Introduction to Inference OVERVIEW

Inference from 2 way Tables M&M §9 Notes on X2 tests and on analysis of binary data in general

• If use x2 , it must be based on counts, not on %'s -0.1, for an "average" chi-square density of approx. (2 x 0.0394)/0.03 =2.6. You can track the 'block transfers' using the different shades.

• a short-cut method of calculation for the 2x2 table with 'generic'entries a, b, c, d, and with row, column and overall totals r1, r2, c1, c2and N respectively and overall totals is (with stroke data as e.g.):

-2.5 -1.5 -0.5 0.5 1.5 2.5

0.25 1. 2.25 4. 6.25

Z ~ N(0,1)

Z-squared ~ Chi-Square(df=1)

x2 = N { a •d – b •c } 2

r1•r2•c1•c2 =

192 { 67•45 – 34•46 } 2

101•91•113•79 = 4.93

For the continuity corrected version, the shortcut formula is:

N{ | a •d – b • c | –

N2

} 2

r1•r2•c1•c2 =

192{ |67•45–34•46 | - 96} 2

101•91•113•79 = 4.30

where | x | means 'the absolute value of'.

The formula involves the crossproducts a•d and b•c. If their ratio(empirical odds ratio) is 1, their difference is zero. The direction of thedifference in proportions is given by the sign of ad – bc.

These formulae avoid fractions -- one doesn't see expectations ordeviations, or the magnitude of the difference. Presumably for thisreason, some books, such as Norman and Streiner's Statistics: the BareEssentials, classify x2 as "non-parametric" or "distribution-free", and soput it in the non-parametric chapter. After their first edition, I pointed outto them that the above use of the chi-square test is as a test of thedifference in two binomial proportions.. how much more parametric ordistribution-specific can that be? Look up the index in the latest editionto see if my arguments convinced them. The direction is given by thesign of ad – bc.

• The uncorrected version of the 2-sided z-test for comparing two

proportions gives the same p-value as the uncorrected version of the x2

test. One can check that Z2 = x2 . Likewise, the corrected version ofthe 2-sided z-test for comparing 2 proportions gives the same p-value asthe corrected version of the x2 test.

The chi-square random variable (r. v.) is the square of the N(0,1) r. v.The very high "probability density", and the rapid change in this density,just to the right of Z2 = 0, (cf. diagram) is a result of the Z ->Z2

transformation. For example, the 3.98% of the "probability mass"between Z=0 and Z=0.1 is transferred to the small interval 02 to 0.12, or 0to 0.01, an width of 0.01 (an identical amount gets transferred from the Zinterval -0.1 to 0). The 3.94% of the "probability mass" between Z=0.1and Z=0.2 is transferred to the small interval 0.12 to 0.22, or 0.01 to 0.04,an width of 0.03. The is an identical transfer from the Z interval -0.2 to

• By construction, the x2 is a 2-sided test, unless one uses x andrefers it to the z - table.

• There are other chi-square distributions (with df > 1 ). See later.

page 5

Page 71: M&M Ch 6 Introduction to Inference OVERVIEW

Inference from 2 way Tables M&M §9 Notes on X2 tests and on analysis of binary data in general

• If use x2 , it must be based on all cells, not just on numerators -- unless

the more common type of outcome is so much more common that thecontribution for these cells is negligible.

Mantel-Haenszel Test Statistic for a single 2 x 2 table

Preamble• The x2 test is a large sample test i.e. it is always an approximation . Sincex2 (1 f) is just Z2, one is no more exact than the other. Some of the paragraphs on the next page more appropriately belong in

the notes for Ch 8.2, where Fisher's exact test was introduced , but the

reasons become important when we come to using the above

formulation of the x2 test for a single table to combining evidence over

several (possibly sparse) 2x2 tables. Cochran first proposed combining

evidence from 2 x 2 tables in 1954. His aim was to combine a small

number of 'large' tables, and he did not anticipate that this technique

could also be used to combine a large number of quite 'small' 2 x 2

tables (each one with quite sparse information) , with the combination of

data from n matched pairs as the limiting case. Thus, he was a wee bit

careless about variances. It took the now famous Mantel-Haenszel

paper of 1959 to make a variance correction that for 'large 'tables was

trivial, but for matched pair tables, was critical.

• In t-tests, n= 30 is often considered 'large enough' for large-sampleprocedures -- it depends on skewness of data are & whether theCentral Limit Theorem will "Gaussianize" the distribution of the statistic .For 0/1 data, such as those above, the 'real' or 'effective' sample sizesare not the denominators 71 and 66, but rather the numerators 5 and 2[e.g. changing the 2 to a 1 would make the ratio of attack rates appeartwice as good ] It doesn't matter if the 5 and 2 came from n's of 710 and660 or 7100 and 6600 . The 'effective sample size' for binary data is thenumber of subjects having the less common outcome.

• The "guidelines" (such as they are) about when it is appropriate to usex2 are based on "Expected" numbers not on the observednumbers. One quoted rule [often used by computer programs togenerate warning messages ] is that the expected numbers in most ofthe cells should exceed 5 for the χ2 to be accurate. Thus, the 2x2 tableon the left below will generate a 'warning' ; that on the right will not.

5 2 1 1166 64 11 1

SAS and others rightly acknowledge Cochran's role in the test statistic,

calling it the 'Cochran-Mantel-Haenszel' or 'CMH' statistic. (Indeed 'CMH'

is the option one uses with the PROC FREQ to obtain the summary

measures and the overall test statistic). This formulation is also the one

most commonly used for the log-trank test used to compare two survival

curves.

• The regular uncorrected x2 test statistic for a single 2x2 table can be

written in a seemingly very different format, as

x2 = { a – E[a | H0] } 2

Variance[ a | H0 ] =

{ a – E [a | H 0] } 2

r1•r2•c1•c2 / N3

The Variance in the denominator of this statistic can be viewed as arising

from a statistical model in which the 2 compared proportions are

separate independent random variables; i.e. the 'unconditional' or '2-

independent binomials' model .

Just like the formula with 4 O's and 4 E's, this format is not as calculator-

friendly as the shortcut (integer-only) one . But... cf Mantel Haenszel.

Most consider that the biggest legacy of "M-H" paper is to the Mantel-

Haenszel summary measure (point estimate) of Odds Ratio. We will

come back a little later to this issue of combining data from 2 x 2 tables .

page 6

Page 72: M&M Ch 6 Introduction to Inference OVERVIEW

Inference from 2 way Tables M&M §9 Notes on X2 tests and on analysis of binary data in general

Mantel-Haenszel Test Statistic for a single 2 x 2 tabletables by their degree of evidence against H0. For example, in a 2x2

table with n0=23, and n1=24 (as in the bromocryptine and infertility

study) , there are theoretically 24 x 25 = 600 possible tables. However, if

one -- after the fact -- restricts the analysis to only those tables where the

total number of "successes" is 12 (12 pregnancies) , then there are only

13 possible tables (see notes and Excel spreadsheet for Fisher's exact

test). And, by reducing the problem from a 2-dimensional one to a 1-

dimensional one, is also becomes possible to more easily rank the

tables by their degree of evidence against H0, something that is

supposedly more difficult when the tables are simultaneously arrayed

along both dimensions. (d) a fourth reason, which I will illustrate with the

Marvin Zelen "Marbles in the Folger's Coffee Can" model, is that, after

the fact, it is much easier to empirically -- and heuristically -- demonstrate -

a low p-value using the single random variable, conditional

(hypergeometric), model than it is with the '2-separate binomials' model.

Preamble: Conditional vs. Unconditional? continued... In the

separate binomials model, the only marginal totals that are fixed ahead of

time are the two sample sizes. In most instances, this model reflects

reality. The only exception I know of is the design exemplified by the

psychophysics study of the lady tasting tea . If she is told that there are 4

cups where the tea is poured first, and 4 where it is poured second, then

she will arrange her responses so that there are 4 of each. Thus, in this

instance, both the row totals and the column totals are "fixed" ahead of

time, and so it makes sense that the (frequentist) inference be limited to

the (only) 5 possible data tables that have all margins fixed.

This is the statistical model behind Fisher's exact test, and indeed Fisher

used the tea-tasting example to explain it. But this test is now used for

data situations where one cannot -- at least ahead of time -- consider

both sets of marginal totals fixed. For example, in the food sensitivity

study in the ch 8.2 notes, from the answers given, it appears that the

subjects were not told that there were 3 three injections of extract and

nine of diluent, but the authors used the conditional test anyway.

In fact there are many ways to circumvent these objections without

having to 'condition' on all margins, and there is still a considerable

debate, much of it philosophical, on this 100 years after analyses of 2x2

data were first introduced. However, since we often combine information

from data arranged as matched pairs or 'finely stratified ' strata, we do

need to consider this one setting where conditioning is the 'right thing

to do'. In the example here, there will only be 1 large table, so the

difference will not be important. But when we come to matched pairs,

the implications are large.

Many of the reasons put forward for using the conditional test based on

all margins fixed (i.e. the hypergeometric model, with only one random

variable) involve practicality rather than adherence to a coherent set of

inferential principles. They mostly have to do with one of the following

'supposed' difficulties (a) using the normal approximation when the

expected numbers are low (b) the fact that there are two parameters, but

one is only interested in their difference, or ratio, or odds ratio, and so

the 'remaining' parameter is just a 'nuisance' (c) how to order or rank the

page 7

Page 73: M&M Ch 6 Introduction to Inference OVERVIEW

Inference from 2 way Tables M&M §9 Notes on X2 tests and on analysis of binary data in general

Mantel-Haenszel Test Statistic for a single 2 x 2 table

Details

In the conditional model, with both margins fixed, there is only

one cell entry that can vary independently. Without loss of generality, we

focus on the frequency in the 'a' cell. Then, under the null hypothesis,

a ~ Hypergeometric[parameters given by marginal totals]

i.e. by r1=Row1Total, r2, c1=Col1Total, c2, and N=OverallTotal.

Thus

Expected value[ a | H0 ] = E[ a | H0 ] = r1 × c 1

N

Variancecondn'l [ a | H0 ] = { r1•r2•c1•c2 } / { N2(N–1) }

Example

In our stroke vs. medical unit example above, the marginal totals were

r1=101, r2=91, c1=113, and c2=79, so N=192. These yield the "excess

in the a cell" of

67 - (101•113)/192 = 67 - 59.44 = 7.56

and conditional variance

{ 101•91•113•79} / { 1922(191) } = 11.6528

giving

X2 MH = 7.562 / 11.6528 = 4.901,

in agreement with the printout from Proc FREQ in SAS.Under the null, the expectation is the same with the conditional as the

unconditional models. Note however the difference in the variance:

under the conditional model it is different, since it uses N2(N–1)

rather than N3 , reflecting the different pattern of variation in the

frequency in the 'a' (and consequently in the other 3) cell(s) if all margins

are fixed (vs. what would happen if the lady were not told "4 1st; 4 2nd").

Note:

The MH test does not use the continuity corrected with the {a - E[a]}.

Part of the justification for this is that when the point estimate of the odds

ratio falls at the null, i.e. a•d = b•c, so that E[a | H0 ] = a, it would be good if

the test statistic also had a value of zero. A continuity correction would

force the test-statistic to have a positive value even when the "observed

a" = "expected under the null" !

The test statistic using this conditional variance can be computed as a Z

statistic

Z = X = "chi" = a – E[ a | H0]

SDcondn'l [ a | H0] ,

which has the same form as the critical ratios used in the z-test for

proportions or means, or as the more traditional square

X2 MH = { a – E[ a | H0] }2

Variancecondn'l [ a | H0] .

page 8

Page 74: M&M Ch 6 Introduction to Inference OVERVIEW

Inference from 2 way Tables M&M §9 2x2, 2x1, 1x2 and 1x1 Tables†

2x2 samples reasonably equal in size, two types of outcome commone.g. outcomes in trial of stroke vs. medical unit .

1x2 1 sample ; two types of outcome commone.g. male and female births with specific timing of conception

BADOUTCOME

GOODOUTCOME

Totalpersons or

TotalPerson-time "LESS GOOD

OUTCOMEGOOD

OUTCOMETotal

persons orTotal

Person-timesample 1 bad1 good1 n1 n1 sample less_good good n nsample 2 bad2 good2 n2 n2

bad good n n

x2 = {less_good – less_good}2

less_good +

{good – good}2

good

+ minimal contribution + minimal contribution

"Expected" numbers of outcomes under H0: rates not different(split the events across 2 samples in ratio of n1 : n2 )

BADOUTCOME

GOODOUTCOME

Total personsor person-time "Expected" numbers of outcomes under H0: rate not different from

EXTERNAL rate (use EXTERNAL rate, based on LARGE amount ofdata (e.g. national rates), to calculate the expected split ofevents). If use internal comparison, then we have full 2 x 2 table.

sample 1 bad1 good1 n1sample 2 bad2 good2 n2

bad good n

1x1 1 large sample , BAD outcome uncommone.g. 78 cancers observed in Alberta study, 83.5 expectedx2 =

{bad1- bad1}2

bad1 +

{good1 - good1}2

good1

+{bad2- bad2}2

bad2 +

{good2- good2}2

good2BAD

OUTCOMEGOOD

OUTCOMETotal

persons orTotal

Person-timesample bad MOST n n

2x1 samples large and reasonably equal in size,BAD outcome uncommon : e.g. leukemias and breast cancers

x2 ≈ {bad- bad}2

badBADOUTCOME

GOODOUTCOME

Totalpersons or

TotalPerson-time

"Expected" number of outcomes under H0: rate not different fromEXTERNAL rate (use EXTERNAL rate to calculate expected numberof BAD events)

sample 1 bad1 MOST n1 n1sample 2 bad2 MOST n2 n2

bad MOST n n

This x2 = {observed- expected}2

expected is equivalent to the large sample

approximation to the Poisson distribution [A&B §4.10]

i .e. z = observed- expected

expected so that

z2 = {observed- expected}2

expected = x2

x2 ≈ {bad1- bad1}2

bad1 + minimal contribution

+ {bad2- bad2}2

bad2 + minimal contribution

"Expected" numbers of outcomes under H0: rates not different(only need ratio n1 : n2 to get expected split of BAD events)

[see A&B §4.10; WE WILL REVISIT THIS 2x1 TABLE , AND THE 1 x 1 TABLE, WHENCOMPUTING EFFECT MEASURES for INCIDENCE RATES ] † This terminology is my own: don't try it out on an editor!

page 9

Page 75: M&M Ch 6 Introduction to Inference OVERVIEW

Inference from 2 way Tables M&M §9 2x2, 2x1, 1x2 and 1x1 Tables†

e.g. Development of leukemia during a 6-year period following drug-rx for cancer

e.g. Breast Cancer in women repeatedly exposed to multiple X-rayfluoroscopies Boice and Monson 1977

leukemia notTotal

personsdrug rx 14 2053 2067

cancers Women-years (WY)no drug rx 1 1565 156615 3618 3633 exposed 41 28,010

not exposed 15 19,017"Expected" numbers of leukemia underH0: rate not increased by drug

56 47,027

"Expected" numbers of MI under H0: rate not affected by X-rays

leukemia notTotal

persons

cancers Women-years (WY)drug rx 8.53 2058.47 2067no drug rx 6.47 1559.53 1566 exposed 33.4 28,010

15 3618 3633 not exposed 22.6 19,01756 47,027

x2 = {14 – 8.53 }2

8.53 + {2053 – 2058.47 }2

2058.47

+ {1 – 6.47 }2

6.47 + {1565 – 1559.53 }2

1559.53

= 8.17

χ2 = {41 – 33.4 }2

33.4 + {15 – 22.6 }2

22.6 + {deviation}2

2 8 K +

{deviation}2

1 9 K

≈ {41 – 33.4 }2

33.4 + {15 – 22.6 }2

22.6 = 4.29

[3.74 with continuity correction]e.g. MI in the first 56 months of US MDs' study of aspirin

MI notTotalMDs

Equivalent to testing whether the a+b events could split in this extreme or moreextreme a way.. would expect under H0 that the split would be (apart from randomvariation) in the ratio of WYexposed : WYnon-exposed.aspirin 104 remainder 11K

placebo 189 remainder 11K293 remainder 22K

WE WILL REVISIT THE 2x1 TABLE , AND THE 1 x 1 TABLE, WHEN COMPUTINGEFFECT MEASURES for INCIDENCE RATES ]"Expected" numbers of MI under H0: rate not affected by aspirin

MI notTotal

personsdrug rx 146.5 rest 11Kno drug rx 146.5 rest 11K

293 rest 22Ke.g. US MDs' study of aspirin ... continued

x2 = {104 – 146.5 }2

146.5 + {189 – 146.5 }2

146.5 + {42.5}2

11K + {–42.5}2

11K = 24.7

In effect, testing whether 293 MI's could distribute this unevenly if used a coin.

M&M Ch 8.1 Inference on π ... page 10

Page 76: M&M Ch 6 Introduction to Inference OVERVIEW

Inference from 2 way Tables M&M §9 Comparison of Proportions --- Paired Data

(McNemar) Test of equality of proportions:e.g. Response of same subject in each of 2 conditions (self-paired)

Responses of matched pair, one in 1 condition, 1 in other

's in paired responses on interval scale, reduced to sign of

-1- discard the concordant pairs (+,+) and (–,–) as being "un-informative"(this point is somewhat controversial )

-2- analyze split of (b+c) discordant pairs (under H0, expect 50:50)Result in Other PAIR Member

Example: HIV in twins in relation to order of delivery [LancetDec14'91]Positive NegativeTotal

PAIRSPositive a b

Mother -> infant transmission of HIV-infection: 66 sets of twinsResult in OnePAIR Member

Negative c dResult in 2nd-born Twinn PAIRS

HIV + HIV –TotalSets

Exposure in "Control" HIV + 10 18

Positive NegativeTotal

PAIRS Result in 1stborn TwinPositive a b HIV – 4 34

Exposure in"Case" 66 Sets

Negative c d

To analyze the 'split' of discordant pairs:

(if n small)

Binomial probabilities with "n" = b+c and π = 0.5

(Table C or Table for Sign Test is helpful here)

(if n larger)

• Z test of observed proportion p = b / (b+c) vs π = 0.5

• χ2 test on observed 1x2 table [ b | c ]

versus 1x2 table expected if H0 holds

[ b+c

2 |

b+c2

]

Note that the Z2 and x2 are equivalent

n PAIRS

extreme situations (1 or other / forced choice e.g. exercise 8.18, or whodies first among twin pairs discordant for handedness)

Shorter

Won LostTotal

PAIRSWon - b

TallerLost c -

n PAIRS

Can also turn this table 'inside-out' and analyze using case-control approach

Loser

Taller ShorterTotal

PAIRSTaller - b

WinnerShorter c -

n PAIRS

page 11

Page 77: M&M Ch 6 Introduction to Inference OVERVIEW

Inference from 2 way Tables M&M §9 Comparison of Proportions --- Paired Data

McNemar) Test of equality of proportions: worked example

Gaussian (or equivalently, Chi-square) Approximation to BinomialResult in 2nd-born Twin

Z = 18 – E[b]

SD[b] =

1 8 – 222

22 x 0 .5 x 0 .5 =

18 –11

5.5 = 2.98 ; Z2 =

72

5.5 = 8.91

HIV + HIV –TotalSets

HIV + 10 18Result in 1stborn Twin

HIV – 4 34Prob ( | Z | > 2.98 ) = 0.003; From Table F, Prob[ X2 (1 df)> 8.91] ≈ 0.00366 Sets

Analysis using exact binomialχ2 test on observed 1x2 table [ 18 4 ]versus 1x2 table expected [ 11 1 1 ]if H0 holds

X2 test = {18 – 1 1 }2

1 1 +

{ 4 – 1 1 }2

1 1 =

72

5.5 = 8.91

Binomial probabilities with "n" = b+c = 28 with discordant outcomes

Under H0 that order makes no difference to likelihood of HIV

transmission, the split among these 22 should be like that obtained by

tossing 22 coins, each with

π (first born is the one to have the HIV transmitted) = 0.5

The Binomial(n=22, π = 0.5) distribution is not available in Table C, but

can be obtained from Excel. Of interest is the sum of the probabilities for

18/22, 19/22, 20/22, 21/22 and 22/22 (1-sided) then doubled if dealing

with a 2-sided alternative, i.e. 2 x ( 0.00174 + 0.00036 + negligible

terms) = 0.004.

Notice that because of the symmetry involved in testing π = 0.5 versus a 2-sided alternative, the test statistics have a particularly simple form:

Z2 = x2 = ( b – c ) 2

b + c ;

( 1 8 – 4 ) 2

22 = 8.91 in our example

With continuity correction Z2c = x2 c = ( | b – c | – 1 )2

b + c =

16922

= 7.68

(2-sided P ≈ 0.005)

Q: Why a continuity correction of 1 rather than usual 0.5?

A: The difference b–c jumps in 2's rather than 1's

e.g. if b+c =18, then b – c =18, 16, 14, ... ,-14, -16, -18)

page 12

Page 78: M&M Ch 6 Introduction to Inference OVERVIEW

Inference from 2 way Tables M&M §9 Inferences regarding proportions --- Summary

Situation Question G a u s s i a n A p p r o x i m a t i o nno yes

1 Popln. CI for π • Nomograms/tables/spreadsheet •p ± z p[1 - p]

n(cf. asymmetric)

π (see notes on 8.1)

Test π0 • Binomial distribution • z = p - π0

π0[1 - π0]n

{z2 = x2}

(sample of n; p = y/n are "positive")

2 Populations (matched samples) or1 population under 2 conditions

OR = [π1/(1 - π1)]/[π2/(1 - π2)]

CI for OR • b ~ Bin(n',[OR/(1 + OR)]) • CI for OR/(1+OR) => CI for OR

Test π1 = π2 Test OR=OR0 • b ~ Bin(n',[OR0/(1 + OR0)]) • z = b/n' - OR0/[1+OR0]

SE[ b/n'| H0 ] or x2

(sample of n pairs n(++)=a; n(+-)=b; n(-+)=c; n(--)=d; b+c=n' ; or (i.e. est. of OR ) = b/c )

2 Poplns. CI for ∆ = π1-π2 • Miettinen and Nurminen • p1 - p2 ± z p1[1-p1]

n1 +

p2[1-p2]n2

(also via Binomial regression** ... RD)π1 and π2

CI for RR • • cf Rothman p134; regression (RR)

CI for OR • Conditional • Condnl[Approx.]/Woolf/Miettinen

(or via Binomial(logistic)regression**)

Test RR or OR0 or 0 • Fisher's Exact Test (cond'nl) • z = [ p1 - p2 ] - ∆0

p[1-p]n1

+ p[1-p]n2

(*) {or x2}

• Unconditional methods (Suissa and Schuster)

• Permutational (StatExact software)(independent samples of n1 and n2)

Notes:

(*) p in combined data = n1p1 + n2p2 n1 + n2

= Σ numerators

Σ denominators (weighted average of two p 's )

** Binomial Regression: extension [to come] of 1-parameter binomial regression models described in notes for 8.1

page 13

Page 79: M&M Ch 6 Introduction to Inference OVERVIEW

Inference from 2 way Tables M&M §9 Tests of Association --- Tables with r rows c columns

e.g. Independence of classification on 2 variablesSimilarity of multinomial profiles

Analyzing data from ORDERED categories

Using a chi-square test for the following 2x3 table ignores the

ordered nature of the responses

(generic) Relationship between one factor (rows) and another (columns) inn observations; crossclassified into an

r(=# of rows) x c(=# of columns) table.e.g. 2 Quality of sleep before elective operation. [BMJ]Col1 Col2 ... Colc Total

Row1 n11 n12 ... n1c Nrow1 Bad Reasonably good Good TotalRow2 n21 n22 ... n2c Nrow2 Patients given

Triazolam 2 17 12 31

...... ... ... ... ... ...Rowr nr1 nr2 ... nrc NrowrTotal Ncol1 Ncol2 Ncolc N Patients given

Placebo 8 15 8 31(e.g. 1) Relationship between laterality of hand and laterality of eye (measured byastigmatism, acuity of vision, etc.) in 413 subjects crossclassified into a 3x3 table.[data from Woo, Biometrika 2A 79-148]

Total 10 32 20 62

See article by Moses L et al NEJM 311 442-448 1984Left-eyed Ambiocular Right-

eyedTotal (also published as Chapter in Medical Uses of Statistics by J

Bailar and F Mosteller}.Left handed 34 62 28 124Ambidextrous 27 28 20 75

e.g. 3 Outcome after 2 to 7 days of Rx in 20 patients withchronic oral candidiasis.

Right handed 57 105 52 214Total 118 195 100 413

χ2df

= ∑ { observed - expected } 2

expectedOutcome category

1(good) 2 3 4 (poor) TotalClotrimazole 6 3 1 0 10

• expected number in cell =Nrow • N column

NPlacebo 1 0 0 9 10Total 7 3 1 9 20

• summation is over all r x c cells

Any dichotomization of outcomes loses

information and statistical power. Moses et al.

suggest using the Mann-Whitney U test (also

known as the Wilcoxon Rank sum test) to take

account of ordered nature of response

categories.

• degrees of freedom (df) = (r–1)(c–1). In above eg., r=3; c=3 => df:4

The χ2 statistic measures the deviation from independence of row andcolumn classifications (e.g. 1) and dissimilarity of the distributions (profiles)of responses (e.g. 2 and 3). However, omnibus chi-square tests (H0:identical response profiles) with large df are seldom of interest, since thealternative hypothesis (profiles are not identical) is so broad, and the chi-square tests are invariant to the ordering of the rows and columns. Moreoften, a specific alternative hypothesis is of interest; omnibus tests penalizeone for looking in all directions, when in fact one's focus is narrower, andaiming to pick up a specific 'signal'. The next 2 examples (>2 ORDEREDresponse categories in each of 2 groups; binary responses in > 2ORDERED exposure categories are a more fruitful step in this direction.

page 14

Page 80: M&M Ch 6 Introduction to Inference OVERVIEW

Inference from 2 way Tables M&M §9 Test for trend in (Response) Proportions [from A&B §12.2]

Suppose that, in a k x 2 contingency table the k groups fall into a natural order.They may correspond to different values, or groups of values, of a quantitativevariable like age; or they may correspond to qualitative categories, such as severityof a disease, which can be ordered but not readily assigned a numerical value. Theusual χ2(k–1) test is designed to detect differences between the k proportions ---without taking the 'ordering' of the rows into account. It is an 'omnibus' test and isunchanged even if we interchange the order of the columns. More specifically onemight ask whether there is a significant trend in these proportions from group 1 togroup k. Let us assign a quantitative variable, x, to the k groups. If the definition ofgroups uses such a variable, this can be chosen to be x. If the definition isqualitative, x can take integer values from 1 to k. The notation is as follows:

Example [jh]Distribution of subjects with polluted-water exposure-related symptoms amongCompetitors and Employees and Relative Risk (RR) According to Number of Fallsin the Water Data from article "Health Hazards Associated with Windsurfing onPolluted Water " AJPH 76 690-691, 1986 -- research conducted at the WindsurferWestern Hemisphere Championship held over 9 days in August 1984. During thechampionships, the same single-menu meals were served to both competitors andemployees]

Groups ofsubjects

No. of subjectswith symptoms

Nowithout Total RD RR OR

Employees (ref gp) 8 (20%) 33 41 -- 1.0 1.0 Frequency proportionCompetitors :Group X Pos Neg Total positive

0-10 falls 15 (44%) 19 34 24% 2.3 3.31 x1 r1 n1 – r1 n1 p11 x2 r2 n2 – r2 n2 p2. . . . . .. . . . . .i xi ri ni – ri ni pi

k xk rk nk – rk nk pk

11-20 falls 9 (45%) 11 20 25% 3.5 3.4

21-30 falls 10 (71%) 4 14 51% 3.7 10.3

> 30 falls 10 (100%) 0 10 80% 5.1 inf.

Any dichotomization of exposure loses information and statistical power. Authors

correctly used Chi-square test for trend, yielding χ21df = 25.3, P = 10-6. I get

24.58 with the "spacing" 0, 5, 15, 25 and 40. SAS*, using the "Cochran-ArmitageTrend Test", with same spacing, gives a Z statistic of -4.969, (Z2 = 24.69). Theentire variation among the 5 proportions in the table (ignoring ordering) isapproximately X2(4 df) = 27, but it is almost all explained by the exposure gradient.In smaller datasets, even if the overall X2 is not significant, the trend portion canbe. In this e.g. there was such a strong relationship that even the overall test wassignificant. The same is true in the example overleaf (dealing with birth date andsporting success), where again the sample sizes are large and the signal strong.

--- -------- --- ----------All R N – R N P(=R/N)

The χ2(1) statistic for trend, X2(1) , which forms part of theoverall X2, can be computed as follows:

X2(1 df) = N{N∑rixi – R∑nixi}2

R{N–R}[N∑nixi2 – (∑nixi)2]* Syntax proc freq DATA= ... ; tables falls*sick /trend;From SAS From Stata

if 1 line of data for each of 119 individuals

PROC FREQ DATA= ... ;TABLES falls*sick /TREND;

if enter a variable (say "number" toindicate how many persons has eachexposure/response pattern, then syntax is

PROC FREQ DATA= ;TABLES falls*sick / TREND; WEIGHT number;

input falls ill number 0 0 33 0 1 8 5 0 19 5 1 15 15 0 11 15 1 9 25 0 4 25 1 10 40 0 0 40 1 10

end

tabodds ill falls [freq=number]

This syntax assumes you enter data for each of the 119 individuals; if instead you enter avariable (say you call it "number" to indicate how many persons has eachexposure/response pattern, then the required syntax is

proc freq DATA= ..; tables falls*sick /trend; weight number;

PS: If you look up A&B, you will find another x2 [ Eqn. 12.2]. This value,calculated as the difference between the trend and the overall x2 statistics, can beis used to test if there is serious non-linear variation over and above the lineartrend.

page 15

Page 81: M&M Ch 6 Introduction to Inference OVERVIEW

Inference from 2 way Tables M&M §9 Test for trend in (Response) Proportions [from A&B §12.2]

Example Birth date and sporting success No. Of Players

75

100

125

150

175

200

Aug-Oct Nov-Jan Feb-Apr May-Jul

Relationship between birthdate and participation rates in Dutch soccer league. Note the ordinate begins at 75 players.

SCIENTIFIC CORRESPONDENCE in NATURE • VOL 368 • 14 APRIL 1994 p592

Sir — I have found a significant relationship between birth date and success in tennis andsoccer. In the Netherlands and England, players born early in the competition year aremore likely to participate in national soccer leagues. The high incidence of elite athletesborn in the first quarter of the competition year can be explained by the effects ofage-group position.

In organized sport. talent is considered predominantly in terms of physical skills. andthe influence of social and psychological factors is often ignored or underestimated1.Various studies have investigated the psychological characteristics of elite athletes2, butnone has looked for an effect of age. I discovered a strikingly skewed distribution of thedates of birth of 12- to l6-year-old tennis players in the top rankings of the Dutch youthleague. Half of a sample of 60 tennis players were born in the first 3 months of the year.

This discovery led me to consider the distribution of the dates of birth of professionalsoccer players. In the Netherlands, there are two leagues comprising a total of 36 clubs. Ifound a striking difference between participation rates of those born in August and July.The Dutch soccer competition year starts on the first of August. A chi-square testindicates that the distribution is not uniform (P<0.001); and a regression analysisdemonstrates a clear linear relationship between month of birth and number ofparticipants. The dates of birth of 621 players, compiled into quarters, are shown in thefigure. This relationship cannot be attributed to the distribution of births in theNetherlands, as this is highly uniform.

PARTICIPATION RATES IN ENGLISH SOCCER LEAGUES

Players in birthdate quarters Statistics

Sep- Dec- Mar- Jun- Chi- Sig.League Nov Feb May Aug Total Square Level

We also inspected the distribution of the dates of birth of English football players inleague clubs in the period 1991-92 (ref.3). Birth dates for all players were tabulated bymonth and compiled into quarters. The results (table) show the significant effect of dateof birth on participation rate of soccer players within each of the national leagues,indicating that. as in the Netherlands, significantly more football players are born in thefirst quarter of the competition year (which starts in September in England).

FA premier 288 190 147 136 761 75.5 P<0.0001Division 1 264 169 154 147 734 48.47 P<0.0001Division 2 251 168 123 131 673 61.11 P<0.0001Division 3 217 169 121 102 609 52.38 P<0.0001

Total 1,020 696 545 516 2,777 230.77 P<0.0001

There is a known relationship between date of birth and educational achievement5.implying that the younger children in any school year group are at a disadvantagecompared to the older children. Children who participate in sports are also placed in agegroups, and my results imply many athletes in organized sports may never get a fairchance because of this method of classification. Very little attention has been drawn tothis problem. One of the few studies done in this area analysed the dates of birth of youngCanadian hockey players in the 1983-84 season6. Players possessing a relative ageadvantage (born in the months lanuary-June) were more likely to participate in minorhockey and more likely to play for top teams than players in July-December.

References: 1 Dudink A Fur J High Ability 1, 144-150 (1990). 2 Dudink A & Bakker.F. Ned. Tschr. Psychol 48. 55 -69 (1993). 3 Rollin,J Rothmans Football Yearbook1992-93 (Headline. London. 1992). 4 Shearer.E Educ Res 10. 51-56 (1967) 5 Doornbos,K. [Date of birth and scholastic performance (Wolters-Noordhoff, Groningen. 1971). 6Barnsley. R. H. & Thompson A. H. Can. J. Behav. Sci 20. 167-176 (1988). 7 Williams.Ph.. Davies P., Evans, R & Ferguson, N. Nature 228. 1033-1036 (1970).

Ad Dudink Faculty of Psychology, University of Amsterdam, 1018 WB 3 Amsterdam,The Netherlands

More than 20 years ago, this journal published an article concerning the relationshipbetween season of birth and cognitive development7. The authors attributed thisrelationship to a fault in the British educational system. A similar relationship wasfound5 in the Netherlands. Despite this, no action was undertaken to change theeducational system. One can only hope that this will not he the case for sports.

-----------------

For an example of an analysis of seasonal variation, see the article by H T Sørensenet al. Does month of birth affect risk of Crohn's disease in childhood andadolescence? p 907 BMJ VOLUME 323 20 OCTOBER 2001 bmj.com (copy ofarticle, and associated dataset, on course 626 website).

page 16

Page 82: M&M Ch 6 Introduction to Inference OVERVIEW

Correlation M&M §2.2

References: A&B Ch 5,8,9,10; Colton Ch 6, M&M Chapter 2.2 Measures of Correlation

Similarities between Correlation and Regression Loose Definition of Correlation:

• Both involve relationships between pair of numerical variables. Degree to which, in observed (x,y) pairs, y value tends to belarger than average when x is larger (smaller) than average; extentto which larger than average x's are associated with larger(smaller) than average y's

• Both: "predictability", "reduction in uncertainty"; "explanation".

• Both involve straight line relationships [can get fancier too].Pearson Product-Moment Correlation Coefficient

Differences

Context Symbol CalculationCorrelation Regression

sample ofn pairs

rxy

∑{xi – x– }{yi – y– }

( ∑{xi – x– }2 ) ( ∑{yi – y–}2 )

Symmetric

(doesn't matter which ison Y, which on X axis)

Directional

(matters which is on Y,which on X axis)

"universe"of all pairs

ρxy E{ (X – µX ) (Y – µY) }

E{ (X – µX )2 } E{ (Y – µY)2 }Chose n 'objects';measure (X,Y) on each

(i) Choose n objects onbasis of their X values;measure their Y; or

(ii) Choose objects, (aswith correlation);measure (X,Y)

Regard X value as'fixed'; .

Notes: ° ρ: Greek letter r, pronounced 'rho' ;° E : Expected value ;° µ: Greek letter 'mu'; denotes mean in universe.° Think of r as an average product of scaled deviations [M&M

p127 use n-1 because the two SDs involved in creating Z scores

implicitly involve 1/√(n-1); result is same as above]Can be extended to non-straight line relationships Spearman's (Non-parametric) Rank Correlation Coefficient

Can relate Y to multipleX variables. x -> rank replace x's by their ranks (1=smallest to n=largest)

Dimensionless (no units)( – 1 to + 1 )

∆Y/∆X units e.g., Kg/cm y -> rank replace y's by their ranks (1=smallest to n=largest)

THEN calculate Pearson correlation for n pairs of ranks

(see later)

page 1

Page 83: M&M Ch 6 Introduction to Inference OVERVIEW

Correlation M&M §2.2

Correlation 2 is a measure of how much the variance of Y is reduced byknowing what the value of X is (or vice versa)Positive: larger than ave. X's with larger than ave. Y's;

smaller than ave. X's with smaller than ave. Y's; See article by Chatillon on "Balloon Rule" for visually estimating r. (cf.Resources for Session 1, course 678 web page)Negative: larger than ave. X's with smaller than ave. Y's;

smaller than ave. X's with larger than ave. Y's;

None: larger than ave. X's 'equally likely' to be coupled with larger as withsmaller than ave. Y's

Var( Y | X ) = Var( Y ) × ( 1 – ρ2 )ρ2 called"coefficient ofdetermination"

ave(Y)

ave(X) ave(X)ave(X)

Var( X | Y ) = Var( X ) × ( 1 – ρ2 )

Large ρ2 (i.e. ρ close -1 or +1) - > close linear association of X and Y

values; far less uncertain about value of one variable if told value of

other.How r ranges from -1 (negative correlation) through 0 (zero correlation.) through +1(positive correlation.) (r not tied to x or y scale) If X and Y scores are standardized to have mean=0 and unit SD=1 it can

be seen that ρ is like a "rate of exchange" ie the value of a standard

deviation's worth of X in terms of PREDICTED standard deviation units

of Y.ave(Y)

ave(X)

X-deviation is –Y-deviation is +PRODUCT is –

X-deviation is +Y-deviation is –PRODUCT is –

X-deviation is –Y-deviation is –PRODUCT is +

X-deviation is +Y-deviation is +PRODUCT is +

.

If we know observation is ZX SD's from µX, then the least squares

prediction of observation's ZY value (ie relative to µY) is given by

predicted ZY = ρ • ZX

ave(Y)

ave(X) ave(X)ave(X)

PRODUCTS

Notice the regression towards mean: ρ is always less than 1 in absolute

value, and so the predicted ZY is closer to 0 (or equivalently make Y

closer to µY) than the ZX was to 0 (or X was to µX ).

page 2

Page 84: M&M Ch 6 Introduction to Inference OVERVIEW

Correlation M&M §2.2

Inferences re [based on sample of n (x,y) pairs] 2 Other common questions: given that r is based only on a sample, whatinterval should I put around r so it can be used as a (say 95%)confidence interval for the "true" coefficient ?

Or (answerable by the same technique): one observes a certain r1 ; inanother population, one observes a value r2 . Is there evidence that the

's in the 2 populations we are studying are unequal?

From our experience with the binomial statistic, which is limited to{0,n} or {0,1}, it is no surprise that the r statistic, limited as it is to{minus 1, plus 1}, also has a pattern of sampling variation that is notsymmetric unless is right in the middle, i.e. unless = 0. Thefollowing transformation of r will lead to a statistic which isapproximately normal even if the ρ('s) in the population(s) we arestudying is(are) quite distant from 0:

Naturally, the observed r in any particular sample will not exactly matchthe in the population (i.e. the coefficient one would get if one includedeverybody). The quantity r varies from one possible sample of n toanother possible sample of n. i.e. r is subject to sampling fluctuationsabout .1 A question all too often asked of one's data is whether there is

evidence of a non-zero correlation between 2 variables. To test this,one sets up the null hypothesis that is zero and determines theprobability, calculated under this null hypothesis that = 0 , ofobtaining an r more extreme than we observed. If the null hypothesisis true, r would just be "randomly different" from zero, with theamount of the random variation governed by n.

This discrepancy of r from 0 can be measured as r n – 2

1 – r2 and

should, if the null hypothesis of = 0 is true, follow a t distributionwith n-2 df.

12 ln {

1 + r1 – r } [where ln is log to the base e or natural log].

It is known as Fisher's transformation of r; the observed r,transformed to this new scale, should be compared against a Gaussiandistribution with

mean = 12 ln {

1 + 1 – } and SD =

1n – 3 .

[Colton's table A5 gives the smallest r which would be consideredevidence that 0. For example, if n=20, so that df = 18, an observedcorrelation of 0.44 or higher, or between -0.44 and -1 would beconsidered statistically significant at the P=0.05 level (2-sided). NB:this t-test assumes that the pairs are from a Bivariate Normaldistribution. Also, it is valid only for testing = 0, not for testingany other value of JH has seen many the researcher scan a matrix of correlations, highlightingthose with a small p-value and hoping to make something of them. But very often,that was non-zero was never in doubt; the more important question is how non-zero the underlying really was. A small p-value (from maybe a feeble r but alarge n!) should not be taken as evidence of an important ! JH has alsoobserved several disappointed researchers who mistakenly see the small p-values and think they are the correlations! (the p-values associated with the testof = 0 are often printed under the correlations)

Interesting example where r 0, and not by chance alone!

1970 U.S. DRAFT LOTTERY during Vietnam War: See Mooreand McCabe pp113-114, along with spreadsheet under Resources forChapter 10, where the lottery is simulated using random numbers(Monte Carlo method)

page 3

Page 85: M&M Ch 6 Introduction to Inference OVERVIEW

Correlation M&M §2.2

Inferences re [continued... ] e.g. 2c: 100(1– )% CI for from r=0.4 in sample of n=20.

e.g. 2a: Testing H0: = 0.5By solving the double inequality

– zα/2 ≤ 12 ln {

1 + r1 – r } –

12 ln {

1 + 1 – }

1

n – 3 . ≤ zα/2

so that the middle term is , we can construct a CI for :

[High, Low] = 1 + r – {1 – r} e [ ± 2 zα/2 / Sqrt[n-3] ]

1 + r + {1 – r} e [ ± 2 zα/2 / Sqrt[n-3] ]

Observe r=0.4 in sample of n=20.

Compute 12 ln {

1 + 0.41 – 0.4 } –

12 ln {

1 + 1 – }

1

n – 3 .

and compare with Gaussian (0,1) tables. Extreme values of the

standardized Z are taken as evidence against H0 . Often, the

alternative hypothesis concerning is 1-sided, of the form > some

quantity.Worked e.g. 95% CI( ) based on r=0.55 in sample of n=12.

With α=0.05, zα/2 = 1.96, lower & upper bounds for :

= 1 + 0.55 – {1 – 0.55} e [ ± 2 • 1.96 / √9 ]

1 + 0.55 + {1 – 0.55} e [ ± 2 • 1.96 / √9 ]

= 1.55 – 0.45 e [ ± 2 • 1.96 / √9 ]

1.55 + 0.45 e [ ± 2 • 1.96 / √9 ] = 1.55 – 0.45 e ± 1.307

1.55 + 0.45 e ± 1.307

= 1.55 – 0.45 • 3.691.55 + 0.45 • 3.69 ,

1.55 – 0.45 / 3.691.55 + 0.45 / 3.69 = –0.04 to 0 .84

e.g. 2b: Testing H0: =

r1 & r2 in independent samples of n1 & n2

Remembering that "variances add; SD's do not", compute the test

statistic

12 ln {

1 + r11 – r1

} – 12 ln {

1 + r21 – r2

} – [ 0 ]

1

n1 – 3 + 1

n2 – 3 .

and compare with Gaussian (0,1) tables.

This CI, which overlaps zero, agrees with the test of =0 described above.

For if we evaluate 0.55 12 – 2

1 – 0.552 , we get a value of 2.08,

which is not as extreme as the tabulated t10 ,0.05(2-sided) value of 2.23.

Note: There will be some slight discrepancies between the t-test of =0 and the z-based CI's. The latter are only approximate. Note also that both assume we have datawhich have a bivariate Gaussian distribution.

page 4

Page 86: M&M Ch 6 Introduction to Inference OVERVIEW

Correlation M&M §2.2

(Partial) NOMOGRAMfor 95% CI's for

n = 10, 15, 25, 50, 150

It is based on Fisher'stransformation of r. In addition toreading it vertically to get a CI for (vertical axis) based on an

observed r (horizontal axis), onecan also use it to test whether anobserved r is compatible with, orsignificantly different at the α =0.05 level, from some specific value, 0 say, on the verticalaxis: simply read across from =

0 and see if the observed r fallswithin the horizontal rangeappropriate to the sample sizeinvolved. Note that this test of anonzero is not possible via thet-test. Books of statistical tableshave fuller nomograms.

Shown: CI if observer=0.5 (o) with n=25.

Could aldo use nomogram togauge the approx. 95% limits ofvariation for the correlation in adraft lottery. The n=366 is a littlemore than 2.44 times the n=150here. So the (horizontal) variationsaround ρ = 0 should be only1/√2.44 or 64% as wide as thoseshown here for n=150. Thus the95% range of r would be approx.-0.1 to +0.1. (since X and Y areuniform, rather than Gaussian,theory may be a little "off").Observed r was -0.23.

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

observed r

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

rho

{1+r-(1-r)Exp[±2z/Sqrt[n-3]]} / {1+r+(1-r)Exp[±2z/Sqrt[n-3]]}

rho

1015

2550

150

1015

2550

150

page 5

Page 87: M&M Ch 6 Introduction to Inference OVERVIEW

Correlation M&M §2.2

Spearman's (Non-parametric) Rank Correlation Coefficient Correlations -- obscured and artifactual

How Calculated:

+ =

+ =

(ii) Artifact

(i) Diluted / attenuated

(i) replace x's and y's by their ranks (1=smallest to n=largest)

(ii) calculate Pearson correlation using the pairs of ranks.

Advantages• Easy to do manually (if ranking not a chore);

rSpearman = 1 – 6∑d2

n{n2-1}

{ d = ∆ between "X rank" & "Y rank" for each observation}

• Less sensitive to outliers (x -> rank==> variance fixed (for a given n).

Extreme {xi – x– } or {yi – y

–} can exert considerable influence on

rPearson. Examples:(i) Diluted / attenuated / obscured

• Picks up on non-linear patterns e.g. the rSpearmanfor the following data is 1, whereas the rPearson. is

less.

1 Relationship, in McGill Engineering students, between their firstyear university grades and their CEGEP grades

2 Relationship between heights of offspring and heights of theirparents y

°

°

°

°

°

°

°

____________________

X = average height of 2 parentsY = height of offspring (ignore sex of offspring)

Galton's solution

'transmute' female heights to male heights

'transmuted' height = height × 1.08

x (ii) Artifact / artificially induced

1. Blood Pressure of unrelated (male, female) 'couples'

page 6

Page 88: M&M Ch 6 Introduction to Inference OVERVIEW

Regression M&M §2.3 and §10

Uses MALES FEMALES• Curve fittingAge. Tot. %-ile; weight,g Tot. %-ile; weight,g• Summarization ('model')

• Descriptionwk N. 10th 50th 90th No. 10th 50th 90th• Prediction

• Explanation25 100 651 810 950 73 604 750 924• Adjustment for 'confounding' variables

30 257 1 156 1 530 2 214 216 1 040 1 485 2 001Technical Meaning31

• [originally] simply a line of 'best fit' to data points 3233

• [nowadays] Regression line is the LINE that connects the CENTRES ofthe distributions of Y's at each X value.

3435 1 840 2 060 2 570 3 140 1 454 1 950 2 460 3 040

36

• not necessarily a straight line; could be curved, as with growth charts3738

• not necessarily µY|X 's used as CENTRES ; could use medians etc.3940 68 102 3 020 3 570 4 160 67 149 2 900 3 430 4 000

• strictly speaking, haven't completed description unless we characterizethe variation around the centres of the Y distributions at each X

4142 10 309 3 200 3 770 4 390 9 636 3 060 3 610 4 190

• inference not restricted to the distributions of Y's for which we makesome observations; it applies to distributions of Y's at all unobserved Xvalues in between.

2000g

1000g

3000g

4000g

GESTATIONAL AGE (week)

36 383230 4034

BIRTH WEIGHT (DISTRIBUTION) MALESBIRTH WEIGHT (MEDIAN) FEMALES

Median (50th %ile) for MALES Median (50th %ile) for FEMALES

Live singleton births, Canada 1986Source: Arbuckle & Sherman CMAJ 140 157-161, 1989

Examples (with appropriate caveats)• Birth weight (Y) in relation to gestational age (X)• Blood pressure (Y) in relation to age (X)• Cardiovascular mortality (Y) in relation to water hardness (X) ?• Cancer incidence (Y) in relation to some exposure (X) ?• Scholastic performance (Y) vis a vis amount of TV watched (X)

Caveat: No guarantee that simple straight line relationship will be adequate.Also, in some instances the relationship might change with the type of Xand Y variables used to measure the two phenomena being studied; also therelationship may be more artifact than real - see later for inference.)

page 1

Page 89: M&M Ch 6 Introduction to Inference OVERVIEW

Regression M&M §2.3 and §10

S i m p l e L i n e a r † R e g r e s s i o n Fitting a straight line to data - Least Squares Method (one X) (straight line)

The most common method is that of Least Squares. Note that leastsquares can be thought of as just a curve fitting method and doesn't haveto be thought of in a statistical (or random variation or samplingvariation) context. Other more statistically-oriented methods includethe method of minimum Chi-Square (matching observed and expectedcounts according to measure of discrepancy) and the Method ofMaximum likelihood (finding the parameters that made the data mostlikely). Each has a different criterion of "best-fit".

Equation

• µY|X = α + β X or ∆ µY|X

∆ X = β = "rise""run"

In Practice:one rarely sees an exact straight line relationship in health scienceapplications;

Least Squares Approach:1 - While physicists are often able to examine the relationship between Yand X in a laboratory with all other things being equal (ie controlled orheld constant) medical investigators largely are not. The universe of(X,Y) pairs is very large and any 'true' relationship is disturbed bycountless uncontrollable (and sometimes un-measurable factors. In anyparticular sample of (X,Y) pairs these distortions will surely beoperating.

• Consider a candidate slope (b) and intercept (a) and predict that the

Y value accompanying any X=x is y = a + b•x. The observed y

value will deviate from this "predicted" or "fitted" value by an

amount d = y - y

We wish to keep this deviation as small as possible, but we must

try to strike a balance over all the data points. Again just like

when calculating variances, it is easier to work with squared

deviations1 :

d2 = (y - y ) 2

We weight all deviations equally (whether they be the ones in the

middle or the extremes of the x range) using ∑ d2 = ∑ (y - y ) 2

to measure the overall (or average) discrepancy of the points from

the line.

2 - The true relationship (even if we could measure it exactly) may not be asimple straight line.

3 - The measuring instruments may be faulty or inexact (using'instruments' in the broadest sense).

One always tries to have the investigation sufficiently controlled that the'real' relationship won't be 'swamped' by factors 1 and 3 and that thebackground "noise" will be small enough so that alternative models (egcurvilinear relationships) can be distinguished from one another.------------------------------------------------------------------------------------------------------

----------------------------------------------

† Linear here means linear in the parameters. The equationy = BxC can be made linear in the parameters by taking logsi.e. log[y] = log[B] + x log[C]; y = a+b•x+c•x2 is already linearin the parameters a b and c. The following model cannot be madelinear in the parameters α β γ:

proportion dying = α + 1-α

1+exp{β–γ log[dose]}

1 there are also several theoretical advantages to least squares estimatesover others based for example on least absolute deviations: - they arethe most precise of all the possible estimates one could get by takinglinear combinations of y's.

page 2

Page 90: M&M Ch 6 Introduction to Inference OVERVIEW

Regression M&M §2.3 and §10

• From all the possible candidates for slope (b) and intercept (a) , we

choose the particular values a and b which make this sum of squares

(sum of squared deviations of 'fitted' from 'observed' Y's) a minimum.

ie we search for the a and b that give us the least squares fit.

but there are no cars weighing 0 lbs. It would be better to write the

equation in relation to some 'central' value for weight e.g. 3500

lbs; then the same equation can be cast as

µY|weight – 25 = 0.01•(weight – 3500)

• Fortunately, we don't have to use trial and error to arrive at the 'best' a

and b . Instead, it can be shown by calculus or algebraically that the a

and b which minimize ∑ d2 are:

b = β^ = ∑{x i – x

– }{y i – y

– }

∑{x i – x– }2

= rxy • sy

sx

a = α^ = y– – b x–

It is helpful for testing whether there is evidence of a non-zero slope to

think of the simplest of all regression models, namely that which is a

horizontal straight line

µY|X = α + 0 • X = the constant α .

This is a re-statement of the fact that the sum of squared deviances

around a constant horizontal line at height 'a' is smallest when 'α ' =

the mean .

[Note that a least-squares fit of the regression line of X on Y wouldgive a different set of values for the slope and intercept: the slope

of the line of x on y is rxy • sx

sy] . one needs to be careful when

using a calculator or computer program to specify which is theexplanatory variable (X) and which is the predicted variable (Y)].

[We don't always use the mean as the best 'centre' of a set of numbers.

Imagine waiting for one of several elevators with doors in a row along

one wall; you do not know which one will arrive next, and so want to

stand in the 'best' place no matter which one comes next. Where to

stand depends on the criterion being optimized: if you want to minimize

the maximum distance, stand in the middle between the one on the

extreme left and the extreme right; if you wish to minimize the average

distance , where do you stand?, If, for some reason, you want to

minimize the average squared distance, where to stand? If the elevator

doors are not equally spaced from each other, what then?]

Meaning of intercept parameter (a):

Unlike the slope parameter (which represents the increase/decrease

in µY|X for every unit increase in x), the intercept does not always

have a 'natural' interpretation. It depends on where the x-values lie

in relation to x=0, and may represent part of what is really the

mean Y. For example, the regression line for fuel economy of cars

(Y) in relation to their weight (x) might be

µY|weight = 60 mpg – 0.01•weight in lbs [0.01 mpg/lb]

page 3

Page 91: M&M Ch 6 Introduction to Inference OVERVIEW

Regression M&M §2.3 and §10

The anatomy of a slope: some re-expressions i.e. a weighted average of the slope from datapoints 1 and 4and that from datapoints 2 and 3, with weights proportional to thesquares of their distances on x axis {x4 – x1}2 and {x3 – x2}2

Consider the formula: slope = b = ∑{x – xbar}{y – ybar}

∑{x – xbar} 2

Without loss of generality & for simplicity, assume ybar=0.

If we have 3 x's, 1 unit apart (e.g. x1=1; x2 =2; x3 =3),Another way to think of the slope:

Rewrite b = ∑{x – xbar}{y – ybar}

∑{x – xbar} 2 as then... x1 – xbar = –1; x2 – xbar = 0; x3 – xbar = +1

so slope = b = { – 1}y1 + {0}y2 + { + 1}y 3

{ – 1}2 + { 0 }2 + { + 1}2

b = ∑{x – xbar} 2 {y – ybar}

{x – xbar}

∑{x – xbar} 2 = ∑ weight {y – ybar}

{x – xbar}∑weight

weight {x – xbar}2 for estimate {y – ybar}{x – xbar} of slope

i.e. slope = y3 – y 1

x3 – x 1

Note that y2 contributes to ybar and thus to an estimateof the average y (i.e. level) but not to the slope.

Yet another way to think of the slope:If 4 x's 1 unit apart (e.g. x1=1; x2 =2; x3 =3; x4 =4), then,...

x1 – xbar = –1.5 x2 – xbar = – 0.5x3 – xbar = +0.5 x4 – xbar = + 1.5

and so

slope = b = { – 1.5}y1 + { – 0 .5}y2 + { + 0 .5}y3+ { + 1.5}y4

{ – 1.5}2 + { – 0.5}2 + { 0.5 }2 + { + 1.5}2

i.e. slope = 1.5{ y4 – y 1}

5 +

0.5{ y3 – y 2}5

i.e. slope =

32 {y4 – y 1}

53 {x4 – x 1}

+

12 { y3 – y 2}

51 {x3 – x 2}

i.e. slope = 910

{y4 – y 1}{x4 – x 1}

+ 110

{ y3 – y 2}{x3 – x 2}

b is a weighted average of all the pairwise slopes yi – yjxi – xj

with weights proportional to {xi – xj }2 .

e.g. If 4 x's 1 unit apart

denote by b1&2 the slope obtained from {x2,y2} & {x1,y1}, etc...

b = 1.b1&2 + 4.b1&3 + 9.b1&4 + 1.b1&3 + 4.b2&4 + 1.b3&4

1+ 4 + 9 + 1 + 4 + 1 = 20

jh 6/94

page 4

Page 92: M&M Ch 6 Introduction to Inference OVERVIEW

Regression M&M §2.3 and §10

Inferences regarding Simple Linear Regression 50 than to take X =38, 39, 49, 41, 42. Any individual fluctuationswill 'throw off' the slope much less if the X's are far apart.

How reliable are

30 40 5030 40 50

BP BP

AGE AGE

(b)(a)(i) the (estimated) slope(ii) the (estimated) intercept(ii) the predicted mean Y at a given X(iv) the predicted y for a (future) individual with a given X

when they are based on data from a sample? i.e. how much would theseestimated quantities change if they were based on a different randomsample [with the same x values]?

We can use the concept of sampling variation to (i) describe the'uncertainty' in our estimates via CONFIDENCE INTERVALS or (ii)carry out TESTS of significance on the parameters (slope, intercept,predicted mean).

thick l ine : real (true) relation between average BP at age X and X : thinlines: possible apparent relationships because of individual variation whenwe study 1 individual at each of two ages (a) spaced closer together (b)spaced further apart.

We can describe the degree of reliability of (or, conversely, the degree ofuncertainty in) an estimated quantity by the standard deviation of thepossible estimates produced by different random samples of the samesize from the same x's. We call this (obviously conceptual) S.D. thestandard error of the estimated quantity (just like the standard error of themean when estimating µ). helpful to think of slope as an averagedifference in means for 2 groups that are 1 x-unit apart.

Notes

Regression line refers to the relationship between the average Y at agiven X to the X , and not to individual Y's vs X. Obviously of course ifthe individual Y's are close to the average Y, so much the better!

The size of the standard error will depend on

1. how 'spread apart' the x's areThe above argument would suggest studying individuals at theextremes of the X values of interest. We do this if we are sure that therelationship is a linear one. If we are not sure, it is wiser -- if we havea choice in the matter -- to take a 3-point distribution.

There is a common misapprehension that a Gaussian distribution of Xvalues is desirable for estimating a regression slope of Y on X. In fact,the 'inverted U' shape of the Gaussian is the least desirable!

2. How good a fit the regression line really is (i.e. how small isthe unexplained variation about the line)

3. How large the sample size, n, is.

Factors affecting reliability (in more detail)

1. The spread of the X's: The best way to get a reliable estimate ofthe slope is to take Y readings at X's that are quite a distance fromeach other. E.g. in estimating the "per year increase in BP over the30-50 yr. age range", it would be better to take X=30,35, 40, 45,

page 5

Page 93: M&M Ch 6 Introduction to Inference OVERVIEW

Regression M&M §2.3 and §10

Factors affecting reliability (continued) variation when we study 1 individual at each of two ages when thewithin-age distributions have (a) a narrow spread (b) a wider spread

2. The (vertical) variation about the regression line: Again, considerBP and age, and suppose that indeed the average BP of all personsaged X + 1 is β units higher than the average BP of all persons

aged X, and that this linear relationship

average BP of persons aged x = α + β • X (average of Y's at a given x = intercept + slope • X)

NOTE: For unweighted regression, should have roughly same spreadof Y's at each X.

Factors affecting reliability (continued)

3. Sample Size (n) Larger n will make it more difficult for the types ofextremes and misleading estimates caused by 1) poor X spread and 2)large variation in Y about µ Y|X , to occur. Clearly, it may be possible

to spread the x's out so as to maximize their variance (and thus reducethe n required) but it may not be possible to change the magnitude ofthe variation about µ Y|X (unless there are other known factors

influencing BP). Thus the need for reasonably stable estimated y[i.e.estimate of µY|X ]

holds over the age span 30-50.

Obviously, everybody aged x=32 won't have the exact same BP,some will be above the average of 32 yr olds, some below.Likewise for the different ages x=30,...50. In other words, at any xthere will be a distribution of y's about the average for age X.Obviously, how wide this distribution is about α + β•X will havean effect on what slopes one could find in different samples(measure vertical spread around the line by σ)

30 40 5030 40 50

BP BP

AGE AGE

(a) (b)

thick l ine : real (true) relation between average BP at age X and X :thin lines: possible apparent relationships because of individual

page 6

Page 94: M&M Ch 6 Introduction to Inference OVERVIEW

Regression M&M §2.3 and §10

Standard Errors The structure of SE(a) : In addition to the factors mentioned above, allof which come in again in the expected way, there is the additional

factor of x–2; since this is in the denominator, it increases the SE . This

is natural in that if the data, and thus x–, are far from x=0, then any

imprecision in the estimate of the slope will project backwards to alarge imprecision in the estimated intercept. Also, if one uses 'centered'

x's, so that x– = 0, the formula for the SE reduces to

SE(a) = σ 1n = σ

n

and we recognize this as SE(y–) -- not surprisingly, since y

– is the

'intercept' for centered data.

SE(b) = SE( β) = σ

∑{xi – x– }2 ;

SE(a) = SE( α) = σ 1n +

x–2

∑{xi – x– }2

(Note: there is a negative correlation between a and b).

We don't usually know σ so we estimate it from the data, using scatter ofthe y's from the fitted line i.e. SD of the residuals)

CI's & Tests of Significance for ^ and ^ are based ont–distribution (or Gaussian Z's if n large)

If examine the structure of SE(b), see that it reflects the 3 factors discussedabove: (i) a large spread of the x's makes contribution of each observation to

∑{xi – x– }2 large, and since this is in the denominator, it reduces the SE

(ii) a small vertical scatter is reflected in a small σ and since this is in thenumerator, it also reduces the SE of the estimated slope (iii) a large sample

size means that ∑{xi – x– }2 is larger, and like (i) this reduces the SE.

The formula, as written, tends to hide this last factor; note that

∑{xi – x– }2 is what we use to compute the spread of a set of x's -- we

simply divide it by n–1 to get a variance and then take the square root to getthe sd. To make the point here, simplify n-1 to n and write

∑{xi – x– }2 ≈ n•var(x), so that ∑{xi – x

– }2 ≈ n • sd(x)

and the equation for the SE simplifies to

SE(b) = σ

n • sd(x) =

SDy|x / SDx

n

with n in its familiar place in the denominator of the SE (even in more

complex SE's, this is where n is usually found !)

^ ± tn–2 • SE( ^ )

H0: tn–2 = ^ –

SE(^)

^ ± tn–2 • SE( ^ )

H0: tn–2 = ^ –

SE(^)

page 7

Page 95: M&M Ch 6 Introduction to Inference OVERVIEW

Regression M&M §2.3 and §10

Standard Error for Estimated µY|X or 'average Y at X' variation, which is as follows:

α^+ β^ •X ± t•σ 1 + 1n +

{X – x–}2

∑{xi – x– }2

.

Both the CI for the estimated mean and the CI for individuals (ie theestimated percentiles of the distribution) are bow-shaped when drawn as

a function of X . They are narrowest at X = x–, and fan out from there.

One needs to be careful not to confuse the much narrower CI for themean with the much wider CI for individuals. If one can see the rawdata, it is usually obvious which is which -- the CI for individuals isalmost as wide as the raw data themselves.

We estimate 'average Y at X' or µY|X by α^ + β^ •X . Since the

estimate is based on two estimated quantities, each of which is subject tosampling variation, it contains the uncertainty of both:

SE(estimated average Y at X) = σ 1n +

{X – x–}2

∑{xi – x– }2

Again, we must use an estimate σ^ of σ .

First-time users of this formula suspect that it has a missing ∑ or an xinstead of an xbar or something. There is no typographical error, and indeedif one examines it closely, it makes sense. X refers to the x-value at whichone is estimating the mean -- it has nothing to do with the actual x's in thestudy which generated the estimated coefficients, except that the closer X is

to the center of the data, the smaller the quantity {X – x–} and thus the

quantity {X – x–}2, and thus the SE, will be. Indeed, if we estimate the

average Y right at X = x–, the estimate is simply y

– (since the fitted

line goes through [ x–, y

–] ) and its SE will be

σ 1n +

{ x– – x

–}2

∑{xi – x– }2

or σ 1n = σ

n = SE( y

– ).

cf. data on sleeping through the night; alcohol levels and eye speed.

Confidence Interval for individual Y at X

A certain percentage P% of individuals are within tP • σ of the meanµY|X = α + β • X , where tP is a multiple, depending on P, from the t

or, if n is large, the Z table. However, we are not quite certain whereexactly the mean α + β • X is -- the best we can do is estimate,

with a certain P% confidence, that it is within tP• SE( α^ + β^ •X ) of

the point estimate α^+ β^ •X. The uncertainty concerning the mean andthe natural variation of individuals around the mean -- wherever it is --combine in the expression for the estimated P% range of individual

page 8

Page 96: M&M Ch 6 Introduction to Inference OVERVIEW

PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 1 of 11

ANDREA KURZ, M.D., DANIEL I. SESSLER, M.D., AND RAINER LENHARDT, M.D.,WOUND infections are common and serious complications of anesthesia

and surgery. A wound infection can prolong hospitalization by 5 to 20

days and substantially increase medical costs.[1,2] In patients undergoing

colon surgery, the risk of such an infection ranges from 3 to 22 percent,

depending on such factors as the length of surgery and underlying medical

problems.[3] Mild perioperative hypothermia (approximately 2°C below

the normal core body temperature) is common in colon surgery.[4] It

results from anesthetic-induced impairment of thermoregulation,[5,6]

exposure to cold, and altered distribution of body heat.[7] Although it is

rarely desired, intraoperative hypothermia is usual because few patients are

actively warmed.[8]

FOR THE STUDY OF WOUND INFECTION AND TEMPERATURE GROUP1

AbstractBackground. Mild perioperative hypothermia, which is common during majorsurgery, may promote surgical- wound infection by triggering thermoregulatoryvasoconstriction, which decreases subcutaneous oxygen tension. Reducedlevels of oxygen in tissue impair oxidative killing by neutrophils and decreasethe strength of the healing wound by reducing the deposition of collagen.Hypothermia also directly impairs immune function. We tested the hypothesisthat hypothermia both increases susceptibility to surgical-wound infection andlengthens hospitalization.

Methods . Two hundred patients undergoing colorectal surgery wererandomly assigned to routine intraoperative thermal care (the hypothermiagroup) or additional warming (the normothermia group). The patients'anesthetic care was standardized, and they were all given cefamandole andmetronidazole. In a double-blind protocol, their wounds were evaluated dailyuntil discharge from the hospital and in the clinic after two weeks; woundscontaining culture-positive pus were considered infected. The patients'surgeons remained unaware of the patients' group assignments. Hypothermia may increase patients' susceptibility to perioperative wound

infections by causing vasoconstriction and impaired immunity. The

presence of sufficient intraoperative hypothermia triggers

thermoregulatory vasoconstriction,[9] and postoperative vasoconstriction is

universal in patients with hypothermia.[10] Vasoconstriction decreases the

partial pressure of oxygen in tissues, which lowers resistance to infection

in animals [11,12] and humans (unpublished data). There is decreased

microbial killing, partly because the production of oxygen and nitroso free

radicals is oxygen-dependent in the range of the partial pressures of

oxygen in wounds. [13,14] Mild core hypothermia can also directly impair

immune functions, such as the chemotaxis and phagocytosis of

granulocytes, the motility of macrophages, and the production of

antibody.[15,16] Mild hypothermia, by decreasing the availability of tissue

oxygen, impairs oxidative killing by neutrophils. And mild hypothermia

during anesthesia lowers resistance to inoculations with Escherichia coli

[17] and Staphylococcus aureus [18] in guinea pigs.

Results. The mean (±SD) final intraoperative core temperature was34.7±0.6°C in the hypothermia group and 36.6±0.5°C in the normothermiagroup (P<0.001). Surgical-wound infections were found in 18 of 96 patientsassigned to hypothermia (19 percent) but in only 6 of 104 patients assigned tonormothermia (6 percent, P=0.009). The sutures were removed one day laterin the patients assigned to hypothermia than in those assigned tonormothermia (P = 0.002), and the duration of hospitalization was prolonged by2.6 days (approximately 20 percent) in the hypothermia group (P=0.01).

Conclusions. Hypothermia itself may delay healing and predispose patientsto wound infections. Maintaining normothermia intraoperatively is likely todecrease the incidence of infectious complications in patients undergoingcolorectal resection and to shorten their hospitalizations(N Engl J Med 1996;334:1209-15. May 9)

1From the Thermoregulation Research Laboratory and the Department of Anesthesia. University ofCalifornia. San Francisco (A.K., D.l.S.); and the Departments of Anesthesiology and General Intensive Care,University of Vienna, Vienna Austria (A.K.. D.l.S.. R.L.). Address reprint requests to Dr. Sessler at theDepartment of Anesthesia, 374 Parnassus Ave., 3rd Fl., University of California, San Francisco. CA 941430648. Supported in part by grants (GM49670 and GM27345) from the National Institutes of Health. by theJoseph Drown and Max Kade Foundations. and by Augustine Medical. Inc. The authors do not consult for,accept honorariums from. Or own stock or stock options in any company whose products are related to thesubject of this research.Presented in part at the International Symposium on the Pharmacology of Thermoregulation, Giessen,Germany, August 17-22. 1°,94, and at the Annual Meeting of the American Society of Anesthesiologists.Atlanta October 21-25, 1995.

Page 97: M&M Ch 6 Introduction to Inference OVERVIEW

PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 2 of 11

Vasoconstriction-induced tissue hypoxia may decrease the strength of the

healing wound independently of its ability to reduce resistance to infection.

The formation of scar requires the hydroxylation of abundant proline and

Iysine residues to form the cross-links between strands of collagen that

give healing wounds their tensile strength.[19] The hydroxylases that

catalyze this reaction are dependent on oxygen tension,[20] making

collagen deposition proportional to the partial pressure of arterial oxygen

in animals [21] and to oxygen tension in wound tissue in humans. [22]

The number of patients required for this trial was estimated on the

basis of a preliminary study in which 80 patients undergoing elective

colon surgery were randomly assigned to hypothermia (mean [±SD]

temperature, 34.4±0.4°C) or normothermia (involving warming with

forced air and fluid to a mean temperature of 37_0.3°C). The number of

wound infections (as defined by the presence of pus and a positive

culture) was evaluated by an observer unaware of the patients'

temperatures and group assignments. Nine infections occurred in the 38

patients assigned to hypothermia, but there were only four in the 42

patients assigned to normothermia (P=0.16). Using the observed

difference in the incidence of infection, we determined that an enrollment

of 400 patients would provide a 90 percent chance of identifying a

difference with an alpha value of 0.01.2 We therefore planned to study a

maximum of 400 patients, with the results to be evaluated after 200 and

300 patients had been studied. The prospective criterion for ending the

study early was a difference in the incidence of surgical-wound infection

between the two groups with a P value of less than 0.01. To compensate

for the two initial analyses, a P value of 0.03 would be required when the

study of 400 patients was completed. The combined risk of a type I error

was thus less than 5 percent [24]

Although safe and inexpensive methods of warming are available[8]

perioperative hypothermia remains common. [23] Accordingly, we tested

the hypothesis that mild core hypothermia increases both the incidence of

surgical-wound infection and the length of hospitalization in patients

undergoing colorectal surgery.

METHODS

With the approval of the institutional review board at each participating

institution and written informed consent from the patients, we studied

patients 18 to 80 years of age who underwent elective colorectal resection

for cancer or inflammatory bowel disease. Patients scheduled for

abdominal-peritoneal pull-through procedures were included, but not those

scheduled for minor colon surgery (e.g., polypectomy or colostomy

performed as the only procedure). The criteria for exclusion from the

study were any use of corticosteroids or other immunosuppressive drugs

(including cancer chemotherapy) during the four weeks before surgery; a

recent history of fever, infection, or both; serious malnutrition (serum

albumin, less than 3.3 g per deciliter, a white-cell count below 2500 cells

per milliliter, or the loss of more than 20 percent of body weight); or

bowel obstruction.

Study Protocol

The night before surgery, each patient underwent a standard mechanical

bowel preparation with an electrolyte solution. Intraluminal antibiotics

were not used, but treatment with cefamandole (2 g intravenously every

2 i tal ics by jh. . . . i t does not make a lot of sense to use the difference oneoberves in a p i lo t s tudy [or for tha t mat ter what i s the l i t e ra ture] as aguide to what w o u l d be impor tan t [ c l in i ca l l y impor tan t d i f f e rence ]

Page 98: M&M Ch 6 Introduction to Inference OVERVIEW

PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 3 of 11

eight hours) and metronidazole (500 mg intravenously every eight hours)

was started during the induction of anesthesia; this treatment was

maintained for about four days postoperatively. Anesthesia was induced

with thiopental sodium (3 to 5 mg per kilogram of body weight), fentanyl

(1 to 3 µg per kilogram), and vecuronium bromide (0.1 mg per kilogram).

The administration of isoflurane (in 60 percent nitrous oxide) was titrated

to maintain the mean arterial blood pressure within 20 percent of the

preinduction values. Additional fentanyl was administered on the

completion of surgery, to improve analgesia when the patient emerged

from anesthesia.

extra warming. Similarly, a forced-air cover (Augustine Medical, Eden

Prairie, Minn.) was positioned over the upper body of every patient, but it

was set to deliver air at the ambient temperature in the hypothermia group

and at 40°C in the normothermia group. Cardboard shields and sterile

drapes were positioned in such a way that the surgeons could not discern

the temperature of the gas inflating the cover. Shields were also positioned

over the switches governing the fluid heater and the forced-air warmer so

that their settings were not apparent to the operating-room personnel. The

temperatures were not controlled postoperatively, and the patients were not

informed of their group assignments.

The patients were hydrated aggressively during and after surgery, because

hypovolemia decreases wound perfusion and increases the incidence of

infection.[25,26] We administered 15 ml of crystalloid per kilogram per

hour throughout surgery and replaced the volume of blood lost with either

crystalloid in a 4:1 ratio or colloid in a 2:1 ratio. Fluids were administered

intravenously at rates of 3.5 ml per kilogram per hour for the first 24

postoperative hours and 2 ml per kilogram per hour for the subsequent 24

hours. Leukocyte-depleted blood was administered as the attending

surgeon considered appropriate.

Supplemental oxygen was administered through nasal prongs at a rate of 6

liters per minute during the first three postoperative hours and was then

gradually eliminated while oxygen saturation was maintained at more than

95 percent. To minimize the decrease in wound perfusion due to activation

of the sympathetic nervous system, postoperative pain was treated with

piritramide (an opioid), the administration of which was controlled by the

patient.

The attending surgeons, who were unaware of the patients' group

assignments and core temperatures, determined when to begin feeding

them again after surgery, remove their sutures, and discharge them from

the hospital. The timing of discharge was based on routine surgical

considerations, including the return of bowel function, the control of any

infections, and adequate healing of the incision.

At the time of the induction of anesthesia, each patient was randomly

assigned to one of the following two temperature-management groups with

computer-generated codes maintained in numbered, sealed, opaque

envelopes: the normothermia group, in which the patients' core

temperatures were maintained near 36.5°C, and the hypothermia group, in

which the core temperature was allowed to decrease to approximately

34.5°C. In both groups, intravenous fluids were administered through a

fluid warmer, but the warmer was activated only in the patients assigned to

Measurements

The patients' morphometric characteristics and smoking history were

recorded. The preoperative laboratory evaluation included a complete

Page 99: M&M Ch 6 Introduction to Inference OVERVIEW

PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 4 of 11

blood count; determinations of the prothrombin time and

partial-thromboplastin time; measurements of serum albumin, total protein,

and creatinine; and liver-function tests. The risk of infection was scored

with a standardized algorithm taken from the Study on the Efficacy of

Nosocomial lnfection Control (SENIC) of the Centers for Disease Control

and Prevention; in this scoring system, one point each is assigned for the

presence of three or more diagnoses, surgery lasting two hours or more,

Surgery at an abdominal site, and the presence of a contaminated or

infected wound.[2] The scoring system was modified slightly from its

original form by the use of the diagnoses made at admission, rather than

discharge. The risk of infection was quantified further with the use of the

National Nosocomial Infection Surveillance System (NNISS), a scoring

system in which the patient's risk of infection was predicted on the basis of

the type of surgery, the patient's physical-status rating on a scale developed

by the American Society of Anesthesiologists, and the duration of surgery

[3]

Thermal comfort was evaluated at 20-minute intervals for 6 hours

postoperatively with a 100-mm visual-analogue scale on which 0 mm

denoted intense cold, 50 mm denoted thermal comfort, and 100 mm

denoted intense warmth. The degree of surgical pain was evaluated

similarly, except that 0 mm denoted no pain and 100 mm the most intense

pain imaginable. Shivering was assessed qualitatively, on a scale on which

0 denoted no shivering; 1, mild or intermittent shivering, 2, moderate

shivering, and 3, continuous, intense shivering. All the qualitative

assessments were made by observers unaware of the patients' group

assignments and core temperatures.

The patients' surgical wounds were evaluated daily during hospitalization

and again two weeks after surgery by a physician who was unaware of the

group assignments. Wounds were suspected of being infected when pus

could be expressed from the surgical incision or aspirated from a loculated

mass inside the wound. Samples of pus were obtained and cultured for

aerobic and anaerobic bacteria, and wounds were considered infected when

the culture was positive for pathogenic bacteria. All the wound infections

diagnosed within 15 days of surgery were included in the data analysis.

Core temperatures were measured at the tympanic membrane

(Mallinckrodt Anesthesiology Products, St. Louis), with values recorded

preoperatively, at 10-minute intervals intraoperatively, and at 20-minute

intervals for 6 hours during recovery. Arteriovenous-shunt flow was

quantified by subtracting the skin temperature of the fingertip from that of

the forearm, with values exceeding 0°C indicating thermoregulatory

vasoconstriction.[27] End-tidal concentrations of isoflurane and carbon

dioxide were recorded at 10-minute intervals during anesthesia.

Measurements of arterial blood pressure and heart rate were recorded

similarly during anesthesia and for six hours thereafter. Oxyhemoglobin

saturation was measured by pulse oximetry.

Wound healing and infections were also evaluated by the ASEPSIS

system,[28] in which a score is calculated as the weighted sum of points

assigned to the following factors: the duration of antibiotic administration,

the drainage of pus during local anesthesia, the debridement of the wound

during general anesthesia, the presence of a serous discharge, the presence

of erythema, the presence of a purulent exudate, the separation of deep

tissues, the isoIation of bacteria from fluid discharged from the wound,

and a duration of hospitalization exceeding 14 days. Scores exceeding 20

on this scale indicate wound infection. As an additional indicator of

Page 100: M&M Ch 6 Introduction to Inference OVERVIEW

PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 5 of 11

infection, preoperative differential white-cell counts were compared with

counts obtained on postoperative days 1, 3, and 6.

than 0.005 was considered to indicate a significant difference in

postoperative temperature (to compensate for multiple comparisons); for

all other data, a P value of less than 0.05 was considered to indicate a

statistically significant difference.Collagen deposition in the wound was evaluated in a subgroup of 30

patients in the normothermia group and 24 patients in the hypothermia

group. A 10-cm expanded polytetrafluoroethylene tube (Impra,

International Polymer Engineering, Tempe, Ariz.) was inserted

subcutaneously several centimeters lateral to the incision at the completion

of surgery. On the seventh postoperative day, the tube was removed and

assayed for hydroxyproline, a measure of collagen deposition.[29] The

ingrowth of collagen in such tubes is proportional to the tensile strength of

the healing wound[29] and the subcutaneous oxygen tension.[22]

RESULTS

Patients were enrolled in the study from July 1993 through March 1995;

155 were evaluated at the University of Vienna, 30 at the University of

Graz, and 15 at Rudolfstiftung Hospital. According to the investigational

protocol, the study was stopped after 200 patients were enrolled, because

the incidence of surgical-wound infection in the two study groups differed

with an alpha level of less than 0.01. One hundred four patients were

assigned to the normothermia group, and 96 to the hypothermia group. An

audit confirmed that the patients had been properly assigned to the groups

and that the slight disparity in numbers was present in the original

computer-generated randomization codes. All the patients allowed their

wounds to be evaluated daily during hospitalization. Ninety-four percent

returned for the two-week clinic visit after discharge; those who did not

were evenly distributed between the study groups and mostly returned to

visit the private offices of their attending surgeons. The wound status of

these patients was determined by calling the physician. No previously

unidentified wound infections were detected in the clinic for the first time.

Statistical Analysis

Outcomes were evaluated on an intention-to-treat basis. The number of

postoperative wound infections in each study group and the proportion of

smokers among the infected patients were analyzed by Fisher's exact test.

Scores for wound healing, the number of days of hospitalization, the extent

of collagen deposition, postoperative core temperatures, and potential

confounding factors were evaluated by unpaired, two-tailed t-tests. Factors

that potentially contributed to infection were included in a univariate

analysis. Those that correlated significantly with infection were then

included in a multivariate logistic regression with backward elimination; a

P value of less than 0.25 was required for a factor to be retained in the

analysis.

Table I shows that the characteristics, diagnoses, types of surgical

procedure, duration of surgery, hemodynamic values, and types of

anesthesia of the patients in the two study groups were similar. Nor did

smoking status, the results of preoperative laboratory tests, or preoperative

laboratory values differ significantly between the groups. The patients

assigned to hypothermia required more transfusions of allogeneic blood

All the results are presented as means ±SD. A P value of less than 0.01

was required to indicate a significant difference in our major outcomes (the

incidence of infection and the duration of hospitalization); a P value of less

Page 101: M&M Ch 6 Introduction to Inference OVERVIEW

PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 6 of 11

(P=0.01). Intraoperative vasoconstriction was observed in 74 percent of

the patients assigned to hypothermia but in only 6 percent of those

assigned to normothermia (P<0.001). Core temperatures at the end of

surgery were significantly lower in the hypothermia group than in the

normothermia group (34.7±0.6 vs. 36.6±0.5°C, P<0.001), and they

remained significantly different for more than five hours postoperatively

(Fig. 1).

Table 1 . Characteristics of the Patients in the Two Study Groups.*

Normothermia Hypothermia P(N = 104) (N = 96) Value

CHARACTERISTIC

Male sex (no. of patients) 58 50 0.70Weight (kg) 73±14 71±14 0.31Height (cm) 170±9 169±9 0.43Age (yr) 61±15 59±14 0.33History of smoking (no.) 33 29 0.94Diagnosis (no. of patients)

Postoperative vasoconstriction was observed in 78 percent of the patients

in the hypothermia group; the vasoconstriction continued throughout the

six-hour recovery period. In contrast, vasoconstriction, usually short-lived,

was observed in only 22 percent of the patients in the normothermia group

(P<0.001). Shivering was observed in 59 percent of the hypothermia

group, but in only a few patients in the normothermia group. Thermal

comfort was significantly greater in the normothermia group than in the

hypothermia group (score on the visual- analogue scale one hour after

surgery, 73 ± 14 vs. 35 ± 17 mm). The difference in thermal comfort

remained statistically significant for three hours. Pain scores and the

amount of opioid administered were virtually identical in the two groups at

every postoperative measurement; hemodynamic values were also similar.

Inflammatory bowel disease 10 8 0.94Cancer 94 88

Duke's stage 1.0A 29 30B 37 34C 26 21D 2 3

Operative site 0.61Colon 59 51Rectum 35 37

Preoperative variablesCore temperature (°C) 36.8±0.4 36.7±0.4 0.08Hemoglobin (g/dl) 12.6±2.3 12.7±20 0.74

Intraoperative variablesFentanyl administered (mg) 0.7±0.3 0.6±0. 5 0.09End-tidal isoflurane (%) 0.6±0.1 0.6±0.2 1.0Arterial blood pressure (mm Hg) 91±17 95±18 0.11Heart rate (beats/min) 74± 17 76± 13 0.35Crystalloid (liters) 3.3± 1.5 3.2± 0.9 0.57Colloid (liters) 0.2±0.3 0.2±0.3 1.0Red-cell transfusion (no. of patients) 23 34 0.054Volume of blood transfused (units) 0.4± 1.0 0.8± 1.2 0.01Urine output (liters) 0.6±0.4 0.7±0.4 0.08Duration of surgery (hr) 3.1±1.0 3.1±0.9 1.0Ambient temperature (°C) 21.9±1.2 22.1±0.9 0.19Oxyhemoglobin saturation (%) 97.3±1.5 97.5±1.3 0.32Final core temperature (°C) 36.6±0.5 34.7±0.6 <0.001

* Plus-minus values are means ± SD.Table continued on next page

Page 102: M&M Ch 6 Introduction to Inference OVERVIEW

PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 7 of 11

Table 1 . Characteristics of the Patients in the Two Study Groups. ... continued

In a univariate analysis, tobacco use, group assignment, surgical site,

NNISS score, SENIC score, need for transfusion, and age were all

correlated with the risk of infection. In a multivariate backward-elimination

analysis, tobacco use, group assignment, surgical site, NNISS score, and

age remained risk factors for infection (Table 3).

Normothermia Hypothermia P(N = 104) (N = 96) Value

Postoperat ive var iablesHemoglobin (g/dl) 11.7±1.9 11.6±1.4 0.67Prophylactic antibiotics (days) 3.7±1.9 3.6±1.4 0.67SENIC score (no. of patients) 0.98

1 3 32 95 88

Four patients in the normothermia group and seven in the hypothermia

group required admission to the intensive care unit (P=0.47), mainly

because of wound dehiscence, colon perforation, and peritonitis. Two

patients in each group died during the month after surgery. The incidence

of infection was similar at each study hospital, and no one surgeon was

associated with a disproportionate number of infections.

3 6 5NNISS score (no. of patients) 0.60

0 32 311 49 393 23 26

Infection rate predicted by NNISS(%)

8.9 8.8 —

Oxyhemoglobin saturation (%) 98±1 98±1 1.0Piritramide (mg)† 20±13 22±12 0.26* Plus-minus values are means ± SD.

SENIC denotes Study on the Efficacy of Nosocomial Infection Control, and NNISSNational Nosocomial Infection Surveillance System.

Table 2 shows that significantly more collagen was deposited near the

wound in the patients in the normothermia group than in the patients in the

hypothermia group (328±135 vs. 254±114 µg per centimeter). The

patients assigned to hypothermia were first able to tolerate solid food one

day later than those assigned to normothermia (P=0.006); similarly, the

sutures were removed one day later in the patients assigned to hypothermia

(P = 0.002). The duration of hospitalization was 12.1±4.4 days in the

normothermia group and 14.7±6.5 days in the hypothermia group

(P=0.001). This difference was statistically significant even when the

analysis was limited to the uninfected patients. In the normothermia group,

the duration of hospitalization was 11.8±4.1 days in patients without

infection and 17.3±7.3 days in patients with infection (P=0.003). In the

hypothermia group the duration of hospitalization was 13.5±4.5 days in

patients without infection and 20.7±11.6 days in patients with infection

(P<0.001).

† The administration of this analgesic agent was controlled by the patient.

The overall incidence of surgical-wound infection was 12 percent.

Although the SENIC and NNISS scores for the risk of infection were

similar in the two groups, there were only 6 surgical-wound infections in

the normothermia group, as compared with 18 in the hypothermia group

(P = 0.009) (Table 2). Most positive cultures contained several different

organisms; the major ones were E. cold (11 cultures), S. aureus (7),

pseudomonas (4), enterobacter (3), and candida (3). Culture-negative pus

was expressed from the wounds of two patients assigned to hypothermia

and one patient assigned to normothermia. The ASEPSIS scores were

higher in the hypothermia group than in the normothermia group (13±16

vs. 7±10, P=0.002) (Table 2); these scores exceeded 20 in 32 percent of

the former but only 6 percent of the latter (P<0.001).

Page 103: M&M Ch 6 Introduction to Inference OVERVIEW

PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 8 of 11

38-

37-

36-

35-

34-

Core Temperature (°C)

0 1 2 3 0 1 2 3 4 5 6Final

Time (hr)

Normothermia

Hypothermia

PostoperativeIntraoperative

Table 2 . Postoperative Findings in the Two Study Groups.*

Normothermia Hypothermia PVARIABLE (N = 104) (N = 96) Value

Al l pat i entsInfection - no. of patients (%) 6(6) 18(19) 0.009ASEPSIS score 7±10 13±16 0.002Collagen deposition - µg/cm 328±135 254±114 0.04Days to first solid food 5.6±2.5 6.5±2.0 0.006Days to suture removal 9.8±2.9 10.9±1.9 0.002Days of hospitalization 12.1±4.4 14.7±6.5 0.001

Uninfected pat ientsNumber of patients 98 78Days to first solid food 5.2±1.6 6.1±1.6 <0.001Days to suture removal 9.6±2.6 10.6±1.6 0.003Days of hospitalization 11.8±4.1 13.5±4.5 0.01* Plus-minus values are means ± SD.Fig 1. Core Temperatures during and after Colorectal Surgery in the Study Patients.

The mean (±SD) final intraoperative core temperature was 34.7±0.6°C in the 96patients assigned to hypothermia, who received routine thermal care, and36.6±0.5°C in the 104 patients assigned to normothermia, who were given extrawarming. The core temperatures in the two groups differed significantly at eachmeasurement, except before the induction of anesthesia (first measurement) andafter six hours of recovery.

Table 3 . Multivariate Analysis of Risk Factors for Surgical-Wound Infection.

RISK FACTOR ODDS RATIO (95% CI)Tobacco use (yes vs. no) 10.5 (3.2—34.1)Group assignment (hypotbormia vs. normothermia) 4.9 (1.7—14.5)

The postoperative hemoglobin concentrations did not differ significantly

between the two groups (Table 1). On the first postoperative day,

leukocytosis was impaired in the hypothermia group as compared with the

normothermia group(white-cell count, 11,500±3500 vs. 13,400±2500

cells per cubic millimeter; P<0.001). On the third postoperative day,

however, white-cell counts were significantly higher in the hypothermia

group (10,100±3900 vs. 8900±2900 cells per cubic millimeter). The

difference in values on the third day was not statistically significant when

only uninfected patients were included in the analysis. By the sixth

postoperative day, the white-cell counts were similar in the two groups.

Surgical site (rectum vs. colon) 2.7 (0.9— 7.6)NNISS score (per unit increase)* 2.5 (1.2— 5.3)Age (per decade) 1.6 (1.0— 2.4)

Table 4 . Postoperative Findings in the Study Patients According to Smoking Status.*

Smokers NonSmokers PCHARACTERISTIC (N = 62) (N = 138) Value

Infection — no. of patients 14 (23) 10 (7) 0.004ASEPSIS Score 15±18 8±104 <0.001Days to suture removal 10.9±3.5 10.1±2.0 0.04Days of hospitalization 14.9±6.7 12.9±5.0 0.02SENIC score (no. of patients) 0.25

1 0 62 58 1253 4 7

NNISS score (no. of patients) 0.080 23 401 30 583 9 40

* Plus-minus values are means ± SD. SENIC denotes Study on the Efficacy of NosocomialInfection Control, and NNISS National Nosocomial Infection Surveillance System.

Page 104: M&M Ch 6 Introduction to Inference OVERVIEW

PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 9 of 11

Among smokers, the number of cigarettes smoked per day was similar in

the two groups (22±20 in the hypothermia group vs. 22±14 in the

normothermia group). The morphometric characteristics, anesthetic care,

and SENIC and NNISS scores of smokers and nonsmokers were not

significantly different. Nonetheless, the proportion of patients with wound

infection was significantly higher among smokers (23 percent, or 14 of

62) than among nonsmokers (7 percent, or 10 of 138; P=0.004).

Furthermore, the length of hospitalization was significantly greater among

smokers (14.9±6.7 days, vs. 12.9±5.0 days among nonsmokers; P=0.02)

(Table 4).

The types of bacteria cultured from our patients' surgical wounds were

similar to those reported previously.[2,3] These organisms are susceptible

to oxidative killing, which is consistent with our hypothesis that

hypothermia inhibits the oxidative killing of bacteria.[31] The overall

incidence of infection in our study was approximately 35 percent higher

than in previous reports.[3] One explanation for this relatively high

incidence is that we considered all wounds draining pus that yielded a

positive culture to be infected, although some may have been of minor

clinical importance. The hospitalizations of infected patients were one

week longer than those of patients without surgical-wound infections,

however, indicating that most infections were substantial. Similar

prolongation of hospitalization has been reported previously.[1,2]DISCUSSION

It is interesting to note that hospitalization was also prolonged (by about

two days) in the uninfected patients in the hypothermia group (Table 2). A

number of factors influenced the decision to discharge patients, but healing

of the incision (formation of a "healing ridge," for example) was among

the most important. As is consistent with a delay in clinical healing, sutures

were removed significantly later and the deposition of collagen (an index

of scar formation and the strength of the healing wound) was significantly

less in the hypothermia group than in the normothermia group. That the

patients assigned to hypothermia required significantly more time before

they could tolerate solid food is also consistent with impaired healing.

The initial hours after bacterial contamination are a decisive period for the

establishment of infection.[25] In surgical patients, perioperative factors

can contribute to surgical-wound infections, but the infection itself is

usually not manifest until days later.

In our study, forced-air warming combined with fluid warming maintained

normothermia in the treated patients, whereas the unwarmed patients had

core temperatures approximately 2°C below normal.[8] Perioperative

hypothermia persisted for more than four hours and thus included the

decisive period for establishing an infection.[25,30] The patients with mild

perioperative hypothermia had three times as many culture-positive

surgical-wound infections as the normothermic patients. Moreover, the

ASEPSIS scores showed that in the patients assigned to hypothermia the

reduction in resistance to infection was twice that in the normothermia

group.

In Austria's medical system, administrative factors and costs of

hospitalization do not influence the length of stay in the hospital. No data

on individual costs are tabulated by the participating hospitals, and they are

therefore not available for our patients. Nonetheless, the cost of a

prolonged hospitalization must exceed the cost of fluid and forced-air

Page 105: M&M Ch 6 Introduction to Inference OVERVIEW

PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 10 of11

warming (approximately $30 in the United States). In a managed-care

situation, the duration of hospitalization might have differed less, or not at

all. However, our data suggest that patients kept at normal temperatures

during surgery would be better prepared for discharge at a fixed time than

those allowed to become hypothermic.

Mild hypothermia can increase blood loss and the need for transfusion

during surgery.38 In vitro studies suggest that perioperative hypothermia

may aggravate surgical bleeding by impairing the function of platelets and

the activity of clotting factors. [39,40] Blood transfusions may increase

susceptibility to surgical-wound infections by impairing immune function.

[41] Our patients assigned to hypothermia required significantly more

allogeneic blood to maintain postoperative hemoglobin concentrations than

did the patients assigned to normothermia However, we administered only

leukocyte depleted blood, and multivariate regression analysis indicated

that a requirement for transfusion did not independently contribute to the

incidence of wound infection. It is thus unlikely that the differences in the

incidence of infection in the two groups we studied resulted from

transfusion-mediated immunosuppression.

Among all 200 patients in our study, those who smoked had three times

more surgical-wound infections and significantly longer hospitalizations

than the nonsmokers. Similar data have been reported previously.[32]

Numerous factors contributed to these results; one may have been that

smoking markedly lowers oxygen tension in tissue for nearly an hour after

each cigarette.[33] (Thermoregulatory vasoconstriction produces a similar

reduction.[34] The distribution of factors known to influence infection was

similar between smokers and nonsmokers, but the smokers may have had

other behavioral or physiologic factors predisposing them to infection.

The prevalence of smoking was similar in the two study groups. Other

factors may have influenced the patients' susceptibility to wound

infections, such as arterial hypoxemia, hypovolemia, the concentration of

the anesthetic used, and vasoconstriction resulting from pain-induced

stress. [25,26,35,36]. However, the administration of oxygen,

oxyhemoglobin saturation, fluid balance, hemodynamic responses,

end-tidal concentrations of anesthetic, pain scores, and quantities of opioid

administered were all similar between the two groups. These factors are

therefore not likely to have confounded our results. It is also unlikely that

exaggerated bacterial growth aggravated the infections in the hypothermia

group, because small reductions in temperature actually decrease growth in

vitro.[37]

In summary, this double-blind, randomized study indicates that

intraoperative core temperatures approximately 2°C below normal triple the

incidence of wound infection and prolong hospitalization by about 20

percent. Maintaining intraoperative normothermia is thus likely to decrease

infectious complications and shorten hospitalization in patients undergoing

colorectal surgery.

We are indebted to Heinz Scheuenstahl for the collagen-deposition analysis; to Helene Ortmann, M.D., Andrea Hubacek, M.D.,Michael Zimpfer, M.D., and Gerhard Pavecic for their generous assistance; and to Mallinckrodt Anesthesiology Products, Inc., for thedonation of thermometers and thermocouples.APPENDIX

The following investigators also participated in this study: patient safety and data auditing: H.W. Hopf and T.K. Hunt (University ofCalifornia, San Francisco); site directors: G. Polak (Hospital Rudolfstiftung, Vienna, Austria) and W. Kroll (University of Graz, Graz,Austria); patient care: E Lackner and R. Fuegger (University of Vienna); data acquisition: E. Narzt (University of Vienna), C. Wol~b(University of Vienna), E. Marker (University of Vienna), A. Bekar (Orthopedic Hospital, Speising, Vienna), H. Kaloud (University ofGraz), U. Stratil (Hospital Rudolfstiftung), and R. Csepan (University of Vienna); wound evaluation: V. Goll (University of Vienna),G.S. Bayer (University of Vienna), and P. Steindorfer (University of Graz); and data management: B. Petschnigg (University ofVienna).

Page 106: M&M Ch 6 Introduction to Inference OVERVIEW

PERIOPERATIVE NORMOTHERMIA TO REDUCE THE INCIDENCE OF SURGICAL-WOUND INFECTION AND SHORTEN HOSPITALIZATION p 11 of11

REFERENCES 20. De Jong L, ... Stoicheiometry and kinetics of the prolyl 4-hydroxylase partial reaction.Biochim Biophys Acta 1984;787:105-11.1. Bremmelgaard A, ... Computer-aided surveillance of surgical infections and

identification of risk factors. J Hosp Infect 1989;13:1-18. 21. Hunt TK, ... The effect of varying ambient oxygen tensions on wound metabolism andcollagen synthesis. Surg Gynecol Obstet 1972; 135:561 - 7.2. Haley RW, ... Idendfying patients at high risk of surgical wound infection: a simple

multivariate index of patient susceptibility and wound contamination. Am J Epidemiol1985;121:206 15.

22. Jonsson K, ... Tissue oxygenation, anemia and perfusion in relation to woumd healingin surgical patients. Ann Surg 1991 ,214:oO5-13.

23. Frank SM, .... The catecholamine, cortisol, and hemodynamic responses to mildperioperative hypothermia: a randomized clinical trial. Anesthesiology1995;82:83-93.

3. Culver DH, .... Surgical wound infection rates by wound class, operative procedure, andpatient risk index. Am J Med 1991; 91:152S-157S.

4. Frank SM, ... Epidural versus general anesthesia, ambient operating room temperature.and patient age as predictors of inadvertent hypothermia. Anesthesiology1992;77:252-7.

24. Fleming TR, .... Designs for group sequential tests. Control Clin Trials1984,5:348-61.

25. Miles AA, ... The value and duration of defense reacdons of the skin to the primarylodgement of bacteria Br J Exp Pathol 1957;38:79-96.

5. Matsukawa T, ... Propofol linearly reduces the vasococonstriction and shiveringthresholds. Anesthesiology 1995;82:1169-80.

26. Jonsson K, .. Assessment of perfusion in postoperative patients using tissue oxygenmeasurements. Br J Surg 1987;74:263-7.

6. Annadata RS, ... Desflurane slightly increases the sweatiug threshold but producesmarked, nonlinear decreases m the vasococonstriction and shivering threshold~Anesthesiology 1995;83:1205-1 1. 27. Rubinstein EH, .... Skin-surface temperature gradients correlate with fingertip blood

flow in humans. Anesthesiology 1990;73:541-5.7. Matsukawa T, .... Heat flow and distribudon during induction of general anesthesiaAnesthesiology 1995;82:662-73. 28. Byrne DJ, ... Postoperadve wound scoring. Biomed Pharmacother 1989;43:669-73.

8. Kurz A, .... Forced-air warming maintains intraoperative normothermia better thancirculating-water mattresses. Anesth Analg 1993;77:89-95.

29. Rabkin JM, ... Woumd healing assessment by radioisotope incubation of tissuesamples in PTFE tubes. Surg Forum 1986,37:5924.

9. Ozaki M, ... Nitrous oxide decreases the threshold for vasoconstriction less thansevoflurane or isoflurane. Anesth Analg 1995,80:1212-6.

30. Classen DC, .... The timing of prophylactic administration of antibiotics and tbe riskof surgical wound infecdon. N Engl J Med 1992;326:281-6.

10. Sessler Dl, ... Physiologic responses to mild peri 31. Babior BM, ... Cbronic granulomatous disoases. Semin Hematol 1990;27: 247 59.anesthetic hypothermia in humans. Anesthesiology 1991;75:594-610. 32. Stopinski J, ... L'abus de nicotine et d'alcool ont-ils une influence sur l'apparition des

infections bactériennes post opératoires? J Chir 1993;130:422-5.11. Chang N, ... Comparison of the effect of bacterial inoculation in musculocutaneous andrandom-pattern flaps. Plast Reconstr Surg 1982,701-10. 33. Jensen JA, .... Cigareotte smoking decreases tissue oxygen. Arch Surg

1991;126:11314.12. Jonsson K, ... Oxygen as an isolated variable influences resistance to infection. AnnSurg 1988;208:783-7. 34. Sheffield CW, ... Effect of thermoregulatory responses on subcutanoous oxygen

tension. Wound Repair Regeneration (in press).13. Hohn DC, ... The effect of O2 tension on microbicidal function of leucocytes in woundand in vitro. Surg Forum 1976;27: 18-20. 35. Knighton DR, ... Oxygen as an antibiotic: tbe effect of inspired oxygen on infection.

Arch Surg 1984;119:199-204.14. Mader JT, ... A mechanism for the amelioration by hyperbaric oxygen of experimentalstaphylococcal osteomyelitis in rabbits. J Infect Dis 1980;142:915-22. 36. Moudgil GC, ... Influence of anaosthesia. and surgery on neutrophil chemotaxis. Can

Anaesth Soc J 1981,28:232-8.15. van Oss Cl, ... Effect of temperature on the chemotaxis, phagocytic engulfment.digestion and O2 consumpbon of human polymorphonuclear leucocytes. JReticuloendothelial Soc 1980,27:561-5.

37. Mackowiak PA. Direct effects of hyperthermia on pathogenic microorganisms:teleologic implications witb regard to fever. Rev Infect Dis 1981;3: 508-20.

16. Leijh PC, ... Kinetics of phagocytosis of Staphylococcus aureus and Escherichia coliby human granulocytes, Immunology 1979;37:453-65.

38. Schmied H, .... Mild hypothermia increases blood loss and transfusion requirementsduring total hip arthroplasty. Lancet 1996,347:289-92

17. Sheffield CW, ... Mild hypothermia during isoflurane anesthesia decreases resistance toE coli dermal infection in guinea pigs. Acta Anaesthesiol Scand 1994;38:2015

39. Michelson AD, .... Reversible inhibition of human platelet activation by hypotbermiain vivo and in vitro. Thromb Haemost 1994;71:633-40.

18. Sheffield CW, .... Mild hypothermia during halathane-induced anesthesia decreasesresistance to Staphylococcus aureus dermal infection in guinea pigs. Wound RepairRegeneration 1994;2:48-56.

40. Rohrer MJ... Effect of hypothermia on the coagulation cascade. Crit Care Med1992,20:1402-5.

41. Jensen LS, ... Postoperative infection and natural killer cell function following bloodtransfusion in patients undergoing elective colorectal surgery. Br J Surg1992;79:513-6.

19. Prockop DJ ... The biosynthesis of collagen and its disorders. N Engl J Med1979;301:13-23.

Page 107: M&M Ch 6 Introduction to Inference OVERVIEW

M&M Ch 3.1 & 3.2 Designing Experiments ... page 1

On Experimental Design

I constructed four miniature houses of worship

a Mohammedan mosque,a Hindu temple,a Jewish synagogue,a Christian cathedral

and placed them in a row.

I then marked l5 ants («fourmis») with red paint and turnedthem loose. They made several trips to and fro, glancing in atthe places of worship, but not entering. I then turned loose l5more painted blue; they acted just as the red ones had done. Inow gilded 15 and turned them loose. No change in the result;the 45 travelled back and forth in a hurry persistently andcontinuously visiting each fane, but never entering.

This satisfied me that these ants were without religiousprejudices--just what I wished; for under no other conditionswould my next and greater experiment be valuable.

I now placed a small square of white paper within the door ofeach fane;

upon the mosque paper I put a pinch of putty, upon the templepaper a dab of tar,upon the synagogue paper a trifle of turpentine, upon thecathedral paper a small cube of sugar.

First I liberated the red ants. They examined and rejected theputty, the tar and the tuprpentine, and then took to the sugarwith zeal and apparent sincere conviction.

I next liberated the blue ants, and they did exactly as the redones had done.

The gilded ants followed. The preceding results wereprecisely repeated.

This seemed to prove that ants destitute of religious prejudicewill always prefer Christianity to any other creed.

However, to make sure, I removed the ants and put putty in thecathedral and sugar in the mosque. I now liberated the ants ina body, and they rushed tumultuously to the cathedral.

I was very much touched and gratified, and went back in theroom to write down the event.

But when I came back the ants had all apostatized and hadgone over to the Mohammedan communion.

I saw that I had been too hasty in my conclusions, andnaturally felt rebuked and humbled. With diminishedconfidence I went on with the test to the finish. I placed thesugar first in one house of worship then in another, till I hadtried them all.

With this result: whatever Church I put the sugar in, that wasthe one the ants straightway joined.

This was true beyond a shadow of doubt, that in religiousmatters the ant is the opposite of man, for man cares for butone thing; to find the only true Church; whereas the ants huntsfor the one with the sugar in it.

from Mark Twain, "On Experimental Design " in Scott W.K. and L.L.Cummings, Readings in Organizational Behavior and HumanPerformance, Irwin: Homewood, Ill., p.2, (l973) .