IE241: Introduction to Hypothesis Testing

IE241: Introduction to Hypothesis Testing

Topic SlideHypothesis testing………………………………………..3Light bulb example………………………………………..4Null and alternative hypotheses………………..……….5Two types of error…………………………………………8Decision rule……………………………………..……….11 test statistic……………………………………………11 critical region………………………………………….12 power of the test……………………………….…….17Simple hypothesis testing……………………………...18 Neyman-Pearson lemma……………….…….…….19 example………………………………………………..21Composite hypothesis testing ……………...………..26 example…………………………………..……………29 Likelihood ratio test………………………………….34 relationship to mean…………………………….38Examples of 1-sided composite hypotheses drug to help sleep……………………………………42 civil service exam………....................…………..44 difference between two proportions ……….46 effect of size of n…………..………………….51 fertilizer to improve yield of corn…………………..55 test of two variances…………………………..59 F distribution ……………………………………61Bayes’ likelihood ratio test…………………………..…66 example…………………...……………………..…..67

Topic Slide

We said before that estimation of parameters was one of the two major areas of statistics. Now let’s turn to the second major area of statistics, hypothesis testing.

What is a statistical hypothesis? A statistical hypothesis is an assumption about f(X) if X is continuous or p(X) if X is discrete.

A test of a statistical hypothesis is a procedure for deciding whether or not to reject the hypothesis.

Let’s look at an example.

A buyer of light bulbs bought 50 bulbs of each of two brands. When he tested them, Brand A had an average life of 1208 hours with a standard deviation of 94 hours. Brand B had a mean life of 1282 hours with a standard deviation of 80 hours. Are brands A and B really different in quality?

We set up two hypotheses.

The first, called the null hypothesis Ho, is the hypothesis of no difference.

Ho: μA = μB

The second, called the alternative hypothesis Ha, is the hypothesis that there is a difference.

Ha: μA ≠ μB

On the basis of the sample of 50 from each of the two populations of light bulbs, we shall either reject or not reject the hypothesis of no difference.

In statistics, we always test the null hypothesis. The alternative hypothesis is the default winner if the null hypothesis is rejected.

We never really accept the null hypothesis; we simply fail to reject it on the basis of the evidence in hand.

Now we need a procedure to test the null hypothesis. A test of a statistical hypothesis is a procedure for deciding whether or not to reject the null hypothesis.

There are two possible decisions, reject or not reject. This means there are also two kinds of error we could make.

The two types of error are shown in the table below.

True state

Decision

Ho true Ho false

Reject Ho

Type 1error α

Correct decision

Do not reject Ho

Correct decision

Type 2 error β

If we reject Ho when Ho is in fact true, then

we make a type 1 error. The probability of type 1 error is α.

If we do not reject Ho when Ho is really false, then we make a type 2 error. The probability of a type 2 error is β.

Now we need a decision rule that will make the probability of the two types of error very small. The problem is that the rule cannot make both of them small simultaneously.

Because in science we have to take the conservative route and never claim that we have found a new result unless we are really convinced that it is true, we choose a very small α, the probability of type 1 error.

Then among all possible decision rules given α, we choose the one that makes β as small as possible.

The decision rule consists of a test statistic and a critical region where the test statistic may fall. For means from a normal population, the test statistic is

where the denominator is the standard deviation of the difference between two independent means.

B

B

A

A

BA

diff

BA

ns

ns

XX

s

XXt

22

The critical region is a tail of the distribution of the test statistic. If the test statistic falls in the critical region, Ho is rejected.

Now, how much of the tail should be in the critical region? That depends on just how small you want α to be. The usual choice is α = .05, but in some very critical cases, α is set at .01.

Here we have just a non-critical choice of light bulbs, so we’ll choose α = .05. This means that the critical region has probability = .025 in each tail of the t distribution.

For a t distribution with .025 in each tail, the critical value of t = 1.96, the same as z because the sample size is greater than 30. The critical region then is |t |> 1.96.

In our light bulb example, the test statistic is

23.45.17

74

5094

5080

1208128222

t

Now 4.23 is much greater than 1.96 so we reject the null hypothesis of no difference and declare that the average life of the B bulbs is longer than that of the A bulbs.

Because α = .05, we have 95% confidence in the decision we made.

We cannot say that there is a 95% probability that we are right because we are either right or wrong and we don’t know which.

But there is such a small probability that t will land in the critical region if Ho is true that if it does get there, we choose to believe that Ho is not true.

If we had chosen α = .01, the critical value of t would be 2.58 and because 4.23 is greater than 2.58, we would still reject Ho. This time it would be with 99% confidence.

How do we know that the test we used is the best test possible?

We have controlled the probability of Type 1 error. But what is the probability of Type 2 error in this test? Does this test minimize it subject of the value of α?

To answer this question, we need to consider the concept of test power. The power of a statistical test is the probability of rejecting Ho when Ho is really false. Thus power = 1-β.

Clearly if the test maximizes power, it minimizes the probability of Type 2 error

β. If a test maximizes power for given α, it is called an admissible testing strategy.

Before going further, we need to distinguish betw

een two types of hypotheses.

A simple hypothesis is one where the value of the parameter under Ho is a specified constant and the value of the parameter under Ha is a different specified constant.

For example, if you test

Ho: μ = 0 vs Ha: μ = 10

then you have a simple hypothesis test.

Here you have a particular value for Ho and a different particular value for Ha.

For testing one simple hypothesis Ha against the simple hypothesis Ho, a ground-breaking result called the Neyman-Pearson lemma provides the most powerful test.

λ is a likelihood ratio with the Ha parameter MLE in the numerator and the Ho parameter MLE in the denominator. Clearly, any value of λ > 1 would favor the alternative hypothesis, while values less than 1 would favor the null hypothesis.

)ˆ(

)ˆ(

0

L

L a

Basically, this likelihood ratio says that if there exists a critical region A of size α and a constant k such that

inside A

and

outside A

then A is a best (most powerful) critical region of size α.

kxf

xf

L

Ln

ioi

n

iai

o

a

1

1

);(

);(

kxf

xf

L

Ln

ioi

n

iai

o

a

1

1

);(

);(

Consider the following example of a test of two simple hypotheses.

A coin is either fair or has p(H) = 2/3. Under Ho, P(H) = ½ and under Ha, P(H) =

2/3.

The coin will be tossed 3 times and a decision will be made between the two hypotheses. Thus X = number of heads = 0, 1, 2, or 3. Now let’s look at how the decision will be made.

First, let’s look at the probability of Type 1 error α. In the table below, Ho⇒ P(H) =1/2 and Ha⇒ P(H) = 2/3.

Now what should the critical region be?

X P(X|Ho) P(X|Ha)

0 1/8 1/27

1 3/8 6/27

2 3/8 12/27

3 1/8 8/27

Under Ho, if X = 0, α = 1/8. Under Ho, if X = 3, α =

1/8. So if either of these two values is chosen as the critical region, the probability of Type 1 error would be the same.

Now what if Ha is true? If X = 0 is chosen as the critical region, the value of β = 26/27 because that is the probability that X ≠ 0.

On the other hand, if X = 3 is chosen as the critical region, the value of β = 19/27 because that is the probability that X ≠ 3.

Clearly, the better choice for the critical region is X=3 because that is the region that minimizes β for fixed α. So this critical region provides the more powerful test.

In discrete variable problems like this, it may not be possible to choose a critical region of the desired α. In this illustration, you simply cannot find a critical region where α = .05 or .01.

This is seldom a problem in real-life experimentation because n is usually sufficiently large so that there is a wide variety of choices for critical regions.

This problem to illustrate the general method for selecting the best test was easy to discuss because there was only a single alternative to Ho.

Most problems involve more than a single alternative. Such hypotheses are called composite hypotheses.

Examples of composite hypotheses:

Ho: μ = 0 vs Ha: μ ≠ 0

which is a two-sided Ha.

A one-sided Ha can be written as

Ho: μ = 0 vs Ha: μ > 0 or Ho: μ = 0 vs Ha: μ < 0

All of these hypotheses are composite because they include more than one value for Ha. And unfortunately, the size of β here depends on the particular alternative value of μ being considered.

In the composite case, it is necessary to compare Type 2 errors for all possible alternative values under Ha. So now the size of Type 2 error is a function of the alternative parameter value θ.

So β(θ) is the probability that the sample point will fall in the noncritical region when θ is the true value of the parameter.

Because it is more convenient to work with the critical region, the power function 1-β(θ) is usually used.

The power function is the probability that the sample point will fall in the critical region when θ is the true value of the parameter.

As an illustration of these points, consider the following continuous example.

Let X = the time that elapses between two successive trippings of a Geiger counter in studying cosmic radiation. The density function is

f(x;θ) = θe-θx

where θ is a parameter which depends on experimental conditions.

Under Ho, θ = 2. Now a physicist believes that θ < 2. So under Ha, θ < 2.

Now one choice for the critical region is the right tail of the distribution, X ≥ 1

Another choice is the left tail, X ≤ .07 for which α = .135. That is,

Now let’s examine the power for the two competing critical regions.

1

2 135.2 dxe x

07.

0

2 135.2 dxe x

For the right-tail critical region X > 1,

and for the left-tail critical region X <.07,

The graphs of these two functions are called the power curves for the two critical regions.

1

1)(1 edxe x

07.

0

07.2 1)(1 edxe x

These two power functions are

Note that the power function for X>1 region is always higher than the power function for X<.07 region before they cross at θ = 2. Since the alternative θ values in the problem are all θ<2, clearly the region X>1 is superior.

Power functions for two critical regions

0

0.2

0.4

0.6

0.8

1

1.2

0 0.5 1 1.5 2 2.5 3 3.5 4

Theta

Pow

er

critical region X>1

critical region X<.07

Unfortunately, with two-sided composite alternative hypotheses, there is no best test that covers all alternative values.

Clearly, if the alternative were θa < θo , the left tai

l would be best, and if the alternative were θa > θo , the right tail would be best.

This shows that best critical regions exist only if the alternative hypothesis is suitably restricted.

So for composite hypotheses, a new principle needs to be introduced to find a good test. This principle is called a likelihood ratio test.

where the denominator is the maximum of the likelihood function with respect to all the parameters, and the numerator is the maximum of the likelihood function after some or all of the parameters have been restricted by Ho.

)ˆ(

)ˆ( 0

L

L

Consequently, the numerator can never exceed the denominator, so λ can assume values only between 0 and 1.

A value of λ close to 1 lends support to Ho because then it is clear that allowing the parameters to assume values other than those possible under Ho would not increase the likelihood of the sample values very much, if at all.

If, however, λ is close to 0, then the probability of the sa

mple values of X is very low under Ho, and Ho is therefore not supported by the data.

Because increasing values of λ correspond to increasing

degrees of belief in Ho, λ may serve as a statistic for testing Ho, with small values leading to rejection of Ho.

Now the MLEs are functions of the values of the random variable X, so λ is also a function of these values of X and is therefore an observable random variable.

λ is often related to whose distribution is known so it i

s not necessary to find the distribution of λ.

X

Suppose we have a normal population with

σ = 1 and we are interested in testing whether the mean = μo. That is,

Let’s see how we would construct a likelihood ratio test.

2)(2

1

2

1

x

e

In this case,

Since maximizing L(θ) is equivalent to maximizing log L(θ),

so and therefore

n

iixn

eL 1

2)(2

1

2)2()(

)()(log

1

n

iix

L

X̂

n

ii Xxn

eL 1

2)(2

1

2)2()(

Under Ho, there are no parameters to be estimated, so

and λ then is

n

ioixn

o eL 1

2)(2

1

2)2()(

2

1 1

22

)(2

)()(2

1

o

n

i

n

iioi

Xn

Xxx

e

e

This expression shows a relationship between λ and , such that for each value of λ, there are two critical values of , which are symmetrical with respect to = θo.

So the 5% critical region for λ corresponds to the two 2.5% tails of the normal distribution given by

So the likelihood ratio test is identical to the t test and serves as a compromise test when no best test is available.

XX

X

X

|96.1|

n

X o

It is because of the concept of power that we simply fail to reject the null hypothesis and do not accept it when the critical value does not fall into the rejection region.

The reason is that if we had a more powerful test, we might have been able to reject Ho.

Now let’s look at some examples.

As an example of a one-sided composite hypothesis test, suppose a new drug is available which claims to produce additional sleep. The drug is tested on 10 patients with the results shown.

We are testing the hypothesis Ho: μ = 0 vs Ha: μ > 0

Patient 1 2 3 4 5 6 7 8 9 10

Hours gained

0.7 -1.1

-0.2

1.20.

13.4 3.

70.8 1.

82.0

The mean hours gained = 1.24 and s = 1.45. So the t statistic is

which has 9 df.

For df = 9 and α = .05, the required t = 2.262. Since our obtained t is greater then the required t, we can, with 95% confidence, reject Ho.

So in this case, even with only 10 patients, we can endorse the drug for obtaining longer sleep.

7.2

10

45.1024.1

t

Now let’s take a second example. A civil service exam is given to a group of 200 candidates. Based on their total scores, the 200 candidates are divided into two groups, the top 30% and the bottom 70%.

Now consider the first question in the examination. In the upper 30% group, 40 had the right answer. In the lower 70% group, 80 had the right answer. Is the question a good discriminator between the top scorers and the lower scorers?

To answer this question, we first set up the two hypotheses.

In this case, the null hypothesis is Ho: pu = pl and the alternative is

Ha: pu > pl because we would expect the upper group to do better than the lower group on all questions.

In binomial situations, we must deal with proportions instead of counts unless the two sample sizes are the same.

The proportion of successes p = x/n may be assumed to be normally distributed with mean p and variance

pq/n if n is large.

Then the difference between two sample proportions may also be approximately normally distributed if n is large.

In this situation, μ p1-p2 = p1-p2 and

Just as for the binomial distribution, the normal approximation will be satisfactory if each nipi exceeds 5 when p ≤ ½ and niqi exceeds 5 when p > ½.

2

22

1

112

21 n

qp

n

qppp

The test statistic is

We need the common estimate of p under Ho to use in the denominator, so we use the estimate for the entire group.

So p = 120/200 = 3/5 =.6 and q = .4. The p for the upper group = 40/60 = .67. T

he p for the lower group = 80/140≈.57.

lu

lu

npq

npq

ppt

So inserting our values into the test statistic, we get

Our critical region is t > 1.96 because we have set α = .05 as the critical value. Because of the large sample size, t.05 = z.05 .

32.1076.

10.

140)4(.6.

60)4(.6.

57.67.

t

Because the obtained t = 1.32 is lower than the required t = 1.96, we cannot reject the null hypothesis because the data didn’t allow us to do so.

So, given the data, we conclude that the first question is not a good one for distinguishing between the upper scorers and the lower scorers on the entire test.

Now let’s look at our test problem again. Suppose instead of 200 candidates we tested 500, but kept everything else in the problem the same.

Now we will reject Ho because now t = 2.092, which is greater than 1.96, the critical value of t.

092.20478.

10.

350)4(.6.

150)4(.6.

57.67.

t

This is why we never accept Ho, but only fail to reject it with the evidence in hand. It is always possible that a more powerful test will provide evidence to reject Ho.

But this leads to another question. If, theoretically, we can always keep increasing sample size, then eventually we will always be able to reject Ho. So why do the test to begin with?

The reality is that you can’t keep increasing n in the real world because there are constraints on time, money, and manpower that prevent having n so large that rejection of Ho is a foregone conclusion.

We usually have to get by with the n we have available.

Furthermore, even if we could get a larger sample size, there is no guarantee that everything else will remain the same.

The mean difference in the numerator could change. So could the variance estimates in the denominator.

So we do the test because there is no other choice.

As another example, consider the application of a fertilizer to plots of farm ground and the effect it has on the yield of corn in bushels. The data are

The average yield for the treated plots = 6.0, with s2 = 0.0711. The average yield for the untreated plots = 5.7 with s2 =0.0267.

Treated 6.2 5.7 6.5 6 6.3

5.8

5.7

6 6 5.8

Untreated

5.6 5.9 5.6 5.7

5.8

5.7

6 5.5

5.7

5.5

The test statistic is

So Ho can be rejected because α = .05 and the t.

05 = 2.101 with 18 df. When you test the difference between two means, df = nA-1 + nB-1.

So we can conclude that the fertilizer will help produce more bushels of corn.

0339.3098883.

3.

100267.0

100711.0

7.56

t

Now can ask how many extra bushels of corn we will get with the fertilizer. The point estimate is .3, but we can find a 95% confidence interval around this estimate.

In the case of a confidence interval for the difference between two means,

508.092.

208.3.

)0989(.101.23.

)()( 95.

BA

XXBA BAsstXX

Because the sample size was only 10 for each group, we can’t say with any degree of confidence that the increase in yield is more than .092, but it may be as much as .508.

One caution about using small samples to test the difference between means is that t assumes equality of the two variances. If the samples are large, this assumption is unnecessary.

Now how can we know if the two variances are equal? We can test them.

We already know that each variance is distributed as chi-square. Now how can we test to see if two variances are equal?

The answer is the F test.

The F distribution is a ratio of two chi-square variables. So if s2

1 and s22 possess in

dependent chi-square distributions with v1 and v2 df, respectively, then

has the F distribution with v1 and v2 df.

2

22

1

21

vsv

s

F

The F distribution is

where c is given by

)(2

1

21

)2(2

1211

)()(vvv

FvvcFFf

22

2

21

2122

21

21

vv

vvvvvv

Now let’s do the test to see if our two variances are equal. In the problem, the two variances are .0711 and .0267.

So and there are 10-1 and 10-1 df. Is 2.66 g

reater than would be expected if the two variances were equal? To answer this, we must consult the F distribution.

66.20267.

0711.22

21 s

sF

Now it turns out that the critical region in the two tails have critical values that are reciprocals of each other. That is, if

then

Because of this reciprocal property, the procedure is always to place the larger variance over the smaller.

Then we can refer to the F distribution for the .025 critical region to see if the hypothesis of a common variance is to be rejected.

22

21

s

sF

21

22/1s

sF

For this case, with 9 and 9 df, the critical value of F = 4.025. There is an FINV function in EXCEL to find critical values for the F distribution.

Since the observed value of 2.66 is less than the critical value of 4.025, we cannot reject the null hypothesis of common variance.

To see the reciprocality, if we had placed the smaller over the larger, we would have as the observed ratio, 1/2.66 = .376.

The critical value would be 1/F = 1/4.025 = .2484. In this case, for the left tail, the observed value should be less than the critical value. But here .376 > .2484, so we would not reject Ho.

Another approach to an admissible test strategy is that developed by Bayes. which turns out to be a likelihood ratio test. Bayes’ formula is used to determine the likelihood of a hypothesis, given an outcome.

This formula gives the likelihood of Hi given the data you actually got versus the total likelihood of every hypothesis given the data you got. So Bayes’ strategy is a likelihood ratio test.

k

iii

iii

HDPHP

HDPHPDHP

1

)|()(

)|()()|(

Consider an example where there are two identical boxes. Box 1 contains 2 red balls and Box 2 contains 1 red ball and 1 white ball.

Now a box is selected by chance and 1 ball is drawn from it. What is the probability that it was Box 1 that was selected if the ball that was drawn was red?

Let’s test this with Bayes’ formula.

There are only two hypotheses here, so H1= Box1 and H2 = Box2. The data, of course, = R. So we can find

And we can find

So we can see that the odds of the data favoring Box1 to Box2 are 2:1.

3

2

)2/1)(2/1()1)(2/1(

)1)(2/1(

)|()()|()(

)|()()|(

2211

111

HRPHPHRPHP

HRPHPRHP

3

1

)2/1)(2/1()1)(2/1(

)2/1)(2/1(

)|()()|()(

)|()()|(

2211

222

HRPHPHRPHP

HRPHPRHP

We are twice as likely to be right if we choose Box 1, but there is still some probability that it could be Box 2.

The reason we choose Box 1 is because it is more likely, given the data we have.

This is the whole idea behind likelihood ratio tests. We choose the hypothesis which has the greater likelihood, given the data we have. With other data, we might choose another hypothesis.

Documents

IE241: Introduction to Hypothesis Testing