Testing Statistical Hypotheses - George Mason Universitymason.gmu.edu/~jgentle/csi9723/11s/l04b_11s.pdf · Testing Statistical Hypotheses ... • asymptotic properties of the likelihood

Testing Statistical Hypotheses

In statistical hypothesis testing, the basic problem is to decide

whether or not to reject a statement about the distribution of a

random variable.

The statement must be expressible in terms of membership in a

well-defined class.

The hypothesis can therefore be expressed by the statement that

the distribution of the random variable X is in the class

PH = Pθ : θ ∈ ΘH.

An hypothesis of this form is called a statistical hypothesis.

Testing is a statistical decision problem.

Issues

• optimality of tests: most powerful

• Neyman-Pearson Fundamental Lemma: the optimal procedure

for testing one simple hypothesis versus another simple hypoth-

esis

• uniformly optimal

∗ impose restrictions, such as unbiasedness or invariance

· find optimal tests under those restrictions

∗ define uniformity in terms of a global averaging

Issues

• general methods for constructing tests

• asymptotic properties of the likelihood ratio tests

• nonparametric tests

• sequential tests

• multiple tests

Statistical Hypotheses

We are given (or assume) broad family of distributions,

P = Pθ : θ ∈ Θ.

As in other problems in statistical inference, the objective is to

decide whether the given observations arose from some subset

of distributions

PH ⊂ P.

The statistical hypothesis is a statement of the form

“the family of distributions is PH”,

where PH ⊂ P,

or perhaps

“θ ∈ ΘH”,

where ΘH ⊂ Θ.

Statistical Hypotheses

The full statement consists of two pieces, one part an assump-

tion, “assume the distribution of X is in the class ”, and the

other part the hypothesis, “θ ∈ ΘH, where ΘH ⊂ Θ.”

Given the assumptions, and the definition of ΘH, we often denote

the hypothesis as H, and write it as

H : θ ∈ ΘH.

Two Hypotheses

While, in general, to reject the hypothesis H would mean to

decide that θ /∈ ΘH, it is generally more convenient to formulate

the testing problem as one of deciding between two statements:

H0 : θ ∈ Θ0

and

H1 : θ ∈ Θ1,

where Θ0 ∩ Θ1 = ∅.

We do not treat H0 and H1 symmetrically; H0 is the hypothesis

(or “null hypothesis”) to be tested and H1 is the alternative.

This distinction is important in developing a methodology of

testing.

Tests of Hypotheses

To test the hypotheses means to choose one hypothesis or theother; that is, to make a decision, d.

We have a sample X from the relevant family of distributionsand a statistic T (X).

A nonrandomized test procedure is a rule δ(X) that assigns twodecisions to two disjoint subsets, C0 and C1, of the range ofT (X).

We equate those two decisions with the real numbers 0 and 1,so δ(X) is a real-valued function,

δ(x) =

0 for T (x) ∈ C01 for T (x) ∈ C1.

Note for i = 0,1,

Pr(δ(X) = i) = Pr(X ∈ Ci).

We call C1 the critical region, and generally denote it by just C.

If δ(X) takes the value 0, the decision is not to reject; if δ(X)

takes the value 1, the decision is to reject.

If the range of δ(X) is 0,1, the test is a nonrandomized test.

Sometimes it is useful to choose the range of δ(X) as some other

set of real numbers, such as d0, d1 or even a set with cardinality

greater than 2.

If the range is taken to be the closed interval [0,1], we can inter-

pret a value of δ(X) as the probability that the null hypothesis

is rejected.

If it is not the case that δ(X) equals 0 or 1 a.s., we call the test

a randomized test.

Errors in Decisions Made in Testing

There are four possibilities in a test of an hypothesis:

the hypothesis may be true, and the test may or may not reject

it,

or the hypothesis may be false, and the test may or may not

reject it.

The result of a statistical hypothesis test can be incorrect in two

distinct ways: it can reject a true hypothesis or it can fail to

reject a false hypothesis.

We call rejecting a true hypothesis a “type I error”, and failing

to reject a false hypothesis a “type II error”.

Errors

Our standard approach in hypothesis testing is to control thelevel of the probability of a type I error under the assumptions,and to try to find a test subject to that level that has a smallprobability of a type II error.

We call the maximum allowable probability of a type I error the“significance level”, and usually denote it by α.

We call the probability of rejecting the null hypothesis the powerof the test, and will denote it by β.

If the alternate hypothesis is the true state of nature, the poweris one minus the probability of a type II error.

It is clear that we can easily decrease the probability of one typeof error (if its probability is positive) at the cost of increasingthe probability of the other.

Errors

In a common approach to hypothesis testing under the given

assumptions on X, we choose α ∈]0,1[ and require that δ(X) be

such that

Pr(δ(X) = 1 | θ ∈ Θ0) ≤ α.

and, subject to this, find δ(X) so as to minimize

Pr(δ(X) = 0 | θ ∈ Θ1).

Optimality of a test T is defined in terms of this constrained

optimization problem.

Notice that the restriction on the type I error applies ∀θ ∈ Θ0.

We call

supθ∈Θ0

Pr(δ(X) = 1 | θ)

the size of the test.

Errors

In common applications, Θ0∪Θ1 form a connected region in IRk,

and Θ0 contains the set of common closure points of Θ0 and

Θ1 and Pr(δ(X) = 1 | θ) is a continuous function of θ; hence the

sup is generally a max.

If the size is less than the level of significance, the test is said

to be conservative, and in that case, we often refer to α as the

“nominal size”.

Example 1 Testing in an exponential distribution

Suppose we have observations X1, . . . , Xn i.i.d. as exponential(θ).

The Lebesgue PDF is

pθ(x) = θ−1e−x/θI]0,∞[(x),

with θ ∈]0,∞[.

Suppose now we wish to test

H0 : θ ≤ θ0 versus H1 : θ > θ0.

We know that X is sufficient for θ.

A reasonable test may be to reject H0 if T (X) = X > c, where c

is some fixed positive constant; that is,

δ(X) = I]c,∞[(T (X)).

Knowing the distribution of X to be exponential(θ/n), we can

now work out

Pr(δ(X) = 1 | θ) = Pr(T (X) > c | θ),

which, for θ < θ0 is the probability of a Type I error.

For θ ≥ θ0

1 − Pr(δ(X) = 1 | θ)

is the probability of a Type I error. These probabilities, as a

function of θ are shown.

Performance of Test

0 θ0

]

(

H0 H1

type I errortype II errorcorrect rejection

Now, for a given significance level α, we choose c so that

Pr(T (X) > c | θ ≤ θ0) ≤ α.

This is satisfied for c such that for a random variable Y with an

exponential(θ0/n), Pr(Y > c) = α.

p-Values

Note that there is a difference in choosing the test procedure,and in using the test.

The question of the choice of α comes back.

Does it make sense to choose α first, and then proceed to applythe test just to end up with a decision d0 or d1?

It is not likely that this rigid approach would be very useful formost objectives.

In statistical data analysis our objectives are usually broader thanjust deciding which of two hypotheses appears to be true.

On the other hand, if we have a well-developed procedure fortesting the two hypotheses, the decision rule in this procedurecould be very useful in data analysis.

p-Values

One common approach is to use the functional form of the rule,

but not to pre-define the critical region.

Then, given the same setup of null hypothesis and alternative,

to collect data X = x, and to determine the smallest value α(x)

at which the null hypothesis would be rejected.

The value α(x) is called the p-value of x associated with the

hypotheses.

The p-value indicates the strength of the evidence of the data

against the null hypothesis.

Example 2 Testing in an exponential distribution; p-value

Consider again the problem where we had observations X1, . . . , Xn

i.i.d. as exponential(θ), and wished to test

H0 : θ ≤ θ0 versus H1 : θ > θ0.

Our test was based on T (X) = X > c, where c was some fixed

positive constant chosen so that Pr(Y > c) = α, where Y is a

random variable distributed as exponential(θ0/n).

Suppose instead of choosing c, we merely compute Pr(Y > x),

where x is the mean of the set of observations.

This is the p-value for the null hypothesis and the given data.

If the p-value is less than a prechosen significance level α, then

the null hypothesis is rejected.

Example 3 Sampling in a Bernoulli distribution; p-values

and the likelihood principle revisited

We have considered the family of Bernoulli distributions that isformed from the class of the probability measures Pπ(1) = π

and Pπ(0) = 1 − π on the measurable space (Ω = 0,1,F =2Ω). Suppose now we wish to test

H0 : π ≥ 0.5 versus H1 : π < 0.5.

As we indicated before there are two ways we could set up anexperiment to make inferences on π.

One approach is to take a random sample of size n, X1, . . . , Xn

from the Bernoulli(π), and then use some function of that sampleas an estimator.

An obvious statistic to use is the number of 1’s in the sample,that is, T =

∑Xi.

To assess the performance of an estimator using T , we wouldfirst determine its distribution and then use the properties of thatdistribution to decide what would be a good estimator based onT .

A very different approach is to take a sequential sample, X1, X2, . . .,until a fixed number t of 1’s have occurred.

This yields N , the number of trials until t 1’s have occurred.

The distribution of T is binomial with parameters n and π; itsPDF is

pT (t ; n, π) =(n

t

)πt(1 − π)n−t, t = 0,1, . . . , n.

The distribution of N is the negative binomial with parameterst and π, and its PDF is

pN(n ; t, π) =(n − 1

t − 1

)πt(1 − π)n−t, n = t, t + 1, . . . .

Suppose we do this both ways.

We choose n = 12 for the first method and t = 3 for the second

method.

Now, suppose that for the first method, we observe T = 3 and

for the second method, we observe N = 12.

The ratio of the likelihoods s does not involve π, so by the

likelihood principle, we should make the same conclusions about

π.

Let us now compute the respective p-values.

For the binomial setup we get p = 0.073 (using the R function

pbinom(3,12,0.5), but for the negative binomial setup we get

p = 0.033 (using the R function 1-pnbinom(8,3,0.5) in which the

first argument is the number of “failures” before the number of

“successes” specified in the second argument).

The p-values are different, and in fact, if we had decided to

perform the test at the α = 0.05 significance level, in one case

we would reject the null hypothesis and in the other case we

would not.

This illustrates a problem in the likelihood principle; it ignores

the manner in which information is collected.

The problem often arises in this kind of situation, in which we

have either an experiment that gathers information completely

independent of that information or an experiment whose conduct

depends on what is observed.

The latter type of experiment is often a type of Markov process

in which there is a stopping time.

Power of a Statistical Test

We call the probability of rejecting H0 the power of the test, and

denote it by β, or for the particular test δ(X), βT .

The power in the case that H1 is true is 1 minus the probability

of a type II error.

The probability of a type II error is generally a function of the

true distribution of the sample Pθ, and hence so is the power,

which we may emphasize by the notation βδ(Pθ) or βδ(θ).

We now can focus on the test under either hypothesis (that is,

under either subset of the family of distributions) in a unified

fashion.


We define the power function of the test, for any given P ∈ P as

βδ(P) = EP (δ(X)).

Thus, to minimize the error is equivalent to maximizing the

power within Θ1.

Because the power is generally a function of θ, what does maxi-

mizing the power mean?


That is, maximize it for what values of θ?

Ideally, we would like a procedure that yields the maximum for

all values of θ; that is, one that is most powerful for all values

of θ.

We call such a procedure a uniformly most powerful or UMP

test.

For a given problem, finding such procedures, or establishing that

they do not exist, will be one of our primary objectives.

Decision Theoretic Approach

In the decision-theoretic formulation of a statistical procedure,

the decision space is 0,1, corresponding respectively to not

rejecting and rejecting the hypothesis.

As in the decision-theoretic setup, we seek to minimize the risk:

R(P, δ) = E(L(P, δ(X))).

In the case of the 0-1 loss function and the four possibilities, the

risk is just the probability of either type of error.

We want a test procedure that minimizes the risk.

The issue above of a uniformly most powerful test is equivalent

to the issue of a uniformly minimum risk test.

Randomized Tests

Just as in the theory of point estimation, we found randomized

procedures useful for establishing properties of estimators or as

counterexamples to some statement about a given estimator,

we can use randomized test procedures to establish properties of

tests.

While randomized estimators rarely have application in practice,

randomized test procedures can actually be used to increase the

power of a conservative test.

Use of a randomized test in this way would not make much sense

in real-world data analysis, but if there are regulatory conditions

to satisfy, it might be useful.

Randomized Tests

We define a function δR that maps X into the decision space,and we define a random experiment R that has two outcomesassociated with not rejecting the hypothesis or with rejecting thehypothesis,such that

Pr(R = d0) = 1 − δR(x)

and so

Pr(R = d1) = δR(x).

A randomized test can be constructed using a test δ(x) whoserange is d0, d1 ∪ DR, with the rule that if δ(x) ∈ DR, then theexperiment R is performed with δR(x) chosen so that the overallprobability of a type I error is the desired level.

After δR(x) is chosen, the experiment R is independent of therandom variable about whose distribution the hypothesis appliesto.

Optimal Tests

Optimal tests are those that minimize the risk.

The risk considers the total expected loss.

In the testing problem, we generally prefer to restrict the prob-

ability of a type I error and then, subject to that, minimize the

probability of a type II error, which is equivalent to maximizing

the power under the alternative hypothesis.

An Optimal Test in a Simple Situation

First, consider the problem of picking the optimal critical region

C in a problem of testing the hypothesis that a discrete ran-

dom variable has the probability mass function p0(x) versus the

alternative that it has the probability mass function p1(x).

We will develop an optimal test for any given significance level

based on one observation.

For x 3 p0(x) > 0, let

r(x) =p1(x)

p0(x),

and label the values of x for which r is defined so that

r(xr1) ≥ r(xr2) ≥ · · · .

Let N be the set of x for which p0(x) = 0 and p1(x) > 0.

Assume that there exists a j such that

j∑

i=1

p0(xri) = α.

If S is the set of x for which we reject the test, we see that the

significance level is∑

x∈S

p0(x).

and the power over the region of the alternative hypothesis is∑

x∈S

p1(x).

Then it is clear that if C = xr1, . . . , xrj ∪ N , then∑

x∈S p1(x) is

maximized over all sets C subject to the restriction on the size

of the test.

If there does not exist a j such that∑j

i=1 p0(xri) = α, the rule is

to put xr1, . . . , xrj in C so long as

j∑

i=1

p0(xri) = α∗ < α.

We then define a randomized auxiliary test R

Pr(R = d1) = δR(xrj+1)

= (α − α∗)/p0(xrj+1)

It is clear in this way that∑

x∈S p1(x) is maximized subject to the

restriction on the size of the test.

Example 4 testing between two discrete distributions

Consider two distributions with support on a subset of 0,1,2,3,4,5.

Let p0(x) and p1(x) be the probability mass functions.

Based on one observation, we want to test H0 : p0(x) is the

mass function versus H1 : p1(x) is the mass function.

Suppose the distributions are as shown in the table, where we

also show the values of r and the labels on x determined by r.

x 0 1 2 3 4 5p0 .05 .10 .15 0 .50 .20p1 .15 .40 .30 .05 .05 .05r 3 4 2 - 1/10 2/5

label 2 1 3 - 5 4

Thus, for example, we see xr1 = 1 and xr2 = 0. Also, N = 3.

For given α, we choose C such that∑

x∈C

p0(x) ≤ α

and so as to maximize∑

x∈C

p1(x).

We find the optimal C by first ordering r(xi1) ≥ r(xi2) ≥ · · · and

then satisfying∑

x∈C p0(x) ≤ α.

The ordered possibilities for C in this example are

1 ∪ 3, 1,0 ∪ 3, 1,0,2 ∪ 3, · · · .

Notice that including N in the critical region does not cost us

anything (in terms of the type I error that we are controlling).

Now, for any given significance level, we can determine the op-

timum test based on one observation.

• Suppose α = .10. Then the optimal critical region is C =

1,3, and the power for the null hypothesis is βδ(p1) = .45.

• Suppose α = .15. Then the optimal critical region is C =

0,1,3, and the power for the null hypothesis is βδ(p1) = .60.

• Suppose α = .05. We cannot put 1 in C, with probability

1, but if we put 1 in C with probability 0.5, the α level is

satisfied, and the power for the null hypothesis is βφ(p1) =

.25.

• Suppose α = .20. We choose C = 0,1,3 with probability

2/3 and C = 0,1,2,3 with probability 1/3. The α level is

satisfied, and the power for the null hypothesis is βφ(p1) =

.75.

All of these tests are most powerful based on one observations

for the given values of α.

We can extend this idea to tests based on two observations.

We see immediately that the ordered critical regions are

C1 = (1,3) × (1,3), C1 ∪ (1,3) × (0,3), · · · .

Extending this direct enumeration would be tedious, but, at this

point we have grasped the implication: the ratio of the likeli-

hoods is the basis for the most powerful test.

This is the Neyman-Pearson Fundamental Lemma.

The Neyman-Pearson Fundamental Lemma

Example 4 illustrates the way we can approach the problem of

testing any simple hypothesis against another simple hypothesis.

Notice the pivotal role played by ratio r.

This is a ratio of likelihoods.

For testing H0 that the distribution of X is P0 versus the al-

ternative H1 that the distribution of X is P1 given 0 < α < 1,

under very mild assumptions, the Neyman-Pearson Fundamen-

tal Lemma tells us that a test based on the ratio of likelihoods

exists, is most powerful, and is unique.

Proof. Let A be any critical region of size α. We want to prove∫

CL(θ1) −

∫

AL(θ1) ≥ 0.

We can write this as∫

CL(θ1) −

∫

AL(θ1) =

∫

C∩AL(θ1) +

∫

C∩AcL(θ1)

−∫

A∩CL(θ1) −

∫

A∩CcL(θ1)

=∫

C∩AcL(θ1) −

∫

A∩CcL(θ1).

By the given condition, L(θ1;x) ≥ kL(θ0; x) at each x ∈ C, so∫

C∩AcL(θ1) ≥ k

∫

C∩AcL(θ0),

and L(θ1; x) ≤ kL(θ0;x) at each x ∈ Cc, so∫

A∩CcL(θ1) ≤ k

∫

A∩CcL(θ0).

Hence∫

CL(θ1) −

∫

AL(θ1) ≥ k

(∫

C∩AcL(θ0) −

∫

A∩CcL(θ0)

).

But∫

C∩AcL(θ0) −

∫

A∩CcL(θ0) =

∫

C∩AcL(θ0) +

∫

C∩AL(θ0)

−∫

C∩AL(θ0) −

∫

A∩CcL(θ0)

=∫

CL(θ0) −

∫

AL(θ0)

= α − α

= 0.

Hence,∫C L(θ1) −

∫A L(θ1) ≥ 0.

This simple statement of the Neyman-Pearson Lemma and its

proof should be in your bag of easy pieces.

The Lemma applies more generally by use of a random experi-

ment so as to achieve the level α.

Shao gives a clear statement and proof of this.

Generalizing the Optimal Test to

Hypotheses of Intervals

Although it applies to a simple alternative (and hence “uni-

form” properties do not make much sense), the Neyman-Pearson

Lemma gives us a way of determining whether a uniformly most

powerful (UMP) test exists, and if so how to find one.

We are often interested in testing hypotheses in which either or

both of Θ0 and Θ1 are continuous regions of IR (or IRk).

We must look at the likelihood ratio as a function both of θ and

x.

The question is whether, for given θ0 and any θ1 > θ0 (or equiv-

alently any θ1 < θ0), the likelihood is monotone in some function

of x; that is, whether the family of distributions of interest is

parameterized by a scalar in such a way that it has a monotone

likelihood ratio (see Chapter 1 in the Companion notes).

In that case, it is clear that we can extend the test to be uniformly

most powerful for testing H0 : θ = θ0 against an alternative

H1 : θ > θ0 (or θ1 < θ0).

The exponential class of distributions is important because UMP

tests are easy to find for families of distributions in that class.

Discrete distributions are especially simple, but there is nothing

special about them.

As an example, work out the test for H0 : θ ≥ θ0 versus the al-

ternative H1 : θ < θ0 in a one-parameter exponential distribution.

The one-parameter exponential distribution, with density over

the positive reals θ−1e−x/θ is a member of the exponential class.

Two easy pieces you should have are construction of a UMP for

the hypotheses in the one-parameter exponential (above), and

the construction of a UMP for testing H0 : π ≥ π0 versus the

alternative H1 : π < π0 in a binomial(π, n) distribution.

Use of Sufficient Statistics

It is a useful fact that if there is a sufficient statistic S(X) for

θ, and δ(X) is an α-level test for an hypothesis specifying values

of θ, then there exists an α-level test for the same hypothesis,

δ(S) that depends only on S(X), and which has power at least

as great as that of δ(X).

We see this by factoring the likelihoods.

Documents

Testing Statistical Hypotheses - George Mason Universitymason.gmu.edu/~jgentle/csi9723/11s/l04b_11s.pdf · Testing Statistical Hypotheses ... • asymptotic properties of the likelihood