Upload
nguyenkien
View
217
Download
0
Embed Size (px)
Citation preview
Testing Statistical Hypotheses
In statistical hypothesis testing, the basic problem is to decide
whether or not to reject a statement about the distribution of a
random variable.
The statement must be expressible in terms of membership in a
well-defined class.
The hypothesis can therefore be expressed by the statement that
the distribution of the random variable X is in the class
PH = Pθ : θ ∈ ΘH.
An hypothesis of this form is called a statistical hypothesis.
Testing is a statistical decision problem.
Issues
• optimality of tests: most powerful
• Neyman-Pearson Fundamental Lemma: the optimal procedure
for testing one simple hypothesis versus another simple hypoth-
esis
• uniformly optimal
∗ impose restrictions, such as unbiasedness or invariance
· find optimal tests under those restrictions
∗ define uniformity in terms of a global averaging
Issues
• general methods for constructing tests
• asymptotic properties of the likelihood ratio tests
• nonparametric tests
• sequential tests
• multiple tests
Statistical Hypotheses
We are given (or assume) broad family of distributions,
P = Pθ : θ ∈ Θ.
As in other problems in statistical inference, the objective is to
decide whether the given observations arose from some subset
of distributions
PH ⊂ P.
The statistical hypothesis is a statement of the form
“the family of distributions is PH”,
where PH ⊂ P,
or perhaps
“θ ∈ ΘH”,
where ΘH ⊂ Θ.
Statistical Hypotheses
The full statement consists of two pieces, one part an assump-
tion, “assume the distribution of X is in the class ”, and the
other part the hypothesis, “θ ∈ ΘH, where ΘH ⊂ Θ.”
Given the assumptions, and the definition of ΘH, we often denote
the hypothesis as H, and write it as
H : θ ∈ ΘH.
Two Hypotheses
While, in general, to reject the hypothesis H would mean to
decide that θ /∈ ΘH, it is generally more convenient to formulate
the testing problem as one of deciding between two statements:
H0 : θ ∈ Θ0
and
H1 : θ ∈ Θ1,
where Θ0 ∩ Θ1 = ∅.
We do not treat H0 and H1 symmetrically; H0 is the hypothesis
(or “null hypothesis”) to be tested and H1 is the alternative.
This distinction is important in developing a methodology of
testing.
Tests of Hypotheses
To test the hypotheses means to choose one hypothesis or theother; that is, to make a decision, d.
We have a sample X from the relevant family of distributionsand a statistic T (X).
A nonrandomized test procedure is a rule δ(X) that assigns twodecisions to two disjoint subsets, C0 and C1, of the range ofT (X).
We equate those two decisions with the real numbers 0 and 1,so δ(X) is a real-valued function,
δ(x) =
0 for T (x) ∈ C01 for T (x) ∈ C1.
Note for i = 0,1,
Pr(δ(X) = i) = Pr(X ∈ Ci).
We call C1 the critical region, and generally denote it by just C.
If δ(X) takes the value 0, the decision is not to reject; if δ(X)
takes the value 1, the decision is to reject.
If the range of δ(X) is 0,1, the test is a nonrandomized test.
Sometimes it is useful to choose the range of δ(X) as some other
set of real numbers, such as d0, d1 or even a set with cardinality
greater than 2.
If the range is taken to be the closed interval [0,1], we can inter-
pret a value of δ(X) as the probability that the null hypothesis
is rejected.
If it is not the case that δ(X) equals 0 or 1 a.s., we call the test
a randomized test.
Errors in Decisions Made in Testing
There are four possibilities in a test of an hypothesis:
the hypothesis may be true, and the test may or may not reject
it,
or the hypothesis may be false, and the test may or may not
reject it.
The result of a statistical hypothesis test can be incorrect in two
distinct ways: it can reject a true hypothesis or it can fail to
reject a false hypothesis.
We call rejecting a true hypothesis a “type I error”, and failing
to reject a false hypothesis a “type II error”.
Errors
Our standard approach in hypothesis testing is to control thelevel of the probability of a type I error under the assumptions,and to try to find a test subject to that level that has a smallprobability of a type II error.
We call the maximum allowable probability of a type I error the“significance level”, and usually denote it by α.
We call the probability of rejecting the null hypothesis the powerof the test, and will denote it by β.
If the alternate hypothesis is the true state of nature, the poweris one minus the probability of a type II error.
It is clear that we can easily decrease the probability of one typeof error (if its probability is positive) at the cost of increasingthe probability of the other.
Errors
In a common approach to hypothesis testing under the given
assumptions on X, we choose α ∈]0,1[ and require that δ(X) be
such that
Pr(δ(X) = 1 | θ ∈ Θ0) ≤ α.
and, subject to this, find δ(X) so as to minimize
Pr(δ(X) = 0 | θ ∈ Θ1).
Optimality of a test T is defined in terms of this constrained
optimization problem.
Notice that the restriction on the type I error applies ∀θ ∈ Θ0.
We call
supθ∈Θ0
Pr(δ(X) = 1 | θ)
the size of the test.
Errors
In common applications, Θ0∪Θ1 form a connected region in IRk,
and Θ0 contains the set of common closure points of Θ0 and
Θ1 and Pr(δ(X) = 1 | θ) is a continuous function of θ; hence the
sup is generally a max.
If the size is less than the level of significance, the test is said
to be conservative, and in that case, we often refer to α as the
“nominal size”.
Example 1 Testing in an exponential distribution
Suppose we have observations X1, . . . , Xn i.i.d. as exponential(θ).
The Lebesgue PDF is
pθ(x) = θ−1e−x/θI]0,∞[(x),
with θ ∈]0,∞[.
Suppose now we wish to test
H0 : θ ≤ θ0 versus H1 : θ > θ0.
We know that X is sufficient for θ.
A reasonable test may be to reject H0 if T (X) = X > c, where c
is some fixed positive constant; that is,
δ(X) = I]c,∞[(T (X)).
Knowing the distribution of X to be exponential(θ/n), we can
now work out
Pr(δ(X) = 1 | θ) = Pr(T (X) > c | θ),
which, for θ < θ0 is the probability of a Type I error.
For θ ≥ θ0
1 − Pr(δ(X) = 1 | θ)
is the probability of a Type I error. These probabilities, as a
function of θ are shown.
Performance of Test
0 θ0
]
(
H0 H1
type I errortype II errorcorrect rejection
Now, for a given significance level α, we choose c so that
Pr(T (X) > c | θ ≤ θ0) ≤ α.
This is satisfied for c such that for a random variable Y with an
exponential(θ0/n), Pr(Y > c) = α.
p-Values
Note that there is a difference in choosing the test procedure,and in using the test.
The question of the choice of α comes back.
Does it make sense to choose α first, and then proceed to applythe test just to end up with a decision d0 or d1?
It is not likely that this rigid approach would be very useful formost objectives.
In statistical data analysis our objectives are usually broader thanjust deciding which of two hypotheses appears to be true.
On the other hand, if we have a well-developed procedure fortesting the two hypotheses, the decision rule in this procedurecould be very useful in data analysis.
p-Values
One common approach is to use the functional form of the rule,
but not to pre-define the critical region.
Then, given the same setup of null hypothesis and alternative,
to collect data X = x, and to determine the smallest value α(x)
at which the null hypothesis would be rejected.
The value α(x) is called the p-value of x associated with the
hypotheses.
The p-value indicates the strength of the evidence of the data
against the null hypothesis.
Example 2 Testing in an exponential distribution; p-value
Consider again the problem where we had observations X1, . . . , Xn
i.i.d. as exponential(θ), and wished to test
H0 : θ ≤ θ0 versus H1 : θ > θ0.
Our test was based on T (X) = X > c, where c was some fixed
positive constant chosen so that Pr(Y > c) = α, where Y is a
random variable distributed as exponential(θ0/n).
Suppose instead of choosing c, we merely compute Pr(Y > x),
where x is the mean of the set of observations.
This is the p-value for the null hypothesis and the given data.
If the p-value is less than a prechosen significance level α, then
the null hypothesis is rejected.
Example 3 Sampling in a Bernoulli distribution; p-values
and the likelihood principle revisited
We have considered the family of Bernoulli distributions that isformed from the class of the probability measures Pπ(1) = π
and Pπ(0) = 1 − π on the measurable space (Ω = 0,1,F =2Ω). Suppose now we wish to test
H0 : π ≥ 0.5 versus H1 : π < 0.5.
As we indicated before there are two ways we could set up anexperiment to make inferences on π.
One approach is to take a random sample of size n, X1, . . . , Xn
from the Bernoulli(π), and then use some function of that sampleas an estimator.
An obvious statistic to use is the number of 1’s in the sample,that is, T =
∑Xi.
To assess the performance of an estimator using T , we wouldfirst determine its distribution and then use the properties of thatdistribution to decide what would be a good estimator based onT .
A very different approach is to take a sequential sample, X1, X2, . . .,until a fixed number t of 1’s have occurred.
This yields N , the number of trials until t 1’s have occurred.
The distribution of T is binomial with parameters n and π; itsPDF is
pT (t ; n, π) =(n
t
)πt(1 − π)n−t, t = 0,1, . . . , n.
The distribution of N is the negative binomial with parameterst and π, and its PDF is
pN(n ; t, π) =(n − 1
t − 1
)πt(1 − π)n−t, n = t, t + 1, . . . .
Suppose we do this both ways.
We choose n = 12 for the first method and t = 3 for the second
method.
Now, suppose that for the first method, we observe T = 3 and
for the second method, we observe N = 12.
The ratio of the likelihoods s does not involve π, so by the
likelihood principle, we should make the same conclusions about
π.
Let us now compute the respective p-values.
For the binomial setup we get p = 0.073 (using the R function
pbinom(3,12,0.5), but for the negative binomial setup we get
p = 0.033 (using the R function 1-pnbinom(8,3,0.5) in which the
first argument is the number of “failures” before the number of
“successes” specified in the second argument).
The p-values are different, and in fact, if we had decided to
perform the test at the α = 0.05 significance level, in one case
we would reject the null hypothesis and in the other case we
would not.
This illustrates a problem in the likelihood principle; it ignores
the manner in which information is collected.
The problem often arises in this kind of situation, in which we
have either an experiment that gathers information completely
independent of that information or an experiment whose conduct
depends on what is observed.
The latter type of experiment is often a type of Markov process
in which there is a stopping time.
Power of a Statistical Test
We call the probability of rejecting H0 the power of the test, and
denote it by β, or for the particular test δ(X), βT .
The power in the case that H1 is true is 1 minus the probability
of a type II error.
The probability of a type II error is generally a function of the
true distribution of the sample Pθ, and hence so is the power,
which we may emphasize by the notation βδ(Pθ) or βδ(θ).
We now can focus on the test under either hypothesis (that is,
under either subset of the family of distributions) in a unified
fashion.
Power of a Statistical Test
We define the power function of the test, for any given P ∈ P as
βδ(P) = EP (δ(X)).
Thus, to minimize the error is equivalent to maximizing the
power within Θ1.
Because the power is generally a function of θ, what does maxi-
mizing the power mean?
Power of a Statistical Test
That is, maximize it for what values of θ?
Ideally, we would like a procedure that yields the maximum for
all values of θ; that is, one that is most powerful for all values
of θ.
We call such a procedure a uniformly most powerful or UMP
test.
For a given problem, finding such procedures, or establishing that
they do not exist, will be one of our primary objectives.
Decision Theoretic Approach
In the decision-theoretic formulation of a statistical procedure,
the decision space is 0,1, corresponding respectively to not
rejecting and rejecting the hypothesis.
As in the decision-theoretic setup, we seek to minimize the risk:
R(P, δ) = E(L(P, δ(X))).
In the case of the 0-1 loss function and the four possibilities, the
risk is just the probability of either type of error.
We want a test procedure that minimizes the risk.
The issue above of a uniformly most powerful test is equivalent
to the issue of a uniformly minimum risk test.
Randomized Tests
Just as in the theory of point estimation, we found randomized
procedures useful for establishing properties of estimators or as
counterexamples to some statement about a given estimator,
we can use randomized test procedures to establish properties of
tests.
While randomized estimators rarely have application in practice,
randomized test procedures can actually be used to increase the
power of a conservative test.
Use of a randomized test in this way would not make much sense
in real-world data analysis, but if there are regulatory conditions
to satisfy, it might be useful.
Randomized Tests
We define a function δR that maps X into the decision space,and we define a random experiment R that has two outcomesassociated with not rejecting the hypothesis or with rejecting thehypothesis,such that
Pr(R = d0) = 1 − δR(x)
and so
Pr(R = d1) = δR(x).
A randomized test can be constructed using a test δ(x) whoserange is d0, d1 ∪ DR, with the rule that if δ(x) ∈ DR, then theexperiment R is performed with δR(x) chosen so that the overallprobability of a type I error is the desired level.
After δR(x) is chosen, the experiment R is independent of therandom variable about whose distribution the hypothesis appliesto.
Optimal Tests
Optimal tests are those that minimize the risk.
The risk considers the total expected loss.
In the testing problem, we generally prefer to restrict the prob-
ability of a type I error and then, subject to that, minimize the
probability of a type II error, which is equivalent to maximizing
the power under the alternative hypothesis.
An Optimal Test in a Simple Situation
First, consider the problem of picking the optimal critical region
C in a problem of testing the hypothesis that a discrete ran-
dom variable has the probability mass function p0(x) versus the
alternative that it has the probability mass function p1(x).
We will develop an optimal test for any given significance level
based on one observation.
For x 3 p0(x) > 0, let
r(x) =p1(x)
p0(x),
and label the values of x for which r is defined so that
r(xr1) ≥ r(xr2) ≥ · · · .
Let N be the set of x for which p0(x) = 0 and p1(x) > 0.
Assume that there exists a j such that
j∑
i=1
p0(xri) = α.
If S is the set of x for which we reject the test, we see that the
significance level is∑
x∈S
p0(x).
and the power over the region of the alternative hypothesis is∑
x∈S
p1(x).
Then it is clear that if C = xr1, . . . , xrj ∪ N , then∑
x∈S p1(x) is
maximized over all sets C subject to the restriction on the size
of the test.
If there does not exist a j such that∑j
i=1 p0(xri) = α, the rule is
to put xr1, . . . , xrj in C so long as
j∑
i=1
p0(xri) = α∗ < α.
We then define a randomized auxiliary test R
Pr(R = d1) = δR(xrj+1)
= (α − α∗)/p0(xrj+1)
It is clear in this way that∑
x∈S p1(x) is maximized subject to the
restriction on the size of the test.
Example 4 testing between two discrete distributions
Consider two distributions with support on a subset of 0,1,2,3,4,5.
Let p0(x) and p1(x) be the probability mass functions.
Based on one observation, we want to test H0 : p0(x) is the
mass function versus H1 : p1(x) is the mass function.
Suppose the distributions are as shown in the table, where we
also show the values of r and the labels on x determined by r.
x 0 1 2 3 4 5p0 .05 .10 .15 0 .50 .20p1 .15 .40 .30 .05 .05 .05r 3 4 2 - 1/10 2/5
label 2 1 3 - 5 4
Thus, for example, we see xr1 = 1 and xr2 = 0. Also, N = 3.
For given α, we choose C such that∑
x∈C
p0(x) ≤ α
and so as to maximize∑
x∈C
p1(x).
We find the optimal C by first ordering r(xi1) ≥ r(xi2) ≥ · · · and
then satisfying∑
x∈C p0(x) ≤ α.
The ordered possibilities for C in this example are
1 ∪ 3, 1,0 ∪ 3, 1,0,2 ∪ 3, · · · .
Notice that including N in the critical region does not cost us
anything (in terms of the type I error that we are controlling).
Now, for any given significance level, we can determine the op-
timum test based on one observation.
• Suppose α = .10. Then the optimal critical region is C =
1,3, and the power for the null hypothesis is βδ(p1) = .45.
• Suppose α = .15. Then the optimal critical region is C =
0,1,3, and the power for the null hypothesis is βδ(p1) = .60.
• Suppose α = .05. We cannot put 1 in C, with probability
1, but if we put 1 in C with probability 0.5, the α level is
satisfied, and the power for the null hypothesis is βφ(p1) =
.25.
• Suppose α = .20. We choose C = 0,1,3 with probability
2/3 and C = 0,1,2,3 with probability 1/3. The α level is
satisfied, and the power for the null hypothesis is βφ(p1) =
.75.
All of these tests are most powerful based on one observations
for the given values of α.
We can extend this idea to tests based on two observations.
We see immediately that the ordered critical regions are
C1 = (1,3) × (1,3), C1 ∪ (1,3) × (0,3), · · · .
Extending this direct enumeration would be tedious, but, at this
point we have grasped the implication: the ratio of the likeli-
hoods is the basis for the most powerful test.
This is the Neyman-Pearson Fundamental Lemma.
The Neyman-Pearson Fundamental Lemma
Example 4 illustrates the way we can approach the problem of
testing any simple hypothesis against another simple hypothesis.
Notice the pivotal role played by ratio r.
This is a ratio of likelihoods.
For testing H0 that the distribution of X is P0 versus the al-
ternative H1 that the distribution of X is P1 given 0 < α < 1,
under very mild assumptions, the Neyman-Pearson Fundamen-
tal Lemma tells us that a test based on the ratio of likelihoods
exists, is most powerful, and is unique.
Proof. Let A be any critical region of size α. We want to prove∫
CL(θ1) −
∫
AL(θ1) ≥ 0.
We can write this as∫
CL(θ1) −
∫
AL(θ1) =
∫
C∩AL(θ1) +
∫
C∩AcL(θ1)
−∫
A∩CL(θ1) −
∫
A∩CcL(θ1)
=∫
C∩AcL(θ1) −
∫
A∩CcL(θ1).
By the given condition, L(θ1;x) ≥ kL(θ0; x) at each x ∈ C, so∫
C∩AcL(θ1) ≥ k
∫
C∩AcL(θ0),
and L(θ1; x) ≤ kL(θ0;x) at each x ∈ Cc, so∫
A∩CcL(θ1) ≤ k
∫
A∩CcL(θ0).
Hence∫
CL(θ1) −
∫
AL(θ1) ≥ k
(∫
C∩AcL(θ0) −
∫
A∩CcL(θ0)
).
But∫
C∩AcL(θ0) −
∫
A∩CcL(θ0) =
∫
C∩AcL(θ0) +
∫
C∩AL(θ0)
−∫
C∩AL(θ0) −
∫
A∩CcL(θ0)
=∫
CL(θ0) −
∫
AL(θ0)
= α − α
= 0.
Hence,∫C L(θ1) −
∫A L(θ1) ≥ 0.
This simple statement of the Neyman-Pearson Lemma and its
proof should be in your bag of easy pieces.
The Lemma applies more generally by use of a random experi-
ment so as to achieve the level α.
Shao gives a clear statement and proof of this.
Generalizing the Optimal Test to
Hypotheses of Intervals
Although it applies to a simple alternative (and hence “uni-
form” properties do not make much sense), the Neyman-Pearson
Lemma gives us a way of determining whether a uniformly most
powerful (UMP) test exists, and if so how to find one.
We are often interested in testing hypotheses in which either or
both of Θ0 and Θ1 are continuous regions of IR (or IRk).
We must look at the likelihood ratio as a function both of θ and
x.
The question is whether, for given θ0 and any θ1 > θ0 (or equiv-
alently any θ1 < θ0), the likelihood is monotone in some function
of x; that is, whether the family of distributions of interest is
parameterized by a scalar in such a way that it has a monotone
likelihood ratio (see Chapter 1 in the Companion notes).
In that case, it is clear that we can extend the test to be uniformly
most powerful for testing H0 : θ = θ0 against an alternative
H1 : θ > θ0 (or θ1 < θ0).
The exponential class of distributions is important because UMP
tests are easy to find for families of distributions in that class.
Discrete distributions are especially simple, but there is nothing
special about them.
As an example, work out the test for H0 : θ ≥ θ0 versus the al-
ternative H1 : θ < θ0 in a one-parameter exponential distribution.
The one-parameter exponential distribution, with density over
the positive reals θ−1e−x/θ is a member of the exponential class.
Two easy pieces you should have are construction of a UMP for
the hypotheses in the one-parameter exponential (above), and
the construction of a UMP for testing H0 : π ≥ π0 versus the
alternative H1 : π < π0 in a binomial(π, n) distribution.
Use of Sufficient Statistics
It is a useful fact that if there is a sufficient statistic S(X) for
θ, and δ(X) is an α-level test for an hypothesis specifying values
of θ, then there exists an α-level test for the same hypothesis,
δ(S) that depends only on S(X), and which has power at least
as great as that of δ(X).
We see this by factoring the likelihoods.