Chapter Five Hypothesis Testing: Concepts

Chapter Five

Hypothesis Testing: Concepts

131Chapter Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .130Summary: Choosing the Confidence Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .126False Negative Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .124False Positive Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .124Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .124Errors in Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .120Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .115Procedure for Formal Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .114Null and Alternate Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .114Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .114Formal Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .112An Initial Look at Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .110The Purpose of Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The Purpose of Hypothesis Testing

The purpose of obtaining measurements of a chemical system is usually to draw someconclusions about the properties of the system. One of the simplest use of statistics, one that haslargely concerned us to this point, is to obtain an estimate of the system properties through theuse of confidence intervals. This is an aspect of statistical estimation theory. Now, however, weturn our attention to decision theory, where we learn how we can use measurement statistics todraw general conclusions about chemical systems.

The following are examples of situations where we want to draw some kind of conclusion basedon measurements:

• two reactants are mixed, and the concentrations of the products are monitored as a function oftime in order to determine the rate constant, k, of the reaction. You want to compare the resultsof your measurement with a value calculated from theory.

• you have just come up with a new synthetic procedure for a certain commercial product thatyou believe increases the yield over the currently accepted method. You measure the yield byboth methods, and you find that your method gives a 65% yield while the older method gave a60% yield. You must prove that your method is actually superior to the older method, and thatthe increase in yield is not due to the uncertainty in the measured values.

For a more detailed example, consider the following situation. Let’s say we obtain the followingmeasurements of the pH of a particular solution

pH measurements: 9.5, 9.9, 9.8

Now we wish to know whether it is possible to state, with confidence, that the pH of the solutionis less than 10. If we can assume that the measurements are unbiased, we can restate this questionin a form that can be evaluated with statistics, namely:

“is it true that pH < 10?”

Now, assuming no measurement bias, the fact that none of the measurements of pH are greaterthan 10 seems to support the notion that the true pH of the solution is less than ten. However,since the measurement of pH is a random variables, there is always a chance that the actual pH isindeed greater than ten, and that the three measurements, by random chance, all happen to be lessthan 10 – just like there is a chance that three coin flips in a row will come up tails, even thoughthere is a fifty-fifty chance of getting heads on any single coin toss.

Our problem is this: at what point can we say that random variability is an unlikely explanationfor the difference between the measured pH values and a fixed value (e.g., a pH of 10)? In otherwords, when do the measured values “differ significantly” from the fixed value? The meaning ofthe word “significantly” must be very clear: a statistically significant difference in the valueswould be a greater difference than could be reasonably explained by random error.

This is exactly the type of question that hypothesis testing answers. Hypothesis tests aresometimes called significance tests, since they detect “significant” differences in numbers,differences that are unlikely to be due to random chance.

Page 110

An Initial Look at Hypothesis Testing

Let’s use an example to help us to see how we might derive conclusions using random variables(i.e., measurements).

Example 5.1A cigarette manufacturer states that the nicotine level of its cigarettes is 14 mg per cigarette.You wish to test this claim. You collect a random sample of 5 cigarettes and test for nicotinecontent. The measured nicotine level (in mg) of the cigarettes in the sample are

14.05, 14.33, 16.36, 18.55, 14.76.Do these measurements indicate a nicotine level different than that claimed by themanufacturer?

Basically, what we would like to do is test the following statement:

Hypothesis: The true nicotine level of the cigarettes is different from that claimed(14 mg) by the manufacturer.

Let’s calculate the mean of the measured nicotine level.

x .T( )14.05 14.33 16.36 18.55 14.76 mg measurements

xbar mean( )x =xbar 15.61 mg

So the mean measured level of nicotine in the five cigarettes was 15.61 mg/cigarette. Obviously,this value is somewhat larger than the nicotine level stated by the manufacturer. The question is,however, is the difference between the nicotine levels “significant?” Do we have anyjustification for challenging the nicotine levels claimed by the manufacturer?

In order to answer this question, we need more information than simply the measurementaverage: we must also make use of the observed variability of the five measurements to constructa confidence interval.

s x stdev ( )x ses x

5=se 0.837 mg standard error of mean value

t 2.7765 critical t-value for 4 df's at the 5% level

width .t se =width 2.32 mg

xlower xbar .t se =x lower 13.29 mg xupper xbar .t se =xupper 17.93 mg

lower boundary of CI upper boundary of CI

In this instance, the 95% confidence interval is 15.61 ± 2.32 mg/cigarette. Recall exactly whatthis interval represents: assuming no bias, this range of values (13.29 → 17.93 mg) contains thetrue amount of nicotine in the cigarettes analyzed, with 95% probability.

Chapter 5 Hypothesis Testing: Concepts

Page 111

Since the confidence interval calculated from the measurements on five cigarettes includes 14mg, we cannot support the original hypothesis that the manufacturer’s claimed nicotine level isincorrect. In other words, the difference between the measurement mean of 15.61 mg and themanufacturer’s stated level of 14 mg is not significant.

Note that we must be very careful in how we phrase our conclusion. Even though the confidenceinterval includes the value 14 mg, we have not proven that the manufacturer’s claim is true. Inother words

• we do not prove that [nicotine] = 14 mg/cigarette. We can only state that there is a 95%probability that the true nicotine content is somewhere between 13.29 and 17.93; out bestestimate of the nicotine content is 15.61 mg.

• we cannot prove (with 95% probability) that [nicotine] ≠ 14 mg/cigarette, since the 95%confidence interval contains this value.

We have just had our first brush with hypothesis testing, where we use data (containing randomerror) from an experiment to test an assertion. This is obviously an important area of statistics,and one that we will discuss in detail.


Page 112

Formal Hypothesis Testing

Introduction

In the last section, a confidence interval was constructed in order to test a specific hypothesis. Inscientific endeavors, there are a wide variety of different types of hypotheses that may need to betested using the results of one or more experiments. In this section, we will formalize theprocedure to be used in hypothesis testing. Although the procedure may seem a little rigid, it canbe adopted for almost any situation. The price for the general applicability of the procedure is theuse of somewhat abstract language and concepts.

Null and Alternate Hypotheses

All hypothesis tests actually involve at least two statements, called the null hypothesis (H0) andthe alternate (or working) hypothesis (H1). A statistical hypothesis is an assertion or conjectureconcerning one or more population parameters. Basically, this step is a translation from words topopulation parameters. The null hypothesis, H0, will generally involve an equality and one ormore population parameters. In our nicotine example, the null hypothesis would be:

null hypothesis H0: µx = 14 mg/cigarette

In other words, we accept as the null hypothesis the manufacturer’s claim that each cigarettecontains 14 mg of nicotine. If the null hypothesis is true, and if there is no bias in themeasurements, then the population mean µx of all measurements will be 14 mg. As you can see,the null hypothesis involves a population parameter (µx, the population mean of themeasurements) and a statement of equality. As we will stress time and again, the null hypothesiscannot be proven. It is assumed as fact unless the data proves otherwise.

The alternate hypothesis, H1, will be a statement involving the same population parameters, insuch a way that H1 and H0 cannot both be true. Usually the alternate hypothesis involves one ofthe following relational operators: ≠, <, or >. For our example,

alternate hypothesis H1: µx ≠ 14 mg/cigarette (two-tailed test)

Alternate hypotheses such as this one, with a “not equals” (≠) relationship, result in two-tailedtests. This statement claims that the measurement population mean is not 14 mg; if we assume nomeasurement bias, this hypothesis disputes the manufacturer’s claim of nicotine level.

The form of both hypotheses is very important, particularly that of the alternate hypothesis. Thisis because we are testing the alternate hypothesis in the hypothesis test procedure.

Suppose we actually suspect that the manufacturer is underestimating the nicotine level in thecigarettes; in this case, we would use the following alternate hypothesis:

a different alternate hypothesis H1: µx > 14 mg/cigarette (one-tailed test)

or, H1: “the true nicotine content is greater than 14 mg/cigarette”


Page 113

This form of H1 would result in a slightly different hypothesis test. Alternate hypotheses such asthis one, with a greater than (>) or less than (<) relationship, result in one-tailed tests.

In the hypothesis testing procedure, we assume that the null hypothesis is true, and it is nottested. The goal of the procedure is to test the assertion embodied by the alternate hypothesis, H1.If H1 is proven to be true, then obviously H0 will be false. This format is exactly the same as thatof the US criminal legal system, as represented in the famous statement “innocent until provenguilty.” In statistical hypothesis testing, H0 is assumed to be true unless H1 can be proven to betrue with reasonable certainty.

Procedure for Formal Hypothesis Tests

For easy reference, here is a list of the steps in hypothesis testing; each step will be discussed indetail.

1. Form the null hypothesis, H0, and the alternative hypothesis, H1, in terms of statisticalpopulation parameters.

2. Choose the desired confidence level. The confidence level this is also sometimes called thesignificance level.

3. Choose a test statistic and calculate it.

4. Calculate the critical values; alternately, determine the P-value of the test statistic.

5. State the conclusion clearly, avoiding statistical jargon.

Step 1: State the null hypothesis (H0) and the alternate hypothesis (H1)

We have described the null and alternate hypotheses. Formulating these is the most difficult butcrucial part of the test procedure. Remember that we begin with an assumption that H0 is true,and that we are trying to test H1. We may be interested in either proving or disproving H1.

The following table gives the null hypotheses for three common statistical tests. Note that thenull hypothesis always involves population parameters, and (in these cases) is expressed as anequality.


Page 114

H0 : ✤x2 = ✤y

2Is there a significant difference betweenthe variances of two sets ofmeasurements?

comparison of the variances oftwo variables, x and y

H0 : ✙x = ✙yIs there a significant difference betweenthe mean of two sets of measurements?

comparison of the mean of twovariables, x and y

H0 : ✙x = kIs there a significant difference betweenthe mean of some measurements, andsome fixed value?

comparison of a randomvariable, x, and a fixed value, k

Answers the question:Form of thenull hypothesisSituation

The alternate hypotheses, H1, in these cases may involve an inequality (≠) or a relational operator(< or >). As discussed previously, the form of H1 determines whether we use a one-tailed or atwo-tailed test.

Step 2: Choose the desired level of confidence/significance

Remember that any confidence interval has an associated confidence level. The purpose of aconfidence interval is to “bracket” the possible values for a population parameter such as µx.Random variables always add a little “spice” (i.e., uncertainty) to any conclusion; there is alwaysa chance that we are wrong, since random variables are, well, random. So the confidence level isneeded to state the probability that the population parameter is truly contained with ourconfidence interval. It is a measure of how much we trust the interval, how “confident” we are inour result.

Since confidence intervals play a crucial role in hypothesis testing, it is not surprising that wegenerally choose a confidence level when testing assertions using the results of experiments,which are almost always random variables. The meaning of the confidence level in hypothesistesting is slightly different than in confidence intervals, however.

Consider our example. We have two competing hypothesis: H0: µx = 14 mg and H1: µx ≠ 14 mg.We are testing the alternate hypothesis, H1, and there are two possible outcomes:

1. We succeed in proving that H1 is true, in which case H0 is know to be false.

2. We fail to prove that H1 is true. [Remember! We cannot prove that H0 is true.]

The confidence level in hypothesis testing measures our certainty when we succeed inproving H1. It is the probability that the conclusion that H1 is true and H0 is false is correct. Let’sassume that we want to test at the 95% level for our example. That means that, if our test provesthat the nicotine level is not 14 mg, there is a 95% probability that our data has lead us to theproper conclusion.

You might wonder: why wouldn’t I want to be very certain in my conclusion? In other words,shouldn’t I always choose a high confidence level in hypothesis testing (at least 95%, and maybe


Page 115

99% or even 99.9%!). We will defer a discussion of the appropriate confidence level in testing tolater in the chapter. But for now, ask yourself this question: why don’t you similarly alwayschoose a high confidence level in constructing confidence intervals? A 95% confidence intervalis commonly given; why not always use 99%, or 99.9%? What affect would that have on theconfidence interval? There are both advantages and disadvantages in choosing high confidencelevels, as we will discover.

In statistics, the term significance level is probably more common than confidence level inhypothesis testing. The significance level (SL) is directly related to the confidence level (CL):SL = 100% − CL. Thus, instead of testing at the 95% confidence level, we may instead test at the5% significance level and arrive at the same conclusions. Although we will tend to use the term“confidence level” in this text, you should be familiar with both terms.

Step 3: Choose a test statistic and calculate its value

The next step in hypothesis testing is to choose a statistic (the test statistic) appropriate fortesting the hypotheses. The test statistic (like any statistic) is a value calculated in some mannerfrom the data. Since the data presumably contain random error, the test statistic will likewise be arandom variable. There are two requirements for a test statistic:

1. Its probability distribution must be known; preferably, tables of critical values exist for thestatistic.

2. The test statistic should result in a reasonably “good” (or “efficient”) hypothesis test. Whatfactors might make one test better than another? Let’s come back to that point in a little bit.

In example 5.1, the null and alternate hypotheses both deal with the population mean µx of themeasurements, so it would seem that we could certainly use the sample mean of themeasurements as the basis for the test statistic. In constructing a confidence interval for µx, thet-distribution is used (when σx is not known). This suggests that the following test statistic, T,could be used in this hypothesis test:

possible test statistic T = xn − 14s(xn)

The test statistic is the studentized sample mean. It has a t-distribution; if H0 is true, then µT = 0.

The sample mean is not the only possible basis of the test statistic. Instead, we could use thesample median, or some other form of weighted average. It turns out that for normally distributeddata, the studentized sample mean is the best test statistic to use for hypothesis tests such as forexample 5.1.

Let’s calculate the observed value for the test statistic for the five measurements in example 5.1:

This is the "studentized" mean: the number of std devs of the mean from 14 mg

T obsxbar .14 mg

se=T obs 1.9243

In this equation, “se” is the standard error of the sample mean, xbar. According to the observedtest statistic, the mean of the measurements, 15.61 mg/cigarette, is 1.92 standard deviations fromthe manufacturer’s claimed value of 14 mg/cigarette.


Page 116

Step 4: Calculate the critical value(s) or the P-value

It is important to keep in mind that the null hypothesis, H0, is “innocent until proven guilty.” Theprobability distribution of the test statistic, T, assuming that the null hypothesis is true, is calledthe null distribution. The next step in hypothesis testing is to calculate the critical value(s) of thenull distribution.

For two-tailed tests, such as the one we must use for example 5.1, there are two critical values.(One-tailed tests only have a single critical value). The null distribution of T is a t-distributionwith four degrees of freedom and a mean of zero. Recalling that we choose 95% as ourconfidence level, the critical values are the values such that

Tcrit = ± t4,0.025 = ± 2.7765

-4 -3 -2 -1 0 1 2 3 4

Test Statistic

95%lower cri tical value upper cri tical value

��

T = -2.7765 T = +2.7765

accept H0

accept H1reject H0

(lower critical value) upper critical value

accept H1reject H0

Figure 5.1: Decision criteria for the hypothesis test for example 5.1. If the observed teststatistic is above the upper critical value, or below the lower critical value, then we acceptthe alternate hypothesis, H1, and reject the null hypothesis, H0.

The critical values are the boundaries between two decision-making regions:

• the acceptance region, between the two critical values. If the test statistic assumes a value inthis region, then the null hypothesis, H0, is accepted. We cannot prove the alternate hypothesis,H1, with the desired confidence level.


Page 117

• the rejection region, where Tobs > Tupper or Tobs < Tlower. If the test statistic is in this region, thenH0 is rejected and H1 is accepted. We have proven that H1 is true at the desired confidencelevel.

By inspecting the null distribution, we can see how the critical values are chosen, and we canunderstand the role of the confidence level in hypothesis testing. Figure 5.1 shows the situationfor a two-tailed test at the 95% confidence level. We choose the critical values so that the 95% ofthe area under the null distribution is between the critical values. What this means is that, if thenull hypothesis is true, there is a 95% probability that the observed test statistic will be within theacceptance region.

It is not strictly necessary to calculate the critical values. An alternative approach makes use ofthe concept of the P-value, which has been mentioned before. The P-value can be interpreted interms of the null distribution; in particular, for a two-tailed test, the P-value is

two-tailed P-value Pobs = P(T > Tobs) + P(T < −Tobs) = 2 $ P(T > Tobs)

Consider example 5.1: the mean of five measurements of nicotine content was 15.61mg/cigarette, which is 1.92 standard deviations from the manufacturer’s claimed value. Moststatistical programs and spreadsheets will also calculate the P-value; for example 5.1, thetwo-tailed P-value is

Pobs = 0.1266

In other words, if the null hypothesis were true, there is a 12.66% probability that we wouldobtain a sample mean that is farther than 1.92 standard deviations from 14 mg/cigarette (in eitherdirection).

The P-value is used instead of (or in addition to) critical values. It indicates the weight of theevidence in favor of the alternate hypothesis: the smaller the P-value, the less likely it is thatrandom variability can account for the observed data.

To tie the P-value approach with the “critical region” approach, consider this: the P-value tells usthe maximum value of the confidence level that we can adopt and still prove the alternatehypothesis. We calculate this value by

maximum confidence level: CL = 100% $ (1 − Pobs)

where CL is the confidence level as a percentage. For example 5.1, if we choose a confidencelevel of 87.44% or less, then we can prove that the alternate hypothesis is true. Of course, asmaller confidence level means that we are less confident of our conclusion, so we want aP-value as small as possible.

We may more directly interpret the P-value in terms of the significance level. The P-value is thelargest significance level at which we may accept the alternate hypothesis. Thus, in this example,we can prove H1 at the 12.66% significance level, at best. Remember: a smaller significancelevel means we are more certain of this conclusion.

Aside: calculating P-values in Excel


Page 118

When the null distribution is a t-distribution, then the P-value is calculated in Excel by using theTDIST() function:

calculation P-values in Excel Pobs = tdist(Tobs, df, tail)

where Tobs is the observed value of the test statistic, df are the degrees of freedom of thet-distribution, and tails is either one or two (for 1- or 2-tailed P-values). For example 5.1, youwould enter “= tdist(1.9243, 4, 2)” into any cell to obtain the 2-tailed P-value.

Other Excel functions would be needed when the null distribution does not follow at-distribution.

Step 5: State the conclusion

After we decide whether to accept H0 or H1, we must state our conclusion in a manner that isaccurate and yet can be understood by anyone who does not have a background in statistics.Essentially, we must translate our conclusions from “statistic-ese” (e.g., “reject H0, accept H1”) tonormal language. We should give both our conclusion and the confidence level, even though theconfidence level is most properly understood in a statistics framework.

For example 5.1, we accepted H0; we couldn’t prove H1. In other words, our conclusion wouldbe:

We cannot prove with 95% confidence that the nicotine level in the cigarettes is differentthan 14 mg/cigarette.

This statement sounds like poor English (basically a double negative), but the wording was verycarefully chosen. We begin with the assumption that the cigarettes have 14 mg of nicotine, andwe fail prove any differently. This is similar to a jury returning a verdict of “Not Guilty” in acriminal trial. Notice that the verdict is not that the defendant was innocent, simply that guilt wasnot proven beyond a “reasonable doubt.” In hypothesis testing, the level of “reasonable doubt” isdetermined when the confidence level is set.

Examples

Let’s try another two-tailed test. This test is similar in nature to example 5.1.

Example 5.2A certain analytical procedure is being tested for the presence of measurement bias. Twentymeasurements are made on a solution whose concentration has been certified at 1.000 µM.The sample mean is 1.010 µM, with an RSD of 5.0% for the individual measurements. Isthere any evidence of measurement bias?


Page 119

First let's set up the null and alternate hypotheses

H0: µ x .1.000 µM There is no bias in the measurements.

H1: µ x .1.000 µM Bias exists; two-tailed test.

ξ x .1.000 µM xbar .1.010 µM RSD .5.0 %

s x .RSD xbar =s x 0.0505 µM std_errs x

19=std_err 0.0116 µM

Let's use the studentized mean as the test statistic, and calculate the observed test statistic

T obsxbar ξ xstd_err

=T obs 0.8631 sample mean is this many std devs from the true value

P obs 0.3988 This is the two-tailed P-value of the observed value of the test statistic

Now we look up the critical values from the t-tables. For 19 degrees of freedom, a 95% confidence level and a two-tailed test, the critical values are -2.0930 and +2.0930. Since the observed value of the test statistic is within the acceptance region, we must accept the null hypothesis. Thus, we cannot prove bias in these measurements at the 95% confidence level.Note: from the observed P-value for this example, we see that we can only prove H1 with 60.22%confidence, at best.

Now let’s try a one-tailed test.

Example 5.3It is suspected that a series of tests of blood alcohol level proves that the alcohol level isabove the legal limit of 0.10%. The measurements are:

0.106 0.118 0.097 0.127 0.134 0.141Do these measurements prove legal intoxication with 95% confidence?

As always, the first step is to set up the null and alternate hypotheses. In this case, we should usethe following:

null H0: µx = 0.10 % “blood alcohol level at the legal limit (assuming no bias)”alternate H1: µx > 0.10 % “blood alcohol level above the legal limit”

It may be a little difficult to see why the null hypothesis should be that the blood alcohol level isexactly 0.10 %. In setting up the hypotheses, it is best to always ask yourself, what is it that Iwant to test? What are the possible conclusions? The answers to these questions determine theform of the alternate hypothesis; the null hypothesis will follow.

For this example, we want to test whether or not the alcohol level is above the legal limit.Remember that the purpose of the statistical test procedure is actually to test the alternatehypothesis, so that we would propose as the alternate hypothesis that the alcohol level is toohigh. The nature of the testing procedure is such that we either prove or fail to prove this


Page 120

hypothesis; i.e., or conclusion will be either that we can prove that the alcohol level is too high (a“guilty” verdict) or that we cannot prove an excessive alcohol level (“not guilty”) .Theseconclusions are proper for our intentions in this example. Since we propose that µx > 0.10 % isour alternate hypothesis statement, the corresponding null hypothesis is µx = 0.10 %.

The other thing to notice about the form of H1 in this example is that it results in a one-tailed test.This will affect the critical values (and the P-value, if we calculate it). Let’s continue with ourtesting procedure. We can proceed by calculating the observed test statistic.

x .T( )0.106 0.118 0.097 0.127 0.134 0.141 %

xbar mean( )x =xbar 0.1205 % std_err stdev ( )x

6=std_err 0.0069 %

Let's calculate the observed test statistic

T obsxbar .0.10 %

std_err=T obs 2.9865 studentized measurement mean

P obs 0.00379 Probability of seeing a larger value that Tobs is 0.379%.

The P-value is standard output for many statistical programs. In this case, the one-tailed P-valueis 0.379 %, which means that we can prove H1 at the 99.721% confidence level, if we desired;certainly at the 95% level we may reject H0 and accept H1. However, it is difficult to use t-tablesto calculate Pobs, so we will confirm this decision using the critical value approach.

For a one-tailed test, there is only a single critical value, as shown in the next figure.


Page 121

Figure 5.2: An example of a one-tailed test. There is only a single critical value. The topfigure shows the null distribution. The critical value is chosen such that the area to underthe curve to the left of the critical value is at the appropriate confidence level (95% forthis example). The lower figure shows the decision process: if the observed test statistic islarger than the critical value, Tobs > Tcrit, then the null hypothesis is rejected and thealternate hypothesis is proven.

-4 -3 -2 -1 -0 1 2 3 4

Test Statistic

95%

critical v alue

��

critical value

accept H1

H0: µ = k H1

: µ > k

accept H0

Recall that the null distribution is the probability distribution of the test statistic, T, assuming thatH0 is true. As the upper figure shows, we must choose the critical value such that, for the nulldistribution,

P(Tobs < Tcrit) = CL

where “CL” is the chosen confidence level. For our example, we have chosen a confidence levelof 95%. We can determine the critical value from the t-tables:

one-tailed critical value Tcrit = t✚,✍ = t5,.05

where ν is the appropriate degrees of freedom, and α is the area in the right tail of the tdistribution. We determine the value of α from the confidence level: .CL = (1 − ✍) $ 100%

For our example, the t-tables tell us that the critical value is Tcrit = 2.0150. If you recall, theobserved value of the test statistic was 2.9865; since this is larger than the critical value, wereject the null hypothesis and accept the alternate hypothesis. Our conclusion is:

Assuming no measurement bias, the data show that the blood alcohol level is abovethe legal limit (at the 95% confidence level).


Page 122

Errors in Hypothesis Testing

Introduction

Since they involve random variables, there is always an element of uncertainty in hypothesistests. Specifically, there is always a chance that the conclusion of a test is in error. Thisuncertainty is the reason that you must specify a confidence level when you perform statisticaltests. Choosing the confidence level allows you to determine the degree of the uncertainty inyour test: basically, you can control the likelihood that your conclusion is correct. As we will see,the confidence level also indirectly determines the ability of the statistical test to detect and labelsmall differences as “significant.”

How can the conclusion from a hypothesis test be in error? For tests with a single nullhypothesis, H0, and a single alternate hypothesis, H1, then the following table shows all thepossibilities:

correctfalse negativeH1 is true:false positivecorrectH1 is not true:

accept H1

(“positive” result)accept H0

(“negative” result)reality

decision

Let’s illustrate with an example. Let’s say someone undergoes a pregnancy test. Now the realityof the matter is that the person is either pregnant or she isn’t .The test will either decide in favorof pregnancy (called a “positive” test result) or will decide that the subject is not pregnancy (a“negative” result).

We can draw an analogy to statistical hypothesis tests. We begin with the assumption (the nullhypothesis) that the subject is “not pregnant.” The alternate hypothesis, the one we want to test,is that the subject is pregnant. A conclusion in favor of pregnancy (H1 is accepted) is considered apositive test result; however, if the subject actually is not pregnant (H0 is actually true), then ourconclusion is in error. This situation − an incorrect acceptance of H1 − is called a false positive.On the other hand, if the conclusion of the test is that the subject is not pregnant (H0 is accepted),and this conclusion is in error (H1 is actually true), then the test gives a false negative.

In the remainder of this section, we will describe how to calculate the probability that the resultof a hypothesis test is in error (either a false positive or false negative).

False Positive Errors

All of the hypothesis tests presented so far in this chapter have been of the following type: thenull hypothesis is

H0: µx = k the true measurement mean is some fixed value, k

While the alternate hypothesis is one of the following


Page 123

H1: µx ≠ k the true measurement mean is not some fixed value, k (a two-tailed test)

H1: µx > k the true measurement mean is larger than some fixed value, k (a one-tailed test)

H1: µx < k the true measurement mean is smaller than some fixed value, k (a one-tailed test)

The decision criterion of the test is the following: if the observed test statistic, Tobs, is outside ofthe interval defined by the critical value(s), then we reject H0 and accept H1. A false positiveoccurs when Tobs is outside the H0 acceptance region when, in fact, H0 is true. The probability of afalse positive is controlled by choosing the appropriate confidence levels in a statistical test. Tobe exact,

CL = 1 − ✍

where CL is the chosen confidence level, and αααα is the probability of a false positive. In otherwords, when testing at the 90% confidence level, there is a 10% chance of falsely accepting H1.

Let’s imagine that we are comparing a mean value, µx, to a fixed value k. Unknown to us, thenull hypothesis is actually true. The following figure shows the null distribution of the teststatistic, i.e., the probability distribution of the test statistic when the null hypothesis is actuallytrue.

-4 -3 -2 -1 -0 1 2 3 4

Test Statistic

area: α/2 area: α/2

Null Distribution

critical v alue critical v alue

Figure 5.3: choosing the critical values for a two-tailed test. If Tobs occurs between thecritical values, then the null hypothesis is accepted; if not, then H1 is accepted. Theshaded area in both tails is probability of a false positive: it is the probability that Tobs

does not fall between the critical values, even though it “should,” since H0 is true.

Now we can see how the critical values are chosen for two-tailed tests: each tail must contain anarea of α/2, so that the total probability of a false positive is α, the desired value.


Page 124

Now let’s consider the probability of false positive error for a one-tailed test. In such a test, thereis only a single critical value. Let’s imagine that we are testing for values that are greater than afixed value, k; in other words, our alternate hypothesis is H1: µx > k. The next figure shows thenull distribution, together with the critical value and the probability of false positive.

-4 -3 -2 -1 -0 1 2 3 4

Test Statistic

area: α

Null Distribution

critical v alue

Figure 5.4: choosing the critical value for a one-tailed test. If Tobs is less than the criticalvalue, then the null hypothesis is accepted; if not, then H1 is accepted. The shaded area inboth tails is probability of a false positive. Note that the critical value was chosen suchthat the probability of false positive, α, is the same as in figure 5.3

To summarize, we set the probability of false positive error when we choose the confidencelevel. We must then choose the critical values according to our desired value of α. This meansthat, for a two-tailed test, the area in each tail of the null distribution must be α/2; for aone-tailed test, the area in the single tail (since there is only one critical value) will be α.

False Negative Errors

A false negative occurs when we incorrectly accept H0 when we should actually reject H0 andaccept H1. In other words, the alternate hypothesis is actually true, but the test statistic still fallswithin the critical region (so that the null hypothesis is accepted). The next figure shows theprobability distribution of the test statistic when the alternate hypothesis is true.


Page 125

-2 -1 0 1 2 3 4 5 6 7

Test Statistic

True Distribution of Test Statistic (H1 is true)

accept H0 accept H1

critical value

β: probability of

f alse negativ e

Figure 5.5: This figure shows the probability distribution (not the null distribution) of thetest statistic in a situation when the alternate hypothesis is actually true (in this case, µx >k). However, if the test statistic is less than the critical value, shown in the figure, then thenull hypothesis will be accepted: this would be a false negative error. The shaded areashows the probability, β, of this occurring.

As we see in the figure, even when the alternate hypothesis is true, there is some chance (β) thatthe test statistic will be less than the critical value. This chance is the probability of falsenegative error, β.

In order to calculate β, we must know the value of the population parameter, µx. We can alwayscalculate the value of β for some hypothetical situation in which we postulate a value for thepopulation parameter. This type of exercise would give us some idea of how “sensitive” ourtesting procedure is to situations in which the alternate hypothesis is false. The next exampleillustrates this point.


Page 126

Example 5.4You wish to develop a procedure to test for bias in the analysis of fluoride in water. Duringthe analytical procedure, three independent measurements are obtained on a sample, andaveraged to determine the fluoride concentration. The standard solution to be used in the testis known to contain 0.45 w/w% F, and the RSD of the entire analytical procedure is known tobe 0.10 (i.e., 10% RSD for the average of the three measurements)

(a) What are the critical values that can be used to determine if there is bias in ameasurement?

(b) What values of population measurement mean, µx, would result in a 90% probability thatbias will be detected? In other words, what bias would result in acceptance (with 90%probability) of the alternate hypothesis in part (a)?

The true fluoride concentration, ξx, of the standard solution is 0.45 w/w%. The analyticalprocedure in this situation consists of obtaining three measurements and averaging them toobtain a point estimate of the fluoride concentration. We can calculate the standard error of themean of three measurements:ξ x .0.45 % RSD 0.1 σoverall .RSD ξ x =σoverall 0.0450 %

the true standard error (a population parameter) is known

The null and alternate hypotheses will be

H0: µx = ξx there is no measurement bias

H1: µx ≠ ξξ measurement bias exists (two-tailed test)

One thing that is different about this hypothesis test, compared to all the others we have done: thetrue (i.e., population) standard deviation of the mean, , is known. Thus, the test statistic will✤(xn)be the standardized different between the mean of three measurements and the true concentrationof the solution:

test statistic T = x3 − ✛x

✤(x3)

where is the mean of 3 measurements. Assuming that is distributed normally, T will followx3 x3a normal distribution with a standard deviation of one.. The null distribution, which assumes thatµx = ξx, will follow a z-distribution (i.e., a standard normal distribution).

Let’s set our confidence level at 99%; in other words, we are limiting the probability of falsepositives to 1%: α = 0.01. Now we can find the critical values. From the z-tables, we see thatz0.005 ≈ 2.575 (you should verify this; the actual value is 2.5758, as reported by Excel). Ourdecision rules for this hypothesis test are:

• if −2.5758 < Tobs < 2.5758, then accept H0. We cannot prove measurement bias with 99%confidence.


Page 127

• if Tobs < −2.5758 or Tobs > 2.5758, then reject H0 and accept H1. We can prove bias with 99%confidence.

In this instance, it is useful to note that there is an equivalent way of stating these decision rules:if the observed measurement mean, , is more than 2.5758 standard errors from the truex3

concentration, ξx, then we have evidence of bias.crit lower ξ x .z crit σoverall =crit lower 0.3341 %

crit upper ξ x .z crit σoverall =crit upper 0.5659 %

Alternate decision rules

• if 0.3321 w/w% < < 0.5659 w/w%, then we must accept H0x3

• if < 0.3321 w/w% or > 0.5659 w/w%, then we reject H0 and accept H1 at the 99% confidencex3 x3level

You should realize that these rules are not different then the first ones; they would result inexactly the same conclusion for a given set of data. These rules just give another way of lookingat the hypothesis test process.

Now let’s look at part (b). We want to find the measurement population mean, µx, that wouldresult in a 90% chance that measurement bias would be detected. Let’s imagine that there isactually a certain amount of positive bias in the measurements. The probability that the bias willactually be detected is the area under the probability distribution curve that is greater than theupper critical value. In other words, if we want to find the minimum amount of positive bias thatwill be detected with a 90% probability, we need to find the measurement mean, µx, that satisfies:

P( > 0.5659 w/w%) = 0.90x3

This situation is shown in the following figure.


Page 128

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

measurement mean, w/w %

β = 0.10

accept H1 accept H1accept H0

probability distribution

of measurement mean

Figure 5.6: The critical values associated with the decision rules for two-tailed biasdetection at the 99% confidence level are represented by the dashed vertical lines. Theprobability distribution describes the mean of three positively biased measurements, andresults in β = 0.10; in other words, for measurements described by this distribution, thereis a 10% chance of a false negative result to bias testing at the 99% confidence level.

From the z-tables, we know that z0.90 = − 1.2816 gives a right-tailed area of 0.90. We must solvefor µx in the following expression:

xcrit − ✙x

✤(x3) = −1.2816

where xcrit is the upper critical value for the testing procedure, and is the standard error of✤(x3)the mean of three measurements. Solving for µx gives

✙ = xcrit + z0.90✤(x3)

This is the mean of the probability distribution shown in the figure. Substituting 0.5659 w/w% forthe critical value, and a standard error of 0.0450 w/w%, gives µx = 0.6236 w/w%. This correspondsto a bias, γx, of

γ x µ x ξx =γ x 0.1736 %

If you repeat this procedure to find the negative bias that gives β = 0.10, you will find that a biasof γx = −0.1736 w/w% will give the desired false negative probability value.

In other words, our calculations tell us that when testing for bias at the 99% confidence levelunder these conditions, we have a 90% chance of detecting bias of 0.1736 w/w%. This is usefulinformation. If, for example, the “sensitivity” of our hypothesis test for bias detection isunacceptable, then we have two options: lower our confidence level from 99% (which would


Page 129

decrease our critical values) or average more measurements to decrease our standard error. Wecould also try to improve the precision of our method, so that the standard deviation of theindividual measurements is smaller.

Summary: Choosing the Confidence Level

Choosing the confidence level directly determines the critical values and the value of α, theprobability of a false positive error. Let’s consider a two-tailed test:

H0: µx = kH1: µx ≠ k

for which there are two critical values, represented on the following number line:

Choosing a larger confidence level will cause the critical values to move further “apart.” True,this means that there is a less chance of a false positive error; however, the power of the test todetect small differences between µx and k has been decreased. In other words, there is a greaterchance of a false negative error (i.e., β has increased).

Thus there is always a compromise to consider in choosing the confidence level; values of 95%and 99% are very common. The value chosen may depend on the potential consequences oferrors. Consider the following situations:

• in employee drug testing, no employer want to deal with false accusations. In such a situation,a high confidence level (99% or even higher) might be appropriate, because the consequencesof a false positive (wrongly accusing an employee of taking drugs) are perceived to be moresevere than missing the borderline cases.

• in screening patients for HIV, the consequences of a false negative (incorrectly concluding thatthe patient is not infected) are very severe. In this case, the confidence level might be setrelaltively low. To be sure, there will be an increase in false positives, but a separate,independent test can be performed on these patients.


Page 130

Chapter Checkpoint

The following terms/concepts were introduced in this chapter:

one-tailed testtwo-tailed testnull distributiontest statisticnull hypothesisstatistical significancehypothesis teststatistical hypothesisfalse negativesignificance testfalse positivesignificance levelcritical valuerejection regionalternate hypothesisP-valueacceptance region

In addition to being able to understand and use these terms, after mastering this chapter, youshould

• use formal hypothesis testing procedures to determine if there is a significant differencebetween a normally-distributed random variable and a fixed value, using either a one- ortwo-tailed test

• interpret P-values from a hypothesis test

• explain trade-offs in choosing a confidence level


Page 131

Documents

Chapter Five Hypothesis Testing: Concepts