TESTS OF HYPOTHESIS - FCAMPENA...If a random sample of 12 students had an average registration time of 42 minutes with a standard deviation of 11.9 minutes under the new system, test

Statistics Lecture Notes – Tests of Hypothesis. Bautista

17

TESTS OF HYPOTHESIS

INTRODUCTION

Consider an experiment of tossing a coin 100 times. From the discussions on probability,

we are able to compute the theoretical probability of getting a head as 1/2, which is also the

same probability we have for getting a tail. Thus, assuming we have a fair coin, we would

expect 50 heads and 50 tails from our experiment. However, this obviously does not always

happen.

Let’s say we did this experiment and ended up with 48 heads and 52 tails. We would say

that this outcome would still be acceptable since the values are close to 50. What happens

though if we conduct the experiment and it results in only 46 heads? 42 heads? 30 heads?

What values would lead us to believe that the coin is not fair?

Our concern then in hypothesis testing is looking for the critical value wherein we can

say that a value would reject our original notion or assumption. These critical values are

determined using different distributions, depending on the parameter being tested. Some of the

distributions we will be using are the Z-distribution, T-distribution, χ2-distribution, and the F-

distribution.

So if we decide to test the fairness of our coin, we may test if the proportion of heads in

the experiment, denoted by p, is actually equal to 1/2. Alternatively, we can also test if the mean

number of heads that appear in the experiment, denoted by μ, is actually equal to 50. These

statements which we test are what we call as the null hypotheses. This is the “default”

statement and it is the one we are testing.

In most cases, we would want to reject the null hypothesis in favor of what we call as the

alternative hypothesis. This is the hypothesis stating that there is a change in the original

parameter we are testing. Let’s say we would like to test the mean number of heads that appear

in our sample experiment. Our null hypothesis, denoted by HO, would be

HO: μ = 50

We note that the null hypothesis always contains the equal sign, as this is the “default”

statement. Given this null hypothesis, our possible alternative hypotheses, denoted by H1 or HA,

would be

H1: μ > 50; H1: μ < 50, H1: μ ≠ 50

The first two alternative hypotheses are the ones we use in one-tailed tests, while the

last one is the one we use in two-tailed tests. Thus if our alternative hypothesis is H1: μ ≠ 50, it

means we are testing if μ > 50 or μ < 50.

Given our data, if we have sufficient evidence that the null hypothesis is not true, then

we reject the null hypothesis in favor of the alternative hypothesis. We may also say that we

accept the alternative hypothesis. Let’s say we establish a critical region of 40 and 60 from our


18

previous example. This means that if we conduct the experiment and it results in less than 40

heads, or more than 60 heads, then our null hypothesis that μ = 50 is not true and hence we

reject it.

On the other hand, if the experiment results in say, 43 heads, then this value is still

inside our acceptance interval and hence we say that we still accept the null hypothesis. Note

that acceptance of the null hypothesis doesn’t mean that it is true, it only means that there was

insufficient evidence to reject it.

Now if the experiment results in 35 heads, we say that we reject the null hypothesis. The

reason for this is either a rare event has occurred, or the null hypothesis is actually not true.

Hence, there is still a room for mistake when we do these tests of hypothesis. When we reject

the null hypothesis when in fact it is true, then we have committed a type I error. The probability

of this error is denoted by α, and is also called the level of significance.

The other type of error is when we fail to reject the null hypothesis when in fact it is false.

This type of error is called a type II error. The probability of this error is denoted by β. The

power of the test is computed by 1 – β. These notions are summarized in the following table.

Null Hypothesis is True Null Hypothesis is False

Accept Correct Decision Type II Error (β)

Reject Type I Error (α) Correct Decision

We would want to minimize these errors as much as possible, however, as we decrease

the probability of a type I error, we increase the probability of a type II error, and vice versa.

Hence we try to find a balance between these two types of errors. In general, if we increase the

sample size, then the probabilities of both errors decrease.

STEPS IN HYPOTHESIS TESTING

The following steps will be followed for each of our tests of hypothesis. For organization

and neatness of our solutions, these steps should be outlined when solving our examples and

exercises.

1. Write the null and alternative hypothesis.

2. Indicate the level of significance.

3. Establish the critical regions and the rejection criterion.

4. Compute the test statistic.

5. Decide the conclusion of the test.

The information required in each of these steps should be given in the problem. The test

statistics for each test of hypothesis will be outlined in the next sections.

TESTS CONCERNING MEANS

For tests on single means, we would be concerned in testing if the mean μ would be

really equal to a predetermined value μO. Thus the null hypothesis μ = μO would be tested


19

against the alternative hypothesis μ < μO or μ > μO for a one-tailed test, and against μ ≠ μO for a

two-tailed test.

For tests on means for two populations, we would be testing the null hypothesis

against the alternatives and for the one-tailed test, and

for the two-tailed test. For paired observations, the null hypothesis would be

against the alternatives and for the one-tailed test and for the

two-tailed test.

In these tests, we will be using the same cases, statistics, and distributions outlined in

the previous chapter. Hence, if we are testing for a single mean where the population standard

deviation, σ, is known, we may use the Z distribution, where our critical values are and

for a two-tailed test, where is the z-value which gives an area of α/2 to the left and

is the z-value which gives an area of α/2 to the right.

Thus if our computed test statistic falls to the left of or to the right of , then this

would result in a rejection of the null hypothesis. For the one-tailed tests, the following critical

values apply:

For the one-tailed test μ < μO, the null hypothesis would be rejected if the test statistic is

less than , which is the z-value which gives an area of α to the left. Note that we use α instead

1 - α

α/2 α/2

z1-α/2 zα/2

1 - α

α

zα


20

of α/2 since we are only testing on one side of the standard normal distribution. Similarly, for the

one-tailed test μ > μO, the null hypothesis would be rejected if the computed test statistic is

greater than , which is the z-value which gives an area of α to the right.

Example 1: A manufacturer of sports equipment has developed a new synthetic fishing line that

he claims has a mean breaking strength of 8 kilograms with a standard deviation of 0.5

kilograms. Test the hypothesis that μ = 8 kg against the alternative that μ ≠ 8 kg if a random

sample of 50 lines is tested and found to have a mean breaking strength of 7.8 kg. Use a 0.01

level of significance.

Solution:

We follow the five steps listed above:

1. HO: μ = 8 kg (null hypothesis)

H1: μ ≠ 8 kg (alternative hypothesis)

2. α = 0.01 (level of significance)

3. Based on the standard normal table, the critical values for α = 0.01 would be z = -2.575

and z = 2.575. Thus we reject HO if our test statistic z is less than -2.575 or greater than

2.575 (z < -2.575 or z > 2.575).

4. From the previous section, we use a similar test statistic when the population standard

deviation σ is given (or if n>30),

Thus, using the values given in the problem, we compute our test statistic z to be

Since -2.83 is less than -2.575, we have sufficient evidence to reject the null hypothesis,

at the 0.01 level of significance.

1 - α

z1-α

α


21

5. Conclusion: We reject the manufacturer’s claim that the new fishing line’s breaking

strength is 8 kg. At a 0.01 level of significance, we have sufficient evidence to say that

the true breaking strength is less than 8 kg.

The rest of the test statistics for the other cases for single populations and two

populations are listed in the table on below. Notice the similarity with the statistics used in the

previous chapter.

Null Hypothesis HO

Test Statistic

Single Population

μ = μO

Case 1: If σ is known, or n≥30

μ = μO

Case 2: If σ is unknown, and n<30

Two Populations

Case 1: If and are known, or unknown but and

Case 2: If and are unknown and and

, but the variances are assumed equal

Case 3: If and are unknown and and

, but the variances are assumed unequal


22

Paired Observations

We also note that in general, if the computed test statistic is z, we compare it with for

a lower tail test or with for an upper tail test, and with or for a two-tailed test. If

the computed test statistic is t, then we look for the critical values , , and from

the T distribution table, with the corresponding degrees of freedom, v, listed in the previous

table.

Alternative Hypothesis

μ < μO μ > μO μ ≠ μO

Reject HO if

or

We also note that when testing the equality of the means of two populations, we test the

if the difference is equal to a specified value . If we are specifically testing if the

means of the two populations are equal, then we set , and proceed with testing the

hypothesis that .

Example 2: A random sample of 100 recorded deaths in the United States during the past year

showed an average life span of 71.8 years, with a standard deviation of 8.9 years. Does this

seem to indicate that the average life span today is greater than 70 years? Use a 0.05 level of

significance.

Example 3: The average length time for students to register for fall classes at a certain college

has been known to be 50 minutes with a standard deviation of 10 minutes. A new registration

procedure using modern computing machines is being implemented. If a random sample of 12

students had an average registration time of 42 minutes with a standard deviation of 11.9

minutes under the new system, test the hypothesis that the population mean is now less than

50, using a level of significance of 0.05. Assume the population of times to be normal.

Example 4: A course in mathematics is taught to 12 students by the conventional classroom

procedure. A second group of 10 students was given the same course by means of

programmed materials. At the end of the semester the same examination was given to each

group. The 12 students meeting in the classroom made an average grade of 85 with a standard

deviation of 4, while the 10 students using programmed materials made an average of 81 with a

standard deviation of 5. Test the hypothesis that the two methods of learning are equal using a

0.10 level of significance. Assume the population to be approximately normal with equal

variances.

Example 5: To determine whether membership in a fraternity is beneficial or detrimental to one’s

grades, the following grade-point averages were collected over a period of 5 years:


23

Year

1 2 3 4 5

Fraternity 2.0 2.0 2.3 2.1 2.4

Non-fraternity 2.2 1.9 2.5 2.3 2.4

Assuming the populations to be normal, test at the 0.05 level of significance whether

membership in a fraternity is detrimental to one’s grades.

TESTS CONCERNING PROPORTION

Next, we may be interested in testing the proportion of successes in a population. For

instance, a politician may be interested in the proportion of citizens who will vote for him, or a

manufacturing firm may be interested in the proportion of defectives that arise from a sample of

products. For these cases we will be testing the null hypothesis

where p is the population proportion and is the specified value of the proportion being tested.

The possible alternative hypotheses are

, , and .

Again we are conducting a binomial experiment on a sample, counting the number of

successes, and determining through this value if our null hypothesis is true or not.

To find our critical values, we would be using the binomial probabilities listed in the

binomial probability table. However, in most practical applications, the sample size is greater

than 30, and when n > 30, the normal probability distribution can be used to approximate the

binomial distribution. This is an easier way of getting critical values, and would prove to be

accurate as long as we have a large sample size and is not close to 0 or 1.

We then compute the test statistic

and compare it with the critical values from the standard normal distribution. The rejection

criteria are summarized below:


Reject HO if

or

We may also be interested in testing the equality of two proportions. If so we would be

testing the null hypothesis


24

against the alternatives

where and are the proportions of success in the two populations.

The test statistic to be used is

where and are the proportions of success from the two samples, with the respective

sample sizes and , and is computed as

where and are the number of successes from each sample. Lastly, . The

rejection criteria are summarized below.


Reject HO if

or

Example 1: A commonly prescribed drug on the market for relieving nervous tension is believed

to be only 60% effective. Experimental results with a new drug administered to a random

sample of 100 adults who were suffering from nervous tension showed that 70 received relief. Is

this sufficient evidence to conclude that the new drug is superior to the one commonly

prescribed? Use a 0.05 level of significance.

Example 2: A vote is to be taken among the residents of a town and the surrounding county to

determine whether a civic center will be constructed. To determine if there is a significant

difference in the proportion of town voters and county voters favoring the proposal, a poll is

taken. If 120 of 200 town voters favor the proposal and 240 of 500 county residents favor it,

would you agree that the proportion of town voters favoring the proposal is higher than the

proportion of county voters? Use a 0.025 level of significance.

TESTS CONCERNING VARIANCES

We may also be interested in testing the uniformity of a certain population, or in

comparing the uniformity of two populations. For a single population, we would be testing if the

population variance σ is equal to a specified value . Hence the null hypothesis would be

against the alternatives


25

.

Assuming the population distribution to be approximately normal, we can use the χ2 test

statistic given by

where is the sample variance, and n is the sample size. We then compare this statistic with

the critical values taken from the χ2-table. The rejection criteria are as follows:


Reject HO if

or

Example 1: A manufacturer of car batteries claims that the life of his batteries has a variance

equal to 0.81 years. If a random sample of 10 of these batteries have a variance of 1.44 years,

do you think that a year? Use a 0.05 level of significance.

When testing the equality of the variances of two populations, the null hypothesis is

which will be tested against any of the alternatives

.

For independent random samples of size and for the two populations, the test

statistic is given by

where is the variance of the first sample and is the variance of the second sample. This

statistic will then be compared to critical values taken from the F-distribution. These rejection

criteria are summarized below:


Reject HO if

, or

Note that and .

Example 2: In testing the equality of the population means in Example 4 under Tests

Concerning Means, we assumed that the two population variances are equal but unknown. Are

we justified in making this assumption? Use a 0.10 level of significance.


26

P-VALUE APPROACH

The p-value approach uses a single value called the p-value to determine whether or not

to reject the null hypothesis. This is the output generated by most statistical softwares. The p-

value is the probability of getting the sample data given that the null hypothesis is true. Hence, if

we have a low p-value (close to 0), we may reject the null hypothesis, and fail to reject it if the p-

value is quite large.

Determining how small the p-value must be in order to reject the null hypothesis is not

easy, and may involve some subjectivity. As a general measure though, we usually compare the

p-value with the level of significance. Hence, we reject the null hypothesis if the p-value is less

than the level of significance, α.

For example, if our level of significance is 0.05, and we have a p-value of 0.0341, then

we would reject the null hypothesis. However, what happens if we have a p-value of say,

0.0612? We may say that we fail to reject the null hypothesis because the p-value is not less

than the level of significance. However, we may also want to reject the null hypothesis because

we still have a relatively low p-value (e.g. We would only get the sample data if the null

hypothesis is true, at 6.12% of the time). This is where subjectivity comes in, and different

conclusions may be made depending on the study being conducted.

EXERCISES

1. The average height of females in the freshman class of a certain college has been 162.5

centimeters with a standard deviation of 6.9 centimeters. Is there reason to believe that

there has been a change in the average height if a random sample of 50 females in the

present freshman class has an average height of 165.2 centimeters, using a 0.05 level

of significance?

2. Test the hypothesis that the average content of containers of a particular lubricant is 10

liters if the contents of a random sample of 10 containers are 10.2, 9.7, 10.1, 10.3, 10.1,

9.8, 9.9, 10.4, 10.3, and 9.8 liters. Use a 0.01 level of significance and assume that the

distribution of contents is normal.

3. A manufacturer claims that the average tensile strength of thread A exceeds the average

tensile strength of thread B by at least 12 kilograms. To test this claim, 50 pieces of each

type of thread are tested under similar conditions. Type A thread had an average tensile

strength of 86.7 kilograms with a standard deviation of 6.28 kilograms, while type B

thread had an average tensile strength of 77.8 kilograms with a standard deviation of

5.61 kilograms. Test the manufacturer’s claim using a 0.05 level of significance.

4. A study is made to see if increasing the substrate concentration has an appreciable

effect on the velocity of a chemical reaction. With the substrate concentration of 1.5

moles per liter, the reaction was run 15 times with an average velocity of 7.5 micromoles

per 30 minutes and a standard deviation of 1.5. With a substrate concentration of 1.0

moles per liter, 12 runs were made yielding an average velocity of 8.8 micromoles per 30

minutes and a sample standard deviation of 1.2. Would you say that the increase in

substrate concentration increases the mean velocity by more than 0.5 micromoles per


27

30 minutes? Use a 0.01 level of significance and assume the populations to be

approximately normally distributed with equal variances.

5. A taxi company is trying to decide whether the use of radial tires instead of belted tires

improves fuel economy. Twelve cars were equipped with radial tires and driven over a

prescribed test course. Without changing drivers, the same cars were then equipped

with regular belted tires and driven once again over the same test course. The gasoline

consumption, in kilometers per liter, was recorded as follows:

Car Kilometers Per Liter

Radial Tires Belted Tires

1 4.2 4.1

2 4.7 4.9

3 6.6 6.2

4 7.0 6.9

5 6.7 6.8

6 4.5 4.4

7 5.7 5.7

8 6.0 5.8

9 7.4 6.9

10 4.9 4.7

11 6.1 6.0

12 5.2 4.9

At the 0.025 level of significance, can we conclude that cars equipped with radial tires

give better fuel economy than those equipped with belted tires? Assume the populations

to be normally distributed.

6. A soft-drink dispensing machine is said to be out of control if the variance of the contents

exceeds 1.15 deciliters. If a random sample of 25 drinks from this machine has a

variance of 2.03 deciliters, does this indicate at the 0.05 level of significance that the

machine is out of control? Assume that the contents are approximately normally

distributed.

7. A study is conducted to compute the length of time between men and women to

assemble a certain product. Past experience indicates that the distribution of times for

both men and women are approximately normal but the variance of the times for women

is less than that for men. A random sample of time for 11 men and 14 women produced

the following data: the variance for men was 6.1 while the variance for women was 5.3.

Test the hypothesis that against the alternative using a 0.01 level of

significance.

8. The gas company claims that two thirds of the houses in a certain city are heated by

natural gas. Do we have reason to doubt this claim if, in a random sample of 1000

houses in this city, it is found that 618 are heated by natural gas? Use a 0.02 level of

significance.

9. A geneticist is interested in the proportion of males and females in a population that have a certain minor blood disorder. In a random sample of 100 males, 31 are found to be afflicted, whereas only 24 of 100 females tested appear to have the disorder. Can we


28

conclude at the 0.01 level of significance that the proportion of men in the population afflicted with this blood disorder is significantly greater than the proportion of women afflicted?

BUSINESS APPLICATIONS

10. A firm is studying the delivery times of two raw material suppliers. The firm is basically satisfied with supplier A and is prepared to stay with that supplier. However, if the firm finds that the mean delivery time of supplier B is less than that of supplier A, it will begin making raw material purchases from supplier B. Assume that 50 independent samples from each supplier show the delivery time of supplier A to be 14 days with a standard deviation of 3 days, while the delivery time of B is shown to be 12.5 days with a standard deviation of 2 days. Testing at a 0.05 level of significance, should the firm switch to supplier B or not?

11. Starting annual salaries for individuals with master’s and bachelor’s degrees were collected in two independent random samples. The average starting salary of a random sample of 60 individuals with master’s degrees showed a mean of $45,000 with a standard deviation of $4000, while those with bachelor’s degrees had a mean starting salary of $35,000 with a standard deviation of $3500. Test at a 0.05 level of significance if those with master’s degrees had a significantly higher average starting salary than those individuals with bachelor’s degrees.

12. Starting annual salaries for individuals entering the public accounting and financial planning professions were presented in Fortune, June 26, 1995. The starting salaries for a sample of 12 public accountants and a sample of 14 financial planners follow, with data in thousands of dollars. Public Accountant:

30.6 31.2 28.9 35.2 25.1 33.2 31.3 35.3 31.0 30.1 29.9 24.4

Financial Planner:

31.6 26.6 25.5 25.0 25.9 32.9 26.9 25.8 27.5 29.6 23.9 26.9

24.4 25.5

Test if the starting annual salaries of the accountants and financial planners are equal at

a 0.05 level of significance.

13. Rental car gasoline prices per gallon were sampled at eight major airports. Data for Hertz and National car rental companies follow (USA Today, April 4, 2000).

Airport Hertz National

Boston Logan 1.55 1.56

Chicago O’hare 1.62 1.59

Los Angeles 1.72 1.78

Miami 1.65 1.49

New York (JFK) 1.72 1.51

New York (LaGuardia) 1.67 1.50

Orange County 1.68 1.77

Washington 1.52 1.41


29

Test at a 0.05 level of significance if there is a difference between the gas prices of the

two rental car companies.

14. Figure Perfect Inc., is a women’s figure salon that specializes in weight reduction programs. Weights for a sample of clients before and after a 6-week introductory program are shown here.

Client Weight Before Weight After

1 140 132

2 160 158

3 210 195

4 148 152

5 190 180

6 170 164

Determine at a 0.05 level of significance whether the introductory program provided a

statistically significant weight loss.

15. Yahoo! Internet Life sponsored surveys in several metropolitan areas to estimate the proportion of adults using the Internet at work (USA Today, May 7, 2000). Results showed 96 of 240 of Washington D.C. adults use the Internet at work, while 80 of 250 of San Francisco adults use the internet at work. Do the sample results indicate that the population proportion of adults using the Internet at work in Washington D.C. is greater then the population proportion in San Francisco? Use a 0.05 level of significance.

16. A Business Week/Harris survey asked senior executives at large corporations their opinions about the economic outlook for the future. One question was “Do you think that there will be an increase in the number of full-time employees at your company over the next 12 months?” In May 1997, 220 of 400 executives answered yes, while in December 1996, 192 of 400 executives had answered yes. Test at a 0.04 level of significance if there was a significant increase in the proportion of executives who answered yes from December 1996 to May 1997.

17. The standard deviation in the 12-month earnings per share for 10 companies in the airline industry was 4.27, and the standard deviation in the 12-month earnings per share for 7 companies in the automotive industry was 2.27. Conduct a test for equal variances at a 0.10 level of significance.

18. On the basis of data provided by a Romac salary survey, the variance in annual salaries for seniors in public accounting firms is approximately 2.1 and the variance in annual salaries for managers in public accounting firms is 11.1 (data in thousands of dollars). Assuming that the salary data were based on samples of 25 seniors and 25 managers, test the hypothesis that the population variances of the salaries are equal. Use a 0.02 level of significance.

Documents

TESTS OF HYPOTHESIS - FCAMPENA...If a random sample of 12 students had an average registration time of 42 minutes with a standard deviation of 11.9 minutes under the new system, test