Upload
jay-eamon-reyes-mendros
View
36
Download
5
Embed Size (px)
Citation preview
Chapter Three Describing Data: Numerical Measures
GOALSWhen you have completed this chapter, you will be able to:ONECalculate the arithmetic mean, median, mode, weighted mean, and the geometric mean.TWO Explain the characteristics, uses, advantages, and disadvantages of each measure of location.THREEIdentify the position of the arithmetic mean, median, and mode for both a symmetrical and a skewed distribution.
Goals
3- 1
FOURCompute and interpret the range, the mean deviation, the variance, and the standard deviation of ungrouped data.
Describing Data: Numerical Measures
FIVEExplain the characteristics, uses, advantages, and disadvantages of each measure of dispersion.
SIXUnderstand Chebyshev’s theorem and the Empirical Rule as they relate to a set of observations.
Goals
Chapter Three 3- 2
Characteristics of the Mean
It is calculated by summing the values and dividing by the number of values.
It requires the interval scale.All values are used.It is unique.The sum of the deviations from the mean is 0.
The Arithmetic Mean is the most widely used measure of location and shows the central value of the data.
The major characteristics of the mean are: A verag e J oe
3- 3
Population Mean
N
X
where µ is the population meanN is the total number of observations.X is a particular value. indicates the operation of adding.
For ungrouped data, the
Population Mean is the sum of all the population values divided by the total number of population values:
3- 4
Example 1
500,484
000,73...000,56
N
X
Find the mean mileage for the cars.
A Parameter is a measurable characteristic of a population.
The Kiers family owns four cars. The following is the current mileage on each of the four cars.
56,000
23,000
42,000
73,000
3- 5
Sample Mean
n
XX
where n is the total number of values in the sample.
For ungrouped data, the sample mean is the sum of all the sample values divided by the number of sample values:
3- 6
Example 2
4.155
77
5
0.15...0.14
n
XX
A statistic is a measurable characteristic of a sample.
A sample of five executives received the following bonus last year ($000):
14.0, 15.0, 17.0, 16.0, 15.0
3- 7
Properties of the Arithmetic Mean
Every set of interval-level and ratio-level data has a mean.
All the values are included in computing the mean.
A set of data has a unique mean.
The mean is affected by unusually large or small data values.
The arithmetic mean is the only measure of location where the sum of the deviations of each value from the mean is zero.
Properties of the Arithmetic Mean 3- 8
Example 3
0)54()58()53()( XX
Consider the set of values: 3, 8, and 4. The mean is 5. Illustrating the fifth
property
3- 9
Weighted Mean
)21
)2211
...(
...(
n
nnw
www
XwXwXwX
The Weighted Mean of a set of numbers X1, X2, ..., Xn, with
corresponding weights w1, w2, ...,wn, is computed from the following
formula:
3- 10
Example 4
89.0$50
50.44$1515155
)15.1($15)90.0($15)75.0($15)50.0($5
wX
During a one hour period on a hot Saturday afternoon cabana boy Chris served fifty drinks. He sold five drinks for $0.50, fifteen for $0.75, fifteen for $0.90, and fifteen for $1.10.
Compute the weighted mean of the price of the drinks.
3- 11
The Median
There are as many values above the median as below it in the data array.
For an even set of values, the median will be the arithmetic average of the two middle numbers and is
found at the (n+1)/2 ranked observation.
The Median is the midpoint of the values after they have been ordered from the smallest to the largest.
3- 12
The ages for a sample of five college students are:21, 25, 19, 20, 22.
Arranging the data in ascending order gives:
19, 20, 21, 22, 25.
Thus the median is 21.
The median (continued)
3- 13
Example 5
Arranging the data in ascending order gives:
73, 75, 76, 80
Thus the median is 75.5.
The heights of four basketball players, in inches, are: 76, 73, 80, 75.
The median is found at the (n+1)/2 =
(4+1)/2 =2.5th data point.
3- 14
Properties of the Median
There is a unique median for each data set.It is not affected by extremely large or small values and is therefore a valuable measure of location when such values occur.It can be computed for ratio-level, interval-level, and ordinal-level data.It can be computed for an open-ended frequency distribution if the median does not lie in an open-ended class.
Properties of the Median
3- 15
The Mode: Example 6
Example 6: The exam scores for ten nursing students are: 81, 93, 84, 75, 68, 87, 81, 75, 81, 87. Because the score of 81 occurs the most often, it is the mode.
Data can have more than one mode. If it has two modes, it is referred to as bimodal, three modes, trimodal, and the like.
The Mode is another measure of location and represents the value of the observation that appears most frequently.
3- 16
Symmetric distribution: A distribution having the same shape on either side of the center
Skewed distribution: One whose shapes on either side of the center differ; a nonsymmetrical distribution.
Can be negatively skewed, or positively skewed
The Relative Positions of the Mean, Median, and Mode
3- 17
The Relative Positions of the Mean, Median, and Mode: Symmetric Distribution
Zero skewness Mean
=Median
=Mode
M o d e
M ed ia n
M ea n
3- 18
The Relative Positions of the Mean, Median, and Mode: Right Skewed Distribution
• Positively skewed: Mean and median are to the right of the mode.
Mean>Median>Mode
M o d e
M ed ia n
M ea n
3- 19
Negatively Skewed: Mean and Median are to the left of the Mode.
Mean<Median<Mode
The Relative Positions of the Mean, Median, and Mode: Left Skewed Distribution
M o d eM ea n
M ed ia n
3- 20
Dispersion refers to the spread or variability in the data.
Measures of dispersion include the following: range, mean deviation, variance, and standard deviation.
Range = Largest value – Smallest value
Measures of Dispersion
0
5
10
15
20
25
30
0 2 4 6 8 10 12
3- 21
The following represents the scores of the 25 master of Nursing students during the first long examination.
65.0 60.0 59.0 81.0 73.048.0 95.0 63.0 92.0 53.078.0 56.0 79.0 95.0 78.080.0 89.0 79.0 97.0 69.065.0 77.0 80.0 83.0 75.0
Example 9
Highest score:97 Lowest score:48
Range = Highest value – lowest value= 97 - 48= 49
3- 22
Mean Deviation
The arithmetic mean of the
absolute values of the
deviations from the arithmetic
mean.
The main features of the mean deviation are:
All values are used in the calculation.
It is not unduly influenced by large or small values.
Mean Deviation
M D = X - X
n
3- 23
The weights of a sample of crates containing books for the bookstore (in pounds ) are:
103, 97, 101, 106, 103Find the mean deviation.
X = 102
The mean deviation is:
4.25
541515
102103...102103
n
XXMD
Example 10
3- 24
Variance: the arithmetic mean of the squared
deviations from the mean.
Standard deviation: The square root of the variance.
Variance and standard Deviation
3- 25
Not influenced by extreme values.All values are used in the calculation.
The major characteristics of the
Population Variance are:
Population Variance
3- 26
Population Variance formula:
(X - )2
N =
X is the value of an observation in the population
m is the arithmetic mean of the population
N is the number of observations in the population
Population Standard Deviation formula:
2Variance and standard deviation
3- 27
Sample variance (s2)
s 2 =(X - X ) 2
n -1
Sample standard deviation (s)
2ss
Sample variance and standard deviation
3- 28
40.75
37
n
XX
30.515
2.2115
4.76...4.77
1
2222
n
XXs
Example 11
The hourly wages earned by a sample of five students are:
$7, $5, $11, $8, $6.
Find the sample variance and standard deviation.
30.230.52 ss
3- 29
Chebyshev’s theorem: For any set of observations, the minimum proportion of the values that lie within k standard deviations of the mean is at
least:
where k is any constant greater than 1.
2
11
k
Chebyshev’s theorem
3- 30
Empirical Rule: For any symmetrical, bell-shaped distribution:
About 68% of the observations will lie within 1s of the mean
About 95% of the observations will lie within 2s of the mean
Virtually all the observations will be within 3s of the mean
Interpretation and Uses of the Standard Deviation
3- 31
Bell -Shaped Curve showing the relationship between and .
-m 3s -m s -1m s m +1m s +m s +3m s
68%
95%99.7%
Interpretation and Uses of the Standard Deviation
3- 32
The Mean of Grouped Data
n
fXX
The Mean of a sample of data organized in a frequency
distribution is computed by the following formula:
3- 33
From the example on hours spent in studying, we have
Hours studying Class Midpoint(X) Frequency(f) fX
30.2 – 35.1 32.65 1 32.65
25.2 – 30.1 27.65 3 82.95
20.2 – 25.1 22.65 7 158.55
15.2 – 20.1 17.65 11 194.15
10.2 – 15.1 12.65 8 101.2
∑fX=569.5
Therefore, the mean hours spent in studying is 569.5 / 30 = 18.98 hours
The Median of Grouped Data
)(2 if
CFn
LMedian
where L is the lower limit of the median class, CF is the cumulative frequency preceding the median class, f is the frequency of the median class, and i is the class width.
The Median of a sample of data organized in a frequency distribution is computed by:
3- 35
Finding the Median Class
To determine the median class for grouped data
Construct a cumulative frequency distribution.
Divide the total number of data values by 2.
Determine which class will contain this value. For example, if n=30, 30/2 = 15, then determine which class will contain the 15th value.
3- 36
From the example on hours spent in studying
Hours Studying
Lower Limit
f Cumulative Frequency
30.2 – 35.1 30.2 1 30
25.2 – 30.1 25.2 3 29
20.2 – 25.1 20.2 7 26
15.2 – 20.1 15.2 11 19
10.2 – 15.1 10.2 8 8
n/2 = 30/2 = 15. The second class (15.2 – 20.1) having a cumulative frequency of 19 is the median class. The lower limit is 15.2. The cumulative frequency (CF) that precedes the median class is 8. The frequency (f) of the median class is 11.The class width, i = 5.
Therefore, the median is:Median = 15.2 + (15-8)(5) = 18.4 11
The Mode of Grouped Data
The Mode for grouped data is approximated by the formula
3- 39
))((21
1 idd
dLMode
21
1
dd
dLbMo (i)
whereL = lower limit of the modal class intervald1 = diff. between the freq. of the modal CI
and the next class lower in valued2 = diff. between the freq. of the modal CI
and the next class higher in value i = class width
From the example on the hours spent in studying, we have
• the modal class is 15.2 – 20.1 with the highest frequency of 11(L = 15.2)
• d1 =11-8 = 3• d2 = 11-7 = 4• i = 5• Therefore, the mode is
Mode = 15.2 +[3/(3+4)](5) = 17.3Since the Mean = 18.98 > median = 18.4> mode = 17.3,
therefore the distribution of the number of hours spent in studying is positively skewed.
Variance for Grouped Data
222
)1(
nn
fxfxns
wheren = sample sizef = frequencyx = class marks
From the example on hours spent in studying, we have
Hours studying X f fX fX2
30.2 – 35.1 32.65 1 32.65 1066.02
25.2 – 30.1 27.65 3 82.95 2293.57
20.2 – 25.1 22.65 7 158.55 3591.16
15.2 – 20.1 17.65 11 194.15 3426.75
10.2 – 15.1 12.65 8 101.2 1280.18
∑fX=569.5 ∑fX2=11657.68
19.29870
25.25399
)130(30
)5.569()65.11657(30 22
s
And the standard deviation is
S = (29.19)1/2 = 5.4
Therefore, the variance is
Chapter FourOne-Sample Tests of Hypothesis
GOALSWhen you have completed this chapter, you will be able to:
ONEDefine a hypothesis and hypothesis testing.TWO Describe the five step hypothesis testing procedure.THREEDistinguish between a one-tailed and a two-tailed test of hypothesis.FOURConduct a test of hypothesis about a population mean.
Chapter Ten continued
GOALSWhen you have completed this chapter, you will be able to:
FIVE Conduct a test of hypothesis about a population proportion.SIXDefine Type I and Type II errors.
One-Sample Tests of Hypothesis
What is a Hypothesis?
What is a Hypothesis?
A HYPOTHESIS is defined by Webster as “a tentative theory or supposition provisionally adopted to explain certain facts and to guide in the investigation of others”.
A statistical hypothesis is an assertion or statement that may or may not be true concerning one or more population.
Example of Statistical Hypotheses
• A leading drug in the treatment of hypertension has an advertised therapeutic success rate of 84%. A medical researcher believes he has found a new drug for treating hypertensive patients that has higher therapeutic success rate than the leading drug but with fewer side effects. He should assume that it is no better than the leading drug and then set out to reject this contention. The two statements: the new drug is no better than the old one (p = 0.84) and the new drug is better than the old one (p > 0.84) are examples of statistical hypothesis.
What is Hypothesis Testing?
Hypothesis testing
Based on sample evidence and
probability theory
Used to determine whether the hypothesis is a reasonable statement and should not be rejected, or is unreasonable
and should be rejected
Types of Hypothesis1. Null Hypothesis (Ho) - the hypothesis that two or more variables are not related or that two or more statistics (e.g. means for two different groups) are not significantly different. It is a negation of the theory that the researcher would like to derive. In the above example, the statement ‘the new drug is no better than the old one’ is an example of a null hypothesis. It is usually constructed to enable the researcher to evaluate his own theory or the research hypothesis. In other words, the null hypothesis is stated with the sole purpose of rejecting it, thereby accepting the research hypothesis. Equality symbol (=) is commonly used in stating the null hypothesis.
Types of Hypothesis2. Alternative Hypothesis (H1) – the hypothesis derived from the theory of the investigator and generally state a specified relationship between two or more variables or that two or more statistics significantly differ. In other words, it is the operational statement of the investigator’s research hypothesis. The second statement ‘the new drug is better than the old one’ (above example) is an example of alternative hypothesis. The symbols commonly used are >, < and .
Two ways of stating the alternative hypothesis: Predictive H1 (One-tailed or directional) – specifies the type of relationship existing between two or more variables (e.g. direct or inverse relationship) or specifies the direction of the difference between two or more statistics (e.g. 1> 2 or 1 < 2).Non-predictive H1 (Two-tailed or non-directional )– does not specify the type of relationship or the direction of the difference ( e.g. 1 2)
One-Tailed and Two-Tailed TestsA test of any hypothesis where the alternative hypothesis is predictive such asHo: 1 = 2
H1 : 1 > 2 or 1 < 2
is called a one-tailed test. The rejection region for the alternative hypothesis 1 > 2 lies entirely in the right tail of the distribution, while the rejection region for the alternative hypothesis 1 < 2 lies entirely in the left tail.
A test of any hypothesis where the alternative hypothesis is non-predictive such asHo: 1 = 2
H1 : 1 2
is called a two-tailed test, since the rejection region is split into two equal parts placed in each tail of the distribution.
More Examples of Stating HypothesisExample1. Suppose you want to study the association between job satisfaction of the employees and the labor turnover in a certain private hospital.
Ho: There is no significant relationship between job satisfaction and the labor turnover in a certain private school or stated differently, job satisfaction does not significantly affect labor turnover
H1: There is an inverse relationship between job satisfaction and labor turnover, more specifically, when job satisfaction decreases, labor turnover increases.(predictive)
More Examples . . .Example 2. A researcher is conducting a study to determine if suicide incidence among teenagers can be attributed to drug use.
Ho: There is no significant difference between the suicide rates of teenagers who use drugs and those who do not.
H1: Suicidal rates of teenagers who use drugs are significantly higher than the suicidal rates of non-users.(predictive)
H1 : There is a significant difference between the suicide rates of teenagers who use drugs and those who do not.(non-predictive)
More Examples . . .Example 3. A social researcher is conducting a study to determine if the level of women’s participation in community extension programs of the barangay can be affected by their educational attainment, occupation, income, civil status, and age.
Ho : The level of women’s participation in community extension programs is not affected by their educational attainment, occupation, income, civil status and age.
H1 : The level of women’s participation in community extension programs is affected by their educational attainment, occupation, income, civil status and age.
Hypothesis Testing
D o n o t re jec t n u ll R e jec t n u ll an d accep t a lte rn a te
S tep 5 : Take a sam p le , a rrive a t a d ec is ion
S tep 4 : F orm u la te a d ec is ion ru le
S tep 3 : Id en tify th e tes t s ta tis t ic
S tep 2 : S e lec t a leve l o f s ig n ifican ce
S tep 1 : S ta te n u ll an d a lte rn a te h yp o th eses
Three possibilities
regarding means
H0: m = 0H1: m = 0
H0: m = 0H1: m > 0
H0: m = 0H1: m < 0
Step One: State the null and alternate hypotheses
The null hypothesis
always contains equality.
3 hypotheses about means
Step Two: Select a Level of Significance.
The probability of rejecting the null hypothesis when it is actually true; the
level of risk in so doing.
Rejecting the null hypothesis when it is actually true ( ).a
Accepting the null hypothesis when it is actually false ( ).b
Level of Significance
Type I Error
Type II Error
Step Two: Select a Level of Significance.
Researcher Null Accepts RejectsHypothesis Ho Ho
Ho is true
Ho is false
Correctdecision
Type I error( )a
Type IIError
( )b
Correctdecision
Risk table
Step Three: Select the test statistic.
A value, determined from sample information, used to determine whether or not to reject the null hypothesis.
Examples: z, t, F, c2
Test statistic z Distribution as a test statistic
n/
X
z
The z value is based on the sampling distribution of X, which is normally distributed when the sample is reasonably large (recall Central Limit Theorem).
Step Four: Formulate the decision rule.Critical value: The dividing point between the region where the null hypothesis is rejected and the region where it is not rejected.
0 1.65
D o not
re ject
[P robability = .95]
R egion of
re jection
[P robability= .05]
C ritica l va lue
Sampling DistributionOf the Statistic z, aRight-Tailed Test, .05Level of Significance
Reject the null hypothesis and accept the alternate hypothesis if
Computed -z < Critical -z
or
Computed z > Critical z
Decision Rule
Decision Rule
Using the p-Value in Hypothesis Testing
If the p-Value is larger than or equal to the significance level, a, H0 is not rejected.
p-ValueThe probability, assuming that the null hypothesis is true, of finding a value of the test statistic at least as extreme as the computed value for the test
Calculated from the probability distribution function or by computer
Decision Rule
If the p-Value is smaller than the significance level, a, H0 is rejected.
> .0 5 .1 0p
> .0 1 .0 5p
Interpreting p-values
SOME evidence Ho is not true
> .0 0 1 .0 1p
STRONG evidence Ho is not true VERY STRONG evidence Ho is not true
Step Five: Make a decision.
Movie
Test Concerning Means ( is known or n 30) – a sample mean is to be compared with the population mean.a.) Ho: = o b.) Ho: = o c.) Ho: = o
H1: < o H1: > o H1: o
Test Statistic:
Rejection Region:a.) Z < -Z b.) Z > -Z c.) Z < -Z Z > Z/2
nx
z o )(
An Example A private university hypothesized that the mean starting monthly salary of its
graduates is P8000 with a standard deviation of P1500. A sample of 25 employed graduates showed an average starting salary of P7500. Test this hypothesis at 5% level of significance. Solution:
Ho: = 8000H1: 8000Level of significance: /2 = 0.05/2 = 0.025 (two-tailed)
Rejection Region: Z 1.96 and Z -1.96 (refer to Table A.1 Appendix)
Computation of the Test Statistic: Given: = 7500, = 900, n = 25
Decision: Since the value of the test statistic (z = -1.67) is greater than the tabular value of –1.96, then do not reject Ho and conclude that the claim of the private university is true.
67.11500
25)80007500()(
nx
z o
Test Concerning Means ( is unknown and n < 30)a.) Ho: = o b.) Ho: = o c.) Ho: = o
H1: < o H1: > o H1: o
Test Statistic:
Rejection Region: a.) t < -t b.) .) t > -t c.) .) t < -t/2 and t > t/2
s
nxt o )(
An Example A perfume company claims that their best selling brand, Blue Ginger, has an average purity of 87.9%. Twenty bottles were examined to have an average purity of 85.6% with a standard deviation of 8.3%. Test at the 1% significance level whether the perfume company is telling the truth.Solution:HO : The average purity of blue ginger is equal to 87.9% ( = 0.879).H1 : The average purity of blue ginger is less than 87.9%, or ( < 0.879).Significance Level : 1%Rejection Region: t - 2.539 (with = 0.01 and degrees of freedom of 19, i.e.20 – 1 = 19
Computation of the Test Statistic: t – test
Decision: Since –1.24 is outside the rejection region, then we accept HO: at the 1% significance level. Therefore, the average purity of blue ginger is equal or greater than 87.9%. Therefore, the perfume company is telling the truth.
856.0x 879.0O 083.0s 20n
n
sx
t O
20
083.0879.0856.0
24.1
Test Concerning Proportion Tests of hypotheses concerning proportions are required in many areas. A drug manufacturer is certainly interested in knowing what fraction of the patients recover after taking the medicine. In business, manufacturing firms are concerned about the proportion of defective when shipment is made. A social researcher might be interested in determining the proportion of people favoring the legalization of abortion in the Philippines.
Steps in testing a proportion1. Ho: p = po
2. H1: p < po , p > po or p po
Choose a level of significance of size .Establish the rejection regionComputation: where p= sample proportion
p0 = population proportionn = sample size
Make a decision that is, reject Ho if z falls in the rejection region, otherwise do not reject Ho.
n
pp
ppz
OO
O
)1(
An Example A television station intends to stop airing a “telenovela” show, “Maganda ang Buhay”, if less than 35% of its intended televiewers watch the said program. A random sample of 1500 households yields an average viewing percentage of 32.9%. Should the television company pull out the airing of the show? Use the 5% significance level.
Solution:HO: The average viewing percentage of the telenovela “Maganda ang Buhay” is equal 35% (p = .35). H1: The average viewing percentage of the telenovela “Maganda ang Buhay” is less than 35% (p <.35) Significance Level: 5% Rejection Region:
zz 65.1z
Computation of Test Statistics: z – test
Decision: Since –1.71 is within the rejection region, then we reject HO at the 5% significance level. Therefore, the average viewing percentage of the telenovela “Maganda ang Buhay” is less than 35% and the television company should pull out the airing of the show.
1500n 329.0p 35.0Op
n
pp
ppz
OO
O
)1(
1500
)35.01(35.0
35.0329.0
71.1
Exercises1. A sample of size 60 has a mean of 12.8 and a standard deviation of 2.5. Test the
hypothesis that the population mean is 12 using a.) a one-tailed test at 0.01 level and b.) a two-tailed test at 0.05 level.
2. Suppose it is known that the mean annual income of assembly line workers in a certain plant is P100,000 with a standard deviation of P7000. You suspect that workers with active union interests have higher than the average incomes and take a random sample of 85 of these active members, obtaining a mean of P105,000. Can you say that active union members have significantly higher incomes? Use 5% level of significance.
3. Last year the employees of the city sanitation department donated an average of 250 pesos to the volunteer rescue squad. Test the hypothesis at 0.01 level that the average contribution this year is still 250 if a random sample of 15 employees showed an average of 265 pesos with a standard deviation of 15 pesos. Assume that the donations are approximately normally distributed.4. Suppose that in the past 40% of all adults favored capital punishment. Do we have reason to believe that the proportion of adults favoring capital punishment today has increased if, in a random sample of 200 adults 120 favor capital punishment? Use a 0.05 level of significance.
5. The average height of males in the freshmen class of a certain university has been 65.8 inches, with a standard deviation of 3.2 inches. Is there reason to believe that there has been an increase in the average height if a random sample of 40 males in the present freshmen class have an average height of 67.5 inches? Use a 1% level of significance. 6.A researcher knows that the average height of Filipino women is 1.525 meters. A random sample of 25 women was taken and was found to have a mean height of 1.572 meters, with a standard deviation of .12 meters. Is there reason to believe that the 25 women in the sample are significantly taller than the others at .05 level of significance?
7. A random sample of 100 recorded deaths in the Philippines during the past year showed an average life span of 71.8 years with a standard deviation of 8.9 years. Does this seem to indicate that the average life span today is greater than 70 years? Use a 0.05 level of significance.8. A mayoralty candidate in a certain city expects that approximately 60% of the voters will favor him in the coming election. To support his claim, he let a social researcher conduct a survey consisting of 100 randomly selected voters in the different barangays comprising the city. Results showed that 70 interviewed voters favor this particular mayoralty candidate. Is this sufficient evidence to conclude that the proportion of voters favoring this candidate in the coming election is higher than what he expected. Use .01 level of significance.
Chapter Five
Two-Sample Tests of HypothesisGOALS
When you have completed this chapter, you will be able to:
TWOConduct a test of hypothesis regarding the difference in two population proportions.
THREEConduct a test of hypothesis about the mean difference between paired or dependent observations.
ONEConduct a test of hypothesis about the difference between two independent population means.
Chapter Eleven continued
Two Sample Tests of Hypothesis
GOALSWhen you have completed this chapter, you will be able to:
FOURUnderstand the difference between dependent and independent samples.
Difference – of – Means Test ( 1 and 2 are known) – two sample means are being compared.
a.) Ho: 1 = 2 b.) Ho: 1 = 2 c.) Ho: 1
= 2
H1: 1 < 2 H1: 1 > 2 H1: 1 2
Test Statistic:
2
22
1
21
021 )(
nn
dxxz
Rejection Region:a.) Z < -Z b.) Z > Z c.) Z < -Z/2 or Z > Z/2
An Example A university investigation team conducted a study to determine whether car ownership of students affects their academic performance. A random sample of 50 students who are non-car owners showed a grade point average (GPA) of 85 with a standard deviation of 10.2 while a group of 60 car owners got an average grade of 80 with a standard deviation of 8.9. Do the data provide sufficient evidence to indicate that the non-car owners are better than car owners in terms of academic performance? Test using 5 % level of significance.
Solution:Let 1 = mean grade for non-car owners
2 = mean grade for car owners
Ho: 1 = 2
H1: 1 > 2 (one-tailed)
Level of significance: = 0.05Rejection Region: Z 1.645 ( refer to Table A.1 Appendix)
Computation of the Test Statistic:Given:
60,9.8,80
50,2.10,85
2222
111 1
nsx
nsx
The value of the z statistic will be
Decision: Since the value of the test statistic (z= 2.71) is greater than the tabular value of 1.645, then reject Ho. The data provide sufficient evidence to indicate non-car owners have better academic performance than car owners.
71.2
60
)9.8(
50
)2.10(
0)8085()(22
2
22
1
21
021
nn
dxxz
Difference – of – Means Test ( 1 = 2 = but unknown)
a.) Ho: 1 = 2 b.) Ho: 1 = 2 c.) Ho: 1 = 2
H1: 1 < 2 H1: 1 > 2 H1: 1 2
Test Statistic:
where
with degrees of freedom v = n1 + n2 -2
21
021
11
)(
nnSp
dxxt
2
)1()1(
21
222
211
nn
snsnSp
Rejection Region: a.) t < -t b.) t > t c.) t < -t/2
t > t/2
An Example A businessman is planning to put up either a computer store or a video store. He randomly selected 10 computer stores and 10 video stores. The average weekly incomes are P45,000 and P53,000 respectively with corresponding standard deviations of P17,500 and P15,000. At the 1% significance level, indicate whether he should be convinced to put up a computer store or a video store.
Solution:Let 1 = mean weekly income of a computer stores
2 = mean weekly income of video stores
Ho: 1 = 2
H1: 1 ≠ 2 (two-tailed)
Significance Level: 1%Rejection Region: and and t> 2.878
18,005.tt 18,005.tt
878.2t
Computation of Test Statistics: t – test
2121
222
211
21
112
)1()1(nnnn
snsn
xxt
1021 nn
000,451 Px 000,532 Px 500,171 P 000,152 P
1.1
10
1
10
1
21010
)000,15)(110()500,17)(110(
000,53000,4522
t
Decision: Since –1.1 is not within the rejection region, then we accept HO : at the 1% significance level. Therefore, there is no significant difference whether the businessman puts up either a computer store or video store. Therefore, opening a computer store is as profitable as opening a video store.
Testing the Difference of Two Proportions
a) Ho: P1 = P 2 b.) Ho: P1 = P2 c.) Ho: P1 = P2
H1: P1 < P2 H1: P1 > P2 H1: P1 P2
Test Statistic:
where Rejection Region:
a.) Z < -Z b.) Z > Z c.) Z < -Z/2 and Z > Z/2
21
21
11)1(
nnpp
ppz
2
22
1
11 ,
n
xp
n
xp
21
2211
nn
pnpnp
An Example
Consider random samples of 85 married couples last year and another 100 couples this year. Data show that 70 of last year’s married couples and 95 of this year’s married couples are entrepreneurs. Using the 1% significance level, can it be stated that the proportion of entrepreneurs has significantly increased from last year to this year?
Solution: Let P1 and P2 be the true proportion of married couples last year and this year who are entrepreneurs, respectively.
HO: The proportion of entrepreneurs has not significantly increased from last year to this year, or (p1 = p2 ).
H1: The proportion of entrepreneurs significantly increased from last year to this year, or . (p1 < p2 ).
Significance Level : 1%Rejection Region:
2/zz
005.zz 33.2z
Computation of Test Statistics: z – test
21
21
11)1(
nnpp
ppz
851 n 1002 n 823.085
701 p 95.0
100
952 p
89.010085
)95.0(100)823.0(85
21
2211
nn
pnpnp
73.2
100
1
85
1)89.01)(89.0(
95.0823.0
z
Decision: Since –2.73 is within the rejection region, then reject HO : at the 1% significance level. Therefore, the proportion of entrepreneurs has significantly increased from last year to this year.
Dependent Samples
Samples are called dependent if any one of the following cases is true.
1. Before and after experiments - an experimental variable is being measured in terms of its effectiveness. This can be done by comparing the observations taken from the same group before and after the introduction of the experimental variable. If significant difference exists, then the experimental variable is said to be effective.
2.Two different groups matched pair by pair with respect to some relevant characteristics. The primary purpose of matching is to ensure that observations may differ because of the experimental variable being tested.
• A study is conducted to determine the effect of fraternity membership to the performance of the students. One way to measure the effect of fraternity membership (experimental variable) is to consider say for instance 30 students and their performance (grades) before and after membership to fraternity are being compared (case 1). Another way is to compare the performance of two groups of students, those who are fraternity members and those who are non-fraternity members. The students in the first group should be carefully matched with the students in the second group with respect to some relevant characteristics such as IQ level, study habits, gender, etc. This will ensure that if a significant difference in the performance exists, then this could solely be attributed to the fact that the first group of students are fraternity members and the second group are non-members.
a.) Ho: 1 = 2 b.) Ho: 1 = 2 c.) Ho: 1 = 2
H1: 1 < 2 H1: 1 > 2 H1: 1 2
Test Statistic:
where:
Sd
nddt
)( 0
)1(
)()( 2
nn
ddnSd i
ei
with v = n –1, n is the number of pairs
Rejection Region:a.) t < -t b.) t > t c.) t < -t/2 and t > t/2
Example. If you wished to measure the effectiveness of a new diet you would weigh the dieters at the start and at the finish of the program
Chapter SixAnalysis of Variance
GOALSWhen you have completed this chapter, you will be able to:
ONE Understand the purpose of Analysis of Variance.TWO
Discuss the general idea of analysis of variance.
THREE
Organize data into a one-way
Chapter Twelve continuedAnalysis of Variance
GOALSWhen you have completed this chapter, you will be able to:
FOURDefine and understand the terms treatments and blocks.FIVEConduct a test of hypothesis among three or more treatment means.
Purpose of ANOVA
• The Analysis of Variance (ANOVA for short) is a technique designed to test whether or not more than two sample means differ significantly from each other.
• In the preceding chapter, we learned that the t-test or the z-test is used to test for the significance of difference between two sample means. ANOVA, therefore, which can be used to test for the equality of several means simultaneously is an extension of the t-test or the z-test which can handle only two means at a time.
Illustration• Suppose for example, three methods of teaching (A,
B, C) are being compared in terms of effectiveness and are being employed to three groups of students. If the researcher uses the t-test, he will have to test separately for the following pairs: A and B, A and C, and B and C. In other words, the researcher would be using the t-test formula three times which would mean spending so much time and effort. Of course there is a possibility that none of the pairs are significantly different and so it is a waste of time and effort on the part of the researcher.
Purpose of ANOVA• Now, if the researcher uses ANOVA, he could take all
the three sample means simultaneously and the test stops if the conclusion arrived at is that of no significant difference. However, if the conclusion arrived at in using ANOVA shows significant difference between the three sample means (A, B, C), then the t-test must be used to find which pair of means differ significantly. Another way is to use the Duncan’s Multiple Range Test (DMRT) which is beyond the scope of this manual.
Why the name analysis of variance? • The phrase analysis of variance means that we will be
analyzing the total variation in a set of observations that can be attributed to specific sources or causes of variation. With reference to the above example, two such specific sources of variations might be (1) actual differences in the three methods that can be shown in the varying performance of the three group of students (treatment), and (2) chance, which in problems like this is usually called experimental error.
The populations have equal standard deviations.
ANOVA requires the following conditions
Underlying assumptions for ANOVA
The sampled populations follow the normal distribution.
The samples are independent
Steps in ANOVA
• The following are the steps and computations involved in the Analysis of Variance technique.
Ho: 1 = 2 = 3 = . . . = k ( for k sample means)
H1 : At least two means differ significantly
Test Statistic: Rejection Region: F > F tabular (, k-1, n-k)where
MSE
MSTrF
1)(
k
SSTrTreatmentMeanSquareMSTr
n
GrandTotal
n
ct
n
ct
n
ct
tesTreatmenSumofSquarSSTr
k
k22
2
22
1
21 )(
...
)(
Computational Formulas
kn
SSEErrorMeanSquareMSE
)(
SSTrTSSesErrorSumofSquarSSE )(
n
GrandTotalxSquaresTotalSumofTSS i
n
i
22
1
)()(
Source of Variation Degrees of Freedom(df)
Sum of Squares(SS)
Mean Square(MS)
FValue
Treatment k-1 SSTr MSTr F = MSTr/ MSE
Error n-k SSE MSE
Total n-1 TSS
Analysis of Variance (ANOVA) Table
ECP Restaurants specialize in meals for families. The owner recently developed a new meat loaf dinner. Before making it a part of the regular menu she decides to test it in several of her restaurants.
Example 1
She would like to know if there is a difference in the mean number of dinners sold per day at Restaurants A, B, and C. Use the .05 significance level.
Number of Dinners Sold by Restaurant
RestaurantDay
A B C
Day 1Day 2Day 3Day 4Day 5
13121412
10121311
1816171717
Example 1 continued
Solution:Ho: mA = mB = mC H1: At least 2 means differ significantly
Level of significance is .05.
Example 2 continued
Rejection Region:The numerator degrees of freedom, k-1, equal 3-1 or 2. The denominator degrees of freedom, n-k, equal 13-3 or 10. The value of F at 2 and 10 degrees of freedom is 4.10. Thus, H0 is rejected if F>4.10
Example 2 continued
Using the data provided, the ANOVA calculations follow.
Computations
= 132 + 122 + . . .+ 172 - (182) 2 / 13= 2634 – 2548 = 86
n
GrandTotalxSquaresTotalSumofTSS i
n
i
22
1
)()(
25.76254825.262413
)182()
5
85
4
46
4
51(
)(...
)(
2222
22
2
22
1
21
n
GrandTotal
n
ct
n
ct
n
ct
tesTreatmenSumofSquarSSTr
k
k
75.925.7686)( SSTrTSSesErrorSumofSquarSSE
ANOVA TableSource of Variation
Sum of Squares
Degrees of
Freedom
Mean Square
F
Treatments 76.25 3-1=2
76.25/2=38.125 38.125
.975= 39.103
Error 9.75 13-3=10
9.75/10=.975
Total 86.00 13-1=12
Example 1 continued
Example 1 continued
The ANOVA tables on the next two slides are from the SPSS and EXCEL systems
The mean number of meals sold at the three locations is not the same. Specifically, as shown in the Duncan’s Multiple Range Test, the meals sold at Restaurant C is significantly higher than A and B, but A and B do not differ significantly.
Since an F of 39.103 > the critical F of 4.10, the decision is to reject the null hypothesis and conclude that
At least two of the treatment means are not the same.
Example 1 continuedANOVAvolsold
Sum of Squares df Mean Square F Sig.Between Groups 76.250 2 38.125 39.103 .000
Within Groups 9.750 10 .975Total 86.000 12
volsoldDuncanRestaurant N Subset for alpha = 0.05
1 22 4 11.50001 4 12.75003 5 17.0000Sig. .094 1.000Means for groups in homogeneous subsets are displayed.
SUMMARY
Groups Count Sum Average Variance
Aynor 4 51 12.75 0.92
Loris 4 46 11.50 1.67
Lander 5 85 17.00 0.50
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 76.25 2 38.13 39.10 2E-05 4.10
Within Groups 9.75 10 0.98
Total 86.00 12
Anova: Single Factor
Example 2 continued
Some common Post Hoc Tests are Duncan’s Multiple Range Test of DMRT and Tukey’s Test.
When I reject the null hypothesis that the
means are equal, I want to know which treatment
means differ.
Chapter SevenLinear Regression and Correlation
GOALSWhen you have completed this chapter, you will be able to:
ONEDraw a scatter diagram.TWO Understand and interpret the terms dependent variable and independent variable.THREECalculate and interpret the coefficient of correlation, the coefficient of determination, and the standard error of estimate.FOURConduct a test of hypothesis to determine if the population coefficient of correlation is different from zero.
Goals
Chapter Thirteen continuedLinear Regression and Correlation
GOALSWhen you have completed this chapter, you will be able to:
FIVE Calculate the least squares regression line and interpret the slope and intercept values.SIXConstruct and interpret a confidence interval and prediction interval for the dependent variable.SEVENSet up and interpret an ANOVA table.
Goals
Correlation Analysis
The Independent Variable provides the basis for estimation. It is the predictor variable.
Correlation Analysis is a group of statistical techniques to measure the association between two variables.
A Scatter Diagram is a chart that portrays the relationship between two variables.
The Dependent Variable is the variable being predicted or estimated.
Advertising Minutes and $ Sales
0
5
10
15
20
25
30
70 90 110 130 150 170 190
Advertising Minutes
Sale
s ($
thou
sand
s)
The Coefficient of Correlation, r
Negative values indicate an inverse relationship and positive values indicate a direct relationship.
The Coefficient of Correlation (r) is a measure of the strength of the relationship between two variables.
-1 10
P earson's r
Also called Pearson’s r and Pearson’s product moment correlation coefficient.
It requires interval or ratio-scaled data.
It can range from -1.00 to 1.00.
Values of -1.00 or 1.00 indicate perfect and strong correlation.
Values close to 0.0 indicate weak correlation.
Perfect Negative Correlation
0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0
X
Y
0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0
X
Y
Perfect Positive Correlation
0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0
X
Y
Zero Correlation
0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0
X
Y
Strong Positive Correlation
Formula for r
We calculate the coefficient of correlation from the following formula.
])()(][)()([
))((2222 yynxxn
yxxynr
Coefficient of Determination
It is the square of the coefficient of correlation. It ranges from 0 to 1.It does not give any information on the direction of the relationship between the variables.
The coefficient of determination (r2) is the proportion of the total variation in the dependent variable (Y) that is explained or accounted for by the variation in the independent variable (X).
Example
Suppose an investigator wants to determine the extent to which a relationship exists between size of income and the number of years of education the individual has completed. The following table shows the data of ten individuals. Compute and interpret r.
Individual # Y, Income (P,000) X, Education (years) 1 45 20
2 63 19 3 36 16 4 52 20 5 29 12 6 33 14 7 48 16 8 55 18 9 72 20 10 66 22
SolutionPreliminary computations of the above data: n = 10 x = 177 y = 499 xy= 9,173 x2
= 3,221 y2 = 26,793 Therefore,
])()(][)()([
))((2222 yynxxn
yxxynr
83.])499()26793(10][)177()3221(10[
)499)(177()9173(1022
r
The value of r2 which is (.83) 2 = .6889 means that 68.89% of the total variability in income is being explained by its linear relationship with education.
Testing for the Significance of rThe sample correlation coefficient r is a value computed from a random sample of n pairs of measurements. Different random samples of size n from the same population will generally produce different values of r. Therefore, we need to test for the significance of the computed r value. The null hypothesis will be = 0 against the alternative that 0. The rejection of Ho will lead to a conclusion that the existing relationship is significant at alpha level. However, failure to reject Ho would imply that the relationship is not significant and thus it can be attributed to chance.From the above result, test the null hypothesis that there is no linear relationship between income and education.. Use a 0.05 level of significance.
Solution: Ho: = 0H1 : 0 (two-tailed)
Level of Significance: 0.05 / 2 = 0.025Rejection Region: t > 2.306 and t < -2.306
Computation of the Test Statistic:
Decision: Since the value of the test statistic (t = 4.26) is greater
than the tabular value of 2.306, therefore reject the null hypothesis of no linear relationship and conclude that there is a significant relationship between education and income. Specifically, the higher the educational attainment, the higher the income earned
26.4)83(.1
)8(83.
1
)2(22
r
nrt
Regression Analysis
The least squares criterion is used to determine the equation. That is the term S(Y – Y’)2 is minimized.
In Regression Analysis we use the independent variable (X) to estimate the dependent variable (Y).
The relationship between the
variables is linear.
Both variables must be at least
interval scale.
The Y values are statistically independent. This means that in the selection of a sample, the Y values chosen for a particular X value do not depend on the Y values for any other X values.
For each value of X, there is a group of Y values, and these Y values are normally distributed.
Assumptions Underlying Linear Regression
The means of these normal distributions of Y values all lie on the straight line of regression.
The standard deviations of these normal distributions are the same.
Regression Analysis
The regression equation is Y(hat)= a + bX (1)
where Y(hat) is the predicted value of Y for any X.
a is the Y-intercept. It is the estimated Y value when X=0
b is the slope of the line, or the average change in Y’ for each change of one unit in X
The least squares principle is used to obtain a and b.
Deterministic ModelModel (1) is called a deterministic model. It
gives an exact relationship between x and y. This model expresses that y is determined exactly by x and for a given value of x there is one and only one value of y.
Probabilistic ModelHowever, in many cases the relationship between variables is
not exact. For instance, if y is food expenditure and x is income, then model (1) would state that food expenditure is determined by income only and that all households with the same income will spend the same amount of food. But food expenditure is determined by many other variables such as household size, taste and preference which explains why different households with the same income spend different amounts of money for food. Hence, to take these variables into considerations and to make our model complete, we add another term to the right side of model(1) called the random error term ( ).
• The complete regression model is written asy = A + Bx + (2)
The regression model (2) is called a probabilistic model or a statistical relationship.
The random error term () is included in the model to represent the following two phenomena.
• Missing or omitted variables. As mentioned earlier, food expenditure is affected by many variables other than income. The random error term is included to capture the effect of all those missing variables that have not been included in the model.
• Random variation. Human behavior is unpredictable. For example, a household may have many parties during the month and may spend more than usual on food during that month. This variation in food expenditure may be called random variation.
Regression Analysis
The least squares principle is used to obtain a and b. The equations to determine a and b are:
22 )()(
))((
xxn
yxxynb
• An ExampleWe have already established a significant linear relationshipbetween income and education (i.e. number of years ineducation). We may now formulate an equation allowing us topredict the income of a person given the number of years heattended for his education. Referring to the above example,
wefind thatn = 10 Xi = 177 Yi = 499 Xi2 = 3221 Yi2 = 2679 XiYi =
9173 Y(bar) = 49.9 X(bar) = 17.7
Therefore,
And
The model for prediction purposes is given by
87.3)177()3221(10
)499)(177()9173(10
)()(
))((222
xxn
yxxynb
59.18)7.17(87.39.49 xbya
xy 87.358.18
The equation can be interpreted as per additional year in
educational attainment, there corresponds 3.87 or 3,870 pesos
increase in the income of a person.
We can use the regression equation to estimate values of Y.
The estimated income of a person who have spent 16 years of education will be:
34.43)16(87.358.18 Income
Standard Error of Estimate (SEE) – a measure of the variability of the regression line, i.e. the
dispersion around the regression line – it tells how much variation there is in the dependent
variable between the raw value and the expected value in the regression
– this SEE allows us to generate the confidence interval on the regression line as we did in the estimation of means
ExerciseIt is generally known that the number of road accidents is
inversely proportional with road width. The following data show the results of a study indicating the number of accidents occurring annually at roads with different widths:
Road width (in feet) (X)75 52 60 33 22 40 70 35 55 80Number of accidents (Y)40 84 55 92 90 86 38 88 78 32
– Draw the scatter diagram.– Find the correlation coefficient and test for its significance.– Establish a regression equation of the form Y = a + bX– Predict the number of accidents for a 50 ft road width.