Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
i
COURSE MANUAL
Descriptive Statistics STA 111
University of Ibadan Distance Learning Centre Open and Distance Learning Course Series Development
ii
Copyright © 2009, Revised in 2015 by Distance Learning Centre, University of Ibadan, Ibadan. All rights reserved. No part of this publication may be reproduced, stored in a retrieval System, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the copyright owner.
ISBN 978-021-269-8
General Editor: Prof. Bayo Okunade
University of Ibadan Distance Learning Centre University of Ibadan,
Nigeria
Telex: 31128NG
Tel: +234 (80775935727) E-mail: [email protected]
Website: www.dlc.ui.edu.ng
iii
Vice-Chancellor’s Message The Distance Learning Centre is building on a solid tradition of over two decades of service in the provision of External Studies Programme and now Distance Learning Education in Nigeria and beyond. The Distance Learning mode to which we are committed is providing access to many deserving Nigerians in having access to higher education especially those who by the nature of their engagement do not have the luxury of full time education. Recently, it is contributing in no small measure to providing places for teeming Nigerian youths who for one reason or the other could not get admission into the conventional universities.
These course materials have been written by writers specially trained in ODL course delivery. The writers have made great efforts to provide up to date information, knowledge and skills in the different disciplines and ensure that the materials are user-friendly.
In addition to provision of course materials in print and e-format, a lot of Information Technology input has also gone into the deployment of course materials. Most of them can be downloaded from the DLC website and are available in audio format which you can also download into your mobile phones, IPod, MP3 among other devices to allow you listen to the audio study sessions. Some of the study session materials have been scripted and are being broadcast on the university’s Diamond Radio FM 101.1, while others have been delivered and captured in audio-visual format in a classroom environment for use by our students. Detailed information on availability and access is available on the website. We will continue in our efforts to provide and review course materials for our courses.
However, for you to take advantage of these formats, you will need to improve on your I.T. skills and develop requisite distance learning Culture. It is well known that, for efficient and effective provision of Distance learning education, availability of appropriate and relevant course materials is a sine qua non. So also, is the availability of multiple plat form for the convenience of our students. It is in fulfilment of this, that series of course materials are being written to enable our students study at their own pace and convenience.
It is our hope that you will put these course materials to the best use.
Prof. Abel Idowu Olayinka
Vice-Chancellor
iv
Foreword As part of its vision of providing education for “Liberty and Development” for Nigerians and the International Community, the University of Ibadan, Distance Learning Centre has recently embarked on a vigorous repositioning agenda which aimed at embracing a holistic and all encompassing approach to the delivery of its Open Distance Learning (ODL) programmes. Thus we are committed to global best practices in distance learning provision. Apart from providing an efficient administrative and academic support for our students, we are committed to providing educational resource materials for the use of our students. We are convinced that, without an up-to-date, learner-friendly and distance learning compliant course materials, there cannot be any basis to lay claim to being a provider of distance learning education. Indeed, availability of appropriate course materials in multiple formats is the hub of any distance learning provision worldwide.
In view of the above, we are vigorously pursuing as a matter of priority, the provision of credible, learner-friendly and interactive course materials for all our courses. We commissioned the authoring of, and review of course materials to teams of experts and their outputs were subjected to rigorous peer review to ensure standard. The approach not only emphasizes cognitive knowledge, but also skills and humane values which are at the core of education, even in an ICT age.
The development of the materials which is on-going also had input from experienced editors and illustrators who have ensured that they are accurate, current and learner-friendly. They are specially written with distance learners in mind. This is very important because, distance learning involves non-residential students who can often feel isolated from the community of learners.
It is important to note that, for a distance learner to excel there is the need to source and read relevant materials apart from this course material. Therefore, adequate supplementary reading materials as well as other information sources are suggested in the course materials.
Apart from the responsibility for you to read this course material with others, you are also advised to seek assistance from your course facilitators especially academic advisors during your study even before the interactive session which is by design for revision. Your academic advisors will assist you using convenient technology including Google Hang Out, You Tube, Talk Fusion, etc. but you have to take advantage of these. It is also going to be of immense advantage if you complete assignments as at when due so as to have necessary feedbacks as a guide.
The implication of the above is that, a distance learner has a responsibility to develop requisite distance learning culture which includes diligent and disciplined self-study, seeking available administrative and academic support and acquisition of basic information technology skills. This is why you are encouraged to develop your computer
v
skills by availing yourself the opportunity of training that the Centre’s provide and put these into use.
In conclusion, it is envisaged that the course materials would also be useful for the regular students of tertiary institutions in Nigeria who are faced with a dearth of high quality textbooks. We are therefore, delighted to present these titles to both our distance learning students and the university’s regular students. We are confident that the materials will be an invaluable resource to all. We would like to thank all our authors, reviewers and production staff for the high quality of work.
Best wishes.
Professor Bayo Okunade
Director
vi
Course Development Team Content Authoring Shittu O. I.
Content Editor
Production Editor
Learning Design/Assessment Authoring
Managing Editor
General Editor
Prof. Remi Raji-Oyelade
Ogundele Olumuyiwa Caleb
Folajimi Olambo Fakoya
Ogunmefun Oladele Abiodun
Prof. Bayo Okunade
vii
Course Introduction
The goals of a public enterprise, corporate body or individual are achieved if decisions are based
on accurate, reliable and timely information called ‘data’. The massive data will be useful only
when they are organized, summarized and presented in a manner that enhances the
comprehension of the actual situation on ground, making clear the significant relationship among
the variables under investigation. This will ensure that trends and pattern of movement of
individual variables are determined.
Descriptive statistics therefore is an aspect of statistics that deals with the compilation and
presentation of data not necessarily for the purpose of rigorous statistical analysis but simply to
provide concise information on which decisions can be taken.
The purpose of this lecture not therefore is to introduce you to the discipline called
‘Statistics’, its nature, scope and coverage. The various methods of data presentation is
discussed, simple summaries of data as well as the different methods of making comparison
among variable especially those measured on different units are treated in detail. The method of
interpretation of results is also given.
Objectives
By the time you finished this lecture note, you should be able to:
1. explain the use of statistics in our day-to-day activities; 2. present data in tables, charts and diagrams; 3. identify, and calculate the measures of location suitable for a particular set of data based
on the purpose of the inquiry; 4. calculate the measures of variation and coefficient of variation; 5. examine the shape of a distribution for normality, skewness and kurtosis; 6. know and explain the various method of data collection and the situations under which
each of them can be used; 7. obtain the linear regression model from a bivariate data; 8. calculate and interpret the correlation coefficient; 9. discuss the concepts of rate, ration and proportion; 10. calculate different types of indices from given data; and 11. discuss the considerations, uses and limitations of consumer price indices.
viii
Table of Contents Study Session 1 What is Statistics? .........................................................................................1
Introduction .............................................................................................................................1 Learning Outcomes for Study Session 1 ..................................................................................1 1.1 Meaning of statistics ........................................................................................................2 1.2 Branches of Statistics.........................................................................................................3 1.3 Uses of Statistics ...............................................................................................................4 1.4 Terms and Concepts in Statistics .......................................................................................5
1.4.1 Population Sample and Variate ...................................................................................5 1.4.2 What is Data? .............................................................................................................6
Summary .................................................................................................................................7 Self-Assessment Question (SAQs) for Study Session 1 ...........................................................8
SAQ 1.1 (Tests Learning Outcomes 1.1)..............................................................................8 SAQ 1.2 (Tests Learning Outcomes 1.2)..............................................................................8 SAQ 1.3 (Tests Learning Outcomes 1.3)..............................................................................8 SAQ 1.4 (Tests Learning Outcomes 1.4)..............................................................................8
Notes on SAQ .........................................................................................................................8 References...............................................................................................................................9
Study Session 2 Presentation of Data ..................................................................................... 10 Introduction ........................................................................................................................... 10 Learning Outcomes for Study Session 2 ................................................................................ 10 2.1 Ways of Presenting a Mass of Data .................................................................................. 10 2.2 Frequency Table ......................................................................................................... 11
2.2.1 Cumulative Curve (OGIVE)................................................................................. 13 2.3 Simple descriptive analysis of data in tables and diagrams .......................................... 15
2.3.1 Histogram ............................................................................................................ 16 2.3.2 Stem plots (Stem and Leave Plots) ....................................................................... 17 2.3.3 Back-to-back stemplot ......................................................................................... 19 2.3.4 Box Plot (Box and Whiskers Plot) ........................................................................ 20
Summary ............................................................................................................................... 22 Self-Assessment Question (SAQs) for Study Session 2 ......................................................... 23
SAQ 2.1 (Tests Learning Outcomes 2.1)............................................................................ 23
ix
SAQ 2.2 (Tests Learning Outcomes 2.2)............................................................................ 23 SAQ 2.3 (Tests Learning Outcomes 2.3)............................................................................ 23
Notes on SAQ ....................................................................................................................... 23 Reference .............................................................................................................................. 24
Study Session 3 Measure of the Centre of a Set of Observations ....................................... 25 Introduction ........................................................................................................................... 25 Learning Outcomes for Study Session 3 ................................................................................ 25 3.1 Measures of Central Tendency .................................................................................... 26 3.2 Mean ........................................................................................................................... 26
3.2.1 Calculation of Mean from Grouped Data .............................................................. 27 3.3 Median ........................................................................................................................ 32
3.3.1 Calculation of Median From a grouped data ......................................................... 33 3.4 Mode .......................................................................................................................... 36
3.4.1 Calculation of Mode from Grouped Data ............................................................. 37 3.5 Partition Values .......................................................................................................... 39 3.6 Other Measures of Central Tendency .......................................................................... 40
3.6.1 Other Partition Values from Grouped Data ............................................................... 44 Summary ............................................................................................................................... 47 Self-Assessment Question (SAQs) for Study Session 3 ......................................................... 47
SAQ 3.1 (Tests Learning Outcomes 3.1)............................................................................ 47 SAQ 3.2 (Tests Learning Outcomes 3.2)............................................................................ 47 SAQ 3.3 (Tests Learning Outcomes 3.3)............................................................................ 48 SAQ 3.4 (Tests Learning Outcomes 3.4)............................................................................ 48 SAQ 3.5 (Tests Learning Outcomes 3.5)............................................................................ 48 SAQ 3.6 (Tests Learning Outcomes 3.6)............................................................................ 48
Notes on SAQ ....................................................................................................................... 48 References............................................................................................................................. 50
Study Session 4 Measures of Dispersion/Variation ................................................................. 51 Introduction ........................................................................................................................... 51 Learning Outcomes for Study Session 4 ................................................................................ 51 4.1 Variation and its Measures .......................................................................................... 51 4.2 The Range ................................................................................................................... 52 4.3 The Mean Absolute Deviation .................................................................................... 52 4.4 The Variance............................................................................................................... 56
x
4.5 Standard Deviation ...................................................................................................... 57 4.6 Coding Method ........................................................................................................... 58 Summary ............................................................................................................................... 61 Self-Assessment Question (SAQs) for Study Session 4 ......................................................... 62
SAQ 4.1 (Tests Learning Outcomes 4.1)............................................................................ 62 SAQ 4.2 (Tests Learning Outcomes 4.2)............................................................................ 62 SAQ 4.3 (Tests Learning Outcomes 4.3)............................................................................ 62 SAQ 4.4 (Tests Learning Outcomes 4.4)............................................................................ 62 SAQ 4.5 (Tests Learning Outcomes 4.5)............................................................................ 62 SAQ 4.6 (Tests Learning Outcomes 4.6)............................................................................ 62
Notes on SAQ ....................................................................................................................... 62 References............................................................................................................................. 63
Study Session 5 Algebraic Treatment of Mean and Variance ................................................. 64 Introduction ........................................................................................................................... 65 Learning Outcomes for Study Session 5 ................................................................................ 65 5.1 Pooled Mean and Variance .......................................................................................... 65 5.2 Adjusting Values of Mean and Standard Deviations for Mistakes ................................ 68 Self-Assessment Question (SAQs) for Study Session 5 ......................................................... 69
SAQ (Tests Learning Outcomes) ...................................................................................... 69 References............................................................................................................................. 70
Study session 6: Measure of Skewness and Kurtosis ................................................................. 71 Introduction ........................................................................................................................... 71 Learning outcomes for study session 6 .................................................................................. 71 6.1 Define skewness and kurtosis .......................................................................................... 71 6.2 Calculating measure of skewness and kurtosis from simple series and grouped data ........ 73 6.3 Determining whether a set of data; is normally distributed, the direction of skewness and the level of peakedness .......................................................................................................... 77 Summary for study session 6 ................................................................................................. 81 Self-Assessment Questions (SAQs) for Study Session 6 ........................................................ 81
SAQ 6.1-6.2 ...................................................................................................................... 81 Reference .............................................................................................................................. 81
Study session 7: Methods of Collecting Statistical Data............................................................. 83 Introduction ........................................................................................................................... 83 Learning outcomes for study session 7 .................................................................................. 83 7.1 The various methods of data collection ............................................................................ 83
xi
7.2 Limitations of Data Collection in Nigeria ........................................................................ 91 Summary for study session 7 ................................................................................................. 91 Self-Assessment Questions (SAQs) for Study Session 7 ........................................................ 91
SAQ 7.1-7.2 ...................................................................................................................... 91 References............................................................................................................................. 92
Study Session 8: Regression Analysis ....................................................................................... 93 Introduction ........................................................................................................................... 93 Learning outcomes for Study Session 8 ................................................................................. 93 8.1 Regression Analysis ........................................................................................................ 93 8.2 Types of Regression Models ............................................................................................ 94 8.3 The Simple Regression Model ......................................................................................... 96 8.4 Estimation of Parameters ................................................................................................. 98 8.5 Testing the Significance of the Model .............................................................................. 99 Summary for Study Session 8 .............................................................................................. 106 Self-Assessment Questions (SAQs) for Study Session 8 ...................................................... 106
SAQ 8.1-8.5 .................................................................................................................... 106 References........................................................................................................................... 109
Study Session 9: Correlation and Association .......................................................................... 110 Introduction ......................................................................................................................... 110 Learning outcomes for Study Session 9 ............................................................................... 110 9.1 Correlation .................................................................................................................... 110 9.2 Coefficient of Rank Correlation ..................................................................................... 112 Summary for Study Session 9 .............................................................................................. 115 Self-Assessment Questions (SAQs) for Study Session 9 ...................................................... 115
SAQ 9.1-9.2 .................................................................................................................... 115 References........................................................................................................................... 116
Study Session 10: Proportions, Rates and Indices .................................................................... 117 Introduction ......................................................................................................................... 117 Learning Outcomes for Study Session 10 ............................................................................ 117 10.1 Proportion, Rates and indices ....................................................................................... 118 10.2 Consideration for an Index Number ............................................................................. 119 10.3 Methods of Construction of Price Index ....................................................................... 121 10.4 Uses of Consumer Price Index ..................................................................................... 127 Summary for Study Session 10 ............................................................................................ 128
xii
Self-Assessment Questions (SAQs) for Study Session 10 .................................................... 128 SAQ 10.1 -10.4 ............................................................................................................... 128
References........................................................................................................................... 130 Study Session 11: Time Series Analysis .................................................................................. 130
Introduction ......................................................................................................................... 131 Learning Outcomes for Study Session 11 ............................................................................ 131 11.1 Time Series Analysis ................................................................................................... 131 11.2 Methods of Analysis of Time Series data ..................................................................... 133 11.3 Components of Time Series ......................................................................................... 134 Summary for Study Session 11 ............................................................................................ 149 Self-Assessment Questions (SAQs) for Study Session 11 .................................................... 149
SAQ 11.1 -11.3 ............................................................................................................... 149 References........................................................................................................................... 150
1
Study Session 1 What is Statistics?
Introduction Statistics is a universal subject used in all disciplines and in all areas of human endeavour. The
word statistics was originally applied only to such data at the state required for its official
purpose. To a layman; it also refers to any set of quantitative data relating to a particular
measurement, whether that data is of interest or not.
The systematic collection of official statistics for political purposes originated in Germany
towards the end of the 18th Century, by comparing data such as population, industrial and
agricultural output. Also in England, a collection of numerical data enabled government
departments to predict levels of revenues and expenditure with more precision than before.
Learning Outcomes for Study Session 1 When you have studied this session, you should be able to:
1.1 Explain the meaning of statistics
1.2 Discuss the nature, scope and coverage of statistics;
1.3 Mention the use of statistics in our day-to-day activities
1.4 Define terms and concepts that would facilitate understanding of this course.
2
1.1 Meaning of statistics The earliest origin of statistics lies in the desire of rulers to count the numbers of inhabitants or
measure the value of taxable land in their domains. This has developed to careful measurement
of weight, distance or counting of physical quantities and items in many disciplines such as
agriculture, life and behavioral sciences.
Thus, the study of statistics is therefore essential for sound reasoning, precise judgment and
objective decision in the face of up-to-date accurate and reliable data.
Box 1.1: Meaning of Statistics
Statistics can simply be defined as the “science of data”. It is the science of collecting,
organizing and interpreting numerical facts, which we called data.
Most of us, especially those in the media-reporters have little or nothing to do with a large mass
of data
Statistics is also the science and practice of developing human knowledge through the use of
empirical data expressed in quantitative form. It is based on statistical theory. It is a branch of
applied mathematics where randomness and uncertainty are modelled by probability theory
(Wikipedia Encyclopedia).
In Nigeria, the official data collection and its usage started with the Statistical Act of 1947 which
established the Department of Statistics in the office of the Governor General of the Federation.
Thus, many researchers, educationalists, businessmen and government agencies at the national,
state, or local level relies on data to answer fundamental questions pertaining to their operations
and programs. In fact, there can be no meaningful science without statistics.
3
In-Text Question
Statistics could also be defined as?
a. A structural science
b. Codes that help programmers in programming
c. a branch of applied mathematics where randomness and uncertainty are modeled by
probability theory
d. None of the above
In-Text Answer
c.) A branch of applied mathematics where randomness and uncertainty are modelled by
probability theory
1.2 Branches of Statistics The science of data “statistics” can be divided into three broad parts which are not mutually
exclusive viz.; descriptive statistics, statistical methods and statistical inference
Descriptive Statistics
It is the act of summarizing and giving a descriptive account of numerical information in the
form of reports, charts and diagrams. The goal of descriptive statistics is to gain information
4
from collecting data. It begins with a collection of data by either counting or measurement in an
inquiry.
It involves the summary of specific aspects of the data, such as average value, and measure of
spread. Suitable graphs, diagrams and chart are then used to gain understanding and clear
interpretation of the phenomenon under investigation, keeping firmly in mind where the data
comes from.
Statistical Method
This is a device for classifying data and making clear relationship between variable under
consideration. This can be achieved by using the statistical tools and formulae. It ranges from
the computation of simple summaries of data (mean, median, mode, etc.) to complex modelling
used in policy formulation.
Inference Statistics
This is the act of making a deductive statement about a population from the quantities computed
from its representative sample. It is a process of making inference or generalizing about the
population under certain conditions and assumptions. Statistical inference involves the processes
of estimation of parameters and hypothesis testing.
1.3 Uses of Statistics Statistics could be used for a lot of our day to day activities which is mentioned below:
1. Planning and decision making by individuals, state, business organizations research
institutions etc.
2. Forecasting and prediction for the future based on a good model provided that its basic
assumptions are not violated.
3. Project implementation and control; this is especially useful in ongoing projects such as
network analysis, construction of roads and bridges, and implementation of government
programs and policies
4. Motoring and evaluation of plans, projects, programmes and policy initiatives. It also
assists in motoring, and evaluation of the activities of government programmes.
5
1.4 Terms and Concepts in Statistics There are a lot of terms and concept in statistics we need to learn to keep us abreast and give us
more understanding about statistics. The following terms and concept discussed below are used
daily in the field of statistics.
1.4.1 Population Sample and Variate
In the earlier part of this study session, we explained that the main aim of statistics is to gain
information about a population. We may want to know what the population is:
Population: A population is the collection of items under investigation. It may be finite
(countable) or infinity (uncountable).
Parameter: A parameter is a summary / quantity computed from a population, e.g. means ( ),
population variances ( 2 ) etc.
Sample: A sample is a representative part of a population observed for the purpose of making a
scientific statement or taking decisions about the population. A good sample must be randomly
selected and adequate.
A sample can be random or purposive. A random sample may be obtained by tossing a coin,
throwing a die, drawing discs from a container or using a table of random numbers. A purposive
judgmental sample is obtained when members of a population are selected by discretion or
personal judgment
Statistics: A statistics is a quantity / summary calculated from a sample for the purpose of
drawing conclusion about the related population, e.g. sample means ( x ), sample variance ( 2 )
etc.
The characteristics of units in the population can be measured or counted (quantitative) e.g.
weight, height age, number of cars. It can also be observed (qualitative or attributes e.g. color, of
eyes, beauty, complexion etc.)
Variate: A variate (variables) is any quantity or attributes whose value varies from one unit of
observation to another. A quantitative variate (variables) may be discrete or continuous
Continuous Variate: A continuous variate is a variate which may take all values within a given
range. Its values are obtained by measurements e.g. height, volume, time, examination score etc.
6
Discrete Random Variate: A discrete random variate is one whose value changes by steps. Its
value may be obtained by counting. It normally takes integer values e.g. number of cars, number
of chairs.
1.4.2 What is Data?
Having defined statistics as the science of data, it is necessary at this juncture to ask ourselves,
the pertinent question: What is data?
Data: Data can be described as a mass of unprocessed information obtained from measurement
of counting of a characteristics or phenomenon. In their raw form, they are usually massive and
disorderly. They become meaningful only when the data have been reduced to some kind of
order by some kind of tables or diagrams.
Statistical data: These are data obtained through objective measurement or enumeration of
characteristics using the state of the art equipment that is precise and unbiased. Such data when
subjected to statistical analysis produce results with high precision.
Sources of Statistical Data
Statistical data can be obtained from
1. Census - Complete enumeration of all the unit of the population
2. Surveys - the study of representative part of a population.
3. Experimentation: Observation from experiments carried out in laboratories and research
centres.
Types of Data
Data can be categorized as internal or external data.
Internal Data
When data is collected from within the organization and used in the organization concerned, it is
called internal data. Examples are data from accounts and internal records of an establishment.
7
External Data
If data is collected from outside the organization, it is called external data. Examples are data
from journals not published by the organization itself. There are two major sources of statistical
data: the internal source and the external source.
Primary Data
These are data generated by first hand or data obtained directly from respondents by personal
interview, measurement or observation.
Secondary Data
These are data obtained from publication, newspapers, magazines and annual reports. They are
usually summarized data used for a purpose other than the intended one.
Summary In Study Session 1, you have learnt that:
1. The study of statistics is essential for sound reasoning, precise judgment and objective
decision in the face of up-to-date accurate and reliable data.
2. Statistics can be defined as the science of collecting organizing and interpreting numerical
facts, which we called data
3. The science of data statistics are descriptive statistics, statistical methods and statistical
inference
4. Statistics could be used for a lot of our day to day activities
5. A population is the collection of items under investigation
6. A parameter is a summary / quantity computed from a population
7. A variate (variables) is any quantity or attributes whose value varies from one unit of
observation to another
8. Data can be described as a mass of unprocessed information obtained from measurement
of counting of a characteristics or phenomenon
9. Data can be categorized as internal or external data.
8
Self-Assessment Question (SAQs) for Study Session 1 Now that you have completed this study session, you can assess how well you have achieved its
Learning outcomes by answering the following questions. Write your answers in your study
Diary and discuss them with your Tutor at the next study Support Meeting. You can check your
answers with the Notes on the Self-Assessment questions at the end of this Module.
SAQ 1.1 (Tests Learning Outcomes 1.1)
What is the meaning of Statistics?
SAQ 1.2 (Tests Learning Outcomes 1.2)
List the branches of Statistics
SAQ 1.3 (Tests Learning Outcomes 1.3)
Mention three uses of statistics?
SAQ 1.4 (Tests Learning Outcomes 1.4)
1. Define population
2. What are the Source of Statistical data?
Notes on SAQ SAQ 1.1
Statistics can simply be defined as the “science of data”. It is the science of collecting organizing
and interpreting numerical facts, which we called data.
SAQ 1.2
Descriptive statistics, statistical methods and statistical inference
SAQ 1.3
I. Planning and decision making by individuals, state, business organizations research
institutions etc.
II. Forecasting and prediction for the future based on a good model provided that its basic
assumptions are not violated.
III. It assists in motoring and evaluation of the activities of government programs.
9
SAQ 1.4
1. A population is the collection of items under investigation
2.
i. Census - Complete enumeration of all the unit of the population
ii. Surveys - the study of representative part of a population.
iii. Experimentation: Observation from experiments carried out in laboratories and research
centres.
References Brookes, B.C. and Dick, W. F. L. (1969): An introduction to Statistic Method, 2nd Edition, H. E.
B. Publishers.
Moore, D.S. and McCabe, G. P. (1993): Introduction to the practice of Statistics; 2nd Edition;
New York: W. H. Freeman and Company.
Adamu, S. O. and Johnson, T. L. (1997): Statistics for Beginners, Book 1: SAAL Publications.
10
Study Session 2 Presentation of Data
Introduction The aim of this study session is to introduce the various methods of presenting statistical data.
Presentation of data in tables, charts and diagrams facilitates understanding of the important
feature of the data.
Learning Outcomes for Study Session 2 When you have studied this session, you should be able to:
2.1 Explain the various ways of presenting a mass of data;
2.2 Construct a frequency table;
2.3 Explain and Carry out simple descriptive analysis of data in tables and diagrams.
2.1 Ways of Presenting a Mass of Data Numerical information (data) about the characteristics of a variable, when collected is often
massive and complex. More often than not, it is necessary to present data in tables, charts and
diagrams in order to have a clear understanding of the data, and to illustrate the relationship
existing between the variables being examined.
We shall discuss the frequency table, cumulative Frequency table, Stem plot, Box plot and
Histogram assuming that we are very familiar with other graphs such as pie chart, frequency
curve, frequency polygon etc.
In-Text Question Why is it necessary to present data in tables, charts and diagrams?
a. To have a clear understanding of the data and illustrate the relationship between variables
b. To break information into pieces
c. Allow a blind man understand the data
d. To have a clear understanding of the data
11
In-Text Answer
a.) To have a clear understanding of the data and illustrate the relationship between variables
2.2 Frequency Table The first step in examining intelligently a set of data for a single quantitative variable is by
constructing a frequency table. This is a tabular arrangement of data into various classes
together with their corresponding frequencies.
Procedure
Given a set of observation x1, x2 …. xN for a single variable.
1. Find the range (R): (i.e. Difference between the largest and smallest values) of the data.
2. Determine the number of classes (K) (depending on the size of the data).
3. Find the class interval (C): (i.e. Range divide by the number of classes) .
4. Tally (i.e. assign the values to classes).
5. Find the class frequencies.
Note: With the advent of computers, all these steps can be accomplished easily.
Example 2.1: The following are the scores of 40 students in Mathematics test:
50, 08, 14, 20, 46, 23, 26, 47, 32, 31, 48, 40, 49, 40, 41,
38, 51, 86, 55, 82, 56, 72, 60, 98, 59, 76, 55, 80, 52, 63,
57, 67, 53, 70, 69, 63, 65, 66, 22, 27
Construct a frequency table for the above data.
Solution
Range: 98 – 08 = 90
No. of classes = 10
Class Interval = 9 1090
classes of No.Range
12
Working Table
Table 2.1
Class Tally Frequency`
1 – 10
11 – 20
21 – 30
31 – 40
41 – 50
51 – 60
61 – 70
71 – 80
81 – 90
91 - 100
I
II
IIII
IIII
IIII I
IIII IIII
IIII II
III
II
I
1
2
4
5
6
9
7
3
2
1
Frequency Table
Table 2.2
Score Frequency
01 up to 10
11 up to 20
21 up to 30
31 up to 40
41 up to 50
51 up to 60
61 up to 70
71 up to 80
81 up to 90
91 up to 100
1
2
4
5
6
9
7
3
2
1
Total 40
13
2.2.1 Cumulative Curve (OGIVE)
The graph of the cumulative frequency of a single variable is called an OGIVE. It is drawn by
plotting the cumulative frequency against the upper class boundary of a class interval. On the
OGIVE it is possible to obtain the median the quartile and inter-quartile range. (IQR)
Example 2.2: Using the data in Example 1. Construct the cumulative frequency curve.
Solution
Table 2.3
Score Frequency Cumulative Frequency
Less than 10 11 up to 20 21 up to 30 31 up to 40 41 up to 50 51 up to 60 61 up to 70 71 up to 80 81 up to 90 91 up to 100
1 2 4 5 6 9 7 3 2 1
1 3 7
12 18 27 34 37 39 40
Total 40
OGIVE
Cum. Freq.
0
Score
Diagram 2.1
14
Example 2.3
The following data represent the ages (in years) of people living in a housing estate in Ibadan.
30, 31 17 16 6 2 8 43 18 18 32 33
9 18 33 19 21 13 14 13 14 6 45 52 61
23 26 14 15 14 15 27 19 36 37 11 12
11 12 20 39 40 20 63 69 64 29 28 27
15
Present the above data in a frequency table using a suitable class interval.
Solution
Maximum value = 69
Minimum value = 2
Range = 69 – 2 = 67
A choice of 10 classes will result in some classes with zero frequencies while the choice of 6
classes is more reasonable with at least one item in each class. In practice, it is easy to determine
the number of classes for a given set of data. We are using K = 6 as our number of classes.
Class interval = 6
67
KR = 10.13 ≃ 10.0
Table 2.4
(1) (2) (3) (4) (5) (6)
Class
Tally Frequency
(F) Relative
Frequency (RF)
Cumulative Frequency
(CF)
CRF
1 – 10 11 – 20 21 – 30 31 – 40 41 – 50 51 – 60 61 – 70
IIII IIII IIII IIII IIII IIII IIII IIII III III I IIII
5 20 9 8 3 1 4
0.10 0.40 0.18 0.16 0.06 0.02 0.08
5 25 34 42 45 46 50
0.10 0.50 0.68 0.84 0.90 0.92 1.00
Total 50 1.00 It is pertinent to define the columns in the frequency table for better understanding.
15
Class interval is a sub-division of the total range of values which a (continuous) variable
may take.
Class frequency is the number of observations of the variate which falls in a given
interval (column 3)
Relative frequency for a class is the actual frequency of the class divided by total
frequency. (Column 4). Sometimes, it is better to work with relative frequencies
[especially in the calculation of probability values].
Cumulative frequency of a class is the sum of all the frequencies before the class up to
and including the frequency of that class (column 5).
Relative Cumulative Frequency: When the relative frequency of a class is expressed as
a proportion of total frequency, what we have is called the relative cumulative frequency
(column 6). It is sometimes called the distribution function.
Box 2.1: Observations from the Table
The data have been summarized and we now have a clearer picture of the distribution of the ages
of inhabitants of the Estate.
Exercise
Now answer the following questions from the table.
How many residents are aged between 11 and 30 years?
i. How many residents are aged above 30 years?
ii. What is the probability that a person selected at random from the Estate will be less than
31 years old?
Answers
(i) 29 (ii) 16 (iii) 0.68
2.3 Simple descriptive analysis of data in tables and diagrams Data can be presented in the text, in a table, or pictorially as a chart, diagram or graph. Tables,
charts and graphs should, ideally, be self-explanatory. The reader should be able to understand
16
them without detailed reference to the text, on the grounds that users may well pick things up
from the tables or graphs without reading the whole text. Below are some ways in analysing data
2.3.1 Histogram
Histogram is a chart used for presenting the frequency distribution of the values of a variable.
(Assuming the variate is a continuous type).
A histogram is a group of rectangles drawn above each class interval such that the area of each
rectangle is proportional to frequency of the observations falling in the corresponding class
interval. The chart is constructed by plotting the values of the variable along the X-axis and the
frequencies along the Y-axis.
Vertical lines are drawn at the lower and upper class boundary of each class up to the
frequencies. Horizontal lines representing the width of each class interval are then drawn on top
of each vertical line.
In a situation where the class intervals are not the same, the height must be adjusted so that the
area represents the frequency.
Draw the histogram of the data in Example 2.2 above
17
Histogram of Ages
Diagram 2.2
20
Frequency
10
0 1 10 20 30 40 50 60 70
Ages (in years)
2.3.2 Stem plots (Stem and Leave Plots)
In statistics, a stemplot (or stem and leaf plot) is a graphical display of quantitative data that is
similar to a histogram and is useful in visualizing the shape of distribution. It was invented by J.
W. Turkey (1915 – 2000). Stemplots contain more information than do histograms because;
unlike in a histogram where bars are used, the individual data values are displayed in a table-like
format, in order of increasing magnitude. A basic stemplot contains two columns separated by a
vertical line. The left column contains the stems and the right column contains the leaves.
18
Constructing a Stemplot
To construct a stemplot, take not of the following steps;
I. The observations must first be sorted in ascending order.
II. It must be determined what the stems will represent and what the leaves will represent.
Typically, the leaf contains the last digit of the number and the stem contains all of the
other digits (in the case of very large or very small numbers, the data values may be
rounded to a particular place value (such as the hundreds place) that will be used for the
leaves. The remaining digits to the left of the rounded place value are used as the stems).
The stemplot is drawn with two columns separated by a vertical line. The stems are listed to the
left of the vertical line. It is important that each stem is listed only once and that no numbers are
skipped, even if it means that some stems will have no leaves. The leaves are listed in increasing
order in a row to the right of each stem.
Example 2.4
Present the following data in a stem-and-leaf plot
68 66 72 75 76 106 54 57 56 63 59 66 68
64 88 84 81
Solution
Table 2.5
Stem Leaf 5 6 7 8 9 10
4 6 7 9 3 4 6 8 8 2 2 5 6 1 4 8 6
Example 2.5
Given the weight of 20 rams at the end of two weeks feeding on a special diet as follows:
46, 59, 35, 41, 46, 21, 24, 33, 40, 45, 49, 53, 48, 54, 61, 36, 70, 58, 47, 12
Make a stem plot for these data
19
Solution
The stem plot is given below
1
2
3
4
5
6
7
2
14
356
01566789
3489
1
1
Important Features
i. It is easy to locate the centre of the distribution, i.e. median = 46
ii. It is also possible to examine the shape of the distribution. Turn the stem plot on its side so
that the larger observation falls on the right (e.g. The above distribution is symmetric)
just as it is possible to measure the median first quartile (q1) the third quartile (q3) and
inter-quartile range (IQR).
iii. It is also possible to look for deviation from the overall shape of the data e.g. outliers
2.3.3 Back-to-back stemplot
Back-to-back stemplots are used to compare two distributions side-by-side. This type of double
stemplot contains three columns, each separated by a vertical line. The center column contains
the stems. The first and third columns, each contain the leaves of a different distribution. The
numbers for the leaves of the distribution in the leftmost column are aligned to the right and are
listed in increasing order from right to left. Here is an example of a back-to-back stemplot
comparing the distribution of the weight of cow to another distribution weight of ram.
20
Example 2.6
Suppose 20 cows were fed with the same special feed as in example 3: the back-to-back stem
plot is shown below:
Table 2.6
Weight Of Cow Weight Of Ram
0
1
2
31
542
7655421
42
1
1
2
3
4
5
6
7
8
9
2
14
356
01566789
3489
1
1
Observations
i. Weight of Ram is symmetric
ii. Weight of Cow is skewed to the right
iii. There is an outlier in the weight of cow (i.e. 91 kg.)
NOTE: Stem plot works well for small set of data especially when the observations are all
greater than zero.
2.3.4 Box Plot (Box and Whiskers Plot)
This is a chart that looks like a box when drawn. They are most useful when comparing two or
more sets of sample data. A box plot shows the centers and spread of the data, gives a clear
picture of the symmetry of a data set and shows outliers very clearly. It is constructed by first
calculating the median 1st and 3rd quartiles.
21
In-Text Question
The box plot chart is most useful when comparing two or more sets of sample data. True or False
In-Text Answer
True
In a box plot, the ends of the box are at the quartiles, so that the length of the box is the inter
quartile range. The median is marked by a line within the box. The ‘whiskers’re the two lines
outside the box that extends to the smallest and largest observations. Outliers are shown as dots,
outside the shickers.
Example 2.7
Consider the data in example 2.6 above. Construct the box plots.
Solution
Diagram 2.3
Boxplots
10
9
8
7
6
5
4
3
2
1
0
Fig. 1.2a Fig. 1.2b
Weight of Ram Weight of Cow
22
In a box plot, the center, the inter-quartile range, the spread are immediately apparent. However,
the box plot is generally inferior to the stem plot or histogram in that it shows only the center and
the partition values; it tells nothing about the shape of the distribution and other values in the
data set.
A stem plot (for large data set) provides a clearer display of a single distribution especially, when
accompanied by the median and quartile as numerical sign post.
Summary In Study Session 2, you have learnt that:
1. It is necessary to present data in tables, charts and diagrams in order to have a clear
understanding of the data
2. The first step in examining intelligently a set of data for a single quantitative variable is
by constructing a frequency table
3. Frequency table is a tabular arrangement of data into various classes together with their
corresponding frequencies
4. The graph of the cumulative frequency of a single variable is called an OGIVE
5. Data can be presented in the text, in a table, or pictorially as a chart, diagram or graph.
6. Histogram is a chart used for presenting the frequency distribution of the values of a
variable
7. Stemplot is a graphical display of quantitative data that is useful in visualizing the shape
of distribution
8. Back-to-back stemplots are used to compare two distributions side-by-side.
9. Box Plot is useful when comparing two or more sets of sample data.
23
Self-Assessment Question (SAQs) for Study Session 2 Now that you have completed this study session, you can assess how well you have achieved its
Learning outcomes by answering the following questions. Write your answers in your study
Diary and discuss them with your Tutor at the next study Support Meeting. You can check your
answers with the Notes on the Self-Assessment questions at the end of this Module.
SAQ 2.1 (Tests Learning Outcomes 2.1)
Why is it necessary to present data in tables, charts and diagrams?
SAQ 2.2 (Tests Learning Outcomes 2.2)
1. The first step in examining a set of data for a single quantitative variable is by
constructing a frequency table. True or False
2. What is a Frequency Table
SAQ 2.3 (Tests Learning Outcomes 2.3)
Mention three ways in analyzing data?
Notes on SAQ SAQ 2.1 They give a clear understanding of the data, and to illustrate the relationship existing between the
variables being examined.
SAQ 2.2
1. True 2. This is a tabular arrangement of data into various classes together with their
corresponding frequencies.
SAQ 2.3
i. Histogram ii. Stemplot
iii. Bxplot
24
Reference Brookes C B and Dick W.F (1969): An Introduction to Statistical Method Second Edition,
Published by H.E.B.Paperback.
Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third edition. London: published
by Arnold & Stoughton
Moore D.S and Mc cabe G.P (1993): Introduction to the Practice of Statistics, second Edition.
New York:W.H. Freeman and coy.
25
Study Session 3 Measure of the Centre of a Set of Observations
Introduction The primary aim of any investigator is to obtain a simple summary value (average) that can be
used to describe all the observations in a set. Thus an average is a single value that can
represents all the observations in a distribution.
The most representative value is one that is at the center of the distribution. They are otherwise
referred to as measures of location or measures of central tendency
Learning Outcomes for Study Session 3 When you have studied this session, you should be able to:
3.1 Discuss the term measures of central tendency;
3.2 Explain and calculate the mean
3.3 Explain and calculate Median
3.4 Explain and calculate mode from a grouped data
3.5 Define and calculate the partition values.
3.6 Discuss Other Measures of Central Tendency
26
3.1 Measures of Central Tendency These are measures of the center of a distribution. They are single values that give a description
of the data. They are also referred to as measures of central tendency. Some of them are
Arithmetic mean, mode, median, geometric mean and harmonic mean. We shall discuss them
one after the other. They are otherwise known as descriptive statistics.
In-Text Question
Measures of central tendency are multiple values that give a description of the data. True or
False
In-Text Answer
False
However, a descriptive statistic should possess the following desirable properties.
A descriptive statistic should
1. Be single-valued
2. Be algebraically tractable
3. Should consider every observed value
3.2 Mean The average (arithmetic mean) of a set of observation is the sum of the observation divided by
the number of observation. Given n observations are denoted by x1, x2, x3 ---- xn, the mean is
defined by
)...(1321 nxxxx
nX
27
Or in a compact notation, it can be written as
ixn
X 1
The above formula is for the simple series and is most useful when few (n < 20) observations are
considered.
Example 3.1
Here are the ages of 15 students in a class 16, 18, 20, 21, 22, 19, 17, 18, 19, 17, 17, 18, 17, 17,
20. Calculate the mean.
Solution
The average age of the students is ixn
X 1
]20171816[151
X
15257
X
= 17.3 ≃ 17 years 4 months
3.2.1 Calculation of Mean from Grouped Data
We have seen in study session two that a large set of observations can be summarized into a
frequency table which elicits some information about the data. This makes the computation of
the mean from a grouped data is very easy.
The mean of a set of N observation of a discrete (continuous) variate, grouped so that the value
xi (xi is the centre of intervals) i = 1, 2, … K occurs with frequencies fi is
K
i
iiN
XfX1
;
K
iifN
1
In a grouped frequency table for a continuous variate, Xi’s are the center of interval (i.e. Average
of the upper and lower class boundary of a class) otherwise known as class mark.
28
Example 3.2
Given the frequency distribution of a random variable X as follows:
Table 3.1
Group Frequency
1 – 5
6 – 10
11 – 15
16 – 20
21 – 25
26 - 30
2
4
8
5
3
1
Total 23
Find the mean of the distribution.
Solution
Find the class mark of a particular class by adding the lower and upper class boundaries of the
class and divide by 2.
Table 3.2
Group Class Mark
(X)
f fx
1 – 5
6 – 10
11 – 15
16 – 20
21 – 25
26 - 30
3
8
13
18
23
28
2
4
8
5
3
1
6
32
104
90
69
28
23 329
29
N = ∑f
N
fxX
= 23
329 = 14.304
Use of Assumed Mean
Sometimes, large values of the variable are involve in the calculation of mean, in order to make
our computation easier, we may assume one of the values as the mean. Then the revised formula
for the mean is:
If the assumed mean is A, then
Mean:
nd
AX where d = X – A
If a constant factor C is used then
Cn
dAX
For a grouped data
Cffd
AX
Where
CAXU
Example3.3
The exact pension allowance paid (in Naira) to 25 workers of a company is given in the table
below.
30
Table 3.3
Pension In N
No. of Person (f)
25 30 35 40 45
7 5 6 4 3
Calculate the mean using an assumed mean 35 and 5 as the common factor.
Solution:
Table 3.4
Pension In N
No. of Person (f) C
AXU fU
25 30 35 40 45
7 5 6 4 3
- 2 - 1 0 1 2
- 14 - 5 0 4 6
25 - 9
Let A = 35, C = 5
525
935
X = 33.20
Example 3.4
Consider the data in example 2.3, using a suitable assumed mean and constant factor, compute
the mean.
31
Table 3.5
Group f X X – A C
AX
(U)
fU
1 – 10 11 – 20 21 – 30 31 – 40 41 – 50 51 – 60 61 - 70
5 20 9 8 3 1 4
5 15 25 35 45 55 65
- 10 0
10 20 30 40 50
-1 0 1 2 3 4 5
-5 0 9 16 9 4 20
Total 50 53 A = 15
C = 10
10)5053(15X
= 15 + 10.6
= 25.6
Note:
It is always easier to select the class mark with the largest frequency as the assumed mean.
Merits
The mean is an average that considers all the observations in the data set. It is simple and easy to
compute and it is the most widely used average.
Demerits
Its value is greatly affected by the extremely too large or too small observation.
32
3.3 Median The median is an average of position. It is the value of the variable that divides a distribution
into two equal parts when the values are arranged in order of magnitude.
To compute the median of a distribution:
i. Arrange all observations in order of size, from smallest to largest.
ii. If n (number of observation is odd, the median X~ is the center of observation in the
ordered list. The location of the median is
2)1(~ thnX
Item.
iii. If n is even, the median X~ is the average of the two middle observations’ is the ordered
list.
i.e. 2
~ 122
1
nn
XX
X
th
Example 3.5 (n is even)
The values of a random variable X are given as 11, 10, 13, 9, 13, 14, 16, and 20. Find the
median.
Solution
In an Array: 9, 10, 11, 13, 13, 14, 16, and 20. Since n is even.
Median = 2
~ 122
nn XX
X
= 2
54 XX
i.e. = 132
1313
33
Example 3.6 (n is odd)
The values of a random variable X are given as 9, 7, 5, 20, 2, 12 and 1. Find the median.
In an array: 1, 2, 5, 7 , 9, 12, 20
n is odd , therefore
The median th
XX)
217(
~
thX4
= 7
Note: The occurrence of 7 in the above example is just a coincidence it could have been any
other value in the middle of the data set.
3.2.1 Calculation of Median From a grouped data
The formula for calculating the median from grouped data is defined as
wm
CfbN
LmX
2~
where Lm = Lower limit of the median class
fm = Frequency of median class
N = f is the total frequency
Cfb = Cumulative frequency before the median class
w = Class width.
34
Example 3.7
The table below shows the length of 100 rods (in inches) produced in a factory
Table 3.6
Length
(inches)
Number of rods
(f)
1 – 2
3 – 4
5 – 6
7 – 8
9 – 10
11 – 12
13 – 14
1
8
26
38
19
7
1
Calculate the median
Solution
The first thing to do is to obtain the cumulative frequency distribution as follow
Table 3.7
Class f Cumulative Frequency
(cf) 1 – 2 3 – 4 5 - 6 7 – 8 9 – 10 11 -12 13 -14
1 8 26 38 19 7 1
1 9
35 73 92 99 100
35
i. determine 502
1002
N , clearly the median value belong to the class
(7 – 8).
ii. The lower class boundary (Lm) of the median class is 6.5.
iii. frequency of the median class (fm) is 38
iv. the cumulative frequency before the median class (cf 6) is 35
v. the class interval (w) is 2 and the median is obtained as
238
35505.6~
X
= 6.5 + 0.789
= 7.289
29.7 ~ (2 dp)
Example 3.8
The following data represent the weight of products manufactured in a factory (in kg.
Table 3.8
Weight Number of Products
45 – 54 55 – 64 65 – 74 75 – 84 85 – 94 95 – 104
105 – 114 115 – 124 125 – 134 135 - 144
1 3 5
18 33 25 21 12 5 2
36
Calculate the median.
Solution
First obtain the cumulative frequency distribution as in Example 3.7.
The following can be obtained from the above table as in Example 3.7.
5.622
1252
N ; cfb = 60
cfb = 94.5, fm = 25, w = 10 (i.e. 104.5 – 94.5)
1025
605.625.94~
X
= 94.5 + 1
= 95.5
Merit
1. It is easy to calculate
2. It is easy to understand by many people.
3. Its value is not affected by extreme values; thus it is a resistant measure of central
tendency.
4. It is a good measure of location in a skewed distribution.
Demerit
1. It does not take into consideration all the values of the variable.
3.4 Mode The Mode is the value of the variable that occurs most often in a set of data. It is the most
unstable measure of location. It is not a unique measure of location as in the arithmetic mean. In
some cases it may not exist. Sometimes when it exists it is more than one (e.g. bimodal
distribution).
37
Let us see how the mode can be obtained from discrete data.
Example 3.9
Consider the data in example 3.5 the modal value is 13. Since it is the only value that occurred
twice.
Example 3.10
Consider the data in example 3.6.
The mode does not exist.
Example 3.11
From Example 2 the mode is X = 2 i.e. the value with the highest frequency.
3.4.1 Calculation of Mode from Grouped Data
The mode of a grouped distribution can be obtained either
i. from the frequency curve by finding the value at the highest point or
ii. By calculation using the following formula.
From a grouped data the mode is defined as
WLmX
21
1ˆ
Where Lm = lower limit of the modal class.
1 = difference between the frequency of the modal class and the class before it.
2 = difference between the frequency of the modal class and that above it.
w = is the class width.
38
Example 3.12
From the data in Example 3.7
Calculate mode:
i. the modal class is the one with the highest frequency. i.e. (7 – 8).
ii. Lm = 6.5
1 = 38 – 26 = 12
2 = 38 – 19 = 19
w = 2
21912
12 65 X
= 6.5 + 0.774
= 7.27
Example 3.13
Also consider the data in Example 3.8 the mode is obtained as
10815
15 84.5 X
= 84.5 + 6.52
= 91.02
Merit
1. The mode is easily understood by many people.
2. It is easy to calculate.
39
Demerit
1. It is not a unique measure of location.
2. It presents a misleading picture of the distribution.
3. It does not take into account all the available data
4. It is the most ideal measure of location when the distribution is highly skewed. e.g.
distribution of wages of workers in a factory.
3.5 Partition Values We have seen in section (3.2) that the median is an average that divides a distribution into two
equal parts. So also there is other quantity that divides a set of data (in an array) into different
equal parts. Such data must have been arranged in order of magnitude. Some of the partition
values are: the quartile, deciles and percentiles.
Quartiles divide a set of data in an array into four equal parts.
For simple Series
First quartiles: Q1 = thNX
4 item
Q2 = X = median = thNX
2 item for simple series
Third quartiles: Q3 = thNX
43 item
For grouped data
i. First Quartile
Q1 = wfq
CfbiN
lq
1
1
14
)(
for grouped data
Where lq1 = Lower limit of quartile 1
fq1 = Frequency of the q1 class
40
w1 = Width of q1 class
1fq = Cf below the q1 class
ii. Third Quartile
Q3 = wfq
CfbN
lq
33
43
Where lq3 = Lower limit of quartile three class
fq3 = Frequency of the q3 class
w3 = Width of q3 class
3fq = Cf below the q3 class
3.6 Other Measures of Central Tendency Other measures of central tendency include the Midrange, Harmonic mean and Geometric mean.
41
Midrange
The half way between the smallest and the largest observation in a set of data is called the
midrange or range midpoint. It is obtained by adding the smallest and the largest together and
dividing the result by 2.
Example 3.14
Find the midrange of the following data: 1, 5, 7, 15, 12, 9, 7,
Solution
Smallest observation 1
Largest observation 15
Midrange = 72
115
Example 3.15
Find the midrange of the following data representing the number of children in 12 households in
Agbowo area of Ibadan.
4, 2, 1, 0, 2, 6, 2, 3, 5, 1,
Solution
Midrange = 32
06
Usefulness
Information on midrange of temperature reading by Meteorologists is used by visitors in the
tourism industry.
Limitations
It takes into account only the extreme observation.
42
Geometric Mean
Given observation X1, X2, ---, Xn, of a random variable X the geometric mean denoted by GM
define as the nth root of the product of n observation in a set. i.e
GM = nnXXX ,,, 21
Example 3.16
Find the geometric mean of the data in Example 3.14.
Solution
GM = 7 7.9.12.15.7.5.1
= 7 396900
= 6.31
Example 3.17
Obtain the geometric mean of the data in Example 3.14
Solution
GM = 10 1.5..........0.1.4
= 0 (since zero is one of the observation)
Usefulness
Geometric mean is very useful in the computation of rates and indices e.g. Computation of price
indices, etc.
Limitation
1. It cannot be calculated when the value zero is one of the observation to be used.
2. It is a readily used measure of location.
43
Harmonic Mean
Given the observation x1, x2, ----, xn of a random variable X, the harmonic mean denoted by
HM is defined as the reciprocal of the mean of the reciprocal of the observations i.e.
Example 3.18
Find the harmonic mean of the data in Example 3.14.
Solution
HM =
71
91
121
151
71
51
11
71
1
= 4.02
Example 3.19
Find the harmonic mean of the data in Example 3.14.
Solution
HM =
11
01
11
41
101
1
= 0 (since 0 is one of the observation)
Note: HM < GM < AM
Usefulness
Harmonic mean is used in the calculation of rates e.g. average speed.
44
Limitations
1. It is hardly used in practice.
2. It cannot be calculated when zero is one of the observation in the set.
3.6.1 Other Partition Values from Grouped Data
The other partition values that can be calculated from grouped data are the Deciles and the
percentiles.
Deciles are those values that divide a distribution to five equal parts. They are denoted by Di i
= 1, 2, ---, 9 D1, D2, D3 …. D9.
For the grouped data deciles two (D2 ) is defined as
wfD
cfbN
LDD
D
2
2
2
52
where
LD2 = Lower limit of decile two class
fD2 = Frequency of the decile two class
w2 = Width of decile 1 class
1Df = Cumulative frequency below the decile two class
Percentiles are those values that divide a distribution into one hundred equal parts. They are
denoted by P1, P2, P3, ….., P99. For a grouped distribution the 65th percentile is defined as
wf
cfbN
LPP
P
p
65
65
65
10065
65
45
Lp65 = Lower limit of 65th percentile class
Fp65 = Frequency of the 65th percentile class
w1 = Width of 65th percentile class
65pf = Cumulative frequency below the 65th percentile class
Example 3.20
Consider the data in Example 3.9
Calculate the i. first quartile (q1)
ii. third quartile (q3)
iii. 4th Decile (D4)
iv. 45th Percentile (P45)
Solution
From the table in Example 3.9
Table 3.9
Class f cf
45 – 54
55 – 64
65 – 74
75 – 84
85 – 94
95 – 104
105 – 114
115 – 124
125 – 134
135 - 144
1
3
5
18
33
25
21
12
5
2
1
4
9
27
60
85
106
118
123
125
46
i. wf
cfbN
Lqq
q
q
1
1
1
41
= 1033
2725.315.84
= 84.5 + 1.29
= 85.79
ii. wf
cfbN
Lqq
q
q
3
3
3
43
3
= 1021
8575.935.104
= 104.5 + 4.17
= 108.67
iii. wf
cfbN
LDD
D
D
4
4
4
54
4
= 1021
851005.104
= 104.5 + 7.14
= 111.64
iv. wf
cfbN
LPP
P
P
145
45
145
10045
45
47
= 1033
2725.565.84
= 84.5 + 8.86
= 93.36
Summary In Study Session 3, you have learnt that:
1. Measures of central tendency are single values that give a description of the data.
2. The arithmetic mean is the average of a set of observation is the sum of the observation
divided by the number of observation
3. The mean is an average that considers all the observations in the data set
4. The median is an average of position.
5. Median is a good measure of location in a skewed distribution.
6. The Mode is the value of the variable that occurs most often in a set of data
7. The mode is not a unique measure of location
8. The partition values are: the quartile, deciles and percentiles.
Self-Assessment Question (SAQs) for Study Session 3 Now that you have completed this study session, you can assess how well you have achieved its
Learning outcomes by answering the following questions. Write your answers in your study
Diary and discuss them with your Tutor at the next study Support Meeting. You can check your
answers with the Notes on the Self-Assessment questions at the end of this Module.
SAQ 3.1 (Tests Learning Outcomes 3.1)
List four measures of central tendency
SAQ 3.2 (Tests Learning Outcomes 3.2)
1. What is an arithmetic mean?
2. What is the formula for the Calculation of Mean from Grouped Data?
48
SAQ 3.3 (Tests Learning Outcomes 3.3)
What is the formula for the Calculation of Median From a grouped data?
SAQ 3.4 (Tests Learning Outcomes 3.4)
1. Define Mode
2. Give three demerit of mode
SAQ 3.5 (Tests Learning Outcomes 3.5)
Name the partition values
SAQ 3.6 (Tests Learning Outcomes 3.6)
Mention the usefulness of the midrange and Geometric mean
Notes on SAQ SAQ 3.1
Arithmetic mean, mode, median, geometric mean
SAQ 3.2
1. The arithmetic mean of a set of observation is the sum of the observation divided by the
number of observation
2.
K
i
iiN
XfX1
;
K
iifN
1
SAQ 3.3
wm
CfbN
LmX
2~
49
SAQ 3.4
1. The Mode is the value of the variable that occurs most often in a set of data. It is the most
unstable measure of location
2.
i. It is not a unique measure of location,
ii. It presents a misleading picture of the distribution
iii. It does not take into account all the available data
SAQ 3.5
The quartile, deciles and percentiles
SAQ 3.6
Information on midrange of temperature reading by Meteorologists is used by visitors in the
tourism industry.
Geometric mean is very useful in the computation of rates and indices e.g. Computation of price
indices
50
References Adamu S.O and Johnson Tinuke L (1998): Statistics for Beginners: Book 1. SAAL Publications. Ibadan. ISBN: 978-34411-3-2 Brookes C B and Dick W.F (1969): An Introduction to Statistical Method .Second Edition
Published by H.E.B. Paperback.
Clarke G.M. and Cooke D (1993): A Basic Course in Statistics. Third edition. London: published
by Arnold & Stoughton
Connor, L. R and Morrell, (1982) A. J. “Statistics in Theory and Practice”. Seventh Edition, London: Pitman Books Limited. Gupta, C. B. (1973) “An Introduction to Statistical Methods” New Delhi: Vikas Publishing House PVT Ltd. Moore D.S and Mc Cabe G.P (1993): Introduction to the Practise of Statistics, Second Edition.
New York: W.H. Freeman and Coy.
Olubosoye O.E, Olaomi J.O and Shittu O.I (2002): Statistics for Engineering, Physical and Biological Sciences”. Ibadan: A Divine Touch Publications.
51
Study Session 4 Measures of Dispersion/Variation
Introduction Dispersion/Variation is degree of scatter or variation of individual values of a variable about the
central value such as the median or the mean.
In this Study Session we shall discuss the range, semi-inter-quartile range, mean deviation from
the mean, median variance and standard deviation
Learning Outcomes for Study Session 4 When you have studied this session, you should be able to:
4.1 Explain the meaning of variation and is measures
4.2 Explain the Range
4.3 Explain Mean deviation and its Calculation
4.4 Explain The variance and its calculation
4.5 Explain Standard Deviation
4.6 Explain the use of coding method when dealing with large values of a variable
4.1 Variation and its Measures Weight, like so many other things, is not static or unchanging. Not everyone who is 5 feet tall is
100 pounds, there is some variability. When reporting these numbers or reviewing them for a
project, a researcher needs to understand how much difference there is in the scores. This is
where we will look at measures of variability.
52
Box 4.1: Definition of Variation
Variation can be defined as a way to show how data is dispersed, or spread out.
Several measures of variation are used in statistics which will be discussed at the course of this
study session.
4.2 The Range This is the simplest measure of variation. It is the difference between the largest and the smallest
value in a set of data.
Range = X (max) = X (min.)
The range is thus a measure which is very easy to determine and use. The range is efficient when
n > 10, otherwise it is not good as it ignores all the values in between. It is commonly used in
statistical quality control.
However, the range may fail to discriminate if the distributions are of different types.
Semi-Interquartile Range: is half the difference between the first and third quartiles. It is good
measure of spread for midrange and the quartiles.
2
.. 13 QQRIS
4.3 The Mean Absolute Deviation Mean deviation is the mean absolute deviation from the center. A measure of the center could be
the arithmetic mean or median. It can be shown that the mean deviation of a distribution is least
when the deviations are from the median. Given a set of X1, X2, ….., XN the mean deviation
from the arithmetic mean is defined by:
N
XXMD
N
ii
1 for simple series
53
In a grouped data
N
i
N
ii
X
f
XXfMD
1
1
Example 4.1
Below is the average of 10 Heads of household randomly selected from a community
54, 59, 35, 41, 46, 25, 47, 60, 54, 46
Find the (i) Range (ii) Mean (iii) Mean deviation from the mean (iv) Mean deviation
from the median.
Solution
i. Range = 60 – 25 = 35
ii. Mean = 10
46....5954
nX
X
= 46.7
iii. Mean Deviation XMD = n
XX
= 10
7.4646....7.46597.4654
7.3 + 12.3 + 11.7 + 5.7 + 0.7 + 21.7 + 0.3 + 13.3 + 7.3 + 0.7
= 1081 = 8.10
Array: 25, 35, 41, 46, 46, 47, 54, 54, 59, 60
iv Median = 5.462
122
nXnX
54
105.46465.46595.4654
ˆ
XMD
= 10
5.05.75.135.05.215.05.55.115.125.7
= 1081
= 8.1
Example 4.2
The table below shows the frequency distribution of the scores of 42 students in STA 111 test.
Table 4.1
Scores
No. of
Students
(f)
0 – 10
10 – 20
20 – 30
30 – 40
40 – 50
50 – 60
60 – 70
2
5
8
12
9
5
1
Find the mean deviation from the mean for the data.
55
Solution
Table 4.2
Classes X F fX XX XX XXf
0 – 10
10 – 20
20 – 30
30 – 40
40 – 50
50 – 60
60 – 70
5
15
25
35
45
35
65
2
5
8
12
9
5
1
10
75
200
420
405
275
65
- 29.52
- 19.52
- 9.52
0.48
10.48
20.48
36.48
29.52
19.52
9.52
0.48
10.48
20.48
30.41
59.04
97.6
76.16
5.76
94.32
102.4
30.48
52.3442
1450
ffx
X
XDeviationMean =
fXXf
= 42
76.365
= 11.089
56
4.4 The Variance The variance of a set of observations is the average of the squared deviation from the mean.
Let x1, x2, x3, ----, xn be a random sample from a population The sample variance S2, is
defined as:
n
ii XX
nS
1
22 1
where nX
X i
for discrete data or simple series
For grouped data, sample variance is defined as:
22
i
ii
fXXf
S
Another formula for calculating variance can be derived from the above as follow
n
ii XX
nS
1
22 1
n
ii XXnS
1
22
N
iii XXXXns
1
222 2
= 22 2 XXXX ii
= 22 XnX i
Therefore 222 1 XXn
S i
However, for grouped data
57
i
i
fXf
S2
12
4.5 Standard Deviation The standard deviation is the square root of the variance. It is sometimes referred to as the root
mean squared deviation from the mean (RMSD).
It should be noted that the variance is measured in units of X2 rather than X. This makes it
difficult to understand the size of the variance. A measure of variability that is closely related to
variance but expressed in the same unit of observation is called Standard Deviation.
In-Text Question
Standard deviation could be defined as?
a. The cube root of the variance
b. The square root of the variance
c. Both the square and cube root of the variance
d. Fourth root of the variance
In-Text Answer
b.) The square root of the variance
Standard deviation is the positive square root of the variance. It is defined as
N
XXS
N
i
1
2
or
22
XnX
S i
Example 4.2
58
Consider the data in example 4.1, calculate the standard deviation and coefficient of variation.
Solution:
i Standard Deviation S =
nXX
2
= 10
)7.4646(....)7.4654( 22
= 10.87
ii. Coefficient of Variation C.V = 100 x XS
= 100 x 7.46
37.10
= 22.21
Comparison of Dispersion: Comparison of two distributions with different means and unit of
measurement is done using the coefficient of variation.
Definition: Coefficient of Variation (C.V) is a dimensionless quantity that measures the
relative variation between two series observed in different units.
It is defined as the ratio of the standard deviation and the mean of a set of data expressed as a
percentage.
i.e. 100 x .XSVC
The distribution with smaller C.V is said to be better
59
4.6 Coding Method This is the method used when larger values of the variable are involved in calculation.
This is achieved by choosing one of the values (or class mark) as the assumed mean (A) and
determine the common factor (C). The values of the variable Xi (or class mark) are transformed
using the code:
C
AXU i
Thus the formula for calculating the variance becomes
C
ff
Uff
S Ui
2
21
2 1
Example 4.3
Given the following grouped data. Compute the (i) Mean and (ii) Standard deviation. And
(iii) coefficient variation using an assumed men of 77 and 5 as a common factor
Table 4.3
Class f 50 – 54 55 – 59 60 – 64 65 – 69 70 – 74 75 – 79 80 – 84 85 – 89 90 – 94 95 – 99
1 2
10 12 18 25 9 6 4 3
Total 80
60
Solution
Table 4.5
Classes f Class
Mark (X)
X – A
CAX
U i
fU 2iU 2fU
50 – 54
55 – 59
60 – 64
65 – 69
70 – 74
75 – 79
80 – 84
85 – 89
90 – 94
95 – 99
1
2
10
12
18
25
9
6
4
3
52
57
62
67
72
77
82
87
92
97
-25
-20
-15
-10
-5
0
5
10
15
20
-5
-4
-3
-2
-1
0
1
2
3
4
-5
-8
-30
-24
-18
0
9
16
12
12
25
16
9
4
1
0
1
4
9
16
25
32
90
48
18
0
9
24
36
48
90 -36 330
A = 77 C = 5
Cff
AX U
= 5903677
= 77 – 2
= 75
Cf
fUf
fS U
i
2
21
2 1
61
= 59036330
901 2
= 3.55
2S S
= 1.88
Coefficient of variation: CV = 100 x XS
= 100 x 7588.1
= 2.51
Summary In Study Session 4, you have learnt that:
1. Variation is a way to show how data is dispersed, or spread out.
2. The range is the simplest measure of variation
3. The range is the difference between the largest and the smallest value in a set of data
4. Mean deviation is the mean absolute deviation from the center
5. The variance of a set of observations is the average of the squared deviation from the
mean
6. The standard deviation is the square root of the variance.
7. The Standard Deviation is also referred to as the root mean squared deviation from the
mean
8. The Coding Method is used when larger values of the variable are involved in calculation
62
Self-Assessment Question (SAQs) for Study Session 4 Now that you have completed this study session, you can assess how well you have achieved its
Learning outcomes by answering the following questions. Write your answers in your study
Diary and discuss them with your Tutor at the next study Support Meeting. You can check your
answers with the Notes on the Self-Assessment questions at the end of this Module.
SAQ 4.1 (Tests Learning Outcomes 4.1)
Define variation
SAQ 4.2 (Tests Learning Outcomes 4.2)
What is a range?
SAQ 4.3 (Tests Learning Outcomes 4.3)
What is the formula for calculating a mean deviation in a group data?
SAQ 4.4 (Tests Learning Outcomes 4.4)
What is a variance?
SAQ 4.5 (Tests Learning Outcomes 4.5)
The standard deviation is sometimes referred to as?
a. The root mean squared deviation from the mean (RMSD)
b. The root mean square
c. The cube root mean square of the deviation (CMSD)
d. Standard means of measurement
SAQ 4.6 (Tests Learning Outcomes 4.6)
When is the coding method used in calculations?
63
Notes on SAQ SAQ 4.1
Variation can be defined as a way to show how data is dispersed, or spread out.
SAQ 4.2
This is the simplest measure of variation. It is the difference between the largest and the smallest
value in a set of data.
SAQ 4.3
N
i
N
ii
X
f
XXfMD
1
1
SAQ 4.4
The variance of a set of observations is the average of the squared deviation from the mean
SAQ 4.5
a.) The root mean squared deviation from the mean (RMSD)
SAQ 4.6
This method is used when larger values of the variable are involved in calculation.
64
References Adamu S.O and Johnson Tinuke L (1998):”Statistics for Beginners: Book 1. Ibadan: SAAL Publications, ISBN: 978-34411-3-2 Brookes C B and Dick W.F (1969): An Introduction to Statistical Method .Second Edition Published by H.E.B.Paperback. Connor, L. R and Morrell,(1982) A. J. “Statistics in Theory and Practice”. Seventh Edition, London: Pitman Books Limited, Gupta, C. B. (1973) “An Introduction to Statistical Methods” Vikas New Delhi: Publishing House PVT Ltd... Moore D.S and Mc cabe G.P (1993): Introduction to the Practise of Statistics, Second Edition.
New York: W.H. Freeman and coy..
Olubosoye O.E, Olaomi J.O and Shittu O.I (2002):’Statistics for Engineering, Physical and Biological Sciences”. Ibadan: A Divine Touch Publications,. ISBN: 978-35606-7-0
65
Study Session 5 Algebraic Treatment of Mean and Variance
Introduction It is advisable to adjust the values of the mean and variance to check for mistakes, it may also be
desired to combine these statistics without recourse to the individual observation of the variable.
The various methods of doing this will be discussed in this study session.
Learning Outcomes for Study Session 5 When you have studied this session, you should be able to:
5.1 Calculate the pooled mean of two or more variables
5.2 Adjust the values of mean, variances and standard deviation for mistakes
5.1 Pooled Mean and Variance You have learnt how to compute the mean and variance from univariate data. Sometimes, we
may have information about the mean and variance of two or more variates and you desire to
find the combined mean and variance. This can be achieved without using the individual values
of the variables.
Given two sets of data consisting of n1 and n2 items and 1X and 2X and their variance 21S
and 22S respectively with the some mean, then the combined mean is defined by
21
221112 nn
XnXnX
and the combined variance is
2
)1()1(
21
222
2112
nn
SnSnS
66
Suppose we have
ni, i = 1, 2, ----, k
X
for i = 1, 2, 3, …, K,
ni number of observation in variable i.
X mean of variable i.
2iS variance of variable i.
Then, the pooled (combined mean) is defined
k
ii
k
iii
k
kkk
n
Xn
nnnXnXnXnX
1
1
21
2211,,12
......
The pooled (combined variance) variance is given by
knnn
XXnXXnSnSnSn
k
kkkkkkk
21
2
12
2
121122
222
112,12
ˆˆ)1()1()1(
Example 5.1
The Mean and Standard Deviation of two variables of 100 and 150 items are 50, 5 40, and 6
respectively. Find the Standard Deviation of all the 250 items taken together.
Solution
250
40) x 150()50(100
21
221112
nn
XnXnX
= 44
248
)4044(150)5044(100)6(149)5(99 2222212
= 55.0
6.5512
= 7.46
67
Example 5.2
A survey was conducted at three locations in a community to study a single variable. At each
location, the sample size (ni), the mean iX and standard deviation i were given the
following table.
Table 5.1
Location I II III
ni 200 250 300
iX 95 10 15
i 3 4 5
Obtain the combined mean and standard deviation for the variable in all the three locations
Solution
Hence 32
332223 nnn
XnXnXnXi
iii
750
26000300250200
)15(300)10(250)95(20023
iX
= 34.7 or 35
3
)!()1()1((
32
233
222
211
212333
212322123
212
23
nnn
nnnXXnXXnXXn
i
ii
748
)25(299)16(249)9(1997.34153007.34102507.3495200 222
123
747133001164275.152522727218
123
55.13492123
55.1349123
= 36.74
68
5.2 Adjusting Values of Mean and Standard Deviations for Mistakes
Sometimes mistakes occur in the computation of mean and variance of a set of data when a
correct value in the original data is replaced by an incorrect one. Instead of going through the
entire process to correct such mistakes, some simple algebraic adjustment can be made as shown
in the following examples.
Example 5.3
The mean and standard deviation of a set of 100 observations were worked out as 40 and 5
respectively by a student who by mistake took the value 50 in place of 40 for one observation.
Recalculate the correct mean and standard deviation.
Solution
n = 100; X = 40; 2 = 25
n
XX
40 = 100 X
Incorrect: ∑X = 4000
Correct: ∑X = 4000 – 50 + 40 = 3990
Corrected mean X = 1003990 = 39.90
2 = 22
XnX
25 = 22
40100
X
2500 = ∑X2 — 160,000
∑X2 = 162,500
69
Correct ∑X2 = 162,500 – 502 x 402
= 161,600
Correct 2 = 2)90.39(100
161600
= 1002399
= 23.99
= 99.23
= 4.89
Self-Assessment Question (SAQs) for Study Session 5 Now that you have completed this study session, you can assess how well you have achieved its
Learning outcomes by answering the following questions. Write your answers in your study
Diary and discuss them with your Tutor at the next study Support Meeting. You can check your
answers with the Notes on the Self-Assessment questions at the end of this Module.
SAQ (Tests Learning Outcomes)
1. Find the mean median and mode of the following observation:
5, 6, 10, 15, 22, 16, 6, 10, 6
2. The six numbers 4, 9, 8, 7, 4 and X, have mean of 7. Find the value of X and hence
calculate the coefficient of variation for the six numbers.
3. The arithmetic mean of five observations is 44 and the variance is 8.24. If 3 of the 5
observation are 1, 2 and 6. Find the other two.
4. The mean and standard deviation of 120 items were found by a student to be 60, and 5
respectively. If at the time of calculation, two items were wrongly recorded as 45 and 55,
instead of 54 and 70. Find the correct mean and standard deviation.
70
References Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third edition London: Arnold &
Stoughton
Cyprian A. Oyeka (1990): An Introduction to Applied Statistical Method in the Sciences. Enugu:
Nobern Avocation publishing coy.
Gupta, C. B. (1973) “An Introduction to Statistical Methods” New Delhi:Vikas Publishing House PVT Ltd.
Moore D.S and McCabe G.P (1993): Introduction to the Practice of Statistics, second Edition.
New York: W.H. Freeman and coy
71
Study session 6: Measure of Skewness and Kurtosis
Introduction A fundamental task in many statistical analyses is to characterize the location and variability of a
data set. A further characterization of the data includes skewness and kurtosis.
In this Study session, you will learn the definition of skewness and kurtosis, you will also learn
how to calculate measure of skewness and kurtosis from simple series and grouped data.
Learning outcomes for study session 6 At the end of this study session you should be able to:
6.1 Define skewness and kurtosis;
6.2 Calculate measure of skewness and kurtosis from simple series and grouped data;
6.3 Determine whether a set of data; is normally distributed, the direction of skewness and the
level of peakedness and Interpret your result.
6.1 Define skewness and kurtosis Skewness is a measure of a symetry, or more precisely, the lack of symmetry. A
distribution, or data set, is symmetric if it looks the same to the left and right of the center
point. For univarite data X1, X2, -----, XN, the formula for skewness is
For discrete data,
Skewness:
3
31
3 )1( sNXX i
Ni
For grouped data
3
3
3 )1( sNXXf i
Where X is the mean, S is the standard deviation, and N is the number of data points.
72
The skewness for a normal distribution is zero, and any symmetric data should have a skewness
near zero.
Negative values for the skewness indicate data that are skewed left and.
Positive values for the skewness indicate data that are skewed right. By skeweness to the left,
we mean that the left tail is long relative to the right tail. Similarly, skeweness to the right means
that the right tail is long relative to the left tail.
Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution
That is, data sets with high kurtosis tend to have a distinct peak near the mean, decline rather
rapidly, and have heavy tails. Data sets with low kurtosis tend to have a flat top near the
mean rather than a sharp peak. A uniform distribution would be the extreme case. Kurtosis is
the standardized 4m central moment of a distribution.
The histogram is an effective graphical technique for showing both the skewness and kurtosis of
data set.
For univariate data X1, X2, -----, XN, the formula for kurtosis is:
For discrete data Kurtosis:
4
41
4 )1( sNXX i
Ni
For grouped data
4
4
4 )1( sNXXf i
where X is the mean, s is the standard deviation, and N is the number of data points.
In-text Question
_____________ is a measure of symmetry, or more precisely, the lack of symmetry?
a) Skewness
b) Kurtosis
c) Grouped data
d) Sample series
73
In-text Answer
a) Skewness
6.2 Calculating measure of skewness and kurtosis from simple series and grouped data
Excess Kurtosis: The Kurtosis for a standard normal distribution is three. For this reason,
excess kurtosis is defined as
For discrete data: Excess Kurtosis:
3)1( 4
41
4
sNXX
K iNi
For grouped data: or
3)1( 4
4
4
sNXXf
K i
The standard normal distribution has excess kurtosis of zero. Positive kurtosis indicates a
“peaked” distribution and negative kurtosis indicates a “flat” distribution.
The peakedness of a distribution can be shown as in the diagram below:
In-text Question
A distribution, or data set, is symmetric if it looks the same to the left and right of the center
point. True\ False
a) False
b) True
74
c) None of the above
d) All of the above
In-text Answer
a) True
Diagram 6.1 - Peakedness of a Distribution
A
B
C
A -------------- Leptokurtic
B -------------- Mesokurtic - Normal
C -------------- Platykurtic
Example 6.1
Twelve numbers were generated from computer are as follows:
10, 43, 67, 89, 70, 80, 62, 80, 03, 42, 71, 35
a. Obtain the measures of skewness and kurtosis.
b. Interpret your result.
75
Solution
Table 6.1
X XX i 2XX i 3XX i 4XX i
03 -51.3 2631.69 -135005.697 6925792.26 10 -44.3 1962.49 -86938.307 3851367.00 35 -19.3 372.49 -7189.057 138748.80 42 -12.3 151.29 -1860.867 22888.66 43 -11.3 127.69 -1442.897 16304.74 52 7.7 59.29 456.533 3515.30 67 12.7 161.29 2048.383 26014.46 70 15.7 246.49 3869.893 60757.32 21 16.7 278.89 4657.463 77779.63 30 25.7 660.49 16974.593 436247.04 30 25.7 660.49 16974.593 436247.04 39 34.7 1204.09 41781.923 1449832.73
652 3516.68 -145673.444 13445494.99
3.5412652
Xn
XX
11
68.8516S
= 27.825
Skewness:
3
3
3 )1()
SNxxi
= (27.825 x 11
444.145673
=6385.236972
444.145673
= -0.6147
That is negatively skewed distribution.
76
Kurtosis:
41
4
3 )1( SN
xxN
ii
= 4(27.825) x 11
99.13445494
= 668.659376389.13445494
= 2.039
Excess Kurtosis 34 K
= 2.039-3
= - 0.961
i.e. platykurtic.
In-test Question
The Kurtosis for a standard normal distribution is three. For this reason, excess kurtosis is
defined as ____________ ?
a)
41
4
3 )1( SN
xxN
ii
b)
3
3
3 )1()
SNxxi
c)
3.5412652
Xn
XX
d)
3)1( 4
41
4
sNXX
K iNi
77
In-text Answer
d)
3)1( 4
41
4
sNXX
K iNi
6.3 Determining whether a set of data; is normally distributed, the direction of skewness and the level of peakedness Example 6.2
Given the data below:
Table 6.2
Class f
10-14
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
1
4
8
19
35
20
7
5
1
a. Draw the histogram for the above data.
b. Obtain the measure of i. Skewness
ii. Kurtosis
c. Interpret your result.
78
Solution
Diagram 6.2
40
30
20
10 0 9.5 14.5 19.5 24.5 28.5 34.5 39.5 44.5 49.5 54.5
Table 6.3
Class
Mid-Point
Xi
F ifx XX i 2XX i 3XX i 4XX i
10-14 12 1 12 -20.1 404.01 8120.6 163224.04
15-19 17 4 68 -15.1 228.01 3442.95 51988.56
20-24 22 8 176 -10.1 102.3 1030.3 10406.04
25-29 27 19 513 -5.1 26.01 132.55 676.52
79
Table 6.3
Class
Mid-Point
Xi
F ifx XX i 2XX i 3XX i 4XX i
10-14 12 1 12 -20.1 404.01 8120.6 163224.04
15-19 17 4 68 -15.1 228.01 3442.95 51988.56
20-24 22 8 176 -10.1 102.3 1030.3 10406.04
25-29 27 19 513 -5.1 26.01 132.55 676.52
30-34 32 35 1120 -0.1 0.01 0.001 0.0001
35-39 37 20 740 4.9 24.01 117.65 576.48
40-44 42 7 294 9.9 48.01 970.299 4605.96
45-49 47 5 235 14.9 222.01 3307.95 49288.44
50-54 52 1 52 19.90 396.01 7880.599 156823.92
100 1500.09 25002.999 442589.96
ifx = 3210
f = 100
ffx
X i = 1003210 = 32.1
80
Table 6.4
2XXf i 3XXf i 4XXf i
404.01 -8120.6 163224.08 912.04 -130771.8 207954.24 816.08 -8242.4 83248.32 494.19 -2520.35 12853.88
0.35 -0.035 0.0035 480.02 2353.00 11529.6 686.07 6792.093 67241.72 110.05 16539.75 246442.2 396.01 7880.599 156823.92
5299.00 910.257 949317.96
1
2
N
XXfS i =
995299 = 52.53
S = 7.316
Skewness:
3
3
)1()
SNxxf i
= 391.58 x 99257.910
= 0.0236
Kurtosis:
4
4
4 )1( SNxxi
= 2864.8 x 99
96.949317
= 5.283615
96.949317
= 3.3
Excess Kurtosis 34 K
= 0.3
81
Since Skewness = 0.0236; Kurtosis = 3.3; and Excess Kurtosis = 0.3.
This implies that the distribution is near normal. The Kurtosis indicates a flat peak i.e.
leptokurtic
Summary for study session 6 In this study session, you have learnt:
1. The concept of Skewness and Kurtosis.
2. How to distinguish between Kurtosis and excess kurtosis and their interpretations.
3. Useful examples were given to illustrate the different formulae for their computation.
Self-Assessment Questions (SAQs) for Study Session 6 Now that you have completed this study session, you can assess how well you have achieved its
Learning outcomes by answering the following questions.
SAQ 6.1-6.2
1. Consider the data in post test question 3 in chapter 4, obtain the measure of skewness and
kurtosis.
2. Consider the data in post test question 1 in chapter 4, obtain a measure of Excess
Kurtosis and interpret your results.
3. Consider the post test question 1 in chapter 5, calculate the measure of Skewness and
interpret your result.
82
Reference Adamu S.O and Johnson Tinuke L (1998):”Statistics for Beginners: Book 1. , Ibadan: SAAL PublicationsISBN: 978-34411-3-2 Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third edition. London:Arnold & Stoughton.
Connor, L. R and Morrell,(1982) A. J. “Statistics in Theory and Practice”. Seventh Edition,London:PitmanBooksLimited.
File://C:\DOCUME~1\FACULT~\LOCALS~\Temp\triHINHP.htm
Gupta, C. B. (1973)“An Introduction to Statistical Methods” London:Vikas Publishing House PVT Ltd. New Delhi:
83
Study Session 7: Methods of Collecting Statistical Data
Introduction In the previous session, you learnt the various ways in which a set of data can be summarized
and calculated some descriptive statistics, examined the shape and how summaries can be
combined and corrected for errors.
In this Study session, you will learn about the various methods that can be employed in the
collection of statistical data.
Learning outcomes for study session 7 At the end of this lecture, you should be able to:
7.1 Explain the various methods of data collection
7.2 Discuss the problems of data collection in Nigeria.
7.1 The various methods of data collection Data collection is an activity aimed at getting information to satisfy some decision objectives or
for purpose of scientific inquiry. The process of data collection varies with the nature of inquiry,
objective of the study and characteristic of the unit of inquiry.
84
Methods of Data Collection
There are five broad methods of data collection. They are:
Figure 7.1: methods of data collection
1. Documentary Sources: It is sometimes possible to answer some of the questions a survey is
intended to cover from available data.
Enquiry concerned with the leisure activities of a town population may verily begin by getting
statistical data about the use made of the local libraries, attendances at cinema, membership of
clubs and societies.
A mass of information about the popularly studied social surveys is available in historical
documents, statistical reports, records of institutions and other surveys.
Government departments possess a mass of information relating to individuals. Some of these
are census schedules, employment records, insurance cards, health records etc.
The only difficulty is that a survey researcher can hardly expect to gain access to these materials.
Some materials are collected in form of case records by psychiatrics, social workers etc. which
are of interest to the sociologist and psychologists. Such materials have limitations for the
research workers in that, it can only represent a highly specialized population i.e. only the case
that happen to came before social workers.
85
There are personal documents which can come directly from the informants such as diaries,
autobiographies and surveys. These give insight into personal character, experiences and beliefs
that formal interviewing can hardly achieve.
The possibility of any investigation bias affecting their contents is eliminated. The use of this
method has many difficulties e.g:
a. How to get the documents
b. How to get a representative collection of documents.
Some people are better in writing letters and essays than others but not everybody can produce
documents and they are at their best when unsolicited for. The method of data collection is
usually by copying out the relevant data from the records available.
2. Observation: Observation as a method of data collection is defined as accurate watching
and classic method of scientific enquiry as they occur in nature. The observer positions
himself and observes the activities of life of a community. The observer positioning
himself to observe depends on:
a. The nature and size of the community.
b. What he wishes to observe.
c. His own personality and skill.
An example where this method is suitable is in the case of traffic censuses. Actual measurement
or counting also comes under the heading of observations. Examples occur in statistical quality
control.
Problems
i. If the characteristics of the population are to be inferred from those of sample, the sample
should ideally be randomly selected.
ii. To instruct an investigator to observe people of all types, men and women of different
ages, social class etc. does not make the sample a random one. It does not ensure that the
resultant group is representative.
iii. The observer can hardly be expected to observe and note everything relevant to the
subject.
86
iv. His selection of the aspect of behaviour and entrainment which he notes may follow
certain channels.
v. If what he is studying is so familiar, he may fail to note the normal etc.
In-text Question
___________ as a method of data collection is defined as accurate watching and classic method
of scientific enquiry as they occur in nature.
a) Problems
b) Merits
c) Observation
d) Demerits
In-text Answer
c) Observation
Merits
The advantages of this method are similar to personal interview and the method has some unique
advantages such as:
i. Providing more reliable information.
ii. Supplying of additional and necessary information
Demerits
The disadvantages are also similar to personal interview.
i. It is exceptionally certified.
ii. Highly trained personnel are needed for observation.
iii. Because of scrutiny, it is time consuming.
87
3. Mail or Postal Questionnaires: This is one of the most widely used methods of data
collection mostly in social surveys. Questionnaires are mailed out to respondents who in turn are
expected to send them back through the post when they are duly completed. The choice of this
method is governed by:
a. Limited resources
b. Economic advantages
c. Potential efficiency.
In-text Question
_________ is one of the most widely used methods of data collection mostly in social surveys.
a) Dairy
b) Telephone
c) Interview
d) Mail or Postal Questionnaires
In-text Answer
a) Mail or Postal Questionnaires
Merits
i. It is generally quicker and cheaper than other methods.
ii. It avoids the problems associated with the use of interviewers.
iii. It is useful when information concerning several members of household is required and
allows for some intra-household consultation.
iv. It is useful where questions demanded is considered rather than when immediate answers
are required.
v. Questions of personal or embarrassing nature are answered more willingly and accurately
than when the respondents are together with the interviewer; who is a complete stranger
to them.
vi. The problem of non-contacts in the sense of respondent not being at home is avoided.
88
Demerits
i. The method can only be considered when the questions are sufficiently simple and
straight forward to be understood with the help of the printed instructions and definitions.
It is unsuitable where the objectives of the survey take a good deal of explanation.
ii. The answers to mail questionnaire have to be accepted as final. There is no opportunity
to probe beyond the given answers.
iii. It is inappropriate where spontaneous (unplanned) answers are wanted or where it is
important that the views of one person only are obtained or where it is essential that one
particular person in each household fills the questionnaires and no one else.
iv. The answers cannot be treated as independent since the respondent can see all the
questions before answering any of them.
v. There is no opportunity to supplement the respondent’s answers by observational data,
his house, appearance, manner etc.
Some of the disadvantages of this method can be overcome by combining it with interview
method.
4. Personal Interview: This is the method that is used mainly in most surveys. It could be a
formal interview in which set questions are asked and the answers recorded in a standard form or
a less formal one in which the interviewer is at liberty to vary the sequence of questions, to
explain their meanings, to change the wordings or where he/she may not have a set of questions
at all but only a number of key points around with which to build the interview.
The interviewer should possess some vital qualities such as (a) Honesty, (b) Interest (c)
Accuracy (d) Adaptability (e) Personality and temperament (f) Intelligence and education.
Merits
i. The interviewer is free and has more opportunity to restructure questions whenever it is
necessary to do so.
ii. It allows more accurate information to be obtained by asking the respondent for further
explanation.
89
iii. A skilled interviewer can easily persuade an unwilling respondent. This will increase the
number of responses.
iv. A skilled interviewer will know when to make call backs and then make more effective
efforts.
v. In addition to recording verbal answers, the interviewer can note the non-verbal reactions
of respondents to questions.
vi. It can be used for persons of all educational levels.
vii. It can be used to explore areas in which little information exists.
In-text Question
___________ is the method that is used mainly in most surveys?
a) Intelligence
b) Personal interview
c) Adaptability
d) All of the above
In-text Answer
b) Personal interview
Demerits
i. Personal interviews are expensive to conduct if the sample to be taken is widely scattered
geographically.
ii. Unscrupulous interviewers may be biased by influencing respondent’s answers or records
to please him.
iii. The respondent in order to boast his image to please the interviewer may give biased
answers.
iv. It may be difficult to interview some individuals such as highly income and influential
people who are not always available.
v. If recalls are necessary, and when the sample is large, it will take more time than
necessary to complete the survey.
90
vi. Respondents may give inaccurate or false information due to lapse in memory,
misunderstanding or may be deliberate.
vii. Larger field staffs are needed for interviewing.
5 Telephone: This is the method of collecting data through the telephone like other methods, it
has many advantages especially in industrialized countries. In a developing country like
Nigeria, this method of collecting information cannot be efficient because of the inefficiency
of the telephone system.
Merits
i. It is faster than other methods.
ii. It is cheaper to collect information by phone than personal interview.
iii. It is more flexible than postal questionnaires.
iv. It encourages higher response rate than postal questionnaire.
v. Recall of respondents is quicker and easier than any other method.
vi. It is the best method of access to every difficult respondent.
vii. It facilitates recording of replies without causing any embarrassment to the respondent.
viii. It is very suitable for radio and television surveys.
Apart from the fact that the telephone system is not effective in a developing country and
therefore renders the method unsuitable, it has other demerits.
Demerits
a. Survey by telephone is limited to respondents having telephones – an obvious evidence
of bias.
b. If the population is widely located all over the country, cost consideration will limit
extensive coverage of the country.
c. The interviewer may be biased and as a result, influence the respondent.
91
d. Cost consideration may restrict the number of questions asked or the time given to the
respondents to answer the questions.
e. Answers given may not be treated in confidence as the telephone could be bugged or
even dropped.
7.2 Limitations of Data Collection in Nigeria Generally, secondary data are limited in scope and information derived from it may not be
satisfactory to all the needs of the researcher. This may also lead to reduction in scope of the
research work or bringing in certain assumptions to fill the loopholes created by insufficient
information.
In-text Question
Collection of data through phone like other method has many advantages. True\ False
In-text Answer
c) True
Summary for study session 7 In this study session, you have learnt:
1. The various methods of data collection.
2. The situations under which each of them can be employed were also highlighted, as well as
their relative merit and demerits.
3. Also the problems usually encountered in the process of collecting statistical data.
Self-Assessment Questions (SAQs) for Study Session 7 Now that you have completed this study session, you can assess how well you have achieved its
Learning outcomes by answering the following questions.
SAQ 7.1-7.2
1. What is statistical data collection?
2. What are the merits of personal interview?
3. Discuss the demerits of postal questionnaire method.
4. Observational method of data collection is best in social science research, Discuss.
92
References Adamu, S. O. (1978): “The Nigerian Statistical System”. Ibadan: University Press.
Adewoye, G. O. and Shittu O.I (1999): “Introduction to Socio-Economic Statistics (Survey
methods and Indicators).” Lagos: Victory Ventures ISBN 978-33867-1-9
Brookes C B and Dick W.F (1969): An Introduction to Statistical Method .Second Edition
Published by H.E.B.Paperback.
Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third edition London: Arnold &
Stoughton
Cyprian A. Oyeka (1990): An Introduction to Applied Statistical Method in the Sciences. Enugu:
Nobern Avocation publishing coy.
Moser, C. A. (1968): “Survey Methods in Social Investigation” London: Heinemann Educational
Books Ltd.,
Osuntogun, E. O. (1997): “Introduction to Social and Economic Statistics” Unpublished paper.
93
Study Session 8: Regression Analysis
Introduction In this study, you will be introduced to the theory of linear regression analysis. Different types of
relationships shall be shown on the scatter plot and the estimate parameter of the model shall be
obtained by method of least squares. An introduction to the test of significance of the regression
line will also be given.
Learning outcomes for Study Session 8 At the end of this study, you should be able to:
8.1 Discuss the concept of regression analysis;
8.2 Identify the types of regression models
8.3 Estimate the parameters of a regression model; and
8.4 Explain the testing of the significance of the model
8.1 Regression Analysis
Regression analysis is a statistical tool which helps to study the trend and pattern of movement in
one variable in response to changes in another variable on the basis of an assumed relationship
existing between them. Once this pattern is established, it can be used to predict one variable
from the other.
The variable being predicted is usually referred to as the response (dependent) variable and the
other variable is called the explanatory (independent) variable. The values of the explanatory
variable are usually fixed and under the control of the investigator while the values of the
response variable are determined by the values of the explanatory variables.
94
Thus regression analysis attempts to determine how changes in the explanatory variable affect
the response variable. The variables involved are assumed to be measured and recorded as
interval scaled or ratio scale data. If the variables are strictly qualitative (i.e. attributes) the
method of regression cannot be used.
The appropriate method used in studying association between two qualitative variables will be
discussed.
In-Text Question
The values of the explanatory variable are usually fixed. True or False
In-Text Answer
True
8.2 Types of Regression Models A regression model may be:
Figure 8.1: Types of regression model
A regression model can be simple if there is only one explanatory variable, and multiple if there
are more than one explanation variable.
A regression model is linear if its parameter does not contain any exponents and are not
multiples of other parameters in the model; otherwise, the model is said to be non-linear. The
value of the highest power of a model is called the order of the model.
Scatter Plot
The first step in the study of the relationship between two variable is to draw the scatter diagram.
It portrays the direction, form and strength of any relationship between quantitative variables. It
95
is drawn by plotting the values of the response variable (on the Y-axis) against the values of the
explanatory variable (on the X-axis).
The shape of the scattered points on the graph gives an idea of the type of relationship between
the two variables.
Types of Scatter Diagram
Diagram 8.1
Y Y
O X O X
(c) (d)
Figure 8.1: shows some of the common types of relationships that exist between two variables:
96
Figure 8.1(a) depicts a linear relationship
Figure 8.1(b) and (c) depicts non-linear relationship; figure 8.1 (a) is a quadratic relationship
while figure 8.1 (c) is an exponential relationship, while figure 8.1 (c) shows no
relationship between variables X and Y (i.e. spacious relationship). Since neither a line
nor curve can be fit on the scatter plot.
Please note that:
i. Scatter plot cannot be used for more than two variables.
ii. A non-linear regression model can be made linear through appropriate transformation.
In-Text Question
The following are types of regression models except _______________
A. Simple
B. Multiple
C. Short
D. Non-Linear
In-Text Answer
C. Short
8.3 The Simple Regression Model
The simple linear regression model describing the relationship between the response variable (Y)
and the explanatory variable (X) can be expressed as
iii eXY 10 For i = 1, 2, ----, n
Where there are n observation on both X and Y and
i. Yi is the ith observation on Y.
ii. Xi is the ith observation on X
0 is the intercept (The point at which the regression line cuts the Y-axis i.e. when
X = 0).
97
1 if the slope (regression coefficient) of the line. It gives the rate of change in Y per unit
change in X.
ei is the error term distributed random error term with mean O and variance 2 . The parabeteics
of the model can be estimated by method of Ordinary Least Squares (OLS).
Basic Assumption of OLS
i. The relationship between X and Y is assumed to be linear.
ii. Xi’s are predetermined (fixed) values assumed to be measured without error.
iii. The error term. ei's are independent of X i.e. E(ei X) = 0.
iv. The error term is assumed to be normally distributed with mean zero and variance
22 ,0~ i.e. Ne
The above assumptions implies that
iii eXY
is a random variable with the expectation (mean)
iii eXEYE 10
= iX 0
since and are constants and E(ei) = 0
Similarly
V(Yi) = Var(0 + 1Xi + ei)
= Var(ei)
= 2e
Where 2e can be estimated by
1
ˆ 2
2
nYY
S i
98
8.4 Estimation of Parameters Suppose there are n pairs of observations on X and Y as
(x1, yi), (x2, y2), -----, (xn, yn).
The assumed linear relationship is
Yi = a + bXi + ei 8.1
Where a and b are estimates of 0 and 1 in the original model. Equation (8.1) can be
expressed as
ei = Yi – a – bXi 8.2
Let Q = iii bXaYe 2 2 8.3
The constant can be obtained by minimizing Q with respect to a and b. i.e.
02 ii bXaY
aQ 8.4
02 iii XbXaY
bQ 8.5
The normal equations 8.4 and 8.5 can be solved simultaneously to obtain
XbYa 8.6
and
XX
XY
SSSS
XnXYXnXY
b
22 8.7
and 2e can be estimated by
1
1
2
n
eS
n
ii
e 8.9
=
1
ˆ 2
n
YYi
99
Where Yi is the observed value of Y
iY is the estimated value of Y
and the variance of the intercept is
XX
ic
SX
Var
0 8.10
and
the variance of the regression coefficient is
XXS
Var2
1
8.11
Coefficient of Determination
This is the proportion of variation in the response variable (Y) that is explained by the
explanatory variable (X).
It is defined by
n
i
n
i
Total
gression
YnY
YXnXY
SSSS
R
1
22
11
Re2
8.12
= variationTotal
variationExplained
where 10 2 R
02 R When 0b and 12 R when all the points fall on the fitted regression line.
8.5 Testing the Significance of the Model
It is always desirable to test the significance of the model. That is to examine whether a
regression line is a good fit. If the line is a good fit, then all the points on the scatter diagram
must fell on the line or lie very close to it.
100
This can be done by examining the residual plot (i.e plot of residual error iii YYe ˆ against
the data points.
The most objective method is by arranging the sum of square and cross products in an Analysis
of Variance (ANOVA) table, and carry out the Fisher’s test (F-test) or student (t-test) as follows:
Specify the Hypothesis
H0 : 0b (i.e. no relationship between X and Y)
H1 : 0b (i.e. relationship exist between X and Y)
Choose , the level of significance.
Table 8.1
ANOVA TABLE
SOURCE df SS MS F-cal
Regression
Error
K – 1
n – K –
1
SSR = SXY
SSY – SSR = SSE
RXY MSKSS 1
EE MSKnSS 1 E
RMS
MS
Total n – 1 Y
ii ss
nY
Y 22
The critical value is ,, 21 VVF
Where V1 = K – 1; V2 = n – K – 1.
= 0.05 or 0.01
Decision rule
Reject H0 if ,, 21 VVcal FF at level of significance an conclude there is enough evidence to
show that variables X and Y are related, otherwise accept H0.
101
Example 8.1
The table below shows the weight losses, (in kilogram) (Y) of a sample of person and the
number of months (X), they have been on a special weight reducing diet.
Table 8.2
Y 4 17 14 1 11 22 9 12 4 7
X 7 32 26 1 20 34 17 21 5 12
a. Draw the scatter diagram of the above data. b. Fit the regression equation of Y on X. c. Interpret the parameters of your regression model. d. An individual is known to have been on a special reducing diet for 27 months, estimate
his weight loss in kilograme. e. Obtain an estimate of the standard error of the model.
Solution
Diagram 8.2
Scatter Plot
30 -
20 -
Weight loss
10 - 0 10 20 30 32 40
No. of months
102
Table 8.3
Y X XY X2 Y2 Y 2YY
4 7 28 49 16 4.159 0.02528
17 32 544 1024 289 18.309 1.7135
14 26 364 676 196 14.913 0.8336
1 1 1 1 1 0.763 0.0561
11 20 220 400 121 11.517 0.2673
22 34 748 1156 434 19.441 6.648
9 17 153 289 31 9.819 0.6708
12 21 252 441 144 12.083 0.00689
4 5 20 25 16 3.027 0.9467
7 12 84 144 49 6.789 0.000121
101 175 2414 4205 4397 0.06877
Y = 10.1
X = 17.5
b. Regression of Y on X Y = a + bx
b = YXnXY
= 22 XnX
= 5.11425.646
)5.17(104205)5.17)(1.10(102414
2
= 0.566
103
a = XbY
= 10.1 – 0.566(17.5)
= 0.197
Y = 0.197 + 0.566X
c. a = 0.197m when X = 0, Y = 0.197
b = 0.566 implies for every month spent taking special weight reducing diet, there is
an average reduction of 0.57 kilogramme loss in weight.
d. Y = 0.197 + 0.566(27)
= 0.197 + 15.28
= 15.48 kg.
e. Se = 1
1
2
n
en
ii
=
1
ˆ 2
n
YYi
from the working table. 2
YYi = 11.0688
Se = 90688.11
= 1.109
Example 8.2
A quality control Manager collects 10 samples of iron roods from the production line at regular
interval of time. Each time the average length (Y) and diameter (X) of the rods are measured.
The results are given below.
104
Table 8.4
Average Diameter (X)
in mm.
Average Length (Y)
in cm. 18.1 23.0 17.5 20.2 14.7 13.8 15.1 13.8 16.1 12.6
8.8 9.5 8.9 9.1 8.6 8.3 8.5 8.2 9.4 7.2
a. Calculate the linear regression of mean length on Diameter. b. Is there any evidence to show that the diameter influences the length of the rods. c. Calculate the standard error of the regression coefficient.
n = 10
X = 164..9 Y = 86.5
X = 16.49 Y = 8.65
2X = 813.85 2Y = 752.25
XY = 1441.85
Hypothesis
H0 : 1 = 0
H1 : 1 0
= 0.05
105
a. b = 22 XnX
YXnXY
= 649.94465.15
)49.16(1085.2813)65.8)(49.16(1085.1441
2
= 0.163
a = XbY
= 8.65 – 0.163(16.49)
= 5.96
Y = 5.96 + 0.163X
b. SSTOTAL = 2)( YY = 22 YnYi
= 752.25 – 10(8.65)
= 4.025
SSTrt = YXnXYb
= 0.163(15.465)
= 2.52
Table 8.5
ANOVA
Source df SS MS Fc
Treatment
Error
1
8
2.52
1.50
2.52
0.188
13.40
TOTAL 9 4.025
F0.95, 1, 18 = 5.32
106
Conclusion: Since Fc > F0.815, 1, 8
we reject H0 and conclude that there are genuine reasons to show that the diameter influences the
length of the rods at 5% level of significance.
c. S.E(b) =
222) XnXMSE
XXe
= 649.94188.0
= 0.045
Summary for Study Session 8 In this study session 8, you have learnt:
1. The theory of regression analysis, and its uses.
2. The difference between linear and non-linear models, simple and multiple regression
models.
3. The method of Ordinary Least Squares (OLS) for estimating the parameters of a simple
linear regression model and the procedure for carrying out the test of significance of a
regression line was given
Self-Assessment Questions (SAQs) for Study Session 8 Now that you have completed this study session, you can assess how well you have achieved its
Learning outcomes by answering the following questions.
SAQ 8.1-8.5
1. A test was performed to determine the relationship between the chemical content (Y)
of a particular solution and the crystallization temperature (X) in deg. The following
quantities are calculated.
n = 20, iX = 400; iY = 220
2iX = 8800 iiYX = 4300
2iY = 2620
107
Assuming a linear relationship iii eXY
a. Calculate the least squares estimate of and each correct to two significant
figures.
b. Test the significance of the fitted model at 5% level of significance.
c. Obtain the standard error of parameter in the model.
d. A previous similar exercise with n = 1.5 shows a regression coefficient of 1 of 0.10
with a standard error of 0.008. Test the hypothesis that the slope of your regression
model is the same as that of the previous exercise at 5% level of significant.
2. Twelve students took two papers in the same subject and the marks in percentages were
as follows:
S/No. 1 2 3 4 5 6 7 8 9 10 11 12
Paper I 65 73 42 52 84 60 70 79 60 83 57 7
Paper II 78 88 60 73 92 77 84 89 70 99 73 8
a. Construct a scatter diagram for the above data.
b. Calculate the regression equation of paper II on paper I.
c. Two boys were each absent for one paper. One score 63 on paper I, the other 81 on
paper II. Estimate the marks of these students in the paper they did not take.
d. Obtain the standard error on your regression coefficient in (b) above.
e. Construct a 95% confidence information for your regression coefficience in (b) above.
3. A random sample of ten families had the following income and food expenditure (in N
per week).
Families A B C D E F G H I J
Family Income 20 30 33 40 15 13 26 38 35 43
Family
Expenditure
7 9 8 11 5 4 8 10 9 10
108
a. Estimate the regression line of food expenditure on income and interpret your results.
b. Obtain the regression line of income on food expenditure and interpret the result.
4. The following results have been obtained from a sample of 11 observations on the value
of sales (Y) of a firm and the corresponding prices (X).
18.519X , 82.217Y , 31345432X ,
5395122Y
a. Estimate the regression line at sales on price and interpret the results
b. What is the part of the variation in sales which is not explained by the regression line?
3. The following table includes the gross national product (X) and the demand for food (Y)
measured in arbitrary units, in an underdeveloped country over the ten year period 1960
– 1969.
1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
Y 6 7 8 10 8 9 10 9 11 10
X 50 52 55 59 57 58 62 65 68 70
(a) Estimate the food function Y = b0 + b1X + U
(b) What is the meaning of this result
(c) Compute the coefficient of determination and find the explained and unexplained
variation in the food expenditure.
(d) Find the regression of X on Y.
1296836iiYX
109
References Brookes C B and Dick W.F (1969): An Introduction to Statistical Method .Second Edition
Published by H.E.B.Paperback.
Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third edition London: Arnold &
Stoughton
Cyprian A. Oyeka (1990): An Introduction to Applied Statistical Method in the Sciences. Enugu:
Nobern Avocation publishing coy.
Connor, L. R and Morrell,(1982) A. J. “Statistics in Theory and Practice”. Seventh Edition,
London: Pitman Books Limited.
Gupta, C. B. (1973)“An Introduction to Statistical Methods” New Delhi: Vikas Publishing House
PVT Ltd.
Moore D.S and Mc Cabe G.P (1993): Introduction to the Practise of Statistics, Second Edition.
New Delhi: W.H. Freeman and coy.
110
Study Session 9: Correlation and Association
Introduction
So far in study eight, you have learnt how to measure the direction and strength of the
relationship between the explanatory variable and the response variable for the purpose of
predicting one from the other. However, in this study, you will learn how to measure the relation
or association between two variables without distinction between the two variables and not for
the purpose of prediction.
Learning outcomes for Study Session 9 At the end of this study, you should be able to:
9.1 Explain the meaning of correlation
9.2 Explain coefficient of rank correlation
9.1 Correlation Correlation refers to the relationship or association between two or more variables while
correlation coefficient is a quantity that measures the strength of the linear relationship between
two qualitative variables. The measure of relationship between two attributes (qualitative
variable) is usually referred to as association. This will be discussed in the next section.
Production Moment Correlation Coefficient
Suppose we have n observations on two variables x and y denoted by
(x1, y1) (x2, y2)… (xn, yn)
The correlation coefficient r for variables X and Y computed from n cases is
2222 YnYXnX
YXnXYr
111
YYXX
YYXX2
where r ranges from -1 to +1.
If r = 0, the two variables are uncorrelated.
If r = +1, x and y are said to be directly or positively correlated and the regression line is upward
sloping on the Scatter plot.
If r = -1, x and y are said to be inversely or negatively correlated and the regression is downward
sloping on the Scatter plot.
If < r < .5, x and y are said to be positively weakly correlated.
If 0.5< r < 0, x and y are said to be strongly positively correlated.
If -0.5< r < 0, x and y are said to be weakly negatively correlated.
If -1< r < -0.5, x and y are said to be strongly negatively correlated.
Note that r is also referred to as the product moment correlation coefficient.
It can be shown from lecture eight that
X
XY
SS
b
and YX
XY
SSS
r
Therefore bS
Sr
Y
X
2
2
Where b is the regression coefficient.
Example 1:
Consider example 8.1. Calculate the product moment correlation coefficient for this data.
Solution:
n = 10, X = 10.1, Y = 17.5
YXnXY = 646.5
22 XnX = 1142.5
112
22 YnY = 376.9
)9.376)(5.1142(
5.646 r
= 21.6565.646
= 0.985
Alternatively
bS
Sr
Y
X
2
2
= )566.0(41.1980.33
= 0.985
9.2 Coefficient of Rank Correlation This is a measure of the strength of relationship between two qualitative variables (or attributes)
It is also used when the exact measurement of qualitative variables may not be accurate,
impossible or impracticable.
To obtain the rank correlation coefficient, the observed values of the variables are replaced by
their respective ranks either in ascending or descending order of magnitude.
The coefficient of rank correlation is given by
)1(
61 2
2
nnd
R
Where d = difference of rank for any pair of variables – 1 < R< 1 and the interpretation is the
same as in product moment correlation coefficient.
If there are ties the average of the ranks are assigned to the units involved.
113
Example 9.3
Two judges were asked to assess twelve beauty contestants in a beauty contest. The twelve on
contestants were ranked according to their performance as follows:
Table 9.1
Judge 1 2 3 4 5 6 7 8 9 10 11 12
A 11 9 7 10 5 1 4 12 8 3 2 6
B 5 7 11 12 6 4 8 9 10 2 1 3
Is there any agreement in the two judges?
Solution
Table 9.2
n = 12
1 2 3 4 5 6 7 8 9 10 11 12
d 6 2 -4 -2 -1 -3 -4 -3 -2 1 1 3
d2 12 4 16 4 1 9 16 9 4 1 1 9
862 d
)1(
61 2
2
nnd
R
= )1144(12
)86(61
= 1 – 0.30
= 0.70
Comment: There is a fairly strong agreement in the opinion of the Judges.
114
Example 9.4
A study was conducted to determine the relationship between level of smoking measures by the
number of sticks of cigarette smoked per day (X) and a Tercim index of health (Y). The
following data were obtained on a random sample of 10 male smokers.
Table 9.3
X 8 20 15 12 15 9 16 10 12 8
Y 4 5 5 7 10 13 8 6 3 8
Calculate the spearman rank correlation coefficient and comment on your result.
Solution
Table 9.4
RX 9.5 1 3.5 5.5 3.5 8 2 7 5.5 9.5
RY 9 7.5 7.5 5 2 1 3.5 6 10 3.5
d 0.5 -6.5 -4 0.5 1.5 7 -1.5 1 -45 6
d2 0.25 42.25 16 0.25 2.25 49 2.25 1 20.25 36
5.1692 d
)1(
61 2
2
nnd
R
= )1100(10
)5.169(61
= 1 – 1.027
= 0.027
115
Comment
The above result shows that there is weak negative association between smoking habit and the
report health index.
Summary for Study Session 9 In Study Session 9, you have learnt about:
1. The concept of correlation.
2. The Distinction between correlations
3. The association between qualitative variables and attributes.
4. The method of interpretation of coefficient
Self-Assessment Questions (SAQs) for Study Session 9 Now that you have completed this study session, you can assess how well you have achieved its
Learning outcomes by answering the following questions.
SAQ 9.1-9.2
A group of sportsmen take part in a competition which includes two gymnasium test; squat
jumps and chins. The score for each exercise is the number performed in one minute. The score
of eight sportsmen taken from this group are given below:
Sportsmen A B C D E F G H
Squat jumps 47 72 60 44 56 63 71 64
Chins 25 48 30 40 27 35 30 34
a. Calculate the Spearman coefficient of rank correlation between these two sets of scores.
b. The overall winner of the gymnasium tests is the sportsman with the highest total score
when the number of squat jumps is added to the number of chins.
Determine the total scores and state which sportsman was the winner.
c. The rank correlation between the total scores and the number of squat jumps is 0.86
for the data above. Calculate the rank correlation between the total score and the total
116
score and the number of chins. If to save time, only one exercise was to be used in
future, state, giving a reason which one you would recommend to be used.
d. Consider the data in example 8.2
i. Calculate the coefficient of Spearman’s rank correlation
ii. Comment on your result.
References Adamu S.O and Johnson Tinuke L (1998):”Statistics for Beginners: Book 1. Ibadan: SAAL
Publications ISBN: 978-34411-3-2
Connor, L. R and Morrell,( ) A. J. “Statistics in Theory and Practice”. Seventh Edition, London:Pitman Books Limited.
Olubosoye O.E, Olaomi J.O and Shittu O.I (2002):’Statistics for Engineering , Physical and Biological Sciences”. Ibadan:A Divine Touch Publications. ISBN: 978- 35606-7-0
Brookes C B and Dick W.F (1969): An Introduction to Statistical Method. Second Edition Published by H.E.B.Paperback.
Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third edition. London: Arnold & Stoughton
Moore D.S and Mc Cabe G.P (1993): Introduction to the Practise of Statistics, second Edition. New York: W.H. Freeman and coy.
117
Study Session 10: Proportions, Rates and Indices
Introduction
Rates, ratio and indices have become very important in the descriptive analysis of certain events
and characteristics. They are especially useful in the study of vital characteristics such as price,
death, birth, population growth epidemics, etc.
In this study, you will be introduced to the three concepts, their uses and applications using some
sample data with particular emphasis on price indices.
Learning Outcomes for Study Session 10 At the end of this study, you should be able to:
10.1 Explain the meaning of the terms proportion, rate and indices;
10.2 Explain items to be taken into consideration when constructing an index number
10.3 Discuss the different methods of construction of price index
10.4 Identify the uses of consumer price index
118
10.1 Proportion, Rates and indices Definition: Proportion is the ratio of a number of items with certain characteristics (X) and the
total number of items exposed to such characteristics (N).
It is defined as
NXnP X
)()(
The above expresses the chance of occurrences of such characteristics. (i.e. Probability of event
x).
Example
If the voting age population (people 18 years and above) in a ward consists of 550 males and 600
females. What is the proportion of males?
Solution
n (males) = 550
Total population: N = 550 + 600 = 1150
Proportion of males: N
malesnP Males)(
)(
= 1150550 = 0.478
Rates
When proportion refers to the number of events or cases occurring during certain period of time,
it becomes a rate and is usually expressed as so many per 1000. Thus we refer to birth rate as the
number of birth per 1000 population in a year.
So also we have death rate, migration rate marriage rate etc. Some examples shall be given to
illustrate this concept later.
Index Number
An Index is a real number that measures the rate of increase or decrease in wage, production
value, quantity, price, or volume of a certain phenomenon in the current period relative to as
specific period in the past. (a base period). It is usually measured in percentage.
119
An Index number is a device for estimating trends in prices, wages, production and other
economic variables.
In its simplest form, an index number represents a special kind of average or a weighted average,
compiled from a sample of items judged to be representative of a whole.In this study, our focus
shall be on the construction of consumer price index, since the principle and methods that will be
discussed apply equally to indices of sales, production, wage, value, quantity indices.
In-Text Question
When proportion refers to the number of events or cases occurring during certain period of time,
it becomes a _______
A. Rate
B. An index
C. Ratio
D. Map
In-Text Answer
A. Rate
10.2 Consideration for an Index Number Quite a number of methods and formulae are used in the computation of index numbers; there
are however, a number of criterions that must be satisfied.
A good index number:
a. Should be simple in conception.
b. Should be easily interpreted. So that the man on the street can understand an index that
tries to measure the changing cost of the things he bought in a particular year.
Just as we mentioned in the earlier part of this lecture, an index number is a special kind of
average that considers the prices of many commodities expressed in different units or the
quantities measured also in different units. The commodities could also of different weights in
120
the “basket” of goods considered for the index. All these constitute the problems usually
encountered in the construction of an index number.
Thus in the construction of price index, the following factor are considered:
Figure 10.1: factors that determine price index
a. Choice of Item
Decision should be taken on the item to be included in an index. Such commodity to be included
should be (i) relevant, (ii) representative (iii) reliable and (iv) comparable over a period of time.
b. Source of Data
Decision should also be taken on the source of data for the items composing the “basket” to be
used in the construction of index number, should the data prices of commodities be collected
from a local market, a supermarket or an urban market. Great care should be taken to ensure that
prices are collected from population market that is patronized by different category of people and
where majority of the selected commodity can be found.
121
c. The Base Period
A base year is a reference period. The chosen year should generally be a fairly “normal” year,
free of occurrence of unusual events such as war, famine, prolonged strike or hyper-inflation. If
it is difficult to select a year in particular, the average of a series of years can be taken.
d. The Weight
Different weights are used in different parts of the country for a particular commodity. For
instance “congo” is used in the Western part of Nigeria, ‘mudu’ in the North and ‘tin’ in the
East. For the purpose of constructing and index number the weight in different region need to be
harmonized to a single unit.
In-Text Question
Choice of item can determine price index. True or False
In-Text Answer
True
10.3 Methods of Construction of Price Index There are different methods of constructing a price index. Some of them are given below;
Let Pn represent the price at the current year
P0 represents the price at the base year
qn represent the quantity at the current year
qo represent the quantity at the base year
1. Price Relative: is the simplest method of calculating an index number. It is defined as the
prices in the current year expressed as a percentage of the price in the based period for single
commodity. Base period is always assumed = 100
100 x 0P
PPR n
2. Simple Aggregate Method: This method considers the price of basket of goods and
services in the current years relative to that of the based period. It is denoted by:
122
100 x 0
PP
SAM n
Limitation: It attaches equal weight to all commodities. It does not take into account the relative
importance of the commodities.
3. Simple Average Relative Method: This is the sum of the price relative divided by the
number of items considered. It is denoted by
100 x 0
NPP
SAP n where S is the number of items
Its limitation is same as in (2)
4. Weight Simple Average Relative Method: To circumvent the problems of assigning equal
weight to different items, the weighted simple aggregate price index is given as
100 x 0
WPPW
WSAR n
Example 10.1
The following are the prices of commodity, A, B and C in 1975 and 1985.
Using 1974 as the base year
Table 10.1
Commodity 1975 1985
A 40 50
B 12 35
C 45 95
Calculate i. Price relative for each item
ii. Simple aggregative price index
123
Solution
Table 10.2
Commodity 1975 1985 Pn/Po x 100
A 40 50 50/40 x 100 = 125
35/12 x 100 = 291.7
95/45 x 100 = 211.1
B 12 35
C 45 95
Total 97 180
SAPR aggregate =
0PPn
= 87
180 x 100
SAP (S. Average)= 100 x 0
NPPn
= 328.6 x 100
Example 10.2
Given the prices of some staple foods in 1980 and 1996 with the corresponding weight. Using
1980 = 100
Table 10.3
Staple
Foods
Weight Price
1980 1996
Elubo 3 1.25 10.50
Gari 5 4.0 12.50
Rice 1 35.0 75.0
Beans 3 12.0 38.0
Yam 2 5.001 8.50
124
Compute i. the price relatives
ii. simple aggregate price index
iii. weight average price index
Table 10.4
Staple
Foods
Weight Price
1980 1996 Pn/Po x 100 W(Pn/Po)
Elubo 3 1.25 10.50 840 25.20
Gari 5 4.0 12.50 32 1.6
Rice 1 35.0 75.0 214 2.14
Beans 3 12.0 38.0 316 9.48
Yam 2 5.001 8.50 170 3.4
Total 14 57.25 144.5 41.85
Simple Aggregate Price Index = 100 x 0
PPn
= 100 x 25.575.144
= 252.14
Weight Average Relative index = 100 x 0
WPWPn
= 41.85 x 100
= 298.7
5. Laspeyer’s Price Index: This is a kind of weight method of constructing an index
number. It assumes that the pattern of consumption has not changed over the years with
change in price. It is denoted by:
PL = 100 x 00
0
qPqPn
125
Limitation: Since the base year quantities reflects the price of out modeled purchasing pattern. It
gives undue weight to items that has increased in price. Therefore Laspeyer’s price index tends
to overestimate.
6. Paashe’s Price index: This method assumes that the consumption pattern of the
consumer has changed in the current year. It is denoted by:
Pp = 100 x 0
n
nn
qPqP
Limitation: Some people tend to spend less on goods that have risen in price, the current
weighting procedure (Paashes) gives undue weight to items that have reduce in price, it tends to
understate the rise in prices. Hence, the underestimate on the price index.
7. Fisher’s Ideal Index Number: This method overcomes the problems of Paashes and
Laspeyer’s. This is considered as the most efficient method of constructing an index
number. It is the geometric mean of the Laspeyer’s and Paashes price indices denoted by:
pPP .1
Fisher’s Ideal Index = 100 x .00
0
nn
nnn
qPqP
qPqP
8. Marshal-Eldgeworth Price Index: This method takes into account the pattern of
consumption in the current and base periods. It uses the arithmetic mean of the base and
current period quantities as weight. It is given by:
100 x
00
0
n
nnp qqP
qqPEM
Example 10.3
The prices and quantity demanded of commodities A, B and C in the current and base years are
given below
126
Table 10.5
Commodity 1960
Po
1960
Qo
1970
Pn
1970
Qn
Price Quality Price Quality
A
B
C
4
3
2
50
10
5
10
9
4
40
2
2
Construct index number of price from the following data using
i. Laspeyer’s method
ii. Pashe’s method
iii. Marshal-Edgeworth method and
iv. Fisher’s ideal index number
Solution
Table 10.6
Commodity 1960
Po
1960
Qo
1970
Pn
1970
Qn
PnQn PoQn PnQn
Price Quantity Price Quantity
A 4 50 10 40
B 3 10 9 2
C 2 5 4 2
Total 240 610 426
i. Laspeyer’s P1 =
= 610/240 x 100
= 254.2
127
ii. Passshe’s Pp =
= 426/170 x 100
= 250.6
iii. M-E =
= [(610 + 426+170)] x 100
= 252.7
iv. Fisher’s =
=
= 252.4
10.4 Uses of Consumer Price Index
1. Consumer price indices among others are used to measure change in retail prices of
specific quantity of goods and services in a given geographical region over a period of
time.
2. It helps in wage and salary negotiation and adjustments of allowances.
3. Government agencies use consumer price indices to formulate wage policy, price control
policy, taxation and general economic policy.
4. Changes in purchasing power and real income can be measured using the consumer price
indices.
5. Use in international comparison
6. Construction of Human Suffering Index.
7. Construction of cost of living index.
128
In-Text Question
Changes in purchasing power and real income can be measured using the consumer price
indices. True or False
In-Text Answer
True
Summary for Study Session 10 In study session 10, you have learnt about:
1. The various methods of constructing an index number
2. The problem associated with the construction of index number
3. The uses of the consumer price index
Self-Assessment Questions (SAQs) for Study Session 10 Now that you have completed this study session, you can assess how well you have achieved its
Learning outcomes by answering the following questions.
SAQ 10.1 -10.4
1. a. Explain what is meant by an index number
b. What are the uses of consumer price index?
c. The price relatives for palm oil and kerosene are shown in the table
Commodity Price Relative
1961 1962
Palm Oil 100 108
Kerosene 100 114
129
Assuming that palm oil is twice as important as kerosene, what is the price index for 1962
taking 1961 = 100.
2. Five feed components are to be used in the construction of an animal feedstuff index
number. From the figures given in the following table, calculate a Laspeyer’s price index
taking 1964 = 100.
Component 1964 1970
Price
per ton
Consumption
(tons)
Price
per ton
Consumption
(tons)
A 40 3,600 41 2,750
B 39 2,750 53 1,500
C 38 2,050 35 2,350
D 37 500 30 750
E 36 1,475 24 2,850
3. Giving the following data on commodities A, B, C and D.
Base Year Current Year
Commodity
A
B
C
D
Po qo
10 12
7 15
5 24
16 5
Pn qn
12 15
5 20
9 20
14 5
Show that Fisher’s ideal index is 115.7
130
References Adamu S.O and Johnson Tinuke L (1998):”Statistics for Beginners: Book 1. Ibadan: SAAL
Publications. ISBN: 978-34411-3-2
Adewoye, G. O. and Shittu O.I (1999): “Introduction to Socio-Economic Statistics (Survey
methods and Indicators).” Lagos:Victory Ventures. ISBN 978-33867-1-9
Connor, L. R and Morrell,(1982) A. J. “Statistics in Theory and Practice”. Seventh Edition,
London: Pitman Books Limited,
Brookes C B and Dick W.F (1969): An Introduction to Statistical Method .Second Edition
Published by H.E.B.Paperback.
Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third Edition. London: Arnold &
Stoughton.
Moore D.S and Mc cabe G.P (1993): Introduction to the Practice of Statistics. Second edition.
New York: W.H. Freeman and coy.
131
Study Session 11: Time Series Analysis Introduction Time series analysis is the application of time series technique to time structured data usually
referred to as time series data. Time series data is the record of observations measuring certain
quantity of interest at regular or irregular interval of time.
The observations may be recorded daily, weekly, quarterly, yearly or bi-annually. It is a
realization or sample function from a certain stochastic process. Time series occur in many
fields such as, Agriculture, Engineering, Business and Economics, Geophysics, Medical
Sciences, Meteorology, Quality control, Social Sciences, and so on.
In this study session, you will learn about the time series data, methods of analysis of time series
data and the components of a time series
Learning Outcomes for Study Session 11 At the end of this study, you should be able to:
11.1 Define and identify a time series data;
11.2 Identify the methods of analysis of time series data
11.3 Estimate and isolate the components of a time series
11.1 Time Series Analysis The goal of time series is to identify a model within a given class of flexible model which can
reasonably approximately express a time-structured relationship of the process that generates the
data. The original use of time series analysis was primarily as an aid to forecasting. In the recent
time, the task has grown to an extent that time series analyst develop reasonably simple models
132
capable of describing the system that generate the time structured data; making reliable forecast
for the future and testing of hypotheses.
Uses of Time Series Models
Time series analysis is the study of the time-structured relationship in a variable. This involves
the use of the basic tools to analyze a given time series data with a view to:
Construct simple mathematical systems that explain the time-structured relationship in
the economic and social series in a concise way.
Use the model to explain the behavior of the series and make reliable forecast for the
future on the basis of the dynamic dependence of the series on the past values.
Thus time series provides a basis for economic and business planning, production and system
planning, control and optimization of industrial process. The intrinsic nature of a time series is
that its observations are dependent or correlated and the order of the observations is therefore
dependent.
Since life must be understood looking backwards and must be lived by looking forward, time
series provides useful tools that helps to predict the future by approximating models that use past
data.
Discrete time series is one where observations are taken at discrete specific time intervals,
usually equally spaced e.g. interest rates, yields, volume of sales and production. Such series
arise from fields such as Agriculture, Business circles etc.
Continuous time series are observation taken at any time t (t T) in the index set T. This type
of series are common in the Engineering, Geophysics and Medical Sciences.
In-Text Question
Discrete time series is one where observations are taken at discrete specific time intervals,
usually equally spaced. True or False
In-Text Answer
True
133
1.2 Methods of Analysis of Time Series data Time series data can be analyzed using either the deterministic method or Dynamic method.
Deterministic Method
A time series is said to be deterministic if future values are determined exactly by some
mathematical function. For example
(i) = +
(ii) X = Cos(2t)
Where a and b are constants and t is time that is fixed.
(iii)
Where Tt is the trend component; St is the seasonal component; Ct is the cyclical component
and it is the irregular component.
Dynamic (Non-deterministic) Method
A time series is said to be non-deterministic if future values can only be determined in terms of a
probability distribution guided by some assumptions. For example:
(i) = + +
(ii) X = A Sin(t + )
Where is normally distributed with mean zero and variance unity, A is a constant and is a
random variable from a uniform distribution on the interval [- ,] independent of A.
This method involves the use of Autocorrelation function (ACF) and Partial autocorrelation
function (PACF) and correlogram in discrete domain. It also involve the Fourier transforms in
Frequency domain Analysis in frequency domain is carried out using the extension of Fourier
method and spectral density function
In-Text Question
A time series is said to be non-deterministic if future values can only be determined in terms of a
probability distribution guided by some assumptions. True or False
In-Text Answer
True
t t t t tX T S C I
134
Deterministic Time series Analysis The analysis of time series depends on the type of system that generates the data. Analysis in
time domain refers to the analysis of discrete time series
Simple Descriptive Analysis
Most social and economic data including data generated in medicine are time structured. They
need to be summarized with a view to make inference about the system that generates the data.
Time Plot
The first and most important diagnostic tool of time series data is the time plot. It is a graphical
representation of a time series data. It is constructed by plotting the observation on the
vertical axis against time on the horizontal axis. When properly drawn, it shows up the
important features of the series such as trend, seasonality, discontinuities and outliers. The time
plot of the data gives an idea of the type of model that is suitable for the data.
It could also indicate whether it would be necessary to transform the observed data to achieve
certain stable conditions suitable for meaningful analysis and inference.
Fig. 1.1: Time Plot of a series
tx
' 't
0
5
10
15
20
25
30
35
t 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96
Xt
t
Time Plot of Xt
135
11.3 Components of Time Series Movements in a time structured data are governed by some peculiar and inherent forces which
may be characterized by their regularity/ periodicity and their effect n the entire series. The
forces could also be due to changes in the social, economic, psychological or environmental
characteristic in the system.
The patterns generated by these forces are referred to as components of time series. Some of the
components are: the trend (Tt), Seasonal movement (St), cyclical movement (Ct) and the
Irregular movement (It).
Trend or Secular Movement
This is the long-term movement in a series in the same direction over a long period of time. It is
usually characterize a continuous increase or decrease in the values on a variable over time. This
movement is generally referred to as secular Variation or Secular Movement. A line can be
freely drawn by hand through the plotted points on the graph of such time series stretching over a
long period; such a line called the trend. It is denoted by (Tt). The time plot below shows trend in
a series.
Fig. 1.2: Time plot showing Trend
Trend can be upward or downward. Upward trend is displayed in the time plot. This type of plot
is expected from sales of a commodity where increase is always expected.
136
Seasonal Variation
This refers to identical or almost identical patterns, which a time series appears to follow during
corresponding months of successive years due to mainly recurring event that takes place
annually. The movement appears to be periodic (exhibit variation at a fixed time within a given
interval if time). Many time series, such as sales figures and temperature readings, exhibit
variation, which are periodic annually.
There are factors responsible for this repetitive pattern year after year and the major factor is
weather condition. During winter, more woolen clothes are sold in UK and some other part of the
world. Also, regardless of increasing trend in the sales of ice cream, there is more sales of ice
cream during summer than winter.
Seasonal variation is denoted by (St). Time plot showing of monthly number of rainfall is given
below. Season is completed within one year, therefore a complete cycle is detected in the time
plot below. Seasonal variation is also found in a quarterly data. In that case, a complete cycle
will be completed within four quarters that make a year.
Fig. 1.3 : Time plot showing seasonal variation
137
Cyclical Variation
Cyclical Variation refers to as a long term oscillation about the trend which may or not be
periodic due to some other physical causes. The movement may or not exactly follow similar
pattern after equal interval of time. Examples include daily variation in temperature and rainfall
as well as some social and economic variables. A cyclical variation is denoted by
Fig. 1.4: Time Plot showing Cyclical Variation
Irregular Variation
This refer to erratic or sporadic movement of time series due to occurrence of random per chance
event, which are unforeseen, hence, it cannot be isolated directly. They are not deterministic.
These variations may or not be random.
Though it is assumed that these chance events produce short time variation, however, they can be
very intense and may result in a new cyclical or other variation. Included among these random
factors are such events as strikes, flood, volcanic eruption, earthquake, fire outbreak, sudden
change in government policy and so on. It is denoted by it.
Method of Combining Components
The task of the statistician is to segregate each of these factors in so far as this is possible: By
isolating or removing individual components, the impact of each of the components may be
assessed. It may happen that not all of the components may be present.
138
Traditionally, it is possibly to decompose time series into the trend, seasonal, cyclical and
irregular components. Using either of
Additive model:
or Multiplicative model:
where Tt is the trend component; St is the seasonal component; Ct is the cyclical component and
It is the irregular component
The resulting trend equation can be used for forecasting while the original data can be de-
seasonalized.
For example, the trend can be estimated using either
(i) k-point moving average
(i) semi-average,
(ii) Least square’s method.
Assuming all these methods are familiar to us, the least squares method uses the normal equation
with the assumption that the error term are independent and not serially
correlated. Otherwise, the regression equation is spurious i.e. the parameters of the models are
biased, and inconsistent due to the presence of a lagged dependent variable, the estimated OLS
standard error is invalid.
Decomposition of a time series can be achieved using any f the following models:
(a) Additive model
Xt = Tt + St + Ct + It
(b) Multiplicative model
Xt = Tt . St . Ct . It
ttttt ICSTX
t t t t tX T S C I
ttt baX t
139
(c) Mixed model
Xt = Tt StCt + It
or Xt = Tt + St . Ct . It
We shall concentrate first on (a) and (b).
The additive model assumes that the actual values are the sum of the four separate effects. This
assumption is probably true when short periods are involved or where the rate of growth or
decline in the trend is small as may be shown in the time plot.
The multiplicative model suggests that the actual values are the product of the separate effects.
This model is indicated when there is a marked (or sharp) growth or decline in a time series data
as may be shown in the time plot.
Decomposition of the Components
Either of these models may be used to effect the decomposition of the time series. The idea is to
decompose a time series into each of the basic components, analyze each component separately
and then recombine them in order to describe the variation in the series as a whole.
The process involves systematic evaluation of each component from the data. The first stage is
usually to estimate the trend and eliminate it from each time period from the actual data by
subtraction or division to give a de-trended series.
De-trended Series can be obtained using:
Additive Model: Xt – Tt = St + Ct + It or
Multiplicative Model:
Estimation of Seasonal indices
The first step in the estimation of seasonal effect is to obtain the deviation from the trend Xt - Tt
(for additive model) or (or ratio to trend) for (multiplicative model).
The de-trended series is averaged day by day, month by month etc. to produce an estimate of the
seasonal components Depending on whether the seasonal effect is thought to be additive or
tttt
t ICSTX
t tX T
140
multiplicative, the deviations are arranged in a table with a view to obtaining the average
otherwise referred to as seasonal indices St.
For additive model, the condition is imposed. That is the sum of seasonal effects
(indices) over the quarters add up to zero because if there were no seasonal effect, we expect Xt
– St = 0. If the means does not sum up to zero, the mean is then averaged among the quarters /
months / day / weeks, thus the seasonal effects are adjusted by subtracting (or adding) the
average from the mean to obtain the adjusted means (i.e. seasonal effects). i.e ;
but if then therefore
(1.3)
For multiplicative model, the condition is imposed where S is the number of quarters
in quarterly series. That is the sum of the seasonal effect over the year is S. The ratio of the
actual values (Xt) and the trend (Tt) is obtained as because, if there were no seasonal
effect we expect for each time period. Thus .
(1.4)
The averaging procedure which produces the seasonal components follows the same pattern as in
the additive model except that the adjustments to the averages which corrects for rounding to S.
This is achieved by summing the averages and multiplying the resultant quotient by the
unadjusted averages.
Let , then the adjustment is . Thus the de-trended, de-seasonalized series can
be obtained by eliminating the trend (Tt) and seasonal components (St) for each time period from
the actual data by subtraction or division depending or whether the additive or multiplication
model was used.
01
K
iiS
ii Sdn
1
0 iS mS i gKm
0 gS i
SS j
t tX T
1t tX T 111
S
jjS
S
SSn
jj
1
CS j jSS1
141
Additive model Xt – Tt – St = Ct + It
Multiplicative model
De- seasonalized series is obtained after the seasonally adjusted data has been calculated. The
residual ratio is obtained either by dividing these seasonally adjusted figures by the
trend values or by dividing the ratio de-trended series by the respective seasonal
indices.
Finally, the cyclical variation (Ct ) can be found by smoothing the joint Ct and It components and
is eliminated as before.
Residual irregular components (It) can be obtained by subtraction or division:
Additive model Xt – Tt – St - Ct = It
Multiplicative model
Although the general method of decomposition has included the four possible components which
make up a time series, it should be noted that it is not a rule for all the four to be present. If
annual data are being used, there can be no seasonal component. Similarly, if short periods of
time are involved, the cyclical components can be ignored. In both cases one of the steps
outlined in the decomposition of time series above may be omitted.
Prediction / Forecasting
The essence of decomposing a time series is for a statistician to measure the effect of each
component and to make meaningful and reliable forecast; taking into consideration the effect of
the component on the forecast values for different time periods.
Thus, if a multiplicative model was used, a sensible predictor for the period K might be
tttt
t ICST
X
t tX S
tmt XSX
tttt
t ICST
X
ktktt STkX ˆˆ)(ˆ
142
where and are the estimated trend and seasonal effects respectively.
Similarly if the additive model was used the predictor for period k might be
Example
The data below gives the monthly sales of umbrella in XYZ company from 2004 – 2011
Table 11.1
2004 2005 2006 2007 2008 2009 2010 2011
JAN 10 15 10 8 10 8 9 10
FEB 18 12 10 9 9 12 13 8
MAR 22 13 9 12 10 10 3 11
APRIL 8 15 20 14 12 15 8 13
MAY 16 11 10 19 10 12 15 8
JUN 10 16 18 20 18 11 15 11
JUL 18 22 16 25 28 16 18 15
AUG 20 30 20 25 30 13 19 17
SEP 15 20 21 17 15 9 10 11
OCT 10 15 18 15 17 18 18 4
NOV 14 25 16 15 15 22 23 15
DEC 11 10 14 7 7 10 9 8
tT tS
ktktt STkX ˆˆ)(ˆ
143
(a) Use a suitable average to decompose the series into trend and seasonal component, hence or
otherwise forecast the sales for 2012 – 2013 using the additive model.
(b) Which is the most appropriate model in the sense of providing the better forecast?
Solution:
The first thing to do is to construct the time plot in order to view the maximum and minimum
values, examine the existence of outliers and fluctuations.
Table 11.2 Showing the Computations of the Trend, Seasonal indices and De-
seasonalized data.
Col. 1 Col. 2 Col. 3 Col. 4 Col. 5 Col. 6 Col. 7 Col. 8
Year/
Month
Sales 6 Point-MT Add-in-
pairs
Moving
Average
Dev. from
Trend
Seasonal
Indices
De-
Seasonalized
data
2005 JAN 10 -4.44 14.4
FEB 18 -3.97 22.0
MAR 22 -4.62 26.6
APRIL 8 -0.45 8.4
144
MAY 16 -2.15 18.2
JUN 10 172 349 14.54 -4.54 0.66 9.3
JUL 18 177 348 14.50 3.50 5.72 12.3
AUG 20 171 333 13.88 6.13 7.85 12.2
SEP 15 162 331 13.79 1.21 0.74 14.3
OCT 10 169 333 13.88 -3.88 1.33 8.7
NOV 14 164 334 13.92 0.08 4.09 9.9
DEC 11 170 344 14.33 -3.33 -4.76 15.8
2006 JAN 15 174 358 14.92 0.08 -4.44 19.4
FEB 12 184 373 15.54 -3.54 -3.97 16.0
MAR 13 189 383 15.96 -2.96 -4.62 17.6
APRIL 15 194 399 16.63 -1.63 -0.45 15.4
MAY 11 205 409 17.04 -6.04 -2.15 13.2
JUN 16 204 403 16.79 -0.79 0.66 15.3
JUL 22 199 396 16.50 5.50 5.72 16.3
AUG 30 197 390 16.25 13.75 7.85 22.2
SEP 20 193 391 16.29 3.71 0.74 19.3
OCT 15 198 395 16.46 -1.46 1.33 13.7
NOV 25 197 396 16.50 8.50 4.09 20.9
DEC 10 199 392 16.33 -6.33 -4.76 14.8
2007 JAN 10 193 376 15.67 -5.67 -4.44 14.4
FEB 10 183 367 15.29 -5.29 -3.97 14.0
MAR 9 184 371 15.46 -6.46 -4.62 13.6
145
APRIL 20 187 365 15.21 4.79 -0.45 20.4
MAY 10 178 360 15.00 -5.00 -2.15 12.2
JUN 18 182 362 15.08 2.92 0.66 17.3
JUL 16 180 359 14.96 1.04 5.72 10.3
AUG 20 179 361 15.04 4.96 7.85 12.2
SEP 21 182 358 14.92 6.08 0.74 20.3
OCT 18 176 361 15.04 2.96 1.33 16.7
NOV 16 185 372 15.50 0.50 4.09 11.9
DEC 14 187 383 15.96 -1.96 -4.76 18.8
2008 JAN 8 196 397 16.54 -8.54 -4.44 12.4
FEB 9 201 398 16.58 -7.58 -3.97 13.0
MAR 12 197 391 16.29 -4.29 -4.62 16.6
Col. 1 Col. 2 Col. 3 Col. 4 Col. 5 Col. 6 Col. 7 Col. 8
Year/
Month
Sales 6 Point-MT Add-in-
pairs
Moving
Average
Dev. from
Trend
Seasonal
Indices
De-
Seasonalized
data
APRIL 14 194 387 16.13 -2.13 -0.45 14.4
MAY 19 193 379 15.79 3.21 -2.15 21.2
JUN 20 186 374 15.58 4.42 0.66 19.3
JUL 25 188 376 15.67 9.33 5.72 19.3
AUG 25 188 374 15.58 9.42 7.85 17.2
SEP 17 186 370 15.42 1.58 0.74 16.3
OCT 15 184 359 14.96 0.04 1.33 13.7
NOV 15 175 348 14.50 0.50 4.09 10.9
146
DEC 7 173 349 14.54 -7.54 -4.76 11.8
2009 JAN 10 176 357 14.88 -4.88 -4.44 14.4
FEB 9 181 360 15.00 -6.00 -3.97 13.0
MAR 10 179 360 15.00 -5.00 -4.62 14.6
APRIL 12 181 362 15.08 -3.08 -0.45 12.4
MAY 10 181 362 15.08 -5.08 -2.15 12.2
JUN 18 181 360 15.00 3.00 0.66 17.3
JUL 28 179 361 15.04 12.96 5.72 22.3
AUG 30 182 364 15.17 14.83 7.85 22.2
SEP 15 182 367 15.29 -0.29 0.74 14.3
OCT 17 185 372 15.50 1.50 1.33 15.7
NOV 15 187 367 15.29 -0.29 4.09 10.9
DEC 7 180 348 14.50 -7.50 -4.76 11.8
2010 JAN 8 168 319 13.29 -5.29 -4.44 12.4
FEB 12 151 296 12.33 -0.33 -3.97 16.0
MAR 10 145 291 12.13 -2.13 -4.62 14.6
APRIL 15 146 299 12.46 2.54 -0.45 15.4
MAY 12 153 309 12.88 -0.88 -2.15 14.2
JUN 11 156 313 13.04 -2.04 0.66 10.3
JUL 16 157 315 13.13 2.88 5.72 10.3
AUG 13 158 309 12.88 0.13 7.85 5.2
SEP 9 151 295 12.29 -3.29 0.74 8.3
OCT 18 144 291 12.13 5.88 1.33 16.7
147
NOV 22 147 298 12.42 9.58 4.09 17.9
DEC 10 151 304 12.67 -2.67 -4.76 14.8
2011 JAN 9 153 312 13.00 -4.00 -4.44 13.4
FEB 13 159 319 13.29 -0.29 -3.97 17.0
MAR 3 160 320 13.33 -10.33 -4.62 7.6
APRIL 8 160 321 13.38 -5.38 -0.45 8.4
MAY 15 161 321 13.38 1.63 -2.15 17.2
JUN 15 160 321 13.38 1.63 0.66 14.3
JUL 18 161 317 13.21 4.79 5.72 12.3
AUG 19 156 320 13.33 5.67 7.85 11.2
SEP 10 164 333 13.88 -3.88 0.74 9.3
Col. 1 Col. 2 Col. 3 Col. 4 Col. 5 Col. 6 Col. 7 Col. 8
Year/
Month
Sales 6 Point-MT Add-in-
pairs
Moving
Average
Dev. from
Trend
Seasonal
Indices
De-
Seasonalized
data
OCT 18 169 331 13.79 4.21 1.33 16.7
NOV 23 162 320 13.33 9.67 4.09 18.9
DEC 9 158 313 13.04 -4.04 -4.76 13.8
2012 JAN 10 155 308 12.83 -2.83 -4.44 14.4
FEB 8 153 307 12.79 -4.79 -3.97 12.0
MAR 11 154 294 12.25 -1.25 -4.62 15.6
APRIL 13 140 272 11.33 1.67 -0.45 13.4
MAY 8 132 263 10.96 -2.96 -2.15 10.2
JUN 11 131 0.66 10.3
148
JUL 15 5.72 9.3
AUG 17 7.85 9.2
SEP 11 0.74 10.3
OCT 4 1.33 2.7
NOV 15 4.09 10.9
DEC 8 -4.76 12.8
Table 11.3 Showing Seasonal Indices
Month/
Year JAN FEB MAR APRIL MAY JUN JUL AUG SEP OCT NOV DEC
2005 - - - - - -4.54 3.50 6.13 1.21 -3.88 0.08 -3.33
2006 0.08 -3.54 -2.96 -1.63 -6.04 -0.79 5.50 13.75 3.71 -1.46 8.50 -6.33
2007 -5.67 -5.29 -6.46 4.79 -5.00 2.92 1.04 4.96 6.08 2.96 0.50 -1.96
2008 -8.54 -7.58 -4.29 -2.13 3.21 4.42 9.33 9.42 1.58 0.04 0.50 -7.54
2009 -4.88 -6.00 -5.00 -3.08 -5.08 3.00 12.96 14.83 -0.29 1.50 -0.29 -7.50
2010 -5.29 -0.33 -2.13 2.54 -0.88 -2.04 2.88 0.13 -3.29 5.88 9.58 -2.67
2011 -4.00 -0.29 -10.33 -5.38 1.63 1.63 4.79 5.67 -3.88 4.21 9.67 -4.04
2012 -2.83 -4.79 -1.25 1.67 -2.96 - - - - - - -
Total -31.13 -27.83 -32.42 -3.21
-
15.13 4.58 40.00 54.88 5.13 9.25 28.54
-
33.38
AVG -4.45 -3.98 -4.63 -0.46 -2.16 0.65 5.71 7.84 0.73 1.32 4.08 -4.77 -0.10
Adjustment 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 -0.01
S.I -4.44 -3.97 -4.62 -0.45 -2.15 0.66 5.72 7.85 0.74 1.33 4.09 -4.76 0.00
149
Fig. 1.5: Time Plot of Sales of Umbrella, Moving average and De-seasonalized data
Summary for Study Session 11 In this study, you have learn about:
1. The components of time series were described with charts for their illustration
2. The additive and multiplicative method of analysis of time series data with examples.
3. The procedure for construction of seasonal indices and de-seasonalized data. Some useful
examples were given to illustrate the techniques.
Self-Assessment Questions (SAQs) for Study Session 11 Now that you have completed this study session, you can assess how well you have achieved its
Learning outcomes by answering the following questions.
SAQ 11.1 -11.3
1. Explain clearly the reasons for analyzing a time series data
2. Sixteen successive observation of a given time series are:
1.6, 0.8, 1.2, 0.5, 0.9, 1.1, 1.1, 0.6, 1.5,
0.8, 0.9, 1.2, 0.5, 1.3, 0.8, 1.2
0
5
10
15
20
25
30
35
2004
JAN
APRI
LJU
LO
CT20
05 J
ANAP
RIL
JUL
OCT
2006
JAN
APRI
LJU
LO
CT20
07 J
ANAP
RIL
JUL
OCT
2008
JAN
APRI
LJU
LO
CT20
09 J
ANAP
RIL
JUL
OCT
2010
JAN
APRI
LJU
LO
CT20
11 J
ANAP
RIL
JUL
OCT
Sales of Umbrella Moving Average De- Seasonalized data
150
(i) Obtain the time-plot of the observation
(ii) Use a 3-point moving average to the trend values.
3. For the following time series:
Year tY
1990 2.4
1991 3.6
1992 5.4
1993 7.8
1994 11.6
1995 17.3
(i) Fit a linear trend to the above data and (ii) Fit a Quadratic trend.
References Brookes C B and Dick W.F (1969): An Introduction to Statistical Method .Second Edition
Published by H.E.B.Paperback.
Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third edition London: Arnold &
Stoughton
Cyprian A. Oyeka (1990): An Introduction to Applied Statistical Method in the Sciences. Enugu:
Nobern Avocation publishing coy.
Gupta, C. B. (1973)“An Introduction to Statistical Methods” New Delhi: Vikas Publishing House
PVT Ltd.
Moore D.S and Mc Cabe G.P (1993): Introduction to the Practise of Statistics, Second Edition.
New Delhi: W.H. Freeman and coy.
Shittu, O. I. and Yaya, O. S. (2011): “Introduction to Time Series Analysis”, Babs-Tunde
Intercontinental Print, Nigeria. ISBN 978-33867-1-9. pp. 282