Descriptive Statistics - University of Ibadandlc.ui.edu.ng/oer.dlc.ui.edu.ng/app/upload/STA 111_1505825983.pdf · Descriptive statistics therefore is an aspect of statistics that

i

COURSE MANUAL

Descriptive Statistics STA 111

University of Ibadan Distance Learning Centre Open and Distance Learning Course Series Development

ii

Copyright © 2009, Revised in 2015 by Distance Learning Centre, University of Ibadan, Ibadan. All rights reserved. No part of this publication may be reproduced, stored in a retrieval System, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the copyright owner.

ISBN 978-021-269-8

General Editor: Prof. Bayo Okunade

University of Ibadan Distance Learning Centre University of Ibadan,

Nigeria

Telex: 31128NG

Tel: +234 (80775935727) E-mail: [email protected]

Website: www.dlc.ui.edu.ng

iii

Vice-Chancellor’s Message The Distance Learning Centre is building on a solid tradition of over two decades of service in the provision of External Studies Programme and now Distance Learning Education in Nigeria and beyond. The Distance Learning mode to which we are committed is providing access to many deserving Nigerians in having access to higher education especially those who by the nature of their engagement do not have the luxury of full time education. Recently, it is contributing in no small measure to providing places for teeming Nigerian youths who for one reason or the other could not get admission into the conventional universities.

These course materials have been written by writers specially trained in ODL course delivery. The writers have made great efforts to provide up to date information, knowledge and skills in the different disciplines and ensure that the materials are user-friendly.

In addition to provision of course materials in print and e-format, a lot of Information Technology input has also gone into the deployment of course materials. Most of them can be downloaded from the DLC website and are available in audio format which you can also download into your mobile phones, IPod, MP3 among other devices to allow you listen to the audio study sessions. Some of the study session materials have been scripted and are being broadcast on the university’s Diamond Radio FM 101.1, while others have been delivered and captured in audio-visual format in a classroom environment for use by our students. Detailed information on availability and access is available on the website. We will continue in our efforts to provide and review course materials for our courses.

However, for you to take advantage of these formats, you will need to improve on your I.T. skills and develop requisite distance learning Culture. It is well known that, for efficient and effective provision of Distance learning education, availability of appropriate and relevant course materials is a sine qua non. So also, is the availability of multiple plat form for the convenience of our students. It is in fulfilment of this, that series of course materials are being written to enable our students study at their own pace and convenience.

It is our hope that you will put these course materials to the best use.

Prof. Abel Idowu Olayinka

Vice-Chancellor

iv

Foreword As part of its vision of providing education for “Liberty and Development” for Nigerians and the International Community, the University of Ibadan, Distance Learning Centre has recently embarked on a vigorous repositioning agenda which aimed at embracing a holistic and all encompassing approach to the delivery of its Open Distance Learning (ODL) programmes. Thus we are committed to global best practices in distance learning provision. Apart from providing an efficient administrative and academic support for our students, we are committed to providing educational resource materials for the use of our students. We are convinced that, without an up-to-date, learner-friendly and distance learning compliant course materials, there cannot be any basis to lay claim to being a provider of distance learning education. Indeed, availability of appropriate course materials in multiple formats is the hub of any distance learning provision worldwide.

In view of the above, we are vigorously pursuing as a matter of priority, the provision of credible, learner-friendly and interactive course materials for all our courses. We commissioned the authoring of, and review of course materials to teams of experts and their outputs were subjected to rigorous peer review to ensure standard. The approach not only emphasizes cognitive knowledge, but also skills and humane values which are at the core of education, even in an ICT age.

The development of the materials which is on-going also had input from experienced editors and illustrators who have ensured that they are accurate, current and learner-friendly. They are specially written with distance learners in mind. This is very important because, distance learning involves non-residential students who can often feel isolated from the community of learners.

It is important to note that, for a distance learner to excel there is the need to source and read relevant materials apart from this course material. Therefore, adequate supplementary reading materials as well as other information sources are suggested in the course materials.

Apart from the responsibility for you to read this course material with others, you are also advised to seek assistance from your course facilitators especially academic advisors during your study even before the interactive session which is by design for revision. Your academic advisors will assist you using convenient technology including Google Hang Out, You Tube, Talk Fusion, etc. but you have to take advantage of these. It is also going to be of immense advantage if you complete assignments as at when due so as to have necessary feedbacks as a guide.

The implication of the above is that, a distance learner has a responsibility to develop requisite distance learning culture which includes diligent and disciplined self-study, seeking available administrative and academic support and acquisition of basic information technology skills. This is why you are encouraged to develop your computer

v

skills by availing yourself the opportunity of training that the Centre’s provide and put these into use.

In conclusion, it is envisaged that the course materials would also be useful for the regular students of tertiary institutions in Nigeria who are faced with a dearth of high quality textbooks. We are therefore, delighted to present these titles to both our distance learning students and the university’s regular students. We are confident that the materials will be an invaluable resource to all. We would like to thank all our authors, reviewers and production staff for the high quality of work.

Best wishes.

Professor Bayo Okunade

Director

vi

Course Development Team Content Authoring Shittu O. I.

Content Editor

Production Editor

Learning Design/Assessment Authoring

Managing Editor

General Editor

Prof. Remi Raji-Oyelade

Ogundele Olumuyiwa Caleb

Folajimi Olambo Fakoya

Ogunmefun Oladele Abiodun

Prof. Bayo Okunade

vii

Course Introduction

The goals of a public enterprise, corporate body or individual are achieved if decisions are based

on accurate, reliable and timely information called ‘data’. The massive data will be useful only

when they are organized, summarized and presented in a manner that enhances the

comprehension of the actual situation on ground, making clear the significant relationship among

the variables under investigation. This will ensure that trends and pattern of movement of

individual variables are determined.

Descriptive statistics therefore is an aspect of statistics that deals with the compilation and

presentation of data not necessarily for the purpose of rigorous statistical analysis but simply to

provide concise information on which decisions can be taken.

The purpose of this lecture not therefore is to introduce you to the discipline called

‘Statistics’, its nature, scope and coverage. The various methods of data presentation is

discussed, simple summaries of data as well as the different methods of making comparison

among variable especially those measured on different units are treated in detail. The method of

interpretation of results is also given.

Objectives

By the time you finished this lecture note, you should be able to:

1. explain the use of statistics in our day-to-day activities; 2. present data in tables, charts and diagrams; 3. identify, and calculate the measures of location suitable for a particular set of data based

on the purpose of the inquiry; 4. calculate the measures of variation and coefficient of variation; 5. examine the shape of a distribution for normality, skewness and kurtosis; 6. know and explain the various method of data collection and the situations under which

each of them can be used; 7. obtain the linear regression model from a bivariate data; 8. calculate and interpret the correlation coefficient; 9. discuss the concepts of rate, ration and proportion; 10. calculate different types of indices from given data; and 11. discuss the considerations, uses and limitations of consumer price indices.

viii

Table of Contents Study Session 1 What is Statistics? .........................................................................................1

Introduction .............................................................................................................................1 Learning Outcomes for Study Session 1 ..................................................................................1 1.1 Meaning of statistics ........................................................................................................2 1.2 Branches of Statistics.........................................................................................................3 1.3 Uses of Statistics ...............................................................................................................4 1.4 Terms and Concepts in Statistics .......................................................................................5

1.4.1 Population Sample and Variate ...................................................................................5 1.4.2 What is Data? .............................................................................................................6

Summary .................................................................................................................................7 Self-Assessment Question (SAQs) for Study Session 1 ...........................................................8

SAQ 1.1 (Tests Learning Outcomes 1.1)..............................................................................8 SAQ 1.2 (Tests Learning Outcomes 1.2)..............................................................................8 SAQ 1.3 (Tests Learning Outcomes 1.3)..............................................................................8 SAQ 1.4 (Tests Learning Outcomes 1.4)..............................................................................8

Notes on SAQ .........................................................................................................................8 References...............................................................................................................................9

Study Session 2 Presentation of Data ..................................................................................... 10 Introduction ........................................................................................................................... 10 Learning Outcomes for Study Session 2 ................................................................................ 10 2.1 Ways of Presenting a Mass of Data .................................................................................. 10 2.2 Frequency Table ......................................................................................................... 11

2.2.1 Cumulative Curve (OGIVE)................................................................................. 13 2.3 Simple descriptive analysis of data in tables and diagrams .......................................... 15

2.3.1 Histogram ............................................................................................................ 16 2.3.2 Stem plots (Stem and Leave Plots) ....................................................................... 17 2.3.3 Back-to-back stemplot ......................................................................................... 19 2.3.4 Box Plot (Box and Whiskers Plot) ........................................................................ 20

Summary ............................................................................................................................... 22 Self-Assessment Question (SAQs) for Study Session 2 ......................................................... 23

SAQ 2.1 (Tests Learning Outcomes 2.1)............................................................................ 23

ix

SAQ 2.2 (Tests Learning Outcomes 2.2)............................................................................ 23 SAQ 2.3 (Tests Learning Outcomes 2.3)............................................................................ 23

Notes on SAQ ....................................................................................................................... 23 Reference .............................................................................................................................. 24

Study Session 3 Measure of the Centre of a Set of Observations ....................................... 25 Introduction ........................................................................................................................... 25 Learning Outcomes for Study Session 3 ................................................................................ 25 3.1 Measures of Central Tendency .................................................................................... 26 3.2 Mean ........................................................................................................................... 26

3.2.1 Calculation of Mean from Grouped Data .............................................................. 27 3.3 Median ........................................................................................................................ 32

3.3.1 Calculation of Median From a grouped data ......................................................... 33 3.4 Mode .......................................................................................................................... 36

3.4.1 Calculation of Mode from Grouped Data ............................................................. 37 3.5 Partition Values .......................................................................................................... 39 3.6 Other Measures of Central Tendency .......................................................................... 40

3.6.1 Other Partition Values from Grouped Data ............................................................... 44 Summary ............................................................................................................................... 47 Self-Assessment Question (SAQs) for Study Session 3 ......................................................... 47

SAQ 3.1 (Tests Learning Outcomes 3.1)............................................................................ 47 SAQ 3.2 (Tests Learning Outcomes 3.2)............................................................................ 47 SAQ 3.3 (Tests Learning Outcomes 3.3)............................................................................ 48 SAQ 3.4 (Tests Learning Outcomes 3.4)............................................................................ 48 SAQ 3.5 (Tests Learning Outcomes 3.5)............................................................................ 48 SAQ 3.6 (Tests Learning Outcomes 3.6)............................................................................ 48

Notes on SAQ ....................................................................................................................... 48 References............................................................................................................................. 50

Study Session 4 Measures of Dispersion/Variation ................................................................. 51 Introduction ........................................................................................................................... 51 Learning Outcomes for Study Session 4 ................................................................................ 51 4.1 Variation and its Measures .......................................................................................... 51 4.2 The Range ................................................................................................................... 52 4.3 The Mean Absolute Deviation .................................................................................... 52 4.4 The Variance............................................................................................................... 56

x

4.5 Standard Deviation ...................................................................................................... 57 4.6 Coding Method ........................................................................................................... 58 Summary ............................................................................................................................... 61 Self-Assessment Question (SAQs) for Study Session 4 ......................................................... 62

SAQ 4.1 (Tests Learning Outcomes 4.1)............................................................................ 62 SAQ 4.2 (Tests Learning Outcomes 4.2)............................................................................ 62 SAQ 4.3 (Tests Learning Outcomes 4.3)............................................................................ 62 SAQ 4.4 (Tests Learning Outcomes 4.4)............................................................................ 62 SAQ 4.5 (Tests Learning Outcomes 4.5)............................................................................ 62 SAQ 4.6 (Tests Learning Outcomes 4.6)............................................................................ 62

Notes on SAQ ....................................................................................................................... 62 References............................................................................................................................. 63

Study Session 5 Algebraic Treatment of Mean and Variance ................................................. 64 Introduction ........................................................................................................................... 65 Learning Outcomes for Study Session 5 ................................................................................ 65 5.1 Pooled Mean and Variance .......................................................................................... 65 5.2 Adjusting Values of Mean and Standard Deviations for Mistakes ................................ 68 Self-Assessment Question (SAQs) for Study Session 5 ......................................................... 69

SAQ (Tests Learning Outcomes) ...................................................................................... 69 References............................................................................................................................. 70

Study session 6: Measure of Skewness and Kurtosis ................................................................. 71 Introduction ........................................................................................................................... 71 Learning outcomes for study session 6 .................................................................................. 71 6.1 Define skewness and kurtosis .......................................................................................... 71 6.2 Calculating measure of skewness and kurtosis from simple series and grouped data ........ 73 6.3 Determining whether a set of data; is normally distributed, the direction of skewness and the level of peakedness .......................................................................................................... 77 Summary for study session 6 ................................................................................................. 81 Self-Assessment Questions (SAQs) for Study Session 6 ........................................................ 81

SAQ 6.1-6.2 ...................................................................................................................... 81 Reference .............................................................................................................................. 81

Study session 7: Methods of Collecting Statistical Data............................................................. 83 Introduction ........................................................................................................................... 83 Learning outcomes for study session 7 .................................................................................. 83 7.1 The various methods of data collection ............................................................................ 83

xi

7.2 Limitations of Data Collection in Nigeria ........................................................................ 91 Summary for study session 7 ................................................................................................. 91 Self-Assessment Questions (SAQs) for Study Session 7 ........................................................ 91

SAQ 7.1-7.2 ...................................................................................................................... 91 References............................................................................................................................. 92

Study Session 8: Regression Analysis ....................................................................................... 93 Introduction ........................................................................................................................... 93 Learning outcomes for Study Session 8 ................................................................................. 93 8.1 Regression Analysis ........................................................................................................ 93 8.2 Types of Regression Models ............................................................................................ 94 8.3 The Simple Regression Model ......................................................................................... 96 8.4 Estimation of Parameters ................................................................................................. 98 8.5 Testing the Significance of the Model .............................................................................. 99 Summary for Study Session 8 .............................................................................................. 106 Self-Assessment Questions (SAQs) for Study Session 8 ...................................................... 106

SAQ 8.1-8.5 .................................................................................................................... 106 References........................................................................................................................... 109

Study Session 9: Correlation and Association .......................................................................... 110 Introduction ......................................................................................................................... 110 Learning outcomes for Study Session 9 ............................................................................... 110 9.1 Correlation .................................................................................................................... 110 9.2 Coefficient of Rank Correlation ..................................................................................... 112 Summary for Study Session 9 .............................................................................................. 115 Self-Assessment Questions (SAQs) for Study Session 9 ...................................................... 115

SAQ 9.1-9.2 .................................................................................................................... 115 References........................................................................................................................... 116

Study Session 10: Proportions, Rates and Indices .................................................................... 117 Introduction ......................................................................................................................... 117 Learning Outcomes for Study Session 10 ............................................................................ 117 10.1 Proportion, Rates and indices ....................................................................................... 118 10.2 Consideration for an Index Number ............................................................................. 119 10.3 Methods of Construction of Price Index ....................................................................... 121 10.4 Uses of Consumer Price Index ..................................................................................... 127 Summary for Study Session 10 ............................................................................................ 128

xii

Self-Assessment Questions (SAQs) for Study Session 10 .................................................... 128 SAQ 10.1 -10.4 ............................................................................................................... 128

References........................................................................................................................... 130 Study Session 11: Time Series Analysis .................................................................................. 130

Introduction ......................................................................................................................... 131 Learning Outcomes for Study Session 11 ............................................................................ 131 11.1 Time Series Analysis ................................................................................................... 131 11.2 Methods of Analysis of Time Series data ..................................................................... 133 11.3 Components of Time Series ......................................................................................... 134 Summary for Study Session 11 ............................................................................................ 149 Self-Assessment Questions (SAQs) for Study Session 11 .................................................... 149

SAQ 11.1 -11.3 ............................................................................................................... 149 References........................................................................................................................... 150

1

Study Session 1 What is Statistics?

Introduction Statistics is a universal subject used in all disciplines and in all areas of human endeavour. The

word statistics was originally applied only to such data at the state required for its official

purpose. To a layman; it also refers to any set of quantitative data relating to a particular

measurement, whether that data is of interest or not.

The systematic collection of official statistics for political purposes originated in Germany

towards the end of the 18th Century, by comparing data such as population, industrial and

agricultural output. Also in England, a collection of numerical data enabled government

departments to predict levels of revenues and expenditure with more precision than before.

Learning Outcomes for Study Session 1 When you have studied this session, you should be able to:

1.1 Explain the meaning of statistics

1.2 Discuss the nature, scope and coverage of statistics;

1.3 Mention the use of statistics in our day-to-day activities

1.4 Define terms and concepts that would facilitate understanding of this course.

2

1.1 Meaning of statistics The earliest origin of statistics lies in the desire of rulers to count the numbers of inhabitants or

measure the value of taxable land in their domains. This has developed to careful measurement

of weight, distance or counting of physical quantities and items in many disciplines such as

agriculture, life and behavioral sciences.

Thus, the study of statistics is therefore essential for sound reasoning, precise judgment and

objective decision in the face of up-to-date accurate and reliable data.

Box 1.1: Meaning of Statistics

Statistics can simply be defined as the “science of data”. It is the science of collecting,

organizing and interpreting numerical facts, which we called data.

Most of us, especially those in the media-reporters have little or nothing to do with a large mass

of data

Statistics is also the science and practice of developing human knowledge through the use of

empirical data expressed in quantitative form. It is based on statistical theory. It is a branch of

applied mathematics where randomness and uncertainty are modelled by probability theory

(Wikipedia Encyclopedia).

In Nigeria, the official data collection and its usage started with the Statistical Act of 1947 which

established the Department of Statistics in the office of the Governor General of the Federation.

Thus, many researchers, educationalists, businessmen and government agencies at the national,

state, or local level relies on data to answer fundamental questions pertaining to their operations

and programs. In fact, there can be no meaningful science without statistics.

3

In-Text Question

Statistics could also be defined as?

a. A structural science

b. Codes that help programmers in programming

c. a branch of applied mathematics where randomness and uncertainty are modeled by

probability theory

d. None of the above

In-Text Answer

c.) A branch of applied mathematics where randomness and uncertainty are modelled by

probability theory

1.2 Branches of Statistics The science of data “statistics” can be divided into three broad parts which are not mutually

exclusive viz.; descriptive statistics, statistical methods and statistical inference

Descriptive Statistics

It is the act of summarizing and giving a descriptive account of numerical information in the

form of reports, charts and diagrams. The goal of descriptive statistics is to gain information

4

from collecting data. It begins with a collection of data by either counting or measurement in an

inquiry.

It involves the summary of specific aspects of the data, such as average value, and measure of

spread. Suitable graphs, diagrams and chart are then used to gain understanding and clear

interpretation of the phenomenon under investigation, keeping firmly in mind where the data

comes from.

Statistical Method

This is a device for classifying data and making clear relationship between variable under

consideration. This can be achieved by using the statistical tools and formulae. It ranges from

the computation of simple summaries of data (mean, median, mode, etc.) to complex modelling

used in policy formulation.

Inference Statistics

This is the act of making a deductive statement about a population from the quantities computed

from its representative sample. It is a process of making inference or generalizing about the

population under certain conditions and assumptions. Statistical inference involves the processes

of estimation of parameters and hypothesis testing.

1.3 Uses of Statistics Statistics could be used for a lot of our day to day activities which is mentioned below:

1. Planning and decision making by individuals, state, business organizations research

institutions etc.

2. Forecasting and prediction for the future based on a good model provided that its basic

assumptions are not violated.

3. Project implementation and control; this is especially useful in ongoing projects such as

network analysis, construction of roads and bridges, and implementation of government

programs and policies

4. Motoring and evaluation of plans, projects, programmes and policy initiatives. It also

assists in motoring, and evaluation of the activities of government programmes.

5

1.4 Terms and Concepts in Statistics There are a lot of terms and concept in statistics we need to learn to keep us abreast and give us

more understanding about statistics. The following terms and concept discussed below are used

daily in the field of statistics.

1.4.1 Population Sample and Variate

In the earlier part of this study session, we explained that the main aim of statistics is to gain

information about a population. We may want to know what the population is:

Population: A population is the collection of items under investigation. It may be finite

(countable) or infinity (uncountable).

Parameter: A parameter is a summary / quantity computed from a population, e.g. means ( ),

population variances ( 2 ) etc.

Sample: A sample is a representative part of a population observed for the purpose of making a

scientific statement or taking decisions about the population. A good sample must be randomly

selected and adequate.

A sample can be random or purposive. A random sample may be obtained by tossing a coin,

throwing a die, drawing discs from a container or using a table of random numbers. A purposive

judgmental sample is obtained when members of a population are selected by discretion or

personal judgment

Statistics: A statistics is a quantity / summary calculated from a sample for the purpose of

drawing conclusion about the related population, e.g. sample means ( x ), sample variance ( 2 )

etc.

The characteristics of units in the population can be measured or counted (quantitative) e.g.

weight, height age, number of cars. It can also be observed (qualitative or attributes e.g. color, of

eyes, beauty, complexion etc.)

Variate: A variate (variables) is any quantity or attributes whose value varies from one unit of

observation to another. A quantitative variate (variables) may be discrete or continuous

Continuous Variate: A continuous variate is a variate which may take all values within a given

range. Its values are obtained by measurements e.g. height, volume, time, examination score etc.

6

Discrete Random Variate: A discrete random variate is one whose value changes by steps. Its

value may be obtained by counting. It normally takes integer values e.g. number of cars, number

of chairs.

1.4.2 What is Data?

Having defined statistics as the science of data, it is necessary at this juncture to ask ourselves,

the pertinent question: What is data?

Data: Data can be described as a mass of unprocessed information obtained from measurement

of counting of a characteristics or phenomenon. In their raw form, they are usually massive and

disorderly. They become meaningful only when the data have been reduced to some kind of

order by some kind of tables or diagrams.

Statistical data: These are data obtained through objective measurement or enumeration of

characteristics using the state of the art equipment that is precise and unbiased. Such data when

subjected to statistical analysis produce results with high precision.

Sources of Statistical Data

Statistical data can be obtained from

1. Census - Complete enumeration of all the unit of the population

2. Surveys - the study of representative part of a population.

3. Experimentation: Observation from experiments carried out in laboratories and research

centres.

Types of Data

Data can be categorized as internal or external data.

Internal Data

When data is collected from within the organization and used in the organization concerned, it is

called internal data. Examples are data from accounts and internal records of an establishment.

7

External Data

If data is collected from outside the organization, it is called external data. Examples are data

from journals not published by the organization itself. There are two major sources of statistical

data: the internal source and the external source.

Primary Data

These are data generated by first hand or data obtained directly from respondents by personal

interview, measurement or observation.

Secondary Data

These are data obtained from publication, newspapers, magazines and annual reports. They are

usually summarized data used for a purpose other than the intended one.

Summary In Study Session 1, you have learnt that:

1. The study of statistics is essential for sound reasoning, precise judgment and objective

decision in the face of up-to-date accurate and reliable data.

2. Statistics can be defined as the science of collecting organizing and interpreting numerical

facts, which we called data

3. The science of data statistics are descriptive statistics, statistical methods and statistical

inference

4. Statistics could be used for a lot of our day to day activities

5. A population is the collection of items under investigation

6. A parameter is a summary / quantity computed from a population

7. A variate (variables) is any quantity or attributes whose value varies from one unit of

observation to another

8. Data can be described as a mass of unprocessed information obtained from measurement

of counting of a characteristics or phenomenon

9. Data can be categorized as internal or external data.

8

Self-Assessment Question (SAQs) for Study Session 1 Now that you have completed this study session, you can assess how well you have achieved its

Learning outcomes by answering the following questions. Write your answers in your study

Diary and discuss them with your Tutor at the next study Support Meeting. You can check your

answers with the Notes on the Self-Assessment questions at the end of this Module.

SAQ 1.1 (Tests Learning Outcomes 1.1)

What is the meaning of Statistics?


List the branches of Statistics


Mention three uses of statistics?


1. Define population

2. What are the Source of Statistical data?

Notes on SAQ SAQ 1.1

Statistics can simply be defined as the “science of data”. It is the science of collecting organizing

and interpreting numerical facts, which we called data.

SAQ 1.2

Descriptive statistics, statistical methods and statistical inference

SAQ 1.3

I. Planning and decision making by individuals, state, business organizations research

institutions etc.

II. Forecasting and prediction for the future based on a good model provided that its basic

assumptions are not violated.

III. It assists in motoring and evaluation of the activities of government programs.

9

SAQ 1.4

1. A population is the collection of items under investigation

2.

i. Census - Complete enumeration of all the unit of the population

ii. Surveys - the study of representative part of a population.

iii. Experimentation: Observation from experiments carried out in laboratories and research

centres.

References Brookes, B.C. and Dick, W. F. L. (1969): An introduction to Statistic Method, 2nd Edition, H. E.

B. Publishers.

Moore, D.S. and McCabe, G. P. (1993): Introduction to the practice of Statistics; 2nd Edition;

New York: W. H. Freeman and Company.

Adamu, S. O. and Johnson, T. L. (1997): Statistics for Beginners, Book 1: SAAL Publications.

10

Study Session 2 Presentation of Data

Introduction The aim of this study session is to introduce the various methods of presenting statistical data.

Presentation of data in tables, charts and diagrams facilitates understanding of the important

feature of the data.


2.1 Explain the various ways of presenting a mass of data;

2.2 Construct a frequency table;

2.3 Explain and Carry out simple descriptive analysis of data in tables and diagrams.

2.1 Ways of Presenting a Mass of Data Numerical information (data) about the characteristics of a variable, when collected is often

massive and complex. More often than not, it is necessary to present data in tables, charts and

diagrams in order to have a clear understanding of the data, and to illustrate the relationship

existing between the variables being examined.

We shall discuss the frequency table, cumulative Frequency table, Stem plot, Box plot and

Histogram assuming that we are very familiar with other graphs such as pie chart, frequency

curve, frequency polygon etc.

In-Text Question Why is it necessary to present data in tables, charts and diagrams?

a. To have a clear understanding of the data and illustrate the relationship between variables

b. To break information into pieces

c. Allow a blind man understand the data

d. To have a clear understanding of the data

11

In-Text Answer

a.) To have a clear understanding of the data and illustrate the relationship between variables

2.2 Frequency Table The first step in examining intelligently a set of data for a single quantitative variable is by

constructing a frequency table. This is a tabular arrangement of data into various classes

together with their corresponding frequencies.

Procedure

Given a set of observation x1, x2 …. xN for a single variable.

1. Find the range (R): (i.e. Difference between the largest and smallest values) of the data.

2. Determine the number of classes (K) (depending on the size of the data).

3. Find the class interval (C): (i.e. Range divide by the number of classes) .

4. Tally (i.e. assign the values to classes).

5. Find the class frequencies.

Note: With the advent of computers, all these steps can be accomplished easily.

Example 2.1: The following are the scores of 40 students in Mathematics test:

50, 08, 14, 20, 46, 23, 26, 47, 32, 31, 48, 40, 49, 40, 41,

38, 51, 86, 55, 82, 56, 72, 60, 98, 59, 76, 55, 80, 52, 63,

57, 67, 53, 70, 69, 63, 65, 66, 22, 27

Construct a frequency table for the above data.

Solution

Range: 98 – 08 = 90

No. of classes = 10

Class Interval = 9 1090

classes of No.Range

12

Working Table

Table 2.1

Class Tally Frequency`

1 – 10

11 – 20

21 – 30

31 – 40

41 – 50

51 – 60

61 – 70

71 – 80

81 – 90

91 - 100

I

II

IIII

IIII

IIII I

IIII IIII

IIII II

III

II

I

1

2

4

5

6

9

7

3

2

1

Frequency Table

Table 2.2

Score Frequency

01 up to 10

11 up to 20

21 up to 30

31 up to 40

41 up to 50

51 up to 60

61 up to 70

71 up to 80

81 up to 90

91 up to 100

1

2

4

5

6

9

7

3

2

1

Total 40

13

2.2.1 Cumulative Curve (OGIVE)

The graph of the cumulative frequency of a single variable is called an OGIVE. It is drawn by

plotting the cumulative frequency against the upper class boundary of a class interval. On the

OGIVE it is possible to obtain the median the quartile and inter-quartile range. (IQR)

Example 2.2: Using the data in Example 1. Construct the cumulative frequency curve.

Solution

Table 2.3

Score Frequency Cumulative Frequency

Less than 10 11 up to 20 21 up to 30 31 up to 40 41 up to 50 51 up to 60 61 up to 70 71 up to 80 81 up to 90 91 up to 100

1 2 4 5 6 9 7 3 2 1

1 3 7

12 18 27 34 37 39 40

Total 40

OGIVE

Cum. Freq.

0

Score

Diagram 2.1

14

Example 2.3

The following data represent the ages (in years) of people living in a housing estate in Ibadan.

30, 31 17 16 6 2 8 43 18 18 32 33

9 18 33 19 21 13 14 13 14 6 45 52 61

23 26 14 15 14 15 27 19 36 37 11 12

11 12 20 39 40 20 63 69 64 29 28 27

15

Present the above data in a frequency table using a suitable class interval.

Solution

Maximum value = 69

Minimum value = 2

Range = 69 – 2 = 67

A choice of 10 classes will result in some classes with zero frequencies while the choice of 6

classes is more reasonable with at least one item in each class. In practice, it is easy to determine

the number of classes for a given set of data. We are using K = 6 as our number of classes.

Class interval = 6

67

KR = 10.13 ≃ 10.0

Table 2.4

(1) (2) (3) (4) (5) (6)

Class

Tally Frequency

(F) Relative

Frequency (RF)

Cumulative Frequency

(CF)

CRF

1 – 10 11 – 20 21 – 30 31 – 40 41 – 50 51 – 60 61 – 70

IIII IIII IIII IIII IIII IIII IIII IIII III III I IIII

5 20 9 8 3 1 4

0.10 0.40 0.18 0.16 0.06 0.02 0.08

5 25 34 42 45 46 50

0.10 0.50 0.68 0.84 0.90 0.92 1.00

Total 50 1.00 It is pertinent to define the columns in the frequency table for better understanding.

15

Class interval is a sub-division of the total range of values which a (continuous) variable

may take.

Class frequency is the number of observations of the variate which falls in a given

interval (column 3)

Relative frequency for a class is the actual frequency of the class divided by total

frequency. (Column 4). Sometimes, it is better to work with relative frequencies

[especially in the calculation of probability values].

Cumulative frequency of a class is the sum of all the frequencies before the class up to

and including the frequency of that class (column 5).

Relative Cumulative Frequency: When the relative frequency of a class is expressed as

a proportion of total frequency, what we have is called the relative cumulative frequency

(column 6). It is sometimes called the distribution function.

Box 2.1: Observations from the Table

The data have been summarized and we now have a clearer picture of the distribution of the ages

of inhabitants of the Estate.

Exercise

Now answer the following questions from the table.

How many residents are aged between 11 and 30 years?

i. How many residents are aged above 30 years?

ii. What is the probability that a person selected at random from the Estate will be less than

31 years old?

Answers

(i) 29 (ii) 16 (iii) 0.68

2.3 Simple descriptive analysis of data in tables and diagrams Data can be presented in the text, in a table, or pictorially as a chart, diagram or graph. Tables,

charts and graphs should, ideally, be self-explanatory. The reader should be able to understand

16

them without detailed reference to the text, on the grounds that users may well pick things up

from the tables or graphs without reading the whole text. Below are some ways in analysing data

2.3.1 Histogram

Histogram is a chart used for presenting the frequency distribution of the values of a variable.

(Assuming the variate is a continuous type).

A histogram is a group of rectangles drawn above each class interval such that the area of each

rectangle is proportional to frequency of the observations falling in the corresponding class

interval. The chart is constructed by plotting the values of the variable along the X-axis and the

frequencies along the Y-axis.

Vertical lines are drawn at the lower and upper class boundary of each class up to the

frequencies. Horizontal lines representing the width of each class interval are then drawn on top

of each vertical line.

In a situation where the class intervals are not the same, the height must be adjusted so that the

area represents the frequency.

Draw the histogram of the data in Example 2.2 above

17

Histogram of Ages

Diagram 2.2

20

Frequency

10

0 1 10 20 30 40 50 60 70

Ages (in years)

2.3.2 Stem plots (Stem and Leave Plots)

In statistics, a stemplot (or stem and leaf plot) is a graphical display of quantitative data that is

similar to a histogram and is useful in visualizing the shape of distribution. It was invented by J.

W. Turkey (1915 – 2000). Stemplots contain more information than do histograms because;

unlike in a histogram where bars are used, the individual data values are displayed in a table-like

format, in order of increasing magnitude. A basic stemplot contains two columns separated by a

vertical line. The left column contains the stems and the right column contains the leaves.

18

Constructing a Stemplot

To construct a stemplot, take not of the following steps;

I. The observations must first be sorted in ascending order.

II. It must be determined what the stems will represent and what the leaves will represent.

Typically, the leaf contains the last digit of the number and the stem contains all of the

other digits (in the case of very large or very small numbers, the data values may be

rounded to a particular place value (such as the hundreds place) that will be used for the

leaves. The remaining digits to the left of the rounded place value are used as the stems).

The stemplot is drawn with two columns separated by a vertical line. The stems are listed to the

left of the vertical line. It is important that each stem is listed only once and that no numbers are

skipped, even if it means that some stems will have no leaves. The leaves are listed in increasing

order in a row to the right of each stem.

Example 2.4

Present the following data in a stem-and-leaf plot

68 66 72 75 76 106 54 57 56 63 59 66 68

64 88 84 81

Solution

Table 2.5

Stem Leaf 5 6 7 8 9 10

4 6 7 9 3 4 6 8 8 2 2 5 6 1 4 8 6

Example 2.5

Given the weight of 20 rams at the end of two weeks feeding on a special diet as follows:

46, 59, 35, 41, 46, 21, 24, 33, 40, 45, 49, 53, 48, 54, 61, 36, 70, 58, 47, 12

Make a stem plot for these data

19

Solution

The stem plot is given below

1

2

3

4

5

6

7

2

14

356

01566789

3489

1

1

Important Features

i. It is easy to locate the centre of the distribution, i.e. median = 46

ii. It is also possible to examine the shape of the distribution. Turn the stem plot on its side so

that the larger observation falls on the right (e.g. The above distribution is symmetric)

just as it is possible to measure the median first quartile (q1) the third quartile (q3) and

inter-quartile range (IQR).

iii. It is also possible to look for deviation from the overall shape of the data e.g. outliers

2.3.3 Back-to-back stemplot

Back-to-back stemplots are used to compare two distributions side-by-side. This type of double

stemplot contains three columns, each separated by a vertical line. The center column contains

the stems. The first and third columns, each contain the leaves of a different distribution. The

numbers for the leaves of the distribution in the leftmost column are aligned to the right and are

listed in increasing order from right to left. Here is an example of a back-to-back stemplot

comparing the distribution of the weight of cow to another distribution weight of ram.

20

Example 2.6

Suppose 20 cows were fed with the same special feed as in example 3: the back-to-back stem

plot is shown below:

Table 2.6

Weight Of Cow Weight Of Ram

0

1

2

31

542

7655421

42

1

1

2

3

4

5

6

7

8

9

2

14

356

01566789

3489

1

1

Observations

i. Weight of Ram is symmetric

ii. Weight of Cow is skewed to the right

iii. There is an outlier in the weight of cow (i.e. 91 kg.)

NOTE: Stem plot works well for small set of data especially when the observations are all

greater than zero.

2.3.4 Box Plot (Box and Whiskers Plot)

This is a chart that looks like a box when drawn. They are most useful when comparing two or

more sets of sample data. A box plot shows the centers and spread of the data, gives a clear

picture of the symmetry of a data set and shows outliers very clearly. It is constructed by first

calculating the median 1st and 3rd quartiles.

21

In-Text Question

The box plot chart is most useful when comparing two or more sets of sample data. True or False

In-Text Answer

True

In a box plot, the ends of the box are at the quartiles, so that the length of the box is the inter

quartile range. The median is marked by a line within the box. The ‘whiskers’re the two lines

outside the box that extends to the smallest and largest observations. Outliers are shown as dots,

outside the shickers.

Example 2.7

Consider the data in example 2.6 above. Construct the box plots.

Solution

Diagram 2.3

Boxplots

10

9

8

7

6

5

4

3

2

1

0

Fig. 1.2a Fig. 1.2b

Weight of Ram Weight of Cow

22

In a box plot, the center, the inter-quartile range, the spread are immediately apparent. However,

the box plot is generally inferior to the stem plot or histogram in that it shows only the center and

the partition values; it tells nothing about the shape of the distribution and other values in the

data set.

A stem plot (for large data set) provides a clearer display of a single distribution especially, when

accompanied by the median and quartile as numerical sign post.


1. It is necessary to present data in tables, charts and diagrams in order to have a clear

understanding of the data

2. The first step in examining intelligently a set of data for a single quantitative variable is

by constructing a frequency table

3. Frequency table is a tabular arrangement of data into various classes together with their

corresponding frequencies

4. The graph of the cumulative frequency of a single variable is called an OGIVE

5. Data can be presented in the text, in a table, or pictorially as a chart, diagram or graph.

6. Histogram is a chart used for presenting the frequency distribution of the values of a

variable

7. Stemplot is a graphical display of quantitative data that is useful in visualizing the shape

of distribution

8. Back-to-back stemplots are used to compare two distributions side-by-side.

9. Box Plot is useful when comparing two or more sets of sample data.

23






Why is it necessary to present data in tables, charts and diagrams?


1. The first step in examining a set of data for a single quantitative variable is by

constructing a frequency table. True or False

2. What is a Frequency Table


Mention three ways in analyzing data?

Notes on SAQ SAQ 2.1 They give a clear understanding of the data, and to illustrate the relationship existing between the

variables being examined.

SAQ 2.2

1. True 2. This is a tabular arrangement of data into various classes together with their

corresponding frequencies.

SAQ 2.3

i. Histogram ii. Stemplot

iii. Bxplot

24

Reference Brookes C B and Dick W.F (1969): An Introduction to Statistical Method Second Edition,

Published by H.E.B.Paperback.

Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third edition. London: published

by Arnold & Stoughton

Moore D.S and Mc cabe G.P (1993): Introduction to the Practice of Statistics, second Edition.

New York:W.H. Freeman and coy.

25

Study Session 3 Measure of the Centre of a Set of Observations

Introduction The primary aim of any investigator is to obtain a simple summary value (average) that can be

used to describe all the observations in a set. Thus an average is a single value that can

represents all the observations in a distribution.

The most representative value is one that is at the center of the distribution. They are otherwise

referred to as measures of location or measures of central tendency


3.1 Discuss the term measures of central tendency;

3.2 Explain and calculate the mean

3.3 Explain and calculate Median

3.4 Explain and calculate mode from a grouped data

3.5 Define and calculate the partition values.

3.6 Discuss Other Measures of Central Tendency

26

3.1 Measures of Central Tendency These are measures of the center of a distribution. They are single values that give a description

of the data. They are also referred to as measures of central tendency. Some of them are

Arithmetic mean, mode, median, geometric mean and harmonic mean. We shall discuss them

one after the other. They are otherwise known as descriptive statistics.

In-Text Question

Measures of central tendency are multiple values that give a description of the data. True or

False

In-Text Answer

False

However, a descriptive statistic should possess the following desirable properties.

A descriptive statistic should

1. Be single-valued

2. Be algebraically tractable

3. Should consider every observed value

3.2 Mean The average (arithmetic mean) of a set of observation is the sum of the observation divided by

the number of observation. Given n observations are denoted by x1, x2, x3 ---- xn, the mean is

defined by

)...(1321 nxxxx

nX

27

Or in a compact notation, it can be written as

ixn

X 1

The above formula is for the simple series and is most useful when few (n < 20) observations are

considered.

Example 3.1

Here are the ages of 15 students in a class 16, 18, 20, 21, 22, 19, 17, 18, 19, 17, 17, 18, 17, 17,

20. Calculate the mean.

Solution

The average age of the students is ixn

X 1

]20171816[151

X

15257

X

= 17.3 ≃ 17 years 4 months

3.2.1 Calculation of Mean from Grouped Data

We have seen in study session two that a large set of observations can be summarized into a

frequency table which elicits some information about the data. This makes the computation of

the mean from a grouped data is very easy.

The mean of a set of N observation of a discrete (continuous) variate, grouped so that the value

xi (xi is the centre of intervals) i = 1, 2, … K occurs with frequencies fi is

K

i

iiN

XfX1

;

K

iifN

1

In a grouped frequency table for a continuous variate, Xi’s are the center of interval (i.e. Average

of the upper and lower class boundary of a class) otherwise known as class mark.

28

Example 3.2

Given the frequency distribution of a random variable X as follows:

Table 3.1

Group Frequency

1 – 5

6 – 10

11 – 15

16 – 20

21 – 25

26 - 30

2

4

8

5

3

1

Total 23

Find the mean of the distribution.

Solution

Find the class mark of a particular class by adding the lower and upper class boundaries of the

class and divide by 2.

Table 3.2

Group Class Mark

(X)

f fx

1 – 5

6 – 10

11 – 15

16 – 20

21 – 25

26 - 30

3

8

13

18

23

28

2

4

8

5

3

1

6

32

104

90

69

28

23 329

29

N = ∑f

N

fxX

= 23

329 = 14.304

Use of Assumed Mean

Sometimes, large values of the variable are involve in the calculation of mean, in order to make

our computation easier, we may assume one of the values as the mean. Then the revised formula

for the mean is:

If the assumed mean is A, then

Mean:

nd

AX where d = X – A

If a constant factor C is used then

Cn

dAX

For a grouped data

Cffd

AX

Where

CAXU

Example3.3

The exact pension allowance paid (in Naira) to 25 workers of a company is given in the table

below.

30

Table 3.3

Pension In N

No. of Person (f)

25 30 35 40 45

7 5 6 4 3

Calculate the mean using an assumed mean 35 and 5 as the common factor.

Solution:

Table 3.4

Pension In N

No. of Person (f) C

AXU fU

25 30 35 40 45

7 5 6 4 3

- 2 - 1 0 1 2

- 14 - 5 0 4 6

25 - 9

Let A = 35, C = 5

525

935

X = 33.20

Example 3.4

Consider the data in example 2.3, using a suitable assumed mean and constant factor, compute

the mean.

31

Table 3.5

Group f X X – A C

AX

(U)

fU

1 – 10 11 – 20 21 – 30 31 – 40 41 – 50 51 – 60 61 - 70

5 20 9 8 3 1 4

5 15 25 35 45 55 65

- 10 0

10 20 30 40 50

-1 0 1 2 3 4 5

-5 0 9 16 9 4 20

Total 50 53 A = 15

C = 10

10)5053(15X

= 15 + 10.6

= 25.6

Note:

It is always easier to select the class mark with the largest frequency as the assumed mean.

Merits

The mean is an average that considers all the observations in the data set. It is simple and easy to

compute and it is the most widely used average.

Demerits

Its value is greatly affected by the extremely too large or too small observation.

32

3.3 Median The median is an average of position. It is the value of the variable that divides a distribution

into two equal parts when the values are arranged in order of magnitude.

To compute the median of a distribution:

i. Arrange all observations in order of size, from smallest to largest.

ii. If n (number of observation is odd, the median X~ is the center of observation in the

ordered list. The location of the median is

2)1(~ thnX

Item.

iii. If n is even, the median X~ is the average of the two middle observations’ is the ordered

list.

i.e. 2

~ 122

1

nn

XX

X

th

Example 3.5 (n is even)

The values of a random variable X are given as 11, 10, 13, 9, 13, 14, 16, and 20. Find the

median.

Solution

In an Array: 9, 10, 11, 13, 13, 14, 16, and 20. Since n is even.

Median = 2

~ 122

nn XX

X

= 2

54 XX

i.e. = 132

1313

33

Example 3.6 (n is odd)

The values of a random variable X are given as 9, 7, 5, 20, 2, 12 and 1. Find the median.

In an array: 1, 2, 5, 7 , 9, 12, 20

n is odd , therefore

The median th

XX)

217(

~

thX4

= 7

Note: The occurrence of 7 in the above example is just a coincidence it could have been any

other value in the middle of the data set.

3.2.1 Calculation of Median From a grouped data

The formula for calculating the median from grouped data is defined as

wm

CfbN

LmX

2~

where Lm = Lower limit of the median class

fm = Frequency of median class

N = f is the total frequency

Cfb = Cumulative frequency before the median class

w = Class width.

34

Example 3.7

The table below shows the length of 100 rods (in inches) produced in a factory

Table 3.6

Length

(inches)

Number of rods

(f)

1 – 2

3 – 4

5 – 6

7 – 8

9 – 10

11 – 12

13 – 14

1

8

26

38

19

7

1

Calculate the median

Solution

The first thing to do is to obtain the cumulative frequency distribution as follow

Table 3.7

Class f Cumulative Frequency

(cf) 1 – 2 3 – 4 5 - 6 7 – 8 9 – 10 11 -12 13 -14

1 8 26 38 19 7 1

1 9

35 73 92 99 100

35

i. determine 502

1002

N , clearly the median value belong to the class

(7 – 8).

ii. The lower class boundary (Lm) of the median class is 6.5.

iii. frequency of the median class (fm) is 38

iv. the cumulative frequency before the median class (cf 6) is 35

v. the class interval (w) is 2 and the median is obtained as

238

35505.6~

X

= 6.5 + 0.789

= 7.289

29.7 ~ (2 dp)

Example 3.8

The following data represent the weight of products manufactured in a factory (in kg.

Table 3.8

Weight Number of Products

45 – 54 55 – 64 65 – 74 75 – 84 85 – 94 95 – 104

105 – 114 115 – 124 125 – 134 135 - 144

1 3 5

18 33 25 21 12 5 2

36

Calculate the median.

Solution

First obtain the cumulative frequency distribution as in Example 3.7.

The following can be obtained from the above table as in Example 3.7.

5.622

1252

N ; cfb = 60

cfb = 94.5, fm = 25, w = 10 (i.e. 104.5 – 94.5)

1025

605.625.94~

X

= 94.5 + 1

= 95.5

Merit

1. It is easy to calculate

2. It is easy to understand by many people.

3. Its value is not affected by extreme values; thus it is a resistant measure of central

tendency.

4. It is a good measure of location in a skewed distribution.

Demerit

1. It does not take into consideration all the values of the variable.

3.4 Mode The Mode is the value of the variable that occurs most often in a set of data. It is the most

unstable measure of location. It is not a unique measure of location as in the arithmetic mean. In

some cases it may not exist. Sometimes when it exists it is more than one (e.g. bimodal

distribution).

37

Let us see how the mode can be obtained from discrete data.

Example 3.9

Consider the data in example 3.5 the modal value is 13. Since it is the only value that occurred

twice.

Example 3.10

Consider the data in example 3.6.

The mode does not exist.

Example 3.11

From Example 2 the mode is X = 2 i.e. the value with the highest frequency.

3.4.1 Calculation of Mode from Grouped Data

The mode of a grouped distribution can be obtained either

i. from the frequency curve by finding the value at the highest point or

ii. By calculation using the following formula.

From a grouped data the mode is defined as

WLmX

21

1ˆ

Where Lm = lower limit of the modal class.

1 = difference between the frequency of the modal class and the class before it.

2 = difference between the frequency of the modal class and that above it.

w = is the class width.

38

Example 3.12

From the data in Example 3.7

Calculate mode:

i. the modal class is the one with the highest frequency. i.e. (7 – 8).

ii. Lm = 6.5

1 = 38 – 26 = 12

2 = 38 – 19 = 19

w = 2

21912

12 65 X

= 6.5 + 0.774

= 7.27

Example 3.13

Also consider the data in Example 3.8 the mode is obtained as

10815

15 84.5 X

= 84.5 + 6.52

= 91.02

Merit

1. The mode is easily understood by many people.

2. It is easy to calculate.

39

Demerit

1. It is not a unique measure of location.

2. It presents a misleading picture of the distribution.

3. It does not take into account all the available data

4. It is the most ideal measure of location when the distribution is highly skewed. e.g.

distribution of wages of workers in a factory.

3.5 Partition Values We have seen in section (3.2) that the median is an average that divides a distribution into two

equal parts. So also there is other quantity that divides a set of data (in an array) into different

equal parts. Such data must have been arranged in order of magnitude. Some of the partition

values are: the quartile, deciles and percentiles.

Quartiles divide a set of data in an array into four equal parts.

For simple Series

First quartiles: Q1 = thNX

4 item

Q2 = X = median = thNX

2 item for simple series

Third quartiles: Q3 = thNX

43 item

For grouped data

i. First Quartile

Q1 = wfq

CfbiN

lq

1

1

14

)(

for grouped data

Where lq1 = Lower limit of quartile 1

fq1 = Frequency of the q1 class

40

w1 = Width of q1 class

1fq = Cf below the q1 class

ii. Third Quartile

Q3 = wfq

CfbN

lq

33

43

Where lq3 = Lower limit of quartile three class

fq3 = Frequency of the q3 class

w3 = Width of q3 class

3fq = Cf below the q3 class

3.6 Other Measures of Central Tendency Other measures of central tendency include the Midrange, Harmonic mean and Geometric mean.

41

Midrange

The half way between the smallest and the largest observation in a set of data is called the

midrange or range midpoint. It is obtained by adding the smallest and the largest together and

dividing the result by 2.

Example 3.14

Find the midrange of the following data: 1, 5, 7, 15, 12, 9, 7,

Solution

Smallest observation 1

Largest observation 15

Midrange = 72

115

Example 3.15

Find the midrange of the following data representing the number of children in 12 households in

Agbowo area of Ibadan.

4, 2, 1, 0, 2, 6, 2, 3, 5, 1,

Solution

Midrange = 32

06

Usefulness

Information on midrange of temperature reading by Meteorologists is used by visitors in the

tourism industry.

Limitations

It takes into account only the extreme observation.

42

Geometric Mean

Given observation X1, X2, ---, Xn, of a random variable X the geometric mean denoted by GM

define as the nth root of the product of n observation in a set. i.e

GM = nnXXX ,,, 21

Example 3.16

Find the geometric mean of the data in Example 3.14.

Solution

GM = 7 7.9.12.15.7.5.1

= 7 396900

= 6.31

Example 3.17

Obtain the geometric mean of the data in Example 3.14

Solution

GM = 10 1.5..........0.1.4

= 0 (since zero is one of the observation)

Usefulness

Geometric mean is very useful in the computation of rates and indices e.g. Computation of price

indices, etc.

Limitation

1. It cannot be calculated when the value zero is one of the observation to be used.

2. It is a readily used measure of location.

43

Harmonic Mean

Given the observation x1, x2, ----, xn of a random variable X, the harmonic mean denoted by

HM is defined as the reciprocal of the mean of the reciprocal of the observations i.e.

Example 3.18

Find the harmonic mean of the data in Example 3.14.

Solution

HM =

71

91

121

151

71

51

11

71

1

= 4.02

Example 3.19

Find the harmonic mean of the data in Example 3.14.

Solution

HM =

11

01

11

41

101

1

= 0 (since 0 is one of the observation)

Note: HM < GM < AM

Usefulness

Harmonic mean is used in the calculation of rates e.g. average speed.

44

Limitations

1. It is hardly used in practice.

2. It cannot be calculated when zero is one of the observation in the set.

3.6.1 Other Partition Values from Grouped Data

The other partition values that can be calculated from grouped data are the Deciles and the

percentiles.

Deciles are those values that divide a distribution to five equal parts. They are denoted by Di i

= 1, 2, ---, 9 D1, D2, D3 …. D9.

For the grouped data deciles two (D2 ) is defined as

wfD

cfbN

LDD

D

2

2

2

52

where

LD2 = Lower limit of decile two class

fD2 = Frequency of the decile two class

w2 = Width of decile 1 class

1Df = Cumulative frequency below the decile two class

Percentiles are those values that divide a distribution into one hundred equal parts. They are

denoted by P1, P2, P3, ….., P99. For a grouped distribution the 65th percentile is defined as

wf

cfbN

LPP

P

p

65

65

65

10065

65

45

Lp65 = Lower limit of 65th percentile class

Fp65 = Frequency of the 65th percentile class

w1 = Width of 65th percentile class

65pf = Cumulative frequency below the 65th percentile class

Example 3.20

Consider the data in Example 3.9

Calculate the i. first quartile (q1)

ii. third quartile (q3)

iii. 4th Decile (D4)

iv. 45th Percentile (P45)

Solution

From the table in Example 3.9

Table 3.9

Class f cf

45 – 54

55 – 64

65 – 74

75 – 84

85 – 94

95 – 104

105 – 114

115 – 124

125 – 134

135 - 144

1

3

5

18

33

25

21

12

5

2

1

4

9

27

60

85

106

118

123

125

46

i. wf

cfbN

Lqq

q

q

1

1

1

41

= 1033

2725.315.84

= 84.5 + 1.29

= 85.79

ii. wf

cfbN

Lqq

q

q

3

3

3

43

3

= 1021

8575.935.104

= 104.5 + 4.17

= 108.67

iii. wf

cfbN

LDD

D

D

4

4

4

54

4

= 1021

851005.104

= 104.5 + 7.14

= 111.64

iv. wf

cfbN

LPP

P

P

145

45

145

10045

45

47

= 1033

2725.565.84

= 84.5 + 8.86

= 93.36


1. Measures of central tendency are single values that give a description of the data.

2. The arithmetic mean is the average of a set of observation is the sum of the observation

divided by the number of observation

3. The mean is an average that considers all the observations in the data set

4. The median is an average of position.

5. Median is a good measure of location in a skewed distribution.

6. The Mode is the value of the variable that occurs most often in a set of data

7. The mode is not a unique measure of location

8. The partition values are: the quartile, deciles and percentiles.






List four measures of central tendency


1. What is an arithmetic mean?

2. What is the formula for the Calculation of Mean from Grouped Data?

48


What is the formula for the Calculation of Median From a grouped data?


1. Define Mode

2. Give three demerit of mode


Name the partition values


Mention the usefulness of the midrange and Geometric mean


Arithmetic mean, mode, median, geometric mean

SAQ 3.2

1. The arithmetic mean of a set of observation is the sum of the observation divided by the

number of observation

2.

K

i

iiN

XfX1

;

K

iifN

1

SAQ 3.3

wm

CfbN

LmX

2~

49

SAQ 3.4

1. The Mode is the value of the variable that occurs most often in a set of data. It is the most

unstable measure of location

2.

i. It is not a unique measure of location,

ii. It presents a misleading picture of the distribution

iii. It does not take into account all the available data

SAQ 3.5

The quartile, deciles and percentiles

SAQ 3.6

Information on midrange of temperature reading by Meteorologists is used by visitors in the

tourism industry.

Geometric mean is very useful in the computation of rates and indices e.g. Computation of price

indices

50

References Adamu S.O and Johnson Tinuke L (1998): Statistics for Beginners: Book 1. SAAL Publications. Ibadan. ISBN: 978-34411-3-2 Brookes C B and Dick W.F (1969): An Introduction to Statistical Method .Second Edition

Published by H.E.B. Paperback.

Clarke G.M. and Cooke D (1993): A Basic Course in Statistics. Third edition. London: published

by Arnold & Stoughton

Connor, L. R and Morrell, (1982) A. J. “Statistics in Theory and Practice”. Seventh Edition, London: Pitman Books Limited. Gupta, C. B. (1973) “An Introduction to Statistical Methods” New Delhi: Vikas Publishing House PVT Ltd. Moore D.S and Mc Cabe G.P (1993): Introduction to the Practise of Statistics, Second Edition.

New York: W.H. Freeman and Coy.

Olubosoye O.E, Olaomi J.O and Shittu O.I (2002): Statistics for Engineering, Physical and Biological Sciences”. Ibadan: A Divine Touch Publications.

51

Study Session 4 Measures of Dispersion/Variation

Introduction Dispersion/Variation is degree of scatter or variation of individual values of a variable about the

central value such as the median or the mean.

In this Study Session we shall discuss the range, semi-inter-quartile range, mean deviation from

the mean, median variance and standard deviation


4.1 Explain the meaning of variation and is measures

4.2 Explain the Range

4.3 Explain Mean deviation and its Calculation

4.4 Explain The variance and its calculation

4.5 Explain Standard Deviation

4.6 Explain the use of coding method when dealing with large values of a variable

4.1 Variation and its Measures Weight, like so many other things, is not static or unchanging. Not everyone who is 5 feet tall is

100 pounds, there is some variability. When reporting these numbers or reviewing them for a

project, a researcher needs to understand how much difference there is in the scores. This is

where we will look at measures of variability.

52

Box 4.1: Definition of Variation

Variation can be defined as a way to show how data is dispersed, or spread out.

Several measures of variation are used in statistics which will be discussed at the course of this

study session.

4.2 The Range This is the simplest measure of variation. It is the difference between the largest and the smallest

value in a set of data.

Range = X (max) = X (min.)

The range is thus a measure which is very easy to determine and use. The range is efficient when

n > 10, otherwise it is not good as it ignores all the values in between. It is commonly used in

statistical quality control.

However, the range may fail to discriminate if the distributions are of different types.

Semi-Interquartile Range: is half the difference between the first and third quartiles. It is good

measure of spread for midrange and the quartiles.

2

.. 13 QQRIS

4.3 The Mean Absolute Deviation Mean deviation is the mean absolute deviation from the center. A measure of the center could be

the arithmetic mean or median. It can be shown that the mean deviation of a distribution is least

when the deviations are from the median. Given a set of X1, X2, ….., XN the mean deviation

from the arithmetic mean is defined by:

N

XXMD

N

ii

1 for simple series

53

In a grouped data

N

i

N

ii

X

f

XXfMD

1

1

Example 4.1

Below is the average of 10 Heads of household randomly selected from a community

54, 59, 35, 41, 46, 25, 47, 60, 54, 46

Find the (i) Range (ii) Mean (iii) Mean deviation from the mean (iv) Mean deviation

from the median.

Solution

i. Range = 60 – 25 = 35

ii. Mean = 10

46....5954

nX

X

= 46.7

iii. Mean Deviation XMD = n

XX

= 10

7.4646....7.46597.4654

7.3 + 12.3 + 11.7 + 5.7 + 0.7 + 21.7 + 0.3 + 13.3 + 7.3 + 0.7

= 1081 = 8.10

Array: 25, 35, 41, 46, 46, 47, 54, 54, 59, 60

iv Median = 5.462

122

nXnX

54

105.46465.46595.4654

ˆ

XMD

= 10

5.05.75.135.05.215.05.55.115.125.7

= 1081

= 8.1

Example 4.2

The table below shows the frequency distribution of the scores of 42 students in STA 111 test.

Table 4.1

Scores

No. of

Students

(f)

0 – 10

10 – 20

20 – 30

30 – 40

40 – 50

50 – 60

60 – 70

2

5

8

12

9

5

1

Find the mean deviation from the mean for the data.

55

Solution

Table 4.2

Classes X F fX XX XX XXf

0 – 10

10 – 20

20 – 30

30 – 40

40 – 50

50 – 60

60 – 70

5

15

25

35

45

35

65

2

5

8

12

9

5

1

10

75

200

420

405

275

65

- 29.52

- 19.52

- 9.52

0.48

10.48

20.48

36.48

29.52

19.52

9.52

0.48

10.48

20.48

30.41

59.04

97.6

76.16

5.76

94.32

102.4

30.48

52.3442

1450

ffx

X

XDeviationMean =

fXXf

= 42

76.365

= 11.089

56

4.4 The Variance The variance of a set of observations is the average of the squared deviation from the mean.

Let x1, x2, x3, ----, xn be a random sample from a population The sample variance S2, is

defined as:

n

ii XX

nS

1

22 1

where nX

X i

for discrete data or simple series

For grouped data, sample variance is defined as:

22

i

ii

fXXf

S

Another formula for calculating variance can be derived from the above as follow

n

ii XX

nS

1

22 1

n

ii XXnS

1

22

N

iii XXXXns

1

222 2

= 22 2 XXXX ii

= 22 XnX i

Therefore 222 1 XXn

S i

However, for grouped data

57

i

i

fXf

S2

12

4.5 Standard Deviation The standard deviation is the square root of the variance. It is sometimes referred to as the root

mean squared deviation from the mean (RMSD).

It should be noted that the variance is measured in units of X2 rather than X. This makes it

difficult to understand the size of the variance. A measure of variability that is closely related to

variance but expressed in the same unit of observation is called Standard Deviation.

In-Text Question

Standard deviation could be defined as?

a. The cube root of the variance

b. The square root of the variance

c. Both the square and cube root of the variance

d. Fourth root of the variance

In-Text Answer

b.) The square root of the variance

Standard deviation is the positive square root of the variance. It is defined as

N

XXS

N

i

1

2

or

22

XnX

S i

Example 4.2

58

Consider the data in example 4.1, calculate the standard deviation and coefficient of variation.

Solution:

i Standard Deviation S =

nXX

2

= 10

)7.4646(....)7.4654( 22

= 10.87

ii. Coefficient of Variation C.V = 100 x XS

= 100 x 7.46

37.10

= 22.21

Comparison of Dispersion: Comparison of two distributions with different means and unit of

measurement is done using the coefficient of variation.

Definition: Coefficient of Variation (C.V) is a dimensionless quantity that measures the

relative variation between two series observed in different units.

It is defined as the ratio of the standard deviation and the mean of a set of data expressed as a

percentage.

i.e. 100 x .XSVC

The distribution with smaller C.V is said to be better

59

4.6 Coding Method This is the method used when larger values of the variable are involved in calculation.

This is achieved by choosing one of the values (or class mark) as the assumed mean (A) and

determine the common factor (C). The values of the variable Xi (or class mark) are transformed

using the code:

C

AXU i

Thus the formula for calculating the variance becomes

C

ff

Uff

S Ui

2

21

2 1

Example 4.3

Given the following grouped data. Compute the (i) Mean and (ii) Standard deviation. And

(iii) coefficient variation using an assumed men of 77 and 5 as a common factor

Table 4.3

Class f 50 – 54 55 – 59 60 – 64 65 – 69 70 – 74 75 – 79 80 – 84 85 – 89 90 – 94 95 – 99

1 2

10 12 18 25 9 6 4 3

Total 80

60

Solution

Table 4.5

Classes f Class

Mark (X)

X – A

CAX

U i

fU 2iU 2fU

50 – 54

55 – 59

60 – 64

65 – 69

70 – 74

75 – 79

80 – 84

85 – 89

90 – 94

95 – 99

1

2

10

12

18

25

9

6

4

3

52

57

62

67

72

77

82

87

92

97

-25

-20

-15

-10

-5

0

5

10

15

20

-5

-4

-3

-2

-1

0

1

2

3

4

-5

-8

-30

-24

-18

0

9

16

12

12

25

16

9

4

1

0

1

4

9

16

25

32

90

48

18

0

9

24

36

48

90 -36 330

A = 77 C = 5

Cff

AX U

= 5903677

= 77 – 2

= 75

Cf

fUf

fS U

i

2

21

2 1

61

= 59036330

901 2

= 3.55

2S S

= 1.88

Coefficient of variation: CV = 100 x XS

= 100 x 7588.1

= 2.51


1. Variation is a way to show how data is dispersed, or spread out.

2. The range is the simplest measure of variation

3. The range is the difference between the largest and the smallest value in a set of data

4. Mean deviation is the mean absolute deviation from the center

5. The variance of a set of observations is the average of the squared deviation from the

mean

6. The standard deviation is the square root of the variance.

7. The Standard Deviation is also referred to as the root mean squared deviation from the

mean

8. The Coding Method is used when larger values of the variable are involved in calculation

62






Define variation


What is a range?


What is the formula for calculating a mean deviation in a group data?


What is a variance?


The standard deviation is sometimes referred to as?

a. The root mean squared deviation from the mean (RMSD)

b. The root mean square

c. The cube root mean square of the deviation (CMSD)

d. Standard means of measurement


When is the coding method used in calculations?

63


Variation can be defined as a way to show how data is dispersed, or spread out.

SAQ 4.2

This is the simplest measure of variation. It is the difference between the largest and the smallest

value in a set of data.

SAQ 4.3

N

i

N

ii

X

f

XXfMD

1

1

SAQ 4.4

The variance of a set of observations is the average of the squared deviation from the mean

SAQ 4.5

a.) The root mean squared deviation from the mean (RMSD)

SAQ 4.6

This method is used when larger values of the variable are involved in calculation.

64

References Adamu S.O and Johnson Tinuke L (1998):”Statistics for Beginners: Book 1. Ibadan: SAAL Publications, ISBN: 978-34411-3-2 Brookes C B and Dick W.F (1969): An Introduction to Statistical Method .Second Edition Published by H.E.B.Paperback. Connor, L. R and Morrell,(1982) A. J. “Statistics in Theory and Practice”. Seventh Edition, London: Pitman Books Limited, Gupta, C. B. (1973) “An Introduction to Statistical Methods” Vikas New Delhi: Publishing House PVT Ltd... Moore D.S and Mc cabe G.P (1993): Introduction to the Practise of Statistics, Second Edition.

New York: W.H. Freeman and coy..

Olubosoye O.E, Olaomi J.O and Shittu O.I (2002):’Statistics for Engineering, Physical and Biological Sciences”. Ibadan: A Divine Touch Publications,. ISBN: 978-35606-7-0

65

Study Session 5 Algebraic Treatment of Mean and Variance

Introduction It is advisable to adjust the values of the mean and variance to check for mistakes, it may also be

desired to combine these statistics without recourse to the individual observation of the variable.

The various methods of doing this will be discussed in this study session.


5.1 Calculate the pooled mean of two or more variables

5.2 Adjust the values of mean, variances and standard deviation for mistakes

5.1 Pooled Mean and Variance You have learnt how to compute the mean and variance from univariate data. Sometimes, we

may have information about the mean and variance of two or more variates and you desire to

find the combined mean and variance. This can be achieved without using the individual values

of the variables.

Given two sets of data consisting of n1 and n2 items and 1X and 2X and their variance 21S

and 22S respectively with the some mean, then the combined mean is defined by

21

221112 nn

XnXnX

and the combined variance is

2

)1()1(

21

222

2112

nn

SnSnS

66

Suppose we have

ni, i = 1, 2, ----, k

X

for i = 1, 2, 3, …, K,

ni number of observation in variable i.

X mean of variable i.

2iS variance of variable i.

Then, the pooled (combined mean) is defined

k

ii

k

iii

k

kkk

n

Xn

nnnXnXnXnX

1

1

21

2211,,12

......

The pooled (combined variance) variance is given by

knnn

XXnXXnSnSnSn

k

kkkkkkk

21

2

12

2

121122

222

112,12

ˆˆ)1()1()1(

Example 5.1

The Mean and Standard Deviation of two variables of 100 and 150 items are 50, 5 40, and 6

respectively. Find the Standard Deviation of all the 250 items taken together.

Solution

250

40) x 150()50(100

21

221112

nn

XnXnX

= 44

248

)4044(150)5044(100)6(149)5(99 2222212

= 55.0

6.5512

= 7.46

67

Example 5.2

A survey was conducted at three locations in a community to study a single variable. At each

location, the sample size (ni), the mean iX and standard deviation i were given the

following table.

Table 5.1

Location I II III

ni 200 250 300

iX 95 10 15

i 3 4 5

Obtain the combined mean and standard deviation for the variable in all the three locations

Solution

Hence 32

332223 nnn

XnXnXnXi

iii

750

26000300250200

)15(300)10(250)95(20023

iX

= 34.7 or 35

3

)!()1()1((

32

233

222

211

212333

212322123

212

23

nnn

nnnXXnXXnXXn

i

ii

748

)25(299)16(249)9(1997.34153007.34102507.3495200 222

123

747133001164275.152522727218

123

55.13492123

55.1349123

= 36.74

68

5.2 Adjusting Values of Mean and Standard Deviations for Mistakes

Sometimes mistakes occur in the computation of mean and variance of a set of data when a

correct value in the original data is replaced by an incorrect one. Instead of going through the

entire process to correct such mistakes, some simple algebraic adjustment can be made as shown

in the following examples.

Example 5.3

The mean and standard deviation of a set of 100 observations were worked out as 40 and 5

respectively by a student who by mistake took the value 50 in place of 40 for one observation.

Recalculate the correct mean and standard deviation.

Solution

n = 100; X = 40; 2 = 25

n

XX

40 = 100 X

Incorrect: ∑X = 4000

Correct: ∑X = 4000 – 50 + 40 = 3990

Corrected mean X = 1003990 = 39.90

2 = 22

XnX

25 = 22

40100

X

2500 = ∑X2 — 160,000

∑X2 = 162,500

69

Correct ∑X2 = 162,500 – 502 x 402

= 161,600

Correct 2 = 2)90.39(100

161600

= 1002399

= 23.99

= 99.23

= 4.89





SAQ (Tests Learning Outcomes)

1. Find the mean median and mode of the following observation:

5, 6, 10, 15, 22, 16, 6, 10, 6

2. The six numbers 4, 9, 8, 7, 4 and X, have mean of 7. Find the value of X and hence

calculate the coefficient of variation for the six numbers.

3. The arithmetic mean of five observations is 44 and the variance is 8.24. If 3 of the 5

observation are 1, 2 and 6. Find the other two.

4. The mean and standard deviation of 120 items were found by a student to be 60, and 5

respectively. If at the time of calculation, two items were wrongly recorded as 45 and 55,

instead of 54 and 70. Find the correct mean and standard deviation.

70

References Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third edition London: Arnold &

Stoughton

Cyprian A. Oyeka (1990): An Introduction to Applied Statistical Method in the Sciences. Enugu:

Nobern Avocation publishing coy.

Gupta, C. B. (1973) “An Introduction to Statistical Methods” New Delhi:Vikas Publishing House PVT Ltd.

Moore D.S and McCabe G.P (1993): Introduction to the Practice of Statistics, second Edition.

New York: W.H. Freeman and coy

71

Study session 6: Measure of Skewness and Kurtosis

Introduction A fundamental task in many statistical analyses is to characterize the location and variability of a

data set. A further characterization of the data includes skewness and kurtosis.

In this Study session, you will learn the definition of skewness and kurtosis, you will also learn

how to calculate measure of skewness and kurtosis from simple series and grouped data.

Learning outcomes for study session 6 At the end of this study session you should be able to:

6.1 Define skewness and kurtosis;

6.2 Calculate measure of skewness and kurtosis from simple series and grouped data;

6.3 Determine whether a set of data; is normally distributed, the direction of skewness and the

level of peakedness and Interpret your result.

6.1 Define skewness and kurtosis Skewness is a measure of a symetry, or more precisely, the lack of symmetry. A

distribution, or data set, is symmetric if it looks the same to the left and right of the center

point. For univarite data X1, X2, -----, XN, the formula for skewness is

For discrete data,

Skewness:

3

31

3 )1( sNXX i

Ni

For grouped data

3

3

3 )1( sNXXf i

Where X is the mean, S is the standard deviation, and N is the number of data points.

72

The skewness for a normal distribution is zero, and any symmetric data should have a skewness

near zero.

Negative values for the skewness indicate data that are skewed left and.

Positive values for the skewness indicate data that are skewed right. By skeweness to the left,

we mean that the left tail is long relative to the right tail. Similarly, skeweness to the right means

that the right tail is long relative to the left tail.

Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution

That is, data sets with high kurtosis tend to have a distinct peak near the mean, decline rather

rapidly, and have heavy tails. Data sets with low kurtosis tend to have a flat top near the

mean rather than a sharp peak. A uniform distribution would be the extreme case. Kurtosis is

the standardized 4m central moment of a distribution.

The histogram is an effective graphical technique for showing both the skewness and kurtosis of

data set.

For univariate data X1, X2, -----, XN, the formula for kurtosis is:

For discrete data Kurtosis:

4

41

4 )1( sNXX i

Ni

For grouped data

4

4

4 )1( sNXXf i

where X is the mean, s is the standard deviation, and N is the number of data points.

In-text Question

_____________ is a measure of symmetry, or more precisely, the lack of symmetry?

a) Skewness

b) Kurtosis

c) Grouped data

d) Sample series

73

In-text Answer

a) Skewness

6.2 Calculating measure of skewness and kurtosis from simple series and grouped data

Excess Kurtosis: The Kurtosis for a standard normal distribution is three. For this reason,

excess kurtosis is defined as

For discrete data: Excess Kurtosis:

3)1( 4

41

4

sNXX

K iNi

For grouped data: or

3)1( 4

4

4

sNXXf

K i

The standard normal distribution has excess kurtosis of zero. Positive kurtosis indicates a

“peaked” distribution and negative kurtosis indicates a “flat” distribution.

The peakedness of a distribution can be shown as in the diagram below:

In-text Question

A distribution, or data set, is symmetric if it looks the same to the left and right of the center

point. True\ False

a) False

b) True

74

c) None of the above

d) All of the above

In-text Answer

a) True

Diagram 6.1 - Peakedness of a Distribution

A

B

C

A -------------- Leptokurtic

B -------------- Mesokurtic - Normal

C -------------- Platykurtic

Example 6.1

Twelve numbers were generated from computer are as follows:

10, 43, 67, 89, 70, 80, 62, 80, 03, 42, 71, 35

a. Obtain the measures of skewness and kurtosis.

b. Interpret your result.

75

Solution

Table 6.1

X XX i 2XX i 3XX i 4XX i

03 -51.3 2631.69 -135005.697 6925792.26 10 -44.3 1962.49 -86938.307 3851367.00 35 -19.3 372.49 -7189.057 138748.80 42 -12.3 151.29 -1860.867 22888.66 43 -11.3 127.69 -1442.897 16304.74 52 7.7 59.29 456.533 3515.30 67 12.7 161.29 2048.383 26014.46 70 15.7 246.49 3869.893 60757.32 21 16.7 278.89 4657.463 77779.63 30 25.7 660.49 16974.593 436247.04 30 25.7 660.49 16974.593 436247.04 39 34.7 1204.09 41781.923 1449832.73

652 3516.68 -145673.444 13445494.99

3.5412652

Xn

XX

11

68.8516S

= 27.825

Skewness:

3

3

3 )1()

SNxxi

= (27.825 x 11

444.145673

=6385.236972

444.145673

= -0.6147

That is negatively skewed distribution.

76

Kurtosis:

41

4

3 )1( SN

xxN

ii

= 4(27.825) x 11

99.13445494

= 668.659376389.13445494

= 2.039

Excess Kurtosis 34 K

= 2.039-3

= - 0.961

i.e. platykurtic.

In-test Question

The Kurtosis for a standard normal distribution is three. For this reason, excess kurtosis is

defined as ____________ ?

a)

41

4

3 )1( SN

xxN

ii

b)

3

3

3 )1()

SNxxi

c)

3.5412652

Xn

XX

d)

3)1( 4

41

4

sNXX

K iNi

77

In-text Answer

d)

3)1( 4

41

4

sNXX

K iNi

6.3 Determining whether a set of data; is normally distributed, the direction of skewness and the level of peakedness Example 6.2

Given the data below:

Table 6.2

Class f

10-14

15-19

20-24

25-29

30-34

35-39

40-44

45-49

50-54

1

4

8

19

35

20

7

5

1

a. Draw the histogram for the above data.

b. Obtain the measure of i. Skewness

ii. Kurtosis

c. Interpret your result.

78

Solution

Diagram 6.2

40

30

20

10 0 9.5 14.5 19.5 24.5 28.5 34.5 39.5 44.5 49.5 54.5

Table 6.3

Class

Mid-Point

Xi

F ifx XX i 2XX i 3XX i 4XX i

10-14 12 1 12 -20.1 404.01 8120.6 163224.04

15-19 17 4 68 -15.1 228.01 3442.95 51988.56

20-24 22 8 176 -10.1 102.3 1030.3 10406.04

25-29 27 19 513 -5.1 26.01 132.55 676.52

79

Table 6.3

Class

Mid-Point

Xi

F ifx XX i 2XX i 3XX i 4XX i

10-14 12 1 12 -20.1 404.01 8120.6 163224.04

15-19 17 4 68 -15.1 228.01 3442.95 51988.56

20-24 22 8 176 -10.1 102.3 1030.3 10406.04

25-29 27 19 513 -5.1 26.01 132.55 676.52

30-34 32 35 1120 -0.1 0.01 0.001 0.0001

35-39 37 20 740 4.9 24.01 117.65 576.48

40-44 42 7 294 9.9 48.01 970.299 4605.96

45-49 47 5 235 14.9 222.01 3307.95 49288.44

50-54 52 1 52 19.90 396.01 7880.599 156823.92

100 1500.09 25002.999 442589.96

ifx = 3210

f = 100

ffx

X i = 1003210 = 32.1

80

Table 6.4

2XXf i 3XXf i 4XXf i

404.01 -8120.6 163224.08 912.04 -130771.8 207954.24 816.08 -8242.4 83248.32 494.19 -2520.35 12853.88

0.35 -0.035 0.0035 480.02 2353.00 11529.6 686.07 6792.093 67241.72 110.05 16539.75 246442.2 396.01 7880.599 156823.92

5299.00 910.257 949317.96

1

2

N

XXfS i =

995299 = 52.53

S = 7.316

Skewness:

3

3

)1()

SNxxf i

= 391.58 x 99257.910

= 0.0236

Kurtosis:

4

4

4 )1( SNxxi

= 2864.8 x 99

96.949317

= 5.283615

96.949317

= 3.3

Excess Kurtosis 34 K

= 0.3

81

Since Skewness = 0.0236; Kurtosis = 3.3; and Excess Kurtosis = 0.3.

This implies that the distribution is near normal. The Kurtosis indicates a flat peak i.e.

leptokurtic

Summary for study session 6 In this study session, you have learnt:

1. The concept of Skewness and Kurtosis.

2. How to distinguish between Kurtosis and excess kurtosis and their interpretations.

3. Useful examples were given to illustrate the different formulae for their computation.

Self-Assessment Questions (SAQs) for Study Session 6 Now that you have completed this study session, you can assess how well you have achieved its

Learning outcomes by answering the following questions.

SAQ 6.1-6.2

1. Consider the data in post test question 3 in chapter 4, obtain the measure of skewness and

kurtosis.

2. Consider the data in post test question 1 in chapter 4, obtain a measure of Excess

Kurtosis and interpret your results.

3. Consider the post test question 1 in chapter 5, calculate the measure of Skewness and

interpret your result.

82

Reference Adamu S.O and Johnson Tinuke L (1998):”Statistics for Beginners: Book 1. , Ibadan: SAAL PublicationsISBN: 978-34411-3-2 Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third edition. London:Arnold & Stoughton.

Connor, L. R and Morrell,(1982) A. J. “Statistics in Theory and Practice”. Seventh Edition,London:PitmanBooksLimited.

File://C:\DOCUME~1\FACULT~\LOCALS~\Temp\triHINHP.htm

Gupta, C. B. (1973)“An Introduction to Statistical Methods” London:Vikas Publishing House PVT Ltd. New Delhi:

83

Study Session 7: Methods of Collecting Statistical Data

Introduction In the previous session, you learnt the various ways in which a set of data can be summarized

and calculated some descriptive statistics, examined the shape and how summaries can be

combined and corrected for errors.

In this Study session, you will learn about the various methods that can be employed in the

collection of statistical data.

Learning outcomes for study session 7 At the end of this lecture, you should be able to:

7.1 Explain the various methods of data collection

7.2 Discuss the problems of data collection in Nigeria.

7.1 The various methods of data collection Data collection is an activity aimed at getting information to satisfy some decision objectives or

for purpose of scientific inquiry. The process of data collection varies with the nature of inquiry,

objective of the study and characteristic of the unit of inquiry.

84

Methods of Data Collection

There are five broad methods of data collection. They are:

Figure 7.1: methods of data collection

1. Documentary Sources: It is sometimes possible to answer some of the questions a survey is

intended to cover from available data.

Enquiry concerned with the leisure activities of a town population may verily begin by getting

statistical data about the use made of the local libraries, attendances at cinema, membership of

clubs and societies.

A mass of information about the popularly studied social surveys is available in historical

documents, statistical reports, records of institutions and other surveys.

Government departments possess a mass of information relating to individuals. Some of these

are census schedules, employment records, insurance cards, health records etc.

The only difficulty is that a survey researcher can hardly expect to gain access to these materials.

Some materials are collected in form of case records by psychiatrics, social workers etc. which

are of interest to the sociologist and psychologists. Such materials have limitations for the

research workers in that, it can only represent a highly specialized population i.e. only the case

that happen to came before social workers.

85

There are personal documents which can come directly from the informants such as diaries,

autobiographies and surveys. These give insight into personal character, experiences and beliefs

that formal interviewing can hardly achieve.

The possibility of any investigation bias affecting their contents is eliminated. The use of this

method has many difficulties e.g:

a. How to get the documents

b. How to get a representative collection of documents.

Some people are better in writing letters and essays than others but not everybody can produce

documents and they are at their best when unsolicited for. The method of data collection is

usually by copying out the relevant data from the records available.

2. Observation: Observation as a method of data collection is defined as accurate watching

and classic method of scientific enquiry as they occur in nature. The observer positions

himself and observes the activities of life of a community. The observer positioning

himself to observe depends on:

a. The nature and size of the community.

b. What he wishes to observe.

c. His own personality and skill.

An example where this method is suitable is in the case of traffic censuses. Actual measurement

or counting also comes under the heading of observations. Examples occur in statistical quality

control.

Problems

i. If the characteristics of the population are to be inferred from those of sample, the sample

should ideally be randomly selected.

ii. To instruct an investigator to observe people of all types, men and women of different

ages, social class etc. does not make the sample a random one. It does not ensure that the

resultant group is representative.

iii. The observer can hardly be expected to observe and note everything relevant to the

subject.

86

iv. His selection of the aspect of behaviour and entrainment which he notes may follow

certain channels.

v. If what he is studying is so familiar, he may fail to note the normal etc.

In-text Question

___________ as a method of data collection is defined as accurate watching and classic method

of scientific enquiry as they occur in nature.

a) Problems

b) Merits

c) Observation

d) Demerits

In-text Answer

c) Observation

Merits

The advantages of this method are similar to personal interview and the method has some unique

advantages such as:

i. Providing more reliable information.

ii. Supplying of additional and necessary information

Demerits

The disadvantages are also similar to personal interview.

i. It is exceptionally certified.

ii. Highly trained personnel are needed for observation.

iii. Because of scrutiny, it is time consuming.

87

3. Mail or Postal Questionnaires: This is one of the most widely used methods of data

collection mostly in social surveys. Questionnaires are mailed out to respondents who in turn are

expected to send them back through the post when they are duly completed. The choice of this

method is governed by:

a. Limited resources

b. Economic advantages

c. Potential efficiency.

In-text Question

_________ is one of the most widely used methods of data collection mostly in social surveys.

a) Dairy

b) Telephone

c) Interview

d) Mail or Postal Questionnaires

In-text Answer

a) Mail or Postal Questionnaires

Merits

i. It is generally quicker and cheaper than other methods.

ii. It avoids the problems associated with the use of interviewers.

iii. It is useful when information concerning several members of household is required and

allows for some intra-household consultation.

iv. It is useful where questions demanded is considered rather than when immediate answers

are required.

v. Questions of personal or embarrassing nature are answered more willingly and accurately

than when the respondents are together with the interviewer; who is a complete stranger

to them.

vi. The problem of non-contacts in the sense of respondent not being at home is avoided.

88

Demerits

i. The method can only be considered when the questions are sufficiently simple and

straight forward to be understood with the help of the printed instructions and definitions.

It is unsuitable where the objectives of the survey take a good deal of explanation.

ii. The answers to mail questionnaire have to be accepted as final. There is no opportunity

to probe beyond the given answers.

iii. It is inappropriate where spontaneous (unplanned) answers are wanted or where it is

important that the views of one person only are obtained or where it is essential that one

particular person in each household fills the questionnaires and no one else.

iv. The answers cannot be treated as independent since the respondent can see all the

questions before answering any of them.

v. There is no opportunity to supplement the respondent’s answers by observational data,

his house, appearance, manner etc.

Some of the disadvantages of this method can be overcome by combining it with interview

method.

4. Personal Interview: This is the method that is used mainly in most surveys. It could be a

formal interview in which set questions are asked and the answers recorded in a standard form or

a less formal one in which the interviewer is at liberty to vary the sequence of questions, to

explain their meanings, to change the wordings or where he/she may not have a set of questions

at all but only a number of key points around with which to build the interview.

The interviewer should possess some vital qualities such as (a) Honesty, (b) Interest (c)

Accuracy (d) Adaptability (e) Personality and temperament (f) Intelligence and education.

Merits

i. The interviewer is free and has more opportunity to restructure questions whenever it is

necessary to do so.

ii. It allows more accurate information to be obtained by asking the respondent for further

explanation.

89

iii. A skilled interviewer can easily persuade an unwilling respondent. This will increase the

number of responses.

iv. A skilled interviewer will know when to make call backs and then make more effective

efforts.

v. In addition to recording verbal answers, the interviewer can note the non-verbal reactions

of respondents to questions.

vi. It can be used for persons of all educational levels.

vii. It can be used to explore areas in which little information exists.

In-text Question

___________ is the method that is used mainly in most surveys?

a) Intelligence

b) Personal interview

c) Adaptability

d) All of the above

In-text Answer

b) Personal interview

Demerits

i. Personal interviews are expensive to conduct if the sample to be taken is widely scattered

geographically.

ii. Unscrupulous interviewers may be biased by influencing respondent’s answers or records

to please him.

iii. The respondent in order to boast his image to please the interviewer may give biased

answers.

iv. It may be difficult to interview some individuals such as highly income and influential

people who are not always available.

v. If recalls are necessary, and when the sample is large, it will take more time than

necessary to complete the survey.

90

vi. Respondents may give inaccurate or false information due to lapse in memory,

misunderstanding or may be deliberate.

vii. Larger field staffs are needed for interviewing.

5 Telephone: This is the method of collecting data through the telephone like other methods, it

has many advantages especially in industrialized countries. In a developing country like

Nigeria, this method of collecting information cannot be efficient because of the inefficiency

of the telephone system.

Merits

i. It is faster than other methods.

ii. It is cheaper to collect information by phone than personal interview.

iii. It is more flexible than postal questionnaires.

iv. It encourages higher response rate than postal questionnaire.

v. Recall of respondents is quicker and easier than any other method.

vi. It is the best method of access to every difficult respondent.

vii. It facilitates recording of replies without causing any embarrassment to the respondent.

viii. It is very suitable for radio and television surveys.

Apart from the fact that the telephone system is not effective in a developing country and

therefore renders the method unsuitable, it has other demerits.

Demerits

a. Survey by telephone is limited to respondents having telephones – an obvious evidence

of bias.

b. If the population is widely located all over the country, cost consideration will limit

extensive coverage of the country.

c. The interviewer may be biased and as a result, influence the respondent.

91

d. Cost consideration may restrict the number of questions asked or the time given to the

respondents to answer the questions.

e. Answers given may not be treated in confidence as the telephone could be bugged or

even dropped.

7.2 Limitations of Data Collection in Nigeria Generally, secondary data are limited in scope and information derived from it may not be

satisfactory to all the needs of the researcher. This may also lead to reduction in scope of the

research work or bringing in certain assumptions to fill the loopholes created by insufficient

information.

In-text Question

Collection of data through phone like other method has many advantages. True\ False

In-text Answer

c) True

Summary for study session 7 In this study session, you have learnt:

1. The various methods of data collection.

2. The situations under which each of them can be employed were also highlighted, as well as

their relative merit and demerits.

3. Also the problems usually encountered in the process of collecting statistical data.



SAQ 7.1-7.2

1. What is statistical data collection?

2. What are the merits of personal interview?

3. Discuss the demerits of postal questionnaire method.

4. Observational method of data collection is best in social science research, Discuss.

92

References Adamu, S. O. (1978): “The Nigerian Statistical System”. Ibadan: University Press.

Adewoye, G. O. and Shittu O.I (1999): “Introduction to Socio-Economic Statistics (Survey

methods and Indicators).” Lagos: Victory Ventures ISBN 978-33867-1-9

Brookes C B and Dick W.F (1969): An Introduction to Statistical Method .Second Edition


Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third edition London: Arnold &

Stoughton



Moser, C. A. (1968): “Survey Methods in Social Investigation” London: Heinemann Educational

Books Ltd.,

Osuntogun, E. O. (1997): “Introduction to Social and Economic Statistics” Unpublished paper.

93

Study Session 8: Regression Analysis

Introduction In this study, you will be introduced to the theory of linear regression analysis. Different types of

relationships shall be shown on the scatter plot and the estimate parameter of the model shall be

obtained by method of least squares. An introduction to the test of significance of the regression

line will also be given.

Learning outcomes for Study Session 8 At the end of this study, you should be able to:

8.1 Discuss the concept of regression analysis;

8.2 Identify the types of regression models

8.3 Estimate the parameters of a regression model; and

8.4 Explain the testing of the significance of the model

8.1 Regression Analysis

Regression analysis is a statistical tool which helps to study the trend and pattern of movement in

one variable in response to changes in another variable on the basis of an assumed relationship

existing between them. Once this pattern is established, it can be used to predict one variable

from the other.

The variable being predicted is usually referred to as the response (dependent) variable and the

other variable is called the explanatory (independent) variable. The values of the explanatory

variable are usually fixed and under the control of the investigator while the values of the

response variable are determined by the values of the explanatory variables.

94

Thus regression analysis attempts to determine how changes in the explanatory variable affect

the response variable. The variables involved are assumed to be measured and recorded as

interval scaled or ratio scale data. If the variables are strictly qualitative (i.e. attributes) the

method of regression cannot be used.

The appropriate method used in studying association between two qualitative variables will be

discussed.

In-Text Question

The values of the explanatory variable are usually fixed. True or False

In-Text Answer

True

8.2 Types of Regression Models A regression model may be:

Figure 8.1: Types of regression model

A regression model can be simple if there is only one explanatory variable, and multiple if there

are more than one explanation variable.

A regression model is linear if its parameter does not contain any exponents and are not

multiples of other parameters in the model; otherwise, the model is said to be non-linear. The

value of the highest power of a model is called the order of the model.

Scatter Plot

The first step in the study of the relationship between two variable is to draw the scatter diagram.

It portrays the direction, form and strength of any relationship between quantitative variables. It

95

is drawn by plotting the values of the response variable (on the Y-axis) against the values of the

explanatory variable (on the X-axis).

The shape of the scattered points on the graph gives an idea of the type of relationship between

the two variables.

Types of Scatter Diagram

Diagram 8.1

Y Y

O X O X

(c) (d)

Figure 8.1: shows some of the common types of relationships that exist between two variables:

96

Figure 8.1(a) depicts a linear relationship

Figure 8.1(b) and (c) depicts non-linear relationship; figure 8.1 (a) is a quadratic relationship

while figure 8.1 (c) is an exponential relationship, while figure 8.1 (c) shows no

relationship between variables X and Y (i.e. spacious relationship). Since neither a line

nor curve can be fit on the scatter plot.

Please note that:

i. Scatter plot cannot be used for more than two variables.

ii. A non-linear regression model can be made linear through appropriate transformation.

In-Text Question

The following are types of regression models except _______________

A. Simple

B. Multiple

C. Short

D. Non-Linear

In-Text Answer

C. Short

8.3 The Simple Regression Model

The simple linear regression model describing the relationship between the response variable (Y)

and the explanatory variable (X) can be expressed as

iii eXY 10 For i = 1, 2, ----, n

Where there are n observation on both X and Y and

i. Yi is the ith observation on Y.

ii. Xi is the ith observation on X

0 is the intercept (The point at which the regression line cuts the Y-axis i.e. when

X = 0).

97

1 if the slope (regression coefficient) of the line. It gives the rate of change in Y per unit

change in X.

ei is the error term distributed random error term with mean O and variance 2 . The parabeteics

of the model can be estimated by method of Ordinary Least Squares (OLS).

Basic Assumption of OLS

i. The relationship between X and Y is assumed to be linear.

ii. Xi’s are predetermined (fixed) values assumed to be measured without error.

iii. The error term. ei's are independent of X i.e. E(ei X) = 0.

iv. The error term is assumed to be normally distributed with mean zero and variance

22 ,0~ i.e. Ne

The above assumptions implies that

iii eXY

is a random variable with the expectation (mean)

iii eXEYE 10

= iX 0

since and are constants and E(ei) = 0

Similarly

V(Yi) = Var(0 + 1Xi + ei)

= Var(ei)

= 2e

Where 2e can be estimated by

1

ˆ 2

2

nYY

S i

98

8.4 Estimation of Parameters Suppose there are n pairs of observations on X and Y as

(x1, yi), (x2, y2), -----, (xn, yn).

The assumed linear relationship is

Yi = a + bXi + ei 8.1

Where a and b are estimates of 0 and 1 in the original model. Equation (8.1) can be

expressed as

ei = Yi – a – bXi 8.2

Let Q = iii bXaYe 2 2 8.3

The constant can be obtained by minimizing Q with respect to a and b. i.e.

02 ii bXaY

aQ 8.4

02 iii XbXaY

bQ 8.5

The normal equations 8.4 and 8.5 can be solved simultaneously to obtain

XbYa 8.6

and

XX

XY

SSSS

XnXYXnXY

b

22 8.7

and 2e can be estimated by

1

1

2

n

eS

n

ii

e 8.9

=

1

ˆ 2

n

YYi

99

Where Yi is the observed value of Y

iY is the estimated value of Y

and the variance of the intercept is

XX

ic

SX

Var

0 8.10

and

the variance of the regression coefficient is

XXS

Var2

1

8.11

Coefficient of Determination

This is the proportion of variation in the response variable (Y) that is explained by the

explanatory variable (X).

It is defined by

n

i

n

i

Total

gression

YnY

YXnXY

SSSS

R

1

22

11

Re2

8.12

= variationTotal

variationExplained

where 10 2 R

02 R When 0b and 12 R when all the points fall on the fitted regression line.

8.5 Testing the Significance of the Model

It is always desirable to test the significance of the model. That is to examine whether a

regression line is a good fit. If the line is a good fit, then all the points on the scatter diagram

must fell on the line or lie very close to it.

100

This can be done by examining the residual plot (i.e plot of residual error iii YYe ˆ against

the data points.

The most objective method is by arranging the sum of square and cross products in an Analysis

of Variance (ANOVA) table, and carry out the Fisher’s test (F-test) or student (t-test) as follows:

Specify the Hypothesis

H0 : 0b (i.e. no relationship between X and Y)

H1 : 0b (i.e. relationship exist between X and Y)

Choose , the level of significance.

Table 8.1

ANOVA TABLE

SOURCE df SS MS F-cal

Regression

Error

K – 1

n – K –

1

SSR = SXY

SSY – SSR = SSE

RXY MSKSS 1

EE MSKnSS 1 E

RMS

MS

Total n – 1 Y

ii ss

nY

Y 22

The critical value is ,, 21 VVF

Where V1 = K – 1; V2 = n – K – 1.

= 0.05 or 0.01

Decision rule

Reject H0 if ,, 21 VVcal FF at level of significance an conclude there is enough evidence to

show that variables X and Y are related, otherwise accept H0.

101

Example 8.1

The table below shows the weight losses, (in kilogram) (Y) of a sample of person and the

number of months (X), they have been on a special weight reducing diet.

Table 8.2

Y 4 17 14 1 11 22 9 12 4 7

X 7 32 26 1 20 34 17 21 5 12

a. Draw the scatter diagram of the above data. b. Fit the regression equation of Y on X. c. Interpret the parameters of your regression model. d. An individual is known to have been on a special reducing diet for 27 months, estimate

his weight loss in kilograme. e. Obtain an estimate of the standard error of the model.

Solution

Diagram 8.2

Scatter Plot

30 -

20 -

Weight loss

10 - 0 10 20 30 32 40

No. of months

102

Table 8.3

Y X XY X2 Y2 Y 2YY

4 7 28 49 16 4.159 0.02528

17 32 544 1024 289 18.309 1.7135

14 26 364 676 196 14.913 0.8336

1 1 1 1 1 0.763 0.0561

11 20 220 400 121 11.517 0.2673

22 34 748 1156 434 19.441 6.648

9 17 153 289 31 9.819 0.6708

12 21 252 441 144 12.083 0.00689

4 5 20 25 16 3.027 0.9467

7 12 84 144 49 6.789 0.000121

101 175 2414 4205 4397 0.06877

Y = 10.1

X = 17.5

b. Regression of Y on X Y = a + bx

b = YXnXY

= 22 XnX

= 5.11425.646

)5.17(104205)5.17)(1.10(102414

2

= 0.566

103

a = XbY

= 10.1 – 0.566(17.5)

= 0.197

Y = 0.197 + 0.566X

c. a = 0.197m when X = 0, Y = 0.197

b = 0.566 implies for every month spent taking special weight reducing diet, there is

an average reduction of 0.57 kilogramme loss in weight.

d. Y = 0.197 + 0.566(27)

= 0.197 + 15.28

= 15.48 kg.

e. Se = 1

1

2

n

en

ii

=

1

ˆ 2

n

YYi

from the working table. 2

YYi = 11.0688

Se = 90688.11

= 1.109

Example 8.2

A quality control Manager collects 10 samples of iron roods from the production line at regular

interval of time. Each time the average length (Y) and diameter (X) of the rods are measured.

The results are given below.

104

Table 8.4

Average Diameter (X)

in mm.

Average Length (Y)

in cm. 18.1 23.0 17.5 20.2 14.7 13.8 15.1 13.8 16.1 12.6

8.8 9.5 8.9 9.1 8.6 8.3 8.5 8.2 9.4 7.2

a. Calculate the linear regression of mean length on Diameter. b. Is there any evidence to show that the diameter influences the length of the rods. c. Calculate the standard error of the regression coefficient.

n = 10

X = 164..9 Y = 86.5

X = 16.49 Y = 8.65

2X = 813.85 2Y = 752.25

XY = 1441.85

Hypothesis

H0 : 1 = 0

H1 : 1 0

= 0.05

105

a. b = 22 XnX

YXnXY

= 649.94465.15

)49.16(1085.2813)65.8)(49.16(1085.1441

2

= 0.163

a = XbY

= 8.65 – 0.163(16.49)

= 5.96

Y = 5.96 + 0.163X

b. SSTOTAL = 2)( YY = 22 YnYi

= 752.25 – 10(8.65)

= 4.025

SSTrt = YXnXYb

= 0.163(15.465)

= 2.52

Table 8.5

ANOVA

Source df SS MS Fc

Treatment

Error

1

8

2.52

1.50

2.52

0.188

13.40

TOTAL 9 4.025

F0.95, 1, 18 = 5.32

106

Conclusion: Since Fc > F0.815, 1, 8

we reject H0 and conclude that there are genuine reasons to show that the diameter influences the

length of the rods at 5% level of significance.

c. S.E(b) =

222) XnXMSE

XXe

= 649.94188.0

= 0.045

Summary for Study Session 8 In this study session 8, you have learnt:

1. The theory of regression analysis, and its uses.

2. The difference between linear and non-linear models, simple and multiple regression

models.

3. The method of Ordinary Least Squares (OLS) for estimating the parameters of a simple

linear regression model and the procedure for carrying out the test of significance of a

regression line was given



SAQ 8.1-8.5

1. A test was performed to determine the relationship between the chemical content (Y)

of a particular solution and the crystallization temperature (X) in deg. The following

quantities are calculated.

n = 20, iX = 400; iY = 220

2iX = 8800 iiYX = 4300

2iY = 2620

107

Assuming a linear relationship iii eXY

a. Calculate the least squares estimate of and each correct to two significant

figures.

b. Test the significance of the fitted model at 5% level of significance.

c. Obtain the standard error of parameter in the model.

d. A previous similar exercise with n = 1.5 shows a regression coefficient of 1 of 0.10

with a standard error of 0.008. Test the hypothesis that the slope of your regression

model is the same as that of the previous exercise at 5% level of significant.

2. Twelve students took two papers in the same subject and the marks in percentages were

as follows:

S/No. 1 2 3 4 5 6 7 8 9 10 11 12

Paper I 65 73 42 52 84 60 70 79 60 83 57 7

Paper II 78 88 60 73 92 77 84 89 70 99 73 8

a. Construct a scatter diagram for the above data.

b. Calculate the regression equation of paper II on paper I.

c. Two boys were each absent for one paper. One score 63 on paper I, the other 81 on

paper II. Estimate the marks of these students in the paper they did not take.

d. Obtain the standard error on your regression coefficient in (b) above.

e. Construct a 95% confidence information for your regression coefficience in (b) above.

3. A random sample of ten families had the following income and food expenditure (in N

per week).

Families A B C D E F G H I J

Family Income 20 30 33 40 15 13 26 38 35 43

Family

Expenditure

7 9 8 11 5 4 8 10 9 10

108

a. Estimate the regression line of food expenditure on income and interpret your results.

b. Obtain the regression line of income on food expenditure and interpret the result.

4. The following results have been obtained from a sample of 11 observations on the value

of sales (Y) of a firm and the corresponding prices (X).

18.519X , 82.217Y , 31345432X ,

5395122Y

a. Estimate the regression line at sales on price and interpret the results

b. What is the part of the variation in sales which is not explained by the regression line?

3. The following table includes the gross national product (X) and the demand for food (Y)

measured in arbitrary units, in an underdeveloped country over the ten year period 1960

– 1969.

1960 1961 1962 1963 1964 1965 1966 1967 1968 1969

Y 6 7 8 10 8 9 10 9 11 10

X 50 52 55 59 57 58 62 65 68 70

(a) Estimate the food function Y = b0 + b1X + U

(b) What is the meaning of this result

(c) Compute the coefficient of determination and find the explained and unexplained

variation in the food expenditure.

(d) Find the regression of X on Y.

1296836iiYX

109

References Brookes C B and Dick W.F (1969): An Introduction to Statistical Method .Second Edition



Stoughton



Connor, L. R and Morrell,(1982) A. J. “Statistics in Theory and Practice”. Seventh Edition,

London: Pitman Books Limited.

Gupta, C. B. (1973)“An Introduction to Statistical Methods” New Delhi: Vikas Publishing House

PVT Ltd.

Moore D.S and Mc Cabe G.P (1993): Introduction to the Practise of Statistics, Second Edition.

New Delhi: W.H. Freeman and coy.

110

Study Session 9: Correlation and Association

Introduction

So far in study eight, you have learnt how to measure the direction and strength of the

relationship between the explanatory variable and the response variable for the purpose of

predicting one from the other. However, in this study, you will learn how to measure the relation

or association between two variables without distinction between the two variables and not for

the purpose of prediction.

Learning outcomes for Study Session 9 At the end of this study, you should be able to:

9.1 Explain the meaning of correlation

9.2 Explain coefficient of rank correlation

9.1 Correlation Correlation refers to the relationship or association between two or more variables while

correlation coefficient is a quantity that measures the strength of the linear relationship between

two qualitative variables. The measure of relationship between two attributes (qualitative

variable) is usually referred to as association. This will be discussed in the next section.

Production Moment Correlation Coefficient

Suppose we have n observations on two variables x and y denoted by

(x1, y1) (x2, y2)… (xn, yn)

The correlation coefficient r for variables X and Y computed from n cases is

2222 YnYXnX

YXnXYr

111

YYXX

YYXX2

where r ranges from -1 to +1.

If r = 0, the two variables are uncorrelated.

If r = +1, x and y are said to be directly or positively correlated and the regression line is upward

sloping on the Scatter plot.

If r = -1, x and y are said to be inversely or negatively correlated and the regression is downward

sloping on the Scatter plot.

If < r < .5, x and y are said to be positively weakly correlated.

If 0.5< r < 0, x and y are said to be strongly positively correlated.

If -0.5< r < 0, x and y are said to be weakly negatively correlated.

If -1< r < -0.5, x and y are said to be strongly negatively correlated.

Note that r is also referred to as the product moment correlation coefficient.

It can be shown from lecture eight that

X

XY

SS

b

and YX

XY

SSS

r

Therefore bS

Sr

Y

X

2

2

Where b is the regression coefficient.

Example 1:

Consider example 8.1. Calculate the product moment correlation coefficient for this data.

Solution:

n = 10, X = 10.1, Y = 17.5

YXnXY = 646.5

22 XnX = 1142.5

112

22 YnY = 376.9

)9.376)(5.1142(

5.646 r

= 21.6565.646

= 0.985

Alternatively

bS

Sr

Y

X

2

2

= )566.0(41.1980.33

= 0.985

9.2 Coefficient of Rank Correlation This is a measure of the strength of relationship between two qualitative variables (or attributes)

It is also used when the exact measurement of qualitative variables may not be accurate,

impossible or impracticable.

To obtain the rank correlation coefficient, the observed values of the variables are replaced by

their respective ranks either in ascending or descending order of magnitude.

The coefficient of rank correlation is given by

)1(

61 2

2

nnd

R

Where d = difference of rank for any pair of variables – 1 < R< 1 and the interpretation is the

same as in product moment correlation coefficient.

If there are ties the average of the ranks are assigned to the units involved.

113

Example 9.3

Two judges were asked to assess twelve beauty contestants in a beauty contest. The twelve on

contestants were ranked according to their performance as follows:

Table 9.1

Judge 1 2 3 4 5 6 7 8 9 10 11 12

A 11 9 7 10 5 1 4 12 8 3 2 6

B 5 7 11 12 6 4 8 9 10 2 1 3

Is there any agreement in the two judges?

Solution

Table 9.2

n = 12

1 2 3 4 5 6 7 8 9 10 11 12

d 6 2 -4 -2 -1 -3 -4 -3 -2 1 1 3

d2 12 4 16 4 1 9 16 9 4 1 1 9

862 d

)1(

61 2

2

nnd

R

= )1144(12

)86(61

= 1 – 0.30

= 0.70

Comment: There is a fairly strong agreement in the opinion of the Judges.

114

Example 9.4

A study was conducted to determine the relationship between level of smoking measures by the

number of sticks of cigarette smoked per day (X) and a Tercim index of health (Y). The

following data were obtained on a random sample of 10 male smokers.

Table 9.3

X 8 20 15 12 15 9 16 10 12 8

Y 4 5 5 7 10 13 8 6 3 8

Calculate the spearman rank correlation coefficient and comment on your result.

Solution

Table 9.4

RX 9.5 1 3.5 5.5 3.5 8 2 7 5.5 9.5

RY 9 7.5 7.5 5 2 1 3.5 6 10 3.5

d 0.5 -6.5 -4 0.5 1.5 7 -1.5 1 -45 6

d2 0.25 42.25 16 0.25 2.25 49 2.25 1 20.25 36

5.1692 d

)1(

61 2

2

nnd

R

= )1100(10

)5.169(61

= 1 – 1.027

= 0.027

115

Comment

The above result shows that there is weak negative association between smoking habit and the

report health index.

Summary for Study Session 9 In Study Session 9, you have learnt about:

1. The concept of correlation.

2. The Distinction between correlations

3. The association between qualitative variables and attributes.

4. The method of interpretation of coefficient



SAQ 9.1-9.2

A group of sportsmen take part in a competition which includes two gymnasium test; squat

jumps and chins. The score for each exercise is the number performed in one minute. The score

of eight sportsmen taken from this group are given below:

Sportsmen A B C D E F G H

Squat jumps 47 72 60 44 56 63 71 64

Chins 25 48 30 40 27 35 30 34

a. Calculate the Spearman coefficient of rank correlation between these two sets of scores.

b. The overall winner of the gymnasium tests is the sportsman with the highest total score

when the number of squat jumps is added to the number of chins.

Determine the total scores and state which sportsman was the winner.

c. The rank correlation between the total scores and the number of squat jumps is 0.86

for the data above. Calculate the rank correlation between the total score and the total

116

score and the number of chins. If to save time, only one exercise was to be used in

future, state, giving a reason which one you would recommend to be used.

d. Consider the data in example 8.2

i. Calculate the coefficient of Spearman’s rank correlation

ii. Comment on your result.

References Adamu S.O and Johnson Tinuke L (1998):”Statistics for Beginners: Book 1. Ibadan: SAAL

Publications ISBN: 978-34411-3-2

Connor, L. R and Morrell,( ) A. J. “Statistics in Theory and Practice”. Seventh Edition, London:Pitman Books Limited.

Olubosoye O.E, Olaomi J.O and Shittu O.I (2002):’Statistics for Engineering , Physical and Biological Sciences”. Ibadan:A Divine Touch Publications. ISBN: 978- 35606-7-0

Brookes C B and Dick W.F (1969): An Introduction to Statistical Method. Second Edition Published by H.E.B.Paperback.

Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third edition. London: Arnold & Stoughton

Moore D.S and Mc Cabe G.P (1993): Introduction to the Practise of Statistics, second Edition. New York: W.H. Freeman and coy.

117

Study Session 10: Proportions, Rates and Indices

Introduction

Rates, ratio and indices have become very important in the descriptive analysis of certain events

and characteristics. They are especially useful in the study of vital characteristics such as price,

death, birth, population growth epidemics, etc.

In this study, you will be introduced to the three concepts, their uses and applications using some

sample data with particular emphasis on price indices.

Learning Outcomes for Study Session 10 At the end of this study, you should be able to:

10.1 Explain the meaning of the terms proportion, rate and indices;

10.2 Explain items to be taken into consideration when constructing an index number

10.3 Discuss the different methods of construction of price index

10.4 Identify the uses of consumer price index

118

10.1 Proportion, Rates and indices Definition: Proportion is the ratio of a number of items with certain characteristics (X) and the

total number of items exposed to such characteristics (N).

It is defined as

NXnP X

)()(

The above expresses the chance of occurrences of such characteristics. (i.e. Probability of event

x).

Example

If the voting age population (people 18 years and above) in a ward consists of 550 males and 600

females. What is the proportion of males?

Solution

n (males) = 550

Total population: N = 550 + 600 = 1150

Proportion of males: N

malesnP Males)(

)(

= 1150550 = 0.478

Rates

When proportion refers to the number of events or cases occurring during certain period of time,

it becomes a rate and is usually expressed as so many per 1000. Thus we refer to birth rate as the

number of birth per 1000 population in a year.

So also we have death rate, migration rate marriage rate etc. Some examples shall be given to

illustrate this concept later.

Index Number

An Index is a real number that measures the rate of increase or decrease in wage, production

value, quantity, price, or volume of a certain phenomenon in the current period relative to as

specific period in the past. (a base period). It is usually measured in percentage.

119

An Index number is a device for estimating trends in prices, wages, production and other

economic variables.

In its simplest form, an index number represents a special kind of average or a weighted average,

compiled from a sample of items judged to be representative of a whole.In this study, our focus

shall be on the construction of consumer price index, since the principle and methods that will be

discussed apply equally to indices of sales, production, wage, value, quantity indices.

In-Text Question

When proportion refers to the number of events or cases occurring during certain period of time,

it becomes a _______

A. Rate

B. An index

C. Ratio

D. Map

In-Text Answer

A. Rate

10.2 Consideration for an Index Number Quite a number of methods and formulae are used in the computation of index numbers; there

are however, a number of criterions that must be satisfied.

A good index number:

a. Should be simple in conception.

b. Should be easily interpreted. So that the man on the street can understand an index that

tries to measure the changing cost of the things he bought in a particular year.

Just as we mentioned in the earlier part of this lecture, an index number is a special kind of

average that considers the prices of many commodities expressed in different units or the

quantities measured also in different units. The commodities could also of different weights in

120

the “basket” of goods considered for the index. All these constitute the problems usually

encountered in the construction of an index number.

Thus in the construction of price index, the following factor are considered:

Figure 10.1: factors that determine price index

a. Choice of Item

Decision should be taken on the item to be included in an index. Such commodity to be included

should be (i) relevant, (ii) representative (iii) reliable and (iv) comparable over a period of time.

b. Source of Data

Decision should also be taken on the source of data for the items composing the “basket” to be

used in the construction of index number, should the data prices of commodities be collected

from a local market, a supermarket or an urban market. Great care should be taken to ensure that

prices are collected from population market that is patronized by different category of people and

where majority of the selected commodity can be found.

121

c. The Base Period

A base year is a reference period. The chosen year should generally be a fairly “normal” year,

free of occurrence of unusual events such as war, famine, prolonged strike or hyper-inflation. If

it is difficult to select a year in particular, the average of a series of years can be taken.

d. The Weight

Different weights are used in different parts of the country for a particular commodity. For

instance “congo” is used in the Western part of Nigeria, ‘mudu’ in the North and ‘tin’ in the

East. For the purpose of constructing and index number the weight in different region need to be

harmonized to a single unit.

In-Text Question

Choice of item can determine price index. True or False

In-Text Answer

True

10.3 Methods of Construction of Price Index There are different methods of constructing a price index. Some of them are given below;

Let Pn represent the price at the current year

P0 represents the price at the base year

qn represent the quantity at the current year

qo represent the quantity at the base year

1. Price Relative: is the simplest method of calculating an index number. It is defined as the

prices in the current year expressed as a percentage of the price in the based period for single

commodity. Base period is always assumed = 100

100 x 0P

PPR n

2. Simple Aggregate Method: This method considers the price of basket of goods and

services in the current years relative to that of the based period. It is denoted by:

122

100 x 0

PP

SAM n

Limitation: It attaches equal weight to all commodities. It does not take into account the relative

importance of the commodities.

3. Simple Average Relative Method: This is the sum of the price relative divided by the

number of items considered. It is denoted by

100 x 0

NPP

SAP n where S is the number of items

Its limitation is same as in (2)

4. Weight Simple Average Relative Method: To circumvent the problems of assigning equal

weight to different items, the weighted simple aggregate price index is given as

100 x 0

WPPW

WSAR n

Example 10.1

The following are the prices of commodity, A, B and C in 1975 and 1985.

Using 1974 as the base year

Table 10.1

Commodity 1975 1985

A 40 50

B 12 35

C 45 95

Calculate i. Price relative for each item

ii. Simple aggregative price index

123

Solution

Table 10.2

Commodity 1975 1985 Pn/Po x 100

A 40 50 50/40 x 100 = 125

35/12 x 100 = 291.7

95/45 x 100 = 211.1

B 12 35

C 45 95

Total 97 180

SAPR aggregate =

0PPn

= 87

180 x 100

SAP (S. Average)= 100 x 0

NPPn

= 328.6 x 100

Example 10.2

Given the prices of some staple foods in 1980 and 1996 with the corresponding weight. Using

1980 = 100

Table 10.3

Staple

Foods

Weight Price

1980 1996

Elubo 3 1.25 10.50

Gari 5 4.0 12.50

Rice 1 35.0 75.0

Beans 3 12.0 38.0

Yam 2 5.001 8.50

124

Compute i. the price relatives

ii. simple aggregate price index

iii. weight average price index

Table 10.4

Staple

Foods

Weight Price

1980 1996 Pn/Po x 100 W(Pn/Po)

Elubo 3 1.25 10.50 840 25.20

Gari 5 4.0 12.50 32 1.6

Rice 1 35.0 75.0 214 2.14

Beans 3 12.0 38.0 316 9.48

Yam 2 5.001 8.50 170 3.4

Total 14 57.25 144.5 41.85

Simple Aggregate Price Index = 100 x 0

PPn

= 100 x 25.575.144

= 252.14

Weight Average Relative index = 100 x 0

WPWPn

= 41.85 x 100

= 298.7

5. Laspeyer’s Price Index: This is a kind of weight method of constructing an index

number. It assumes that the pattern of consumption has not changed over the years with

change in price. It is denoted by:

PL = 100 x 00

0

qPqPn

125

Limitation: Since the base year quantities reflects the price of out modeled purchasing pattern. It

gives undue weight to items that has increased in price. Therefore Laspeyer’s price index tends

to overestimate.

6. Paashe’s Price index: This method assumes that the consumption pattern of the

consumer has changed in the current year. It is denoted by:

Pp = 100 x 0

n

nn

qPqP

Limitation: Some people tend to spend less on goods that have risen in price, the current

weighting procedure (Paashes) gives undue weight to items that have reduce in price, it tends to

understate the rise in prices. Hence, the underestimate on the price index.

7. Fisher’s Ideal Index Number: This method overcomes the problems of Paashes and

Laspeyer’s. This is considered as the most efficient method of constructing an index

number. It is the geometric mean of the Laspeyer’s and Paashes price indices denoted by:

pPP .1

Fisher’s Ideal Index = 100 x .00

0

nn

nnn

qPqP

qPqP

8. Marshal-Eldgeworth Price Index: This method takes into account the pattern of

consumption in the current and base periods. It uses the arithmetic mean of the base and

current period quantities as weight. It is given by:

100 x

00

0

n

nnp qqP

qqPEM

Example 10.3

The prices and quantity demanded of commodities A, B and C in the current and base years are

given below

126

Table 10.5

Commodity 1960

Po

1960

Qo

1970

Pn

1970

Qn

Price Quality Price Quality

A

B

C

4

3

2

50

10

5

10

9

4

40

2

2

Construct index number of price from the following data using

i. Laspeyer’s method

ii. Pashe’s method

iii. Marshal-Edgeworth method and

iv. Fisher’s ideal index number

Solution

Table 10.6

Commodity 1960

Po

1960

Qo

1970

Pn

1970

Qn

PnQn PoQn PnQn

Price Quantity Price Quantity

A 4 50 10 40

B 3 10 9 2

C 2 5 4 2

Total 240 610 426

i. Laspeyer’s P1 =

= 610/240 x 100

= 254.2

127

ii. Passshe’s Pp =

= 426/170 x 100

= 250.6

iii. M-E =

= [(610 + 426+170)] x 100

= 252.7

iv. Fisher’s =

=

= 252.4

10.4 Uses of Consumer Price Index

1. Consumer price indices among others are used to measure change in retail prices of

specific quantity of goods and services in a given geographical region over a period of

time.

2. It helps in wage and salary negotiation and adjustments of allowances.

3. Government agencies use consumer price indices to formulate wage policy, price control

policy, taxation and general economic policy.

4. Changes in purchasing power and real income can be measured using the consumer price

indices.

5. Use in international comparison

6. Construction of Human Suffering Index.

7. Construction of cost of living index.

128

In-Text Question

Changes in purchasing power and real income can be measured using the consumer price

indices. True or False

In-Text Answer

True

Summary for Study Session 10 In study session 10, you have learnt about:

1. The various methods of constructing an index number

2. The problem associated with the construction of index number

3. The uses of the consumer price index



SAQ 10.1 -10.4

1. a. Explain what is meant by an index number

b. What are the uses of consumer price index?

c. The price relatives for palm oil and kerosene are shown in the table

Commodity Price Relative

1961 1962

Palm Oil 100 108

Kerosene 100 114

129

Assuming that palm oil is twice as important as kerosene, what is the price index for 1962

taking 1961 = 100.

2. Five feed components are to be used in the construction of an animal feedstuff index

number. From the figures given in the following table, calculate a Laspeyer’s price index

taking 1964 = 100.

Component 1964 1970

Price

per ton

Consumption

(tons)

Price

per ton

Consumption

(tons)

A 40 3,600 41 2,750

B 39 2,750 53 1,500

C 38 2,050 35 2,350

D 37 500 30 750

E 36 1,475 24 2,850

3. Giving the following data on commodities A, B, C and D.

Base Year Current Year

Commodity

A

B

C

D

Po qo

10 12

7 15

5 24

16 5

Pn qn

12 15

5 20

9 20

14 5

Show that Fisher’s ideal index is 115.7

130

References Adamu S.O and Johnson Tinuke L (1998):”Statistics for Beginners: Book 1. Ibadan: SAAL

Publications. ISBN: 978-34411-3-2

Adewoye, G. O. and Shittu O.I (1999): “Introduction to Socio-Economic Statistics (Survey

methods and Indicators).” Lagos:Victory Ventures. ISBN 978-33867-1-9

Connor, L. R and Morrell,(1982) A. J. “Statistics in Theory and Practice”. Seventh Edition,

London: Pitman Books Limited,

Brookes C B and Dick W.F (1969): An Introduction to Statistical Method .Second Edition


Clarke G.M.and Cooke D (1993): A Basic Course in Statistics. Third Edition. London: Arnold &

Stoughton.

Moore D.S and Mc cabe G.P (1993): Introduction to the Practice of Statistics. Second edition.

New York: W.H. Freeman and coy.

131

Study Session 11: Time Series Analysis Introduction Time series analysis is the application of time series technique to time structured data usually

referred to as time series data. Time series data is the record of observations measuring certain

quantity of interest at regular or irregular interval of time.

The observations may be recorded daily, weekly, quarterly, yearly or bi-annually. It is a

realization or sample function from a certain stochastic process. Time series occur in many

fields such as, Agriculture, Engineering, Business and Economics, Geophysics, Medical

Sciences, Meteorology, Quality control, Social Sciences, and so on.

In this study session, you will learn about the time series data, methods of analysis of time series

data and the components of a time series

Learning Outcomes for Study Session 11 At the end of this study, you should be able to:

11.1 Define and identify a time series data;

11.2 Identify the methods of analysis of time series data

11.3 Estimate and isolate the components of a time series

11.1 Time Series Analysis The goal of time series is to identify a model within a given class of flexible model which can

reasonably approximately express a time-structured relationship of the process that generates the

data. The original use of time series analysis was primarily as an aid to forecasting. In the recent

time, the task has grown to an extent that time series analyst develop reasonably simple models

132

capable of describing the system that generate the time structured data; making reliable forecast

for the future and testing of hypotheses.

Uses of Time Series Models

Time series analysis is the study of the time-structured relationship in a variable. This involves

the use of the basic tools to analyze a given time series data with a view to:

Construct simple mathematical systems that explain the time-structured relationship in

the economic and social series in a concise way.

Use the model to explain the behavior of the series and make reliable forecast for the

future on the basis of the dynamic dependence of the series on the past values.

Thus time series provides a basis for economic and business planning, production and system

planning, control and optimization of industrial process. The intrinsic nature of a time series is

that its observations are dependent or correlated and the order of the observations is therefore

dependent.

Since life must be understood looking backwards and must be lived by looking forward, time

series provides useful tools that helps to predict the future by approximating models that use past

data.

Discrete time series is one where observations are taken at discrete specific time intervals,

usually equally spaced e.g. interest rates, yields, volume of sales and production. Such series

arise from fields such as Agriculture, Business circles etc.

Continuous time series are observation taken at any time t (t T) in the index set T. This type

of series are common in the Engineering, Geophysics and Medical Sciences.

In-Text Question

Discrete time series is one where observations are taken at discrete specific time intervals,

usually equally spaced. True or False

In-Text Answer

True

133

1.2 Methods of Analysis of Time Series data Time series data can be analyzed using either the deterministic method or Dynamic method.

Deterministic Method

A time series is said to be deterministic if future values are determined exactly by some

mathematical function. For example

(i) = +

(ii) X = Cos(2t)

Where a and b are constants and t is time that is fixed.

(iii)

Where Tt is the trend component; St is the seasonal component; Ct is the cyclical component

and it is the irregular component.

Dynamic (Non-deterministic) Method

A time series is said to be non-deterministic if future values can only be determined in terms of a

probability distribution guided by some assumptions. For example:

(i) = + +

(ii) X = A Sin(t + )

Where is normally distributed with mean zero and variance unity, A is a constant and is a

random variable from a uniform distribution on the interval [- ,] independent of A.

This method involves the use of Autocorrelation function (ACF) and Partial autocorrelation

function (PACF) and correlogram in discrete domain. It also involve the Fourier transforms in

Frequency domain Analysis in frequency domain is carried out using the extension of Fourier

method and spectral density function

In-Text Question

A time series is said to be non-deterministic if future values can only be determined in terms of a

probability distribution guided by some assumptions. True or False

In-Text Answer

True

t t t t tX T S C I

134

Deterministic Time series Analysis The analysis of time series depends on the type of system that generates the data. Analysis in

time domain refers to the analysis of discrete time series

Simple Descriptive Analysis

Most social and economic data including data generated in medicine are time structured. They

need to be summarized with a view to make inference about the system that generates the data.

Time Plot

The first and most important diagnostic tool of time series data is the time plot. It is a graphical

representation of a time series data. It is constructed by plotting the observation on the

vertical axis against time on the horizontal axis. When properly drawn, it shows up the

important features of the series such as trend, seasonality, discontinuities and outliers. The time

plot of the data gives an idea of the type of model that is suitable for the data.

It could also indicate whether it would be necessary to transform the observed data to achieve

certain stable conditions suitable for meaningful analysis and inference.

Fig. 1.1: Time Plot of a series

tx

' 't

0

5

10

15

20

25

30

35

t 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96

Xt

t

Time Plot of Xt

135

11.3 Components of Time Series Movements in a time structured data are governed by some peculiar and inherent forces which

may be characterized by their regularity/ periodicity and their effect n the entire series. The

forces could also be due to changes in the social, economic, psychological or environmental

characteristic in the system.

The patterns generated by these forces are referred to as components of time series. Some of the

components are: the trend (Tt), Seasonal movement (St), cyclical movement (Ct) and the

Irregular movement (It).

Trend or Secular Movement

This is the long-term movement in a series in the same direction over a long period of time. It is

usually characterize a continuous increase or decrease in the values on a variable over time. This

movement is generally referred to as secular Variation or Secular Movement. A line can be

freely drawn by hand through the plotted points on the graph of such time series stretching over a

long period; such a line called the trend. It is denoted by (Tt). The time plot below shows trend in

a series.

Fig. 1.2: Time plot showing Trend

Trend can be upward or downward. Upward trend is displayed in the time plot. This type of plot

is expected from sales of a commodity where increase is always expected.

136

Seasonal Variation

This refers to identical or almost identical patterns, which a time series appears to follow during

corresponding months of successive years due to mainly recurring event that takes place

annually. The movement appears to be periodic (exhibit variation at a fixed time within a given

interval if time). Many time series, such as sales figures and temperature readings, exhibit

variation, which are periodic annually.

There are factors responsible for this repetitive pattern year after year and the major factor is

weather condition. During winter, more woolen clothes are sold in UK and some other part of the

world. Also, regardless of increasing trend in the sales of ice cream, there is more sales of ice

cream during summer than winter.

Seasonal variation is denoted by (St). Time plot showing of monthly number of rainfall is given

below. Season is completed within one year, therefore a complete cycle is detected in the time

plot below. Seasonal variation is also found in a quarterly data. In that case, a complete cycle

will be completed within four quarters that make a year.

Fig. 1.3 : Time plot showing seasonal variation

137

Cyclical Variation

Cyclical Variation refers to as a long term oscillation about the trend which may or not be

periodic due to some other physical causes. The movement may or not exactly follow similar

pattern after equal interval of time. Examples include daily variation in temperature and rainfall

as well as some social and economic variables. A cyclical variation is denoted by

Fig. 1.4: Time Plot showing Cyclical Variation

Irregular Variation

This refer to erratic or sporadic movement of time series due to occurrence of random per chance

event, which are unforeseen, hence, it cannot be isolated directly. They are not deterministic.

These variations may or not be random.

Though it is assumed that these chance events produce short time variation, however, they can be

very intense and may result in a new cyclical or other variation. Included among these random

factors are such events as strikes, flood, volcanic eruption, earthquake, fire outbreak, sudden

change in government policy and so on. It is denoted by it.

Method of Combining Components

The task of the statistician is to segregate each of these factors in so far as this is possible: By

isolating or removing individual components, the impact of each of the components may be

assessed. It may happen that not all of the components may be present.

138

Traditionally, it is possibly to decompose time series into the trend, seasonal, cyclical and

irregular components. Using either of

Additive model:

or Multiplicative model:

where Tt is the trend component; St is the seasonal component; Ct is the cyclical component and

It is the irregular component

The resulting trend equation can be used for forecasting while the original data can be de-

seasonalized.

For example, the trend can be estimated using either

(i) k-point moving average

(i) semi-average,

(ii) Least square’s method.

Assuming all these methods are familiar to us, the least squares method uses the normal equation

with the assumption that the error term are independent and not serially

correlated. Otherwise, the regression equation is spurious i.e. the parameters of the models are

biased, and inconsistent due to the presence of a lagged dependent variable, the estimated OLS

standard error is invalid.

Decomposition of a time series can be achieved using any f the following models:

(a) Additive model

Xt = Tt + St + Ct + It

(b) Multiplicative model

Xt = Tt . St . Ct . It

ttttt ICSTX

t t t t tX T S C I

ttt baX t

139

(c) Mixed model

Xt = Tt StCt + It

or Xt = Tt + St . Ct . It

We shall concentrate first on (a) and (b).

The additive model assumes that the actual values are the sum of the four separate effects. This

assumption is probably true when short periods are involved or where the rate of growth or

decline in the trend is small as may be shown in the time plot.

The multiplicative model suggests that the actual values are the product of the separate effects.

This model is indicated when there is a marked (or sharp) growth or decline in a time series data

as may be shown in the time plot.

Decomposition of the Components

Either of these models may be used to effect the decomposition of the time series. The idea is to

decompose a time series into each of the basic components, analyze each component separately

and then recombine them in order to describe the variation in the series as a whole.

The process involves systematic evaluation of each component from the data. The first stage is

usually to estimate the trend and eliminate it from each time period from the actual data by

subtraction or division to give a de-trended series.

De-trended Series can be obtained using:

Additive Model: Xt – Tt = St + Ct + It or

Multiplicative Model:

Estimation of Seasonal indices

The first step in the estimation of seasonal effect is to obtain the deviation from the trend Xt - Tt

(for additive model) or (or ratio to trend) for (multiplicative model).

The de-trended series is averaged day by day, month by month etc. to produce an estimate of the

seasonal components Depending on whether the seasonal effect is thought to be additive or

tttt

t ICSTX

t tX T

140

multiplicative, the deviations are arranged in a table with a view to obtaining the average

otherwise referred to as seasonal indices St.

For additive model, the condition is imposed. That is the sum of seasonal effects

(indices) over the quarters add up to zero because if there were no seasonal effect, we expect Xt

– St = 0. If the means does not sum up to zero, the mean is then averaged among the quarters /

months / day / weeks, thus the seasonal effects are adjusted by subtracting (or adding) the

average from the mean to obtain the adjusted means (i.e. seasonal effects). i.e ;

but if then therefore

(1.3)

For multiplicative model, the condition is imposed where S is the number of quarters

in quarterly series. That is the sum of the seasonal effect over the year is S. The ratio of the

actual values (Xt) and the trend (Tt) is obtained as because, if there were no seasonal

effect we expect for each time period. Thus .

(1.4)

The averaging procedure which produces the seasonal components follows the same pattern as in

the additive model except that the adjustments to the averages which corrects for rounding to S.

This is achieved by summing the averages and multiplying the resultant quotient by the

unadjusted averages.

Let , then the adjustment is . Thus the de-trended, de-seasonalized series can

be obtained by eliminating the trend (Tt) and seasonal components (St) for each time period from

the actual data by subtraction or division depending or whether the additive or multiplication

model was used.

01

K

iiS

ii Sdn

1

0 iS mS i gKm

0 gS i

SS j

t tX T

1t tX T 111

S

jjS

S

SSn

jj

1

CS j jSS1

141

Additive model Xt – Tt – St = Ct + It

Multiplicative model

De- seasonalized series is obtained after the seasonally adjusted data has been calculated. The

residual ratio is obtained either by dividing these seasonally adjusted figures by the

trend values or by dividing the ratio de-trended series by the respective seasonal

indices.

Finally, the cyclical variation (Ct ) can be found by smoothing the joint Ct and It components and

is eliminated as before.

Residual irregular components (It) can be obtained by subtraction or division:

Additive model Xt – Tt – St - Ct = It

Multiplicative model

Although the general method of decomposition has included the four possible components which

make up a time series, it should be noted that it is not a rule for all the four to be present. If

annual data are being used, there can be no seasonal component. Similarly, if short periods of

time are involved, the cyclical components can be ignored. In both cases one of the steps

outlined in the decomposition of time series above may be omitted.

Prediction / Forecasting

The essence of decomposing a time series is for a statistician to measure the effect of each

component and to make meaningful and reliable forecast; taking into consideration the effect of

the component on the forecast values for different time periods.

Thus, if a multiplicative model was used, a sensible predictor for the period K might be

tttt

t ICST

X

t tX S

tmt XSX

tttt

t ICST

X

ktktt STkX ˆˆ)(ˆ

142

where and are the estimated trend and seasonal effects respectively.

Similarly if the additive model was used the predictor for period k might be

Example

The data below gives the monthly sales of umbrella in XYZ company from 2004 – 2011

Table 11.1

2004 2005 2006 2007 2008 2009 2010 2011

JAN 10 15 10 8 10 8 9 10

FEB 18 12 10 9 9 12 13 8

MAR 22 13 9 12 10 10 3 11

APRIL 8 15 20 14 12 15 8 13

MAY 16 11 10 19 10 12 15 8

JUN 10 16 18 20 18 11 15 11

JUL 18 22 16 25 28 16 18 15

AUG 20 30 20 25 30 13 19 17

SEP 15 20 21 17 15 9 10 11

OCT 10 15 18 15 17 18 18 4

NOV 14 25 16 15 15 22 23 15

DEC 11 10 14 7 7 10 9 8

tT tS

ktktt STkX ˆˆ)(ˆ

143

(a) Use a suitable average to decompose the series into trend and seasonal component, hence or

otherwise forecast the sales for 2012 – 2013 using the additive model.

(b) Which is the most appropriate model in the sense of providing the better forecast?

Solution:

The first thing to do is to construct the time plot in order to view the maximum and minimum

values, examine the existence of outliers and fluctuations.

Table 11.2 Showing the Computations of the Trend, Seasonal indices and De-

seasonalized data.

Col. 1 Col. 2 Col. 3 Col. 4 Col. 5 Col. 6 Col. 7 Col. 8

Year/

Month

Sales 6 Point-MT Add-in-

pairs

Moving

Average

Dev. from

Trend

Seasonal

Indices

De-

Seasonalized

data

2005 JAN 10 -4.44 14.4

FEB 18 -3.97 22.0

MAR 22 -4.62 26.6

APRIL 8 -0.45 8.4

144

MAY 16 -2.15 18.2

JUN 10 172 349 14.54 -4.54 0.66 9.3

JUL 18 177 348 14.50 3.50 5.72 12.3

AUG 20 171 333 13.88 6.13 7.85 12.2

SEP 15 162 331 13.79 1.21 0.74 14.3

OCT 10 169 333 13.88 -3.88 1.33 8.7

NOV 14 164 334 13.92 0.08 4.09 9.9

DEC 11 170 344 14.33 -3.33 -4.76 15.8

2006 JAN 15 174 358 14.92 0.08 -4.44 19.4

FEB 12 184 373 15.54 -3.54 -3.97 16.0

MAR 13 189 383 15.96 -2.96 -4.62 17.6

APRIL 15 194 399 16.63 -1.63 -0.45 15.4

MAY 11 205 409 17.04 -6.04 -2.15 13.2

JUN 16 204 403 16.79 -0.79 0.66 15.3

JUL 22 199 396 16.50 5.50 5.72 16.3

AUG 30 197 390 16.25 13.75 7.85 22.2

SEP 20 193 391 16.29 3.71 0.74 19.3

OCT 15 198 395 16.46 -1.46 1.33 13.7

NOV 25 197 396 16.50 8.50 4.09 20.9

DEC 10 199 392 16.33 -6.33 -4.76 14.8

2007 JAN 10 193 376 15.67 -5.67 -4.44 14.4

FEB 10 183 367 15.29 -5.29 -3.97 14.0

MAR 9 184 371 15.46 -6.46 -4.62 13.6

145

APRIL 20 187 365 15.21 4.79 -0.45 20.4

MAY 10 178 360 15.00 -5.00 -2.15 12.2

JUN 18 182 362 15.08 2.92 0.66 17.3

JUL 16 180 359 14.96 1.04 5.72 10.3

AUG 20 179 361 15.04 4.96 7.85 12.2

SEP 21 182 358 14.92 6.08 0.74 20.3

OCT 18 176 361 15.04 2.96 1.33 16.7

NOV 16 185 372 15.50 0.50 4.09 11.9

DEC 14 187 383 15.96 -1.96 -4.76 18.8

2008 JAN 8 196 397 16.54 -8.54 -4.44 12.4

FEB 9 201 398 16.58 -7.58 -3.97 13.0

MAR 12 197 391 16.29 -4.29 -4.62 16.6


Year/

Month


pairs

Moving

Average

Dev. from

Trend

Seasonal

Indices

De-

Seasonalized

data

APRIL 14 194 387 16.13 -2.13 -0.45 14.4

MAY 19 193 379 15.79 3.21 -2.15 21.2

JUN 20 186 374 15.58 4.42 0.66 19.3

JUL 25 188 376 15.67 9.33 5.72 19.3

AUG 25 188 374 15.58 9.42 7.85 17.2

SEP 17 186 370 15.42 1.58 0.74 16.3

OCT 15 184 359 14.96 0.04 1.33 13.7

NOV 15 175 348 14.50 0.50 4.09 10.9

146

DEC 7 173 349 14.54 -7.54 -4.76 11.8

2009 JAN 10 176 357 14.88 -4.88 -4.44 14.4

FEB 9 181 360 15.00 -6.00 -3.97 13.0

MAR 10 179 360 15.00 -5.00 -4.62 14.6

APRIL 12 181 362 15.08 -3.08 -0.45 12.4

MAY 10 181 362 15.08 -5.08 -2.15 12.2

JUN 18 181 360 15.00 3.00 0.66 17.3

JUL 28 179 361 15.04 12.96 5.72 22.3

AUG 30 182 364 15.17 14.83 7.85 22.2

SEP 15 182 367 15.29 -0.29 0.74 14.3

OCT 17 185 372 15.50 1.50 1.33 15.7

NOV 15 187 367 15.29 -0.29 4.09 10.9

DEC 7 180 348 14.50 -7.50 -4.76 11.8

2010 JAN 8 168 319 13.29 -5.29 -4.44 12.4

FEB 12 151 296 12.33 -0.33 -3.97 16.0

MAR 10 145 291 12.13 -2.13 -4.62 14.6

APRIL 15 146 299 12.46 2.54 -0.45 15.4

MAY 12 153 309 12.88 -0.88 -2.15 14.2

JUN 11 156 313 13.04 -2.04 0.66 10.3

JUL 16 157 315 13.13 2.88 5.72 10.3

AUG 13 158 309 12.88 0.13 7.85 5.2

SEP 9 151 295 12.29 -3.29 0.74 8.3

OCT 18 144 291 12.13 5.88 1.33 16.7

147

NOV 22 147 298 12.42 9.58 4.09 17.9

DEC 10 151 304 12.67 -2.67 -4.76 14.8

2011 JAN 9 153 312 13.00 -4.00 -4.44 13.4

FEB 13 159 319 13.29 -0.29 -3.97 17.0

MAR 3 160 320 13.33 -10.33 -4.62 7.6

APRIL 8 160 321 13.38 -5.38 -0.45 8.4

MAY 15 161 321 13.38 1.63 -2.15 17.2

JUN 15 160 321 13.38 1.63 0.66 14.3

JUL 18 161 317 13.21 4.79 5.72 12.3

AUG 19 156 320 13.33 5.67 7.85 11.2

SEP 10 164 333 13.88 -3.88 0.74 9.3


Year/

Month


pairs

Moving

Average

Dev. from

Trend

Seasonal

Indices

De-

Seasonalized

data

OCT 18 169 331 13.79 4.21 1.33 16.7

NOV 23 162 320 13.33 9.67 4.09 18.9

DEC 9 158 313 13.04 -4.04 -4.76 13.8

2012 JAN 10 155 308 12.83 -2.83 -4.44 14.4

FEB 8 153 307 12.79 -4.79 -3.97 12.0

MAR 11 154 294 12.25 -1.25 -4.62 15.6

APRIL 13 140 272 11.33 1.67 -0.45 13.4

MAY 8 132 263 10.96 -2.96 -2.15 10.2

JUN 11 131 0.66 10.3

148

JUL 15 5.72 9.3

AUG 17 7.85 9.2

SEP 11 0.74 10.3

OCT 4 1.33 2.7

NOV 15 4.09 10.9

DEC 8 -4.76 12.8

Table 11.3 Showing Seasonal Indices

Month/

Year JAN FEB MAR APRIL MAY JUN JUL AUG SEP OCT NOV DEC

2005 - - - - - -4.54 3.50 6.13 1.21 -3.88 0.08 -3.33

2006 0.08 -3.54 -2.96 -1.63 -6.04 -0.79 5.50 13.75 3.71 -1.46 8.50 -6.33

2007 -5.67 -5.29 -6.46 4.79 -5.00 2.92 1.04 4.96 6.08 2.96 0.50 -1.96

2008 -8.54 -7.58 -4.29 -2.13 3.21 4.42 9.33 9.42 1.58 0.04 0.50 -7.54

2009 -4.88 -6.00 -5.00 -3.08 -5.08 3.00 12.96 14.83 -0.29 1.50 -0.29 -7.50

2010 -5.29 -0.33 -2.13 2.54 -0.88 -2.04 2.88 0.13 -3.29 5.88 9.58 -2.67

2011 -4.00 -0.29 -10.33 -5.38 1.63 1.63 4.79 5.67 -3.88 4.21 9.67 -4.04

2012 -2.83 -4.79 -1.25 1.67 -2.96 - - - - - - -

Total -31.13 -27.83 -32.42 -3.21

-

15.13 4.58 40.00 54.88 5.13 9.25 28.54

-

33.38

AVG -4.45 -3.98 -4.63 -0.46 -2.16 0.65 5.71 7.84 0.73 1.32 4.08 -4.77 -0.10

Adjustment 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 -0.01

S.I -4.44 -3.97 -4.62 -0.45 -2.15 0.66 5.72 7.85 0.74 1.33 4.09 -4.76 0.00

149

Fig. 1.5: Time Plot of Sales of Umbrella, Moving average and De-seasonalized data

Summary for Study Session 11 In this study, you have learn about:

1. The components of time series were described with charts for their illustration

2. The additive and multiplicative method of analysis of time series data with examples.

3. The procedure for construction of seasonal indices and de-seasonalized data. Some useful

examples were given to illustrate the techniques.



SAQ 11.1 -11.3

1. Explain clearly the reasons for analyzing a time series data

2. Sixteen successive observation of a given time series are:

1.6, 0.8, 1.2, 0.5, 0.9, 1.1, 1.1, 0.6, 1.5,

0.8, 0.9, 1.2, 0.5, 1.3, 0.8, 1.2

0

5

10

15

20

25

30

35

2004

JAN

APRI

LJU

LO

CT20

05 J

ANAP

RIL

JUL

OCT

2006

JAN

APRI

LJU

LO

CT20

07 J

ANAP

RIL

JUL

OCT

2008

JAN

APRI

LJU

LO

CT20

09 J

ANAP

RIL

JUL

OCT

2010

JAN

APRI

LJU

LO

CT20

11 J

ANAP

RIL

JUL

OCT

Sales of Umbrella Moving Average De- Seasonalized data

150

(i) Obtain the time-plot of the observation

(ii) Use a 3-point moving average to the trend values.

3. For the following time series:

Year tY

1990 2.4

1991 3.6

1992 5.4

1993 7.8

1994 11.6

1995 17.3

(i) Fit a linear trend to the above data and (ii) Fit a Quadratic trend.

References Brookes C B and Dick W.F (1969): An Introduction to Statistical Method .Second Edition



Stoughton



Gupta, C. B. (1973)“An Introduction to Statistical Methods” New Delhi: Vikas Publishing House

PVT Ltd.

Moore D.S and Mc Cabe G.P (1993): Introduction to the Practise of Statistics, Second Edition.

New Delhi: W.H. Freeman and coy.

Shittu, O. I. and Yaya, O. S. (2011): “Introduction to Time Series Analysis”, Babs-Tunde

Intercontinental Print, Nigeria. ISBN 978-33867-1-9. pp. 282

Documents

Descriptive Statistics - University of Ibadandlc.ui.edu.ng/oer.dlc.ui.edu.ng/app/upload/STA 111_1505825983.pdf · Descriptive statistics therefore is an aspect of statistics that