Lesson 1 07 measures of variation

C h a p t e r 1 D e s c r i b i n g D a t a – L e s s o n 7 Page 1

Chapter 1:

Describing Data

Lesson 7: Measures of Variation

TIME FRAME: 1 hour session

OVERVIEW OF LESSON

In this lesson, students will be shown that it is not enough to get measures of central tendency in

a data set by seeing two data sets representing the returns on stocks. Here, the means are the

same, and the spread of the data, i.e. range, standard deviation and variance, for the data sets are

also the same. The standard deviation can be viewed as a measure of risk. The main learning her

is that if we get a mixture of the stocks (and thus have an average of the returns of the two

stocks), we would get less risk (as the standard deviation of this mixture will be less).

LEARNING OUTCOME(S): At the end of the lesson, the learner is able to

calculate some measures of variation,

think of the strengths and limitations of these measures,

provide a sound interpretation of these measures in a business setting.

LESSON OUTLINE:

1. Introduction

2. Case Study: Returns on Stocks

3. Analysis and Comments on Case

DEVELOPMENT OF THE LESSON

(A) Introduction

Discuss with students the importance of thinking of their future, of saving, and of wealth

generation. Explain that a number of people invest money into the stock market as an

alternative financial instrument to generate wealth from savings. (Explanatory Note: Stocks

are shares of ownership in a company. When people buy stocks they become part owners of

the company, whether in terms of profits or losses of the company. )

Mention to students that the history of performance of a particular stock maybe a useful

guide to what may be expected of its performance in the foreseeable future. (This is of

course, a very big assumption, but we have to assume it anyways.)

(B) Case Study: Returns on Stocks

Provide the following data to students representing the rates of return for two stocks, which

we’ll call stock A and stock B.


Year Stock A Stock B Year Stock A Stock B

2005 0.081 0.214 2010 0.241 0.081

2006 0.231 0.193 2011 0.193 0.181

2007 0.214 0.133 2012 0.133 0.231

2008 0.214 0.071 2013 0.071 0.214

2009 0.181 0.066 2014 0.066 0.241

Inform students that the rate of return is defined as the increase in value of the portfolio

(including any dividends or other distributions) during the year divided by its value at the

beginning of the year. For instance, if the parents of Juana dela Cruz invests 50,000 pesos in

a stock at the beginning of the year, and the value of the stock goes to 60,000 pesos, thus

having an increase in value of 10,000 pesos, then the rate of return here is 10,000/50,000 =

0.20

Explain to students that the rate of return may be positive or negative. It represents the

fraction by which your wealth would have changed had it been invested in that particular

combination of securities.

Divide students into sets of threes, ask students to obtain the average return for the two

stocks and the standard deviations for the rates of return.

Tell them to use the historical performance of the stocks as a guide to making an investment

decision. Instruct them to look at summary measures of variability (such as the range and the

standard deviation) of the rates of return, and use these as measures of risk associated with

investing in a given security. Discuss whether it would make any difference if we decide to

invest wholly in stock A, wholly in stock B, or half of our investments in stock A and half in

stock B? Ask them why is this so???

Notes on Calculating Measures of Variation

(i) A simple measure of variation is the range, the difference between the maximum and

minimum values.

The range is the difference between the largest and smallest

values in a data set.

While the range is simple, it only depends on the extremes; it ignores information

about what goes on between the smallest (minimum) and largest (maximum) values

in a data set.


TECHNICAL NOTES

Given a data set 1x , 2x , …, Nx , denote the mean as N

xN

i

i 1

(a) the variance, denoted as 2 , is

N

xN

i

i

1

2

2

)(

(b) the population standard deviation, denoted as , is the square root of the variance

N

xN

i

i

1

2)(

(ii) Explain that a more popularly used measure of variation is the standard deviation, but

its calculation, however, is more complex. To justify its calculation, consider a list of

data. Each of the data entries deviates from the average by some amount,

deviation from the average = entry - average

We may want to have a measure of spread based on all these deviations. Getting the

mean of these deviations always yields a value of zero regardless of the values of a

data distribution. However, the average of the absolute value of these deviations is

nontrivial. It is called the mean absolute deviation and is useful for measuring

spread. This measure, alas, does not have very interesting mathematical properties.

An alternative to the mean absolute deviation is the variance, formed by taking the

mean of the squared deviations from the average. Unlike the mean absolute deviation,

the variance has some interesting mathematical properties; but, we omit discussion of

these properties here. The variance has the square of the units of the data. If we take

its square root, we get the standard deviation.

The variance is the mean of the squared deviations from the

average. The standard deviation is the square root of the

variance.

If a list of data has no variation, that is, all the data items are constant, then the mean

is that constant, and the variance and standard deviation are both zero. Furthermore, if

for a given list of data, the variance and standard deviation turn out to be zero, then

all the deviations from the average must be zero, which means that all the members of


the list are equal. Note that if a data set were rescaled, i.e. if the data were multiplied

by some constant, then the standard deviation of the new data set is merely the scaling

factor multiplied to the standard deviation of the original data set.

The variance and standard deviation are based on all items in the list, and each item is

given a proper weight. They are extremely useful measures of variability as they

measure the mean scattering of the data around the average, i.e. how large data

fluctuate above the average and how small data distribute below the average. The

variance and standard deviation increase with an increase in the deviations about the

mean, and decrease with decreases in these deviations. A small standard deviation

(and variance) means a high degree of uniformity in the observations and of

homogeneity in a series.

The variance is most suitable for algebraic manipulations but as was pointed out

earlier, its computation results in squared units. On the other hand, the standard

deviation has a value in the original units of the data. Thus, it serves as the primary

measure of variation just as the mean is the primary measure of central location

(C) Analysis and Comments of Case

This case introduces the idea of expected return and risk, as measured by the average and

standard deviation of the rates of return to a stock. The main point in the case is to measure

risk of stocks with the standard deviation.

Students should discover that the even if two data sets have a common standard deviation,

the standard deviation of a mixture of portfolios will yield a lower standard deviation. This

provides empirical basis for the advice in making investments: “do not put all your eggs in

one basket!”

The means, ranges and standard deviations for stocks A and B are identical (mean 0.1625,

range 0.175 and standard deviation 0.065291).

Some students may notice that the rates of return in B are the same as in A except for some

rearrangement in the order, and at this point, they might stop and say it does not make any

difference which one to invest in: A or B (but what about the mixture of A and B?).

There may be other recommendations and observations made by students. The main

discovery of students should be that when we compute the returns that would have been

received for the 50/50 mix (as 0.5*A + 0.5*B) , the mean, range and standard deviation of

that sequence are 0.1625, 0.0885 and .024862, respectively so that the mixture gives the

same return with less risk.

This case is an effective way of showing that intuition (that the mix would also not make a

difference) does not always yield something right, and that it is important to have solid data

analytics.


Recommended: It may be helpful to illustrate fast calculation of standard deviation (and

other summary measures) with a spreadsheet application (such as Microsoft Excel), as

indicated in Students’ Notes.


REFERENCES

Albert, J. R. G. (2008). Basic Statistics for the Tertiary Level (ed. Roberto Padua, Welfredo

Patungan, Nelia Marquez), published by Rex Bookstore.

Bryant−Smith (2009): Practical Data Analysis, Second Edition. McGraw-Hill/Irvine, USA.

“Range as a Measure of Variation” http://www.sharemylesson.com/teaching-resource/range-as-

a-measure-of-variation-50009362


ASSESSMENT

1. The mean number of letters in ten students’ names is 8 letters. The range of the data set is 5

letters. Make a line plot showing a data set that fits this description.

One possible answer is

Another is

2. Gerald, Carmina, and Rodolfo obtained the prices (in pesos) of a jar of peanut butter at

several grocery stores. Below is the data they have collected:

100.80 197.60 158.00 131.60 184.40 149.20

136.00 109.60 360.40 122.80 131.60

After analyzing the data, Gerald said, “The prices of peanut butter are pretty similar. The range

is only PHP 30.80.” Carmina said, “You are mistaken! The prices are very different. The range

is P259.60. That is a big difference in terms of peanut butter.” Rodolfo said, “I think you are

both mistaken. The range isn’t a useful measure to describe this set of data.”

a) Explain what you think each person was thinking.

x

x

x x x

x x x x x

6 7 8 9 10

x – number of letters of one student

x – one night

x x x x x

x x x x x

6 7 8 9 10

x – number of letters of one student


ANSWER:

Gerald did not order the data set from smallest to largest, and erroneously subtracted the first

value (100.80) from the last value (131.60) in the data set.

Carmina found the range correctly by subtracting the smallest value (100.80) from the largest

value (360.40).

Rodolfo noticed that the maximum 360.80 is an outlier in this set of data. As a result, a range of

PHP259.60 should not entirely describe the variation of the set of data as it was unduly increased

by the extreme value.

b) Who should we agree with? Why ?

ANSWER:

We can agree with both Carmina and Rodolfo.

Carmina correctly calculated the range;

Rodolfo astutely observed that while Carmina was correct in her cacluations, the range is not

very useful in describing the variability of this set of data, as the range would only be PHP 96.80

if the outlier were removed from the data set.

3. Three hundred students taking a first course in Statistics are provided a common final

examination. The following histogram shows the distribution of the final scores.

40 60 80 100 120

A

B

C

D

E

F

G H


Suppose the professor will give 30% weight to the Final Examination, what effect would multiplying 30% on all the Final Scores have on the mean of the Final

Exam Scores? on the standard deviation of the Final Exam Scores?

Answer: mean will also get rescaled by 30%, so with the standard deviation

Suppose the professor wants to bloat the Final Examination Scores, what effect

would adding 5 points to all the Final Scores have on the mean of the Final Exam

Scores? on the standard deviation of the Final Exam Scores?

Answer: mean will also go up by 5 points; standard deviation stays the same

4. In a fitness center, the weights of a certain group of students were taken resulting to a common weight of 140 pounds. What would be the standard deviation of the distribution of weights?

Answer:

Zero (since the data do not vary).

5. Determine which of the following five statements is true or false and explain briefly.

a. The average and median of any list of data are always close together. (Answer: False)

b. Half of a list of data is always below the average. (Answer: False, median)

c. If entries in a list are doubled, then the average is doubled. (Answer: True)

d. If entries in a list are doubled, then the standard deviation is doubled. (Answer: True)

e. If in a set of data, positive numbers are changed to negative, while negative are changed

to positive, the Standard Deviation changes sign as well. (Answer: False, Standard

deviation is always nonnegative)

Explanatory Note:

Teachers have the option to just ask this assessment orally to the entire class to either introduce or recall the notions of computing the range and of computing the standard

deviation, or to group students and ask them to identify answers, or to give this as

homework, or to use some questions/items here for a chapter examination.


HANDOUT FOR STUDENTS

Computing the Standard Deviation

Consider the following five data representing the difference in scores of two players in a

computer game:

-47, 28, 37, -47, 34

To compute for the sample standard deviation here, five steps must be essentially done, viz.:

(a) compute the sample mean, i.e. sum the values in the first column and divide by

the number of items thus yielding: �̅� = 1

(b) subtract the sample mean from each of the items (yielding the “deviations from

the average”) and thus obtain the second column in table below.

Difference in

scores of two

players

Deviations from

the Average

Squared

Deviations

-47 -48 2304

28 27 729

37 36 1296

-47 -48 2304

34 33 1089

(c) square the deviations from the average (the items in the second column), and

thus obtain the third column of the table above.

(d) sum the values in the third column and divide by number of data, thus yielding

Variance = 7722/5 = 1544.4

(e) take the square root of the result from (d): Standard Deviation is nearly 39. 3

In practice, the sum of the values in the third column may be obtained in much faster

and efficient way. This calculation involves (a*) summing the squared values of the

first column, 7727, (b*) subtracting from (a*) the product of the sample size and the

square of the sample mean, thus yielding 7727-(5)*(1)2=7722. This is the so-called

computational scheme for obtaining the variance (and standard deviation).

If the first column of the table were entered as an Excel spreadsheet as in Figure 6.1,


TECHNICAL NOTE:

Given a data set 1x , 2x , …, Nx , denote the mean as

N

xN

i

i 1 . The variance, denoted as

2 , is

N

xN

i

i

1

2

2

)(

; while the population standard deviation, denoted as , is the square root of the

variance, i.e.,

N

xN

i

i

1

2)(

we merely have to enter the command

= STDEVP(A2:A6)

in an empty cell (such as A9) and obtain the “population” standard deviation as

approximately 39.3

Education

Lesson 1 07 measures of variation