1 Chapter 3 – Descriptive Statistics Numerical Measures

1

Chapter 3 – Descriptive Statistics

Numerical Measures

2

Chapter Outline

Measures of Central Location Mean Median Mode Percentile (Quartile, Quintile, etc.)

Measures of Variability Range Variance (Standard Deviation, Coefficient of Variation)

3

A Recall

A sample is a subset of a population. Numerical measures calculated for sample data are

called sample statistics. Numerical measures calculated for population data

are called population parameters. A sample statistic is referred to as the A sample statistic is referred to as the point point

estimatorestimator of the corresponding population of the corresponding population parameter.parameter.

4

Mean

As a measure of central location, mean is simply the arithmetic average of all the data values.

The sample mean is the point estimator of the population mean .

x

5

Sample Mean x

n

xx i

The symbol (called sigma) means ‘sum up’.

is the value of th observation in the sample.

n is the number of observations in the sample.

ix i

6

Population Mean

The symbol (called sigma) means ‘sum up’.

is the value of th observation in the sample.

N is the number of observations in the population.

is pronounced as ‘miu’.

ix i

N

xi

7

Sample Mean Example: Sales of Starbucks Stores

50 Starbucks stores are randomly chosen in the NYC. The table below shows the sales of those stores in December 2012.

95 77 97 99 89108 120 78 79 8867 97 97 79 9399 103 106 82 9393 97 95 61 10977 88 100 109 9086 89 97 93 8893 105 87 82 98

119 104 93 104 101118 105 82 73 101

8

Sample Mean

Example: Sales of Starbucks Stores

95 77 97 99 89108 120 78 79 8867 97 97 79 9399 103 106 82 9393 97 95 61 10977 88 100 109 9086 89 97 93 8893 105 87 82 98

119 104 93 104 101118 105 82 73 101

69.9350

685,4

n

xx i

9

Median

Whenever a data set has extreme values, the medianWhenever a data set has extreme values, the median is the preferred measure of central location.is the preferred measure of central location.

A few extremely large incomes or property valuesA few extremely large incomes or property values can inflate the mean since the calculation of meancan inflate the mean since the calculation of mean uses all the data items.uses all the data items.

The median is the measure of location most oftenThe median is the measure of location most often reported for annual income and property value data.reported for annual income and property value data.

The The medianmedian of a data set is the value in the middle of a data set is the value in the middle when the data items are arranged in ascending order.when the data items are arranged in ascending order.

10

Median

1212 1414 1919 2626 27271818 2727

For an For an odd numberodd number of observations: of observations:

in ascending orderin ascending order

2626 1818 2727 1212 1414 2727 1919 7 observations7 observations

the median is the middle value.the median is the middle value.

Median = 19Median = 19

11

Median

For an For an even numbereven number of observations: of observations:

1212 1414 1919 2626 27271818 2727

2626 1818 2727 1212 1414 2727 1919 8 observations8 observations

the median is the average of the middle two values.the median is the average of the middle two values.

Median = (19 + 26)/2 = 22.5Median = (19 + 26)/2 = 22.5

3030

3030 in ascending orderin ascending order

12

Mean vs. Median As noted, extremes values can change means remarkably,As noted, extremes values can change means remarkably, while medians might not be affected much by extremewhile medians might not be affected much by extreme values. Therefore, in that regard, median is a better values. Therefore, in that regard, median is a better representative of central location.representative of central location.

1212 1414 1919 2626 27271818 2727 3030

3030

280280

For the previous example, the median is 22.5 and the mean is 21.6. If we add one large number (280) to the data, the median becomes 26 (the value in the middle). But the mean becomes 50.3. In this case we prefer median to mean as a measure of central location.

13

Mode

The mode of a data set is the value that occurs most frequently.

The greatest frequency can occur at two or more different values.

If the data have exactly two modes, the data are bimodal. If the data have more than two modes, the data are

multimodal. Caution: If the data are bimodal or multimodal, Excel’s

MODE function will incorrectly identify a single mode.

14

Mode

1212 1414 1919 2626 27271818 2727 3030

For the example above, 27 shows up twice while all the other data values show up once. So, the mode is 27.

15

Percentiles

A percentile provides information about how the data are A percentile provides information about how the data are spread over the interval spread over the interval from the smallest value to the from the smallest value to the largest valuelargest value..

Admission test scores for colleges and universities are Admission test scores for colleges and universities are frequently reported in terms of percentiles.frequently reported in terms of percentiles.

The The ppth percentileth percentile of a data set is of a data set is a valuea value such that at least such that at least pp percent of the items are less than or equal to this value percent of the items are less than or equal to this value and at least (100 - and at least (100 - pp) percent of the items are more than or ) percent of the items are more than or equal to this value.equal to this value.

The 50The 50thth percentile is simply the median. percentile is simply the median.

16

Percentiles

Arrange the data in ascending order. Arrange the data in ascending order.

Compute index i, the position of the pth percentile. Compute index i, the position of the pth percentile.

If i is not an integer, round up. The p th percentile is the value in the i th position. If i is not an integer, round up. The p th percentile is the value in the i th position.

If i is an integer, the p th percentile is the average of the values in positions i and i +1. If i is an integer, the p th percentile is the average of the values in positions i and i +1.

ii = ( = (pp/100)/100)nn

17

Percentiles

Find the 75th percentile of the following data

1212 1414 1919 2626 27271818 2929 3030

Note: The data is already in ascending order.

ii = ( = (pp/100)/100)nn = (75/100)8 = 6 = (75/100)8 = 6

So, averaging the 6So, averaging the 6thth and 7 and 7thth data values: data values:7575thth percentile = (27 + 29)/2 = 28 percentile = (27 + 29)/2 = 28

18

Percentiles

Find the 20th percentile of the following data

1212 1414 1919 2626 27271818 2929 3030

Note: The data is already in ascending order.

ii = ( = (pp/100)/100)nn = (20/100)8 = 1.6, which is rounded up to 2. = (20/100)8 = 1.6, which is rounded up to 2.

So, the 20So, the 20thth percentile is simply the 2 percentile is simply the 2ndnd data value, i.e. 14. data value, i.e. 14.

19

Quartiles

Quartiles are specific percentiles.Quartiles are specific percentiles. First Quartile = 25First Quartile = 25thth percentile percentile Second Quartile = 50Second Quartile = 50thth percentile = Median percentile = Median Third Quartile = 75Third Quartile = 75thth percentile percentile

20

Measures of Variability It is often desirable to consider measures of variability (dispersion), as

well as measures of central location. For example, when two stocks provide the same average return of 5% a

year, but stock A’s return is very stable – close to 5% and stock B’s return is volatile ( it could be as low as –10%), are you indifferent with regard to which stock to invest in?

For another example, in choosing supplier A or supplier B we might For another example, in choosing supplier A or supplier B we might consider not only the average delivery time for each, but also the consider not only the average delivery time for each, but also the variability in delivery time for each.variability in delivery time for each.

21

Measures of Variability

Range Interquartile Range Variance/Standard Deviation Coefficient of Variation

22

Range

The range of a data set is the difference between the largest and smallest data values.

It is the simplest measure of variability. It is very sensitive to the smallest and largest data

values.

23

Range

Example:

1212 1414 1919 2626 27271818 2929 3030

Range = largest value - smallest valueRange = largest value - smallest value = 30 – 12 = 30 – 12 = 8= 8

24

Interquartile Range

The interquartile range of a data set is the difference between the 3rd quartile and the 1st quartile.

It is the range of the middle 50% of the data. It overcomes the sensitivity to extreme data

values.

25

Interquartile Range

Example:

1212 1414 1919 2626 27271818 2929 3030

33rdrd Quartile (Q3) = 75 Quartile (Q3) = 75thth percentile = 28 percentile = 2811stst Quartile (Q1) = 25 Quartile (Q1) = 25thth percentile = 16 percentile = 16Interquartile Range = Q3 – Q1 = 28 – 16 = 12Interquartile Range = Q3 – Q1 = 28 – 16 = 12

26

Variance

The variance is a measure of variability that utilizes all the data.

It is based on the difference between the value of each observation (xi) and the mean ( for a sample, for a population)

The variance is useful in comparing the variability of two or more variables.

x

27

Variance

The variance is the average of the squared differences between each data value and the mean. The variance is calculated as follows:

22

( )x

Ni 2

2

( )xNis

xi x

n2

2

1

( )s

xi x

n2

2

1

( )

for afor asamplesample

for afor apopulationpopulation

28

Standard Deviation

The standard deviation of a data set is the positive square root of the variance.

It is measured in the same units as the data, making it more appropriately interpreted than the variance.

29

Standard Deviation

for asample

for apopulation

s s 2s s 2 2

The standard deviation is computed as follows:

30

Variance and Standard Deviation

1212 1414 1919 2626 27271818 2929 3030

VarianceVariance

Standard DeviationStandard Deviation

98.48

1

2

2

n

xxs i

798.482 ss

Example

31

Coefficient of Variation

The coefficient of variation indicates how large the standard deviation is in relation to the mean.

In a comparison between two data sets with different units or with the same units but a significant difference in magnitude, coefficient of variation should be used instead of variance.

32


100 %s

x

100 %s

x

for asample

for apopulation

100 %

100 %

The coefficient of variation is computed as follows:

33


1212 1414 1919 2626 27271818 2929 3030

%32%100875.21

7%100

x

s

Example

34

Coefficient of Variation Example: Height vs. Weight

In a class of 30 students, the average height is 5’5’’ with a standard deviation of 3’’ and the average weight is 120 lbs with a standard deviation of 20 lbs. Question, in which measure (height or weight) are students more different?

Since height and weight don’t have the same unit, we have to use coefficient of variation to remove the units before comparing the variations in height and weight.

As shown below, students’ weight is more variant than their height.

%6.4%100''5'5

''3%100

height

height

x

s

%7.16%100120

20%100

weight

weight

x

s

35

Measures of Distribution Shape, Relative Location, and Detecting Outliers

Distribution Shape z-Scores Chebyshev’s Theorem Empirical Rule Detecting Outliers

36

Distribution Shape: Skewness

An important measure of the shape of a distribution is An important measure of the shape of a distribution is called called skewnessskewness..

The formula for the skewness of sample data isThe formula for the skewness of sample data is

Skewness can be easily computed using statistical Skewness can be easily computed using statistical software.software.

3

)2)(1(Skewness

s

xx

nn

n i

3

)2)(1(Skewness

s

xx

nn

n i

37

Distribution Shape: Skewness Symmetric (not skewed)

• Skewness is zero.• Mean and median are equal.

Skewness = 0 Skewness = 0

Rela

tive F

req

uen

cyR

ela

tive F

req

uen

cy

.05.05

.10.10

.15.15

.20.20

.25.25

.30.30

.35.35

00

38

Distribution Shape: Skewness Skewed to the left

• Skewness is negative.• Mean is usually less than the median.

Rela

tive F

req

uen

cyR

ela

tive F

req

uen

cy

.05.05

.10.10

.15.15

.20.20

.25.25

.30.30

.35.35

00

Skewness = Skewness = .33 .33

39

Distribution Shape: Skewness Skewed to the right

• Skewness is positive.• Mean is usually more than the median.

Rela

tive F

req

uen

cyR

ela

tive F

req

uen

cy

.05.05

.10.10

.15.15

.20.20

.25.25

.30.30

.35.35

00

Skewness = .31 Skewness = .31

40

Z-Scores

zx x

sii

zx x

sii

The z-score is often called the standardized value.

It denotes the number of standard deviations a data value xi is from the mean.

Excel’s STANDARDIZE function can be used to compute the z-score.

41

Z-Scores

An observation’s z-score is a measure of the relative location of the observation in a data set.

A data value less than the sample mean has a negative z-score.

A data value greater than the sample mean has a positive z-score.

A data value equal to the sample mean has a z-score of zero.

42

Z-Scores

1212 1414 1919 2626 27271818 2929 3030

16.17

875.213088

s

xxz

41.17

875.211211

s

xxz

Example

43

Chebyshev’s Theorem

At leastAt least (1 - 1/ (1 - 1/zz22) of the items in ) of the items in anyany data set data set will be within will be within zz standard deviations of the standard deviations of the mean, I.e. between ( ) and ( ), mean, I.e. between ( ) and ( ), where where z z is any value is any value greater thangreater than 1. 1.

Chebyshev’s theorem requires Chebyshev’s theorem requires zz > 1, but > 1, but zz need notneed not be an integer. be an integer.

xszx szx

44


At least 55.6% of the data values must be within z = 1.5 standard deviations of the mean.

At least 89% of the data values must be within z = 3 standard deviations of the mean.

At least 94% of the data values must be within z = 4 standard deviations of the mean.

45


x Example: Given that = 10 and s = 2, at least what percentage of all the data values falls into 2 standard deviations of the mean?

At least (1-1/22) = 1-1/4 = 75% of all the data values must be between 6 and 14.

= 10-2(2) = 6

= 10+2(2) = 14

szx

szx

46

Empirical Rule

When the data are believed to approximate a bell-shaped distribution, the empirical rule can be used to determine the percentage of data values that must be within a specified number of standard deviations of the mean.

The empirical rule is based on the normal distribution, which is covered in Chapter 6.

47

Empirical Rule

For data having a bell-shaped distribution:

Expected number of correct answers

About of values of a normal random variableAbout of values of a normal random variable are between are between - - and and + + ..About of values of a normal random variableAbout of values of a normal random variable are between are between - - and and + + ..

68%68%68%68%

About of values of a normal random variableAbout of values of a normal random variable are between are between - 2 - 2 and and + 2 + 2..About of values of a normal random variableAbout of values of a normal random variable are between are between - 2 - 2 and and + 2 + 2..

About of values of a normal random variableAbout of values of a normal random variable are between are between - 3 - 3 and and + 3 + 3..About of values of a normal random variableAbout of values of a normal random variable are between are between - 3 - 3 and and + 3 + 3..

95%95%95%95%

99%99%99%99%

48

Empirical Rule

Expected number of correct answers

xx – – 33 – – 11

– – 22 + 1+ 1

+ 2+ 2 + 3+ 3

About 68%About 68%About 95%About 95%About 99%About 99%

49

Detecting Outliers

An outlier is an unusually small or unusually large value in a data set.

A data value with a z-score less than –3 or greater than +3 might be considered an outlier.

It might be:• An incorrectly recorded data value• A data value that was incorrectly included in the data

set.• A correctly recorded data value that belongs in the data

set.

50

Measures of Association Between Two Variables

So far, we have examined numerical methods used to summarize the data for one variable at a time.

Often a manager or decision maker is interested in the relationship between two variables.

Two numerical measures of the relationship between two variables are covariance and correlation coefficient.

51

Covariance

The covariance is a measure of the linear association between two variables.

Positive values indicate a positive relationship.

Negative values indicate a negative relationship.

52

Covariance

The covariance is computed as follows:

forforsamplessamples

forforpopulationspopulations

1

))((

n

yyxxs ii

xy 1

))((

n

yyxxs ii

xy

xyi x i yx y

N

( )( )

xy

i x i yx y

N

( )( )

53

Correlation Coefficient

Correlation is a measure of linear association and not necessarily causation.

Just because two variables are highly correlated, it does not mean that one variable is the cause of the other.

54


The correlation coefficient is computed as follows:

forforsamplessamples

forforpopulationspopulations

rs

s sxyxy

x yr

s

s sxyxy

x y

xy

xy

x y

xy

xy

x y

55


The correlation can take on values between –1 and +1.

Values near –1 indicate a strong negative linear relationship.

Values near +1 indicate a strong positive linear relationship.

The closer the correlation is to zero, the weaker the relationship.

56

Covariance and Correlation Coefficient

Example: Stock Returns The table below presents the monthly returns (in percentage) of

the market index S&P 500 (SPY) and the Apple stock (AAPL) from December 2012 to May 2013.

Date SPY AAPLDec-12 2.17 -6.76Jan-13 0.55 -1.11Feb-13 1.62 0.12Mar-13 0.83 0.01Apr-13 1.01 0.96May-13 0.24 0.10

57

DateDec-12 2.17 -6.76 1.10 -5.65 -6.21Jan-13 0.55 -1.11 -0.52 0.00 0.00Feb-13 1.62 0.12 0.55 1.24 0.68Mar-13 0.83 0.01 -0.24 1.12 -0.27Apr-13 1.01 0.96 -0.06 2.08 -0.12May-13 0.24 0.10 -0.83 1.21 -1.00Average 1.07 -1.11 Total: -6.92

Std. Dev. 0.71 2.84


Example: Stock Returns

x y xxi yyi yyxx ii

58


Example: Stock Returns

• Sample Covariance

• Sample Correlation Coefficient

38.1

16

92.6

1

n

yyxxs ii

xy

68.084.271.0

38.1

yx

xyxy ss

sr

Documents

1 Chapter 3 – Descriptive Statistics Numerical Measures