58
1 Chapter 3 – Descriptive Statistics Numerical Measures

1 Chapter 3 – Descriptive Statistics Numerical Measures

Embed Size (px)

Citation preview

Page 1: 1 Chapter 3 – Descriptive Statistics Numerical Measures

1

Chapter 3 – Descriptive Statistics

Numerical Measures

Page 2: 1 Chapter 3 – Descriptive Statistics Numerical Measures

2

Chapter Outline

Measures of Central Location Mean Median Mode Percentile (Quartile, Quintile, etc.)

Measures of Variability Range Variance (Standard Deviation, Coefficient of Variation)

Page 3: 1 Chapter 3 – Descriptive Statistics Numerical Measures

3

A Recall

A sample is a subset of a population. Numerical measures calculated for sample data are

called sample statistics. Numerical measures calculated for population data

are called population parameters. A sample statistic is referred to as the A sample statistic is referred to as the point point

estimatorestimator of the corresponding population of the corresponding population parameter.parameter.

Page 4: 1 Chapter 3 – Descriptive Statistics Numerical Measures

4

Mean

As a measure of central location, mean is simply the arithmetic average of all the data values.

The sample mean is the point estimator of the population mean .

x

Page 5: 1 Chapter 3 – Descriptive Statistics Numerical Measures

5

Sample Mean x

n

xx i

The symbol (called sigma) means ‘sum up’.

is the value of th observation in the sample.

n is the number of observations in the sample.

ix i

Page 6: 1 Chapter 3 – Descriptive Statistics Numerical Measures

6

Population Mean

The symbol (called sigma) means ‘sum up’.

is the value of th observation in the sample.

N is the number of observations in the population.

is pronounced as ‘miu’.

ix i

N

xi

Page 7: 1 Chapter 3 – Descriptive Statistics Numerical Measures

7

Sample Mean Example: Sales of Starbucks Stores

50 Starbucks stores are randomly chosen in the NYC. The table below shows the sales of those stores in December 2012.

95 77 97 99 89108 120 78 79 8867 97 97 79 9399 103 106 82 9393 97 95 61 10977 88 100 109 9086 89 97 93 8893 105 87 82 98

119 104 93 104 101118 105 82 73 101

Page 8: 1 Chapter 3 – Descriptive Statistics Numerical Measures

8

Sample Mean

Example: Sales of Starbucks Stores

95 77 97 99 89108 120 78 79 8867 97 97 79 9399 103 106 82 9393 97 95 61 10977 88 100 109 9086 89 97 93 8893 105 87 82 98

119 104 93 104 101118 105 82 73 101

69.9350

685,4

n

xx i

Page 9: 1 Chapter 3 – Descriptive Statistics Numerical Measures

9

Median

Whenever a data set has extreme values, the medianWhenever a data set has extreme values, the median is the preferred measure of central location.is the preferred measure of central location.

A few extremely large incomes or property valuesA few extremely large incomes or property values can inflate the mean since the calculation of meancan inflate the mean since the calculation of mean uses all the data items.uses all the data items.

The median is the measure of location most oftenThe median is the measure of location most often reported for annual income and property value data.reported for annual income and property value data.

The The medianmedian of a data set is the value in the middle of a data set is the value in the middle when the data items are arranged in ascending order.when the data items are arranged in ascending order.

Page 10: 1 Chapter 3 – Descriptive Statistics Numerical Measures

10

Median

1212 1414 1919 2626 27271818 2727

For an For an odd numberodd number of observations: of observations:

in ascending orderin ascending order

2626 1818 2727 1212 1414 2727 1919 7 observations7 observations

the median is the middle value.the median is the middle value.

Median = 19Median = 19

Page 11: 1 Chapter 3 – Descriptive Statistics Numerical Measures

11

Median

For an For an even numbereven number of observations: of observations:

1212 1414 1919 2626 27271818 2727

2626 1818 2727 1212 1414 2727 1919 8 observations8 observations

the median is the average of the middle two values.the median is the average of the middle two values.

Median = (19 + 26)/2 = 22.5Median = (19 + 26)/2 = 22.5

3030

3030 in ascending orderin ascending order

Page 12: 1 Chapter 3 – Descriptive Statistics Numerical Measures

12

Mean vs. Median As noted, extremes values can change means remarkably,As noted, extremes values can change means remarkably, while medians might not be affected much by extremewhile medians might not be affected much by extreme values. Therefore, in that regard, median is a better values. Therefore, in that regard, median is a better representative of central location.representative of central location.

1212 1414 1919 2626 27271818 2727 3030

3030

280280

For the previous example, the median is 22.5 and the mean is 21.6. If we add one large number (280) to the data, the median becomes 26 (the value in the middle). But the mean becomes 50.3. In this case we prefer median to mean as a measure of central location.

Page 13: 1 Chapter 3 – Descriptive Statistics Numerical Measures

13

Mode

The mode of a data set is the value that occurs most frequently.

The greatest frequency can occur at two or more different values.

If the data have exactly two modes, the data are bimodal. If the data have more than two modes, the data are

multimodal. Caution: If the data are bimodal or multimodal, Excel’s

MODE function will incorrectly identify a single mode.

Page 14: 1 Chapter 3 – Descriptive Statistics Numerical Measures

14

Mode

1212 1414 1919 2626 27271818 2727 3030

For the example above, 27 shows up twice while all the other data values show up once. So, the mode is 27.

Page 15: 1 Chapter 3 – Descriptive Statistics Numerical Measures

15

Percentiles

A percentile provides information about how the data are A percentile provides information about how the data are spread over the interval spread over the interval from the smallest value to the from the smallest value to the largest valuelargest value..

Admission test scores for colleges and universities are Admission test scores for colleges and universities are frequently reported in terms of percentiles.frequently reported in terms of percentiles.

The The ppth percentileth percentile of a data set is of a data set is a valuea value such that at least such that at least pp percent of the items are less than or equal to this value percent of the items are less than or equal to this value and at least (100 - and at least (100 - pp) percent of the items are more than or ) percent of the items are more than or equal to this value.equal to this value.

The 50The 50thth percentile is simply the median. percentile is simply the median.

Page 16: 1 Chapter 3 – Descriptive Statistics Numerical Measures

16

Percentiles

Arrange the data in ascending order. Arrange the data in ascending order.

Compute index i, the position of the pth percentile. Compute index i, the position of the pth percentile.

If i is not an integer, round up. The p th percentile is the value in the i th position. If i is not an integer, round up. The p th percentile is the value in the i th position.

If i is an integer, the p th percentile is the average of the values in positions i and i +1. If i is an integer, the p th percentile is the average of the values in positions i and i +1.

ii = ( = (pp/100)/100)nn

Page 17: 1 Chapter 3 – Descriptive Statistics Numerical Measures

17

Percentiles

Find the 75th percentile of the following data

1212 1414 1919 2626 27271818 2929 3030

Note: The data is already in ascending order.

ii = ( = (pp/100)/100)nn = (75/100)8 = 6 = (75/100)8 = 6

So, averaging the 6So, averaging the 6thth and 7 and 7thth data values: data values:7575thth percentile = (27 + 29)/2 = 28 percentile = (27 + 29)/2 = 28

Page 18: 1 Chapter 3 – Descriptive Statistics Numerical Measures

18

Percentiles

Find the 20th percentile of the following data

1212 1414 1919 2626 27271818 2929 3030

Note: The data is already in ascending order.

ii = ( = (pp/100)/100)nn = (20/100)8 = 1.6, which is rounded up to 2. = (20/100)8 = 1.6, which is rounded up to 2.

So, the 20So, the 20thth percentile is simply the 2 percentile is simply the 2ndnd data value, i.e. 14. data value, i.e. 14.

Page 19: 1 Chapter 3 – Descriptive Statistics Numerical Measures

19

Quartiles

Quartiles are specific percentiles.Quartiles are specific percentiles. First Quartile = 25First Quartile = 25thth percentile percentile Second Quartile = 50Second Quartile = 50thth percentile = Median percentile = Median Third Quartile = 75Third Quartile = 75thth percentile percentile

Page 20: 1 Chapter 3 – Descriptive Statistics Numerical Measures

20

Measures of Variability It is often desirable to consider measures of variability (dispersion), as

well as measures of central location. For example, when two stocks provide the same average return of 5% a

year, but stock A’s return is very stable – close to 5% and stock B’s return is volatile ( it could be as low as –10%), are you indifferent with regard to which stock to invest in?

For another example, in choosing supplier A or supplier B we might For another example, in choosing supplier A or supplier B we might consider not only the average delivery time for each, but also the consider not only the average delivery time for each, but also the variability in delivery time for each.variability in delivery time for each.

Page 21: 1 Chapter 3 – Descriptive Statistics Numerical Measures

21

Measures of Variability

Range Interquartile Range Variance/Standard Deviation Coefficient of Variation

Page 22: 1 Chapter 3 – Descriptive Statistics Numerical Measures

22

Range

The range of a data set is the difference between the largest and smallest data values.

It is the simplest measure of variability. It is very sensitive to the smallest and largest data

values.

Page 23: 1 Chapter 3 – Descriptive Statistics Numerical Measures

23

Range

Example:

1212 1414 1919 2626 27271818 2929 3030

Range = largest value - smallest valueRange = largest value - smallest value = 30 – 12 = 30 – 12 = 8= 8

Page 24: 1 Chapter 3 – Descriptive Statistics Numerical Measures

24

Interquartile Range

The interquartile range of a data set is the difference between the 3rd quartile and the 1st quartile.

It is the range of the middle 50% of the data. It overcomes the sensitivity to extreme data

values.

Page 25: 1 Chapter 3 – Descriptive Statistics Numerical Measures

25

Interquartile Range

Example:

1212 1414 1919 2626 27271818 2929 3030

33rdrd Quartile (Q3) = 75 Quartile (Q3) = 75thth percentile = 28 percentile = 2811stst Quartile (Q1) = 25 Quartile (Q1) = 25thth percentile = 16 percentile = 16Interquartile Range = Q3 – Q1 = 28 – 16 = 12Interquartile Range = Q3 – Q1 = 28 – 16 = 12

Page 26: 1 Chapter 3 – Descriptive Statistics Numerical Measures

26

Variance

The variance is a measure of variability that utilizes all the data.

It is based on the difference between the value of each observation (xi) and the mean ( for a sample, for a population)

The variance is useful in comparing the variability of two or more variables.

x

Page 27: 1 Chapter 3 – Descriptive Statistics Numerical Measures

27

Variance

The variance is the average of the squared differences between each data value and the mean. The variance is calculated as follows:

22

( )x

Ni 2

2

( )xNis

xi x

n2

2

1

( )s

xi x

n2

2

1

( )

for afor asamplesample

for afor apopulationpopulation

Page 28: 1 Chapter 3 – Descriptive Statistics Numerical Measures

28

Standard Deviation

The standard deviation of a data set is the positive square root of the variance.

It is measured in the same units as the data, making it more appropriately interpreted than the variance.

Page 29: 1 Chapter 3 – Descriptive Statistics Numerical Measures

29

Standard Deviation

for asample

for apopulation

s s 2s s 2 2

The standard deviation is computed as follows:

Page 30: 1 Chapter 3 – Descriptive Statistics Numerical Measures

30

Variance and Standard Deviation

1212 1414 1919 2626 27271818 2929 3030

VarianceVariance

Standard DeviationStandard Deviation

98.48

1

2

2

n

xxs i

798.482 ss

Example

Page 31: 1 Chapter 3 – Descriptive Statistics Numerical Measures

31

Coefficient of Variation

The coefficient of variation indicates how large the standard deviation is in relation to the mean.

In a comparison between two data sets with different units or with the same units but a significant difference in magnitude, coefficient of variation should be used instead of variance.

Page 32: 1 Chapter 3 – Descriptive Statistics Numerical Measures

32

Coefficient of Variation

100 %s

x

100 %s

x

for asample

for apopulation

100 %

100 %

The coefficient of variation is computed as follows:

Page 33: 1 Chapter 3 – Descriptive Statistics Numerical Measures

33

Coefficient of Variation

1212 1414 1919 2626 27271818 2929 3030

%32%100875.21

7%100

x

s

Example

Page 34: 1 Chapter 3 – Descriptive Statistics Numerical Measures

34

Coefficient of Variation Example: Height vs. Weight

In a class of 30 students, the average height is 5’5’’ with a standard deviation of 3’’ and the average weight is 120 lbs with a standard deviation of 20 lbs. Question, in which measure (height or weight) are students more different?

Since height and weight don’t have the same unit, we have to use coefficient of variation to remove the units before comparing the variations in height and weight.

As shown below, students’ weight is more variant than their height.

%6.4%100''5'5

''3%100

height

height

x

s

%7.16%100120

20%100

weight

weight

x

s

Page 35: 1 Chapter 3 – Descriptive Statistics Numerical Measures

35

Measures of Distribution Shape, Relative Location, and Detecting Outliers

Distribution Shape z-Scores Chebyshev’s Theorem Empirical Rule Detecting Outliers

Page 36: 1 Chapter 3 – Descriptive Statistics Numerical Measures

36

Distribution Shape: Skewness

An important measure of the shape of a distribution is An important measure of the shape of a distribution is called called skewnessskewness..

The formula for the skewness of sample data isThe formula for the skewness of sample data is

Skewness can be easily computed using statistical Skewness can be easily computed using statistical software.software.

3

)2)(1(Skewness

s

xx

nn

n i

3

)2)(1(Skewness

s

xx

nn

n i

Page 37: 1 Chapter 3 – Descriptive Statistics Numerical Measures

37

Distribution Shape: Skewness Symmetric (not skewed)

• Skewness is zero.• Mean and median are equal.

Skewness = 0 Skewness = 0

Rela

tive F

req

uen

cyR

ela

tive F

req

uen

cy

.05.05

.10.10

.15.15

.20.20

.25.25

.30.30

.35.35

00

Page 38: 1 Chapter 3 – Descriptive Statistics Numerical Measures

38

Distribution Shape: Skewness Skewed to the left

• Skewness is negative.• Mean is usually less than the median.

Rela

tive F

req

uen

cyR

ela

tive F

req

uen

cy

.05.05

.10.10

.15.15

.20.20

.25.25

.30.30

.35.35

00

Skewness = Skewness = .33 .33

Page 39: 1 Chapter 3 – Descriptive Statistics Numerical Measures

39

Distribution Shape: Skewness Skewed to the right

• Skewness is positive.• Mean is usually more than the median.

Rela

tive F

req

uen

cyR

ela

tive F

req

uen

cy

.05.05

.10.10

.15.15

.20.20

.25.25

.30.30

.35.35

00

Skewness = .31 Skewness = .31

Page 40: 1 Chapter 3 – Descriptive Statistics Numerical Measures

40

Z-Scores

zx x

sii

zx x

sii

The z-score is often called the standardized value.

It denotes the number of standard deviations a data value xi is from the mean.

Excel’s STANDARDIZE function can be used to compute the z-score.

Page 41: 1 Chapter 3 – Descriptive Statistics Numerical Measures

41

Z-Scores

An observation’s z-score is a measure of the relative location of the observation in a data set.

A data value less than the sample mean has a negative z-score.

A data value greater than the sample mean has a positive z-score.

A data value equal to the sample mean has a z-score of zero.

Page 42: 1 Chapter 3 – Descriptive Statistics Numerical Measures

42

Z-Scores

1212 1414 1919 2626 27271818 2929 3030

  

16.17

875.213088

s

xxz

41.17

875.211211

s

xxz

Example

Page 43: 1 Chapter 3 – Descriptive Statistics Numerical Measures

43

Chebyshev’s Theorem

  

At leastAt least (1 - 1/ (1 - 1/zz22) of the items in ) of the items in anyany data set data set will be within will be within zz standard deviations of the standard deviations of the mean, I.e. between ( ) and ( ), mean, I.e. between ( ) and ( ), where where z z is any value is any value greater thangreater than 1. 1.

Chebyshev’s theorem requires Chebyshev’s theorem requires zz > 1, but > 1, but zz need notneed not be an integer. be an integer.

xszx szx

Page 44: 1 Chapter 3 – Descriptive Statistics Numerical Measures

44

Chebyshev’s Theorem

  

At least 55.6% of the data values must be within z = 1.5 standard deviations of the mean.

At least 89% of the data values must be within z = 3 standard deviations of the mean.

At least 94% of the data values must be within z = 4 standard deviations of the mean.

Page 45: 1 Chapter 3 – Descriptive Statistics Numerical Measures

45

Chebyshev’s Theorem

  

x Example: Given that = 10 and s = 2, at least what percentage of all the data values falls into 2 standard deviations of the mean?

At least (1-1/22) = 1-1/4 = 75% of all the data values must be between 6 and 14.

= 10-2(2) = 6

= 10+2(2) = 14

szx

szx

Page 46: 1 Chapter 3 – Descriptive Statistics Numerical Measures

46

Empirical Rule

When the data are believed to approximate a bell-shaped distribution, the empirical rule can be used to determine the percentage of data values that must be within a specified number of standard deviations of the mean.

The empirical rule is based on the normal distribution, which is covered in Chapter 6.

Page 47: 1 Chapter 3 – Descriptive Statistics Numerical Measures

47

Empirical Rule

For data having a bell-shaped distribution:

Expected number of correct answers

About of values of a normal random variableAbout of values of a normal random variable are between are between - - and and + + ..About of values of a normal random variableAbout of values of a normal random variable are between are between - - and and + + ..

68%68%68%68%

About of values of a normal random variableAbout of values of a normal random variable are between are between - 2 - 2 and and + 2 + 2..About of values of a normal random variableAbout of values of a normal random variable are between are between - 2 - 2 and and + 2 + 2..

About of values of a normal random variableAbout of values of a normal random variable are between are between - 3 - 3 and and + 3 + 3..About of values of a normal random variableAbout of values of a normal random variable are between are between - 3 - 3 and and + 3 + 3..

95%95%95%95%

99%99%99%99%

Page 48: 1 Chapter 3 – Descriptive Statistics Numerical Measures

48

Empirical Rule

Expected number of correct answers

xx – – 33 – – 11

– – 22 + 1+ 1

+ 2+ 2 + 3+ 3

About 68%About 68%About 95%About 95%About 99%About 99%

Page 49: 1 Chapter 3 – Descriptive Statistics Numerical Measures

49

Detecting Outliers

An outlier is an unusually small or unusually large value in a data set.

A data value with a z-score less than –3 or greater than +3 might be considered an outlier.

It might be:• An incorrectly recorded data value• A data value that was incorrectly included in the data

set.• A correctly recorded data value that belongs in the data

set.

Page 50: 1 Chapter 3 – Descriptive Statistics Numerical Measures

50

Measures of Association Between Two Variables

So far, we have examined numerical methods used to summarize the data for one variable at a time.

Often a manager or decision maker is interested in the relationship between two variables.

Two numerical measures of the relationship between two variables are covariance and correlation coefficient.

Page 51: 1 Chapter 3 – Descriptive Statistics Numerical Measures

51

Covariance

The covariance is a measure of the linear association between two variables.

Positive values indicate a positive relationship.

Negative values indicate a negative relationship.

Page 52: 1 Chapter 3 – Descriptive Statistics Numerical Measures

52

Covariance

The covariance is computed as follows:

forforsamplessamples

forforpopulationspopulations

1

))((

n

yyxxs ii

xy 1

))((

n

yyxxs ii

xy

xyi x i yx y

N

( )( )

xy

i x i yx y

N

( )( )

Page 53: 1 Chapter 3 – Descriptive Statistics Numerical Measures

53

Correlation Coefficient

Correlation is a measure of linear association and not necessarily causation.

Just because two variables are highly correlated, it does not mean that one variable is the cause of the other.

Page 54: 1 Chapter 3 – Descriptive Statistics Numerical Measures

54

Correlation Coefficient

The correlation coefficient is computed as follows:

forforsamplessamples

forforpopulationspopulations

rs

s sxyxy

x yr

s

s sxyxy

x y

xy

xy

x y

xy

xy

x y

Page 55: 1 Chapter 3 – Descriptive Statistics Numerical Measures

55

Correlation Coefficient

The correlation can take on values between –1 and +1.

Values near –1 indicate a strong negative linear relationship.

Values near +1 indicate a strong positive linear relationship.

The closer the correlation is to zero, the weaker the relationship.

Page 56: 1 Chapter 3 – Descriptive Statistics Numerical Measures

56

Covariance and Correlation Coefficient

Example: Stock Returns The table below presents the monthly returns (in percentage) of

the market index S&P 500 (SPY) and the Apple stock (AAPL) from December 2012 to May 2013.

Date SPY AAPLDec-12 2.17 -6.76Jan-13 0.55 -1.11Feb-13 1.62 0.12Mar-13 0.83 0.01Apr-13 1.01 0.96May-13 0.24 0.10

Page 57: 1 Chapter 3 – Descriptive Statistics Numerical Measures

57

DateDec-12 2.17 -6.76 1.10 -5.65 -6.21Jan-13 0.55 -1.11 -0.52 0.00 0.00Feb-13 1.62 0.12 0.55 1.24 0.68Mar-13 0.83 0.01 -0.24 1.12 -0.27Apr-13 1.01 0.96 -0.06 2.08 -0.12May-13 0.24 0.10 -0.83 1.21 -1.00Average 1.07 -1.11 Total: -6.92

Std. Dev. 0.71 2.84

Covariance and Correlation Coefficient

Example: Stock Returns

x y xxi yyi yyxx ii

Page 58: 1 Chapter 3 – Descriptive Statistics Numerical Measures

58

Covariance and Correlation Coefficient

Example: Stock Returns

• Sample Covariance

• Sample Correlation Coefficient

38.1

16

92.6

1

n

yyxxs ii

xy

68.084.271.0

38.1

yx

xyxy ss

sr