Chapter 3 Descriptive Statistics: Numerical Methods Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin

Chapter 3

Descriptive Statistics: Numerical Methods

Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin

3-2

Descriptive Statistics

3.1 Describing Central Tendency

3.2 Measures of Variation

3.3 Percentiles, Quartiles and Box-and-Whiskers Displays

3.4 Covariance, Correlation, and the Least Square Line (Optional)

3.5 Weighted Means and Grouped Data (Optional)

3.6 The Geometric Mean (Optional)

3-3

3.1 Describing Central Tendency

In addition to describing the shape of a distribution, want to describe the data set’s central tendency◦ A measure of central tendency represents the center or

middle of the data◦ Population mean (μ) is average of the population

measurementsPopulation parameter: a number calculated from all

the population measurements that describes some aspect of the population

Sample statistic: a number calculated using the sample measurements that describes some aspect of the sample

LO3-1: Compute and interpret the mean, median, and mode.

3-4

Measures of Central Tendency

Mean, The average or expected value

Median, Md The value of the middle point of the ordered measurements

Mode, Mo The most frequent value

LO3-1

3-5

The MeanPopulation X1, X2, …, XN

Population Mean

N

X

N

=1ii

Sample x1, x2, …, xn

Sample Mean

x

n

x x

n

=1ii

LO3-1

3-6

The Sample Mean

and is a point estimate of the population mean • It is the value to expect, on average and in the long run

n

xxx

n

xx n

n

ii

...211

For a sample of size n, the sample mean () is defined as

LO3-1

3-7

Example 3.1 Car Mileage Case: Estimating Mileage

Sample mean for first five car mileages from Table 3.1

30.8, 31.7, 30.1, 31.6, 32.1

26.315

3.156

5

1.326.311.307.318.3055

54321

5

1

x

xxxxxx

x ii

LO3-1

3-8

The Median

The median Md is a value such that 50% of all measurements, after having been arranged in numerical order, lie above (or below) it◦ If the number of measurements is odd, the median

is the middlemost measurement in the ordering◦ If the number of measurements is even, the

median is the average of the two middlemost measurements in the ordering

LO3-1

3-9

Example 3.1 The Car Mileage Case

First five observations from Table 3.1:

30.8, 31.7, 30.1, 31.6, 32.1

In order: 30.1, 30.8, 31.6, 31.7, 32.1

There is an odd so median is one in middle, or 31.6

LO3-1

3-10

The Mode

The mode Mo of a population or sample of measurements is the measurement that occurs most frequently◦ Modes are the values that are observed “most typically”

◦ Sometimes higher frequencies at two or more values If there are two modes, the data is bimodal If more than two modes, the data is multimodal

◦ When data are in classes, the class with the highest frequency is the modal class The tallest box in the histogram

LO3-1

3-11

Relationships Among Mean, Median and Mode

LO3-1

Figure 3.3

3-12

3.2 Measures of Variation

Knowing the measures of central tendency is not enough

Both of the distributions below have identical measures of central tendency

LO3-2: Compute and interpret the range, variance, and standard deviation.

Figure 3.13

3-13

3-14

Measures of Variation

Range Largest minus the smallest measurement

Variance The average of the squared deviations of all the population measurements from the population mean

Standard The square root of the populationDeviation variance

LO3-2

3-15

The Range

Largest minus smallest

Measures the interval spanned by all the data

For the left side of Figure 3.13, largest is 5 and smallest is 3

Range is 5 – 3 = 2 days

LO3-2

3-16

Population Variance and Standard DeviationThe population variance (σ2) is the average

of the squared deviations of the individual population measurements from the population mean (µ)

The population standard deviation (σ) is the positive square root of the population variance

LO3-2

3-17

Variance

For a population of size N, the population variance σ2 is:

For a sample of size n, the sample variance s2 is:

N

xxx

N

xN

N

ii 22

22

11

2

2

11

222

211

2

2

n

xxxxxx

n

xxs n

n

ii

LO3-2

3-18

Standard Deviation

Population standard deviation (σ):

Sample standard deviation (s):

2

2ss

LO3-2

3-19

Example: Chris’s Class Sizes This SemesterData points are: 60, 41, 15, 30, 34Mean is 36 (180/5)Variance is:

Standard deviation is:

4.2165

1082

5

436441255765

36343630361536413660 222222

71.144.216

LO3-2

3-20

Example: Sample Variance and Standard DeviationExample 3.6: data for first five car mileages

from Table 3.1: 30.8, 31.7, 30.1, 31.6, 32.1The sample mean is 31.26The variance and standard deviation are:

8019.0643.0

643.04

572.24

26.311.3226.316.3126.311.3026.317.3126.318.30

15

2

22222

5

1

2

2

ss

xxs i

i

LO3-2

3-21

The Empirical Rule for Normal Populations

If a population has mean µ and standard deviation σ and is described by a normal curve, then

68.26% of the population measurements lie within one standard deviation of the mean: [µ-σ, µ+σ]

95.44% lie within two standard deviations of the mean: [µ-2σ, µ+2σ]

99.73% lie within three standard deviations of the mean: [µ-3σ, µ+3σ]

LO3-3: Use the EmpiricalRule and Chebyshev’s Theorem to describe variation.

3-22

Chebyshev’s Theorem

Let µ and σ be a population’s mean and standard deviation, then for any value k > 1

At least 100(1 - 1/k2)% of the population measurements lie in the interval [µ-kσ, µ+kσ]

Only practical for non-mound-shaped distribution population that is not very skewed

LO3-3

3-23

z Scores

For any x in a population or sample, the associated z score is

The z score is the number of standard deviations that x is from the mean◦ A positive z score is for x above (greater than) the mean

◦ A negative z score is for x below (less than) the mean

deviation standard

mean

xz

LO3-3

3-24

Coefficient of Variation

Measures the size of the standard deviation relative to the size of the mean

Used to:◦ Compare the relative variabilities of values about the mean

◦ Compare the relative variability of populations or samples with different means and different standard deviations

◦ Measure risk

%100Mean

deviation Standard variationoft Coefficien

LO3-3

3-25

3.3 Percentiles, Quartiles, and Box-and-Whiskers Displays

For a set of measurements arranged in increasing order, the pth percentile is a value such that p percent of the measurements fall at or below the value and (100-p) percent of the measurements fall at or above the value

The first quartile Q1 is the 25th percentile

The second quartile (median) is the 50th percentileThe third quartile Q3 is the 75th percentile

The interquartile range IQR is Q3 - Q1

LO3-4: Compute and interpret percentiles, quartiles, and box-and-whiskers displays.

3-26

Calculating Percentiles

1. Arrange the measurements in increasing order

2. Calculate the index i=(p/100)n where p is the percentile to find

3. (a) If i is not an integer, round up and the next integer greater than i denotes the pth percentile(b) If i is an integer, the pth percentile is the average of the measurements in the i and i+1 positions

LO3-4

3-27

Percentile Example

i=(10/100)12=1.2Not an integer so round up to 210th percentile is in the second position so 11,070

i=(25/100)12=3Integer so average values in positions 3 and 425th percentile (18,211+26,817)/2 or 22,514

LO3-4

3-28

Five Number Summary

1. The smallest measurement

2. The first quartile, Q1

3. The median, Md

4. The third quartile, Q3

5. The largest measurement

Displayed visually using a box-and-whiskers plot

LO3-4

3-29

Box-and-Whiskers Plots

The box plots the: ◦ First quartile, Q1

◦ Median, Md

◦ Third quartile, Q3

◦ Inner fences

◦ Outer fences

Inner fences◦ Located 1.5IQR away

from the quartiles: Q1 – (1.5 IQR) Q3 + (1.5 IQR)

Outer fences◦ Located 3IQR away

from the quartiles: Q1 – (3 IQR) Q3 + (3 IQR)

LO3-4

3-30

Box-and-Whiskers Plots Continued

The “whiskers” are dashed lines that plot the range of the data

A dashed line drawn from the box below Q1 down to the smallest measurement

Another dashed line drawn from the box above Q3 up to the largest measurement

LO3-4

Figures 3.17 and 3.18

3-31

3-32

3-33

Outliers

Outliers are measurements that are very different from other measurements◦They are either much larger or much smaller than

most of the other measurementsOutliers lie beyond the limits of the box-and-

whiskers plot◦Measurements less than the lower limit or greater

than the upper limit

LO3-4

3-34

3.4 Covariance, Correlation, and the Least Squares Line (Optional)

When points on a scatter plot seem to fluctuate around a straight line, there is a linear relationship between x and y

A measure of the strength of a linear relationship is the covariance sxy

1

1

n

yyxxs

n

iii

xy

LO3-5: Compute and interpret covariance, correlation, and the least squares line (Optional).

3-35

Covariance

A positive covariance indicates a positive linear relationship between x and y◦As x increases, y increases

A negative covariance indicates a negative linear relationship between x and y◦As x increases, y decreases

LO3-5

3-36

Correlation Coefficient

Magnitude of covariance does not indicate the strength of the relationship◦Magnitude depends on the unit of measurement

used for the dataCorrelation coefficient (r) is a measure of the

strength of the relationship that does not depend on the magnitude of the data

yx

xy

ss

sr

LO3-5

3-37

Correlation Coefficient Continued

Sample correlation coefficient r is always between -1 and +1◦Values near -1 show strong negative correlation◦Values near 0 show no correlation◦Values near +1 show strong positive correlation

Sample correlation coefficient is the point estimate for the population correlation coefficient ρ

LO3-5

3-38

Least Squares Line

If there is a linear relationship between x and y, might wish to predict y on the basis of x

This requires the equation of a line describing the linear relationship

Line is calculated based on least squares line◦Discussed in detail in a later chapter

Need to find slope (b1) and y-intercept (b0)

xbyb

s

sb

x

xy

10

21

LO3-5

3-39

3.5 Weighted Means and Grouped Data (Optional)

Sometimes, some measurements are more important than others◦ Assign numerical “weights” to the data

Weights measure relative importance of the valueCalculate weighted mean as

where wi is the weight assigned to the ith measurement xi

i

ii

w

xw

LO3-6: Compute and interpret weighted means and the mean and standard deviation of grouped data (Optional).

3-40

Descriptive Statistics for Grouped Data

Data already categorized into a frequency distribution or a histogram is called grouped data

Can calculate the mean and variance even when the raw data is not available

Calculations are slightly different for data from a sample and data from a population

LO3-6

3-41

Descriptive Statistics for Grouped Data (Sample)Sample mean for grouped data:

Sample variance for grouped data:

fi is the frequency for class i Mi is the midpoint of class i n = Σfi = sample size

n

Mf

f

Mfx ii

i

ii

1

22

n

xMfs ii

LO3-6

3-42

Descriptive Statistics for Grouped Data (Population)Population mean for grouped data:

Population variance for grouped data:

fi is the frequency for class i Mi is the midpoint of class i N = Σfi = population size

N

Mf

f

Mf ii

i

ii

N

xMf ii

22

LO3-6

3-43

3.6 The Geometric Mean (Optional)

For rates of return of an investment, use the geometric mean to give the correct wealth at the end of the investment

Suppose the rates of return (expressed as decimal fractions) are R1, R2, …, Rn for periods 1, 2, …, n

The mean of all these returns is the calculated as the geometric mean:

1111 21 n

ng RRRR

LO3-7: Compute and interpret the geometric mean (Optional).

Documents

Chapter 3 Descriptive Statistics: Numerical Methods Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin