75
CHAPTER THREE Numerical Summary Measures Numerical Summarizing Data 12/19/2013 1 Yusuf H (BSc, MPHE)

Biostat Chapter Three-3

Embed Size (px)

Citation preview

Page 1: Biostat Chapter Three-3

CHAPTER THREE

Numerical Summary Measures

Numerical Summarizing Data

12/19/2013 1 Yusuf H (BSc, MPHE)

Page 2: Biostat Chapter Three-3

• A frequency distribution is a general picture of the distribution of a variable

• But, can’t indicate the average value or (the middle) and the spread of the values

• Although these techniques are extremely useful, they do not allow us to make concise, quantitative statements that characterize the distribution of values as a whole

12/19/2013 2 Yusuf H (BSc, MPHE)

Page 3: Biostat Chapter Three-3

Measures of Central Tendency (MCT)

• Computed from the data of a sample or population

• Convey information regarding the average value

• Facilitate the description or comparison of a data

• The goal of measures of central tendency is to come up with the one single number that best describes a distribution of scores.

12/19/2013 3 Yusuf H (BSc, MPHE)

Page 4: Biostat Chapter Three-3

12/19/2013 4 Yusuf H (BSc, MPHE)

Page 5: Biostat Chapter Three-3

Characteristics of a good MCT A MCT is good or satisfactory if it possesses the following

characteristics.

1. It should be based on all the observations

2. It should not be affected by the extreme values

3. It should be as close to the majority of values as possible

4. It should have a definite value

5. It should not be subjected to complicated and tedious calculations

6. It should be capable of further algebraic treatment

12/19/2013 5 Yusuf H (BSc, MPHE)

Page 6: Biostat Chapter Three-3

• The most common MCT include:

– Arithmetic Mean

– Median

– Mode

– Others

12/19/2013 6 Yusuf H (BSc, MPHE)

Page 7: Biostat Chapter Three-3

1. Arithmetic Mean A. Ungrouped Data

• The arithmetic mean is the "average" of the data set and by far the most widely used measure of central location

• The sample mean is the sum of all the observations divided by the number of observations:

12/19/2013 7 Yusuf H (BSc, MPHE)

Page 8: Biostat Chapter Three-3

12/19/2013 8 Yusuf H (BSc, MPHE)

Page 9: Biostat Chapter Three-3

The heart rates for n=10 patients were as follows (beats per minute): 167, 120, 150, 125, 150, 140, 40, 136, 120, 150 What is the arithmetic mean for the heart rate of these patients?

12/19/2013 9 Yusuf H (BSc, MPHE)

Page 10: Biostat Chapter Three-3

b) Grouped data

In calculating the mean from grouped data, we assume that all values falling into a

particular class interval are located at the mid-point of the interval. It is calculated as

follow:

x =

m f

f

i i

i=1

k

i

i=1

k

where,

k = the number of class intervals

mi = the mid-point of the ith

class interval

fi = the frequency of the ith

class interval

12/19/2013 10 Yusuf H (BSc, MPHE)

Page 11: Biostat Chapter Three-3

Example. Compute the mean age of 169 subjects from the grouped data.

E.g. Given below the grouped data Mean = 5810.5/169 = 34.48 years

Class interval Mid-point (mi) Frequency (fi) mifi

10-19

20-29

30-39

40-49

50-59

60-69

14.5

24.5

34.5

44.5

54.5

64.5

4

66

47

36

12

4

58.0

1617.0

1621.5

1602.0

654.0

258.0

Total __ 169 5810.5

12/19/2013 11 Yusuf H (BSc, MPHE)

Page 12: Biostat Chapter Three-3

Properties of the Arithmetic Mean.

• Uniqueness. One and only one mean

• Simplicity

• Affected by each value

• Extreme values influence the mean

• It is most willing to algebraic treatment

12/19/2013 12 Yusuf H (BSc, MPHE)

Page 13: Biostat Chapter Three-3

2. Median

• Divides the data set into two equal parts

• Alternative measure of central location next to

mean.

12/19/2013 13 Yusuf H (BSc, MPHE)

Page 14: Biostat Chapter Three-3

2. Median

a) Ungrouped data • Odd number:

– The median will be the middle value when all values are arranged in order of magnitude.

• Even number:

– There is no single middle value but two middle observations.

– In this case the median is the mean of these two middle observations, when all observations have been arranged in the order of their magnitude.

12/19/2013 14 Yusuf H (BSc, MPHE)

Page 15: Biostat Chapter Three-3

If the observations are ordered from smallest to

largest, then the median is defined as follows.

12/19/2013 15 Yusuf H (BSc, MPHE)

Page 16: Biostat Chapter Three-3

12/19/2013 16 Yusuf H (BSc, MPHE)

Page 17: Biostat Chapter Three-3

• Compute the sample median for the birth weight data. First arrange the sample in ascending order.

2069 2581 2759 2834 2828 2841 3031 3101 3200 3245

3248 3260 3265 3314 3323 3484 3541 3609 3649 4146

• Since n = 20 is even, median = average of the 10th and 11th observation =

(3245 + 3248)/2 = 3246.5 gm.

12/19/2013 17 Yusuf H (BSc, MPHE)

Page 18: Biostat Chapter Three-3

• The median is a better description (than the mean) of the majority when the distribution is skewed

• Example

– Data: 14, 89, 93, 95, 96

– Skewness is reflected in the outlying low value of 14

– The sample mean is 77.4

– The median is 93

12/19/2013 18 Yusuf H (BSc, MPHE)

Page 19: Biostat Chapter Three-3

b) Grouped data • In calculating the median from grouped

data, we assume that the values within a class-interval are evenly distributed through the interval.

• The first step is to locate the class interval in which the median is located, using the following procedure.

• Find n/2 and see a class interval with a minimum cumulative frequency which contains n/2.

• Then, use the following formal.

12/19/2013 19 Yusuf H (BSc, MPHE)

Page 20: Biostat Chapter Three-3

Wf

F2

n

L=x~

m

c

m

where,

Lm = lower true class boundary of the interval containing the median

Fc = cumulative frequency of the interval just above the median class

interval

fm = frequency of the interval containing the median

W= class interval width

n = total number of observations

12/19/2013 20 Yusuf H (BSc, MPHE)

Page 21: Biostat Chapter Three-3

Example. Compute the median age of 169 subjects from the grouped data. n/2 = 169/2 = 84.5

Class interval Mid-point (mi) Frequency (fi) Cum. freq

10-19

20-29

30-39

40-49

50-59

60-69

14.5

24.5

34.5

44.5

54.5

64.5

4

66

47

36

12

4

4

70

117

153

165

169

Total 169

12/19/2013 21 Yusuf H (BSc, MPHE)

Page 22: Biostat Chapter Three-3

• n/2 = 84.5 = in the 3rd class interval

• Lower limit = 29.5, Upper limit = 39.5

• Frequency of the class = 47

• (n/2 – fc) = 84.5-70 = 14.5

• Median = 29.5 + (14.5/47)10 = 32.58 ≈ 33

12/19/2013 22 Yusuf H (BSc, MPHE)

Page 23: Biostat Chapter Three-3

Properties of the median

• Uniqueness

• It is an average position

• Simplicity.

• affected by the number of items than by extreme values

• Insensitive to very large or very small values unlike the mean.

• Weakness determined mainly by the middle points

12/19/2013 23 Yusuf H (BSc, MPHE)

Page 24: Biostat Chapter Three-3

Quartiles

• Just as the median is the value above and below which lie half the set of data, one can define measures (above or below) which lie other fractional parts of the data.

• The median divides the data into two equal parts (50%)

12/19/2013 24 Yusuf H (BSc, MPHE)

Page 25: Biostat Chapter Three-3

12/19/2013 25 Yusuf H (BSc, MPHE)

Page 26: Biostat Chapter Three-3

a) The first quartile (Q1): 25% of all the ranked observations are less than Q1.

b) The second quartile (Q2): 50% of all the ranked observations are less than Q2. The second quartile is the median.

c) The third quartile (Q3): 75% of all the ranked observations are less than Q3.

12/19/2013 26 Yusuf H (BSc, MPHE)

Page 27: Biostat Chapter Three-3

12/19/2013 27 Yusuf H (BSc, MPHE)

Page 28: Biostat Chapter Three-3

Percentiles • Simply divide the data into 100 pieces.

• Percentiles are less sensitive to outliers and not greatly affected by the sample size (n).

• Commonly used percentiles: → 10, 20, ….. 90% (deciles)

→ 20, 40, ….. 80% (quintiles)

→ 25, 50, 75% (quartiles)

→ 33.3, 66.7% (tertiles)

12/19/2013 28 Yusuf H (BSc, MPHE)

Page 29: Biostat Chapter Three-3

3. Mode

• Value which occurs most frequently

• If all observations are different, no mode

• More than one mode can occur, bimodal (two-peaks).

• Less amenable (responsive) to mathematical treatment.

• The mode is not often used in biological or medical data.

12/19/2013 29 Yusuf H (BSc, MPHE)

Page 30: Biostat Chapter Three-3

• It is a value which occurs most frequently in a set of values.

• If all the values are different there is no mode, on the other hand, a set of values may have more than one mode.

a) Ungrouped data

12/19/2013 30 Yusuf H (BSc, MPHE)

Page 31: Biostat Chapter Three-3

Some distributions have more than one mode:

• Unimodal: A distribution with one mode

• Bimodal: A distribution with two modes

• Trimodal: A distribution with three modes

• Empirical formula:

12/19/2013 31 Yusuf H (BSc, MPHE)

)(3 medianmeanmodemean

Page 32: Biostat Chapter Three-3

• Example

• Data are: 1, 2, 3, 4, 4, 4, 4, 5, 5, 6

• Mode is 4 “Unimodal”

• Example

• Data are: 1, 2, 2, 2, 3, 4, 5, 5, 5, 6, 6, 8

• There are two modes = 2 & 5

• This distribution is said to be “bi-modal”

• Example

• Data are: 2.62, 2.75, 2.76, 2.86, 3.05, 3.12

• No mode, since all the values are different

12/19/2013 32 Yusuf H (BSc, MPHE)

Page 33: Biostat Chapter Three-3

b) Grouped data • To find the mode of grouped data, we

usually refer to the modal class, where the modal class is the class interval with the highest frequency.

• If a single value for the mode of grouped data must be specified, it is taken as the mid-point of the modal class interval.

12/19/2013 33 Yusuf H (BSc, MPHE)

Page 34: Biostat Chapter Three-3

12/19/2013 34 Yusuf H (BSc, MPHE)

Page 35: Biostat Chapter Three-3

Properties of mode

• An average of position

• Not affected by extreme values

• The most typical value of the distribution

• Not capable of mathematical treatment

12/19/2013 35 Yusuf H (BSc, MPHE)

Page 36: Biostat Chapter Three-3

• The mean can be used for discrete and continuous data

• The median is appropriate for discrete and continuous data as well, but can also be used for ordinal data

• The mode can be used for all types of data, but may be especially useful for nominal and ordinal measurements

Which MTC to use?

12/19/2013 36 Yusuf H (BSc, MPHE)

Page 37: Biostat Chapter Three-3

• Skewness: If extremely low or extremely high observations are present in a distribution, then the mean tends to shift towards those scores.

• Types of Distributions

(a) Symmetric and unimodal distribution — Mean, median, and mode should all be approximately the same

Mean, Median & Mode

12/19/2013 37 Yusuf H (BSc, MPHE)

Page 38: Biostat Chapter Three-3

(b) Skewed to the right (positively skewed) —Mean is sensitive to extreme values, so median might be more appropriate

Mode

Median

Mean

12/19/2013 38 Yusuf H (BSc, MPHE)

Page 39: Biostat Chapter Three-3

(c) Skewed to the left (negatively skewed) — Same as (b)

Mode

Median

Mean

12/19/2013 39 Yusuf H (BSc, MPHE)

Page 40: Biostat Chapter Three-3

Measures of Central Tendency The Shape of Distributions

• With perfectly bell shaped distributions, the mean, median, and mode are identical.

• With positively skewed data, the mode is lowest, followed by the median and mean.

• With negatively skewed data, the mean is lowest, followed by the median and mode.

Page 41: Biostat Chapter Three-3

4. Geometric mean (GM)

If x x ..., x are n positive observed values, then

GM = x

1 2 n

i

i=1

n

n

, ,

and

logGM =

logx

n

i

i=1

n

.

The geometric mean is generally used with data measured on a logarithmic scale, such

as titers of anti-neutrophil immunoglobulin G.

Page 42: Biostat Chapter Three-3

Geometric mean… • The GM is preferable to the AM if the series of

observations contains one or more unusually large values.

• It is obtained by taking the nth root of the product

• of “n” values, i.e, if the values of the observation are demoted by x1,x2,…,x n then, GM = n√(x1)(x2)….(xn) .

• GM = n√(x1)(x2)….(xn) = { (x1)(x2)… (xn ) }1/n

• Log GM = log {(x1 )(x2 )…(xn)}1/n

• = 1/n log {(x1 )(x2 )…(xn)}

• =1/n {log(x1 ) + log(x2 )+…log(xn)}

• = Σ(log xi)/n • The logarithm of the geometric mean is equal to the

arithmetic mean of the logarithms of individual values. 12/19/2013 Yusuf H (BSc, MPHE) 42

Page 43: Biostat Chapter Three-3

• Examle: The geometric mean may be calculated for the following parasite counts per 100 fields of thick films.

• 7 8 3 14 2 1 440 15 52 6 2 1 1 25

• 12 6 9 2 1 6 7 3 4 70 20 200 2 50

• 21 15 10 120 8 4 70 3 1 103 20 90 1 237

• GM = 42√7x8x3x…x1x237

• log Gm = 1/42 (log 7+log8+log3+..+log 237)

• = 1/42 (.8451+.9031+.4771 +…2.3747)

• = 1/42 (41.9985) 12/19/2013 Yusuf H (BSc, MPHE) 43

Page 44: Biostat Chapter Three-3

• = 0.9999 ≈ 1.0000

• The anti-log of 0.9999 is 9.9992 ≈10 and this is the required geometric mean. By contrast, the arithmetic mean, which is inflated by the high values of 440, 237 and 200 is 39.8 ≈ 40.

12/19/2013 Yusuf H (BSc, MPHE) 44

Page 45: Biostat Chapter Three-3

5. Weighted mean (WM)

• In a weighted mean, separate outcomes have separate influences.

• The influence attached to an outcome is the weight.

• Familiar is the calculation of a course grade as a weighted average of scores on separate outcomes.

Page 46: Biostat Chapter Three-3

Example:

Page 47: Biostat Chapter Three-3

Which measure of central tendency is best with a given set of data?

• Two factors are important in making this decisions:

– The shape of the distribution of the observations

– The scale of measurement

Page 48: Biostat Chapter Three-3

Measures of Dispersion

Consider the following two sets of data:

A: 177 193 195 209 226 Mean = 200

B: 192 197 200 202 209 Mean = 200

Two or more sets may have the same mean and/or median but they may be quite different.

12/19/2013 48 Yusuf H (BSc, MPHE)

Page 49: Biostat Chapter Three-3

• MCT are not enough to give a clear understanding about the distribution of the data.

• We need to know something about the variability or spread of the values — whether they tend to be clustered close together, or spread out over a broad range

12/19/2013 49 Yusuf H (BSc, MPHE)

Page 50: Biostat Chapter Three-3

Measures of Dispersion…

Measures that quantify the variation or dispersion of a set of data from its central location

Dispersion refers to the variety exhibited by the values of the data.

The amount may be small when the values are close

together.

If all the values are the same, no dispersion

12/19/2013 50 Yusuf H (BSc, MPHE)

Page 51: Biostat Chapter Three-3

• Measures of dispersion include:

– Range

– Inter-quartile range

– Variance

– Standard deviation

– Coefficient of variation

12/19/2013 51 Yusuf H (BSc, MPHE)

Page 52: Biostat Chapter Three-3

1. Range (R) • The difference between the largest and

smallest observations in a sample.

• Range = Maximum value – Minimum value

• Example – – Data values: 5, 9, 12, 16, 23, 34, 37, 42

– Range = 42-5 = 37

• Data set with higher range exhibit more variability

12/19/2013 52 Yusuf H (BSc, MPHE)

Page 53: Biostat Chapter Three-3

Properties of range It is the simplest crude measure and can be easily

understood

It takes into account only two values which causes it to be a poor measure of dispersion

Very sensitive to extreme observations

• Extreme values may be unreliable (most likely faulty)

• Not suitable for mathematical treatment

12/19/2013 53 Yusuf H (BSc, MPHE)

Page 54: Biostat Chapter Three-3

2. Interquartile range (IQR)

• Indicates the spread of the middle 50% of the observations, and used with median

IQR = Q3 - Q1

12/19/2013 54 Yusuf H (BSc, MPHE)

Page 55: Biostat Chapter Three-3

12/19/2013 55 Yusuf H (BSc, MPHE)

Page 56: Biostat Chapter Three-3

• Example:

• Suppose the first and third quartile for weights of girls 12 months of age are 8.8 Kg and 10.2 Kg, respectively.

IQR = 10.2 Kg – 8.8 Kg

i.e., 50% of the infant girls weigh between 8.8 and 10.2 Kg.

12/19/2013 56 Yusuf H (BSc, MPHE)

Page 57: Biostat Chapter Three-3

Properties of IQR:

• It is a simple and versatile measure

• It encloses the central 50% of the observations

• It is not based on all observations but only on two specific values

• Since it excludes the lowest and highest 25% values, it is not affected by extreme values

12/19/2013 57 Yusuf H (BSc, MPHE)

Page 58: Biostat Chapter Three-3

3. Variance (2, s2)

• The variance is the average of the squares of the deviations taken from the mean.

• A good measure of dispersion make use of all the data

• Measures the deviations of each observation from the mean

• Achieves by averaging the sum of the squares of the deviations from the mean.

12/19/2013 58 Yusuf H (BSc, MPHE)

Page 59: Biostat Chapter Three-3

• Variance is used to measure the dispersion of values relative to the mean.

• When values are close to their mean (narrow range) the dispersion is less than when there is scattering over a wide range.

– Population variance = σ2

– Sample variance = S2

12/19/2013 59 Yusuf H (BSc, MPHE)

Page 60: Biostat Chapter Three-3

a) Ungrouped data

Let X1, X2, ..., XN be the measurement on N population units, then:

mean. population theis N

X

=

whereN

)(X

N

1=i

i

N

1i

2

i2

12/19/2013 60 Yusuf H (BSc, MPHE)

Page 61: Biostat Chapter Three-3

A sample variance is calculated for a sample of individual values (X1, X2, … Xn) and uses the sample mean (e.g. ) rather than the population mean µ.

12/19/2013 61 Yusuf H (BSc, MPHE)

Page 62: Biostat Chapter Three-3

b) Grouped data

where

mi = the mid-point of the ith class interval

fi = the frequency of the ith class interval

= the sample mean

k = the number of class intervals

1-f

f)x(m

Sk

1=i

i

k

1=i

i

2

i2

x

12/19/2013 62 Yusuf H (BSc, MPHE)

Page 63: Biostat Chapter Three-3

Properties of Variance:

•The main disadvantage of variance is that its unit is the square of the unite of the original measurement values

•A variance of a distribution of weight is not expressed in Kg, but in Kg2

weight = 36.5 Kg, s² = 257 Kg2

12/19/2013 63 Yusuf H (BSc, MPHE)

Page 64: Biostat Chapter Three-3

4. Standard deviation (, s)

• It is the square root of the variance.

• This produces a measure having the same scale as that of the individual values.

• Most commonly used

12/19/2013 64 Yusuf H (BSc, MPHE)

Page 65: Biostat Chapter Three-3

2 and S = S2

12/19/2013 65 Yusuf H (BSc, MPHE)

Page 66: Biostat Chapter Three-3

Example • Areas of sprayable surfaces with DDT from a sample

of 15 houses are as follows (m2):

101 105 110 114 115 124 125 125

130 133 135 136 137 140 145

• Find the variance and standard deviation of the above distribution.

• The mean of the sample is 125 m2.

12/19/2013 66 Yusuf H (BSc, MPHE)

Page 67: Biostat Chapter Three-3

12/19/2013 67 Yusuf H (BSc, MPHE)

Page 68: Biostat Chapter Three-3

Example. Compute the variance and SD of the age of 169 subjects from the grouped data.

Mean = 5810.5/169 = 34.48 years S2 = 20199.22/169-1 = 120.23 SD = √S2 = √120.23 = 10.96

Class

interval

(mi)

(fi)

(mi-Mean)

(mi-Mean)2

(mi-Mean)2 fi

10-19

20-29

30-39

40-49

50-59

60-69

14.5

24.5

34.5

44.5

54.5

64.5

4

66

47

36

12

4

-19.98

-9-98

0.02

10.02

20.02

30.02

399.20

99.60

0.0004

100.40

400.80

901.20

1596.80

6573.60

0.0188

3614.40

4809.60

3604.80

Total 169 1901.20 20199.22

12/19/2013 68 Yusuf H (BSc, MPHE)

Page 69: Biostat Chapter Three-3

Properties of SD

• The SD has the advantage of being expressed in the same units of measurement as the mean

• SD is considered to be the best measure of dispersion and is used widely

• However, if the units of measurements of variables of two data sets is not the same, then there variability can’t be compared by comparing the values of SD.

12/19/2013 69 Yusuf H (BSc, MPHE)

Page 70: Biostat Chapter Three-3

5. Coefficient of variation (CV) • When two data sets have different units of

measurements, or their means differ sufficiently in size, the CV should be used as a measure of dispersion.

• It is the best measure to compare the variability of two series of sets of observations.

• Data with less CV is considered more consistent.

12/19/2013 70 Yusuf H (BSc, MPHE)

Page 71: Biostat Chapter Three-3

• “Cholesterol is more variable than systolic blood pressure”

SD Mean CV (%)

SBP

Cholesterol

15mm

40mg/dl

130mm

200mg/dl

11.5

20.0

•CV is the ratio of the SD to the mean multiplied by 100.

12/19/2013 71 Yusuf H (BSc, MPHE)

Page 72: Biostat Chapter Three-3

Characteristics of a Distribution

• A distribution is characterized by:

– Location = about average value

– Dispersion = about spread out

– Modality = number of peaks

– Skewness = whether or not a distribution is symmetric

12/19/2013 72 Yusuf H (BSc, MPHE)

Page 73: Biostat Chapter Three-3

Which Measures to Use?

• For symmetrically distributed data, mean and SD are used to summarize the data.

• However, in skewed distributions it is preferable to use median and quartiles

• Median and quartiles are not easily influenced by extreme values as are means and SDs.

12/19/2013 73 Yusuf H (BSc, MPHE)

Page 74: Biostat Chapter Three-3

Exercise • The following frequency distribution table shows the forced

expiratory volume in one second (FEV1) for 13 adults suffering

from asthma.

12/19/2013 74 Yusuf H (BSc, MPHE)

Page 75: Biostat Chapter Three-3

1. Find the mean, median and mode.

2. Find Range, quartiles and IQR.

3. Calculate variance and SD.

4. Suppose the FEV1 value for subject 11was given by mistake 40.2 instead of 4.02, what will happen to the mean, median and mode, respectively?

12/19/2013 75 Yusuf H (BSc, MPHE)