Biostat Chapter Three-3

Preview:

Citation preview

CHAPTER THREE

Numerical Summary Measures

Numerical Summarizing Data

12/19/2013 1 Yusuf H (BSc, MPHE)

• A frequency distribution is a general picture of the distribution of a variable

• But, can’t indicate the average value or (the middle) and the spread of the values

• Although these techniques are extremely useful, they do not allow us to make concise, quantitative statements that characterize the distribution of values as a whole

12/19/2013 2 Yusuf H (BSc, MPHE)

Measures of Central Tendency (MCT)

• Computed from the data of a sample or population

• Convey information regarding the average value

• Facilitate the description or comparison of a data

• The goal of measures of central tendency is to come up with the one single number that best describes a distribution of scores.

12/19/2013 3 Yusuf H (BSc, MPHE)

12/19/2013 4 Yusuf H (BSc, MPHE)

Characteristics of a good MCT A MCT is good or satisfactory if it possesses the following

characteristics.

1. It should be based on all the observations

2. It should not be affected by the extreme values

3. It should be as close to the majority of values as possible

4. It should have a definite value

5. It should not be subjected to complicated and tedious calculations

6. It should be capable of further algebraic treatment

12/19/2013 5 Yusuf H (BSc, MPHE)

• The most common MCT include:

– Arithmetic Mean

– Median

– Mode

– Others

12/19/2013 6 Yusuf H (BSc, MPHE)

1. Arithmetic Mean A. Ungrouped Data

• The arithmetic mean is the "average" of the data set and by far the most widely used measure of central location

• The sample mean is the sum of all the observations divided by the number of observations:

12/19/2013 7 Yusuf H (BSc, MPHE)

12/19/2013 8 Yusuf H (BSc, MPHE)

The heart rates for n=10 patients were as follows (beats per minute): 167, 120, 150, 125, 150, 140, 40, 136, 120, 150 What is the arithmetic mean for the heart rate of these patients?

12/19/2013 9 Yusuf H (BSc, MPHE)

b) Grouped data

In calculating the mean from grouped data, we assume that all values falling into a

particular class interval are located at the mid-point of the interval. It is calculated as

follow:

x =

m f

f

i i

i=1

k

i

i=1

k

where,

k = the number of class intervals

mi = the mid-point of the ith

class interval

fi = the frequency of the ith

class interval

12/19/2013 10 Yusuf H (BSc, MPHE)

Example. Compute the mean age of 169 subjects from the grouped data.

E.g. Given below the grouped data Mean = 5810.5/169 = 34.48 years

Class interval Mid-point (mi) Frequency (fi) mifi

10-19

20-29

30-39

40-49

50-59

60-69

14.5

24.5

34.5

44.5

54.5

64.5

4

66

47

36

12

4

58.0

1617.0

1621.5

1602.0

654.0

258.0

Total __ 169 5810.5

12/19/2013 11 Yusuf H (BSc, MPHE)

Properties of the Arithmetic Mean.

• Uniqueness. One and only one mean

• Simplicity

• Affected by each value

• Extreme values influence the mean

• It is most willing to algebraic treatment

12/19/2013 12 Yusuf H (BSc, MPHE)

2. Median

• Divides the data set into two equal parts

• Alternative measure of central location next to

mean.

12/19/2013 13 Yusuf H (BSc, MPHE)

2. Median

a) Ungrouped data • Odd number:

– The median will be the middle value when all values are arranged in order of magnitude.

• Even number:

– There is no single middle value but two middle observations.

– In this case the median is the mean of these two middle observations, when all observations have been arranged in the order of their magnitude.

12/19/2013 14 Yusuf H (BSc, MPHE)

If the observations are ordered from smallest to

largest, then the median is defined as follows.

12/19/2013 15 Yusuf H (BSc, MPHE)

12/19/2013 16 Yusuf H (BSc, MPHE)

• Compute the sample median for the birth weight data. First arrange the sample in ascending order.

2069 2581 2759 2834 2828 2841 3031 3101 3200 3245

3248 3260 3265 3314 3323 3484 3541 3609 3649 4146

• Since n = 20 is even, median = average of the 10th and 11th observation =

(3245 + 3248)/2 = 3246.5 gm.

12/19/2013 17 Yusuf H (BSc, MPHE)

• The median is a better description (than the mean) of the majority when the distribution is skewed

• Example

– Data: 14, 89, 93, 95, 96

– Skewness is reflected in the outlying low value of 14

– The sample mean is 77.4

– The median is 93

12/19/2013 18 Yusuf H (BSc, MPHE)

b) Grouped data • In calculating the median from grouped

data, we assume that the values within a class-interval are evenly distributed through the interval.

• The first step is to locate the class interval in which the median is located, using the following procedure.

• Find n/2 and see a class interval with a minimum cumulative frequency which contains n/2.

• Then, use the following formal.

12/19/2013 19 Yusuf H (BSc, MPHE)

Wf

F2

n

L=x~

m

c

m

where,

Lm = lower true class boundary of the interval containing the median

Fc = cumulative frequency of the interval just above the median class

interval

fm = frequency of the interval containing the median

W= class interval width

n = total number of observations

12/19/2013 20 Yusuf H (BSc, MPHE)

Example. Compute the median age of 169 subjects from the grouped data. n/2 = 169/2 = 84.5

Class interval Mid-point (mi) Frequency (fi) Cum. freq

10-19

20-29

30-39

40-49

50-59

60-69

14.5

24.5

34.5

44.5

54.5

64.5

4

66

47

36

12

4

4

70

117

153

165

169

Total 169

12/19/2013 21 Yusuf H (BSc, MPHE)

• n/2 = 84.5 = in the 3rd class interval

• Lower limit = 29.5, Upper limit = 39.5

• Frequency of the class = 47

• (n/2 – fc) = 84.5-70 = 14.5

• Median = 29.5 + (14.5/47)10 = 32.58 ≈ 33

12/19/2013 22 Yusuf H (BSc, MPHE)

Properties of the median

• Uniqueness

• It is an average position

• Simplicity.

• affected by the number of items than by extreme values

• Insensitive to very large or very small values unlike the mean.

• Weakness determined mainly by the middle points

12/19/2013 23 Yusuf H (BSc, MPHE)

Quartiles

• Just as the median is the value above and below which lie half the set of data, one can define measures (above or below) which lie other fractional parts of the data.

• The median divides the data into two equal parts (50%)

12/19/2013 24 Yusuf H (BSc, MPHE)

12/19/2013 25 Yusuf H (BSc, MPHE)

a) The first quartile (Q1): 25% of all the ranked observations are less than Q1.

b) The second quartile (Q2): 50% of all the ranked observations are less than Q2. The second quartile is the median.

c) The third quartile (Q3): 75% of all the ranked observations are less than Q3.

12/19/2013 26 Yusuf H (BSc, MPHE)

12/19/2013 27 Yusuf H (BSc, MPHE)

Percentiles • Simply divide the data into 100 pieces.

• Percentiles are less sensitive to outliers and not greatly affected by the sample size (n).

• Commonly used percentiles: → 10, 20, ….. 90% (deciles)

→ 20, 40, ….. 80% (quintiles)

→ 25, 50, 75% (quartiles)

→ 33.3, 66.7% (tertiles)

12/19/2013 28 Yusuf H (BSc, MPHE)

3. Mode

• Value which occurs most frequently

• If all observations are different, no mode

• More than one mode can occur, bimodal (two-peaks).

• Less amenable (responsive) to mathematical treatment.

• The mode is not often used in biological or medical data.

12/19/2013 29 Yusuf H (BSc, MPHE)

• It is a value which occurs most frequently in a set of values.

• If all the values are different there is no mode, on the other hand, a set of values may have more than one mode.

a) Ungrouped data

12/19/2013 30 Yusuf H (BSc, MPHE)

Some distributions have more than one mode:

• Unimodal: A distribution with one mode

• Bimodal: A distribution with two modes

• Trimodal: A distribution with three modes

• Empirical formula:

12/19/2013 31 Yusuf H (BSc, MPHE)

)(3 medianmeanmodemean

• Example

• Data are: 1, 2, 3, 4, 4, 4, 4, 5, 5, 6

• Mode is 4 “Unimodal”

• Example

• Data are: 1, 2, 2, 2, 3, 4, 5, 5, 5, 6, 6, 8

• There are two modes = 2 & 5

• This distribution is said to be “bi-modal”

• Example

• Data are: 2.62, 2.75, 2.76, 2.86, 3.05, 3.12

• No mode, since all the values are different

12/19/2013 32 Yusuf H (BSc, MPHE)

b) Grouped data • To find the mode of grouped data, we

usually refer to the modal class, where the modal class is the class interval with the highest frequency.

• If a single value for the mode of grouped data must be specified, it is taken as the mid-point of the modal class interval.

12/19/2013 33 Yusuf H (BSc, MPHE)

12/19/2013 34 Yusuf H (BSc, MPHE)

Properties of mode

• An average of position

• Not affected by extreme values

• The most typical value of the distribution

• Not capable of mathematical treatment

12/19/2013 35 Yusuf H (BSc, MPHE)

• The mean can be used for discrete and continuous data

• The median is appropriate for discrete and continuous data as well, but can also be used for ordinal data

• The mode can be used for all types of data, but may be especially useful for nominal and ordinal measurements

Which MTC to use?

12/19/2013 36 Yusuf H (BSc, MPHE)

• Skewness: If extremely low or extremely high observations are present in a distribution, then the mean tends to shift towards those scores.

• Types of Distributions

(a) Symmetric and unimodal distribution — Mean, median, and mode should all be approximately the same

Mean, Median & Mode

12/19/2013 37 Yusuf H (BSc, MPHE)

(b) Skewed to the right (positively skewed) —Mean is sensitive to extreme values, so median might be more appropriate

Mode

Median

Mean

12/19/2013 38 Yusuf H (BSc, MPHE)

(c) Skewed to the left (negatively skewed) — Same as (b)

Mode

Median

Mean

12/19/2013 39 Yusuf H (BSc, MPHE)

Measures of Central Tendency The Shape of Distributions

• With perfectly bell shaped distributions, the mean, median, and mode are identical.

• With positively skewed data, the mode is lowest, followed by the median and mean.

• With negatively skewed data, the mean is lowest, followed by the median and mode.

4. Geometric mean (GM)

If x x ..., x are n positive observed values, then

GM = x

1 2 n

i

i=1

n

n

, ,

and

logGM =

logx

n

i

i=1

n

.

The geometric mean is generally used with data measured on a logarithmic scale, such

as titers of anti-neutrophil immunoglobulin G.

Geometric mean… • The GM is preferable to the AM if the series of

observations contains one or more unusually large values.

• It is obtained by taking the nth root of the product

• of “n” values, i.e, if the values of the observation are demoted by x1,x2,…,x n then, GM = n√(x1)(x2)….(xn) .

• GM = n√(x1)(x2)….(xn) = { (x1)(x2)… (xn ) }1/n

• Log GM = log {(x1 )(x2 )…(xn)}1/n

• = 1/n log {(x1 )(x2 )…(xn)}

• =1/n {log(x1 ) + log(x2 )+…log(xn)}

• = Σ(log xi)/n • The logarithm of the geometric mean is equal to the

arithmetic mean of the logarithms of individual values. 12/19/2013 Yusuf H (BSc, MPHE) 42

• Examle: The geometric mean may be calculated for the following parasite counts per 100 fields of thick films.

• 7 8 3 14 2 1 440 15 52 6 2 1 1 25

• 12 6 9 2 1 6 7 3 4 70 20 200 2 50

• 21 15 10 120 8 4 70 3 1 103 20 90 1 237

• GM = 42√7x8x3x…x1x237

• log Gm = 1/42 (log 7+log8+log3+..+log 237)

• = 1/42 (.8451+.9031+.4771 +…2.3747)

• = 1/42 (41.9985) 12/19/2013 Yusuf H (BSc, MPHE) 43

• = 0.9999 ≈ 1.0000

• The anti-log of 0.9999 is 9.9992 ≈10 and this is the required geometric mean. By contrast, the arithmetic mean, which is inflated by the high values of 440, 237 and 200 is 39.8 ≈ 40.

12/19/2013 Yusuf H (BSc, MPHE) 44

5. Weighted mean (WM)

• In a weighted mean, separate outcomes have separate influences.

• The influence attached to an outcome is the weight.

• Familiar is the calculation of a course grade as a weighted average of scores on separate outcomes.

Example:

Which measure of central tendency is best with a given set of data?

• Two factors are important in making this decisions:

– The shape of the distribution of the observations

– The scale of measurement

Measures of Dispersion

Consider the following two sets of data:

A: 177 193 195 209 226 Mean = 200

B: 192 197 200 202 209 Mean = 200

Two or more sets may have the same mean and/or median but they may be quite different.

12/19/2013 48 Yusuf H (BSc, MPHE)

• MCT are not enough to give a clear understanding about the distribution of the data.

• We need to know something about the variability or spread of the values — whether they tend to be clustered close together, or spread out over a broad range

12/19/2013 49 Yusuf H (BSc, MPHE)

Measures of Dispersion…

Measures that quantify the variation or dispersion of a set of data from its central location

Dispersion refers to the variety exhibited by the values of the data.

The amount may be small when the values are close

together.

If all the values are the same, no dispersion

12/19/2013 50 Yusuf H (BSc, MPHE)

• Measures of dispersion include:

– Range

– Inter-quartile range

– Variance

– Standard deviation

– Coefficient of variation

12/19/2013 51 Yusuf H (BSc, MPHE)

1. Range (R) • The difference between the largest and

smallest observations in a sample.

• Range = Maximum value – Minimum value

• Example – – Data values: 5, 9, 12, 16, 23, 34, 37, 42

– Range = 42-5 = 37

• Data set with higher range exhibit more variability

12/19/2013 52 Yusuf H (BSc, MPHE)

Properties of range It is the simplest crude measure and can be easily

understood

It takes into account only two values which causes it to be a poor measure of dispersion

Very sensitive to extreme observations

• Extreme values may be unreliable (most likely faulty)

• Not suitable for mathematical treatment

12/19/2013 53 Yusuf H (BSc, MPHE)

2. Interquartile range (IQR)

• Indicates the spread of the middle 50% of the observations, and used with median

IQR = Q3 - Q1

12/19/2013 54 Yusuf H (BSc, MPHE)

12/19/2013 55 Yusuf H (BSc, MPHE)

• Example:

• Suppose the first and third quartile for weights of girls 12 months of age are 8.8 Kg and 10.2 Kg, respectively.

IQR = 10.2 Kg – 8.8 Kg

i.e., 50% of the infant girls weigh between 8.8 and 10.2 Kg.

12/19/2013 56 Yusuf H (BSc, MPHE)

Properties of IQR:

• It is a simple and versatile measure

• It encloses the central 50% of the observations

• It is not based on all observations but only on two specific values

• Since it excludes the lowest and highest 25% values, it is not affected by extreme values

12/19/2013 57 Yusuf H (BSc, MPHE)

3. Variance (2, s2)

• The variance is the average of the squares of the deviations taken from the mean.

• A good measure of dispersion make use of all the data

• Measures the deviations of each observation from the mean

• Achieves by averaging the sum of the squares of the deviations from the mean.

12/19/2013 58 Yusuf H (BSc, MPHE)

• Variance is used to measure the dispersion of values relative to the mean.

• When values are close to their mean (narrow range) the dispersion is less than when there is scattering over a wide range.

– Population variance = σ2

– Sample variance = S2

12/19/2013 59 Yusuf H (BSc, MPHE)

a) Ungrouped data

Let X1, X2, ..., XN be the measurement on N population units, then:

mean. population theis N

X

=

whereN

)(X

N

1=i

i

N

1i

2

i2

12/19/2013 60 Yusuf H (BSc, MPHE)

A sample variance is calculated for a sample of individual values (X1, X2, … Xn) and uses the sample mean (e.g. ) rather than the population mean µ.

12/19/2013 61 Yusuf H (BSc, MPHE)

b) Grouped data

where

mi = the mid-point of the ith class interval

fi = the frequency of the ith class interval

= the sample mean

k = the number of class intervals

1-f

f)x(m

Sk

1=i

i

k

1=i

i

2

i2

x

12/19/2013 62 Yusuf H (BSc, MPHE)

Properties of Variance:

•The main disadvantage of variance is that its unit is the square of the unite of the original measurement values

•A variance of a distribution of weight is not expressed in Kg, but in Kg2

weight = 36.5 Kg, s² = 257 Kg2

12/19/2013 63 Yusuf H (BSc, MPHE)

4. Standard deviation (, s)

• It is the square root of the variance.

• This produces a measure having the same scale as that of the individual values.

• Most commonly used

12/19/2013 64 Yusuf H (BSc, MPHE)

2 and S = S2

12/19/2013 65 Yusuf H (BSc, MPHE)

Example • Areas of sprayable surfaces with DDT from a sample

of 15 houses are as follows (m2):

101 105 110 114 115 124 125 125

130 133 135 136 137 140 145

• Find the variance and standard deviation of the above distribution.

• The mean of the sample is 125 m2.

12/19/2013 66 Yusuf H (BSc, MPHE)

12/19/2013 67 Yusuf H (BSc, MPHE)

Example. Compute the variance and SD of the age of 169 subjects from the grouped data.

Mean = 5810.5/169 = 34.48 years S2 = 20199.22/169-1 = 120.23 SD = √S2 = √120.23 = 10.96

Class

interval

(mi)

(fi)

(mi-Mean)

(mi-Mean)2

(mi-Mean)2 fi

10-19

20-29

30-39

40-49

50-59

60-69

14.5

24.5

34.5

44.5

54.5

64.5

4

66

47

36

12

4

-19.98

-9-98

0.02

10.02

20.02

30.02

399.20

99.60

0.0004

100.40

400.80

901.20

1596.80

6573.60

0.0188

3614.40

4809.60

3604.80

Total 169 1901.20 20199.22

12/19/2013 68 Yusuf H (BSc, MPHE)

Properties of SD

• The SD has the advantage of being expressed in the same units of measurement as the mean

• SD is considered to be the best measure of dispersion and is used widely

• However, if the units of measurements of variables of two data sets is not the same, then there variability can’t be compared by comparing the values of SD.

12/19/2013 69 Yusuf H (BSc, MPHE)

5. Coefficient of variation (CV) • When two data sets have different units of

measurements, or their means differ sufficiently in size, the CV should be used as a measure of dispersion.

• It is the best measure to compare the variability of two series of sets of observations.

• Data with less CV is considered more consistent.

12/19/2013 70 Yusuf H (BSc, MPHE)

• “Cholesterol is more variable than systolic blood pressure”

SD Mean CV (%)

SBP

Cholesterol

15mm

40mg/dl

130mm

200mg/dl

11.5

20.0

•CV is the ratio of the SD to the mean multiplied by 100.

12/19/2013 71 Yusuf H (BSc, MPHE)

Characteristics of a Distribution

• A distribution is characterized by:

– Location = about average value

– Dispersion = about spread out

– Modality = number of peaks

– Skewness = whether or not a distribution is symmetric

12/19/2013 72 Yusuf H (BSc, MPHE)

Which Measures to Use?

• For symmetrically distributed data, mean and SD are used to summarize the data.

• However, in skewed distributions it is preferable to use median and quartiles

• Median and quartiles are not easily influenced by extreme values as are means and SDs.

12/19/2013 73 Yusuf H (BSc, MPHE)

Exercise • The following frequency distribution table shows the forced

expiratory volume in one second (FEV1) for 13 adults suffering

from asthma.

12/19/2013 74 Yusuf H (BSc, MPHE)

1. Find the mean, median and mode.

2. Find Range, quartiles and IQR.

3. Calculate variance and SD.

4. Suppose the FEV1 value for subject 11was given by mistake 40.2 instead of 4.02, what will happen to the mean, median and mode, respectively?

12/19/2013 75 Yusuf H (BSc, MPHE)