Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Measures of Central Tendency and their dispersion and
applications
Acknowledgement: Dr Muslima Ejaz
�LEARNING OBJECTIVES:Compute and distinguish between the uses of measures of central tendency: mean, median and mode. Compute and list some uses for measures of variation of dispersion: range, variance and standard deviation.
9/24/2013 2
range, variance and standard deviation.Understand the distinction between the population mean and the sample mean.Learn the empirical rule and its application.
REFERENCES:Basic Statistics for the Health Sciences, Jan W. Kuzma and Stephen E. Bohnenblust, by Mayfield Publishing Company, 2001.An introduction to Statistical Methods and Data Analysis, Lyman OttPWS-Kent Publishing Company, 1988
�Average speed of a car crossing midtown Manhattan during the day is 5.3 miles /hr
�Average minutes an American father of 4-year-old spend alone with his child each day is 42
�Average American man is 5 feet 9 inches
9/24/2013 3
�Average American man is 5 feet 9 inches and average women is 5 feet 3.6 inches tall
�The average American man is sick in bed seven days a year missing 5 days of work
Measures of Central Tendency (center of the distribution)
�Find a single score that is most typical or most representative of the entire group�Helpful in comparing groups
�No single measure representative in every
9/24/2013 4
�No single measure representative in every situation - three ways of determining central tendency
�Mean�Median�Mode
Mean
� Also called arithmetic mean or average� The sum of all scores divided by the
number of scores
9/24/2013 5
n
XiX
n
i∑
== 1
Sample Mean
�Add up all the observations given in the data, then divide by sample size (n)
� The sample size n is the number of
9/24/2013 6
� The sample size n is the number of observations
Example; Mean
n = 5 Systolic blood pressures ( mmHg)
�X1 = 120
9/24/2013 7
�X2 = 80�X3 = 90�X4 = 110�X5 = 95
Example: Mean
n
XiX
n
i∑
== 1
Mean Systolic Blood Pressure:
9/24/2013 8
Mean Systolic Blood Pressure:
995
495 ==X
Pros and Cons of the Mean
�Pros�Mathematical
center of a distribution.
�Just as far from
�Cons�Influenced by
extreme scores and skewed
9/24/2013 9
�Just as far from scores above it as it is from scores below it.
�Does not ignore any information
and skewed distributions
�One data point could make a great change in sample mean
Example
n= 5 Systolic blood pressures ( mmHg)
� X1 = 120� X2 = 180� X3 = 90
9/24/2013 10
� X3 = 90� X4 = 110� X5 = 95
�Mean Systolic Blood Pressure:
1195
595 ==X
Population Versus Sample Mean
Population —The entire group you want information about –
�For example: The blood pressure of all 18-
9/24/2013 11
�For example: The blood pressure of all 18-year-old male Medical college students at AKU
Cont…
Sample— A part of the population from which we actually collect information and draw conclusions about the whole population –
9/24/2013 12
population –
For example: Sample of blood pressures N=five 18-year-old male college students in AKU
Mean
�Population
N
XiN
i∑
== 1µ“mu”
“sigma”, the sum of X, add up all scores
“N”, the total number of scores in a population
9/24/2013 13
�Sample
N
n
XiX
n
i∑
== 1“X bar”
“n”, the total number of scores in a sample
“sigma”, the sum of X, add up all scores
The Median
�The score that divides the distribution exactly in half when observations are ordered
�The 50th percentile (50%)
9/24/2013 14
�The 50th percentile (50%) �Goal: determine the exact midpoint
�Half of the rank order of observations n+1 / 2�Scores arranged from highest to lowest –
middle score
Example: Median
110, 90, 80, 95, 120
80, 90, 95, 110, 120
9/24/2013 15
� The median is the middle value when observations are ordered.�To find the middle, count in (N+1)/2 scores when
observations are ordered lowest to highest.� Median Systolic BP:
�(5+1)/2 = 3
Finding the median with an even number of scores.�With an even number of scores, the
median is the average of the middle two observations when observations are ordered.
9/24/2013 16
�(95 + 110)/2 = 102.5
80, 90, 95, 110, 120, 125
Example; Median
80, 90, 95, 110, 220
9/24/2013 17
Median
Pros and Cons of Median
�Pros�Not influenced by
extreme scores or skewed
�Cons�Doesn’t take actual
values into account.�As its value is
9/24/2013 18
skewed distributions
�Easier to compute than the mean.
�As its value is determined solely by its rank, provides no information about any of the other values within the distribution
The Mode
�The highest frequency/most frequently occurring score
9/24/2013 19
�Applicable to qualitative and quantitative data�Could be bi-modal or multi-modal
Central Tendency Example: Mode
Mode: most frequent observation
75, 76, 90, 90, 95, 99, 100, 120, 120, 135,135, 155, 170, 186, 196, 205, 220
9/24/2013 20
�Mode: most frequent observation�Mode(s) for Blood Pressure:
�90, 120, 135
Pros and Cons of the Mode
�Pros�Easiest to
compute and understand.
�Cons�Ignores most of
the information in a distribution
9/24/2013 21
understand.
�The score comes from the data set.
in a distribution
�Small samples may not have a mode
Using different measures of central tendencyTwo factors are important in making the decision
of which measure of central tendency should be
used:
� Scale of measurement (ordinal or numerical)
9/24/2013 22
� Scale of measurement (ordinal or numerical)
� Shape of the distribution of observations.
�A distribution can be symmetric or skewed to
the right, positively skewed or to the left,
negatively skewed.
Using different measures of central tendency
� In a normal distribution, the mean, median, and mode are
f(x)
9/24/2013 23
and mode are the same.
µµµµx
Mean Median Mode
The effect of skew on average.
� In a skewed distribution, the mean is pulled toward the tail.
9/24/2013 24
toward the tail.
Using different measures of central tendency
The following guidelines help the researcher decide which measure is best with a given set of data:
�The mean is used for numerical data
Fre
quen
cy
0.3
9/24/2013 25
for numerical data and for symmetric distribution.
Values
Fre
quen
cy
-4 -2 0 2 4
0.0
0.1
0.2
Using different measures of central tendencyThe following guidelines help the researcher decide which measure is best with a given set of data:
�The median is used for ordinal data or for
9/24/2013 26
data or for numerical data whose distribution is skewed.
Using different measures of central tendencyThe following guidelines help the researcher decide which measure is best with a given set of data:
�The mode is used primarily for nominal or ordinal 20
2530
9/24/2013 27
nominal or ordinal data or for numerical data with bimodal distribution Stress Rating
Fre
quen
cy
0 2 4 6 8 10
05
1015
20
Measures of VariationOr
9/24/2013 28
OrMeasures of dispersion
Measures of Variability
�A single summary figure that describes the spread of observations within a distribution.
9/24/2013 29
Centrally located at the Same value on the horizontal axis, but havesubstantially different amount of variability
Measures of Variability� Consider the following two data sets on the ages of all
patients suffering from bladder cancer and prostatic cancer.
39453640353847BC
2752183370PC
9/24/2013 30
� The mean age of both the groups is 40 years. � If we do not know the ages of individual patients and are told only that
the mean age of the patients in the two groups is the same, we may assume that the patients in the two groups have a similar age distribution.
� Variation in the patient’s ages in each of these two groups is very different.
� The ages of the prostatic cancer patients have a much larger variation than the ages of the bladder cancer patients.
Measures of Variability
� Measure the “spread” in the data� Some important measures
�Range�Mean deviation
9/24/2013 31
�Mean deviation�Variance�Standard Deviation�Coefficient of variation
Variability
� The purpose of the majority of medical, behavioural and social science research is to explain or account for variance or differences among individuals or groups.
9/24/2013 32
Examples1. What factors account for the variance (or
difference) in IQ among individuals?2. What factors account for the variance in
treatment compliance among different groups of patients?
Range�The range tells us the span over which
the data are distributed, and is only a very rough measure of variability
�Range: The difference between the
9/24/2013 33
�Range: The difference between the maximum and minimum scores
�Range = 120 – 80 = 40
80, 90, 95, 110, 120
Range
�Range is the simplest measure of dispersion
� It depends entirely on the extreme scores
9/24/2013 34
� It depends entirely on the extreme scores and doesn’t take into consideration the bulk of the observations
Variation
X
5 0.00 5 0.00 5 0.00
XX −
9/24/2013 35
5 0.005 0.005 0.00
= 25 n = 5 = 5∑ X XThis is an example of data with no i.e. zero variability
Variation
X
6 +1.00 4 -1.00 6 +1.00
XX −
9/24/2013 36
6 +1.005 0.004 -1.00
= 25 n = 5 = 5∑ X X
This is an example of data with low variability
Variation
X
8 +3.00 1 -4.00 9 +4.005 0.00
XX −
9/24/2013 37
5 0.002 -3.00
= 25 n = 5 = 5∑ X X
This is an example of data with high variability
Mean deviation
�The best measures of dispersion should:�take into account all the scores in the distribution �and should describe the average deviation of all
observations from the mean.
�Normally, to find the average we would want to
9/24/2013 38
�Normally, to find the average we would want to sum all deviations from the mean and then divide by n, i.e.,
n
xX∑ −
Mean DeviationX | X- x | n = 6; ΣX = 33 3 3 - 5.50 = 2.50 X = Σ X/n5 5 - 5.50 = 0.50 X = 33/69 9 - 5.50 = 3.50 X = 5.502 2 - 5.50 = 3.50
9/24/2013 39
2 2 - 5.50 = 3.508 8 - 5.50 = 2.506 6 - 5.50 = 0.50
= 13
Mean Deviation = 13/ 6 = 2.167
Variance & Standard Deviation
�However, if we square each of the deviations from the mean, we obtain a sum that is not equal to zero
9/24/2013 40
�This is the basis for the measures of varianceand standard deviation, the two most common measures of variability (or dispersion) of data
Variance & Standard Deviation (cont)
X
8 +3.00 9.001 -4.00 16.009 +4.00 16.005 0.00 0.00
XX − ( )2XX −
9/24/2013 41
2 -3.00 9.00= 25 = 0.00 = 50.00
Note: The is called the Sum of Squares
∑ X ( )∑ − XX
( )∑ − 2XX
( )∑ − 2XX
Steps to calculate Variance
�Compute the mean. �Subtract the mean from each observation. �Square each of the deviations.
9/24/2013 42
�Square each of the deviations. �Find the sum of the squares. �Divide the sum by N to get the variance�Take the square root of the variance to get
the standard deviation.
Few Facts� The square root of the variance gives the standard
deviation (SD) and vice versa� Variance is actually the average of the square of the
distance that the each value is from the mean� Why the squared distances and not the actual ones!
Sum of the distances will always be zero, when each
9/24/2013 43
Sum of the distances will always be zero, when each value is squared the negative sign is eliminated
� Why to take the square root? Since distances were squared, the units of the resultant numbers are the squares of the units of the original raw data. Finding the square root of the variance puts the SD in the same units as the raw data. i.e. standard deviation expresses variability in the same units as the data.
Sample Variance
�The sum of squared deviations from the mean divided by the n - 1 (an estimate of the population variance)
9/24/2013 44
( )1
2
2
−−∑=
n
xXs
Variance of a Population
�The sum of squared deviations from the mean divided by the number of scores (sigma squared):
9/24/2013 45
( )N
X 22 µσ −∑=
Standard Deviation Formulas
Population Standard Deviation( )
N
X 2µσ −∑=
( )2−
= ∑ xXs
9/24/2013 46
( )1−−
= ∑n
xXsSample Standard Deviation
Sample standard deviation usually underestimates population standard deviation. Using n-1 in the denominator corrects for this and gives us a better estimate of the population standard deviation.
�Sometimes it is of interest to compare the degree of variability in the distribution of a factor from two different populations or of two different variables from the same populations eg; SBP (factor) among
9/24/2013 47
populations eg; SBP (factor) among children and adults (two different populations) or among adults the distribution of SBP has more spread than that of DBP
Coefficient of variation: expresses the SD as proportion of the mean
� It is a dimensionless measure of the relative variation. �Constructed by dividing the standard deviation by the
mean and multiplying by 100.CV = (SD/mean) * (100)
�It depicts the size of standard deviation relative to its
9/24/2013 48
�It depicts the size of standard deviation relative to its mean
�Used to compare the variability in one data set with that in another when a direct comparison of standard deviation is not appropriate.
Coefficient of variation
� The formula is:� CV = (s/x) (100)� Suppose two samples
of human males yield
Children
Adults
11 yrs25 yrsMean age
9/24/2013 49
of human males yield the following results:
age
80lbs145lbsMean wt
10lbs10lbsSD
12.5%6.9%CV
Using different measures of dispersion
The following guidelines help investigators decide which measure of dispersion is most appropriate for a given set of data:
� The standard deviation is used when the mean is used i.e., with symmetric distributions of
9/24/2013 50
is used i.e., with symmetric distributions of numerical data
� The range is used with numerical data when the purpose is to emphasize extreme values.
� The coefficient of variation is used when the intent is to compare two numerical distributions measured on different scales.
Empirical Rule
�Specifies the proportion of the spread in terms of the standard deviation
� It applies to the normal symmetric or bell-shaped distribution
9/24/2013 51
shaped distribution�Approx 68% of the data values will fall within 1 SD of
the mean�Approx 95% of the data values will fall within 2 SD of
the mean�Approx 99.7% of the data values will fall within 3 SD
of the mean
Empirical Rule
95%
99.7%
Approximate percentage of area within given standard deviations
9/24/2013 52
68%
95%
Assume the distribution of underlying variable is symmetric and bell shaped (Normal)
Example
�Scores on a National Achievement Exam have a mean of 480 and a SD of 90. And if these scores are normally distributed, then�approximately 68% will fall between 390 & 570
9/24/2013 53
�approximately 68% will fall between 390 & 570�approximately 95% will fall between 300 & 660�approximately 99.7% will fall between 210 &
750
Women participating in a three-day experimental diet regime have been demonstrated to have normally distributed weight loss with mean 600 g and a standard deviation 200 g.
Application of the Empirical Rule
9/24/2013 54
and a standard deviation 200 g.
a) What percentage of these women will have a weight loss between 400 and 800 g?
b) What percentage of women will lose weight too quickly on the diet (where too much weight is defined as >1000g)?
X : (600,200)
~ 68%
a)
9/24/2013 55
600 800 1000 12004002000
X : (600,200)b)
9/24/2013 56
600 800 1000 12004002000
2.3%