Upload
shaine-sanders
View
34
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Lecture 4 Chapter 2. Numerical descriptors. Objectives (PSLS Chapter 2). Describing distributions with numbers Measure of center: mean and median (Meas. Cent. Award) Measure of spread: quartiles, standard deviation, IQR (Meas. Var. Award) The five-number summary and boxplots (SUMS Award) - PowerPoint PPT Presentation
Citation preview
Objectives (PSLS Chapter 2)
Describing distributions with numbers
Measure of center: mean and median (Meas. Cent. Award)
Measure of spread: quartiles, standard deviation, IQR (Meas. Var. Award)
The five-number summary and boxplots (SUMS Award)
Dealing with outliers (outliers award)
Choosing among summary statistics (All Numeric Awards)
Organizing a statistical problem (Foundational)
The mean, or arithmetic average
To calculate the average (mean) of a data set, add all values, then
divide by the number of individuals. It is the “center of mass.”
Measure of center: the mean
n
x....xxx
n
21
n
iixn
x1
1
n is the sample sizex is the variable
Measure of center: the median
The median is the midpoint of a distribution—the number such that
half of the observations are smaller, and half are larger.
1) Sort observations from smallest to largest.n = number of observations
2) The location of the median is (n + 1)/2 in the sorted list
______________________________
1 0.62 1.23 1.64 1.95 1.56 2.17 2.38 2.39 2.510 2.811 2.912 3.313 3.414 3.615 3.716 3.817 3.918 4.119 4.220 4.521 4.722 4.923 5.324 5.6
n = 24 (n+1)/2 = 12.5
Median = (3.3+3.4)/2 = 3.35
If n is even, the median is the mean of the two center observations
1 0.62 1.23 1.64 1.95 1.56 2.17 2.38 2.39 2.510 2.811 2.912 3.313 3.414 3.615 3.716 3.817 3.918 4.119 4.220 4.521 4.722 4.923 5.324 5.625 6.1
n = 25 (n+1)/2 = 13 Median = 3.4
If n is odd, the median is the value of the center observation
Mean and median for skewed distributions
Mean and median for a symmetric distribution
Left skew Right skew
MeanMedian
Mean Median
MeanMedian
Comparing the mean and the median
The median is a measure of center that is resistant to skew and
outliers. The mean is not.
Measure of spread: quartiles
M = median = 3.4
Q1= first quartile = 2.2
Q3= third quartile = 4.35
1 0.62 1.23 1.64 1.95 1.56 2.17 2.38 2.39 2.510 2.811 2.912 3.313 3.414 3.615 3.716 3.817 3.918 4.119 4.220 4.521 4.722 4.923 5.324 5.625 6.1
The first quartile, Q1, is the median
of the values below the median in the
sorted data set.
The third quartile, Q3, is the median
of the values above the median in the
sorted data set.
28 12 23 14 40 18 22 33 26 27 29 11 35 30 34 22 23 35
How fast do skin wounds heal?
Here are the skin healing rate data from 18 newts measured
in micrometers per hour:
11 12 14 18 22 22 23 23 26 27 28 29 30 33 34 35 35 40
Sorted data:
Median = ???
Quartiles = ???
Measure of spread: standard deviationThe standard deviation is used to describe the variation around the mean.
To get the standard deviation of a SAMPLE of data:
2
1
2 )(1
1xx
ns
n
i
1) Calculate the variance s2
2
1
)(1
1xx
ns
n
i
2) Take the square root to get the standard deviation s
Learn how to obtain the standard deviation of a sample using a spread sheet.
A person’s metabolic rate is the rate at which the body consumes energy. Find the mean and standard deviation for the metabolic rates of a sample of 7 men (in kilocalories, Cal, per 24 hours).
*
2.1897.811,35
7.811,356870,214
)()1(
61
870,214)(
1600/
22
2
1
s
xxdfs
ndf
xx
nxx
i
i
Center and spread in boxplots
median = 3.4
Q3= 4.35
Q1= 2.2
25 6.124 5.623 5.322 4.921 4.720 4.519 4.218 4.117 3.916 3.815 3.714 3.613 3.412 3.311 2.910 2.89 2.58 2.37 2.36 2.15 1.54 1.93 1.62 1.21 0.6
max = 6.1
min = 0.6
Disease X
0
1
2
3
4
5
6
7
Yea
rs u
nti
l dea
th
“Five-number summary”
Boxplot
0123456789
101112131415
Disease X Multiple Myeloma
Ye
ars
un
til d
ea
th
Boxplots for a symmetric and a right-skewed distribution
Boxplots and skewed data
Boxplots show
symmetry or
skew.
IQR and outliers
The interquartile range (IQR) is the distance between the first and
third quartiles (the length of the box in the boxplot)
IQR = Q3 – Q1
An outlier is an individual value that falls outside the overall pattern.
How far outside the overall pattern does a value have to fall to be
considered a suspected outlier?
Suspected low outlier: any value < Q1 – 1.5 IQR
Suspected high outlier: any value > Q3 + 1.5 IQR
Q3 = 4.35
Q1 = 2.2
25 7.924 5.623 5.322 4.921 4.720 4.519 4.218 4.117 3.916 3.815 3.714 3.613 3.412 3.311 2.910 2.89 2.58 2.37 2.36 2.15 1.54 1.93 1.62 1.21 0.6
Disease X
0
1
2
3
4
5
6
7
Yea
rs u
nti
l dea
th
8
Interquartile rangeQ3 – Q1
4.35-2.2 = 2.15
Distance to Q37.9-4.35 = 3.55
Individual #25 has a survival of 7.9 years, which is 3.55 years
above the third quartile. This is more than 1.5 IQR = 3.225 years.
Individual #25 is a suspected outlier.
*
Dealing with outliers: Baldi and Moore’s SuggestionsWhat should you do if you find outliers in your data? It depends in part on what kind of outliers they are:
Human error in recording information
Human error in experimentation or data collection
Unexplainable but apparently legitimate wild observations
Are you interested in ALL individuals?
Are you interested only in typical individuals?
Learn. Does the outlier tell you something interesting about
biology?
Don’t discard outliers just to make your data look better, and don’t act as if they did not exist.
Choosing among summary statistics: B & M Because the mean is not resistant
to outliers or skew, use it is often
used to describe distributions that
are fairly symmetrical and don’t
have outliers.
Plot the mean and use the
standard deviation for error bars.
Otherwise, use the median and the
five-number summary, which can be
plotted as a boxplot.
Describe a distribution with its
S.U.M.S. (shape, unusual points,
middle, and spread).
Height of 30 women
58
59
60
61
62
63
64
65
66
67
68
69
Box plot Mean +/- sd
Hei
ght i
n in
ches
Boxplot Mean ± s.d.