Upload
mariah-hicks
View
231
Download
0
Tags:
Embed Size (px)
Citation preview
1.1 - Populations, Samples and Processes
1.2 - Pictorial and Tabular Methods in
Descriptive Statistics
1.3 - Measures of Location
1.4 - Measures of Variability
1
Chapter 1Overview and Descriptive Statistics
Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages.
In published journal articles, the original data are almost never shown, but displayed in tabular form as above. This summary is called “grouped data.”
4 values 8 values 5 values 2 values 1 value
From these values, we can construct a table which consists of the frequencies of each age-interval in the dataset, i.e., a frequency table.
{18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59}{18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59}
4
8
2
5
1
Frequency Histogram
Suggests population may be skewed to the right (i.e.,
positively skewed).
Class Interval Frequency
[10, 20) 4
[20, 30) 8
[30, 40) 5
[40, 50) 2
[50, 60) 1
Total n = 20
“Endpoint convention”Here, the left endpoint is included, but not the right.
Note!...Stay away from “10-20,” “20-30,” “30-40,” etc.
2
Class Interval Frequency
[10, 20) 4
[20, 30) 8
[30, 40) 5
[40, 50) 2
[50, 60) 1
Total n = 20
Relative Frequency
4/20 = 0.20
8/20 = 0.40
5/20 = 0.25
2/20 = 0.10
1/20 = 0.05
20/20 = 1.00
Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages.
{18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59}
Often though, it is preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20.
↓
Relative frequencies are always between 0 and 1,
and sum to 1.
Relative Frequency Histogram
.20
.40
.10
.25
.05
3
0.4
0.3
0.2
0.1
0.0
Class Interval Frequency
[10, 20) 4
[20, 30) 8
[30, 40) 5
[40, 50) 2
[50, 60) 1
Total n = 20
Relative Frequency
4/20 = 0.20
8/20 = 0.40
5/20 = 0.25
2/20 = 0.10
1/20 = 0.05
20/20 = 1.00
Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages.
{18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59}
Often though, it is preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20.
↓
Relative frequencies are always between 0 and 1,
and sum to 1.
Relative Frequency Histogram
.20
.40
.10
.25
.05
4
0.4
0.3
0.2
0.1
0.0
“0.20 of the sample is under 20 yrs old”
“0.60 of the sample is under 30 yrs old”
“0.85 of the sample is under 40 yrs old”
“0.95 of the sample is under 50 yrs old”
“1.00 of the sample is under 60 yrs old”
“0.00 of the sample is under 10 yrs old”
Cumulative
(0.00)
0.20
0.60
0.85
0.95
1.00
Example: Exactly what proportion of the sample is under 34 years old?Approximately
Class Interval Frequency
[10, 20) 4
[20, 30) 8
[30, 40) 5
[40, 50) 2
[50, 60) 1
Total n = 20
Relative Frequency
4/20 = 0.20
8/20 = 0.40
5/20 = 0.25
2/20 = 0.10
1/20 = 0.05
20/20 = 1.00
Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages.
{18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59}
Often though, it is preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20.
↓
Relative frequencies are always between 0 and 1,
and sum to 1.
Relative Frequency Histogram
.20
.40
.10
.25
.05
5
0.4
0.3
0.2
0.1
0.0
Cumulative
(0.00)
0.20
0.60
0.85
0.95
1.00
Cumulative relative frequencies always
increase from 0 to 1.
Solution: [30, 34) contains 4/10 of 0.25 = 0.1, [0, 30) contains 0.6,
sum = 0.7
Class Interval Frequency
[10, 20) 4
[20, 30) 8
[30, 40) 5
[40, 50) 2
[50, 60) 1
Total n = 20
Relative Frequency
4/20 = 0.20
8/20 = 0.40
5/20 = 0.25
2/20 = 0.10
1/20 = 0.05
20/20 = 1.00
Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages.
{18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59}
Often though, it is preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20.
↓
Relative frequencies are always between 0 and 1,
and sum to 1.
Relative Frequency Histogram
.20
.40
.10
.25
.05
6
0.4
0.3
0.2
0.1
0.0
Cumulative
(0.00)
0.20
0.60
0.85
0.95
1.00
Cumulative relative frequencies always
increase from 0 to 1.
Solution: [30, 34) contains 4/10 of 0.25 = 0.1, [0, 30) contains 0.6,
sum = 0.7
Example: Approximately what proportion of the sample is under 34 years old?ExactlyBut alas, there is a major problem….
Relative Frequency Histogram
.20
.40
.10
.25
.05
Suppose that, for the purpose of the study, we are not primarily concerned with those 30 or older, and wish to “lump” them into a single class interval.
{18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27,
What effect will this have on the histogram?
Class Interval Frequency
[10, 20) 4
[20, 30) 8
[30, 40) 5
[40, 50) 2
[50, 60) 1
Total n = 20
Relative Frequency
4/20 = 0.20
8/20 = 0.40
5/20 = 0.25
2/20 = 0.10
1/20 = 0.05
20/20 = 1.00
4 values 8 values
31, 35, 35, 37, 38, 42, 46, 59}
Class Interval
[10, 20)
[20, 30)
[30, 60)
Total
Relative Frequency
4/20 = 0.20
8/20 = 0.40
8/20 = 0.40
20/20 = 1.00
.40
The skew no longer appears. The histogram is distorted because of the
presence of an outlier (59) in the data, creating the need for unequal class widths.
8 values
7
0.4
0.3
0.2
0.1
0.0
OUTLIERS• What are they?Informally, an outlier is a sample data value that is either “much” smaller or larger than the other values.
• How do they arise?o experimental erroro measurement erroro recording erroro not an error; genuine
• What can we do about them?o double-check them if possibleo delete them?o include them… somehowo perform analysis both ways
(A Pain in the Tuches)
8
IDEA: Instead of having height of each class rectangle = relative frequency,
make... area of each class rectangle = relative frequency.
Class Interval
Relative Frequency
[10, 20) 0.20
[20, 30) 0.40
[30, 60) 0.40
Total 20/20 = 1.00
Density(= height)
0.20/10 = 0.020
0.40/10 = 0.040
0.40/30 = 0.013
height“Density” = relative frequency ×
width /
width = 10
width = 10
width = 30
Density Histogram
0.02
0.04
0.0133…
0.20
0.40
0.40
Total Area = 1!
9
The outlier is included, and the overall skewed appearance is restored.
Exercise: What if the outlier was 99 instead of 59?
1.1 - Populations, Samples and Processes
1.2 - Pictorial and Tabular Methods in
Descriptive Statistics
1.3 - Measures of Location
1.4 - Measures of Variability
10
Chapter 1Overview and Descriptive Statistics
“Measures of ”Example: Sample exam scores = {70, 80, 80, 80, 80, 90, 90, 100, 100, 100}
• sample mode
most frequent value = 80
• sample median
“middle” value = (80 + 90) / 2 = 85
• sample mean
average value =
11
Data values xi
Frequenciesfi
70 1
80 4
90 2
100 3
Total n = 10
i = 1
i = 2
i = 3
i = 4
(70)(1) + (80)(4) + (90)(2) + (100)(3)
x = xi fi
= 87
(Quartiles are found similarly: Q1 = , Q2 = 85, Q3 = ) 80 100
Center
1/10
n
1
• sample mode
most frequent value = 80
• sample median
“middle” value = (80 + 90) / 2 = 85
• sample mean
average value =
“Measures of Center”Example: Sample exam scores = {70, 80, 80, 80, 80, 90, 90, 100, 100, 100}
12
Data values xi
Frequenciesfi
70 1
80 4
90 2
100 3
Total n = 10
(70)(1) + (80)(4) + (90)(2) + (100)(3)1/10 = 87
x = xi fi n
1
x = 87• sample mean
“Measures of Center”Example: Sample exam scores = {70, 80, 80, 80, 80, 90, 90, 100, 100, 100}
13
Data values xi
Frequenciesfi
70 1
80 4
90 2
100 3
Total n = 10
Relative Frequenciesf (xi ) = fi /n
1/10 = 0.1
4/10 = 0.4
2/10 = 0.2
3/10 = 0.3
10/10 = 1.0
(70)(1) + (80)(4) + (90)(2) + (100)(3)1/10
x = xi f (xi)
“Notation, notation, notation.”
110
410
210
310
(70)(1) + (80)(4) + (90)(2) + (100)(3) =1/10 87
x = xi fi n
1
“weighted” sample mean
• sample mean
14
Data values xi
Frequenciesfi
70 1
80 4
90 2
100 3
Total n = 10
… but how do we measure the “spread” of a set of values?
First attempt:
• sample range = xn – x1 = 100 – 70 = 30. Simple, but…
Spread“Measures of ”Example: Sample exam scores = {70, 80, 80, 80, 80, 90, 90, 100, 100, 100}
Ignores all of the data except the extreme points, thus far too sensitive to outliers to be of any practical value.
Example: Company employee salaries, including CEO
Can modify with…
• sample interquartile range (IQR) = Q3 – Q1
= 100 – 80 = 20.
We would still prefer a measure that uses all of the data.
x = 87
Deviations from meanxi – x
70 – 87 = –17
80 – 87 = –7
90 – 87 = +3
100 – 87 = +13
• sample mean
15
Data values xi
Frequenciesfi
70 1
80 4
90 2
100 3
Total n = 10
… but how do we measure the “spread” of a set of values?
Better attempt: Calculate the average of the “deviations from the mean.”
1/10 [(–17)(1) + (–7)(4) + (3)(2) + (13)(3)] =
0. ????????
This is not a coincidence – the deviations always sum to 0* – so it is not a good measure of variability.
Spread“Measures of ”Example: Sample exam scores = {70, 80, 80, 80, 80, 90, 90, 100, 100, 100}
(xi – x) fi =n
1
* Physically, the sample mean is a “balance point” for the data.
x = 87
Deviations from meanxi – x
70 – 87 = –17
80 – 87 = –7
90 – 87 = +3
100 – 87 = +13
• sample mean
16
Data values xi
Frequenciesfi
70 1
80 4
90 2
100 3
Total n = 10
(xi – x) 2 fi
[(–17) 2 (1) + (–7)
2 (4) + (3) 2 (2) + (13)
2 (3)]
Calculate the
“Measures of Spread”Example: Sample exam scores = {70, 80, 80, 80, 80, 90, 90, 100, 100, 100}
s 2 =
• sample variance
• sample standard deviation
s = 2s
1
1
n
1/9 = 112.22
average of the “squared deviations from the mean.”
s = 10.59
a modified x = 87 “typical” sample value
“typical” distance from mean
Grouped Data - revisited
17
Class Interval
Absolute Frequency
[10, 20) 4
[20, 30) 8
[30, 60) 8
2 2
1mean
1variance ( )
1
i i
i i
x x fn
s x x fn
if
Use the interval midpoints for .ix
Grouped Data - revisited
18
Class Interval
Absolute Frequency
[10, 20) 4
[20, 30) 8
[30, 60) 8
2 2
1mean
1variance ( )
1
i i
i i
x x fn
s x x fn
ifix
15
25
45
Use the interval midpoints for .ixCompare this “grouped mean” with the actual mean.
Class Interval
Absolute Frequency
[10, 20) 4
[20, 30) 8
[30, 60) 8
Grouped Data - revisited
19
2 2
1mean
1variance ( )
1
i i
i i
x x fn
s x x fn
Use the interval midpoints for .ix
median Q2 = ?
Compare this “grouped mean” with the actual mean.
Class Interval
Absolute Frequency
Relative Frequency Density
[10, 20) 4 0.20 0.020
[20, 30) 8 0.40 0.040
[30, 60) 8 0.40 0.01333
0.02
0.04
0.0133…
0.20
0.40
0.40
Step 1. Identify the interval & rectangle.
Step 2. Split the rectangle so that0.5 area lies above and below.
0.3 0.1
Q
0000
0.10.1 0.10.3
Q
Grouped Data - revisited
2 2
1mean
1variance ( )
1
i i
i i
x x fn
s x x fn
Use the interval midpoints for .ix
median Q2 = ?
Compare this “grouped mean” with the actual mean.
Step 1. Identify the interval & rectangle.
Step 2. Split the rectangle so that0.5 area lies above and below.
Step 3. Observe that this rectangle can be split into 4 strips of 0.1 each.
0.1
22.5 25 27.5
Step 4. Thus, split the interval into 4 equal parts, each of width (30 – 20 )/4.
…OR…
0000
0.3 0.1
Grouped Data - revisited
2 2
1mean
1variance ( )
1
i i
i i
x x fn
s x x fn
Use the interval midpoints for .ix
median Q2 = ?
Compare this “grouped mean” with the actual mean.
Step 1. Identify the interval & rectangle.
Step 2. Split the rectangle so that0.5 area lies above and below.
Step 3. Set up a proportion and solve for Q:
A B
a b
Label as shown, and use the formula .
20 0.330 20 0.4Q
…OR…Aa Bb
QA B
(0.3)(20) (0.1)(30)
0.3 0.1
• Other percentiles are done similarly.• Solve using cumul dist, w/o
histogram.• Solve for areas, given Q.• See posted Lecture Notes!
Q
…OR…
Comments is an unbiased estimator of the population mean ,
s 2 is an unbiased estimator of the population variance 2. (Their “expected values” are and 2, respectively.)
Beware of roundoff error!!! There is an alternate, more computationally stable formula for sample variance s 2.
The numerator of s 2 is called a sum of squares (SS); the denominator “n – 1” is the number of degrees of freedom (df) of the n deviations xi – , because they must satisfy a constraint (sum = 0), hence 1 degree of freedom is “lost.”
A natural setting for these formulas and concepts is geometric, specifically, the Pythagorean Theorem: a 2 + b 2 = c 2. See lecture notes appendix…
22
a
cb
x
x