Upload
cameron-stevens
View
214
Download
1
Tags:
Embed Size (px)
Citation preview
STA 291Summer 2010
Lecture 4Dustin Lueker
The population distribution for a continuous variable is usually represented by a smooth curve◦ Like a histogram that gets finer and finer
Similar to the idea of using smaller and smaller rectangles to calculate the area under a curve when learning how to integrate
Symmetric distributions◦ Bell-shaped◦ U-shaped◦ Uniform
Not symmetric distributions:◦ Left-skewed◦ Right-skewed◦ Skewed
Population Distribution
2STA 291 Summer 2010 Lecture 4
Center of the data◦ Mean◦ Median◦ Mode
Dispersion of the data Sometimes referred to as spread
◦ Variance, Standard deviation◦ Interquartile range◦ Range
Summarizing Data Numerically
3STA 291 Summer 2010 Lecture 4
Mean◦ Arithmetic average
Median◦ Midpoint of the observations when they are
arranged in order Smallest to largest
Mode◦ Most frequently occurring value
Measures of Central Tendency
4STA 291 Summer 2010 Lecture 4
Sample size n Observations x1, x2, …, xn
Sample Mean “x-bar”
Sample Mean
5
SUM
STA 291 Summer 2010 Lecture 4
n
ii
n
xn
nxxxx
1
21
1
/)...(
Population size N Observations x1 , x2 ,…, xN
Population Mean “mu”
Note: This is for a finite population of size N
Population Mean
6
SUM
STA 291 Summer 2010 Lecture 4
N
ii
N
xN
Nxxx
1
21
1
/)...(
Requires numerical values◦ Only appropriate for quantitative data◦ Does not make sense to compute the mean for
nominal variables◦ Can be calculated for ordinal variables, but this does not
always make sense Should be careful when using the mean on ordinal variables Example “Weather” (on an ordinal scale)
Sun=1, Partly Cloudy=2, Cloudy=3,Rain=4, Thunderstorm=5Mean (average) weather=2.8
Another example is “GPA = 3.8” is also a mean of observations measured on an ordinal scale
Mean
7STA 291 Summer 2010 Lecture 4
Center of gravity for the data set Sum of the differences from values above
the mean is equal to the sum of the differences from values below the mean◦ 3+2+2 = 3 + 4
Mean
STA 291 Summer 2010 Lecture 4 8
Mean◦ Sum of observations divided by the number of
observations
Example◦ {7, 12, 11, 18}◦ Mean =
Mean (Average)
9STA 291 Summer 2010 Lecture 4
Highly influenced by outliers◦ Data points that are far from the rest of the data
◦ Example Monthly income for five people
1,000 2,000 3,000 4,000 100,000 Average monthly income =
What is the problem with using the average to describe this data set?
Mean
10STA 291 Summer 2010 Lecture 4
Measurement that falls in the middle of the ordered sample
When the sample size n is odd, there is a middle value◦ It has the ordered index (n+1)/2
Ordered index is where that value falls when the sample is listed from smallest to largest An index of 2 means the second smallest value
◦ Example 1.7, 4.6, 5.7, 6.1, 8.3
n=5, (n+1)/2=6/2=3, index = 3Median = 3rd smallest observation = 5.7
Median
11STA 291 Summer 2010 Lecture 4
When the sample size n is even, average the two middle values◦ Example
3, 5, 6, 9, n=4(n+1)/2=5/2=2.5, Index = 2.5Median = midpoint between 2nd and 3rd smallest observations = (5+6)/2 = 5.5
Median
12STA 291 Summer 2010 Lecture 4
For skewed distributions, the median is often a more appropriate measure of central tendency than the mean
The median usually better describes a “typical value” when the sample distribution is highly skewed
Example◦ Monthly income for five people
1,000 2,000 3,000 4,000 100,000◦ Median monthly income:
Why is the median better to use with this data than the mean?
Mean and Median
13STA 291 Summer 2010 Lecture 4
Measures of Central Tendency
14
Mode - Most frequent value.
Notation: Subscripted variables n = # of units in the sample N = # of units in the population x = Variable to be measured xi = Measurement of the ith unit
Mean - Arithmetic Average
Mean of a Sample - x
Mean of a Population -
μ
Median - Midpoint of the observations when they are arranged in increasing order
STA 291 Summer 2010 Lecture 4
Example: Highest Degree Completed
Median for Grouped or Ordinal Data
15
Highest Degree Frequency Percentage
Not a high school graduate
38,012 21.4
High school only 65,291 36.8
Some college, no degree
33,191 18.7
Associate, Bachelor, Master, Doctorate,
Professional
41,124 23.2
Total 177,618 100
STA 291 Summer 2010 Lecture 4
n = 177,618 (n+1)/2 = 88,809.5 Median = midpoint between the 88809th
smallest and 88810th smallest observations◦ Both are in the category “High school only”
Mean wouldn’t make sense here since the variable is ordinal
Median◦ Can be used for interval data and for ordinal data◦ Can not be used for nominal data because the
observations can not be ordered on a scale
Calculate the Median
16STA 291 Summer 2010 Lecture 4
Mean◦ Interval data with an approximately symmetric
distribution Median
◦ Interval data◦ Ordinal data
Mean is sensitive to outliers, median is not
Mean vs. Median
17STA 291 Summer 2010 Lecture 4
Symmetric distribution◦ Mean = Median
Skewed distribution◦ Mean lies more toward the direction which the
distribution is skewed
Mean vs. Median
18STA 291 Summer 2010 Lecture 4
While the median is better than the mean for skewed distributions there is one large disadvantage to using the median◦ Insensitive to changes within the lower or upper
half of the data◦ Example
1, 2, 3, 4, 5 1, 2, 3, 100, 100
◦ Sometimes, the mean is more informative even when the distribution is skewed
Median
19STA 291 Summer 2010 Lecture 4
Keeneland Sales
Example
STA 291 Summer 2010 Lecture 4 20
Deviations The deviation of the ith observation xi from
the sample mean is the difference between them, ◦ Sum of all deviations is zero◦ Therefore, we use either the sum of the absolute
deviations or the sum of the squared deviations as a measure of variation
21
x)( xxi
STA 291 Summer 2010 Lecture 4
Variance of n observations is the sum of the squared deviations, divided by n-1
Sample Variance
22
22 ( )
1
ix xs
n
STA 291 Summer 2010 Lecture 4
Example
23
Observation Mean Deviation SquaredDeviation
1
3
4
7
10
Sum of the Squared Deviations
n-1
Sum of the Squared Deviations / (n-1)
STA 291 Summer 2010 Lecture 4
Interpreting Variance About the average of the squared
deviations◦ “average squared distance from the mean”
Unit◦ Square of the unit for the original data
Difficult to interpret◦ Solution
Take the square root of the variance, and the unit is the same as for the original data Standard Deviation
24STA 291 Summer 2010 Lecture 4
Properties of Standard Deviation s ≥ 0
◦ s = 0 only when all observations are the same If data is collected for the whole population
instead of a sample, then n-1 is replaced by N
s is sensitive to outliers
25STA 291 Summer 2010 Lecture 4
Variance and Standard Deviation Sample
◦ Variance
◦ Standard Deviation
Population◦ Variance
◦ Standard Deviation
26
22 ( )
1
ix xs
n
2( )
1
ix xs
n
22 ( )ix
N
2( )ix
N
STA 291 Summer 2010 Lecture 4
Population Parameters and Sample Statistics Population mean and population standard
deviation are denoted by the Greek letters μ (mu) and σ (sigma)◦ They are unknown constants that we would like to
estimate Sample mean and sample standard deviation are
denoted by and s◦ They are random variables, because their values vary
according to the random sample that has been selected
27
x
STA 291 Summer 2010 Lecture 4
Empirical Rule If the data is approximately symmetric and
bell-shaped then◦ About 68% of the observations are within one
standard deviation from the mean◦ About 95% of the observations are within two
standard deviations from the mean◦ About 99.7% of the observations are within
three standard deviations from the mean
28STA 291 Summer 2010 Lecture 4
Example Scores on a standardized test are scaled so
they have a bell-shaped distribution with a mean of 1000 and standard deviation of 150◦ About 68% of the scores are between
◦ About 95% of the scores are between
◦ If you have a score above 1300, you are in the top %
29STA 291 Summer 2010 Lecture 4