29
STA 291 Summer 2010 Lecture 4 Dustin Lueker

Lecture 4 Dustin Lueker. The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

Embed Size (px)

Citation preview

Page 1: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

STA 291Summer 2010

Lecture 4Dustin Lueker

Page 2: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

The population distribution for a continuous variable is usually represented by a smooth curve◦ Like a histogram that gets finer and finer

Similar to the idea of using smaller and smaller rectangles to calculate the area under a curve when learning how to integrate

Symmetric distributions◦ Bell-shaped◦ U-shaped◦ Uniform

Not symmetric distributions:◦ Left-skewed◦ Right-skewed◦ Skewed

Population Distribution

2STA 291 Summer 2010 Lecture 4

Page 3: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

Center of the data◦ Mean◦ Median◦ Mode

Dispersion of the data Sometimes referred to as spread

◦ Variance, Standard deviation◦ Interquartile range◦ Range

Summarizing Data Numerically

3STA 291 Summer 2010 Lecture 4

Page 4: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

Mean◦ Arithmetic average

Median◦ Midpoint of the observations when they are

arranged in order Smallest to largest

Mode◦ Most frequently occurring value

Measures of Central Tendency

4STA 291 Summer 2010 Lecture 4

Page 5: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

Sample size n Observations x1, x2, …, xn

Sample Mean “x-bar”

Sample Mean

5

SUM

STA 291 Summer 2010 Lecture 4

n

ii

n

xn

nxxxx

1

21

1

/)...(

Page 6: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

Population size N Observations x1 , x2 ,…, xN

Population Mean “mu”

Note: This is for a finite population of size N

Population Mean

6

SUM

STA 291 Summer 2010 Lecture 4

N

ii

N

xN

Nxxx

1

21

1

/)...(

Page 7: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

Requires numerical values◦ Only appropriate for quantitative data◦ Does not make sense to compute the mean for

nominal variables◦ Can be calculated for ordinal variables, but this does not

always make sense Should be careful when using the mean on ordinal variables Example “Weather” (on an ordinal scale)

Sun=1, Partly Cloudy=2, Cloudy=3,Rain=4, Thunderstorm=5Mean (average) weather=2.8

Another example is “GPA = 3.8” is also a mean of observations measured on an ordinal scale

Mean

7STA 291 Summer 2010 Lecture 4

Page 8: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

Center of gravity for the data set Sum of the differences from values above

the mean is equal to the sum of the differences from values below the mean◦ 3+2+2 = 3 + 4

Mean

STA 291 Summer 2010 Lecture 4 8

Page 9: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

Mean◦ Sum of observations divided by the number of

observations

Example◦ {7, 12, 11, 18}◦ Mean =

Mean (Average)

9STA 291 Summer 2010 Lecture 4

Page 10: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

Highly influenced by outliers◦ Data points that are far from the rest of the data

◦ Example Monthly income for five people

1,000 2,000 3,000 4,000 100,000 Average monthly income =

What is the problem with using the average to describe this data set?

Mean

10STA 291 Summer 2010 Lecture 4

Page 11: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

Measurement that falls in the middle of the ordered sample

When the sample size n is odd, there is a middle value◦ It has the ordered index (n+1)/2

Ordered index is where that value falls when the sample is listed from smallest to largest An index of 2 means the second smallest value

◦ Example 1.7, 4.6, 5.7, 6.1, 8.3

n=5, (n+1)/2=6/2=3, index = 3Median = 3rd smallest observation = 5.7

Median

11STA 291 Summer 2010 Lecture 4

Page 12: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

When the sample size n is even, average the two middle values◦ Example

3, 5, 6, 9, n=4(n+1)/2=5/2=2.5, Index = 2.5Median = midpoint between 2nd and 3rd smallest observations = (5+6)/2 = 5.5

Median

12STA 291 Summer 2010 Lecture 4

Page 13: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

For skewed distributions, the median is often a more appropriate measure of central tendency than the mean

The median usually better describes a “typical value” when the sample distribution is highly skewed

Example◦ Monthly income for five people

1,000 2,000 3,000 4,000 100,000◦ Median monthly income:

Why is the median better to use with this data than the mean?

Mean and Median

13STA 291 Summer 2010 Lecture 4

Page 14: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

Measures of Central Tendency

14

Mode - Most frequent value.

Notation: Subscripted variables n = # of units in the sample N = # of units in the population x = Variable to be measured xi = Measurement of the ith unit

Mean - Arithmetic Average

Mean of a Sample - x

Mean of a Population -

μ

Median - Midpoint of the observations when they are arranged in increasing order

STA 291 Summer 2010 Lecture 4

Page 15: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

Example: Highest Degree Completed

Median for Grouped or Ordinal Data

15

Highest Degree Frequency Percentage

Not a high school graduate

38,012 21.4

High school only 65,291 36.8

Some college, no degree

33,191 18.7

Associate, Bachelor, Master, Doctorate,

Professional

41,124 23.2

Total 177,618 100

STA 291 Summer 2010 Lecture 4

Page 16: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

n = 177,618 (n+1)/2 = 88,809.5 Median = midpoint between the 88809th

smallest and 88810th smallest observations◦ Both are in the category “High school only”

Mean wouldn’t make sense here since the variable is ordinal

Median◦ Can be used for interval data and for ordinal data◦ Can not be used for nominal data because the

observations can not be ordered on a scale

Calculate the Median

16STA 291 Summer 2010 Lecture 4

Page 17: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

Mean◦ Interval data with an approximately symmetric

distribution Median

◦ Interval data◦ Ordinal data

Mean is sensitive to outliers, median is not

Mean vs. Median

17STA 291 Summer 2010 Lecture 4

Page 18: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

Symmetric distribution◦ Mean = Median

Skewed distribution◦ Mean lies more toward the direction which the

distribution is skewed

Mean vs. Median

18STA 291 Summer 2010 Lecture 4

Page 19: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

While the median is better than the mean for skewed distributions there is one large disadvantage to using the median◦ Insensitive to changes within the lower or upper

half of the data◦ Example

1, 2, 3, 4, 5 1, 2, 3, 100, 100

◦ Sometimes, the mean is more informative even when the distribution is skewed

Median

19STA 291 Summer 2010 Lecture 4

Page 20: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

Keeneland Sales

Example

STA 291 Summer 2010 Lecture 4 20

Page 21: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

Deviations The deviation of the ith observation xi from

the sample mean is the difference between them, ◦ Sum of all deviations is zero◦ Therefore, we use either the sum of the absolute

deviations or the sum of the squared deviations as a measure of variation

21

x)( xxi

STA 291 Summer 2010 Lecture 4

Page 22: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

Variance of n observations is the sum of the squared deviations, divided by n-1

Sample Variance

22

22 ( )

1

ix xs

n

STA 291 Summer 2010 Lecture 4

Page 23: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

Example

23

Observation Mean Deviation SquaredDeviation

1

3

4

7

10

Sum of the Squared Deviations

n-1

Sum of the Squared Deviations / (n-1)

STA 291 Summer 2010 Lecture 4

Page 24: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

Interpreting Variance About the average of the squared

deviations◦ “average squared distance from the mean”

Unit◦ Square of the unit for the original data

Difficult to interpret◦ Solution

Take the square root of the variance, and the unit is the same as for the original data Standard Deviation

24STA 291 Summer 2010 Lecture 4

Page 25: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

Properties of Standard Deviation s ≥ 0

◦ s = 0 only when all observations are the same If data is collected for the whole population

instead of a sample, then n-1 is replaced by N

s is sensitive to outliers

25STA 291 Summer 2010 Lecture 4

Page 26: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

Variance and Standard Deviation Sample

◦ Variance

◦ Standard Deviation

Population◦ Variance

◦ Standard Deviation

26

22 ( )

1

ix xs

n

2( )

1

ix xs

n

22 ( )ix

N

2( )ix

N

STA 291 Summer 2010 Lecture 4

Page 27: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

Population Parameters and Sample Statistics Population mean and population standard

deviation are denoted by the Greek letters μ (mu) and σ (sigma)◦ They are unknown constants that we would like to

estimate Sample mean and sample standard deviation are

denoted by and s◦ They are random variables, because their values vary

according to the random sample that has been selected

27

x

STA 291 Summer 2010 Lecture 4

Page 28: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

Empirical Rule If the data is approximately symmetric and

bell-shaped then◦ About 68% of the observations are within one

standard deviation from the mean◦ About 95% of the observations are within two

standard deviations from the mean◦ About 99.7% of the observations are within

three standard deviations from the mean

28STA 291 Summer 2010 Lecture 4

Page 29: Lecture 4 Dustin Lueker.  The population distribution for a continuous variable is usually represented by a smooth curve ◦ Like a histogram that gets

Example Scores on a standardized test are scaled so

they have a bell-shaped distribution with a mean of 1000 and standard deviation of 150◦ About 68% of the scores are between

◦ About 95% of the scores are between

◦ If you have a score above 1300, you are in the top %

29STA 291 Summer 2010 Lecture 4