Upload
abraham-summers
View
223
Download
2
Embed Size (px)
Citation preview
Copyright © 2014, 2011 Pearson Education, Inc. 2
4.1 Summaries of Numerical Variables
Can 500 different songs fit on the iPod Shuffle?
To answer this question we must understand the typical length of a song and the variation of song sizes around the typical length
We can do this using summary statistics
Copyright © 2014, 2011 Pearson Education, Inc. 3
4.1 Summaries of Numerical Variables
A Subset of the Data
Copyright © 2014, 2011 Pearson Education, Inc. 4
4.1 Summaries of Numerical Variables
The Median
Value in the middle of a sorted list of numerical values (a typical value)
Half of the values fall below the median; half fall above
It is the 50th Percentile
Copyright © 2014, 2011 Pearson Education, Inc. 5
4.1 Summaries of Numerical Variables
Common Percentiles
Lower Quartile = 25th Percentile
Upper Quartile = 75th Percentile
One quarter of the values fall below the lower quartile and one quarter fall above the upper quartile
Copyright © 2014, 2011 Pearson Education, Inc. 6
4.1 Summaries of Numerical Variables
The Interquartile Range (IQR)IQR = 75th Percentile – 25th Percentile
A measure of variation based on quartiles
Used to accompany the median
Copyright © 2014, 2011 Pearson Education, Inc. 7
4.1 Summaries of Numerical Variables
The Range Range = Maximum - Minimum
Maximum Value = 100th Percentile
Minimum Value = 0th Percentile
Another measure of variation; not preferred because based on extreme values
Copyright © 2014, 2011 Pearson Education, Inc. 8
4.1 Summaries of Numerical Variables
The Five Number Summary Minimum Lower Quartile Median Upper Quartile Maximum
Copyright © 2014, 2011 Pearson Education, Inc. 9
4.1 Summaries of Numerical Variables
The Five Number Summary for Song Sizes Minimum = 0.148 MB Lower Quartile = 2.85 MB Median = 3.5015 MB Upper Quartile = 4.32 MB Maximum = 21.622 MB
Copyright © 2014, 2011 Pearson Education, Inc. 10
4.1 Summaries of Numerical Variables
Summary Statistics for Song Sizes Median = 3.5015 MB
IQR = 4.32 MB – 2.85 MB = 1.47 MB
Range = 21.622 MB – 0.148 MB = 21.474 MB
Copyright © 2014, 2011 Pearson Education, Inc. 11
4.1 Summaries of Numerical Variables
The Mean (Average) Arithmetic average; divide the sum of the values
by the number of values (another typical value)
The symbol y represents the variable of interest
The symbol read “y bar” represents the meany
Copyright © 2014, 2011 Pearson Education, Inc. 12
4.1 Summaries of Numerical Variables
The Mean (Average)
1 2 ... ny y yy
n
Copyright © 2014, 2011 Pearson Education, Inc. 13
4.1 Summaries of Numerical Variables
The Variance (s2)
Is a measure of variation based on the mean
How far a value is from the mean is known as its deviation; the variance is the average of the squared deviations
Copyright © 2014, 2011 Pearson Education, Inc. 14
4.1 Summaries of Numerical Variables
The Variance
2
2 2 2
1 2
1ny y y y y y
sn
Copyright © 2014, 2011 Pearson Education, Inc. 15
4.1 Summaries of Numerical Variables
The Standard Deviation (SD)
Is the square root of the variance
Is a measure of variability in the original units of the data (the variance results in squared units)
2s s
Copyright © 2014, 2011 Pearson Education, Inc. 16
4.1 Summaries of Numerical Variables
Summary Statistics for Song Sizes
Mean = 3.7794 MB
Variance = 2.584 MB²
SD = 1.607 MB
Copyright © 2014, 2011 Pearson Education, Inc. 17
4M Example 4.1: MAKING M&M’s
Motivation
How many M&M’s are needed to fill a bag labeled to weigh 1.6 ounces?
Copyright © 2014, 2011 Pearson Education, Inc. 18
4M Example 4.1: MAKING M&M’s
Method
Data are weights of 72 plain chocolate M&M’s taken from several packages. To get a measure of the amount of variation relative to the typical size, we use the ratio of the standard deviation to the mean (known as the coefficient of variation).
v
sc
y
Copyright © 2014, 2011 Pearson Education, Inc. 19
4M Example 4.1: MAKING M&M’s
Mechanics
Mean Weight = 0.86 gmSD = 0.04 gm
Cv = 0.04 gm / 0.86 gm = 0.0465
Copyright © 2014, 2011 Pearson Education, Inc. 20
4M Example 4.1: MAKING M&M’s
Message
Since the SD is quite small compared to the mean (with a cv of about 5%) the results suggest that 53
pieces are usually enough to fill a bag.
A bag labeled 1.6 ounces weighs about 45.36 grams. Since there is little variability around the typical weight of an M&M, we can calculate the number of pieces to fill a 1.6 ounce bag as 45.36/0.86.
Copyright © 2014, 2011 Pearson Education, Inc. 21
4.2 Histograms
Histograms
Plot the distribution of a numerical variable by showing counts of values occurring within adjacent intervals
Similar to bar charts but designed for continuous quantitative data (bar charts are only appropriate for discrete categories)
Copyright © 2014, 2011 Pearson Education, Inc. 23
4.2 Histograms
Histogram of Song Sizes
Indicates a few very long songs (outliers)
The graph devotes more than half of its area to show less than 1% of the songs (white space rule: graphs with mostly white space can be improved by changing the interval of the plot to focus on the data rather than the white space)
Copyright © 2014, 2011 Pearson Education, Inc. 24
4.2 Histograms
Histogram of Song Sizes Using intervals of different lengths yield different
histograms
Narrow intervals expose details smoothed over by wider intervals
Most software packages determine the right length to use automatically
Copyright © 2014, 2011 Pearson Education, Inc. 25
4.2 Histograms
Histograms of Song Sizes – Different Intervals
Copyright © 2014, 2011 Pearson Education, Inc. 27
4.3 Boxplots
Combining Boxplots with Histograms
Boxplots locate the median and quartiles and highlight outliers
The median splits the area of the histogram in half (unlike the mean, it is resistant or robust to the effects of outliers)
Copyright © 2014, 2011 Pearson Education, Inc. 30
4.4 Shape of a Distribution
Modes Position of an isolated peak in a histogram
A histogram with one peak is unimodal; two is bimodal; three or more is multimodal
A flat histogram with all bars about the same height is uniform
Copyright © 2014, 2011 Pearson Education, Inc. 31
4.4 Shape of a Distribution
Symmetry and Skewness
A distribution is symmetric if the two sides of its histogram are mirror images
A distribution is skewed if one tail of the histogram stretches out farther than the other
Copyright © 2011 Pearson Education, Inc.
Copyright © 2014, 2011 Pearson Education, Inc. 32
4.4 Shape of a Distribution
Distribution of Song Sizes
The mode lies between 3 and 4 MB
The distribution is right skewed (the right tail stretches out farther than the left tail)
Copyright © 2014, 2011 Pearson Education, Inc. 33
4M Example 4.2: EXECUTIVE COMPENSATION
Motivation
What can we say about the salaries of CEO’s in 2010?
Copyright © 2014, 2011 Pearson Education, Inc. 34
4M Example 4.2: EXECUTIVE COMPENSATION
Method
Data consist of salaries for 1,766 CEO’s reported in thousands of dollars (obtained from Compustat).
Copyright © 2014, 2011 Pearson Education, Inc. 36
4M Example 4.2: EXECUTIVE COMPENSATION
Message
The salaries of CEOs in 2010 range from less than $100,000 into the millions. The distribution is right skewed. The median is $725,000 with half of salaries within the range of $520,000 to $970,000. A few exceed $3,000,000.
Copyright © 2014, 2011 Pearson Education, Inc. 37
4.4 Shape of a Distribution
Bell-Shaped Distributions and Empirical Rule
A bell-shaped distribution is symmetric and unimodal
The empirical rule uses the standard deviation to describe how data with a bell-shaped distribution cluster around the mean
Copyright © 2014, 2011 Pearson Education, Inc. 39
4.4 Shape of a Distribution
Standardizing
Converting data to z-scores
Z- scores measure the distance from the mean in standard deviations
y yz
s
Copyright © 2014, 2011 Pearson Education, Inc. 40
4.5 Epilog
Can 500 different songs fit on the iPod Shuffle?
Because of variation, not every collection of 500 songs will fit. The longest 500 songs won’t fit. However, based on the typical song size, the amount of variation in song sizes and the shape of its distribution, we can say that most collections of 500 songs will fit!
Copyright © 2014, 2011 Pearson Education, Inc. 41
Best Practices
Be sure that data are numerical when using histograms and summaries such as the mean and standard deviation.
Summarize the distribution of a numerical variable with a graph.
Choose interval widths appropriate to the data when preparing a histogram.
Copyright © 2014, 2011 Pearson Education, Inc. 42
Best Practices (Continued)
Scale your plots to show data, not empty space.
Anticipate what you will see in a histogram.
Label clearly.
Check for gaps.
Copyright © 2014, 2011 Pearson Education, Inc. 43
Pitfalls
Do not use the methods of this chapter for categorical variables.
Do not assume that all numerical data have a bell-shaped distribution.
Do not ignore the presence of outliers.