21

Why use boxplots? ease of construction convenient handling of outliers construction is not subjective (like histograms) Used with medium or large size

Embed Size (px)

Citation preview

Why use boxplots?• ease of construction• convenient handling of outliers• construction is not subjective

(like histograms)• Used with medium or large size

data sets (n > 10)• useful for comparative displays

Disadvantage of boxplots

• does not retain the individual observations

• should not be used with small data sets (n < 10)

How to construct • find five-number summary

Min Q1 Med Q3 Max

• draw box from Q1 to Q3

• draw median as center line in the box

• extend whiskers to min & max

Modified boxplots• display outliers

• fences mark off mild & extreme outliers

• whiskers extend to largest (smallest) data value inside the fence

ALWAYS use modified boxplots in this class!!!

Inner fence

Q1 Q3

Q1 – 1.5IQR Q3 + 1.5IQRAny observation outside this fence is an outlier! Put a dot

for the outliers.

Interquartile Range (IQR) – is the range (length) of the box

Q3 - Q1

Modified Boxplot . . .

Q1 Q3

Draw the “whisker” from the quartiles to the observation that is within the

fence!

Outer fence

Q1 Q3

Q1 – 3IQR Q3 + 3IQR

Any observation outside this fence is an extreme outlier!

Any observation between the fences is considered a mild outlier.

For the AP Exam . . .

. . . you just need to find outliers, you DO NOT need to identify them as mild or extreme.

Therefore, you just need to use the

1.5IQRs

A report from the U.S. Department of Justice gave the following percent increase in federal prison populations in 20 northeastern & mid-western states in 1999.

5.9 1.3 5.0 5.9 4.5 5.6 4.1 6.3 4.8 6.9

4.5 3.5 7.2 6.4 5.5 5.3 8.0 4.4 7.2 3.2

Create a modified boxplot. Describe the distribution.

Use the calculator to create a modified boxplot.

Symmetrical boxplots Approximately symmetrical boxplot

Skewed boxplot

Evidence suggests that a high indoor radon concentration might be linked to the development of childhood cancers. The data that follows is the radon concentration in two different samples of houses. The first sample consisted of houses in which a child was diagnosed with cancer. Houses in the second sample had no recorded cases of childhood cancer.

Cancer10 21 5 23 15 11 9 13 27 13 39 22 720 45 12 15 3 8 11 18 16 23 16 9 5716 21 18 38 37 10 15 11 18 210 22 11 1617 33 10

No Cancer9 38 11 12 29 5 7 6 8 29 24 12 1711 11 3 9 33 17 55 11 29 13 24 7 1121 6 39 29 7 8 55 9 21 9 3 85 11 14

Create parallel boxplots. Compare the distributions.

Cancer’s 5 # Summary:

No Cancer’s 5 # Summary:

Min Q1 Med Q3 Max

3 11 16 22 210

IQR = 11

Min Q1 Med Q3 Max3 8.5 11.5 26.5 85

IQR = 18

Calculating the fence (Cancer):

Q1 – 1.5 IQR

11 – 1.5*11 = - 5.5

Q3 + 1.5 IQR

22 + 1.5*11 = 38.5

Calculating the fence (No Cancer):

Q1 – 1.5 IQR

8.5 – 1.5*18 = -18.5

Q3 + 1.5 IQR

26.5 + 1.5*18 = 53.5

Creating a Box Plot

0 50 100 150 200Radon

Cancer

No Cancer

Cancer

No Cancer

100 200Radon

The median radon concentration for the no cancer group is lower than the median for the cancer group. The range of the cancer group is larger than the range for the no cancer group. Both distributions are skewed right. The cancer group has outliers at 39, 45, 57, and 210. The no cancer group has outliers at 55 and 85.

Creating a Box Plot on your Calculator

Knowing about the DATA• Which terms best represent the data?

– The mean and median best illustrate skewed data– While variance and standard deviation represent symmetrical data– Spread – how far away from the mean does the data stretch– To calculate variances – we need to square the differences

between the mean and each data value.– Variance (s2) - a measure of how far a set of numbers is spread

out.A variance of zero indicates that all the values are identical

A small variance indicates a small spread, while a large variance means the numbers are spread out•Standard Deviation (s) - shows how much variation or dispersion from the average exists

Example: • A person’s metabolic rate is the rate at which the body

consumes energy. Metabolic rate is important in studies of weight gain, dieting and exercise. Here are the metabolic rate of 7 men who took part in a study of dieting (units per 24 hours)

• Data: 1792 1666 1362 1614 1460 1867 1439

• Calculating standard deviation and variance on the calculator• Use 1VAR Stats• S = 189.240 calories• S2 = 35,811.667