19
Statistics for Describing, Exploring, and Comparing Data 3-1: Review and Preview Statistical ways to summarize data. Measures of central tendency: mean, median, mode, and midrange o Averages o Center of the distribution o Location of the center, middle o “Although it describes all of us it describes none of us” Measures of variation (dispersion): range, variance, and standard deviation. o Spread of the data o Cluster around the center? (small value, variance) o More widely spread out? (larger value, variance) Measures of position: percentiles, deciles, quartiles. o Where a specific data value falls within the set. o Comparison of data value to the set. o Called “norms” These three types of measures are called traditional statistics. Used to confirm conjectures about the data. Exploratory data analysis – see what the data will show. o Box plot o 5 number summary 3-2: Measures of Center RECALL: We take a sample from a population. If we have a small population we can sample the whole population (census) or if our population is large we take a sample. RECALL: Parameter is a measure found by using all the data values in the population – a characteristic of the population. A statistic is a measure found by using data values from a sample – a characteristic of the sample. General Rounding Rule: Rounding should not take place until the final answer is calculated. Intermediate rounding increases error in your final calculation. We write down on paper about 3 or 4 decimal places to “show our work” but keep all the digits on your calculator to ensure an exact answer. A measure of center is a value at the middle or center of the data set. There are several ways to calculate the “center” of a data set and they don’t all agree. Each Math 121 Chapter 3 Page 1

221 Chapter3 Student

Embed Size (px)

Citation preview

Page 1: 221 Chapter3 Student

Statistics for Describing, Exploring, and Comparing Data

3-1: Review and Preview Statistical ways to summarize data. Measures of central tendency: mean, median, mode, and midrange

o Averageso Center of the distributiono Location of the center, middleo “Although it describes all of us it describes none of us”

Measures of variation (dispersion): range, variance, and standard deviation.o Spread of the datao Cluster around the center? (small value, variance)o More widely spread out? (larger value, variance)

Measures of position: percentiles, deciles, quartiles.o Where a specific data value falls within the set.o Comparison of data value to the set.o Called “norms”

These three types of measures are called traditional statistics. Used to confirm conjectures about the data. Exploratory data analysis – see what the data will show.

o Box ploto 5 number summary

3-2: Measures of Center

RECALL: We take a sample from a population. If we have a small population we can sample the whole population (census) or if our population is large we take a sample.

RECALL: Parameter is a measure found by using all the data values in the population – a characteristic of the population. A statistic is a measure found by using data values from a sample – a characteristic of the sample.

General Rounding Rule: Rounding should not take place until the final answer is calculated. Intermediate rounding increases error in your final calculation. We write down on paper about 3 or 4 decimal places to “show our work” but keep all the digits on your calculator to ensure an exact answer.

A measure of center is a value at the middle or center of the data set. There are several ways to calculate the “center” of a data set and they don’t all agree. Each measure of center has its benefits as well as its drawbacks. We will look at each measure below:

The mean:

o Arithmetic average is found by adding up all data values and dividing by the total number of values, n. An arithmetic average in statistics is referred to as either:

the sample mean , , and it is calculated as follows:

n is the number of sample data values.

Math 121 Chapter 3 Page 1

Page 2: 221 Chapter3 Student

Or, the population mean, and is calculated as follows:

N is the number of values in the population.

Rounding Rule for the Mean, Median, and Midrange: The mean, median, and midrange should always have one more decimal place than the raw data.

EXAMPLE 1: Remember our data set of miles traveled from home to be at GU? We will find the average distance traveled.

1279 288 794329 471 363833 277 2480321

EXAMPLE 2: In a particular math class of thirty math students, they took an exam over Chapter 3. Find the mean exam score for the scores below:

19 86 96 99 99 45 74 76 80 9867 43 76 87 97 98 90 76 42 797 34 95 76 35 76 29 45 76 80

Finding the mean for grouped data (in a frequency table/distribution). Find the class midpoint for each class. Multiply each class midpoint by the frequency for its class. Sum up those products and divide that sum by the total frequency (n).

EXAMPLE 3: Find the mean of the grouped data in a frequency distribution below

Class Limits Frequency Class Midpoint1 -500 6 250.5 1503

501- 1000 2 750.5 15011001 – 1500 1 1250.5 1250.51501 – 2000 0 1750.5 02001 - 2500 1 2250.5 2250.5

Total 10 6505

Math 121 Chapter 3 Page 2

Page 3: 221 Chapter3 Student

The median:o The median is the “halfway” point in the data set.o The median is the midpoint of the data array.o We must arrange the data values in order (data array).o Then the median will be the middle value for a data set with an odd number of data values or the average

of the middle two data values for a data set with an even number of data values.

EXAMPLE 4: Find the median for the data set below. Raw Data:

1279 288 794329 471 363833 277 2480321

Data Array:277 288 321 329 363 471 794 833 1279 2480

Median: (even number of values, average of middle two)Median = (363 + 471)/2 = 834/2 = 417

EXAMPLE 5: In a particular math class of thirty math students, they took an exam over Chapter 3. The exam scores are as follows:

19 86 96 99 99 45 74 76 80 9867 43 76 87 97 98 90 76 42 797 34 95 76 35 76 29 45 76 80

Find the median test score for the class.

The mode: o Value that occurs the most often.o Unimodal – one value occurs more than any other value.o Bimodal – 2 values occur with the same frequency but more often than the rest.

Math 121 Chapter 3 Page 3

Page 4: 221 Chapter3 Student

o Multimodal – more than 2 values occur with the same frequency and more often than other data values.o “No Mode” – no data value occurs more than once.

EXAMPLE 6: Find the mode of the data set.1279 288 794329 471 363833 277 2480321

EXAMPLE 7: In a particular math class of thirty math students, they took an exam over Chapter 3. The exam scores are as follows:

19 86 96 99 99 45 74 76 80 9867 43 76 87 97 98 90 76 42 797 34 95 76 35 76 29 45 76 80

Find the mode for the Math exam over Chapter 3.

The midrange:o Rough estimate of the middleo Add the lowest and highest value sin the data set and divide by 2.

o Affected by extreme values!!! (Hence, really rough estimate!)

EXAMPLE 8: Find the midrange of the data set for miles traveled from home to college.

1279 288 794329 471 363833 277 2480321

Math 121 Chapter 3 Page 4

Page 5: 221 Chapter3 Student

Weighted Average:

o

o Think about weighted means in a table…sort of like calculating a mean for grouped data.o The classic example of weighted average is calculating GPA!

EXAMPLE 9: You took 5 classes: a 3 credit stats class and earned an A, a 4 credit Chemistry class and earned a B, A fitness and nutrition class that was 2 credits and earned an A, a religion class that was 3 credits and earned a C, and a sociology class that was 3 credits and earned an A. What was your semester GPA for that semester?

Class Credits (Weight) Grade (x) Weight ValueStats 3 A (4.0) 12Chem 4 B (3.0) 12PE 2 A (4.0) 8Religion 3 C (2.0) 6Sociology 3 A (4.0) 12Total 15 50

EXAMPLE 10: Thirty automobiles were tested for fuel efficiency (in miles per gallon). This frequency distribution was obtained.

Class Boundaries Frequency Class Midpoint

7.5 – 12.5 3 10 3(10)=3012.5 – 17.5 5 15 7517.5 – 22.5 15 20 30022.5 – 27.5 5 25 12527.5 – 32.5 2 30 60Total 30 -- 590

a. Find the mean.

b. What is the modal class?

EXAMPLE 11: The heights of 20 highest waterfalls in the world are shown here. Find the mean, median, mode, and midrange.

3212 2800 2625 2540 2499 2425 2307 2151 2123 20001904 1841 1650 1612 1536 1388 1215 1198 1182 1170

EXAMPLE 9: A recent survey of a new diet cola reported the following percentages of people who like the taste. Find the weighted mean of the percentages.

Math 121 Chapter 3 Page 5

Page 6: 221 Chapter3 Student

AREA % Favored(value)

Number Surveyed (weight)

Weight Value

1 40 1000 40(1000)=400002 30 3000 900003 50 800 40000Total -- 4800 170000

SkewnessA comparison of the mean, median, and mode can give us information pertaining to the skewness.A distribution is said to be skewed if it is not symmetric. This means the left and right halves of the distribution are not mirror images of one another.

If a distribution is left skewed (negatively skewed) then the distribution tails off on the left-hand side and the mean and median are to the left of the mode.

If a distribution is right skewed (positively skewed) then the distribution tails off on the right-hand side and the mean and median are to the right of the mode.

When a distribution is symmetric, the mean, median, and mode are equal. Keep in mind that the mean, median, and mode cannot always be used to determine the shape of the distribution.

Math 121 Chapter 3 Page 6

Page 7: 221 Chapter3 Student

3.3 Measure of VariationEXAMPLE 1: Below are scores on quizzes in 2 classes. Which class did better?

mode = 6,8 mode = 6median = 6 median = 6

We cannot tell by just the measures of central tendency. So we must look at the measures of variation. The consistency or spread of the data should be taken into account.

Round-Off Rule for Measures of Variation: We carry one more decimal place than is present in the raw data.

The range of a set of data is the difference between the maximum value and minimum value. The range is easy to compute; however the range does not take into account all values. So, outliers can affect

the range.

EXAMPLE 2: Find the range for the data set:

1279 288 794329 471 363833 277 2480321

Range = Max – Min = 2480 – 277 = 2203

The standard deviation of a set of sample values is a measure of variation of values about the mean.

Sample standard deviation:

Short-Cut formula for Standard deviation:

o The standard deviation is a measure of variation of all values from the mean.o Outliers can dramatically increase the standard deviation.

Math 121 Chapter 3 Page 7

Class 1 Class 23 10 5 76 5 8 68 8 7 66 4 5 6

Page 8: 221 Chapter3 Student

o The units of the standard deviation, s, are the same as the units of the original data values.o The value of the standard deviation is positive.

EXAMPLE 3: Find the mean and standard deviation of the given data (in minutes) by hand and then with calculator:a) 50, 50, 50, 50, 50

x 5050505050

Total

b) 46, 50, 50, 50, 54

x (min)

4650505054

Total

c) 5, 50, 50, 50, 95

x (min)

5 -45 202550 0 050 0 050 0 095 45 2025

Total 4050

Population Standard Deviation

The variance of a set of values is a measure of variation equal to the square of the standard deviation. Sample Variance: Population Variance:

EXAMPLE 4: In the preceding example we found that for 46, 50, 50, 50, 54 (minutes), the standard deviation was 2.8 minutes. Find the variance of that same example.

Math 121 Chapter 3 Page 8

Page 9: 221 Chapter3 Student

Range Rule of Thumb: For Estimating a Value of the Standard Deviation, s: To roughly estimate the standard deviation from a

collection of known sample data, use

EXAMPLE 5: In the preceding example we found that for 46, 50, 50, 50, 54 (minutes), the standard deviation was 2.8 minutes. Use the range rule of thumb to determine if this standard deviation is a reasonable value.

For Interpreting a Known Value of the Standard Deviation: If the standard deviation is known, use it to find rough estimates of the minimum and maximum “usual” sample values by using the following:

Minimum “usual” value = – Maximum “usual” value = +

EXAMPLE 6: Past results from the National Health Survey suggest that the pulse rates (beats per minute) for women have a mean of 76.0 and a standard deviation of 12.5. Find the minimum and maximum “usual” pulse rates. Then determine whether a pulse rate of 113 would be considered “unusual.”

Empirical (or 68-95-99.7) Rule for Data with a Bell-Shaped DistriubtionThis rule states that for data sets having a distribution that is approximately bell-shaped, the following properties apply.

About 68% of all values fall within 1 standard deviation of the mean.

About 95% of all values fall within 2 standard deviation of the mean.

About 99.7% of all values fall within 3 standard deviation of the mean.

Math 121 Chapter 3 Page 9

Page 10: 221 Chapter3 Student

EXAMPLE 7: IQ scores have a bell-shaped distribution with a mean of 100 and variance of 225. What percentage of IQ scores are between 70 and 130?

Chebyshev’s TheoremThe proportion (or fraction) of any set of data lying within K standard deviations of the mean is always at least

, where K is any positive number greater than 1. For K = 2 and K = 3, we get the following:

At least ¾ (or 75%) of all values lie within 2 standard deviations of the mean.

At least (or 89%) of all values lie within 3 standard deviations of the mean.

EXAMPLE 9: On a particular exam, the average score has been 65 with a standard deviation of 5. According to Chebyshev’s Theorem, find the percentage of students having a score between 40 and 90.

The coefficient of variation (or CVar) for a set of nonnegative sample or population data, expressed as a percent, describes the standard deviation relative to the mean, and is given by the following:

Sample:

Population: Note: We cannot compare the standard deviations between data sets of the different units

EXAMPLE 10: Using the sample height and weight data for the 40 males we find the statistics below:

Height:

70.8 66.2 71.7 68.7 67.6 69.2 66.5 67.268.3 65.6 63.0 68.3 73.1 67.6 68.0 71.061.3 76.2 66.3 69.7 65.4 70.0 62.9 68.5

Math 121 Chapter 3 Page 10

Page 11: 221 Chapter3 Student

68.3 69.4 69.2 68.0 71.9 66.1 72.4 73.068.0 68.7 70.3 63.7 71.1 65.6 68.3 66.3

Weight:

169.1 144.2 179.3 175.8 152.6 166.8 135.0 201.5175.2 139.0 156.3 186.6 191.1 151.3 209.4 237.1176.7 220.6 166.1 137.4 164.2 162.4 151.8 144.1204.6 193.8 172.9 161.9 174.8 169.8 213.3 198.0173.3 214.5 137.1 119.5 189.1 164.7 170.1 151.0

Mean, Standard Deviation, sHeight 68.34 in 3.02 inWeight 172.55 lb 26.33 lb

Because we can’t compare standard deviations of different units, we need to find the coefficient of variance. The weight is more varied than the height with a CVar of 15.3 compared to a CVar of 4.4 for the height of the 40 men.

3.4 Measures of Relative Standing and BoxplotsWhen we talk about relative standing within a data set, we are talking about the location of a data value in comparison with the other data values.

You are probably familiar with percentiles as your results on standardized exams (SAT, etc.) and childhood growth (height, weight, etc.) are most often given as a percentile. In addition to percentiles, we will discuss quartiles and z-scores.

Z-ScoreA z-score is simply a standardized score that is found by converting a value to a standardized scale. This means we are really determining how many standard deviations that a particular data value, x, falls from the mean.

Rounding Rule for Z-scores: We round a z-score to two decimal places (that matches the z-chart we will use later).

EXAMPLE 1: Referring back to EXAMPLE 10 in section 3.3, we would like to know if it more extreme for a man to be 7 feet tall, 5 feet tall, or weight 300 pounds.

In order to do this we cannot make a direct comparison as these are two different types of measures. However, if we “standardize” each value then we can make a comparison.

Recall the following summary data for each set:

Mean, Standard Deviation, sHeight 68.34 in 3.02 inWeight 172.55 lb 26.33 lb

Math 121 Chapter 3 Page 11

Page 12: 221 Chapter3 Student

Now we will calculate the z-score for each:

Height of 84”:

Height of 60”:

Weight of 300 lbs:

When we look at the z-scores we can say that a height of 7 feet (84”) is the most extreme because it falls 5.19 standard deviations above (+) the mean. A height of 5 feet (60”) is 2.76 standard deviations below (-) the mean and a weight of 300 pounds is 4.73 standard deviations above (+) the mean.

Usual and Unusual Values:Usual values have z-scores that are between -2 and 2, inclusive.Unusual values fall outside of that range – z-scores below -2 or above 2.

NOTE: A negative z-score means the data value is less than the mean (below the mean) and a positive z-score means the data values exceeds the mean (above the mean).

EXAMPLE 2: Referring back to EXAMPLE 1, is a height of 72” usual or unusual?

Since this value falls between -2 and 2 we say it is a usual or typical value. (The question is asking is it unusual to find a man that is 6 feet tall and we know from common sense that this is NOT unusual so it should be usual.)

PercentilesA percentile is a measure of location which divides the data set into 100 groups with 1% of the values in each group. We denote percentiles as . If we talk about , then we mean the 35th percentile and this is the means that

about 35% of the data values will like below this value. If we talk about , then we mean the 50th percentile and this means that 50% of the data values lie below this value – the median. Additionally, just as when finding the median, when we find percentiles we must order the data set first.

There are two ways that we want to talk about percentiles:1. We may want to know what percentile corresponds to a known data value; or2. We may want to know what data value corresponds to a particular percentile.

To find the percentile corresponding to a known data value we compute

(Round to the nearest whole number)

EXAMPLE 3: In a particular math class of thirty math students, they took an exam over Chapter 3. The exam scores are as follows:

Math 121 Chapter 3 Page 12

Page 13: 221 Chapter3 Student

19 86 96 99 99 45 74 76 80 9867 43 76 87 97 98 90 76 42 797 34 95 76 35 76 29 45 76 80

What percentile corresponds to an exam score of 73?

What percentile corresponds to an exam score of 45?

To find the data value corresponding to a stated percentile we must define a few notations:

First, we calculate the value of the location or position, L, as follows:

Now, there are two options for L:1. It is NOT a whole number and then we round UP to the next whole number and locate the data value that

occupies that position; OR

2. It is a whole number and then we find the average of the Lth and (L + 1)st numbers in the ordered data set.

EXAMPLE 4: In a particular math class of thirty math students, they took an exam over Chapter 3. The exam scores are as follows:

19 86 96 99 99 45 74 76 80 9867 43 76 87 97 98 90 76 42 797 34 95 76 35 76 29 45 76 80

Find the 25th percentile for the Chapter 3 exam scores.

Math 121 Chapter 3 Page 13

Page 14: 221 Chapter3 Student

Find the 75th percentile for the Chapter 3 exam scores.

QuartilesRecall that there are 99 percentiles that divide the data set up into 100 groups. There are 3 quartiles that divide

the data set up into 4 groups -- Quartiles are measures of location, just as percentiles, but each group contains about 25% of the data. How can we connect quartiles back to percentiles?

So, when we find quartiles, really we aren’t finding anything new just thinking back to the corresponding percentile.EXAMPLE 5: In a particular math class of thirty math students, they took an exam over Chapter 3. The exam scores are as follows:

19 86 96 99 99 45 74 76 80 9867 43 76 87 97 98 90 76 42 797 34 95 76 35 76 29 45 76 80

Find the quartiles for the data set.

The interquartile range (IQR) is the difference between the upper ( ) and lower ( ) quartiles. Approximately half the data values fall within the interquartile range.

The semi-interquartile range is the IQR divided by 2.

The midquartile is the sum of the upper and lower quartiles divide by 2.

Math 121 Chapter 3 Page 14

Page 15: 221 Chapter3 Student

We use a diagram called a boxplot visually display the extreme values (minimum and maximum), the quartiles (lower, median, upper) and the IQR over a number line. We draw a box that shows the IQR with a line through the box at the median. Then we draw in “whiskers” that extend from the box out to the extreme values.

The 5 number summary is just an ordered listing of the important values that are used in the box plot in parentheses:

(minimum, , , , maximum)

EXAMPLE 6: Consider the data set and find the minimum value, the lower quartile, the median, the upper quartiles, and the maximum value. Find the interquartile range.

Minimum: _______________

Lower Quartile: ___________

Median: ________________

Upper Quartile: __________

Maximum: ______________ IQR: ________________________________

EXAMPLE 7: Use the values we found in EXAMPLE 6 and construct a boxplot for the data set.

EXAMPLE 8: Find the 5 number summary and construct a box plot for the Math exam scores.

Math 121 Chapter 3 Page 15

0 1 1 2 2 23 3 3 3 4 44 4 4 5 5 55 5 6 6 6 67 7 7 8 8 9

Page 16: 221 Chapter3 Student

Outliers and Modified Boxplots

An outlier is a value that that falls “far” from what would be considered normal data values. We will define outliers in terms of the interquartile range (IQR). A data value will qualify as an outlier if either of the following conditions are met:

1. The data value is larger than ; or2. The data value is smaller than .

A modified boxplot is constructed in a manner that is much the same as a regular or skeletal box plot but with the following modifications:

1. An asterisk is used to identify all outlier data values.2. The whiskers are only extend as far as the maximum and/or minimum data value(s) that are not

considered outliers.

EXAMPLE 9: Construct a modified boxplot for the data below:

0 2 4 4 4 5 5 5 5 5 9

Math 121 Chapter 3 Page 16